fedora_coreos_meeting
LOGS
16:31:36 <dustymabe> #startmeeting fedora_coreos_meeting
16:31:36 <zodbot> Meeting started Wed Jan 12 16:31:36 2022 UTC.
16:31:36 <zodbot> This meeting is logged and archived in a public location.
16:31:36 <zodbot> The chair is dustymabe. Information about MeetBot at https://fedoraproject.org/wiki/Zodbot#Meeting_Functions.
16:31:36 <zodbot> Useful Commands: #action #agreed #halp #info #idea #link #topic.
16:31:36 <zodbot> The meeting name has been set to 'fedora_coreos_meeting'
16:31:42 <dustymabe> #topic roll call
16:31:43 <bgilbert> .hi
16:31:44 <zodbot> bgilbert: bgilbert 'Benjamin Gilbert' <bgilbert@backtick.net>
16:31:52 <nemric> o/
16:32:33 <dustymabe> .hi
16:32:34 <zodbot> dustymabe: dustymabe 'Dusty Mabe' <dusty@dustymabe.com>
16:32:34 <jlebon> .hello2
16:32:37 <zodbot> jlebon: jlebon 'None' <jonathan@jlebon.com>
16:32:45 <davdunc> .hi
16:32:46 <zodbot> davdunc: davdunc 'David Duncan' <davdunc@amazon.com>
16:32:48 <jlebon> woah cool, new room
16:32:56 <dustymabe> jlebon: fresh paint
16:32:58 <jlebon> test TEST Test
16:33:03 <jlebon> acoustics seem nice
16:33:09 <nemric> :D
16:33:36 <nemric> no fridge yet ?
16:34:23 <dustymabe> no fridge, but there is a pony keg with beer in it
16:34:23 <travier> .hello siosm
16:34:26 <zodbot> travier: siosm 'Timothée Ravier' <travier@redhat.com>
16:34:50 <jdoss> .hello
16:34:50 <zodbot> jdoss: (hello <an alias, 1 argument>) -- Alias for "hellomynameis $1".
16:34:55 <saqali> .hello
16:34:55 <skunkerk> .hello sohank2602
16:34:58 <zodbot> saqali: (hello <an alias, 1 argument>) -- Alias for "hellomynameis $1".
16:34:58 <miabbott> .hello miabbott
16:35:01 <zodbot> skunkerk: sohank2602 'Sohan Kunkerkar' <skunkerk@redhat.com>
16:35:04 <zodbot> miabbott: miabbott 'Micah Abbott' <miabbott@redhat.com>
16:35:14 <jdoss> .hello2
16:35:15 <zodbot> jdoss: jdoss 'Joe Doss' <joe@solidadmin.com>
16:35:17 <saqali> .hello saqali
16:35:19 <zodbot> saqali: saqali 'Saqib Ali' <saqali@redhat.com>
16:36:29 <dustymabe> #chair bgilbert nemric jlebon davdunc travier jdoss saqali skunkerk miabbott
16:36:29 <zodbot> Current chairs: bgilbert davdunc dustymabe jdoss jlebon miabbott nemric saqali skunkerk travier
16:36:36 <dustymabe> #chair lorbus
16:36:36 <zodbot> Current chairs: bgilbert davdunc dustymabe jdoss jlebon lorbus miabbott nemric saqali skunkerk travier
16:36:48 <lorbus> .hi
16:36:49 <zodbot> lorbus: lorbus 'Christian Glombek' <cglombek@redhat.com>
16:37:41 <dustymabe> #topic Action items from last meeting
16:37:50 <dustymabe> There were no action items from last meeting!
16:38:13 <dustymabe> other than the usual  `cat everything > /dev/jlebon`
16:38:22 <jdoss> \o/
16:38:33 <jlebon> :)
16:38:40 <dustymabe> oops
16:38:42 <dustymabe> meant
16:38:54 <dustymabe> `cat everything >> /dev/jlebon` - can't overwrite the backlog
16:39:02 <travier> :D
16:39:14 <dustymabe> ha
16:39:23 <jlebon> -ENOSPC
16:39:32 <jdoss> Just redirect it all to stdtravier
16:39:37 <travier> :)
16:39:56 <travier> (Can we do https://github.com/coreos/fedora-coreos-tracker/issues/194 ? But only if nothing else is more pressing?)
16:40:22 <dustymabe> travier: can try
16:40:29 <dustymabe> let's start with something else first
16:40:36 <travier> =1
16:40:39 <travier> +1
16:40:39 <dustymabe> #topic FYI: some xen instance types might fail to boot on latest testing and next streams
16:40:44 <dustymabe> #link https://github.com/coreos/fedora-coreos-tracker/issues/1066
16:41:05 <dustymabe> This one is mostly an FYI to raise awareness (we'll probably be sending out a communication about it as well)
16:41:31 <dustymabe> some xen instance types that take advantage of enhanced networking via the ixgbevf driver are failing to boot
16:41:44 <jmarrero> hi.
16:41:48 <jmarrero> .hi
16:41:49 <zodbot> jmarrero: jmarrero 'Joseph Marrero' <jmarrero@redhat.com>
16:41:55 <dustymabe> the failure rate is ~95% but somehow our tests passed when `testing` and `next` were cut last week
16:42:20 <dustymabe> (for that instance type we only launch one instance and run a `basic` test)
16:42:23 <miabbott> #chair jmarrero
16:42:23 <zodbot> Current chairs: bgilbert davdunc dustymabe jdoss jlebon jmarrero lorbus miabbott nemric saqali skunkerk travier
16:42:43 <travier> Should we report that to AWS folks? davdunc?
16:42:46 <dustymabe> The current proposal is to revert the kernel back to a known good and pursue a fix upstream
16:42:50 <dustymabe> travier: already done! :)
16:42:53 <davdunc> thanks dustymabe for that. I am investigating..
16:42:54 <travier> great!
16:43:03 <davdunc> we have a kernel ticket in for it.
16:43:14 <davdunc> I'll add that for reference in the issue.
16:43:35 <dustymabe> thanks davdunc
16:43:56 <dustymabe> AFAIK there isn't anyway to workaround the issue other than reverting the kernel
16:44:15 <dustymabe> and we'll have to come up with some steps for people to recover their instances if they've fallen into this trap :(
16:44:50 <dustymabe> I have some ideas for improvements on how we can not hit this again, but I'll leave them for latert
16:45:05 <jlebon> wow how lucky were we that both runs were in those 5%. that's a ...0.25% probability
16:45:16 <jlebon> likely some other factor at play?
16:45:31 <jlebon> this is a good example though of how CI will never really catch everything
16:45:57 <travier> Maybe something else changed in AWS between that time and now?
16:46:00 <dustymabe> jlebon: yeah there is definitely something else going on underneath the covers. Maybe some changes on AWS backend?
16:46:14 <dustymabe> that make it more consistently failing
16:46:36 <jlebon> +1
16:46:45 <davdunc> there is a nitro wrapper for specifically older instance types, like the m2 and m3 instances.
16:46:53 <dustymabe> there were other contributing factors to why we either didn't see this or ignored it for a period of time: https://github.com/coreos/fedora-coreos-tracker/issues/1066#issuecomment-1009978326
16:47:14 <dustymabe> i did a deep dive in mantle last night and found some skeletons we need to address
16:47:26 <davdunc> a lot of isolation was required after spectre/meltdown
16:47:28 <dustymabe> which would have given us a much clearer red X failure
16:47:38 <dustymabe> for the testing-devel runs
16:47:46 <travier> It's always Jenkins fault!
16:48:26 <dustymabe> anywho I've spent too much time on this already.. FYI bgilbert looks like you're up in the ad-hoc release rotation: https://hackmd.io/WCA8XqAoRvafnja01JG_YA
16:48:35 <dustymabe> will collaborate with you
16:48:37 <mnguyen> this is one of the cve's fixed in the kernel https://bugzilla.redhat.com/show_bug.cgi?id=2031199
16:48:39 <bgilbert> yup
16:49:06 <dustymabe> ok next topic
16:49:27 <dustymabe> #topic networking: consider the effects of BOOTIF kernel argument on nm-initrd-generator
16:49:34 <dustymabe> #link https://github.com/coreos/fedora-coreos-tracker/issues/1048
16:50:14 <dustymabe> There is a downstream issue I'm working on that this has implications on so I prioritized it
16:50:53 <dustymabe> Basically there is a gap we've had in the past where we didn't consider the change in behavior a BOOTIF kernel argument would have on nm-initrd-generator
16:51:03 <dustymabe> or rather a gap we "have"
16:51:04 <jlebon> #chair jbrooks
16:51:04 <zodbot> Current chairs: bgilbert davdunc dustymabe jbrooks jdoss jlebon jmarrero lorbus miabbott nemric saqali skunkerk travier
16:52:17 <dustymabe> the question is, should we start to tell nm-initrd-generator to ignore that argument or not. Ignoring it gets us back into our happy place that we thought we were in to begin with.
16:53:04 <jlebon> i think we should ignore it, but it'd be nice if there was a nicer way to do that than changing the kargs defaults
16:53:28 <dustymabe> though, ignoring it could have implications (behavior change) for some (i.e. if there are a ton of NICs on a machine or something)
16:53:29 <jlebon> like some flag to nm-initrd-generator or something
16:54:42 <jlebon> i think it's worth highlighting that BOOTIF usually is not something users provide
16:55:02 <dustymabe> true, it's usually provided by the PXE executable
16:55:08 <jlebon> it's meant as a way for a PXE boot to know from which interface it booted for informational purposes
16:55:26 <travier> Could someone be relying on that right now and we would break it by changing how this behaves?
16:55:57 <jlebon> in theory, yes
16:56:21 <dustymabe> jlebon: another option for us could be to enhance the code that attempts to determine if the user supplied any networking configuration or not to consider `BOOTIF`
16:56:29 <travier> I would prefer we improve our detection logic but that might not be ideal (and would require more work)
16:57:16 <dustymabe> I guess we need to game out all of the scenarios
16:57:18 <jlebon> dustymabe: i.e. and not propagate?
16:57:23 <dustymabe> jlebon: correct
16:57:36 <jlebon> but still let it have an effect on initrd networking
16:57:56 <dustymabe> jlebon: right it still would have an effect on initrd networking
16:58:12 * dustymabe needs to read the bug again to see if that would actually help or not
16:58:17 <jlebon> yeah, could make sense. it's a more conservative change
16:58:56 <dustymabe> the problem with that scenario arises when someone has their ignition config on a different network/NIC than they PXE booted from
16:59:27 <dustymabe> which I guess in that case we tell them to add the `rd.bootif=0` arg?
17:00:00 <jlebon> there's also a knob they could use to have it not inject BOOTIF=
17:00:03 <dustymabe> i'll take this back to investigation and add more info to the ticket and circle back next meeting with alternative options/implications
17:00:14 <dustymabe> right `rd.bootif=0` ?
17:00:28 <bgilbert> it looks like the nm-initrd-generator glue respects the dracut cmdline glue
17:00:37 <jlebon> dustymabe: no on the pxe configuration side
17:00:43 <bgilbert> so we could drop rd.bootif=0 in /etc/cmdline.d/foo.conf
17:00:52 <dustymabe> oh, yeah, that could be a nice option jlebon
17:01:16 <bgilbert> and /proc/cmdline can still override it
17:01:18 <dustymabe> bgilbert: right, that's the same thing as adding it to our default kargs (the original proposal)
17:01:36 <bgilbert> dustymabe: yes, but without having to change the user-visible kargs
17:02:02 <dustymabe> right, sorry that wasn't written down (that was the implentation I was thinking of in my head)
17:02:03 <jlebon> UX wise though, apply the principle of least surprise, i still think it'd make more sense for us to ignore it
17:02:24 <travier> bgilbert: would it still be possible to override it from kargs then?
17:03:03 <bgilbert> travier: I haven't traced nmi-cmdline-reader.c, but if nm-i-g respects last-arg-wins, yes
17:03:38 <travier> It's both nice but more convoluted thus harder to figure out
17:03:50 <dustymabe> here is where I was originally going to update: ./overlay.d/05core/usr/lib/dracut/modules.d/35coreos-network/50-afterburn-network-kargs-default.conf
17:03:54 <dustymabe> https://github.com/coreos/fedora-coreos-config/blob/testing-devel/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-network/50-afterburn-network-kargs-default.conf#L7
17:04:22 <bgilbert> travier: update: yes it would
17:04:52 <dustymabe> either way let me try to come back to this next time with more information
17:05:08 <jlebon> there's a semantic difference between changing the afterburn fallback and shipping a cmdline.d dropin
17:05:10 <dustymabe> so the decision is easier to make
17:05:59 <jlebon> +1
17:06:29 <dustymabe> #topic Release notes
17:06:50 <travier> #link https://github.com/coreos/fedora-coreos-tracker/issues/194#issuecomment-992334650
17:06:50 <dustymabe> #link https://github.com/coreos/fedora-coreos-tracker/issues/194
17:07:16 <travier> So the subject is broad. I'm suggesting we scope it to a smaller subset first
17:07:41 <travier> Improving the way we track and display what issues are fixed in which releases
17:08:23 <travier> This is partly inspired by the layout at https://www.flatcar.org/releases/
17:09:17 <dustymabe> travier: I say run with it.
17:09:26 <jlebon> i feel like we should be able to avoid any manual work to get this. e.g. just a label on tracker issues which marks it to be added to the notes
17:09:33 <travier> The idea is to list issues fixed in a release in a json file
17:09:36 <travier> Once we have that we could use the same logic to make a job that generate lists for CVE too
17:10:26 <travier> jlebon: agree, there could be a bot acting on it
17:11:07 <dustymabe> i like the idea of using labels, but I wonder if we can get away with not creating a label for every released version ID
17:11:23 <jlebon> so e.g. the release job runs a script which collects all the issues with a certain label and auto-generates the notes to push to s3 and then drops the labels
17:11:27 <travier> We can do it manually until we make a bot. It's not that painful to update.
17:11:35 <jlebon> dustymabe: i think we'd just need one per stream
17:12:19 <travier> The idea behind a separated json stream and not just adding that to the main one is that we can update the list at any time
17:12:24 <travier> and correct things
17:12:32 <dustymabe> jlebon: yeah, I was just thinking about automation and going back later to correct things (i.e. we thought we fixed an issue, but we didn't and it's still broken)
17:12:53 <dustymabe> is there a way to link issues to the fedora-coreos-streams issue (checklist)
17:12:58 <dustymabe> and then pull the information from there?
17:13:26 <travier> We could create per-release milestones on Github
17:13:35 <dustymabe> at least in that case we have a single issue that represents a release, if we can find a way to associate other issues with it and pull that information then we'd be set
17:14:24 <travier> https://github.com/coreos/fedora-coreos-tracker/milestones
17:14:47 <jlebon> dustymabe: hmm yeah could work. we could have the job that pulls that info and converts to JSON be separate. it gets triggered by the release job, but could be rerun if we changed something
17:14:59 <dustymabe> travier: feels heavy, but maybe it could work
17:15:17 <dustymabe> either way I think we're very much in favor of the intermediate proposal (we have nothing right now)
17:15:31 <jlebon> +1
17:15:34 <dustymabe> but we're just pining about how to achieve it with least effort (which we can talk about later)
17:15:41 <travier> I don't know if we can have a bug in multiple milestones
17:15:59 <bgilbert> last time I checked, a bug can only have one milestone
17:16:15 <bgilbert> linking from the streams issue seems pretty heavy, since we'd need to switch to another repo each time
17:16:19 <travier> https://github.com/isaacs/github/issues/797
17:16:22 <travier> we can not
17:16:26 <travier> so this not an option
17:17:01 <dustymabe> ok let's agree to discuss implementation details further outside of the meeting
17:17:22 <dustymabe> I don't think we need a #proposed #agreed for this
17:17:32 <jlebon> agreed
17:17:35 <dustymabe> trying to pick over the remaining meeting items
17:17:41 <dustymabe> https://github.com/coreos/fedora-coreos-tracker/issues?q=is%3Aissue+is%3Aopen+label%3Ameeting
17:17:44 <dustymabe> anything time pressing?
17:18:04 <dustymabe> if not we're going to discuss "Large and growing PXE RAM requirement kind/bug meeting "
17:19:11 <jlebon> SGTM
17:19:26 <dustymabe> #topic Large and growing PXE RAM requirement
17:19:40 <dustymabe> #link https://github.com/coreos/fedora-coreos-tracker/issues/1055
17:19:44 <dustymabe> bgilbert: you have the stage
17:20:15 <bgilbert> I discovered that the documented 3 GiB RAM requirement for PXE appended rootfs is no longer enough
17:20:27 <bgilbert> we never really understood why it needed to be that large
17:21:07 <bgilbert> I did some digging.  part of the issue, for both coreos.live.rootfs_url and appended rootfs, is that initrd / is on tmpfs and tmpfs will only use 50% of RAM by default
17:21:34 <bgilbert> but that doesn't completely explain the memory requirements
17:22:02 <dustymabe> bgilbert: how much memory of the system is reserved for the kernel? could that explain some of it
17:22:22 <bgilbert> dustymabe: not very much
17:22:26 <bgilbert> we're missing hundreds of MB
17:22:51 <bgilbert> I don't have a single concrete proposal here
17:22:54 <nemric> I've a server that only run in live env .... is there a command line to get these info for you now ?
17:23:28 <bgilbert> nemric: we have testing environments; the problem is figuring out what's going on
17:23:41 <nemric> ok
17:23:56 <bgilbert> one major piece of this is: how much do we care about the PXE appended rootfs case?
17:24:03 <bgilbert> we went to some effort to support it
17:24:16 <bgilbert> but from what I've seen, I suspect everyone just uses the rootfs_url karg
17:24:26 <bgilbert> which is also faster and more RAM-efficient
17:25:14 <bgilbert> if we deemphasize that case, we have more control, since e.g. the fetcher script could remount the tmpfs
17:25:19 <dustymabe> hmm
17:25:33 <travier> Last option: bump the requirement for the concatenate option and mention in the doc the rootfs url option as faster and more efficient?
17:25:40 <dustymabe> `rootfstype=` is a kernel argument that can be set on the pxe server side/
17:25:43 <dustymabe> side?
17:26:00 <bgilbert> travier: I mean, yeah.  it's also possible to improve the UX, so we actually tell the user they're OOM rather than failing on something random
17:26:01 <miabbott> do we know if the downstream consumers (i.e. assisted installer) are using the appended rootfs case?
17:26:44 <bgilbert> dustymabe: yeah, but see the notes in the bug.  we can't do exactly what we want, and the workaround has potential unknown consequences
17:26:49 <bgilbert> miabbott: AFAIK they are
17:26:51 <bgilbert> *not
17:27:18 <jlebon> bgilbert: that seems like it'd be a nice improvement (OOM) without too much work
17:27:48 <jlebon> personally ok with just requiring more RAM and keep supporting it
17:27:50 <bgilbert> jlebon: if we completely drop appended mode, it's not worth doing, but otherwise yes
17:27:54 <travier> If we think there are better options, I think we should emphasize those and do the minimum to keep the one we have sane?
17:28:09 <bgilbert> one trivial change we can make
17:28:10 <dustymabe> I lean towards the second half of 1. (obviously we need to test some more to see if there are side effects) and then maybe we poke the upstream PR for tmpfs to see if that can gain any traction
17:28:29 <bgilbert> is to allowlist TFTP in rootfs_url.  that allows using rootfs_url without setting up HTTP
17:28:54 <jlebon> dustymabe: meh, the risk doesn't seem worth it IMO
17:29:05 <dustymabe> what are the risks?
17:29:08 <bgilbert> we omitted that to cut down on the support matrix, since appended initrds exist
17:29:34 <travier> I don't think it's worth optimising for low ram is we have another option more ram efficient already. If you want less use of RAM, use the other option
17:29:49 <dustymabe> travier: fair
17:29:54 <bgilbert> dustymabe: at runtime we'd be reading our rootfs out of a squashfs out of a minimal ramfs that no one uses
17:30:07 <jlebon> dustymabe: our initramfs is already special and extremely complex. this would further increase the gap between what we do and everyone else does
17:30:20 <bgilbert> *ramfs implementation
17:31:13 <dustymabe> ok so current proposal is to:
17:31:14 <bgilbert> thoughts about allowlisting TFTP in rootfs_url?  it would close a functionality gap in the preferred path
17:31:34 <dustymabe> update docs to mention higher RAM reqs if you're going to concatenate and make OOM reporting better?
17:31:37 <bgilbert> it's not 100% equivalent, since you have to repeat your TFTP server address in kargs instead of leaving it implicit
17:32:00 <bgilbert> but it might help people migrate away from appending
17:32:04 <jlebon> bgilbert: sounds sane, but can you expand on the UX for that in the issue?
17:32:16 <bgilbert> sure.  it's just rootfs_url=tftp://
17:32:24 <dustymabe> :)
17:32:43 <bgilbert> working on a #proposed
17:32:54 <jlebon> ahhh, well in that case SGTM :)
17:32:58 <travier> (FYI we're past time)
17:33:10 <dustymabe> bgilbert: were the things I mentioned part of that?
17:33:13 <bgilbert> yup
17:33:16 <dustymabe> +1
17:34:08 <bgilbert> #proposed We will update our docs for the apparent new RAM requirements for PXE appended rootfs, and we'll improve OOM reporting for the appended case.  We'll also pursue supporting TFTP in the rootfs_url karg, and have the docs encourage people to use that karg when possible.
17:34:42 <bgilbert> s/use/prefer/
17:34:52 <dustymabe> "use that karg when possible" as a possible alternative to appending if they have limited RAM?
17:35:07 <bgilbert> regardless of RAM
17:35:14 <bgilbert> it's also faster and more debuggable
17:35:36 <jlebon> bgilbert: hmm, i wonder if there's a way to auto-query the IP of the server served us somehow so the UX could be even simpler
17:35:48 <jlebon> the server that* served us
17:35:51 <bgilbert> we can weaken that last part of the #proposed if desired
17:36:04 <bgilbert> jlebon: the bootloader would need to pass that info on
17:36:39 <dustymabe> I don't have strong opinions but it feels like we should just encourage people to use rootfs_url and then the rootfs_url docs can mention tftp or http
17:36:42 <travier> +1
17:36:58 <dustymabe> either way I think:
17:37:00 <bgilbert> we could get it from DHCP next-server, assuming the DHCP response doesn't change based on the client ID.  which it very well might.
17:37:01 <dustymabe> +1
17:37:05 <jlebon> bgilbert: yeah. wonder if it already does somehow. maybe some obscure ethtool knob against the interface
17:37:19 <jlebon> anyway, we can discuss this elsewhere :)
17:37:20 <jmarrero> +1
17:38:05 <bgilbert> #agreed We will update our docs for the apparent new RAM requirements for PXE appended rootfs, and we'll improve OOM reporting for the appended case.  We'll also pursue supporting TFTP in the rootfs_url karg, and have the docs encourage people to prefer that karg when possible.
17:38:12 <bgilbert> thanks all
17:38:24 <dustymabe> thanks
17:38:27 <dustymabe> #topic open floor
17:38:32 <dustymabe> sorry for the long meeting
17:38:36 <dustymabe> any topics for open floor?
17:38:59 <dustymabe> #info dustymabe updated the f36 changes list: https://github.com/coreos/fedora-coreos-tracker/issues/918
17:39:15 <dustymabe> i'm thinking maybe we should do a video meeting soon to go through the list
17:39:15 <jlebon> are we due for another video meeting soon?
17:39:20 <dustymabe> ha
17:39:21 <jlebon> woah :)
17:39:32 <jlebon> same second
17:39:46 <dustymabe> we could just ad-hoc schedule one for next week
17:40:04 <dustymabe> we'll make jdoss run it
17:40:08 <dustymabe> thoughts?
17:40:41 <jlebon> SGTM
17:40:43 <jdoss> Oh crap.
17:40:59 <jlebon> dustymabe: maybe not the whole meeting. leave some time for something more fun
17:41:01 <dustymabe> jdoss: don't worry, </joke>
17:41:19 <dustymabe> jlebon: IOW we should have something else on the agenda?
17:41:31 <jlebon> yeah. unless you meant it as a meeting separate from the community meeting
17:41:44 <dustymabe> was going to use the same timeslot
17:41:58 <dustymabe> ok let's discuss more offline
17:42:01 <dustymabe> #endmeeting