ansible_aws_community_meeting
LOGS
17:34:30 <abuzachis[m]> #startmeeting Ansible AWS Community Meeting
17:34:30 <zodbot> Meeting started Thu Mar 24 17:34:30 2022 UTC.
17:34:30 <zodbot> This meeting is logged and archived in a public location.
17:34:30 <zodbot> The chair is abuzachis[m]. Information about MeetBot at https://fedoraproject.org/wiki/Zodbot#Meeting_Functions.
17:34:30 <zodbot> Useful Commands: #action #agreed #halp #info #idea #link #topic.
17:34:30 <zodbot> The meeting name has been set to 'ansible_aws_community_meeting'
17:35:25 <abuzachis[m]> I'm not familiar, but I guess we can start with the agenda https://github.com/ansible/community/issues/654
17:35:48 <abuzachis[m]> And if you have other things to bring, feel free to do so
17:35:56 <tremble> O/
17:36:00 <abuzachis[m]> #chair jill
17:36:00 <zodbot> Current chairs: abuzachis[m] jill
17:36:04 <abuzachis[m]> * #chair jillr
17:36:08 <jillr> #link https://github.com/ansible/community/issues/654
17:36:12 <jillr> #chair tremble
17:36:39 <jillr> aw, zodbot doesnt want to let me give chair, abuzachis[m] I think you'll need to do it
17:36:42 <abuzachis[m]> * #chair jillr
17:36:58 <abuzachis[m]> #chair tremble
17:36:58 <zodbot> Current chairs: abuzachis[m] jill tremble
17:37:14 <mandar242[m]> O/
17:37:14 <abuzachis[m]> #chair jillr
17:37:14 <zodbot> Current chairs: abuzachis[m] jill jillr tremble
17:37:25 <abuzachis[m]> * #chair @jillr
17:37:46 <abuzachis[m]> #chair markw
17:37:46 <zodbot> Current chairs: abuzachis[m] jill jillr markw tremble
17:38:00 <tremble> #chair jillr
17:38:00 <zodbot> Current chairs: abuzachis[m] jill jillr markw tremble
17:38:42 <jillr> looking at the agenda, it looks like the we have some discussion topics,
17:38:46 <tremble> I've not used zodbot in a long time
17:38:49 <jillr> #topic Clarity on update path for module_utils
17:39:19 <markw[m]> I've not ever, so all new to me :)
17:39:26 <tremble> Let's not worry about it today
17:39:44 <jillr> markw[m]: welcome, and greetings on behalf of the bot!  this was your topic I think?
17:42:26 <markw[m]> Who starts normally or is it a case of free for all once clear from a discussion point ?
17:42:45 <jillr> markw[m]: we usually do the topics in order that are in the agenda, then an open floor at the end
17:43:00 <jillr> the `#topic` commmand tells the bot when we change topics, those become headers in the log
17:43:28 <jillr> you had added what I think is the first topic, to talk about module utils that are consummed in community.aws from amazon.aws
17:43:28 <tremble> Looks like the 3 agenda items are: 1) module_utils (markw) 2) test stability (markw) and 3) prs to target for next release
17:45:03 <markw[m]> Cool so on the first point I just wondered if there was a process that we could better be documented to make it easier to know what to update for module utils and knowing when you could consume that in the community repo
17:45:14 <jillr> typically whoever adds the topic is asked to introduce it to the meeting and then we discuss.  one or two people usually run the meeting, making sure we address all the topics (that are possible in the time) and sort of guiding the conversation as needed
17:45:19 <markw[m]> as currently it's a little unclear and can be a bit confusing
17:46:15 <markw[m]> rds for example i think is only used in the community repo but the utils are in the amazon.aws repo
17:47:40 <jillr> that one is a bit historical, the plan was always to promote the modules to amazon.aws but it's taken us rather a bit longer than expected to do that :)
17:48:02 <jillr> I agree that we should have better documentation and poilicy around that though
17:48:11 <jillr> *policy
17:48:34 <tremble> Yeah, with the perfect lens of hind sight I think dropping all of module_utils into amazon.aws was a mistake.  Going forward I'd suggest new things start in community.aws and move if/when the modules move
17:49:28 <jillr> in retrospect I actually might have done a shared utility collection, like cloud.common or amazon.common, but that feels like a massive change (and more dependencies) to do now
17:49:49 <markw[m]> Cool I think some new utils have gone in community since so definitely makes sense to keep them closer together until moved or needed more widely :)
17:50:26 <markw[m]> that all makes sense to me
17:51:15 <tremble> I remember agreeing with jillr putting into into amazon.aws originally and there was a logic to it
17:51:37 <markw[m]> Got it ! I guess soon it will be easier when rds etc moves to amazon.aws
17:51:59 <markw[m]> Do I just move the topic on with that command ?
17:52:30 <tremble> Anyone anything else to add in the topic?
17:52:31 <jillr> I think it might have been because we knew c.aws would always depend on amazon.aws, and if a module in amazon.aws did want to use a module utility, moving the util from c.aws to amazon.aws would be a massive ask
17:52:38 <jillr> and amazon.aws can not depend on c.aws
17:53:02 <jillr> do we have any decisions, votes, or actions (like documentation) that we want out of this topic before we move on?
17:53:42 <markw[m]> nope all makes sense
17:54:17 <tremble> #topic test stability (markw)
17:54:19 <jillr> markw[m]: do you mind if I edit your commment on the agenda to have checkboxes instead of bullets, so we can mark off the topics as we discuss them?
17:55:17 <jillr> oh, nm, I dont have edit permissions on the community repo!  :)
17:56:15 <markw[m]> Me again :D so currently we tend to see tests intermittently failing a fair amount some of it can simply be AWS transient issues but sometimes it can be zuul timeout or similar related meaning we have to run the recheck / regate command a fair amount. I'm just wondering if there would be a way to autoretry (to handle transient) or i guess a plan to try and stabilise some tests as they occur more
17:56:46 <markw[m]> I appreciate its a little vauge
17:57:02 <markw[m]> but its just something I observed quite a bit over the weeks
17:57:42 <tremble> Yeah, so from experience this is a hard problem...
17:58:02 <tremble> It usually stems from a couple of issues:
17:58:11 <tremble> 1) Rate limits
17:58:29 <tremble> We don't see this as often any more, because we're not running in parallel so much.
17:59:30 <markw[m]> I can edit my comment if you like no trouble at all !
17:59:47 <tremble> 2) Race conditions
18:00:10 <tremble> This is often caused by modules trying to modify straight after creation.
18:00:40 <tremble> A lot of the time the way to deal with 1 and 2 is to use AWSRetry
18:00:58 <tremble> Sometimes you also need waiters.
18:01:23 <markuman[m]> ...and not every boto3 service got waiters
18:01:23 <tremble> The trouble is that the issues are often inconsistent which makes testing stability fixes hard
18:01:39 <markuman[m]> I know you made a PR do implement custom waiters ....
18:02:04 <tremble> If you need help with waiters ping me.  I have some experience in building custom waiters, even for services that don't have them
18:02:33 <markuman[m]> For me, most annyoing are the backport rechecks
18:02:43 <tremble> However, I will likely be very flaky this side of Easter due to personal/family issues
18:02:48 <markw[m]> Got it yep i think a lot of modules have been improved to add the retries but I think some are still missing so it would be good to go through and add where possible as its quite quick to do on the waiters front I'm quite new to custom ones I've only used default ones so yep that sounds good
18:02:53 <jillr> more recently, we've also had some issues with the underlying nodes the test containers run on (like /tmp running out of space, so we needed to change jobs to use /var/tmp)
18:02:54 <markuman[m]> because in the origin PR they succeed
18:03:18 <abuzachis[m]> We can probably come up with a list of modules having these timeous more frequently and try to improve with waiters or aws_retry where needed.
18:03:26 <tremble> Yeah, I used to go through the "daily" tests every now and then and try and stabilise things, but we don't have them any more.
18:03:28 <jillr> and today the CI cluster is very busy, so we're looking at adding more nodes (I believe) to help with job queueing
18:04:13 <markuman[m]> PR 1005 and 1006 are current run in timeouts,...and they are just backports of AWSRetry implementation :)
18:04:50 <markw[m]> I think those 2 might be the spitter issue and needing the PR backported that abuzachis  has open
18:05:13 <tremble> Beware assuming the "timeout" tests are just running out of time.  There is some auto-retry stuff in there to get more verbose output IIRC.  Sometimes the original test leaves enough behid to break the second.
18:05:31 <tremble> And sometimes the second test doesn't have enough time because the first failed late enough
18:06:04 <markw[m]> yep that follows espeically so with really slow tests from ec2 related / rds etc
18:06:22 <markw[m]> on the missing backoff / waiters I could build a list of modules without backoff / waiting maybe if that helps ?
18:06:54 <tremble> If someone's able to get a list of flakes and links to the test failures that's half the work
18:06:59 <abuzachis[m]> So, the problem is that we didn't backport https://github.com/ansible-collections/community.aws/pull/986 to stable-2 and stable-3 and the CI was still running for whole collection. Don't remember if those ones were affected by this thing.
18:07:24 <abuzachis[m]> rechecking should work since both back ports have been merged
18:07:27 <markuman[m]> jillr: are the nodes rebooting frequently? that might be solve the /tmp issue?
18:07:30 <tremble> abuzachis: #986 hides the issue rather than fixes the underlying cause
18:07:58 <jillr> markuman[m]: I have to be honest, I don't know  :)  someone who knows way more than me about it looked at it, and fixed it!
18:08:18 <jillr> they're shared infrastructure so I'm a little fuzzy on the details
18:08:37 <abuzachis[m]> So, we had to do that because the splitter us also used by other collections that works differently.
18:09:35 <abuzachis[m]> Adding  that fake integration tests suite should help for those modules that do not have integrations at the moment. Once we cover all of them, we can remove it.
18:09:47 <abuzachis[m]> It is supposed to be a temporary solution.
18:10:45 <abuzachis[m]> s/works/work/
18:11:38 <tremble> If someone's able to identify the flaky tests maybe we can have a mini-hackathon to try and stabilize things
18:12:01 <markw[m]> Sounds good to me I can start building a list
18:12:16 <abuzachis[m]> Awesome!
18:12:17 <markw[m]> On the more nodes fingers crossed helps, I wonder if using the container nodes in zuul would help if they could be used for the integration tests as they are generally quicker
18:12:23 <markw[m]> * fingers crossed that helps, I
18:12:38 <markw[m]> * fingers crossed that helps, I, * generally quicker and maybe more reliable
18:12:58 <markw[m]> * fingers crossed that helps, I, * generally quicker and maybe more reliable as tehy are emphermal
18:12:59 <markw[m]> * fingers crossed that helps, I, * generally quicker and maybe more reliable as tehy are emphemeral
18:13:10 <jillr> markw[m]: the tests are containerized, iirc we get something like 25 containers at a time on a host?  (I could be misremembering though)
18:13:33 <markuman[m]> hm ok, PR 1005 some rds test is failing
18:13:37 <markuman[m]> > 2022-03-23 11:20:45.374543 | controller | botocore.exceptions.ClientError: An error occurred (InvalidParameterValue) when calling the CreateDBInstance operation: Invalid DB engine
18:13:54 <markuman[m]> that sounds like the entire test is wrong....
18:14:05 <markuman[m]> and not flaky
18:14:54 <tremble> Yeah, AWS have been dropping support for things recently
18:15:04 <markuman[m]> and 1006 (same backport PR) hits other errors.
18:15:11 <markw[m]> yeah thats one of the 2 that is running all tests rather than just elb
18:15:12 <tremble> That may just need an update
18:15:16 <markuman[m]> > 2022-03-22 10:58:39.923626 | controller | TypeError: get_paginator() missing 1 required positional argument: 'client'
18:15:44 <tremble> Yeah, that's a bug :)
18:15:45 <markuman[m]> I wonder how the origin PR was ever able to succeed?
18:15:55 <markuman[m]> I mean, that's just backport PR
18:15:58 <markw[m]> huh ! that is what i was just thinking
18:16:02 <abuzachis[m]> markuman: is that related in come way with cloud.py ?
18:16:08 <tremble> Probably something in _utils that didn't trigger thee right tests
18:16:11 <markuman[m]> that runs minutes/hours before the first backport PR
18:17:04 <markw[m]> Ahh got it thanks jillr
18:18:28 <markw[m]> I'll take a look at 1006 and see if there is any bug I introduced
18:18:46 <markuman[m]> <markuman[m]> "> 2022-03-23 11:20:45.374543..." <- nvm, that's not the failure. it was `"engine": "thisisnotavalidengine",` and ignored.
18:18:54 <tremble> :)
18:19:47 <markw[m]> markuman: Yep it is a bug i made :( not sure how it passed before
18:19:57 <markuman[m]> ok... but that's off-topic if we're digging deeper in 1005/1006 now
18:19:57 <tremble> I suggest markw (and anyone else) tries to build a list, and we come back next time and either try to arrange a hackathon or discuss causes
18:20:09 <markw[m]> Yep I'll start making a list !
18:20:38 <abuzachis[m]> Do we have anything else to add here or can we move next?
18:20:40 <jillr> #action markw[m] to make a list of modules facing recurring CI issues
18:20:43 <jillr> thanks markw[m]!
18:21:14 <markuman[m]> markw[m]: just a note. we can just close the backport PR. currently it's just in the main branch and not relevant for 3.2.0/2.4.0 imo
18:21:54 <markw[m]> sounds good
18:22:35 <markw[m]> #topic PRs for next v3 release
18:23:00 <markuman[m]> #chair markuman
18:23:28 <tremble> You generally can't #chair yourself :)
18:23:51 <jillr> #chair markuman[m]
18:23:51 <zodbot> Current chairs: abuzachis[m] jill jillr markuman[m] markw tremble
18:23:56 <jillr> sorry we missed you  :)
18:24:10 <jillr> #topic PRs for next v3 release
18:24:54 <markw[m]> Just a general one I added for some PRs that would be good to get over the line and merged for v3 i think most are basically ready so I'm not sure when v3 should be release but they would be good to be included: https://github.com/ansible-collections/community.aws/pull/972
18:24:58 <markw[m]> https://github.com/ansible-collections/community.aws/pull/963
18:25:05 <markw[m]> https://github.com/ansible-collections/community.aws/pull/940
18:25:13 <abuzachis[m]> I guess we can use "#info"?
18:26:10 <markuman[m]> markw: in four days https://github.com/ansible-collections/community.aws/issues/890#issuecomment-1065404882
18:26:15 <tremble> Yup, I'd suggest #info for each, and if folks have time, please go take a look :)
18:27:01 <abuzachis[m]> #info https://github.com/ansible-collections/community.aws/pull/973
18:27:02 <tremble> #info https://github.com/ansible-collections/community.aws/pull/963
18:27:16 <abuzachis[m]> This one has been merged. 🎉
18:27:37 <markuman[m]> maybe in 827, the first-time-contributor lost the motivation?
18:28:12 <markw[m]> It looks like it was close
18:28:44 <tremble> It happens, does it just need minor docs fixes?
18:28:45 <markuman[m]> no github activity this year ...
18:28:57 <abuzachis[m]> #info https://github.com/ansible-collections/community.aws/pull/827
18:29:07 <tremble> Hmm, and a rebase
18:29:16 <jillr> I have to run for another meeting, thank you everyone for showing up and participating!
18:29:43 <markuman[m]> there was a "purge" discussion. that's a bit more as doc fixes I guess
18:30:24 <markw[m]> Yep the purge option made sense I think that and enabling the tests looks like is all that is left for that PR
18:30:45 <markuman[m]> I have to put my son to bed. See you
18:30:58 <markw[m]> Have a good evening !
18:31:01 <abuzachis[m]> Thank you markuman and jillr
18:31:32 <markw[m]> and thankyou ! have a good afternoon
18:32:40 <abuzachis[m]> 963 I guess it's ready to be merged.
18:34:01 <markw[m]> yep  it looks good
18:34:28 <abuzachis[m]> How would we like to proceed with 827?
18:34:44 <tremble> Someone willing to take it over to get it past the finish line?
18:35:14 <markw[m]> Happy to do that as it would be a great change to get in
18:35:40 <abuzachis[m]> Awesome! Thank you markw
18:35:57 <abuzachis[m]> I guess there is only one PR missing on our agenda
18:36:03 <abuzachis[m]> #info https://github.com/ansible-collections/community.aws/pull/940
18:36:21 <markw[m]> Not sure if I can make the cutoff on 827 but I can definitely carry it on
18:36:38 <markw[m]> and get it ready
18:37:35 <tremble> I wouldn't worry too much about missing the cutoff, we're doing much better at getting releases out it's (hopefully) not like you'll need to wait 6 months for the next release...
18:38:09 <markw[m]> :D cool
18:38:53 <markw[m]> on that last PR 940 it's quite a big one as I had to refactor the module quite a bit to support the extra notification targets but I think its all good to go if people are happy with it
18:38:55 <tremble> We're at time, and folks have had to leave (and I should too), anyone got urgent other business?
18:39:31 <markw[m]> yep thats fine :) thanks all
18:39:52 <abuzachis[m]> Yes, the PR looks good to me.
18:40:02 <abuzachis[m]> Thank you very much for joining.
18:40:14 <abuzachis[m]> I guess we can stop the meeting now.
18:40:51 <tremble> Have a good rest-of-day all.
18:41:16 <markw[m]> abuzachis: No worries happy to join and help out ! and yourselves !
18:41:36 <abuzachis[m]> Thank you markw
18:41:49 <abuzachis[m]> Have a nice rest of day everyone!
18:41:54 <markw[m]> does that meeting need a command to end ?
18:42:02 <abuzachis[m]> #endmeeting