gluster_community_meeting_20170104
LOGS
12:00:23 <kshlm> #startmeeting Gluster community meeting 20170104
12:00:23 <zodbot> Meeting started Wed Jan  4 12:00:23 2017 UTC.  The chair is kshlm. Information about MeetBot at http://wiki.debian.org/MeetBot.
12:00:23 <zodbot> Useful Commands: #action #agreed #halp #info #idea #link #topic.
12:00:23 <zodbot> The meeting name has been set to 'gluster_community_meeting_20170104'
12:00:29 <kshlm> Hi all!
12:00:38 <kshlm> #topic Rollcall
12:00:57 * ndevos _o/
12:01:24 <kshlm> ndevos, You're here after a **really** long time
12:01:43 * shyam is here
12:01:45 <ndevos> kshlm: yeah, travelling and holidays got in my way
12:02:16 <kshlm> shyam, hey!
12:02:29 * Saravanakmr is here
12:02:43 <kshlm> Hey Saravanakmr!
12:02:59 <kshlm> Also, Happy New Year everyone!
12:03:01 <nigelb> o/
12:03:17 <kshlm> I'll wait for a couple more minutes before I start.
12:03:24 <nigelb> Sorry about skipping out on a previous meeting. I had a family emergency.
12:04:13 <kshlm> nigelb, You didn't miss much. We haven't had any meetings for the past 4 weeks.
12:04:21 <nigelb> Oh, hehe.
12:04:28 <nigelb> well, I've added a bunch of infra updates to this one.
12:04:39 <nigelb> And I have a few items to introduce for discussion.
12:04:56 <kshlm> nigelb, Awesome.
12:05:21 <kshlm> If anyone else wants to add their updates to the meeting pad, you can do it now as well.
12:05:40 <kshlm> The pad is at https://bit.ly/gluster-community-meetings
12:06:04 <kshlm> So let's begin the first meeting of the new year
12:06:15 * rastar is here
12:06:18 <kshlm> #topic STM and backports
12:06:26 <kshlm> shyam, You're up.
12:06:33 <shyam> ok :)
12:06:57 <shyam> So, I wanted to discuss or bring to light that a lot of backports are happening to 3.9 branch which is an STM release
12:07:22 <shyam> My understanding is that this gets backports based on user requests, and not bugs and stuff like a LTM release
12:07:41 <shyam> So, is there a gap in my understanding, or are people doing more work by backporting stuff to the STM release?
12:07:51 <shyam> thoughts?
12:08:17 <ndevos> not sure, but we should only backport bugfixes that have a noticible effect for users
12:08:39 <atinmu> shyam, I had a different understanding which was probably incorrect, I thought we'd continue to have minor updates on 3.9 till it expires
12:09:00 <ndevos> in general, I suggest to have backports to stick to http://www.gluster.org/pipermail/maintainers/2016-May/000706.html
12:09:07 <nigelb> Does it make sense to close branches for open merges after release? And release manager(s) has to hit the submit button to actually merge it? That should prevent accidental backports.
12:09:08 <shyam> yes, we will have minor updates to 3.9, ideally 2 in number
12:09:36 <atinmu> shyam, then the backports are valid right? (if they are bug fixes)
12:10:10 <shyam> ndevos: agree with your mail, I am further stating that STM gets backports for only those issues that users report and want to continue testing and hence need the fix
12:10:38 <shyam> atinmu: it is a STM, so why backport all/any bugfix that we backport to 3.8 as well
12:10:40 <ndevos> shyam: well, anything that gets backported to 3.8 or 3.7 should also be backported to 3.9
12:10:59 <ndevos> shyam: otherwise users that upgrade from 3.7/3.8 to 3.9 could run into regressions
12:11:09 <shyam> isn't an STM for testing, providing feedback on an short unsupported for long release
12:11:17 <shyam> STM is meant for production?
12:12:18 <ndevos> users should be able to run STM in production, they may want to have some volumes with new features for testing, while still serving others more conservatively (is that a word?)
12:12:45 <nigelb> At the very least, we shouldn't bring down their test cluster.
12:12:58 <jdarcy> I think any release is meant for production.  If they want to live life on the edge they can build from master.
12:13:05 <ndevos> we dont suggest that users run STM in production, it'll require them to upgrade within 3 months again, that is probably not what most want from a storage environment
12:13:26 <nigelb> I believe the "less work" is defined as "we'll only do backports for 3 months"
12:13:31 <rastar> i agree with what jdarcy says
12:13:33 <shyam> Hmmm... that is not the perspective I had, I thought an STM is to release features quickly to test, and LTM is a more maintained release
12:13:33 <nigelb> rather than "we'll only backport a few things".
12:14:10 <rastar> every release is supported, STM is supported for only 3 months
12:14:19 <shyam> Well, most or all of you think otherwise, so let;s proceed to the next agenda
12:14:30 <shyam> otherwise to what I had in mine i.e :)
12:15:10 <kshlm> shyam, Okay then.
12:15:11 <atinmu> then the question is when is 3.9.1?
12:15:22 <shyam> 1 month from 3.9 right?
12:15:39 <atinmu> it's been 2 months I guess
12:15:57 <kshlm> The release was announced end of November.
12:16:21 <kshlm> And there have been 95 changes merged in since.
12:16:51 <kshlm> But we don't have the release-maintainers around to give information
12:17:27 <kshlm> We'll take that up later on the maintainers list.
12:17:41 <atinmu> thanks kshlm
12:17:51 <kshlm> #action Need to find out when 3.9.1 is happening
12:18:08 <kshlm> Onto the next topic.
12:18:20 <kshlm> #topic A common location for testing-tools
12:18:44 <kshlm> I added this topic on behalf of ShwethaHP
12:18:58 <kshlm> Some info about this,
12:19:08 <kshlm> QE tests being ported upstream with Glusto use some custom tools
12:19:15 <kshlm> They need to be provided from a well known location owned by the community
12:19:23 <kshlm> Currently, one tool (arequal) is being hosted in copr at https://copr.fedorainfracloud.org/coprs/nigelbabu/arequal/
12:19:44 <kshlm> I talked about this with nigelb before the meeting.
12:20:04 <nigelb> (That was the cheapest way in terms of time and effort to get a repo without hosting it ourselves)
12:20:14 <kshlm> And our opinion is that we can host the tools in a community accesible way, on copr.
12:20:33 <kshlm> But we need someone to maintain the packages.
12:20:39 <ndevos> are there plans to push those tools into distributions properly, like dbench and probably others?
12:21:00 <nigelb> I don't see a large value in distributing this into distributions.
12:21:10 <kshlm> I thought dbench was in distributions already.
12:21:12 <nigelb> It's used to test a very specific gluster usecase.
12:21:22 <nigelb> Practically most tools we use are already in distributions.
12:21:27 <nigelb> dbench/fio/etc.
12:21:36 <nigelb> The exceptions are arequal and smallfiles.
12:21:57 <nigelb> arequal is slightly more icky because it's in C. It's harder to just clone and run. It needs to be compiled.
12:22:04 <ndevos> smallfiles is not gluster specific, it would probably be good to have it in distributions?
12:22:08 <nigelb> I build packages so we didn't compile it for every test.
12:22:16 <shyam> So we need packages or git clone, build install like schemes for this
12:22:18 <nigelb> Possibly, but we don't intend to put smallfiles on copr.
12:22:35 <nigelb> If it goes into a repo, it's going upstream rather than copr.
12:22:43 <nigelb> upstream = fedora/epel
12:22:52 <ndevos> if arequal is in the distribution, it'll get compiled for all its suported archirectures and versions automatically, that saves recurring work?
12:23:02 <nigelb> There is no recurring work?
12:23:10 <nigelb> The tool hasn't been touched in years.
12:23:20 <ndevos> well, rebuild of the packages is some action
12:23:25 <nigelb> There's a redhat internal package, but there wasn't one available to the community.
12:23:45 <nigelb> If you want to take that up, please do.
12:23:51 <nigelb> I don't see value in distributing arequal
12:23:56 <nigelb> when it has such a narrow use case.
12:25:01 <kshlm> nigelb, Is just these two tools that are needed?
12:25:05 <shyam> knowing some parts of areeuqal I would agree with nigelb
12:25:12 <nigelb> As far as I know, yeah.
12:25:12 <jdarcy> If there's an internal package, that can probably be massaged into an external one more easily than starting from scratch.
12:25:29 <nigelb> I started from scratch and already finished :)
12:25:40 <jdarcy> Oh.
12:25:48 <ndevos> I dont know how arequal works, I've seen test results with it, but never looked at the details
12:26:04 <jdarcy> I think I looked at it once, then suppressed the memory.
12:26:15 <shyam> lol!
12:26:25 <ndevos> if it is gluster specific, maybe it should be part of the extras/ directory instead of seperate?
12:26:32 <kshlm> If we don't need to provide a package for smallfiles, then what we have now in copr should be enough.
12:26:58 <nigelb> I particulary didn't create a gluster account on copr because that's harder to track *who* maintains it.
12:27:28 <kshlm> nigelb, I think you can have group accounts tied to FAS groups.
12:27:30 <nigelb> But I'm also not a copr expert, so find me after the meeting if you have strong opinions on that.
12:27:51 <ndevos> I think we as a Gluster Community should work together with distributions and provide users/maintainers of those distributions easy access to tools for testing, a 2nd repository is less helpful
12:28:46 <kshlm> ndevos, Right now these tools are only used by our tests, which run on CentOS/RHEL.
12:29:20 <ndevos> kshlm: sure, and we do have users running those tests on their environment every now and then
12:29:23 <kshlm> So I don't there is a need to get them packaged for everything, unless we plan to test on them.
12:29:31 <sankarshan> Is the point of discussion that a copr served build is not "sanctioned" enough and thus needs to be under the Gluster org? In that event this is essentially a question of who continues to maintain and build packages for consumption
12:29:54 <ndevos> I don not think it is about 'us' testing it, it is about enabling others to test as well
12:30:36 <nigelb> Perhaps it's useful to actually get to a point where we test with arequal ourselves before investing more time and effort into upstreaming it.
12:30:47 <nigelb> On a quickk glance, the entire C program can be replaced by a python script.
12:31:10 <nigelb> considering glusto is written in python, the program might entirely be redundant
12:31:40 <nigelb> The easiest way to distribute it is by distributing along with the test that use it.
12:32:10 <ndevos> still, if arequal is used in our testing and is gluster specific, it is easy enough to include it in the glusterfs sources under extras/ or tests/utils/ or something
12:33:26 <nigelb> If you have time to do this, have gluster nightly builds also produce arequal builds, please go ahead.
12:33:59 <ndevos> anyway, smallfiles should probably be packaged for Fedora and EPEL - to me that is proper community collaboration
12:34:34 <ndevos> if arequal is only useful for gluster, it can be part of the gluster specific test-suite, no need to have a arequal package at all?
12:35:17 <Saravanakmr> ndevos, nigelb arequal is used as part of geo-rep tests for developer - its a good to have it as part of glusterfs sources.
12:35:36 <ndevos> we compile many different c programs during testing already, adding arequal to that does not look like a problem to me
12:35:55 <shyam> Where are these geo-rep tests that use arequal?
12:36:11 <Saravanakmr> to check for data checksum after geo-rep sync.
12:36:26 <nigelb> we already do have it as part of glusterfs sources
12:36:28 <nigelb> See ./tests/utils/arequal-checksum.c
12:36:56 <nigelb> I suspect the code is the same.
12:37:06 <Saravanakmr> nigelb, ok..I remember pulling it from github earlier - Thanks!
12:37:34 <nigelb> It was moved for the exact reason we're talking about.
12:37:58 <nigelb> The case here is we need it as a binary when testing with glusto.
12:38:13 <nigelb> for which having a tiny package made sense for ease of install.
12:38:36 <ndevos> so we could just make it part of the glusterfs-devel package or an other so that others can use the tool too?
12:39:07 <nigelb> if someone wants to take an action item to do that and port the changes to all branches, that would solve this, yes.
12:39:25 <kshlm> We have a glusterfs-tests pacakge.
12:39:42 <kshlm> That could contain this.
12:39:46 <ndevos> kshlm: yes, but that does not get build in distributions :)
12:40:35 <kshlm> Okay.
12:40:56 <kshlm> But right now, I'm just okay with tiny package that nigelb built.
12:41:04 <kshlm> That should do for now.
12:41:04 <shyam> So we convert this into an AI for someone to look at the what and how and move on? Just a suggestion, unless there is more to discuss here...
12:41:17 <kshlm> If we need to visit this again later we can.
12:41:35 <ndevos> someone should just file a bug for getting arequal included in one of the glusterfs packages and we'll get it done
12:42:18 <shyam> I'll file the bug, kshlm add that AI to me
12:42:46 <kshlm> #action shyam will file a bug to get arequal included in glusterfs packages
12:42:51 <kshlm> Thanks shyam.
12:43:02 <kshlm> Onto the last topic for the dya.
12:43:16 <kshlm> #topic Developer workflow problems
12:43:17 <nigelb> And that's me o/
12:43:49 <nigelb> Context: We created a separate branch for fb to land patches and gave them exceptions for regression runs.
12:44:16 <nigelb> This has actually pointed out that our developer workflow is less than ideal.
12:44:23 <nigelb> this is bad and we should be better.
12:44:52 <nigelb> We have tests that take 4h+ and fail intermittently.
12:45:15 <nigelb> Around releases, this adds to the frustration as well
12:45:21 <nigelb> when someone is fixing up that final patch.
12:45:31 <nigelb> We've more or less grown accustomed to it over time.
12:45:51 <nigelb> What we haven't noticed is how unfriendly our process is ot contributors.
12:45:57 <kshlm> I've previously mentioned several times that IMO running regression for every patch is a pain.
12:47:12 <nigelb> I'm happy to help solve this with infra help.
12:47:17 <nigelb> But this is not something I can lead.
12:47:48 <ndevos> what kind of changes do you propose?
12:48:02 <shyam> There is much to desire here, it sort of starts all the way from file multiple bugs for the same issue -> wait for regressions -> rinse/repeat for each patch. Maybe current focus is just the regressions?
12:48:14 <nigelb> Honestly, I'd like to hear what everyone thinks as the best way forward.
12:48:32 <nigelb> I don't have a solution. How can I enable this to be fixed?
12:48:36 <ndevos> we're (slowly) working on fixing the spurious regression failures, and we could run tests per component/tests-directory in parallel
12:48:45 <nigelb> Is that sfae?
12:48:47 <nigelb> *safe?
12:49:05 <ndevos> per directory tests? that should be on different systems
12:49:17 <nigelb> Don't we have a lot of directories?
12:49:41 <kshlm> rastar got this working.
12:49:55 <nigelb> so one idea is running tests in parallel. I'm talking to rastar on Friday to hear what he'd one and seee if we can run it upstream.
12:50:28 <ndevos> maybe it is all part of the run-tests-in-vagrant.sh script?
12:50:37 <kshlm> In my view, we should be doing this smarter. Not just throw more resources at it.
12:50:50 <nigelb> Would taking a long look at the tests hep?
12:51:14 <nigelb> Possibly looking at tests that take the longest time?
12:51:14 <kshlm> We should do away with per patch regression runs, and instead have regular regression runs on a branch.
12:51:19 <shyam> Another is CI, than runs for each patch as kshlm states, but when that fails... I am not sure how we can handle that.
12:51:36 <nigelb> The problem is without fixing intermittent failures, the regular runs make things worse.
12:51:57 <nigelb> Because now you're not sure if you've got a real failure or if it's a one off.
12:51:58 <shyam> yes, agree
12:52:25 <nigelb> I thought throwing more resources would help.
12:52:35 <nigelb> But running the tests on centos CI does not reduce intermittency or time taken.
12:52:40 <shyam> We need to take a look at the tests, we have had such discussions in the past, and to improve tests for a release etc. so I think that is surely something we should do in addition to other ideas.
12:52:52 <jdarcy> IMO we should get rid of tests that create more interference than information, which is about half.  Then we should run the remainder in parallel, in decreasing order of importance / information content
12:53:19 <kshlm> Running tests regularly should give us a better idea about what the intermittent failures are.
12:53:42 <shyam> kshlm: we do that already no? the posts to maintainers is from such runs if I am not wrong.
12:53:53 <nigelb> Indeed, we do.
12:53:53 <jdarcy> We should also consider running low-information-content tests periodically instead of per-commit.
12:54:03 <nigelb> I'm guessing we don't have this meta-data?
12:54:27 <kshlm> I'd forgotten about that.
12:55:03 <nigelb> For now my suggestion is, figuring out what it would take to make wait time for centos regression under 1h.
12:55:12 <nigelb> Right now it takes 4h+
12:55:19 <nigelb> which basicaly means half a working day.
12:55:36 <ndevos> nigelb: the one job in the CI that gets started in a loop, your fstat script can use the statistics from that, right?
12:55:47 <nigelb> Yep.
12:55:55 <nigelb> I was going to show that one separately.
12:56:07 <ndevos> cool, that would be valuable
12:56:15 <jdarcy> nigelb: I'd halve that.  With tests running in parallel, we should be able to get under *half* an hour.
12:56:46 <nigelb> What's our next action to this step?
12:57:00 <nigelb> Should we talk about this offline? Or bring it up on the list?
12:58:22 <kshlm> We should try rastar's parallel method first to speed up things, follow that up by doing a smarter CI.
12:58:22 <jdarcy> I think we need to bring a specific plan to the list.
12:58:23 <ndevos> I'd suggest to think about how to run tests in parallel as first step, then see how many tests we should remove/disable/select/.. for running on each posted change
12:59:27 <nigelb> OK. I'll take an item to talk to rastar and figure out what we know.
12:59:39 <rastar> running tests in parallel on same machines takes a lot of work
12:59:41 <rastar> see https://cwrap.org/socket_wrapper.html
12:59:44 <kshlm> We need to reduce the time taken to identify the bad tests, before we do something about it. Parallelizing will help.
12:59:56 <nigelb> rastar: The suggesiton is to use multple machines.
12:59:56 <rastar> that is how samba does parallel tests on same machine ^^^
13:00:03 <nigelb> Our existing machines have plenty of capacty on a normal day.
13:00:10 <rastar> nigelb: multiple machines would make it easy
13:00:30 <rastar> I really like kshlm's suggestion on not running tests per patch
13:00:38 <rastar> have a build train like a release train
13:00:42 <ndevos> rastar: cwrap and fuse mounts? I guess that can get tricky :)
13:01:01 <jdarcy> We also need a way to measure coverage/overlap.  I'm sure we have a hundred tests which test *nothing* that some predecessor hadn't covered already.  Total waste of time.
13:01:05 <nigelb> To get there, we need to fix the intermittent failure issue.
13:01:07 <rastar> ndevos: hence the point on it being difficult
13:01:43 <rastar> ok, nigelb and I would work on getting a doc out after friday meeting
13:01:48 <rastar> we can continue from there
13:01:53 <nigelb> Can someone from the devs rotate every month to chase down intermittent failure?
13:02:04 <nigelb> Because it really needs someone to drive it.
13:02:18 <nigelb> I'm happy to help with data and any logs you need.
13:02:47 <rastar> ndevos:  that is the reason I say it is very difficult
13:03:09 <ndevos> rastar: yes, I got that :)
13:03:13 <nigelb> It's not *fixing* the failure. It's chasing down who will fix and filing the bug and following up.
13:03:28 <jdarcy> Would it be fair to say that any "recheck" indicates a failure believed to be spurious?
13:03:46 <rastar> nigelb: devs do chase down. But usually all devs become busy with their *own* work closer to release.
13:03:50 <nigelb> well, it could be someone writing a WIP patch :)
13:03:50 <ndevos> nigelb: reporting a bug for the most occuring failure(s) should be sufficient to get the component developers attention
13:03:52 <rastar> jdarcy: yes
13:04:04 <jdarcy> If so, then we should separate those from failures that *do* result in code change, and log them all in one place as a work list.
13:04:26 <nigelb> Shall I then convert the monthly email to 5 bugs?
13:04:50 <ndevos> nigelb: I think that would be a good idea
13:04:58 <shyam> Should we increase the frequency? To tackle this faster?
13:05:05 <nigelb> biweekly?
13:05:09 <shyam> Better
13:05:38 <ndevos> maybe limit to x bugs per run?
13:05:48 <ndevos> bugs per component that is
13:06:17 <nigelb> So, I'm looking at http://fstat.gluster.org/weeks/2
13:06:22 <nigelb> There's just one failure.
13:06:31 <nigelb> okay, 2 failures that I'd file a bug.
13:06:53 <ndevos> good, no problem there then
13:07:30 <nigelb> we're 7 minutes past time.
13:07:43 <kshlm> I didn't notice that.
13:07:45 <shyam> Bug for packaging arequal-checksum, here, https://bugzilla.redhat.com/show_bug.cgi?id=1410100
13:08:06 <ndevos> thanks shyam!
13:08:09 <nigelb> I'll bring this up again on the maintainers meeting for the audience there as well.
13:08:18 <kshlm> nigelb, Thanks.
13:08:41 <kshlm> I'll end the meeting now.
13:08:46 <jdarcy> Just looking at the failwhale results, I immediately realized that I was skipping netbsd failures and concentrating first on failures that happened for multiple patches (not just successive versions of the same patch).
13:09:15 <nigelb> the quota-rename one really needs to be disabled for netbsd.
13:09:21 <nigelb> until someone fixes it.
13:09:58 <kshlm> Now I'm acutally ending it. Please continue your conversations on #gluster-dev
13:10:03 <kshlm> #endmeeting