infrastructure
LOGS
18:00:03 <nirik> #startmeeting Infrastructure (2014-10-09)
18:00:03 <zodbot> Meeting started Thu Oct  9 18:00:03 2014 UTC.  The chair is nirik. Information about MeetBot at http://wiki.debian.org/MeetBot.
18:00:03 <zodbot> Useful Commands: #action #agreed #halp #info #idea #link #topic.
18:00:04 <nirik> #meetingname infrastructure
18:00:04 <zodbot> The meeting name has been set to 'infrastructure'
18:00:04 <nirik> #topic aloha
18:00:04 <nirik> #chair smooge relrod nirik abadger1999 lmacken dgilmore mdomsch threebean pingou puiterwijk
18:00:04 <zodbot> Current chairs: abadger1999 dgilmore lmacken mdomsch nirik pingou puiterwijk relrod smooge threebean
18:00:15 * lanica is here for the infra meeting.
18:00:21 * danielbruno here
18:00:22 * pingou 
18:00:28 * tflink is here
18:00:30 * bwood09_ here
18:00:57 * lmacken 
18:01:22 * mpduty is here
18:01:32 * threebean is here
18:01:35 * roshi lurks
18:01:49 <nirik> welcome everyone.
18:02:03 <smooge> hi
18:02:04 <nirik> #topic New folks introductions and Apprentice tasks
18:02:14 <nirik> any new folks like to introduce themselves?
18:02:23 <nirik> Or apprentices with questions/comments/ideas?
18:02:35 * puiterwijk is here, but busy (as announced)
18:02:52 <nirik> note: I'm going to be doing my monthly cleanup of the apprentice group later today... so if you haven't sent in your monthly status email, please do so asap. ;)
18:03:28 <danofsatx-dt> I am here, busy installing openstack
18:04:07 * oddshocks is here
18:04:13 <danofsatx-dt> I did have one question for the fedora cloud folks - what is your management/front end? is it all CLI virsh commands, or is there a web interface somewhere?
18:04:34 * danofsatx-dt checks his sent folder for nirik
18:04:52 <nirik> danofsatx-dt: we have a dashboard (horizon?) for one off things, but we use ansible to manage the persistent instances.
18:05:05 <nirik> ansible spins up the instance and configures it as needed.
18:05:06 <danofsatx-dt> ok, that's what I wanted to check
18:05:14 <vgologuz> nirik, i'm not sure if it right time to ask, but still: please point me where to look/ask about backup. I've found out, that copr has no backup in fedora-infra, mirek done them by hand. I'm going to add new component to copr - package signer and it surely need to backup gpg keys (and in secure way btw)
18:05:27 <nirik> http://infrastructure.fedoraproject.org/cgit/ansible.git/tree/
18:05:29 <danofsatx-dt> a lot of the stuff I'm deploying is one-off instances
18:06:16 <nirik> vgologuz: yeah, we aren't doing backups of it now... but it would be good to do so. Can you email me (or perhaps make a ticket) with what things should be backed up and I can set it up.
18:06:37 <nirik> we have a backup server, it would connect and use rdiff-backup to backup whatever directory trees/volumes we want.
18:07:00 <nirik> I guess we should backup all the rpms/repos too?
18:07:17 <vgologuz> and second question, do fedora-infra have an intstance of cacti/zabbix to send custom monitoring stats (e.g. length of builds queue ) ?
18:07:42 <vgologuz> nirik, and DB i think
18:08:07 <nirik> vgologuz: yeah, db too. :) we should setup a cron to dump the db to a file and backup that and keys and rpms.
18:08:11 <nirik> vgologuz: we have nagios...
18:08:18 <nirik> and collectd
18:08:25 <nirik> depending on if you want to monitor, or alert
18:08:32 <vgologuz> i think nagios only about critical states?
18:08:41 <nirik> yeah.
18:09:40 <nirik> vgologuz: sure. so, if you could file tickets we can work on it, or you can do so. ;)
18:10:01 <vgologuz> ok i will review copr and file ticket about backup
18:10:05 <nirik> #info copr backups, monitoring and alerting work coming up.
18:10:07 <vgologuz> haven't heard about collectd, where should i look in infra?
18:10:20 <vgologuz> except collectd.org
18:10:27 <nirik> http://admin.fedoraproject.org/collectd/
18:10:36 <nirik> it's also in ansible (configuring and setup). it's a role.
18:11:08 <nirik> it collects normal stats... load, cpu, etc... and we can make plugins for extra stuff we want.
18:11:27 <vgologuz> thanks, i'll study about plugins
18:11:37 <nirik> .tiny https://admin.fedoraproject.org/collectd/bin/index.cgi?hostname=busgateway01.phx2.fedoraproject.org&plugin=fedmsg&timespan=86400&action=show_selection&ok_button=OK
18:11:38 <zodbot> nirik: http://tinyurl.com/o8srljr
18:11:50 <nirik> for example a fedmsg plugin for our busgateway that sows fedmsgs
18:12:18 <nirik> anyhow, yeah, do let us know if you have questions or need info.
18:12:55 <nirik> vgologuz: oh, you were fixing up the copr playbooks? last time I tried to run them they didn't finish... could you look into fixing that? I think it was a missing source file.
18:13:29 <nirik> #topic Applications status / discussion
18:13:37 <nirik> any applications news ?
18:14:07 <threebean> been roping anitya and koschei into fedmsg this week.. lots of fun.  ;)
18:14:21 <pingou> threebean: and tnh? :)
18:14:32 * threebean nods
18:14:45 <threebean> started work on the new backend for anitya
18:14:47 * tflink is working on taskotron monitoring setup, most of the other issues have been fixed but waiting for new builds
18:14:48 <threebean> https://github.com/fedora-infra/the-new-hotness
18:14:59 <pingou> I've been working on anitya with threebean
18:15:10 <tflink> at this point, we need to decide how wise it is to switch off autoqa right before beta freeze
18:15:21 <pingou> and spent some time on progit yesterday to make it a little less fedora-centric (ie: allow local account instead of relying on FAS)
18:15:38 <threebean> tflink: the goal was to switch it off a few days ago, no?
18:15:58 <pingou> oh, and I got the new fedocal out of the door (that benefited from quite a bit of help from trashy)
18:16:35 * mirek-hm is here
18:17:09 <tflink> threebean: that was the original hope, yes
18:17:20 <tflink> some bits took longer than I wanted them to
18:17:25 * threebean nods
18:17:26 <nirik> #info new fedocal releases this week (see changelog on list)
18:17:34 * lmacken has been doing a lot of bodhi masher development lately. Almost ready to start testing pushes in stg.
18:17:50 <nirik> #info taskotron monitoring has been added.
18:17:55 <threebean> if the only pieces that are left are monitoring, I'd be +1 to moving forwards with taskotron for this portion of the release cycle -- and killing autoqa.
18:17:58 <pingou> tflink: 2 questions: a/ how easy/hard is it to turn it back on if needed? b/ can both system work in parallel?
18:18:05 <nirik> #info anitya and new backend work moving along
18:18:24 <pingou> nirik: and pkgdb2 also got a released pushed in prod, on Monday
18:18:33 <nirik> tflink: there's still a bit more monitoring to add? and we wanted to backup some more stuff?
18:18:39 <tflink> pingou: if we don't delete the autoqa01 vm, trivial to turn back on. they can't work 100% in parallel due to how we provide feedback in bodhi comments
18:19:03 <tflink> nirik: yeah, I'm working on the website monitoring right now, I don't think that the buildbot plugin is going to be ready this week
18:19:09 <pingou> then I'm +1 on moving forward and just kee the autoqa01 vm around for now
18:19:18 <tflink> there are some files on taskotron01.qa that need to be backed up as well
18:19:27 * tflink doesn't remember if he filed a ticket for that
18:19:39 <nirik> tflink: I'm happy to assist adding the website monitoring and backups...
18:19:49 <nirik> no ticket yet, but if you file one I can get it going. ;)
18:19:50 <tflink> I'm going to do a new libtaskotron build later today and reset the history on taskotron01.qa
18:20:09 <threebean> tflink: any chance of a new resultsdb release before prime time?
18:20:12 <tflink> nirik: if you have the time to do the nagios stuff, I can work on the new builds and cleanup
18:20:27 <tflink> threebean: yeah, was planning on a new build/release for that today as well
18:20:32 <threebean> rad, rad.
18:20:36 <nirik> tflink: also, there's a report email going to admin I think about success and failures for each {prod|stg|dev}... should that better go to qa-devel? or test list?
18:21:30 <tflink> nirik: odd, I'm only seeing it go out to sysadmin-qa-members
18:21:39 <nirik> oh, perhaps I misread.
18:21:41 * nirik looks
18:21:41 <tflink> it's not supposed to go out to admin@
18:22:10 <nirik> oh, you are right. I misread. ;)
18:22:24 <nirik> but still would those better go to qa-devel? or sysadmin-qa is good?
18:22:39 <tflink> syadmin-qa is good for now - the information in those emails is of limited utility
18:22:50 <tflink> a very limited audience, rather
18:22:56 <nirik> ok. yeah, I wasn't sure if there was anything to do with them. ;)
18:23:14 <tflink> in practice, anyone interested in the emails is part of sysadmin-qa
18:23:29 <nirik> fair enough.
18:23:35 <tflink> that may change in the future, but we'll have better methods of reporting by then, I think
18:24:10 * nirik likes tracking down automated things that send email and find out if they are needed/going the right place. ;)
18:24:18 <nirik> anyhow, any other applications news?
18:24:40 <pingou> oh, we flooded fedmsg earlier this week :)
18:24:59 <pingou> but people cannot complain anymore that the information stored in pkgdb2 about their packages is incorrect (for most of them)
18:25:11 <nirik> pingou: is that a cron job now?
18:25:19 <nirik> or a fedmsg trigger? ;)
18:25:23 <threebean> ooo
18:25:32 <pingou> basically, we now have a cron job that on a weekly basis will take the metadata from rawhide and update the package information in pkgdb with them
18:25:38 <pingou> nirik: cron :)
18:25:50 <nirik> cool. worth a blog or note to devel-announce?
18:26:33 <pingou> maybe a blog, seems to small to worth devel-announce (imho)
18:26:48 <nirik> yeah, fair
18:27:05 <nirik> #info pkgdb info on packages is now updated once a week from rawhide metadata.
18:27:23 <pingou> which means it's missing some packages, those only present in the other branches
18:27:33 <nirik> yeah, or epel only or whatever.
18:28:06 <nirik> anything we want to try and land before freeze next week?
18:28:14 <nirik> or we are in pretty good shape for apps that freeze?
18:29:18 <pingou> I have some big changes coming up in pkgdb land, but that's something to coordinate with rel-eng
18:29:29 <pingou> and it'll most likely wait for after the freeze
18:29:35 <nirik> sounds good
18:29:41 <nirik> #topic Sysadmin status / discussion
18:29:50 <nirik> lets see... on sysadmin side of the world.
18:30:21 <nirik> I've been seeing problems with our nightly ansible check/diff cron job not completing. :( Still investigating... it's just being really really slow when run from cron.
18:30:35 <nirik> #info bastion02 reinstalled with rhel7 and ansible
18:30:54 <nirik> I've reinstalled bastion02... and I would like to take a quiet time off hours to test it as vpn hub.
18:30:55 <pingou> \ó/
18:31:08 <nirik> it would be a short blip as everything reconnects (if all goes well)
18:31:38 <tflink> there's more movement on the new qa boxes that were supposed to have been ordered in Q2, hopefully they'll be ordered in the next week or so
18:31:50 <nirik> tflink: great. Just keep us posted.
18:31:55 <tflink> will do
18:32:05 <smooge> I am working on two things
18:32:12 <smooge> 1) getting a rack for the QA machines...
18:32:19 <nirik> tflink: I was going to ask what you would think of moving all the instances off virthost-comm01.qa to 03? but I'm not sure we will have time before freeze...
18:32:29 <smooge> 2) starting inventoring what we have and what we will want for next fiscal year
18:32:58 * nirik nods.
18:33:15 <nirik> smooge: might be good to check support status for everything too... make sure we didn't miss any renewals.
18:33:20 <mirek-hm> I learned that juno should be released next week https://wiki.openstack.org/wiki/Juno_Release_Schedule
18:33:28 <smooge> nirik, will do so
18:33:37 <smooge> oh BOY
18:33:37 <mirek-hm> so we may install juno for next fedora cloud
18:33:52 <nirik> mirek-hm: fun. ;) what changes will we need to make for that?
18:33:57 <mirek-hm> by that time i should have that equilogic attached
18:34:05 <nirik> mirek-hm: was also going to ask... yeah, about that... ;)
18:34:14 <smooge> start from scratch, burn the old to the ground, add kerosene and matches
18:34:18 <mirek-hm> nirik: I do not know, hopefully nothing :)
18:34:56 <nirik> ok, yeah, we should go with newer if we at all can.
18:36:00 <mirek-hm> packstack is primary developed in RH, and most development are backported immediately to RDO, so our installation should be identical or just few touches. everything else is over api, which should be same
18:36:57 <nirik> ok, great.
18:37:06 <nirik> thanks for working on it mirek-hm
18:37:09 <smooge> thanks mirek-hm
18:37:16 <pingou> smooge: on or off the matches?
18:37:19 <mirek-hm> my idea was that we can utilize this time to try upgrades. i.e. keep current installation and before we burn it down, let try to upgrade it to juno. so we will be more sure what we are doing when will be upgradin juno to k-something
18:37:58 <nirik> mirek-hm: well, I'd be ok with that, but I think we should migrate manually anything important off it to the new one first
18:38:01 <smooge> so icehouse to juno?
18:38:07 <nirik> folsom
18:38:08 <smooge> or folsom to juno?
18:38:17 * nirik doesn't think it will work at all. ;)
18:38:27 <smooge> oh god.. you are a braver man than I
18:38:34 <mirek-hm> icehouse to juno, then reprovision it, and install juno from scratch
18:39:02 <mirek-hm> but it will prepare us to upgrade from juno to k-something next year
18:39:07 <nirik> mirek-hm: or we can save that for after. Install juno now, get migrated, then install a icehouse and play with upgrades on a single node?
18:39:36 <nirik> I really want to get off this folsom one. ;)
18:39:45 <mirek-hm> me too :)
18:40:08 <nirik> when we installed folsom they said upgrades were not supported at all.
18:40:17 <nirik> glad to hear that it might work now. ;)
18:40:39 <nirik> anyhow, anything we can do to accelerate moving to a new one is good with me. We can then take time after to test things or whatever.
18:41:14 <nirik> #info openstack juno out next week, will try and move to that for our new cloud.
18:41:20 <nirik> #info need to test openstack upgrades
18:41:38 <nirik> ok, one other thing I wanted to bring up:
18:42:14 <nirik> currently for rhel6 hosts, we use denyhosts. It's dead upstream and has no epel7 branch (nor will it), so we have to move to something else for rhel7 hosts.
18:42:31 <nirik> I tried fail2ban and could not get it working at all. It crashed my test machine too.
18:42:45 <nirik> I tried pam_abl (didn't work at all) and pam_shield
18:43:10 <nirik> pam_shield works, but only if you allow one or both of password auth or token auth in sshd.
18:43:29 <nirik> Anyone have any better ideas in the area of blocking brute force sshd junk?
18:44:03 <nirik> #info ideas wanted for rhel7 denyhost replacement
18:44:32 <nirik> I guess we could just put up with the log noise, or do iptables hashlimit for now.
18:44:50 <nirik> it's just anoying there's no working solution in this space. ;)
18:45:08 <lmacken> can't we disable password auth?
18:45:14 <nirik> we have. long ago
18:45:46 <nirik> the issue is external hosts, get 10,000 ssh attemps... so logs get filled with 'failed login for admin' 'failed login for root'
18:47:04 <nirik> anyhow, can take that out of meeting if anyone has ideas. ;)
18:47:09 <nirik> anything else sysadmin wise?
18:47:11 <lanica> Is there an open ticket and is someone assigned?
18:47:31 <nirik> lanica: I think tickets for this kind of thing are bad, but I could open one I suppose.
18:47:52 <lanica> I don't quite follow...
18:48:08 <nirik> good ticket: "do x and y", bad ticket: "figure out the best way around this problem that will take a lot of discussion and it's not clear what we should do yet"
18:48:31 <nirik> IMHO tickets are poor for open discussion on something, but great when there's a known thing to do or action.
18:48:55 <nirik> The list is probibly better for this...
18:49:00 <lanica> Understood.  But to get to specific steps someone needs to dig in and figure out what works, unless someone has hit this and has the answers.
18:49:02 <lanica> Good point though.
18:49:18 <lanica> I might try to work on it, so I'll talk on list if so ;)
18:49:38 <nirik> I'll post to the list... there's a lot of things I have already looked at and don't work. ;)
18:49:52 <nirik> #topic nagios/alerts recap
18:50:41 <nirik> .tiny https://admin.fedoraproject.org/nagios/cgi-bin//summary.cgi?report=1&displaytype=3&timeperiod=last7days&smon=10&sday=1&syear=2014&shour=0&smin=0&ssec=0&emon=10&eday=9&eyear=2014&ehour=24&emin=0&esec=0&hostgroup=all&servicegroup=all&host=all&alerttypes=3&statetypes=3&hoststates=7&servicestates=120&limit=25
18:50:48 <zodbot> nirik: http://tinyurl.com/qevmn8u
18:51:11 <nirik> so the top two there... need us to fix our monitoring. ;)
18:51:14 <nirik> Rank	Producer Type	Host	Service	Total Alerts
18:51:14 <nirik> #1	Service	collab03	mail_queue	226
18:51:14 <nirik> #2	Service	lockbox01	Zombie Processes	182
18:51:33 <nirik> collab03 mail_queue notices because sometimes it has more than a few emails in queue because it's sending to a large list.
18:51:50 <lanica> Zombies...its not quite Halloween yet....
18:52:07 <nirik> lockbox01 zombies alerts because something like ansible runs over tons of machines and some of them show up as zombie until they are reaped.
18:52:12 <nirik> (or puppet might also do it)
18:52:44 <nirik> those are things I can (and will) file tickets on. ;)
18:53:14 <nirik> #topic Upcoming Tasks/Items
18:53:14 <nirik> https://apps.fedoraproject.org/calendar/list/infrastructure/
18:53:25 <nirik> any upcoming items anyone wants to schedule or note?
18:53:35 <pingou> when is freeze again?
18:53:51 * pingou wonders if we should add it to the calendar
18:54:22 <nirik> tuesday.
18:54:26 <nirik> sure! we shoud
18:54:42 <nirik> 2014-10-14 f21 beta freeze
18:54:42 <nirik> 2014-10-28 f21 beta release
18:54:59 <nirik> #topic Open Floor
18:55:05 <nirik> anyone have any items for open floor?
18:56:12 <threebean> oh, in case anyone should need me, I'll be mostly afk friday morning through sunday.
18:56:22 <pingou> threebean: enjoy :)
18:56:48 <nirik> threebean: cool. ;)
18:56:57 <nirik> alright, thanks for coming everyone. ;)
18:57:00 * threebean waves
18:57:01 <nirik> #endmeeting