infrastructure
MINUTES
19:00:00 <nirik> #startmeeting Infrastructure (2011-07-28)
19:00:00 <zodbot> Meeting started Thu Jul 28 19:00:00 2011 UTC.  The chair is nirik. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:00 <zodbot> Useful Commands: #action #agreed #halp #info #idea #link #topic.
19:00:01 <nirik> #meetingname infrastructure
19:00:01 <zodbot> The meeting name has been set to 'infrastructure'
19:00:01 <nirik> #topic Robot Roll Call
19:00:01 <nirik> #chair smooge skvidal codeblock ricky nirik abadger1999
19:00:01 <zodbot> Current chairs: abadger1999 codeblock nirik ricky skvidal smooge
19:00:07 * skvidal is here
19:00:20 * abadger1999 here
19:00:29 <CodeBlock> hi there
19:00:51 <nirik> hello all.
19:01:14 * nirik waits another minute or so for more folks to wander in.
19:01:21 * herlo is here
19:01:24 <herlo> mostly
19:02:00 <skvidal> herlo: how goes your avian project?
19:02:09 <herlo> skvidal: it goes
19:02:25 <herlo> making decent progress, not as fast as I'd like, but good
19:02:31 <nirik> ok, I guess lets go ahead and get started...
19:02:37 <skvidal> waterfowl are gross, gross creatures. I wouldn't expect dealing with them would be a quick thing
19:02:42 <smooge> I am ready
19:02:44 <nirik> #topic New folks introductions and apprentice tasks/feedback
19:02:47 <herlo> skvidal: haha, yeah
19:02:47 <skvidal> herlo: I wish you luck
19:02:50 <herlo> thx
19:03:10 <nirik> any apprentice folks want to ask questions or note tickets? any new folks want to introduce themselves?
19:04:08 <nirik> I've seen a tapering off of apprentice activity of late... possibly due to schools restarting and people getting busy? or just general summer malaze?
19:04:23 <skvidal> both
19:04:24 <skvidal> I suspect
19:04:29 <skvidal> people chilling out for their summer
19:04:35 <skvidal> and prepping for school
19:04:38 <skvidal> I suspect in oct or so
19:04:49 <skvidal> when people are riding back comfortably in the reigns of school
19:04:53 <skvidal> then we'll see them come back
19:05:03 <nirik> #info If you are an apprentice and want to get (re) involved, look at easyfix tickets and/or chime in on other topics in channel to get something to work on. ;)
19:05:09 <nirik> yeah, could well be.
19:05:14 <skvidal> bonus points if you get the literary reference there
19:05:44 * nirik doesn't off hand. ;(
19:06:23 <herlo> dead poet's society? and I didn't google it, especially since I'm probably wrong :)
19:06:54 <nirik> carpe diem!
19:06:58 <herlo> :P
19:07:02 <skvidal> herlo: ray bradbury - something wicked this way comes
19:07:10 <nirik> anyhow, if nothing else on apprentice tasks, moving on...
19:07:14 <herlo> not much of a literary buff
19:07:18 <nirik> #topic Moving SOP docs from wiki to git
19:07:25 <skvidal> w00t
19:07:35 <smooge> csi?
19:07:36 <skvidal> nirik: so - we just need a decision - put them with the csi docs or put them in their own repo
19:07:42 <nirik> so, there are some advantages and disadvantages here.
19:07:48 <StylusEater> skvidal: put them on github? :-)
19:07:56 <nirik> but I think the advantages outweigh the disadvantages.
19:07:57 <skvidal> StylusEater: <stab>
19:08:06 <skvidal> what are the disads?
19:08:08 * nirik looks at the csi docs repo.
19:08:18 <skvidal> nirik: is csi docs repo even a repo?
19:08:20 <skvidal> or is it just a dir?
19:08:24 <smooge> it is a repo
19:08:25 <nirik> less ability for $otherpeople to correct things/contibute.
19:08:34 <skvidal> smooge: where is it housed/
19:08:57 <smooge> Most of the stuff under it is git clone git://git.fedorahosted.org/csi.git
19:09:04 <skvidal> okay
19:09:04 <nirik> it's on hosted.
19:09:07 <skvidal> that I didn't understand
19:09:18 <skvidal> I'm not in favor of our sop docs being on hosted
19:09:29 <nirik> yeah, that drops one of the advantages.
19:09:31 <skvidal> right
19:09:41 <smooge> well it could be moved.
19:09:42 <skvidal> and I'd rather have to protect a single basket in the event of disaster
19:09:44 <skvidal> not multiple ones
19:09:45 <abadger1999> yeah
19:10:02 <nirik> it's also using publican I think...
19:10:20 <abadger1999> harder to point $otherpeople at how to do something.
19:10:32 <smooge> Correct. The CSI documents are supposed to be the policies and the SOPs are supposed to be the how to complete the policies
19:11:38 <skvidal> right
19:11:38 <skvidal> so
19:11:42 <nirik> so, if we had a repo on infrastructure, could we have it allow the same groups as the wiki does to edit? cla_done+1 or whatever.
19:11:42 <smooge> The main thing I wanted to was to make sure that we keep both in sync
19:11:43 <abadger1999> think we're agreed.... git repo on lockbox separate from the other git repos males more sense.
19:11:51 <smooge> yes I agree on that
19:12:07 <abadger1999> *makes sheesh, the typos today.
19:12:10 <skvidal> in the event of a disaster
19:12:15 <skvidal> I don't need to see our policies
19:12:18 <nirik> we probibly need to give the CSI docs a good lookover/edit/cleanup someday.
19:12:20 <skvidal> I will need to see our SOPs
19:12:24 <skvidal> cleaning up CSI makes sense
19:12:35 <skvidal> but I wouldn't want to clean it up at the detriment of SOPs being current
19:12:44 <herlo> yeah, that was my question nirik, what is the format going to be, publican? or something else?
19:12:49 <skvidal> herlo: txt
19:12:50 <skvidal> txt file
19:13:02 <nirik> yeah, text would be fine with me.
19:13:06 <skvidal> any system which involves effort to maintain == doom
19:13:09 <skvidal> b/c people will avoid it
19:13:15 <herlo> somethign like rst would be nice
19:13:17 <skvidal> and the SOPs don't have anything that is not text-able
19:13:23 <herlo> and it still is basically text
19:13:30 <nirik> it doesn't need to be pretty, just contain the data. Hopefully so you can cut and paste things.
19:13:35 <herlo> but I supppose we could make that later...
19:13:40 <herlo> yeah
19:13:53 <smooge> I would prefer the SOP's to be in text
19:14:03 <skvidal> nirik: paste++++
19:14:32 <nirik> so, any objections to that plan? If not, we need to plan the move, convert docs and redirect as we go... I could write up a plan for it.
19:15:03 <skvidal> nirik: I can take the action item to make the repo and get the pushing happening to a path in infra/web
19:15:10 <skvidal> err /srv/web/infra/
19:15:32 <nirik> skvidal: ok. Thanks. Perhaps we need a 'new infra git repo with hooks' SOP. ;)
19:15:54 <skvidal> nirik: I genericized the hook code
19:16:00 <skvidal> so I don't have to do it a billion times
19:16:04 <nirik> #action skvidal to make new repo
19:16:04 <skvidal> when I setup infra-hosts
19:16:11 <nirik> #action nirik to write up migration plan
19:16:23 <abadger1999> I'm still lamenting lack of TOC and hyperlinks but.... getting it off the wiki is pretty important.
19:16:47 <skvidal> abadger1999: suggestions welcome - but not publican
19:16:52 <nirik> we could make an index.html? ;)
19:16:52 <skvidal> abadger1999: that's A LOT of infrastructure for links
19:16:52 <abadger1999> *shudders*
19:17:36 <nirik> #action nirik will investigate updating CSI docs, or see what we can do to update them.
19:17:41 <abadger1999> I wouldn't wish publkican on this in a million years.
19:18:06 * CodeBlock is back, sorry, had to talk to my boss about a $dayjob project.
19:18:09 <smooge> having done this in the long ago past.. you use a minimal markup at the top to say things like: Title:, Reason: Keywords, and then use a quick txt2html wrapper which makes html files with just <pre></pre>and index.html to link them
19:18:30 * nirik dealt with docbook when he did his howto, would really prefer to avoid that complexity (and publican is another layer on top of that)
19:18:59 <abadger1999> smooge: Yeah, something more like that is what I was envisioning.
19:19:05 <skvidal> nirik: docbook----
19:19:15 <skvidal> nirik: I've done it there, too - with the nfs-howto - it made me cranky
19:19:17 <abadger1999> smooge: Maybe parse some sort of internal headers as well.
19:19:20 <smooge> basically a checkin rebuilds the stuff and we go
19:19:25 <skvidal> abadger1999: I suspect someone has this
19:19:29 <skvidal> abadger1999: and it is lightweight
19:19:30 <skvidal> and simple
19:19:39 <skvidal> abadger1999: CodeBlock mentioned something
19:19:41 <skvidal> markdown?
19:19:46 <nirik> we can look at implementation details out of band?
19:19:57 <ianweller> markdown is an *okay* language, it's not great
19:20:03 <skvidal> nirik: nod
19:20:04 <smooge> skvidal, only in their 0.1 versions.. then people start asking for features and it becomes an XML/SGML/publican thing in the 0.2 version
19:20:20 <abadger1999> +1
19:20:25 <smooge> yeah offline after meeting
19:20:35 <nirik> I think possibly just adding a link/description to a index.html when you add a SOP could be enough...
19:20:50 <ianweller> nirik: directory listing?
19:21:10 <skvidal> post-meeting
19:21:12 <nirik> ie, "GomGabbar SOP - use this when you want to install a new Gom Gabbar device"
19:21:14 <skvidal> in fedora-admin
19:21:15 <nirik> anyhow, yeah.
19:21:29 * nirik cares not what colour the bike shed is.
19:21:45 <nirik> anything else on this? or shall we move on?
19:21:45 * skvidal is partial to blue-gray
19:21:47 <CodeBlock> ianweller: I like Markdown because when it's not parsed, stuff still stands out/looks good/is distinguishable in plaintext. As in headers are underlined (with - or =) etc.
19:21:58 * CodeBlock shuts up so we can move on :P
19:22:19 <nirik> #topic QA network setup
19:22:34 <nirik> ok, so we talked about this some last week... here's the conclusion I came to:
19:22:56 <nirik> monitoring - use our nagios for monitoring, and have alerts go to sysadmin-qa folks.
19:23:30 <nirik> config management - try out bcfg2 there. This means removing virthost-comm01 and bastion-comm01 from our puppet and adding them into bcfg2 there.
19:24:06 <nirik> I'm undecided if they should have a seperate lockbox-comm01 for bcfg2 or not... I guess so to be carefull.
19:24:17 <nirik> so, if anyone wants to help with that setup, please do. ;)
19:24:44 <skvidal> ok
19:24:55 <smooge> ok will do so
19:25:02 <nirik> once we get it setup, we can re-eval down the road... if bcfg2 isn't working we can switch them out.
19:25:16 <smooge> cfengime
19:25:16 <nirik> they will still need to use our setup for a few things... like repos I suspect.
19:25:48 <nirik> also, in the qa space a few more things down the road:
19:26:25 <nirik> it would be good if we could add some fedora ks files if we want them to use our kickstarts (which I think is probibly easiest, since they have to use our repos anyhow)
19:27:11 <nirik> probibly before too long will be adding some secondary arch signing instances there... which will let us test out the new sigul/setup
19:27:30 <skvidal> heh
19:27:32 <nirik> thats all I had off hand on qa network stuffs. Any questions/concerns/ideas?
19:27:36 * skvidal read that as 'sighing instances'
19:27:49 <nirik> yeah, basically. ;)
19:28:39 <nirik> ok, moving on.
19:29:02 <nirik> #topic Upcoming Tasks/Items
19:29:14 <nirik> So, monday morning we have some reboots...
19:29:29 <nirik> tuesday is the start of f16alpha freeze.
19:30:11 <nirik> we have some new machines hopefully racked or being racked, so they will need installing and adding to monitoring.
19:30:46 <nirik> Any other upcoming items folks would like to note/schedule/ask about/plan for?
19:32:01 <nirik> ok. moving on then.
19:32:09 <nirik> #topic Meeting tagged tickets
19:32:24 <nirik> https://fedorahosted.org/fedora-infrastructure/query?status=new&status=assigned&status=reopened&group=milestone&keywords=~Meeting&order=priority
19:32:30 <nirik> any here folks want to discuss?
19:32:49 * skvidal looks
19:33:06 <smooge> not me
19:33:38 <smooge> 2501 will need us to get hardware up at ibiblio working I believe
19:33:44 <nirik> There's some old stuff there I might close or remove meeting from.
19:33:57 <skvidal> smooge: the hw is there now
19:34:02 <skvidal> when I hear from reuning
19:34:05 <skvidal> I'll go over and hitch it up
19:34:10 <smooge> ok cool
19:34:15 <nirik> well, hosted needs a plan. I might try and tackle that in my copius free time....
19:34:22 <skvidal> and you shall surely here my cries and curses when the ipv6 stuff doesn't work
19:34:34 <skvidal> nirik: so I had a thought
19:34:35 <skvidal> on hosted
19:34:39 <skvidal> that you may well hate
19:34:43 <skvidal> but I wanted to bring it up
19:34:48 * nirik gets ready on the hate button.
19:35:06 <skvidal> what if we did it piece-meal?
19:35:17 <skvidal> ie: could we setup a new infrastructure that let us do a project at a time
19:35:36 <smooge> I thought that was what we were going to do
19:35:37 <nirik> yeah, I did think of that too.
19:35:46 <skvidal> ah, hmm
19:35:47 <nirik> I don't hate it, but we still need a plan...
19:35:50 <smooge> put a proxy in front and then move through that
19:35:59 <nirik> ie, how many instances, seperated/connected how, etc
19:36:22 <skvidal> nirik: would hosted be an example of service that is well suited to having 'in the cloud'?
19:36:23 <nirik> and preferably a way to make it spread out more.
19:36:49 <nirik> skvidal: I'm not sure. Possibly... it does get hit pretty hard... so it would use a lot of resources.
19:37:03 <skvidal> nirik: which is sorta the point...
19:37:20 <skvidal> nirik: did we ever get an answer on cloud-money?
19:37:35 <nirik> skvidal: no. sadly.
19:37:45 <skvidal> nirik: okay, so I didn't just miss that meeting
19:37:46 <skvidal> okay
19:37:47 <skvidal> thx
19:38:07 <nirik> but any plan could look at how to split it out, and if we have cloud, we could use cloud for some or part of it if it makes sense.
19:38:48 <skvidal> okay
19:38:53 <skvidal> so really this needs some focus
19:38:56 <nirik> yes.
19:39:03 <skvidal> I suspect that's why it's not gotten very far ;)
19:39:08 <skvidal> or rather :-\
19:39:27 <nirik> I think we all agree it would be good to get it updated/upgraded, but we also want to try and make it less SPOF and such at the same time...
19:39:46 <nirik> anyhow, I can try and at least whip up some plan for people to be inspired to counterpropose. ;)
19:39:46 <skvidal> nod
19:40:02 <skvidal> nirik: not that we have the time right now
19:40:05 <skvidal> nor (probably) the money
19:40:18 <skvidal> but hosted migration sure seems like we'd benefit from a FAD
19:40:21 <nirik> it may be that we should be less far reaching... just plan for moving it the way it is now, and do a longer term thing later.
19:40:34 <nirik> yeah, that could be the case...
19:40:35 <skvidal> nirik: which will, likely, be kicked down the road forever :(
19:41:48 <skvidal> okay
19:41:50 <skvidal> what else?
19:41:51 <nirik> ok, moving on.
19:41:55 <nirik> #topic Open Floor
19:42:00 <nirik> anything for open floor?
19:42:27 <herlo> oh, I do
19:42:37 <nirik> herlo: fire away
19:42:39 <herlo> I had a nice little bug for fpaste-server
19:42:48 <herlo> with django-tracking, it needed to be updated
19:42:58 <herlo> and I'm going to be pushing that back some to accommodate
19:43:04 <herlo> but I needed to ask a couple questions regarding it
19:43:18 <nirik> sure, whats the questions?
19:43:19 <herlo> one is, do we plan to host it on its own vm? or will it live with other services?
19:44:03 <nirik> excellent question. :) I think this is exactly the sort of thing we should be figuring out when a new resource is in the 'dev' stage. ;)
19:44:08 <herlo> I've currently got the package deploying an fpaste.conf which I need to alter to better accommodate vhosts
19:44:51 <herlo> I think it could run with other vhosts as I don't think the load is going to be too high at first.
19:45:32 <nirik> so, there's a spectrum here: on one side is totally seperate. it's own instance with it's own db and own webserver. On the other end is in our proxy/app mix. It's hit via proxy and uses varnish/haproxy, runs on the app servers and talks to a single backend db.
19:46:04 <nirik> there's also some middle ground where it could be using proxy/caching, but have it's own instance and db
19:46:16 <herlo> which if we did the latter, would need to be similarly setup somewhere along the way, right?
19:46:25 <herlo> would that be in dev or rather in staging?
19:46:50 * herlo thinks most of this convo can go offline, but just wanted to bring up these thoughts
19:47:00 <abadger1999> Are our proxies a limited resource?
19:47:16 <herlo> especially since we're working on the SOPs for dev and staging rollout
19:47:28 <nirik> dev => no proxy or other setup, stg => setup like it would be in prod
19:47:33 <herlo> abadger1999: a good question, wish I knew
19:47:35 <nirik> abadger1999: I don't think so...
19:48:08 <herlo> nirik: k, I'll work with it that way then...
19:48:10 <herlo> thanks
19:48:18 <abadger1999> nirik: I get a good feling about the middle way, then, treat the app as a separate resource but the infrastructure around it as shared.
19:48:49 <abadger1999> so that we can upgrade the host/db for one app independently of the host for a different app.
19:48:50 <nirik> yeah, adding more to a single db is something I wish to avoid... more eggs in one basket.
19:49:17 <abadger1999> I'm not sure what makes the most sense from a sysadmin/money perspective though.
19:49:46 <nirik> I guess if we had better db replication it might be less anoying.
19:51:12 <nirik> I guess it also depends on load/how popular something becomes.
19:51:44 <nirik> if it's really popular and we need more app resources, we could also move an app from it's own instance out to the app* machines to spread that out...
19:51:49 <abadger1999> I'm not sure if separate db's gains us as much as separate app servers... I think it would distribute load and allow tweaking individual dbs for different types of queries but I'm not sure we have those issues.
19:52:10 <smooge> I would prefer middle road proxies -> app-fpasteXX -> db-fpasteXX
19:52:14 <abadger1999> it's a the db is a SPOF for an app whether it's in a shared db or separate.
19:52:22 <abadger1999> s/it's a//
19:52:35 <smooge> yeah but instead of 20 apps going down for an hour.. we may have 1.
19:52:50 <abadger1999> smooge: All depends....
19:52:57 <smooge> right now when we reboot the db servers, most of everything Fedora is offline
19:53:02 <smooge> and if the db doesn't come back
19:53:30 <abadger1999> why did it go down?  postgres update or hardware?  Did it take out fas?  etc etc.
19:53:40 <herlo> so I basically setup my app as normal and then we add it to the lb structure? That sound about right?
19:54:04 <nirik> herlo: yeah, that work takes place in stg step... but do be thinking about it now, I think thats good.
19:54:07 <herlo> there isn't a lot to fpaste-server.
19:54:09 <smooge> usually it goes down because we need to reboot or the hardware itself needs to reboot and then it all waits til it comes back
19:54:11 <abadger1999> smooge: <nod>... otoh, if we have five different db serves, doesn't that mean we're five times as likely to have any single piece of hardware go bad?
19:54:39 <smooge> abadger1999, no.. murphy is kind to us there. The hardware will go bad sometime.
19:54:44 <herlo> cool, this sounds like we could do some cool stuff like sticky sessions and add an additional fpaste machine if ever needed.
19:54:45 <smooge> anywhere
19:54:47 * herlo likes
19:54:47 <abadger1999> :-)
19:55:26 <nirik> I guess memory on app servers is a limited resource...
19:55:29 <nirik> and cpu there.
19:55:29 <smooge> abadger1999, you can't go over 100% failure possibility
19:55:35 <abadger1999> hehe :-)
19:56:23 <nirik> if someone would like to add these questions into https://fedoraproject.org/wiki/Request_for_resources_SOP that would be great. ;)
19:56:28 <abadger1999> nirik: Hmm... memory is the limiting factor for dbs in my experience....
19:56:33 <smooge> I mean I figure if we could devote time and effort we could have failover and clustering of some sort in place.. but I think that has been the equivalent of bug#1 since 2005?
19:57:20 * nirik nods.
19:57:29 <nirik> so, anything more? or shall we call it a meeting?
19:57:30 <abadger1999> smooge: Well... I think the reason it's still bug #1 is that no one's laid down what we want to solve and what limitations we're willing to live with.
19:58:19 <abadger1999> ie: We could have warm backups of postgres and masterslave of mysql right now... but if we anticipate switching to the failover machines in case of db outage we're going to have to acept that we might lose data.
19:58:52 <smooge> abadger1999, I agree.
19:58:55 <abadger1999> I think we could accept that... but no one wants to actually commit to it.
19:58:58 * nirik nods again.
19:59:06 <smooge> I commit to losing data
19:59:22 <smooge> that will make a great Compass goal
19:59:27 <abadger1999> hehe :-)
19:59:43 <skvidal> abadger1999: can we quantify the kind of loss?
20:00:06 <nirik> I had this random thought yesterday... I might start tossing out disaster scenerios to the mailing list. ;) "phx2 is gone. What do we have left? how would we recover" "serverbeach down, what do we have, how do we recover" etc.
20:00:48 <abadger1999> skvidal: I'm thinking in many cases, it would be minimal -- postgres warm backups rsync the transaction logs at a period you define.  So you'd lose the data from that period.
20:00:59 <skvidal> then +1
20:01:05 <abadger1999> skvidal: We'd probably set it somewhat low... maybe 5 minutes.
20:02:13 <abadger1999> mysql master-slave can get out of sysnc.  if it does and master fails in that time, we lose all the data that didn't get synced.  If we fix any out-of-sync errors promptly and we don't fail when out of sync, we'd only lose a few minutes of data there as well.
20:03:22 <abadger1999> nirik: Perhaps we should do that with one of the new apps?  Set up a replicated db server for it.
20:03:26 <abadger1999> See how it works.
20:03:37 <nirik> yeah, thats a possibility...
20:03:46 <nirik> one in phx2 and one elsewhere?
20:03:46 <abadger1999> nirik: The one issue I'd see is... db servers take a lot of memory to be high performance.
20:04:12 <abadger1999> And you want the two boxes to have the same specs so that if you have to switch to the other box, it can take the load.
20:04:15 <abadger1999> well...
20:04:27 <nirik> yeah. true
20:04:32 <abadger1999> network *latency* can be an issue with app to db.
20:04:46 <nirik> although sometimes slower performance is still better than down.
20:05:03 <abadger1999> I think the app servers not at phx show that when they try to talk to the db there.
20:05:31 <abadger1999> so it may make sense to have the db servers not at phx if our intent is protecting data in case phx disappears.
20:05:51 <nirik> or if there's some way to do master/master. ;)
20:05:52 <abadger1999> but it may not if our intent is to have a db server to drop into place if the db server in phx goes kaput.
20:06:10 <abadger1999> master/master tends to slow everything down from what I can tell.
20:06:18 <nirik> yeah, I would imagine so...
20:06:28 <abadger1999> Since you have to wait for the slowest master to commit the data.
20:06:42 <nirik> ok, lets keep pondering on it, and discuss more out of meeting/next week. ;)
20:06:48 <abadger1999> <nod>
20:06:59 <nirik> Thanks for coming everyone!
20:07:04 <nirik> #endmeeting