infrastructure
LOGS
16:00:21 <lenkaseg> #startmeeting Infrastructure (2021-11-25)... (full message at https://libera.ems.host/_matrix/media/r0/download/libera.chat/f41e9cef1bbe06d4d48d7abf25e8bf514f0fb6b5)
16:00:21 <zodbot> Meeting started Thu Nov 25 16:00:21 2021 UTC.
16:00:21 <zodbot> This meeting is logged and archived in a public location.
16:00:21 <zodbot> The chair is lenkaseg. Information about MeetBot at https://fedoraproject.org/wiki/Zodbot#Meeting_Functions.
16:00:21 <zodbot> Useful Commands: #action #agreed #halp #info #idea #link #topic.
16:00:21 <zodbot> The meeting name has been set to 'infrastructure_(2021-11-25)..._(full_message_at_https://libera.ems.host/_matrix/media/r0/download/libera.chat/f41e9cef1bbe06d4d48d7abf25e8bf514f0fb6b5)'
16:00:36 <mobrien> .hi
16:00:37 <zodbot> mobrien: mobrien 'Mark O'Brien' <markobri@redhat.com>
16:01:04 <lenkaseg> Hello everybody, today unexpectedly with me as a chair!
16:01:18 <mobrien> lenkaseg: you will need to paste line by line on matrix
16:01:22 <austinpowered_> .hi
16:01:23 <zodbot> austinpowered_: Sorry, but user 'austinpowered_' does not exist
16:01:29 <lenkaseg> ah, true
16:01:36 <lenkaseg> #meetingname infrastructure
16:01:36 <zodbot> The meeting name has been set to 'infrastructure'
16:01:43 <lenkaseg> #chair nirik siddharthvipul mobrien zlopez pingou bodanel dtometzki jnsamyak computerkid
16:01:43 <zodbot> Current chairs: bodanel computerkid dtometzki jnsamyak lenkaseg mobrien nirik pingou siddharthvipul zlopez
16:01:50 <lenkaseg> #info Agenda is at: https://board.net/p/fedora-infra
16:02:02 <lenkaseg> #info About our team: https://docs.fedoraproject.org/en-US/cpe/
16:02:08 <lenkaseg> #topic greetings!
16:03:03 <nils> \o
16:03:08 <nils> .hello nphilipp
16:03:09 <zodbot> nils: nphilipp 'Nils Philippsen' <nphilipp@redhat.com>
16:03:10 <austinpowered> .hi
16:03:12 <zodbot> austinpowered: austinpowered 'T.C. Williams' <fedoraproject@wootenwilliams.com>
16:04:34 <lenkaseg> #topic New folks introductions
16:04:43 <lenkaseg> #info This is a place where people who are interested in Fedora Infrastructure can introduce themselves
16:04:52 <lenkaseg> #info Getting Started Guide: https://fedoraproject.org/wiki/Infrastructure/GettingStarted
16:05:39 <lenkaseg> Anybody new here, who would like to introduce themselves?
16:07:30 <lenkaseg> Ok then, let's continue
16:07:37 <lenkaseg> #topic Next chair
16:07:42 <lenkaseg> #info magic eight ball says:
16:07:50 <lenkaseg> #info chair 2021-11-25 - lenkaseg
16:08:13 <lenkaseg> Anybody interested in chairing next week?
16:08:28 <mobrien> I will do it
16:08:35 <lenkaseg> mobrien: sold!
16:08:48 <lenkaseg> #info char 2021-12-02 - mobrien
16:09:35 <lenkaseg> I guess we can determine the next next chair next time
16:09:45 <lenkaseg> That is a lot of next in one sentence
16:09:57 <lenkaseg> Let's move on
16:09:59 <mobrien> We are probably down volunteers due to thanksgiving
16:10:28 <lenkaseg> #topic announcements and information
16:10:33 <lenkaseg> #info CPE Infra&Releng EU-hours team has a Monday through Thursday 30 minute meeting going through tickets at 1030 Europe/paris in #centos-meeting
16:10:40 <lenkaseg> #info CPE Infra&Releng NA-hours team has a Monday through Thursday 30 minute meeting going through tickets at 1800 UTC in #fedora-meeting-3
16:10:48 <lenkaseg> #info If your team wants support from the Fedora Program Management Team, file an isssue: https://pagure.io/fedora-pgm/pgm_team/issues?template=support_request
16:10:54 <lenkaseg> #info Happy Thanksgiving to those in US!
16:11:21 <lenkaseg> Is there some more info worth sharing?
16:13:01 <lenkaseg> Let's move on
16:13:04 <lenkaseg> #topic Oncall
16:13:11 <lenkaseg> #info https://fedoraproject.org/wiki/Infrastructure/Oncall
16:13:17 <lenkaseg> #info https://docs.fedoraproject.org/en-US/cpe/day_to_day_fedora/
16:13:24 <lenkaseg> ## .oncalltakeeu .oncalltakeus
16:13:35 <lenkaseg> #info mobrien on call from 2021-11-18 to 2021-11-25
16:13:43 <lenkaseg> #info lenkaseg on call from 2021-11-25 to 2021-12-02
16:14:12 <lenkaseg> I guess I can take the oncall right now...
16:14:29 <lenkaseg> .oncalltakeeu
16:14:29 <zodbot> lenkaseg: Kneel before zod!
16:14:35 <mobrien> There was no oncall ping for the last week
16:14:49 <lenkaseg> Thanks for the update mobrien
16:16:03 <lenkaseg> Any volunteers for the week after December 2nd?
16:16:33 <lenkaseg> If not we can leave it for the next week
16:17:04 <mobrien> I think leave it until more people are here
16:17:34 <lenkaseg> ok, let's leave it for next time
16:17:48 <lenkaseg> ummm, now the topic...
16:18:15 <lenkaseg> I guess the Monitoring discussion with nirik was last week
16:18:29 <mobrien> I can do the monitoring topic this week
16:18:54 <lenkaseg> #topic Monitoring discussion
16:19:02 <lenkaseg> mobrien: ok, all yours!
16:19:09 <mobrien> So not a huge amount to report
16:19:26 <mobrien> copr are having some issues with their dev instances in AWS
16:19:38 <mobrien> They will work on it over the next week but no rush
16:20:07 <mobrien> proxy09 is flapping a bit which I should really look into
16:20:44 <mobrien> otherwise a few long running issues that we should clean up sometime but not urgent
16:21:31 <mobrien> That's about it so we can go to the learning topic for abompard now I would say
16:21:51 <lenkaseg> #topic Learning topic
16:22:01 <lenkaseg> #info 2021-11-25 - Migration of databases in Fedora Infra [abompard]
16:22:19 <lenkaseg> The floor is yours abompard!
16:22:24 <abompard> Thanks!
16:22:30 <abompard> How long do I have?
16:22:58 <abompard> OK if anybody has questions, feel free to ask them when they pop in your head, I'll try to answer right away
16:23:24 <abompard> First, let's talk about libraries, because it's the simplest
16:23:31 <lenkaseg> 15-20 mins, but really up to you
16:23:55 <abompard> There are currently two database libraries that we use in infra: sqlalchemy (most of the time) and Django's ORM (when using Django obviously)
16:24:13 <abompard> sqlalchemy has the "alembic" library dedicated to handling DB migrations
16:24:19 <abompard> so that's what we use most of the time
16:24:51 <abompard> Things to look out for:
16:25:36 <abompard> Make sure you stop all connections to the DB
16:25:48 <abompard> Set a downtime in Nagios before you migrate
16:26:01 <abompard> Only then can you upgrade your DB
16:26:13 <abompard> and then, obviously, restart the services and reactivate nagios
16:26:35 <abompard> Note: for some reason I cannot cut-n-paste into this window, which is a bit annoying
16:27:03 <abompard> There are mainly two situations one can be in in our infra :
16:27:13 <abompard> - the app is running in a VM
16:27:20 <abompard> - or the app is running in Openshift
16:27:41 <nils> How about dry-running migrations on a copy of the DB?
16:28:03 <nils> Is this anything we routinely do?
16:28:09 <abompard> Ah, good question nils, I don't think we do that currently
16:28:36 <abompard> but we could standardize on creating two databases for each app, one main and one for the migration tests
16:28:51 <abompard> I think it would be interesting
16:29:06 <abompard> although of course migrations should have been tested in staging
16:29:08 <copperi[m]> sure
16:29:16 <abompard> but staging is not always a perfect image of prod
16:29:16 <nils> It could quickly become infeasible, thinking of e.g. the datanommer DB
16:29:35 <abompard> yeah datanommer is a special case, I don't think any DB comes close
16:29:44 <abompard> maybe Koji's DB is big too
16:29:47 <nils> Yeah, if size allows it testing the migration on a recent copy would give me the most confidence
16:29:47 <abompard> but not as big
16:30:01 <mobrien> In the case of a live migration such as the datanommer one going on now, how does a migration guarantee the integrity of the database and ensure that what is copied over is up to date?
16:30:02 <abompard> That's a good idea, we should investigate that
16:30:17 <abompard> I'll go to the datanommer case next
16:30:49 <abompard> Let me get back, if your app is running in a VM, then you write a manual playbook
16:30:57 <abompard> it'll be in playbooks/manual/upgrade
16:31:02 <abompard> example : https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/manual/upgrade/pagure.yml
16:31:08 <abompard> (I can cut-n-paste, yeah!)
16:31:22 <abompard> If your app is running in Openshift, then you have two options
16:31:44 <abompard> You can run your upgrade during the deployment phase, using the "recreate" strategy
16:32:01 <abompard> openshift will take down a pod, run a script, and startup a pod with the new code
16:32:18 <abompard> The doc is here: https://docs.openshift.com/container-platform/4.9/applications/deployments/deployment-strategies.html#deployments-recreate-strategy_deployment-strategies
16:32:29 <abompard> it's good, but you have downtime while the DB is being migrated
16:33:06 <abompard> but at least you know that none of your code will be running while the DB is migrated, and that your older code will never connect to a DB that's been migrated
16:33:19 <abompard> That's the strategy that Bodhi uses : https://pagure.io/fedora-infra/ansible/blob/main/f/roles/openshift-apps/bodhi/templates/deploymentconfig.yml#_21
16:33:51 <abompard> There is another option: use a manual playbook there too
16:34:02 <abompard> That's the option that we settled for in Datanommer
16:34:11 <abompard> because multiple apps access the Datanommer DB directly
16:34:21 <abompard> so we need to shut them all down before upgrading
16:34:40 <abompard> For that we use a manual playbook : https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/manual/upgrade/datanommer.yml
16:34:56 <abompard> which calls an OpenShift "Job" : https://pagure.io/fedora-infra/ansible/blob/main/f/roles/openshift-apps/datanommer/templates/job.yml
16:35:10 <abompard> a Job is a pod that openshift starts with a custom command
16:35:23 <abompard> here it's the DB upgrade command obviously
16:35:50 <abompard> The trick with OS jobs is that you have to monitor them for completion and delete the job manually
16:36:02 <abompard> OS does not do that for you (or we didn't find the option)
16:36:18 <abompard> OK, that's what we currently do in infra
16:36:39 <abompard> If there's no questions on it, let's talk about long-running migrations
16:37:01 <abompard> there's one at the moment, the Datanommer migration. It's a huge DB
16:37:18 <abompard> the migration is expected to last about 60 days, and there's about 40 left
16:37:40 <abompard> when you have to undertake such a migration, I strongly recommend designing a process/script that you can restart
16:37:55 <abompard> because servers need to reboot sometimes, connections get lost, etc
16:38:18 <abompard> If you use SQLAlchemy, also make sure you use windowed queries: https://github.com/sqlalchemy/sqlalchemy/wiki/RangeQuery-and-WindowedRangeQuery
16:39:20 <abompard> The advantage of being able to recover on failure is also that you can handle the data that has arrived in between
16:39:28 <abompard> that should anser mobrien's question
16:39:33 <mobrien> So are windowed queries taking smaller chunks of a table and migrating them 1 at a time?
16:39:42 <abompard> yes
16:40:04 <abompard> and when the migration is done, we'll restart it, and it'll migrate the 60 days worth of data that has accumulated
16:40:22 <abompard> it'll take maybe an hour, then we restart it again, etc
16:40:33 <mobrien> May be an edge case but what happens if a part of that is deleted after its window has been migrated?
16:40:36 <abompard> when there's almost no data left, then we have the downtime and we migrate over
16:40:48 <abompard> ah, with datanommer we don't delete data
16:40:55 <abompard> so that's not a problem we have
16:41:07 <abompard> but if you have deletion then you need a smarter sync
16:41:17 <mobrien> ah ok, that's the part I was unsure of
16:41:28 <abompard> Something that would compare the unique ids on each side and decide what to remove
16:41:38 <abompard> (not the primary keys, those would change)
16:41:52 <abompard> (but for example the msg-ids in DN's case)
16:42:17 <mobrien> makes sense
16:42:26 <abompard> Alright, that's all I had, any other questions?
16:43:06 * lenkaseg still reading
16:43:38 <abompard> nils could you open a ticket in our tracker to investigate how we could have testing databases for apps? I think it's a very interesting idea.
16:44:28 <nils> abompard, do you mean pagure.io/fedora-infrastructure?
16:44:45 <abompard> yeah, I think it could be akin to a mini-initiative
16:45:27 <mobrien> postgres natively supports hot standbys which would be great but resource consuming
16:45:56 <abompard> Huh that's cool, never tried it before
16:46:27 <abompard> Oh, also I don't think we have the habit of writing reverse migrations in Alembic
16:46:30 <mobrien> Azure and AWS leverage it for their HA db solutions
16:46:36 <abompard> but it would be cool to do it
16:46:48 <mobrien> https://wiki.postgresql.org/wiki/Hot_Standby
16:47:04 <lenkaseg> `when you have to undertake such a migration, I strongly recommend designing a process/script that you can restart`
16:47:04 <lenkaseg> Is the script somewhere available?
16:47:16 <mobrien> I think it takes incremental snapshots and has a wal with a short time to make up the difference
16:47:59 <abompard> lenkaseg: well that depends on your migration, what I mean is that you should design your script to withstand disconnections
16:48:15 <abompard> the Datanommer migration script is available of course, but it only works for datanommer
16:48:51 <abompard> if you don't need to handle deletions, I'd say look at your last inserted unique ID, find it in the source DB and start there
16:49:17 <lenkaseg> Aurélien B: of course, i meant the datanommer one
16:49:28 <nils> I'm not sure how windowd migration meshes with alembic, though. At least the default mode is to work, convert, etc. in one live database.
16:49:36 <nils> (Sorry, was out for a moment)
16:49:58 <abompard> lenkaseg: it's there: https://github.com/fedora-infra/datanommer/blob/develop/tools/timescaledb/migrate-to-timescaledb.py
16:50:21 <nils> So I assume this is not about using Alembic, but hand-sewn migration scripts?
16:50:36 <abompard> nils: if you need to do a long running operation, I don't think you should use alembic
16:50:41 <abompard> yeah
16:50:44 <lenkaseg> Aurélien B: Thanks! I could not find it :)
16:50:47 <nils> ah gotcha
16:50:53 <abompard> it should be very rare
16:51:02 <abompard> Datanommer is one case, the mailman migration was another
16:51:42 <nils> one recipe to allow delete-with-sync is to have an additional "uuid" in the tables which are assigned randomly on creation but would stay stable between different "incarnations" of a database
16:52:15 <nils> which is probably not very efficient, but what is with huge DBs?
16:52:24 <abompard> Yes, in general I would avise against using the primary key only for unicity
16:53:25 <nils> you probably won't need it day to day though
16:53:45 <abompard> then it depends on your data
16:53:54 <nils> I guess you could tack it onto a database schema using an alembic migration before you use the real DB-to-another-DB migration script
16:54:22 <abompard> for anything that needs to leave the DB, such as messages or async tasks, a unique id is needed
16:54:37 <abompard> nils: yes indeed
16:54:40 <nils> yeah
16:55:02 <mobrien> some data may have it anyway right? like unique message ids?
16:55:11 <nils> I was thinking single system there, i.e. a user talks to one service and the ids only make sense in the context of short-term communication between both
16:55:21 <nils> mobrien, yeah
16:55:23 <abompard> mobrien: yes, most messaging systems have those anyway
16:56:38 <abompard> Any other question?
16:57:32 <mobrien> I have no other questions but I enjoyed the talk. Thanks Aurélien B
16:57:32 <abompard> Well if you do feel free to ping me in #apps:fedoraproject.org
16:57:43 <abompard> Sure, I hope it was useful
16:57:43 <mobrien> abompard++
16:57:43 <zodbot> mobrien: Karma for abompard changed to 4 (for the current release cycle):  https://badges.fedoraproject.org/tags/cookie/any
16:57:51 <pmoura> abompard++
16:57:52 <lenkaseg> abompard++
16:57:52 <zodbot> lenkaseg: Karma for abompard changed to 5 (for the current release cycle):  https://badges.fedoraproject.org/tags/cookie/any
16:58:08 <abompard> yum !
16:58:14 <nils> abompard++
16:58:36 <lenkaseg> Aurélien B: Thanks a lot, I'm saving this discussion.
16:58:36 <lenkaseg> Where the logs are available by the way?
16:58:41 * nils eyes zodbot
16:58:52 <lenkaseg> 2 mins left, so left move to open floor
16:58:55 <lenkaseg> #topic Open Floor
16:59:01 <abompard> A link to the logs should be displayed at the end of the meeting
16:59:20 <abompard> I should add that it was lenkase
16:59:41 <abompard> I should add that it was lenkaseg who figured out the openshift job playbook thingie
16:59:47 <abompard> for datanommer
16:59:49 <abompard> so, well done
17:00:13 <mobrien> Good work lenkaseg
17:00:22 <mobrien> lenkaseg++
17:00:22 <zodbot> mobrien: Karma for lenkaseg changed to 5 (for the current release cycle):  https://badges.fedoraproject.org/tags/cookie/any
17:00:24 <nils> 👍
17:00:24 <lenkaseg> no no. the job was done by Aurélien B , I only added that thingy for checking if it's done I think
17:00:46 <pmoura> lenkaseg++
17:00:53 <mobrien> lenkaseg: the logs are here if you ever want to look back on historical topics https://meetbot.fedoraproject.org/sresults/?group_id=infrastructure&type=team
17:00:55 <lenkaseg> Thanks all for attending the meeting!
17:00:57 <nils> lenkaseg++
17:01:00 <lenkaseg> #endmeeting