rolekitweekly
MINUTES
14:30:03 <sgallagh> #startmeeting rolekit (2015-11-10)
14:30:03 <zodbot> Meeting started Tue Nov 10 14:30:03 2015 UTC.  The chair is sgallagh. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:30:03 <zodbot> Useful Commands: #action #agreed #halp #info #idea #link #topic.
14:30:03 <sgallagh> #meetingname rolekitweekly
14:30:03 <zodbot> The meeting name has been set to 'rolekitweekly'
14:30:03 <sgallagh> #chair sgallagh twoerner nilsph
14:30:03 <zodbot> Current chairs: nilsph sgallagh twoerner
14:30:04 <sgallagh> #topic init process
14:30:10 <nilsph> .hello nphilipp
14:30:11 <zodbot> nilsph: nphilipp 'Nils Philippsen' <nphilipp@redhat.com>
14:30:12 <sgallagh> Hello, folks. Who do we have today?
14:30:14 <twoerner> .hello twoerner
14:30:15 <zodbot> twoerner: twoerner 'Thomas Woerner' <twoerner@redhat.com>
14:30:18 <sgallagh> .hello sgallagh
14:30:20 <zodbot> sgallagh: sgallagh 'Stephen Gallagher' <sgallagh@redhat.com>
14:30:25 <sgallagh> Great, the gang's all here.
14:30:30 <sgallagh> #topic Agenda
14:30:59 <sgallagh> #info Agenda Item: Issue Triage
14:31:07 <sgallagh> #info Agenda Item: Status Report
14:31:19 <sgallagh> Any other topics?
14:31:49 <twoerner> nope
14:32:21 <nilsph> not from me either
14:32:37 <sgallagh> ok
14:32:42 <sgallagh> #topic Issue Triage
14:32:53 <sgallagh> There are three issues that have been filed since the last time we did one of these.
14:33:29 <sgallagh> Oh, and a pull request
14:33:59 <sgallagh> #topic https://github.com/libre-server/rolekit/issues/54 - The Domain Controller role should support setting up the Vault
14:34:23 <sgallagh> (All three of the new tickets are around the Domain Controller, BTW)
14:34:24 <nilsph> That looks easy enough to do.
14:34:47 <nilsph> Do we support differentiating between versions of underlying software?
14:35:05 <sgallagh> Right, that's the tricky part.
14:35:10 <nilsph> I.e. can we optionally support that cmdline option on versions of freeipa?
14:35:19 <nilsph> (that do have it)
14:35:28 <sgallagh> I think in this case, we can probably Ask Forgiveness Instead Of Permission.
14:35:38 <nilsph> What I thought
14:35:40 <sgallagh> (Try to launch with --setup-kra and if that fails, retry without it)
14:35:58 <sgallagh> Because ipa-server-install fails on unknown arguments
14:36:18 <nilsph> My take is: if the user specifies it, add it to the cmdline and if not, leave it out. Error out if it fails regardless.
14:36:38 <nilsph> Otherwise users will think they have the option when they haven't (because we caught the error and retried without it)
14:36:44 <sgallagh> nilsph: Well, the approach we've taken with the DC so far is to install all options unless asked not to
14:36:51 <nilsph> Ahh.
14:36:57 <nilsph> That's an option, too :).
14:37:17 <sgallagh> nilsph: I think it might be reasonable to fail if it was explicitly requested and was unavailable
14:37:26 <nilsph> Yeah, retrying without it seems good then.
14:37:34 <sgallagh> But if it's just taking the defaults, go with whatever it supports.
14:37:51 <nilsph> I thougth we didn't provide options and just add all features?
14:38:04 <sgallagh> nilsph: We provide the option, just default it to enabled.
14:38:33 <sgallagh> For the DNS server, for example, it's "serve_dns"
14:39:11 <sgallagh> Technically, the internal CA can be skipped as well, if you have a chained CA, but we don't support that right now.
14:39:20 <sgallagh> I should probably open a ticket for that, but I don't want to rush to implement it.
14:39:25 <nilsph> So would it then be: add the option, default to "try-enable" it (you know what I mean), but choosing does "enable-or-fail".
14:39:44 <sgallagh> nilsph: Yeah, I think that's what I was saying.
14:39:55 <twoerner> yes, that sounds reasonable
14:40:23 <sgallagh> (And in the future if other options are added, we'll have to have a staged fall back, of course)
14:40:26 <nilsph> Not sure whether the defaults system supports that already, but I guess I can try.
14:41:05 <sgallagh> nilsph: I kind of fudged it with the domain_name function.
14:41:32 <nilsph> Then I'll have to see if I can copy that. I like fudge. ;)
14:41:32 <sgallagh> So "None" would be "try-enable" in this case.
14:41:45 <sgallagh> /me nods
14:41:49 <nilsph> That doesn't even sound like fudging to me.
14:41:53 <sgallagh> I take it you are volunteering to implement this? :)
14:42:11 <nilsph> I guess so. :)
14:42:28 <sgallagh> Excellent. I'll put it on the F24 schedule, but not as a blocker.
14:43:26 <sgallagh> /me adds a summary comment to the issue. Please hold.
14:44:34 <nilsph> *cue hold music*
14:45:00 <sgallagh> OK, updated
14:45:03 <sgallagh> (and assigned)
14:45:27 <sgallagh> OK, next ticket
14:45:42 <sgallagh> #topic https://github.com/libre-server/rolekit/issues/55 - Ensure static IP for domain controller
14:46:06 <sgallagh> This came as part of the fallout from the F22->F23 upgrade.
14:46:25 <sgallagh> Some bits in the networking stack changed under the hood and my DHCP-assigned static address changed.
14:46:48 <sgallagh> In reality, we probably don't ever want FreeIPA to rely on a DHCP address, since it is so closely tied to DNS services.
14:47:10 <sgallagh> My proposal is that it would be useful to have us tell NetworkManager to make the configuration static.
14:47:37 <sgallagh> That said, I'm not sure if it's feasible to do this automatically (since we'd be attempting to claim an IP from within the configured DHCP range)
14:48:11 <sgallagh> So I think we should think about this carefully. There's no rush, because the simple workaround would be the admin setting the IP manually using nmcli or Cockpit anyway.
14:48:38 <sgallagh> Oh, I forgot to #info the previous ticket. One moment while I tinker with the logs.
14:48:40 <sgallagh> #undo
14:48:40 <zodbot> Removing item from minutes: <MeetBot.items.Topic object at 0x7f08bd3996d0>
14:48:45 <nilsph> There might be (Fort Knox-like) setups where machines have to call DHCP to get their assigned IP addresses, otherwise they'll get blocked at the switches or something. Not sure if we want to cater to these.
14:48:53 <sgallagh> #info Assigned to nilsph for Fedora 24 nice-to-have enhancement
14:48:58 <sgallagh> #topic https://github.com/libre-server/rolekit/issues/55 - Ensure static IP for domain controller
14:49:34 <sgallagh> nilsph: Those systems are guaranteed to be broken by changes like the device naming one that we just hit.
14:49:41 <nilsph> I think if we want to add that, make it a (default enabled?) option.
14:49:45 <sgallagh> It is basically how I had my VMs deployed, in fact
14:50:17 <nilsph> Yes. That's why we should leave the option of "trust me, I know what I'm doing" open to the admin ;).
14:51:22 <nilsph> And/or telling people that they need to take care of that the IP address doesn't change (if we don't pin it).
14:52:25 <sgallagh> The more I think about this, the more I think we should just scrap the issue entirely and maybe just add some notes in the man page that the IP must be static for safe operation
14:52:33 <sgallagh> And rely on them to make that happen however makes sense
14:52:40 <nilsph> +1
14:52:49 <sgallagh> OK, I'll convert this to a doc bug and handle it myself
14:53:41 <twoerner> +1 for skipping this issue and adding a section to the man page of the domain conrtoller man page
14:54:31 <twoerner> making the IP static is nothing we should and could easily do
14:54:33 <sgallagh> #info Making this a documentation bug, assigned to sgallagh for Fedora 24
14:54:58 <sgallagh> #topic https://github.com/libre-server/rolekit/issues/57 - Support redeploy() for Domain Controller
14:55:10 <sgallagh> I added this one just before the meeting, and it's non-trivial.
14:55:26 <twoerner> wow
14:55:42 <sgallagh> There's no urgency on this, however.
14:55:50 <twoerner> wouldn't this be more like an undeploy and a fresh deploy?
14:55:51 * nilsph disappears into the woodwork
14:55:53 <sgallagh> For now, it's here for tracking.
14:55:56 <nilsph> ;)
14:55:56 <sgallagh> twoerner: No, actually
14:56:02 <twoerner> no?
14:56:14 <sgallagh> twoerner: IPA has tools for doing this already on a live system
14:56:22 <twoerner> annd ing an removing components might be a bit too much for redeploy
14:56:23 <nilsph> sgallagh: you want to be able to switch options off and on without disrupting the rest of the instance, right?
14:56:51 <sgallagh> twoerner: This is the exact example I used when defining the redeploy() option :)
14:56:58 <twoerner> ok.. if there is working support in IPA, then this is ok for me..
14:57:17 <sgallagh> nilsph: Well, short-term disruption is okay.
14:57:28 <sgallagh> (Like, outage-window type disruption)
14:57:49 <nilsph> I meant, permanently, as in deleting the instance and starting over (which is redeploy AIUI)
14:58:19 <twoerner> the description in the issue is a bit short .. not mentioning that there is support for this in IPA with a tool .. :-)
14:58:36 <sgallagh> nilsph: Don't use memcached as a good example of redeploy()
14:58:39 <sgallagh> That *is* a hack
14:58:53 <sgallagh> That particular example we only get away with because memcache has no state
14:59:29 <sgallagh> twoerner: Yeah, like I said I was creating it just as the meeting started.
14:59:31 <sgallagh> I was rushing :)
14:59:42 <nilsph> Time to explain "redeploy" to me, again, I guess :)
15:00:18 <sgallagh> nilsph: redeploy() is meant to be an in-place modification of solution-level changes.
15:00:34 <sgallagh> I hacked the decommission/redeploy for memcache because it was easier.
15:00:52 <sgallagh> But this ticket is the exact feature I always meant for redeploy() to handle
15:01:01 <nilsph> Without the marketing buzzwords: "change options in a deployed instance"?
15:01:01 <sgallagh> I just only now got around to adding it to the list.
15:01:14 <sgallagh> nilsph: Well, options *of the instance*.
15:01:35 <sgallagh> e.g. Add or remove optional components vs. add or remove users in the domain
15:01:49 <nilsph> Of course of the instance, we're not talking modifying the role. I'm not at least.
15:01:58 <nilsph> Understood.
15:02:03 <sgallagh> nilsph: Right, I'm just trying to be extremely clear in the meeting logs :)
15:02:08 <nilsph> Good :)
15:02:43 <sgallagh> In any case, this is going to be a large effort and probably not immediately urgent.
15:02:55 <nilsph> I think there are types of options which shouldn't be open to this, e.g. "database name" which is not an installation option of Postgres, really.
15:03:01 <sgallagh> So I'm going to suggest we drop this in the "Future" milestone and revisit it later
15:03:07 <nilsph> +1
15:03:36 <sgallagh> nilsph: Right, not all options have to be modifiable.
15:03:59 <twoerner> +1
15:04:19 <sgallagh> I'm also going to do the same to any of the help-wanted items on the list
15:04:31 <sgallagh> So we can easily just look at the "no milestone" set when figuring out what to triage.
15:04:43 <nilsph> +1
15:05:23 <sgallagh> #info Domain Controller redeploy() is deferred to the Future milestone for now
15:06:47 <sgallagh> ok, next topic
15:06:57 <sgallagh> #topic Status Report
15:07:21 <sgallagh> I went to the systemd conference last week. I had a chance to discuss a few topics with Lennart around how we are working with target units.
15:07:41 <nilsph> What's his take?
15:08:02 <sgallagh> #link https://github.com/systemd/systemd/issues/1797
15:08:36 <sgallagh> Lennart agreed with me that not propagating failures up to BindsTo units is a bug.
15:08:42 <sgallagh> So we will be able to rely on that in the future.
15:09:40 <sgallagh> What this means is that, once this is implemented, we'll be able to tell the difference between a target that is stopped because it was manually stopped or one that crashed.
15:09:47 <nilsph> Good. Even if we find that we don't need it in the end (pointing to our discussion of Monday last week ;))
15:10:00 <sgallagh> Well, this is still useful *information*
15:10:20 <nilsph> Yeah, having targets just for book-keeping is still worthwhile.
15:10:29 <sgallagh> I think I've revised the way I want to handle this somewhat, and I'm working on implementing it.
15:10:46 <sgallagh> I'm going to try to describe my new plan first, then I'll EOF and you can poke holes in it, please :)
15:11:14 <twoerner> ok, will do :-)
15:11:47 <sgallagh> First, we'll stop recording the state as an attribute of the role XML. Instead, we will rely entirely on the ActiveState attribute of the role target unit to represent the RUNNING vs. READY-TO-START state of the system.
15:12:12 <sgallagh> Whenever this attribute is requested, we will query systemd for it (with a reasonable cache in rolekit)
15:12:51 <sgallagh> Second, we will modify the role target unit so that it will attempt to auto-restart unless it fails repeatedly.
15:13:30 <sgallagh> If the target unit fails on *startup*, it does end up in the failure state and in this case we can propagate that into rolekit as being in the ERROR state. (Since obviously something is broken that needs attention)
15:14:29 <sgallagh> By relying on the state of the target unit, we no longer need the OnFailure units that fire d-bus messages indicating error to the rolekit daemon, since our status will always be accurate on lookup.
15:15:09 <sgallagh> This means we will be able to reduce the set of unit files we create down to one per role (rather than the target unit, failure unit and N signaling units, where N was every Requires: entry)
15:15:28 <sgallagh> This reduction in complexity will make for easier maintenance of the code and easier cleanup of the system.
15:15:31 <sgallagh> EOF
15:15:35 <twoerner> ok, this is working if the instance at least reached the running state.. before reaching it, there is no unit - we also need to track the transitional states and the other persistent states
15:15:59 <sgallagh> The transitional states are actually representable by ActiveState values.
15:16:01 <twoerner> s/unit/target unit/
15:16:13 <sgallagh> Well, except deploying and decommissioning, of course
15:16:21 <nilsph> which can be tracked internally (I expect the rolekitd instance to stay around for the duration of deployment)
15:16:22 <sgallagh> But those are active states and the rolekit daemon must be running
15:16:27 <sgallagh> Right
15:16:51 <nilsph> +1
15:16:56 <nilsph> sounds good
15:17:04 <twoerner> roled is not suspended while an instance is in a transitional state
15:17:10 <sgallagh> twoerner: Before reaching ready-to-start, it can only be one of "nascent", "deploying" or "error"
15:17:24 <twoerner> yes
15:17:42 <sgallagh> twoerner: Right, as per a previous bug, if it starts up and we're in a transitional state, then it treats it as an error
15:17:52 <nilsph> coincides with how I think things should be working
15:17:56 <sgallagh> /me nods
15:18:20 <sgallagh> OK, I realize though I won't be able to remove the state from the XMl, if only because we still want to know if we're stuck in a transitional state during startup.
15:18:22 <nilsph> less persistent state == good
15:18:33 <nilsph> (that we have to keep care of)
15:18:33 <twoerner> yes
15:18:45 <sgallagh> But we can always go to "ask systemd if we think it's supposed to be ready-to-start or running"
15:19:02 <sgallagh> And therefore get the real value.
15:19:02 <twoerner> so.. if the instance is in READY_TO_START or RUNNING state, then we ask systemd
15:19:05 <sgallagh> yes
15:19:24 <sgallagh> (and if systemd comes back with a transitional state, that'll be an interesting situation :) )
15:19:43 <twoerner> is that possible?
15:20:06 <sgallagh> Yes, a race condition where someone called `systemctl stop target.unit` just as we were trying to read it.
15:20:20 <twoerner> ok.. yes
15:20:23 <twoerner> right
15:20:30 <sgallagh> But in that case, I think we probably want to just report the reality and expect the client to wait a bit and retry.
15:20:52 <sgallagh> Or register with the magical job system we're going to build :)
15:21:17 <twoerner> ok.. how about this:
15:21:36 <nilsph> "magical job system"?
15:21:37 <twoerner> remove READY_TO_START and RUNNING and add SYSTEMD instead?
15:21:54 <nilsph> sounds sensible to me
15:21:58 <twoerner> just to make sure that this is an expected and also valid state
15:22:00 <sgallagh> Interesting... go on
15:22:30 <nilsph> SYSTEMD meaning "we might not know the state, but ask systemd and cache for a reasonable time"
15:22:31 <sgallagh> nilsph: https://github.com/libre-server/rolekit/issues/18
15:22:35 <twoerner> and that systemd will be consulted to get the result
15:22:55 <nilsph> sgallagh: ahh
15:22:57 <nilsph> tnx
15:24:03 <sgallagh> twoerner: I'm not sure I see the value.
15:24:04 <twoerner> SYSTEMD might be our internal state, that is then replaced by the value we roled is getting back from systemd
15:24:16 <sgallagh> Ah
15:24:29 <sgallagh> Yeah, I suppose that could be reasonable
15:24:35 <nilsph> logic vs interface
15:24:43 <sgallagh> I have to think about that a bit.
15:25:11 <sgallagh> (and we're running out of meeting)
15:25:36 <sgallagh> #info Lots of discussion on the redesign of the unit file creation and monitoring. See full logs for details
15:25:46 <twoerner> but we need to think about the actions that can be done using the SYSTEMD state
15:25:55 <sgallagh> nilsph, twoerner: With five minutes remaining, a fast update on your status? :)
15:26:01 <twoerner> redeploy, start/stop, ..
15:26:28 <sgallagh> twoerner: I think SYSTEMD might just be a transitional state at startup
15:26:47 <sgallagh> All our real operations should be done from ready-to-start or running, after we've established where we actually are
15:29:22 <twoerner> yes, but with READY_TO_START and RUNNING not being real states (we do not have a simple way to get notified of a stopped service that was required by a running instance) we can not distinguish betweethese states
15:30:04 <sgallagh> twoerner: Sorry, I'm not following. What do you mean by "we do not have a simple way to get notified of a stopped service that was required by a running instance"?
15:30:15 <twoerner> therefore running will not get 'reduced' to ready-to-start or error
15:30:56 <twoerner> as soon as the instance has a target unit we need to rely on systemd completely to get the state
15:31:01 <nilsph> I think these ideas could to be split into "who's responsible for XYZ" and "what state is XYZ in", and rolekitd presents an amalgamate of both to the outside world -- if we're responsible, we present the state we know, otherwise ask systemd.
15:32:08 <nilsph> Or even, our state has a dual meaning, either "state is ..." or "ask systemd". I guess that's what twoerner was proposing.
15:32:34 <twoerner> I think "ask systemd" is not good.. we should provide this to the user
15:32:35 <sgallagh> twoerner: Sure, but we can also register for notification when that unit changes state
15:32:40 <twoerner> s/we/rolekit/
15:33:10 <sgallagh> nilsph: I don't want us ever to be telling the user "ask systemd". We can do that on their behalf
15:33:24 <nilsph> twoerner: I meant that as "if someone asks us now, we ask systemd, translate that appropriately, and then respond with the state"
15:33:30 <nilsph> sgallagh: ^^
15:33:39 <twoerner> yes
15:33:42 <sgallagh> OK, that was unclear.
15:34:12 <nilsph> just that the meaning of our internal state would drift away from "the state of the instance" and becomes a hybrid
15:34:29 <nilsph> nothing should change to the outside in this scheme
15:34:49 <twoerner> yes, therefore I want to replace the states that are hybrid to simply be SYSTEMD
15:34:49 <nilsph> only less (or later) work for rolekitd
15:34:55 <nilsph> exactky
15:35:00 <nilsph> -k+l
15:35:04 <twoerner> to make sure that this is really only one state and not several
15:35:11 <nilsph> +11
15:35:40 <sgallagh> The thing I'm getting at is that I'm not sure it makes sense to have SYSTEMD as a state.
15:35:50 <twoerner> and to make sure that if someone is looking at the XML file, there is somehting we can easily explain
15:36:03 <sgallagh> If we're not nascent or error at startup, then we get it from systemd and use the known real state
15:36:17 <sgallagh> I don't know that having a systemd transitional state makes sense
15:36:40 <twoerner> this is not a transitional in my opinion
15:36:47 <sgallagh> twoerner: OK, I guess I can *kind of* see where it might be useful in the XML file
15:36:51 <twoerner> it is a persistent state
15:37:07 <twoerner> for rolekit... to make sure to ask systemd
15:37:18 <twoerner> in this state
15:37:20 <nilsph> sgallagh: the state could even be "None" unless we want that to express something else?
15:37:33 <nilsph> as in "we don't know ourselves"
15:37:34 <sgallagh> nilsph: I'd prefer to avoid None at least
15:37:39 <nilsph> thought so
15:37:45 <twoerner> we reached the state, where systemd need to be consulted to get the state
15:38:11 <nilsph> I don't think that the state should be on disk for a deployed instance, in this scheme.
15:38:13 <twoerner> yes, please do not use "None" or ""
15:38:18 <sgallagh> Yeah, I see the value in having it and keeping it in the XML
15:38:31 <sgallagh> But never under any circumstances exposed to end-users
15:38:39 <twoerner> yes
15:38:41 <sgallagh> OK, I think we're on the same page
15:38:44 <twoerner> this is for rolekit
15:39:07 <nilsph> While deploying or decommissioning, it needs to go somewhere (hint: probably not in /etc), but while it's deployed properly there's no need for it.
15:39:11 <twoerner> to make sure that we can have a simple way to detect this internal state
15:40:00 <nilsph> How that is represented in the running rolekitd is another matter, state==SYSTEMD or whatever.
15:40:21 <twoerner> I think there is a need for this to make sure that we really reached at least the 'ready-to-start' state
15:40:42 <sgallagh> ok
15:40:48 <twoerner> and are prepared to ask systemd
15:40:52 <sgallagh> I think I have enough to go on, here.
15:40:56 <twoerner> to get a meaningful reply
15:41:05 <nilsph> Presence of deployment/decommissioning state info -> not ready-to-start/SYSTEMD or whatever
15:41:20 <sgallagh> I'll work on getting a patch out this week and we can iterate on it from there
15:41:24 <nilsph> cool
15:41:26 <twoerner> ok
15:41:28 <twoerner> good
15:44:06 <sgallagh> Alright, I think we've been in this meeting long enough, and I have to run the Server SIG meeting in 15 minutes.
15:44:11 <sgallagh> Thanks for coming, folks!
15:44:18 <twoerner> yes, thanks
15:44:27 <sgallagh> #endmeeting