Re: taxonomy of dependencies from Jonathan de Boyne Pollard on 2015-06-08 (supervision)

From: Jonathan de Boyne Pollard <J.deBoynePollard-newsgroups_at_NTLWorld.com>
Date: Mon, 08 Jun 2015 01:56:56 +0100

post-sysv:
> Of course, your particular example would be made less gruesome simply
> by introducing a rate limit on startup failures. This strategy seems
> to be employed frequently in launchd setups.

It's a standard daemontools thing, also. It's hardwired into the
original "supervise".

I had to get rid of it for nosh; again, because what's fine on a
hobbyist PC isn't fine in a datacentre. In the daemontools-style
avoid-restarting-too-often 1 second sleep that ensued whilst dnscache
was doing a restart (to quickly clear the cache of a bogus DNS resource
record set), application X processed several hundred transaction
requests. Unfortunately, since application X was talking to dnscache
over the loopback interface, the UDP/IP subsystem merrily informed the
DNS client library that it couldn't reach port 53. (On a non-loopback
interface, the ICMP messages would return too late.) And thus instead
of waiting and retransmitting, the DNS client library immediately
returned a failure to application X for all of that 1 second's worth of
requests.

Sometimes, one does _not_ want these things. If it's doing a graceful
restart, I want dnscache back up *right now*, not 1 second from now.
Application X, whose rate of continual DNS lookups is why there's a
local dnscache in the first place, needs as close to uninterrrupted DNS
service as it can get, even in the face of system administrators who
know that "we can just clear that problem out of the local cache and get
things fixed today by killing the DNS server and letting it
auto-restart, can't we?" and then terminate the service twice.

What I have in nosh now is of course that this is user-configurable.
You want a 1 second sleep? Put "sleep 1" in the "restart" script. You
don't want one? Don't do that, then. You want to sleep in the event of
a "bad" signal but restart immediately in the event of normal
termination or a "good" signal? Use a case statement and the parameter
passed to restart which encodes the process termination status. And so
forth.

And convert-systemd-units can thus write a "restart" script that does
the range from that to (say) RestartSec=60 and Restart=on-abort, since
the mechanism is flexible enough.

But even a restart interval of 1 minute isn't enough to cope with the
times when it takes rabbitmq-server a fair fraction of an hour to come
up and the number of waiting clients is in 3 figures. Rate limits are a
sticking plaster, not the anwer.
Received on Mon Jun 08 2015 - 00:56:56 UTC

This archive was generated by hypermail 2.3.0 : Sun May 09 2021 - 19:44:19 UTC