Re: runit kill runsv from Martin \ on 2016-06-27 (supervision)

From: Martin \ <et.code_at_ethome.sk>
Date: Mon, 27 Jun 2016 15:25:32 +0200

On Mon, 27 Jun 2016 14:02:31 +0200

Joan Picanyol i Puig <lists-supervision_at_biaix.org> wrote:
> However, couldn't they know whether their child did not cease to run because
> of a signal they sent?

Some systems allow to register signal to be sent by kernel to child on "parent
death" (Linux), but that is unportable, and actually require both parties to be
aware such mechanism is in place (eg. both supervisor and daemon would have to
support it).

On Fri, 24 Jun 2016 08:33:50 +0800
Thomas Lau <tlau_at_tetrioncapital.com> wrote:
> ... if we could fine tune every parts which makes it more reliable for
> our case, which doesn't seems possible but that's fine.

I am pretty confused and forgetful sometimes myself. I also have been messing
with, and abusing, supervisors some time now. Yet haven't seen either runsv
or s6-supervise die from some internal state breakage. I broke my
experimentation vms and few real boxes in really bad ways sometimes, but
supervisor ran unfazed.

As was already said, after messing with this, when managing children, to avoid
problems, parent should "never die". From this point of view, it really seems
extremely great care was taken, so that both runit and s6 never die during
normal operation. Great care means, that actually almost all IO calls are
encapsulated in protected wrappers, and memory is usually pre-allocated
statically. Think about it for a second.

Although neither the supervisors, seems to be using OS "protected process"
mechanism, the size of supervisor's parts is actually so miniscule
during runtime, that they are probably smallest processes running on the
machine.

Talking about protection, BSDs have madvise(MADV_PROTECT) call which marks
process as "important" (this breaks in FreeBSD jails), but not even official
init uses it. I bet Linux has something similar. However given the way these
things are coded, that is probably not worth the effort.

I wonder, whether situation described by you really happened naturally or it
was result of some manual intervention (`kill -9` or `kill -6` or libc abort
perhaps?), because chances of supervisor crashing being so insanely low.

To minimize PEBKAC, I made similar rule (like Colin) for myself:
- either always use supervision package's provided control program or learn
signals (used by supervisor of choice internally) by heart.

But besides when messing with things manually and during some research
exercises, there should be no point in learning signals in question, since
both "scandir monitor" and supervisor should be "uncrashable" under normal
conditions.

To reiterate, I bevelive, anything else in machine should crash sooner than
any part of runit or s6. Maybe you were having some physical memory corruption
or somebody else somehow termianted the supervisor?

> I am wondering how does Solaris do their supervision? Their supervision
> program is well known for solid running.

From what I was able to dig out, however without bothering to actually try it in
vm, both Solaris and Illumos use "contracts" subsystem which is in-kernel
facility, exposed through filesystem. SMF probably relies on that, similarly
like systemd relies on cgroups on Linux, or launchd relies on MACH IPC on OS X.

All these interfaces are usally completely orthogonal to classic Unix basic
concepts (besides being exposed by fs) and very very system specific. There is
usually not even direct parity in functionality.

Both s6 and runit authors put quite alot of thought and work into given package
portability, so it seems to me very unlikely, these specific capabilities will
ever be supported directly.

eto
Received on Mon Jun 27 2016 - 13:25:32 UTC

This archive was generated by hypermail 2.3.0 : Sun May 09 2021 - 19:44:19 UTC