Re: interesting claims from Jeff on 2019-05-16 (supervision)

From: Jeff <sysinit_at_yandex.com>
Date: Thu, 16 May 2019 19:10:50 +0200

16.05.2019, 10:31, "Laurent Bercot" <ska-supervision_at_skarnet.org>:
>> The Question: As a newbie outsider I wonder, after following the
>> discussion of supervision and tasks on stages (1,2,3), that there is a
>> restrictive linear progression that prevents reversal. In terms of pid1
>> that I may not totally understand, is there a way that an admin can
>> reduce the system back to pid1 and restart processes instead of taking
>> the system down and restarting? If a glitch is found, usually it is
>> corrected and we find it simple to just do a reboot. What if you can
>> fix the problem and do it on the fly. The question would be why (or why
>> not), and I am not sure I can answer it, but if you theoretically can do
>> so, then can you also kill pid2 while pid10 is still running. With my
>> limited vision I see stages as one-way check valves in a series of fluid
>> linear flow.

take a look at (the now defunct) depinit:
http://sf.net/p/depinit/
http://depinit.sf.net/

it is said to provide very extended rollback of dependencies
(so extended gettys will not work with it according to the docs).

> Stage 1 isn't reversible; once it's done, you never touch it again,
> you don't need to "reverse" it. It would be akin to also unloading
> the kernel from memory before shutting down - it's just not necessary.

indeed.
and when something fails in that first stage a super-user rescue shell
should be started to fix it instead of any services that depend on it.
(stupid example: sethostname failed for some reason, spawn a rescue
shell for the admin to do something about it ;-).
in such cases it has to be considered whether this failure important
enough to justify interuption of the boot phase.

if not: start as much other services as possible,
output/log an error message, keep calm, and carry on,
things can be handled when a getty is up.

> stage 4

i would prefer to call it "stage 3b" since stage 4 would be start after
stage3a + b, i. e. process #1 execs into another executable, maybe
required in connection with initramfs, anopa provides such a stage 4
execline script.

> - If you want to kill every process but pid 1 and have the system
> reconstruct itself from there, then yes, it is possible, and that is
> the whole point of having a supervision tree rooted in pid 1. When
> you kill every process, the supervision tree respawns, so you always
> have a certain set of services running, and the system can always
> recover from whatever you throw at it. Try it: grab a machine with
> a supervision tree and a root shell, run "kill -9 -1", see what happens.

i wonder what happens if process #1 reacts to, say SIGTERM,
by starting the shutdown phase and doing reboot afterwards.
what if process #1 is signaled "accidently" by kill -TERM 1
(as we saw in preceding posts -1 will not reach it).
nothing is restarted and the system goes down instead since
it is assumed that the signal was not sent "accidently".

in the case of a process #1 not supervising anything, supervisor
runs with 1 < PID when killing everything "accidently"
(via kill ( -1, SIGKILL ) for example), system is bricked, reset
button has to be used:

only a privileged process can reach everything with PID > 1 that
way. there seems to be something wrong that should be fixed ASAP.
in the case of process #1 respawning the supervisor:
it restarts everything, maybe the "accident" happens again, and so on ...
could lead to the system being caught in such an "endless" loop.
maybe this can also only get fixed by powering down ...

non supervising process #1: same, but worse: reset button has to
be used, state is lost, fs are not unmounted cleanly and what not.

but in the situation of a supervising process #1 it can also be possible
to be prevented from entering the shutdown phase cleanly.
Received on Thu May 16 2019 - 17:10:50 UTC

This archive was generated by hypermail 2.3.0 : Sun May 09 2021 - 19:44:19 UTC