Re: s6 daemon restart feature enhancement suggestion from Laurent Bercot on 2024-05-26 (supervision)

From: Laurent Bercot <ska-supervision_at_skarnet.org>
Date: Sun, 26 May 2024 19:53:39 +0000

>Let me say that a $daemon i.e. wpa_supplicant or iwd providing
>$service=WiFi{wpa} has been pulled into the s6-rc compiled db and
>started in the supervision tree.
>But the system doesn't have the hardware to support that, or some
>important resource is unavailable.

  So, here's my question: if the system doesn't have the hardware to
support that, why is the daemon in the database in the first place?

  s6-rc, in its current incarnation, is very static when it comes to its
service database; this is by design. The point is that when you have a
compiled service database, you know what's in there, you know what it
does, and you know what services will be running when you boot your
system.
  Adding dynamism goes against that design. I understand the value of
flexibility (this is why most distributions won't use s6-rc as is: they
need more flexibility in their service manager) but there's a trade-off
with reliability, and s6-rc weighs heavily on the reliability side.

  If you are building a distribution aimed at supporting several kinds
of hardware, I suggest adding flexibility at the *source database*
level, and building the compiled database at system configuration time
(or, in extreme cases, at boot time, though I do not recommend that if
you can avoid it, since you lose the static bootability guarantee).

  If your machine can't run wpa_supplicant, then the service manager
should not attempt to run wpa_supplicant in the first place, so the
wpa_supplicant service should not appear in the top bundle.

  Lacking resources is a different issue: it's a temporary error, and
it makes sense for the service to fail (and be restarted) if it cannot
reserve the resources it needs. If you want to report permanent
failure, and stop trying to bring the service up, after a certain amount
of time, you can write a timeout-up file, or have a finish script exit
125, see below.

>A mechanism should be prepared, to let $daemon inform it's instance of
>s6-supervise that it can't run, or can't provide $service / it's
>services.

  If you have the information before the machine boots, you should use
the information to prune your service database, and compile a database
that you know will work with your system.

  If you don't have the information before the machine boots, then a
service failing to start is a normal temporary failure, and s6 will
attempt to restart the service until it reports permanent failure.

  You have several ways of marking a service as permanently failed:

  - (only with s6-rc) you can have a timeout-up file, see
  https://skarnet.org/software/s6-rc/s6-rc-compile.html and look for
"timeout-up"

  - (generic s6) you can have a finish script that uses data that has
been collected by s6-supervise to determine whether a permanent failure
should be reported or not. A finish script can report permanent failure
by exiting 125.
  For instance, using s6-permafailon, see
  https://skarnet.org/software/s6/s6-permafailon.html , allows you to
tell s6 that if the service exits nonzero too many times in a given
number of seconds, then it's hopeless.

  Does this help?

--
  Laurent

Received on Sun May 26 2024 - 21:53:39 CEST

This archive was generated by hypermail 2.4.0 : Sun May 26 2024 - 21:54:10 CEST