Re: [PATCH] s6-supervise: Optionally run child in a new pid namespace

From: Jesse Young <jlyo_at_jlyo.org>
Date: Sun, 16 Jul 2017 14:48:57 -0500

On Sun, 16 Jul 2017 08:41:03 +0000
"Laurent Bercot" <ska-skaware_at_skarnet.org> wrote:

> As I told Jesse on IRC, the patch isn't going in. I'm not including
> OS-specific code into s6, even with a compile-time option. The main
> reason for it is that it changes the API: the choice to spawn the
> service in a new namespace or not should be made at run time, so
> it would introduce a new file in the service directory that would only
> be valid under Linux, and the file would need to be supported by
> s6-rc and friends even on other systems, etc. This is exactly the kind
> of complexity created by OS divergences that plagues the Unix world
> and that I very much want to avoid. This change itself looks quite
> simple, but it would be a precedent and the slope is extremely
> slippery.
>
>
> >Though as Jesse explained, this requires some sort of exit/signal
> >proxing, which isn't the case here. Here the direct child of
> >s6-supervise remains the daemon itself - in its own pid ns - which is
> >much better.
> It would unarguably be more elegant, yes, but since there's a way to
> do without it, it's only about elegance, not feasability - and I
> really think the cost for elegance is too high.
>
> execline's 'trap' binary can adequately perform the needed proxying
> at low resource cost.

There's a reliability issue as well, one thing trap won't be able to
do is relay KILL and other uncatchable signals. The proxy and the
service aren't completely sharing the same fate as each other, which
may break users expectations when they run, e.g. s6-svc -k. This fate
sharing mismatch is similar to how s6-sudod may keep running despite
s6-sudoc's death.

To reliably deliver a KILL signal to the service without re-introducing
pid races, the proxy would need a non-fatal way to know to send the
signal, requiring a new API. Improving on execline's trap program
to handle this would probably transform it into a version of daemontools
supervise that exits rather than respawns.

If there were no interposing process, an added benefit is that killing
the pid 1 in a namespace also kills all the other processes in that
namespace. This is a secondary thought, but we get it for free. Using
pid namespaces isn't necessary to reliably kill a set of processes on
Linux, another technique is to use a cgroup freezer, which essentially
puts a write lock on a set of processes, so that an unrelated process
can send signals without the TOCTOU races.

> If more various namespace feature requests come in, at some point I
> will look for a way to integrate some namespace functions into
> skalibs, with proper compile-time guards and stubs, and then
> reconsider; but as long as there are ways to achieve the desired
> outcome with external tools, it's not a priority.

Unsharing everything, except for the pid namespace, should take effect
immediately on the calling process. CLONE_NEWNS is special since the
pid 1 effect gets applied to the process' first child, rather than the
process itself. This means the unshare tool from util-linux should be
able to do everything else in the usual exec chainloading style.

Alternative to patching s6-supervise, what do you think about adding an
option to spawn a different supervisor from s6-svscan? An example API
could be, s6-svscan will spawn "./supervisor" for each service dir it
tracks, if it doesn't exist, it just spawns s6-supervise as usual.

Note that having ./supervisor consist of "unshare -p s6-supervise" will
prevent s6-supervise from re-spawning its supervised child. fork() will
succeed the first time, but error with ENOMEM on subsequent calls unless
unshare(CLONE_NEWPID) is called again. In fact, the patch could
probably avoid refactoring the child into a continuation by doing
unshare() ; fork() ; rather than clone().
Received on Sun Jul 16 2017 - 19:48:57 UTC

This archive was generated by hypermail 2.3.0 : Sun May 09 2021 - 19:38:49 UTC