0032 — Graceful server upgrade (sessions survive a binary update)
0032 — Graceful server upgrade (sessions survive a binary update)
TL;DR. Upgrading the phux binary today means phux kill → restart, which
drops every live pane (your shells, your editor, your agent). Adopt
graceful in-place upgrade via execve re-exec with fd inheritance + VT
snapshot replay: on phux upgrade, the running server snapshots every pane
(the same vt_replay_bytes + scrollback_bytes it already sends a client on
attach), serializes the session/window/pane tree and an fd→pane manifest, then
re-execs the new binary, passing the PTY master fds and the UDS listener as
inherited descriptors. The new server adopts the fds and rebuilds each pane’s
libghostty Terminal by replaying its snapshot. Child processes never notice
(fds survive execve); clients blink-reconnect. A separate fd-keeper daemon
and live engine-state serialization are the rejected alternatives.
Status: Proposed Date: 2026-06-09
Context
phux is a daemon mux: the server runs setsid-detached
(commands/server.rs:52), owns one libghostty Terminal plus a PTY master per
pane, and accepts clients on a UnixListener. Closing a terminal only kills
the client; panes survive and you re-attach — the core value prop.
But a binary upgrade breaks that. The running process is the old code in
memory; the only way onto a new build is to kill the daemon, which closes every
PTY master → the children get SIGHUP/EOF and die. This bit the maintainer
directly: installing a fix required killing the very session that hosted the
work. tmux has the same limitation — tmux has no in-place server upgrade.
For a tool whose whole point is session durability, “lose everything on
update” is a UX hole, and it hurts three audiences: the maintainer iterating on
phux (rebuild → lose panes), an agent (Claude) whose long-running session lives
in a pane, and non-dev users on brew upgrade / package updates who never
expect a refresh to nuke their work.
phux is unusually positioned to fix this. Under ADR-0013 the wire is VT bytes
and the server already synthesizes a self-contained vt_replay_bytes +
scrollback_bytes snapshot per pane (the SnapshotSynthesizer, used on every
client attach). Reconstructing a pane’s grid is therefore already a solved
problem — it is “attach to yourself.” The missing pieces are descriptor
hand-off and the re-exec orchestration, not state capture.
Decision
Add a graceful upgrade path. On phux upgrade (a control command to the
running server; later, optionally auto-triggered when a newer binary appears on
disk), the server:
- Snapshots every pane to
vt_replay_bytes+scrollback_bytesvia the existing synthesizer, and serializes the session/window/pane tree, the per-pane child metadata, and an fd→pane manifest into a state blob (passed via an inherited memfd / temp fd, not argv). - Clears
FD_CLOEXECon every PTY master and theUnixListener, thenexecves the new binary asphux server --resume <state-fd>. Open fds survive the exec, so the children stay alive and attached to their PTY masters across the swap, and the socket path stays bound with no rebind race. - The new server reads the state blob, adopts the inherited PTY masters and
listener, and rebuilds each pane’s
Terminalbyvt_write-ing its snapshot — the same path a fresh client attach drives.
Clients reconnect: when the old image is replaced, accepted client connection fds close, the client sees a disconnect and re-attaches to the (inherited, still-bound) listener, getting a fresh snapshot. v1 is a brief blink-reconnect; passing the live client sockets through for a flicker-free swap is a v2 refinement (more fd-passing, not a different design).
Phasing: v1 is the explicit phux upgrade command. v2 may auto-detect a
newer on-disk binary (e.g. after cargo install / brew upgrade) and prompt or
auto-upgrade per config — the install path then “just works” for non-dev users.
Rationale
fds surviveexecve— the standard graceful-reload primitive (nginx, HAProxy, systemd socket activation). Keeping the PTY masters open across the exec is exactly what keeps the children alive, with zero cooperation from them.- Reuse, not reinvent. The hard part of any hot-swap is reconstructing terminal state; phux already emits a replayable snapshot for attach, so the new server rebuilds via a path that is already tested. tmux would have to bespoke-serialize its grid model; phux replays the wire it already speaks.
- No second process in the steady state — the server stays a single daemon; the cost is paid only during the sub-second upgrade.
Alternatives rejected
- Kill + restart (status quo). Loses every child process. The thing this ADR exists to fix.
- Separate fd-keeper / state daemon that permanently owns the PTYs +
listener while the logic server restarts and reconnects via
SCM_RIGHTS. Cleaner process isolation and survives a logic-server crash, not just an upgrade — but it is a permanent second process, a second IPC seam, and more moving parts. Recorded as the v2 hardening if re-exec proves fragile (e.g. if adopting fds across exec turns out error-prone in practice). - Serialize the libghostty engine state directly. Not feasible across a version bump (no stable engine snapshot format) and would drift from the wire. Replaying the VT snapshot sidesteps it entirely.
Consequences
- Sessions survive binary updates — the headline UX win, for maintainer, agent, and non-dev users alike.
- Upgrade is a brief stop-the-world per server: snapshot + exec + rebuild, targeted sub-second; a client reconnect blink. Acceptable for an explicit upgrade.
- Snapshot fidelity bounds what survives: viewport + bounded scrollback as the
snapshot captures it (styled-scrollback is still plain-text pending
phux-q0x7); transient out-of-band registries (OSC 8, kitty graphics) follow the same deferral as attach. A pane mid-SPAWNor with in-flight input needs a quiesce step before the snapshot. - A re-exec that fails to bind/adopt must not strand the children: the upgrade
path needs a fallback (abort the exec, keep serving on the old image) — so the
command validates the new binary (
--version/ a dry--resumeprobe) before committing. - New surface: a
--resume <fd>server mode, the state-blob format (versioned, since old↔new server versions straddle it), and the fd manifest. The state format is an explicit compatibility boundary maintained across releases.