Overseer

The Overseer is an ephemeral container that Holden spawns when it needs to recreate itself. Since Docker containers can’t modify their own labels or pull a fresh image while running, Holden delegates this to a short-lived Overseer process.

The Overseer is the same Docker image as Holden, started with the --overseer flag.

Idle Mode

When Holden spawns an Overseer, it sets an in-memory idle flag and stops reconciling apps. This is just a variable — no container renames or Docker API calls needed.

If Holden restarts while idle (process crash, server reboot), the flag resets. Holden boots fresh and checks for a running holden-overseer or stale holden-next to decide whether to spawn a new Overseer.

Container Names

During the update process:

Name	Role
`holden`	Current orchestrator (idle while Overseer works)
`holden-next`	Replacement being health-checked
`holden-overseer`	Ephemeral Overseer process

This follows the same -next convention used for zero-downtime app deployments.

When Holden Spawns an Overseer

Initial boot - When Holden starts and wasn’t created by the Overseer (no holden.created-at label), it spawns an Overseer to recreate itself with the correct labels.

Label changes - Docker labels are immutable on running containers. If you change HOLDEN_PUBLIC_DOMAIN, Holden needs new Traefik labels. The Overseer recreates Holden with the correct labels.

Maintenance - After cleanup, Holden always spawns an Overseer to recreate itself, even if no new image is available. This ensures a fresh state and queues all apps for reconciliation.

Recovery - If Holden boots and finds a holden-next container or running holden-overseer, a previous update was interrupted. It spawns an Overseer to clean up and retry.

The Overseer container is always named holden-overseer. Docker prevents duplicate container names, so only one Overseer can run at a time. If Holden tries to spawn an Overseer but one already exists, it knows an update is already in progress and enters idle mode.

Trust Model

The Overseer doesn’t trust instructions from Holden. It inspects the current Holden container, reads the environment variables, and derives the correct configuration independently.

Bootstrap Data

The Overseer gets everything by inspecting the current Holden container:

Data	Source	Why
Environment vars	Container inspection	`HOLDEN_PUBLIC_DOMAIN`, `HOLDEN_TRAEFIK_*`, `GITHUB_PAT`, etc.
Image tag	Container inspection	e.g., `holden:latest`
Volume mounts	Container inspection	Docker socket, `/data`
Network memberships	Container inspection	Traefik network, default network
Port bindings	Container inspection	`6020:6020`, `127.0.0.1:6021:6021`

Labels (Traefik config) are derived from the environment variables.

The Algorithm

sequenceDiagram
    participant H as Holden
    participant O as Overseer
    participant D as Docker
    participant G as Git

    H->>H: Set idle flag
    H->>D: Spawn holden-overseer container
    O->>D: Find container with holden.orchestrator=true
    O->>D: Inspect → get env vars, image tag, port bindings
    O->>O: Parse env vars → get public_domain
    O->>O: Generate Traefik labels
    loop Until health check passes
        O->>D: Remove holden-next if exists (cleanup)
        O->>D: Pull fresh image
        Note over O,D: Phase 1 — Validate
        O->>D: Create holden-next (no port bindings)
        O->>D: Wait for health check
        alt Health check fails
            O->>D: Remove holden-next
            O->>O: Wait (exponential backoff, 1h cap)
        end
    end
    O->>D: Remove holden-next
    Note over O,D: Phase 2 — Swap
    O->>D: Stop and remove holden
    Note over H: Old Holden gone
    O->>D: Create holden (with port bindings)
    O->>D: Start holden
    O->>O: Exit

The Overseer uses a two-phase approach to avoid port binding conflicts:

Phase 1 — Validate: Create holden-next without port bindings to confirm the image is healthy. The old holden container still holds the ports during validation. Once health check passes, remove holden-next.

Phase 2 — Swap: Stop and remove the old holden container (freeing the ports), then create a new holden container with the correct labels, env, volumes, networks, and port bindings replicated from the original.

There is a brief downtime (seconds) between removing the old container and starting the new one. Port bindings survive because they’re copied from the original container’s inspection data.

Step by step:

Holden enters idle mode - Sets in-memory idle flag, spawns holden-overseer container with --overseer flag
Find current Holden - By holden.orchestrator=true label
Inspect container - Get env vars, image tag, volumes, networks, and port bindings
Parse env vars - Get HOLDEN_PUBLIC_DOMAIN, generate Traefik labels
Cleanup - Remove holden-next if it exists (from a previous failed run)
Pull fresh image - Same tag, but latest digest (force pull)
Create holden-next - Without port bindings (validation only)
Wait for health check - If it fails: remove holden-next, wait (exponential backoff: 10s initial, 2x multiplier, 1h cap), retry from step 5
Remove holden-next - Validation passed, no longer needed
Stop and remove old holden - Frees port bindings
Create new holden - With correct image, labels, env, volumes, networks, and port bindings
Overseer exits - Only after success (or if manually stopped)

The Overseer never gives up. If health checks keep failing, it retries forever (at 1h intervals once capped).

If the Overseer is stopped or crashes, the original holden container is still running with its correct name — no recovery needed. The stale holden-next container (if any) gets cleaned up on the next Overseer run.

Bootstrap Requirements

Your initial docker-compose must include the discovery label:

services:
  holden:
    image: benjick/holden:latest
    labels:
      - "holden.orchestrator=true"  # Required for Overseer
    # ... rest of config

Without this label, the Overseer can’t find the Holden container to replace.

Updating docker-compose.yml

After the first boot, the running holden container is managed by the Overseer, not by Compose. If you change your docker-compose.yml (e.g., add an environment variable), Compose won’t update the running container because it doesn’t recognize it as its own.

To apply changes:

docker rm -f holden && docker compose up -d --pull always holden

This removes the Overseer-managed container (freeing the name), lets Compose create a fresh one with the updated config, and the Overseer will recreate it with the correct labels.

Getting Started Full docker-compose setup with all required configuration

Why This Design

Labels are immutable - Docker doesn’t allow modifying labels on a running container. The only way to change labels is to recreate the container.

Same pattern as app deploys - The Overseer uses the same -next convention as zero-downtime app deployments. Create the replacement, health-check it, then swap.

Clean recovery - If the Overseer crashes, the original holden container is still running with its correct name. No rename-back or special recovery logic needed — just clean up holden-next and retry.

Automatic rollback - If holden-next never becomes healthy, the Overseer removes it and retries. The original holden is untouched until the new one is confirmed healthy.

Git as source of truth - By always reading from git, the Overseer guarantees correctness. It doesn’t matter what state Holden was in or what instructions it tried to pass.

Generated Labels

The Overseer adds these labels to the new Holden container:

Creation timestamp - holden.created-at=<ISO timestamp> marks when this container was created. Its presence tells Holden it was properly created by the Overseer.

Traefik labels - When HOLDEN_PUBLIC_DOMAIN is set:

traefik.enable: "true"
traefik.http.routers.holden.rule: "Host(`holden.example.com`)"
traefik.http.routers.holden.entrypoints: "websecure"
traefik.http.routers.holden.tls.certresolver: "letsencrypt"
traefik.http.services.holden.loadbalancer.server.port: "6020"

This routes webhook traffic from Traefik to Holden’s port 6020. The entrypoint and certresolver names are configurable. The internal API (6021) remains bound to localhost only, accessible via the CLI.