Skip to content

Zero-Downtime Deployments

When your container has a Docker HEALTHCHECK, Holden uses zero-downtime deployment. The new container must be healthy before it receives traffic, and the old container gets time to drain before stopping.

Holden uses network gating and priority-based cutover. The new container starts connected to app networks (postgres, valkey, etc.) but NOT the Traefik network — so Traefik can’t route traffic to it yet. Once healthy, Holden connects it to Traefik with a higher router priority than v1, so all new traffic routes to v2 immediately with no round-robin.

sequenceDiagram
    participant H as Holden
    participant D as Docker
    participant T as Traefik

    Note over H: v1 running, serving traffic
    H->>D: Create v2 on app networks (not Traefik)
    Note over T: Traefik can't see v2 yet
    loop Poll health status
        H->>D: Check container health
    end
    alt Container becomes healthy
        H->>D: Connect v2 to Traefik network
        Note over T: v2 has higher priority, gets all new traffic
        H->>D: docker stop --time drain_timeout v1
        Note over D: v1 drains, then stops
        H->>D: Remove v1
        H->>D: Rename v2 → final name
    else startup_timeout expires
        H->>D: Remove v2
        Note over H: v1 unchanged
    end

Step by step:

  1. v1 running — Connected to Traefik network + app networks, serving traffic
  2. Create v2 — Starts on app networks only (postgres, valkey, etc.), NOT Traefik
    • Traefik can’t see v2 yet
    • v2 connects to dependencies, runs initialization
  3. Wait for healthy — Holden polls Docker’s health status
    • Container has HEALTHCHECK: wait until status = “healthy”
    • No healthcheck: “running” = ready (no zero-downtime guarantee)
  4. Connect to Traefikdocker network connect traefik v2
    • v2 has a higher Traefik router priority than v1, so all new traffic routes to v2 immediately
    • v1 only finishes in-flight requests — no round-robin between old and new
  5. Stop v1 gracefullydocker stop --time <drain_timeout> v1
    • v1 receives SIGTERM
    • v1 drains in-flight requests
    • After drain_timeout, SIGKILL if still running
  6. Remove v1 — Container is deleted
  7. Rename — v2 gets its final container name
holden.yml
services:
web:
image: ghcr.io/you/myapp:latest
startup_timeout: 5m # max time to wait for healthy (default: 5m)
drain_timeout: 10s # time between SIGTERM and SIGKILL (default: 10s)
OptionDefaultDescription
startup_timeout5mMax time to wait for container to become healthy
drain_timeout10sTime for graceful shutdown before SIGKILL

Health checks are defined in your Dockerfile, not in holden.yml. This keeps app health knowledge with the app.

Dockerfile
FROM node:24-alpine
# ... your app setup ...
HEALTHCHECK --interval=5s --timeout=5s --retries=3 \
CMD curl -f http://localhost:3000/health || exit 1

Holden polls Docker’s container health status. When Docker reports “healthy”, the container is ready for traffic.

For images without a HEALTHCHECK, wrap them:

FROM someimage:latest
HEALTHCHECK CMD curl -f http://localhost:8080/health || exit 1

Without a HEALTHCHECK, containers are considered ready when “running” — no zero-downtime guarantee.

If the new container never becomes healthy:

  1. Holden waits up to startup_timeout (default 5m)
  2. If timeout expires or Docker marks container “unhealthy”, deployment fails
  3. Holden removes the new container, old container keeps running

This is a rollback without rolling back — v1 never stopped serving traffic.

Deployment proceeds normally. The new container must pass its own health check before receiving traffic—v1’s health status doesn’t affect this.

If v1 is unhealthy and v2 also fails to become healthy, you’re left with unhealthy v1 (same as before the deploy attempt). Fix the issue in your code/config and push again.

Because v2 has a higher Traefik router priority, no new requests are routed to v1 after v2 joins the network. The only requests v1 handles are those already mid-flight when the switch happens.

When v1 receives SIGTERM:

  1. Finish processing in-flight requests
  2. Exit cleanly before drain_timeout

For long-lived connections (websockets, large uploads), apps need proper SIGTERM handling to drain gracefully.

Holden sets a Traefik router priority on v2 that’s one higher than v1’s. This priority persists on the container after the deploy completes, so the next deploy reads it and increments again. Over many deploys, the priority grows monotonically.

Traefik reserves a range of priorities for its internal routers. The maximum user-defined priority is MaxInt64 - 1000 on 64-bit platforms. At one deploy per minute, it would take over 17 trillion years to reach this limit.

Workers don’t receive HTTP traffic — they pull jobs from a queue. The Traefik network gating doesn’t apply, but the same deployment flow works:

  1. Start new worker, wait for healthy
  2. SIGTERM old worker — should stop pulling new jobs, finish current job
  3. Brief overlap where both workers process jobs is normal (jobs are independent)

Holden does “wait for healthy, then stop old.” The worker framework handles the drain — stop accepting new jobs on SIGTERM, finish in-flight work.