Zero-Downtime Deployments

When your container has a Docker HEALTHCHECK, Holden uses zero-downtime deployment. The new container must be healthy before it receives traffic, and the old container gets time to drain before stopping.

How It Works

Holden uses network gating and priority-based cutover. The new container starts connected to app networks (postgres, valkey, etc.) but NOT the Traefik network — so Traefik can’t route traffic to it yet. Once healthy, Holden connects it to Traefik with a higher router priority than v1, so all new traffic routes to v2 immediately with no round-robin.

sequenceDiagram
    participant H as Holden
    participant D as Docker
    participant T as Traefik

    Note over H: v1 running, serving traffic
    H->>D: Create v2 on app networks (not Traefik)
    Note over T: Traefik can't see v2 yet
    loop Poll health status
        H->>D: Check container health
    end
    alt Container becomes healthy
        H->>D: Connect v2 to Traefik network
        Note over T: v2 has higher priority, gets all new traffic
        H->>D: docker stop --time drain_timeout v1
        Note over D: v1 drains, then stops
        H->>D: Remove v1
        H->>D: Rename v2 → final name
    else startup_timeout expires
        H->>D: Remove v2
        Note over H: v1 unchanged
    end

Step by step:

v1 running — Connected to Traefik network + app networks, serving traffic
Create v2 — Starts on app networks only (postgres, valkey, etc.), NOT Traefik
- Traefik can’t see v2 yet
- v2 connects to dependencies, runs initialization
Wait for healthy — Holden polls Docker’s health status
- Container has HEALTHCHECK: wait until status = “healthy”
- No healthcheck: “running” = ready (no zero-downtime guarantee)
Connect to Traefik — docker network connect traefik v2
- v2 has a higher Traefik router priority than v1, so all new traffic routes to v2 immediately
- v1 only finishes in-flight requests — no round-robin between old and new
Stop v1 gracefully — docker stop --time <drain_timeout> v1
- v1 receives SIGTERM
- v1 drains in-flight requests
- After drain_timeout, SIGKILL if still running
Remove v1 — Container is deleted
Rename — v2 gets its final container name

Configuration

services:
  web:
    image: ghcr.io/you/myapp:latest
    startup_timeout: 5m  # max time to wait for healthy (default: 5m)
    drain_timeout: 10s   # time between SIGTERM and SIGKILL (default: 10s)

Option	Default	Description
`startup_timeout`	`5m`	Max time to wait for container to become healthy
`drain_timeout`	`10s`	Time for graceful shutdown before SIGKILL

Docker HEALTHCHECK

Health checks are defined in your Dockerfile, not in holden.yml. This keeps app health knowledge with the app.

FROM node:24-alpine
# ... your app setup ...
HEALTHCHECK --interval=5s --timeout=5s --retries=3 \
  CMD curl -f http://localhost:3000/health || exit 1

Holden polls Docker’s container health status. When Docker reports “healthy”, the container is ready for traffic.

Third-Party Images

For images without a HEALTHCHECK, wrap them:

FROM someimage:latest
HEALTHCHECK CMD curl -f http://localhost:8080/health || exit 1

Without a HEALTHCHECK, containers are considered ready when “running” — no zero-downtime guarantee.

Failed Deployments

If the new container never becomes healthy:

Holden waits up to startup_timeout (default 5m)
If timeout expires or Docker marks container “unhealthy”, deployment fails
Holden removes the new container, old container keeps running

This is a rollback without rolling back — v1 never stopped serving traffic.

What if v1 is already unhealthy?

Deployment proceeds normally. The new container must pass its own health check before receiving traffic—v1’s health status doesn’t affect this.

If v1 is unhealthy and v2 also fails to become healthy, you’re left with unhealthy v1 (same as before the deploy attempt). Fix the issue in your code/config and push again.

In-Flight Requests

Because v2 has a higher Traefik router priority, no new requests are routed to v1 after v2 joins the network. The only requests v1 handles are those already mid-flight when the switch happens.

When v1 receives SIGTERM:

Finish processing in-flight requests
Exit cleanly before drain_timeout

For long-lived connections (websockets, large uploads), apps need proper SIGTERM handling to drain gracefully.

Router Priority

Holden sets a Traefik router priority on v2 that’s one higher than v1’s. This priority persists on the container after the deploy completes, so the next deploy reads it and increments again. Over many deploys, the priority grows monotonically.

Traefik reserves a range of priorities for its internal routers. The maximum user-defined priority is MaxInt64 - 1000 on 64-bit platforms. At one deploy per minute, it would take over 17 trillion years to reach this limit.

Background Workers

Workers don’t receive HTTP traffic — they pull jobs from a queue. The Traefik network gating doesn’t apply, but the same deployment flow works:

Start new worker, wait for healthy
SIGTERM old worker — should stop pulling new jobs, finish current job
Brief overlap where both workers process jobs is normal (jobs are independent)

Holden does “wait for healthy, then stop old.” The worker framework handles the drain — stop accepting new jobs on SIGTERM, finish in-flight work.

Overseer How Holden updates itself with zero downtime