Zero-Downtime Deployments
When your container has a Docker HEALTHCHECK, Holden uses zero-downtime deployment. The new container must be healthy before it receives traffic, and the old container gets time to drain before stopping.
How It Works
Section titled “How It Works”Holden uses network gating and priority-based cutover. The new container starts connected to app networks (postgres, valkey, etc.) but NOT the Traefik network — so Traefik can’t route traffic to it yet. Once healthy, Holden connects it to Traefik with a higher router priority than v1, so all new traffic routes to v2 immediately with no round-robin.
sequenceDiagram
participant H as Holden
participant D as Docker
participant T as Traefik
Note over H: v1 running, serving traffic
H->>D: Create v2 on app networks (not Traefik)
Note over T: Traefik can't see v2 yet
loop Poll health status
H->>D: Check container health
end
alt Container becomes healthy
H->>D: Connect v2 to Traefik network
Note over T: v2 has higher priority, gets all new traffic
H->>D: docker stop --time drain_timeout v1
Note over D: v1 drains, then stops
H->>D: Remove v1
H->>D: Rename v2 → final name
else startup_timeout expires
H->>D: Remove v2
Note over H: v1 unchanged
end
Step by step:
- v1 running — Connected to Traefik network + app networks, serving traffic
- Create v2 — Starts on app networks only (postgres, valkey, etc.), NOT Traefik
- Traefik can’t see v2 yet
- v2 connects to dependencies, runs initialization
- Wait for healthy — Holden polls Docker’s health status
- Container has
HEALTHCHECK: wait until status = “healthy” - No healthcheck: “running” = ready (no zero-downtime guarantee)
- Container has
- Connect to Traefik —
docker network connect traefik v2- v2 has a higher Traefik router priority than v1, so all new traffic routes to v2 immediately
- v1 only finishes in-flight requests — no round-robin between old and new
- Stop v1 gracefully —
docker stop --time <drain_timeout> v1- v1 receives SIGTERM
- v1 drains in-flight requests
- After drain_timeout, SIGKILL if still running
- Remove v1 — Container is deleted
- Rename — v2 gets its final container name
Configuration
Section titled “Configuration”services: web: image: ghcr.io/you/myapp:latest startup_timeout: 5m # max time to wait for healthy (default: 5m) drain_timeout: 10s # time between SIGTERM and SIGKILL (default: 10s)| Option | Default | Description |
|---|---|---|
startup_timeout | 5m | Max time to wait for container to become healthy |
drain_timeout | 10s | Time for graceful shutdown before SIGKILL |
Docker HEALTHCHECK
Section titled “Docker HEALTHCHECK”Health checks are defined in your Dockerfile, not in holden.yml. This keeps app health knowledge with the app.
FROM node:24-alpine# ... your app setup ...HEALTHCHECK --interval=5s --timeout=5s --retries=3 \ CMD curl -f http://localhost:3000/health || exit 1Holden polls Docker’s container health status. When Docker reports “healthy”, the container is ready for traffic.
Third-Party Images
Section titled “Third-Party Images”For images without a HEALTHCHECK, wrap them:
FROM someimage:latestHEALTHCHECK CMD curl -f http://localhost:8080/health || exit 1Without a HEALTHCHECK, containers are considered ready when “running” — no zero-downtime guarantee.
Failed Deployments
Section titled “Failed Deployments”If the new container never becomes healthy:
- Holden waits up to
startup_timeout(default 5m) - If timeout expires or Docker marks container “unhealthy”, deployment fails
- Holden removes the new container, old container keeps running
This is a rollback without rolling back — v1 never stopped serving traffic.
What if v1 is already unhealthy?
Section titled “What if v1 is already unhealthy?”Deployment proceeds normally. The new container must pass its own health check before receiving traffic—v1’s health status doesn’t affect this.
If v1 is unhealthy and v2 also fails to become healthy, you’re left with unhealthy v1 (same as before the deploy attempt). Fix the issue in your code/config and push again.
In-Flight Requests
Section titled “In-Flight Requests”Because v2 has a higher Traefik router priority, no new requests are routed to v1 after v2 joins the network. The only requests v1 handles are those already mid-flight when the switch happens.
When v1 receives SIGTERM:
- Finish processing in-flight requests
- Exit cleanly before drain_timeout
For long-lived connections (websockets, large uploads), apps need proper SIGTERM handling to drain gracefully.
Router Priority
Section titled “Router Priority”Holden sets a Traefik router priority on v2 that’s one higher than v1’s. This priority persists on the container after the deploy completes, so the next deploy reads it and increments again. Over many deploys, the priority grows monotonically.
Traefik reserves a range of priorities for its internal routers. The maximum user-defined priority is MaxInt64 - 1000 on 64-bit platforms. At one deploy per minute, it would take over 17 trillion years to reach this limit.
Background Workers
Section titled “Background Workers”Workers don’t receive HTTP traffic — they pull jobs from a queue. The Traefik network gating doesn’t apply, but the same deployment flow works:
- Start new worker, wait for healthy
- SIGTERM old worker — should stop pulling new jobs, finish current job
- Brief overlap where both workers process jobs is normal (jobs are independent)
Holden does “wait for healthy, then stop old.” The worker framework handles the drain — stop accepting new jobs on SIGTERM, finish in-flight work.