Zero-Downtime Deployments Without the Enterprise Tooling

A client's users complained about error pages during deployments. Our process was: push to main, build image, SSH in, pull, stop old container, start new one. Fifteen to forty-five seconds of downtime. Sounds minor, but their app processed Stripe webhooks. Those seconds meant missed payments and confused customers.

Zero-downtime deployment is a baseline requirement for any app handling transactions or webhooks. The concept is simple: run the new version alongside the old, shift traffic, stop the old one.

On Railway, it is built in. Railway runs the new deployment alongside the existing one, waits for the health check, shifts traffic, terminates the old instance. You just need a GET /health endpoint returning 200 when ready.

On Fly.io, similarly automated with rolling deployments. Configure the health check path and timeout in fly.toml.

On bare VPS instances, we use a reverse proxy (Caddy or nginx) with two upstream targets. Deploy script: build new image, start on different port, health check the new container, update proxy config, wait sixty seconds for connection draining, stop old container. About forty lines of bash in the CI pipeline.

Database migrations are the tricky part. Our rule: every migration must be backward-compatible. Never rename columns directly -- add the new column, backfill, deploy code using it, remove the old column later. Never add NOT NULL without a default. Each rule adds one extra migration step but eliminates incompatibility windows.

The full pipeline: push to main, GitHub Actions builds and tests, image pushed to registry, platform performs zero-downtime swap, post-deploy health check verifies, automatic rollback if health check fails.

Total setup time: two to four hours per project. That investment pays for itself the first time a deployment goes out during business hours and nobody notices.

One last thing: test your rollback process before you need it. We run a deliberate rollback test on every new project during the first week. Deploy a known-bad change (a health check that fails intentionally), verify the automatic rollback triggers, and confirm the previous version is restored. Discovering that your rollback process is broken during a real incident is the worst possible time to find out.

If your deployment has any window where users see errors, it is not finished.