Skip to main content

Zero-Downtime Database Migrations: The Complete Playbook

Every schema migration is a risk. The expand-contract pattern eliminates the risk by making the change invisible to running code. This is the complete playbook — from blue-green databases to online schema changes with gh-ost and pgroll.

Abhishek Sharma· Head of Engg @ Fordel Studios
14 min read min read
Zero-Downtime Database Migrations: The Complete Playbook

The classic database migration story: rename a column, deploy the new application code that uses the new column name, see the old application code (still running during the deployment) crash because it references the old column name. Downtime. Rollback. Post-mortem.

Teams that do this once learn to do it carefully. Teams that do it at scale — continuous deployment, multiple services, shared databases — learn the expand-contract pattern. This is the bedrock of zero-downtime schema changes, and understanding it makes every other technique in this space make sense.

···

The Expand-Contract Pattern

Expand-contract (also called parallel change) is a three-phase migration approach that separates the schema change from the application code change. The core insight: never make a breaking schema change and a breaking application code change in the same deployment.

Expand-contract in three phases

01
Expand: add the new structure alongside the old

Add the new column, table, or relationship while keeping the old one. Both exist simultaneously. No application code is changed yet. This is a purely additive, non-breaking migration. The old code keeps working because the old schema is still there.

02
Migrate: backfill and dual-write

Backfill existing rows to populate the new column. Update application code to write to both old and new structures simultaneously (dual-write). Deploy. The new code writes to both; the old code only writes to the old. Data is now consistent in both places.

03
Contract: remove the old structure

After all application instances are running the dual-write code, remove the old column/table/relationship. This is now safe because nothing reads or writes to the old structure anymore. Another purely additive change from the running code's perspective.

The cost: a schema change that could be done in one migration now requires three separate deployments, spread across days or weeks. The benefit: zero downtime, zero risk of the old code encountering a missing column, and a clean rollback path at every phase.

···

Online Schema Change Tools

Even with expand-contract, large table alterations are dangerous. ALTER TABLE on a 500GB table in MySQL takes the table offline for the duration of the operation. Online schema change tools solve this by copying the table in the background while traffic continues, then performing an atomic cutover.

gh-ost (GitHub's Online Schema Transmogrifier)

gh-ost reads MySQL binary logs to replay changes made to the original table onto the shadow table during the copy. When the copy is complete, it pauses writes momentarily, validates consistency, and swaps the tables. The pause is typically under a second. gh-ost is the industry standard for MySQL/MariaDB zero-downtime schema changes.

The operational advantage of gh-ost over pt-online-schema-change (Percona's older tool): gh-ost is controllable during execution. You can pause it, throttle it (reduce the replication lag it causes), and monitor its progress through status files. pt-online-schema-change uses triggers instead of binary log tailing, which has higher overhead on write-heavy tables.

pgroll (PostgreSQL)

PostgreSQL users have a newer option: pgroll, which implements expand-contract at the tool level rather than requiring developers to write it manually. You write a single migration definition; pgroll handles the expansion, creates a view layer that translates old schema to new for running code, backfills data, and waits for you to explicitly complete the migration (which removes the old structure).

ToolDatabaseMechanismPause duration on cutoverBest for
gh-ostMySQL/MariaDBBinary log tailing<1 secondLarge MySQL tables, production
pt-oscMySQL/MariaDBTriggersVaries (longer on heavy writes)Simpler use cases
pgrollPostgreSQLExpand-contract + viewsNear-zeroPostgres with multiple app versions
AWS RDS blue-greenMySQL, Postgres, AuroraBlue-green at DB layerSeconds (DNS swap)RDS deployments, managed cutover
···

Flyway and Liquibase

Migration tooling and schema change tooling are different concerns. Flyway and Liquibase manage migration versioning — they track which migrations have run, in what order, and provide a history of schema changes. They do not make the migrations safe; they make the process reproducible.

Flyway is SQL-first: you write versioned SQL files (V1__create_users.sql, V2__add_email_index.sql). It is simple, has excellent IDE support, and is sufficient for most teams. Liquibase supports XML, YAML, JSON, and SQL format changelogs, which is useful for teams that want database-agnostic migration definitions or programmatic migration generation.

Both integrate with CI/CD pipelines through CLI tools or Maven/Gradle plugins. The right pattern: run migrations as a separate step before deploying new application code, with the migration step gated on a backup completing successfully.

···

Testing Migrations Against Production Data

The most reliable way to know a migration will succeed in production: run it against a clone of production data before the production window. A migration that works on a 10-row test database may fail on a 200-million-row production table due to index rebuilding time, constraint violations on edge-case data, or foreign key check overhead.

The tooling for this: AWS RDS point-in-time restore creates a clone quickly. Postgres logical replication can be used to maintain a continuously-updated clone for regular migration testing. For teams without RDS, pg_dump + pg_restore to a staging database at similar size is the minimum acceptable approach.

···

Dual-Write Patterns for Distributed Systems

In microservices architectures, a schema change in one service's database may require changes in downstream services that consume that data via events or APIs. The dual-write pattern extends across service boundaries: produce both old and new event formats during the transition period, update downstream services to consume the new format, then stop producing the old format.

This is the same expand-contract logic applied to event schemas and API contracts. The key discipline: never remove a field from a public event or API response until you have verified no consumer reads that field. This requires consumer-driven contract testing (Pact is the standard tool) to make the "no consumer reads this field" claim provable rather than assumed.

Migration safety checklist before production
  • Migration tested against a production data clone at similar row counts
  • Rollback script written and tested before running forward migration
  • Migration time estimated (test run with EXPLAIN ANALYZE or pg_stat_progress_*)
  • Application code deployed in dual-write mode before running destructive migration step
  • Monitoring alerts configured for query latency spikes during migration window
  • Backup completed and verified in the 2 hours before migration
···

Case Study: Renaming a Column on a 200-Million-Row Table

A payments team needed to rename transaction_type to payment_method on their primary transactions table — 200 million rows in MySQL 8.0, serving 400 queries per second during peak hours. A direct ALTER TABLE RENAME COLUMN would lock the table for an estimated 45 minutes based on their staging test. Forty-five minutes of payment processing downtime is not an option.

They applied expand-contract with gh-ost. Phase one: add the new payment_method column as a nullable VARCHAR with the same type. This ALTER added a column without blocking reads or writes — MySQL 8.0 supports instant ADD COLUMN for most types. Phase two: deploy application code that writes to both columns simultaneously and reads from the old column. Backfill existing rows with a batched UPDATE that processes 10,000 rows per transaction, throttled to keep replication lag under 2 seconds.

Phase three: once backfill completed and all application instances were running dual-write code, switch reads to the new column. Phase four: drop the old column after a 7-day observation period. Total production impact: zero. Total calendar time: 12 days from first migration to cleanup. This is the expand-contract pattern in practice — slower than a rename, but with zero customer impact.

The 12-day timeline surprises teams used to shipping migrations in a single deploy. But the alternative — a 45-minute outage window scheduled at 3 AM with a war room on standby — is more expensive in engineering time, stress, and risk. The expand-contract overhead is predictable and low-stress. The "just run the migration" approach is unpredictable and high-stress.

seconds of downtimeAchieved on a 200M-row table rename using expand-contract with gh-ost in MySQL 8.0
···

Blue-Green Database Deployments

Blue-green deployments are well understood at the application layer. At the database layer, they are less common but increasingly supported. AWS RDS Blue/Green Deployments create a synchronised replica (the green environment), apply the migration to it, then swap the DNS endpoint. The swap takes seconds rather than the minutes or hours a direct migration would require.

The limitation: blue-green database deployments only work for migrations that are backward-compatible at the replication level. You cannot add a NOT NULL constraint without a default value in the green environment because the blue environment will continue writing rows without that column value, breaking replication. This brings you back to the expand-contract pattern — the same discipline applies whether you are doing the migration directly or through a blue-green swap.

For PostgreSQL teams not on RDS, pglogical or native logical replication can achieve a similar pattern: replicate to a secondary, apply the migration on the secondary, then promote it to primary. This requires more manual orchestration but gives you the same zero-downtime property.

···

Common Anti-Patterns

01
Running migrations inside the application startup sequence

If your application runs pending migrations on boot, you have a race condition: multiple instances starting simultaneously will attempt the same migration concurrently. Migrations should run as a separate CI/CD step, not as part of application startup.

02
Adding a NOT NULL column without a default value

This requires MySQL to rewrite every row in the table to fill in the value — effectively a full table lock on large tables. Always add columns as nullable first, backfill, then add the constraint.

03
Creating indexes synchronously on large tables

CREATE INDEX on a 100M+ row table can take minutes and block writes. PostgreSQL supports CREATE INDEX CONCURRENTLY which builds the index without locking writes. MySQL's online DDL handles most index operations non-blocking in 8.0+, but verify with your specific storage engine.

04
Skipping migration testing because "it worked in staging"

Staging data distributions rarely match production. A migration that completes in 2 seconds on a 10,000-row staging table may take 30 minutes on a 50-million-row production table. Always test against production-scale data volumes.

···

Migration Tooling Decision Tree

Choosing the right migration approach depends on table size, write volume, and acceptable risk. Small tables (under 1 million rows) can usually tolerate direct DDL with a brief lock. Medium tables (1-100 million rows) benefit from online schema change tools. Large tables (100 million+ rows) require expand-contract with careful backfill throttling. The decision is not just about size — a 10-million-row table with 2,000 writes per second is harder to migrate than a 500-million-row table with 5 writes per second. Write volume during the migration window matters as much as row count. For teams managing complex distributed systems, the same principles apply to API contract migrations across service boundaries.

Table sizeWrite volumeRecommended approachExpected impact
< 1M rowsAnyDirect DDL (ALTER TABLE)Sub-second lock
1-100M rowsLow (< 100 w/s)Direct DDL with monitoring< 30 seconds lock
1-100M rowsHigh (> 100 w/s)gh-ost / pt-osc / pgrollNear-zero
100M+ rowsAnyExpand-contract + online DDLZero (multi-day process)
···

Monitoring During Migrations

Running a migration without monitoring is like driving with your eyes closed. At minimum, track: replication lag (if using gh-ost or logical replication), query latency percentiles (p50, p95, p99) on the migrating table, lock wait timeouts, and disk I/O utilisation. Set up alerts that fire if replication lag exceeds your threshold (typically 5-10 seconds for gh-ost) or if query latency doubles. For comprehensive monitoring approaches, see our guide on observability-driven development.

gh-ost provides built-in throttling: it monitors replication lag and pauses the copy when lag exceeds a configurable threshold. This self-regulating behaviour is one of the reasons gh-ost is preferred over pt-online-schema-change for high-write tables — it adapts to the system's capacity rather than running at a fixed rate regardless of impact.

···

Foreign Key Constraints and the Migration Trap

Foreign keys are the most underestimated source of migration complexity. Adding a foreign key constraint to an existing table requires MySQL to validate every row against the referenced table — a full table scan that can take minutes on large tables and holds a metadata lock throughout. PostgreSQL behaves similarly: adding a foreign key with NOT VALID skips the validation scan (making the ALTER instant), but the constraint only applies to new rows until you run VALIDATE CONSTRAINT separately.

The pattern: add the foreign key as NOT VALID (Postgres) or defer the constraint check (MySQL 8.0 with SET FOREIGN_KEY_CHECKS=0 in a controlled migration script). Backfill any violations. Then validate. This separates the schema change (fast) from the data validation (slow, non-blocking). Teams that skip this and add foreign keys directly on 100M+ row tables learn the lesson the hard way — through a multi-minute table lock during business hours.

···

DynamoDB and NoSQL Migration Patterns

The expand-contract pattern applies to NoSQL databases too, but the mechanics differ. DynamoDB has no ALTER TABLE — schema is defined by the data you write, not by a formal DDL. A "migration" in DynamoDB means changing the shape of items you write and backfilling existing items to the new shape.

The challenge: DynamoDB charges for write capacity units consumed during backfill. Backfilling 100 million items at 1KB each consumes roughly 100 million WCUs. At on-demand pricing, this costs approximately $125 — not prohibitive, but worth budgeting. Provisioned capacity mode requires you to temporarily increase write capacity for the backfill and reduce it after. Auto-scaling helps but reacts slowly to burst writes. For teams running complex data pipelines alongside migrations, the same cost discipline applies to AI infrastructure costs.

MongoDB migrations follow a similar pattern: write both old and new field names, backfill with a bulk update operation, then remove the old field. MongoDB 5.0+ supports schema validation rules that can enforce the new shape, providing a safety net similar to database constraints. The key principle is the same across all databases: never break running code by changing the shape of data it reads.

···

Rollback Strategy: The Plan Nobody Writes

Every migration plan should have a rollback plan. Most do not. The rollback plan for an expand-contract migration is straightforward: at phase 1 (expand), rollback is "drop the new column." At phase 2 (dual-write), rollback is "deploy the previous application version that only writes to the old column." At phase 3 (contract), rollback is complex because you have removed the old structure — which is why phase 3 should be the last step, executed only after extensive validation.

A migration without a rollback plan is not a migration. It is a bet. The cost of writing the rollback script is 30 minutes. The cost of not having one during a failed migration at 2 AM is immeasurable.
···

When to Accept Downtime

Not every migration requires zero-downtime. If your application has a natural maintenance window (a B2B SaaS product used Monday-Friday during business hours has weekends available), a planned 30-minute downtime window during off-hours may be simpler, safer, and cheaper than the engineering overhead of expand-contract. The decision framework: calculate the cost of the downtime (lost revenue, SLA penalties, customer trust impact) and compare it to the cost of the zero-downtime approach (engineering time, operational complexity, multi-day migration timeline).

For most startups with fewer than 1,000 active users and no SLA commitments, a planned maintenance window with clear communication is the right approach. For SaaS products with 99.9% uptime SLAs, enterprise customers, and revenue-generating traffic at all hours, the zero-downtime approach is not optional — the SLA violation cost alone justifies the engineering investment. The expand-contract pattern is a tool, not a religion. Use it when the cost of downtime exceeds the cost of avoiding it.

Build with us

Need this kind of thinking applied to your product?

We build AI agents, full-stack platforms, and engineering systems. Same depth, applied to your problem.

Newsletter

Enjoyed this? Get the weekly digest.

Research highlights and AI news, delivered every Thursday. No spam.

Loading comments...