Zero-Downtime Database Migrations: A Practical Guide for PostgreSQL

Database migrations are one of the few operations that can take a perfectly healthy production application offline in seconds. You deploy a one-line schema change, and suddenly your API is returning 500 errors, your connection pool is exhausted, and your users are staring at a loading spinner. The worst part is that the migration itself might only take three seconds — but the damage lasts far longer.

This guide explains why migrations break production PostgreSQL databases, how to prevent it using the Expand and Contract pattern, which tools actually help, and what your ORM is doing behind your back. Everything here is based on documented PostgreSQL behavior, open-source tooling, and real outage stories. If you deploy to production, this is required reading.

1. Why Migrations Break Production

The core problem is simple: every DDL statement in PostgreSQL acquires a lock on the affected table. When you run ALTER TABLE, CREATE INDEX, DROP COLUMN, or any other schema change, PostgreSQL must ensure that no other transaction is reading or writing the table in a way that would conflict with your change. It does this by acquiring locks — and some of those locks are exclusive, meaning nothing else can touch the table until the migration finishes.

Here is a real story that illustrates how badly this can go wrong.

The 45-minute outage: A 3-second ALTER TABLE ADD COLUMN took down an application for 45 minutes. The migration itself was fast — adding a nullable column to a table is nearly instantaneous in modern PostgreSQL. But it queued behind a long-running analytics query that held a conflicting lock on the same table. While the ALTER TABLE waited for that lock, it blocked every subsequent query on the table. Within seconds, 2,000 connections were queued, the connection pool was exhausted, and the entire application became unresponsive. The migration had not even started yet — it was just waiting in line.

This is the fundamental trap. The migration does not need to be slow to cause an outage. It just needs to wait for a lock, and while it waits, it blocks everything behind it. PostgreSQL's lock queue is a first-come, first-served line. Your ALTER TABLE gets in line behind the analytics query. Every new SELECT, INSERT, and UPDATE that arrives after your ALTER TABLE gets in line behind it. The table is effectively frozen.

Which Operations Are Dangerous?

Not all migrations are equally risky. Here is a breakdown of common operations and their lock behavior:

Operation	Lock Type	Risk Level	Duration
Add nullable column (no default)	AccessExclusiveLock	Low	Instant (metadata only)
Add column with DEFAULT	AccessExclusiveLock	High (pre-PG11)	Rewrites entire table
CREATE INDEX	ShareLock	High	Minutes to hours
CREATE INDEX CONCURRENTLY	ShareUpdateExclusiveLock	Low	Minutes (non-blocking)
Change column type	AccessExclusiveLock	High	Rewrites entire table
Change primary key	AccessExclusiveLock	High	Rewrites table + indexes
Add CHECK constraint (NOT VALID)	ShareRowExclusiveLock	Low	Instant (skips validation)
Reorganize table (CLUSTER/VACUUM FULL)	AccessExclusiveLock	High	Minutes to hours

The pattern is clear. Any operation that requires an AccessExclusiveLock is dangerous in production because it blocks all reads and writes. Operations that rewrite the table — adding a column with a default value on PostgreSQL versions before 11, changing a column type, reorganizing tables, changing primary keys — can hold that lock for minutes or even hours on large tables.

Even "safe" operations like adding a nullable column still acquire AccessExclusiveLock briefly. If that brief moment coincides with a long-running transaction holding a conflicting lock, you get the 45-minute outage scenario described above.

The Lock Queue Problem

The most counterintuitive aspect of PostgreSQL locking is the queue effect. When your DDL statement requests a lock it cannot immediately acquire, it enters a wait queue. But here is the critical detail: while it is waiting, it blocks all subsequent lock requests that conflict with the lock it is requesting — even though it has not acquired the lock yet. So a simple ALTER TABLE that is waiting behind one slow query will block every other query on that table. The damage cascates exponentially.

This is why setting lock_timeout before running migrations is essential. If your migration cannot acquire a lock within a few seconds, it is better to fail fast and retry later than to sit in the queue blocking everything.

                    -- Always set this before a production migration

                    SET lock_timeout = '5s';

                    -- Now run your migration

                    ALTER TABLE users ADD COLUMN last_login_at TIMESTAMPTZ;

                    -- If the lock cannot be acquired in 5 seconds,

                    -- PostgreSQL will raise an error instead of waiting

2. The Expand and Contract Pattern

The Expand and Contract pattern is the gold standard for zero-downtime migrations. The idea is simple: never make a breaking change in one step. Instead, break every migration into two phases.

Expand phase: Add the new schema alongside the old one. Both versions coexist. The application is updated to work with both schemas. No data is removed, no columns are dropped, nothing breaks.

Contract phase: Once all application instances are using the new schema and the old schema is no longer needed, remove the old schema. Clean up the leftover columns, tables, or constraints.

Here is what the full lifecycle looks like for a common scenario — renaming a column from email to email_address:

                    EXPAND AND CONTRACT: Renaming a Column

                    ========================================

                    Phase 1: EXPAND (add new, keep old)

                      +------------------+      +------------------+

                      |     users      |      |     users      |

                      |------------------|  ==>  |------------------|

                      | id               |      | id               |

                      | email            |      | email            |  <-- old (still active)

                      | name             |      | email_address    |  <-- new (added)

                      +------------------+      | name             |

                                              +------------------+

                      1. ALTER TABLE users ADD COLUMN email_address TEXT;

                      2. Deploy app code that writes to BOTH columns

                      3. Backfill: UPDATE users SET email_address = email

                         WHERE email_address IS NULL;

                      4. Deploy app code that reads from email_address

                    Phase 2: CONTRACT (remove old)

                      +------------------+      +------------------+

                      |     users      |      |     users      |

                      |------------------|  ==>  |------------------|

                      | id               |      | id               |

                      | email            |      | email_address    |

                      | email_address    |      | name             |

                      | name             |      +------------------+

                      +------------------+

                      5. Verify no code reads from old "email" column

                      6. ALTER TABLE users DROP COLUMN email;

The key properties of this pattern are:

Backwards compatible. At every step, the old version of the application can still function. If you need to roll back a deployment, the old code still works because the old schema is still there.
Reversible. If something goes wrong during the expand phase, you simply drop the new column and you are back to where you started. No data loss.
No downtime. The expand phase adds a nullable column (instant in PostgreSQL). The backfill runs as a background operation. The contract phase drops a column that nothing reads anymore. At no point is the table locked for more than milliseconds.

Expand and Contract for Column Type Changes

Changing a column type (for example, from INTEGER to BIGINT) is one of the most dangerous migrations because PostgreSQL must rewrite the entire table. The Expand and Contract pattern handles this with a dual-write strategy:

                    DUAL-WRITE PATTERN: Changing Column Type

                    ==========================================

                    Step 1: Add new column with the target type

                      ALTER TABLE orders ADD COLUMN amount_v2 BIGINT;

                    Step 2: Deploy code that writes to BOTH columns

                      INSERT INTO orders (amount, amount_v2, ...)

                      VALUES (value, value, ...);

                    Step 3: Backfill existing rows (in batches)

                      UPDATE orders SET amount_v2 = amount

                      WHERE amount_v2 IS NULL

                      LIMIT 10000;

                      -- Repeat until all rows are backfilled

                    Step 4: Switch reads to new column

                      -- Deploy code that reads from amount_v2

                    Step 5: Drop old column, rename new

                      ALTER TABLE orders DROP COLUMN amount;

                      ALTER TABLE orders RENAME COLUMN amount_v2 TO amount;

This is more work than a single ALTER TABLE orders ALTER COLUMN amount TYPE BIGINT, but that single command would lock your table for the entire rewrite — which could be hours for a table with millions of rows. The dual-write approach breaks it into steps that each take milliseconds of lock time.

Rule of thumb: If your migration touches a table with more than 100,000 rows and requires a table rewrite, use the Expand and Contract pattern. For small tables or development environments, a direct ALTER TABLE is fine. The cost of the pattern is complexity. Only pay that cost when the table size justifies it.

3. Tool Comparison: pgroll vs Alembic vs Prisma

Several tools exist to help you run safe migrations. They differ significantly in philosophy, automation level, and what guardrails they provide. Here is an honest comparison of the most relevant options.

pgroll (by Xata)

pgroll is an open-source CLI tool that implements the Expand and Contract pattern automatically. It is the most opinionated and safety-focused tool in this comparison.

When you run a migration with pgroll, it does not simply execute your SQL. Instead, it:

Creates a new column alongside the old one
Sets up triggers so that writes to the old column are automatically copied to the new column (and vice versa)
Backfills existing rows in the background
Serves multiple schema versions simultaneously using PostgreSQL schemas — old application instances see the old schema, new instances see the new schema
Once you confirm the migration is complete, it drops the old column and removes the triggers

                    # Define your migration in JSON

                    $ cat migrations/001_add_email.json

                    {

                      "name": "001_add_email",

                      "operations": [

                        {

                          "alter_column": {

                            "table": "users",

                            "column": "email",

                            "name": "email_address"

                          }

                        }

                      ]

                    }

                    # Start the migration (expand phase)

                    $ pgroll start migrations/001_add_email.json

                    # Both old and new schemas are active

                    # Deploy your new app code, verify everything works

                    # Complete the migration (contract phase)

                    $ pgroll complete

The key advantage of pgroll is that it handles the entire dual-schema lifecycle for you, including the triggers that keep both columns in sync during the transition. The key limitation is that it only works with PostgreSQL and requires you to define migrations in its JSON format rather than raw SQL.

Reshape

Reshape is another open-source tool that implements Expand and Contract for PostgreSQL. It was designed with coding agents in mind — the interface is intentionally simple so that automated tools can drive it.

                    # Start a migration

                    $ reshape migration start

                    # The old schema is preserved, new schema is active

                    # Deploy, test, verify

                    # Complete the migration

                    $ reshape migration complete

Reshape's philosophy is similar to pgroll but with a simpler interface. It uses TOML files for migration definitions and focuses on making the expand/contract workflow as frictionless as possible.

Alembic (SQLAlchemy)

Alembic is the migration tool for the Python SQLAlchemy ecosystem. It does not implement Expand and Contract automatically — it generates migration scripts that you can customize. The advantage is full control; the disadvantage is that safety is your responsibility.

                    # Generate a migration

                    $ alembic revision --autogenerate -m "add email_address"

                    # Review and edit the generated migration file

                    # (this is where you add lock_timeout, batch backfills, etc.)

                    # Apply it

                    $ alembic upgrade head

Alembic autogenerates migrations by comparing your SQLAlchemy models to the actual database schema. It catches most changes, but it does not know about safety concerns like lock durations. You must manually review every generated migration and add safety measures yourself.

pg_repack

pg_repack is not a migration tool in the traditional sense. It reorganizes tables online without holding exclusive locks for the duration of the operation. It works by creating a shadow copy of the table, replaying any changes that happen during the copy, and then performing a brief lock swap at the end.

Use pg_repack when you need to reclaim disk space after mass deletions, rebuild bloated indexes, or change the physical order of rows (CLUSTER). These operations normally require VACUUM FULL or CLUSTER, both of which hold AccessExclusiveLock for the entire duration. pg_repack brings the exclusive lock down to a brief moment at the end of the operation.

Enterprise Options: Flyway and Liquibase

For Java/enterprise environments, two tools dominate:

Flyway is the simpler choice. It runs versioned SQL scripts in order. No magic, no abstraction — you write the SQL, Flyway tracks which scripts have been applied. Choose Flyway when you want simplicity and full control over your SQL.
Liquibase offers rollback guarantees. Migrations are defined in XML, YAML, or JSON, and Liquibase can auto-generate rollback scripts. Choose Liquibase when you need guaranteed rollback capability and work in a team that prefers declarative migration definitions.

Comparison Table

Tool	Expand/Contract	Language	Best For
pgroll	Automatic	Any (CLI)	Teams wanting maximum safety
Reshape	Automatic	Any (CLI)	Coding agents, simple workflows
Alembic	Manual	Python	SQLAlchemy projects, full control
Prisma Migrate	Manual	TypeScript/JS	Prisma ORM projects
Flyway	Manual	Java / SQL	Enterprise, simplicity
Liquibase	Manual	Java / XML / YAML	Enterprise, rollback guarantees
pg_repack	N/A (table reorg)	Any (CLI)	Online table reorganization

Our recommendation: If you are running PostgreSQL and want the safest possible migrations, use pgroll. It automates the hard parts of Expand and Contract. If you need more control or are using an ORM, use your ORM's migration tool but manually apply the safety practices from Section 4. No tool eliminates the need to understand PostgreSQL's locking behavior.

4. Step-by-Step Safe Migration Workflow

Regardless of which tool you use, every production migration should follow this workflow. This is a practical checklist that you can print out and tape to your monitor.

Before You Migrate

Review the generated SQL. Every ORM and migration tool generates SQL under the hood. Before running any migration in production, extract and read the actual SQL that will execute. alembic upgrade head --sql, prisma migrate diff, rails db:migrate:status — whatever your tool provides. Review the generated SQL before running it in production. Always.

Test against a backup first. Take a recent backup of your production database (or use a read replica) and run the migration against it. Measure how long it takes. Check for lock contention. Verify the results. Never run untested migrations on production. This sounds obvious, but the majority of migration outages happen because someone skipped this step.

Check for long-running transactions. Before starting your migration, query pg_stat_activity to find any long-running transactions that might hold locks on your target table. Kill or wait for them to finish before proceeding.

                    -- Find long-running queries on your target table

                    SELECT pid, now() - pg_stat_activity.query_start AS duration,

                           query, state

                    FROM pg_stat_activity

                    WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes'

                      AND state != 'idle'

                    ORDER BY duration DESC;

During the Migration

Set lock_timeout. Always set lock_timeout before running any DDL in production. A value of 3–5 seconds is reasonable. If the migration cannot acquire a lock in that time, it fails fast instead of queuing and blocking everything. You can retry later when the table is less busy.

                    -- Set lock timeout for this session

                    SET lock_timeout = '5s';

                    -- Set statement timeout as a safety net

                    SET statement_timeout = '30s';

                    -- Now run your migration

                    ALTER TABLE users ADD COLUMN phone TEXT;

Use CREATE INDEX CONCURRENTLY. A regular CREATE INDEX holds a ShareLock on the table, which blocks all writes for the duration of the index build. On a large table, this can take hours. CREATE INDEX CONCURRENTLY uses a weaker lock that allows reads and writes to continue. It takes longer but does not block your application.

                    -- WRONG: Blocks all writes until the index is built

                    CREATE INDEX idx_users_email ON users(email);

                    -- RIGHT: Allows reads and writes during index creation

                    CREATE INDEX CONCURRENTLY idx_users_email ON users(email);

Add constraints in two steps. Adding a NOT NULL constraint or a CHECK constraint normally validates all existing rows while holding a lock. Split it into two steps: first add the constraint as NOT VALID (instant, no validation), then validate it separately (scans the table but with a weaker lock).

                    -- Step 1: Add constraint without validating (instant)

                    ALTER TABLE orders

                      ADD CONSTRAINT orders_amount_positive

                      CHECK (amount > 0) NOT VALID;

                    -- Step 2: Validate existing rows (weaker lock)

                    ALTER TABLE orders

                      VALIDATE CONSTRAINT orders_amount_positive;

Backfill in batches. If your migration requires backfilling data (populating a new column with values derived from existing data), do it in small batches. A single UPDATE users SET new_col = old_col on a million-row table will hold a lock for the entire duration. Instead, update 5,000–10,000 rows at a time with a short sleep between batches.

                    -- Backfill in batches of 5000

                    WITH batch AS (

                      SELECT id FROM users

                      WHERE email_address IS NULL

                      LIMIT 5000

                      FOR UPDATE SKIP LOCKED

                    )

                    UPDATE users SET email_address = email

                    WHERE id IN (SELECT id FROM batch);

                    -- Repeat until no rows remain

                    -- Add pg_sleep(0.1) between batches to reduce load

After the Migration

Verify before contracting. After the expand phase, verify that the new schema is working correctly. Check application logs for errors. Run your test suite against the new schema. Monitor query performance. Only proceed to the contract phase (dropping old columns) once you are confident everything works.

Wait at least one deployment cycle. Do not run the contract phase immediately after the expand phase. Wait at least one full deployment cycle to ensure that all application instances have been updated to use the new schema. If you use rolling deployments, some instances might still be running old code that reads from the old columns.

Clean up and document. After the contract phase, update your documentation. Remove any dual-write code from the application. Update your ORM models to reflect the final schema. Future developers should not have to guess why there is a commented-out column reference in the codebase.

The most important step is number 2: test against a backup first. Every other step on this list mitigates risk. Testing against a backup eliminates it. If a migration works correctly on a copy of your production data, it will work correctly on production. If it fails on the copy, you just saved yourself an outage.

5. ORM-Specific Gotchas

Most developers do not write raw SQL migrations. They use an ORM that generates the SQL for them. This is convenient, but it means the ORM is making decisions about locking, transaction boundaries, and migration safety — and those decisions are not always correct for production. Here is what each major ORM does and what to watch for.

Django

Django's migration framework is one of the more production-friendly ORMs. On PostgreSQL 11 and later, Django's AddField operation generates safe SQL for adding columns with defaults. PostgreSQL 11 introduced the ability to add a column with a default value without rewriting the table — it stores the default in the catalog and applies it lazily. Django takes advantage of this.

However, Django does not use CREATE INDEX CONCURRENTLY by default. When you add db_index=True to a field or create an index through a migration, Django generates a regular CREATE INDEX, which blocks writes on the table. You need to use AddIndex with the concurrently=True parameter (available via django.contrib.postgres) to get non-blocking index creation.

                    # Django: WRONG (blocks writes)

                    class Migration(migrations.Migration):

                        operations = [

                            migrations.AddIndex(

                                model_name='user',

                                index=models.Index(fields=['email'], name='idx_email'),

                            ),

                        ]

                    # Django: RIGHT (non-blocking, requires atomic=False)

                    from django.contrib.postgres.operations import AddIndexConcurrently

                    class Migration(migrations.Migration):

                        atomic = False  # Required for CONCURRENTLY

                        operations = [

                            AddIndexConcurrently(

                                model_name='user',

                                index=models.Index(fields=['email'], name='idx_email'),

                            ),

                        ]

Note the atomic = False. CREATE INDEX CONCURRENTLY cannot run inside a transaction, and Django wraps migrations in transactions by default. You must explicitly disable this.

Ruby on Rails

Rails' ActiveRecord migrations have a significant gotcha: add_index generates a regular CREATE INDEX by default, which blocks writes. You must explicitly pass algorithm: :concurrently and wrap the migration in disable_ddl_transaction!.

                    # Rails: WRONG (blocks writes)

                    class AddEmailIndex < ActiveRecord::Migration[7.1]

                      def change

                        add_index :users, :email

                      end

                    end

                    # Rails: RIGHT (non-blocking)

                    class AddEmailIndex < ActiveRecord::Migration[7.1]

                      disable_ddl_transaction!

                      def change

                        add_index :users, :email, algorithm: :concurrently

                      end

                    end

Rails also does not set lock_timeout by default. The strong_migrations gem is widely recommended for Rails projects — it catches dangerous migrations before they run and suggests safer alternatives.

Prisma

Prisma Migrate wraps all migrations in a transaction by default. This is a problem because CREATE INDEX CONCURRENTLY cannot run inside a transaction. If Prisma detects that it needs to create an index, it will generate a regular CREATE INDEX, which blocks writes. There is no built-in way to disable the transaction wrapper for a single migration.

                    -- Prisma generates this (inside a transaction):

                    BEGIN;

                    CREATE INDEX "User_email_idx" ON "User"("email");

                    COMMIT;

                    -- What you actually need (outside a transaction):

                    CREATE INDEX CONCURRENTLY "User_email_idx" ON "User"("email");

The workaround is to generate the migration with prisma migrate diff, manually edit the generated SQL file to use CONCURRENTLY and remove the transaction wrapper, and then apply it with prisma db execute. This is not ideal, but it is the current state of Prisma's migration tooling for PostgreSQL.

Prisma gotcha: Prisma wraps migrations in transactions, which prevents CREATE INDEX CONCURRENTLY from working. Always review Prisma's generated SQL before running it in production. For index creation on large tables, you may need to bypass Prisma's migration runner entirely and execute the SQL directly.

SQLAlchemy / Alembic

Alembic gives you the most control of any ORM migration tool, but that means you are responsible for safety. By default, Alembic generates migrations that run in a transaction. For CREATE INDEX CONCURRENTLY, you need to explicitly configure the migration to run outside a transaction.

                    # Alembic: Configure for concurrent index creation

                    from alembic import op

                    import sqlalchemy as sa

                    # Disable transaction for this migration

                    # In env.py, use context.configure(transaction_per_migration=True)

                    # Then in your migration:

                    def upgrade():

                        # This runs outside the transaction block

                        op.execute("SET lock_timeout = '5s'")

                        op.create_index(

                            'idx_users_email',

                            'users',

                            ['email'],

                            postgresql_concurrently=True

                        )

                    def downgrade():

                        op.drop_index('idx_users_email', 'users')

Alembic's postgresql_concurrently=True parameter generates the correct SQL. The gotcha is that you must also ensure the migration is not wrapped in a transaction, which requires configuring the migration context. Alembic's autogenerate will not add postgresql_concurrently for you — you must add it manually after reviewing the generated migration.

Summary of ORM Gotchas

ORM	Safe Defaults?	Key Gotcha	Recommended Action
Django	Mostly	`AddField` safe on PG11+; indexes are not concurrent	Use `AddIndexConcurrently`
Rails	No	`add_index` blocks writes by default	Use `algorithm: :concurrently` + strong_migrations gem
Prisma	No	Transaction wrapper prevents `CONCURRENTLY`	Manually edit SQL, bypass migration runner for indexes
Alembic	No	Autogenerate omits `concurrently`; manual review needed	Add `postgresql_concurrently=True`, disable txn

The common thread across all ORMs is the same: review the generated SQL before running it in production. Every ORM makes trade-offs between convenience and safety. The ORM does not know the size of your tables, the load on your database, or whether a 30-second lock will take down your application. You do. Act on that knowledge.

Final thought: Zero-downtime migrations are not about finding the perfect tool. They are about understanding what PostgreSQL does when you change a schema, and making deliberate choices about how and when to do it. The Expand and Contract pattern works with any tool or ORM. Setting lock_timeout works with any tool or ORM. Testing against a backup works with any tool or ORM. Master the principles, and the specific tool becomes a detail.

Zero-Downtime Database Migrations: A Practical Guide for PostgreSQL

1. Why Migrations Break Production

Which Operations Are Dangerous?

The Lock Queue Problem

2. The Expand and Contract Pattern

Expand and Contract for Column Type Changes

3. Tool Comparison: pgroll vs Alembic vs Prisma

pgroll (by Xata)

Reshape

Alembic (SQLAlchemy)

pg_repack

Enterprise Options: Flyway and Liquibase

Comparison Table

4. Step-by-Step Safe Migration Workflow

Before You Migrate

During the Migration

After the Migration

5. ORM-Specific Gotchas

Django

Ruby on Rails

Prisma

SQLAlchemy / Alembic

Summary of ORM Gotchas

Sources