Database migrations are one of the few operations that can take a perfectly healthy production application offline in seconds. You deploy a one-line schema change, and suddenly your API is returning 500 errors, your connection pool is exhausted, and your users are staring at a loading spinner. The worst part is that the migration itself might only take three seconds — but the damage lasts far longer.
This guide explains why migrations break production PostgreSQL databases, how to prevent it using the Expand and Contract pattern, which tools actually help, and what your ORM is doing behind your back. Everything here is based on documented PostgreSQL behavior, open-source tooling, and real outage stories. If you deploy to production, this is required reading.
1. Why Migrations Break Production
The core problem is simple: every DDL statement in PostgreSQL acquires a lock on the affected table. When you run ALTER TABLE, CREATE INDEX, DROP COLUMN, or any other schema change, PostgreSQL must ensure that no other transaction is reading or writing the table in a way that would conflict with your change. It does this by acquiring locks — and some of those locks are exclusive, meaning nothing else can touch the table until the migration finishes.
Here is a real story that illustrates how badly this can go wrong.
The 45-minute outage: A 3-second ALTER TABLE ADD COLUMN took down an application for 45 minutes. The migration itself was fast — adding a nullable column to a table is nearly instantaneous in modern PostgreSQL. But it queued behind a long-running analytics query that held a conflicting lock on the same table. While the ALTER TABLE waited for that lock, it blocked every subsequent query on the table. Within seconds, 2,000 connections were queued, the connection pool was exhausted, and the entire application became unresponsive. The migration had not even started yet — it was just waiting in line.
This is the fundamental trap. The migration does not need to be slow to cause an outage. It just needs to wait for a lock, and while it waits, it blocks everything behind it. PostgreSQL's lock queue is a first-come, first-served line. Your ALTER TABLE gets in line behind the analytics query. Every new SELECT, INSERT, and UPDATE that arrives after your ALTER TABLE gets in line behind it. The table is effectively frozen.
Which Operations Are Dangerous?
Not all migrations are equally risky. Here is a breakdown of common operations and their lock behavior:
| Operation | Lock Type | Risk Level | Duration |
|---|---|---|---|
| Add nullable column (no default) | AccessExclusiveLock | Low | Instant (metadata only) |
| Add column with DEFAULT | AccessExclusiveLock | High (pre-PG11) | Rewrites entire table |
| CREATE INDEX | ShareLock | High | Minutes to hours |
| CREATE INDEX CONCURRENTLY | ShareUpdateExclusiveLock | Low | Minutes (non-blocking) |
| Change column type | AccessExclusiveLock | High | Rewrites entire table |
| Change primary key | AccessExclusiveLock | High | Rewrites table + indexes |
| Add CHECK constraint (NOT VALID) | ShareRowExclusiveLock | Low | Instant (skips validation) |
| Reorganize table (CLUSTER/VACUUM FULL) | AccessExclusiveLock | High | Minutes to hours |
The pattern is clear. Any operation that requires an AccessExclusiveLock is dangerous in production because it blocks all reads and writes. Operations that rewrite the table — adding a column with a default value on PostgreSQL versions before 11, changing a column type, reorganizing tables, changing primary keys — can hold that lock for minutes or even hours on large tables.
Even "safe" operations like adding a nullable column still acquire AccessExclusiveLock briefly. If that brief moment coincides with a long-running transaction holding a conflicting lock, you get the 45-minute outage scenario described above.
The Lock Queue Problem
The most counterintuitive aspect of PostgreSQL locking is the queue effect. When your DDL statement requests a lock it cannot immediately acquire, it enters a wait queue. But here is the critical detail: while it is waiting, it blocks all subsequent lock requests that conflict with the lock it is requesting — even though it has not acquired the lock yet. So a simple ALTER TABLE that is waiting behind one slow query will block every other query on that table. The damage cascates exponentially.
This is why setting lock_timeout before running migrations is essential. If your migration cannot acquire a lock within a few seconds, it is better to fail fast and retry later than to sit in the queue blocking everything.
SET lock_timeout = '5s';
-- Now run your migration
ALTER TABLE users ADD COLUMN last_login_at TIMESTAMPTZ;
-- If the lock cannot be acquired in 5 seconds,
-- PostgreSQL will raise an error instead of waiting
2. The Expand and Contract Pattern
The Expand and Contract pattern is the gold standard for zero-downtime migrations. The idea is simple: never make a breaking change in one step. Instead, break every migration into two phases.
Expand phase: Add the new schema alongside the old one. Both versions coexist. The application is updated to work with both schemas. No data is removed, no columns are dropped, nothing breaks.
Contract phase: Once all application instances are using the new schema and the old schema is no longer needed, remove the old schema. Clean up the leftover columns, tables, or constraints.
Here is what the full lifecycle looks like for a common scenario — renaming a column from email to email_address:
========================================
Phase 1: EXPAND (add new, keep old)
+------------------+ +------------------+
| users | | users |
|------------------| ==> |------------------|
| id | | id |
| email | | email | <-- old (still active)
| name | | email_address | <-- new (added)
+------------------+ | name |
+------------------+
1. ALTER TABLE users ADD COLUMN email_address TEXT;
2. Deploy app code that writes to BOTH columns
3. Backfill: UPDATE users SET email_address = email
WHERE email_address IS NULL;
4. Deploy app code that reads from email_address
Phase 2: CONTRACT (remove old)
+------------------+ +------------------+
| users | | users |
|------------------| ==> |------------------|
| id | | id |
| email | | email_address |
| email_address | | name |
| name | +------------------+
+------------------+
5. Verify no code reads from old "email" column
6. ALTER TABLE users DROP COLUMN email;
The key properties of this pattern are:
- Backwards compatible. At every step, the old version of the application can still function. If you need to roll back a deployment, the old code still works because the old schema is still there.
- Reversible. If something goes wrong during the expand phase, you simply drop the new column and you are back to where you started. No data loss.
- No downtime. The expand phase adds a nullable column (instant in PostgreSQL). The backfill runs as a background operation. The contract phase drops a column that nothing reads anymore. At no point is the table locked for more than milliseconds.
Expand and Contract for Column Type Changes
Changing a column type (for example, from INTEGER to BIGINT) is one of the most dangerous migrations because PostgreSQL must rewrite the entire table. The Expand and Contract pattern handles this with a dual-write strategy:
==========================================
Step 1: Add new column with the target type
ALTER TABLE orders ADD COLUMN amount_v2 BIGINT;
Step 2: Deploy code that writes to BOTH columns
INSERT INTO orders (amount, amount_v2, ...)
VALUES (value, value, ...);
Step 3: Backfill existing rows (in batches)
UPDATE orders SET amount_v2 = amount
WHERE amount_v2 IS NULL
LIMIT 10000;
-- Repeat until all rows are backfilled
Step 4: Switch reads to new column
-- Deploy code that reads from amount_v2
Step 5: Drop old column, rename new
ALTER TABLE orders DROP COLUMN amount;
ALTER TABLE orders RENAME COLUMN amount_v2 TO amount;
This is more work than a single ALTER TABLE orders ALTER COLUMN amount TYPE BIGINT, but that single command would lock your table for the entire rewrite — which could be hours for a table with millions of rows. The dual-write approach breaks it into steps that each take milliseconds of lock time.
Rule of thumb: If your migration touches a table with more than 100,000 rows and requires a table rewrite, use the Expand and Contract pattern. For small tables or development environments, a direct ALTER TABLE is fine. The cost of the pattern is complexity. Only pay that cost when the table size justifies it.
3. Tool Comparison: pgroll vs Alembic vs Prisma
Several tools exist to help you run safe migrations. They differ significantly in philosophy, automation level, and what guardrails they provide. Here is an honest comparison of the most relevant options.
pgroll (by Xata)
pgroll is an open-source CLI tool that implements the Expand and Contract pattern automatically. It is the most opinionated and safety-focused tool in this comparison.
When you run a migration with pgroll, it does not simply execute your SQL. Instead, it:
- Creates a new column alongside the old one
- Sets up triggers so that writes to the old column are automatically copied to the new column (and vice versa)
- Backfills existing rows in the background
- Serves multiple schema versions simultaneously using PostgreSQL schemas — old application instances see the old schema, new instances see the new schema
- Once you confirm the migration is complete, it drops the old column and removes the triggers
$ cat migrations/001_add_email.json
{
"name": "001_add_email",
"operations": [
{
"alter_column": {
"table": "users",
"column": "email",
"name": "email_address"
}
}
]
}
# Start the migration (expand phase)
$ pgroll start migrations/001_add_email.json
# Both old and new schemas are active
# Deploy your new app code, verify everything works
# Complete the migration (contract phase)
$ pgroll complete
The key advantage of pgroll is that it handles the entire dual-schema lifecycle for you, including the triggers that keep both columns in sync during the transition. The key limitation is that it only works with PostgreSQL and requires you to define migrations in its JSON format rather than raw SQL.
Reshape
Reshape is another open-source tool that implements Expand and Contract for PostgreSQL. It was designed with coding agents in mind — the interface is intentionally simple so that automated tools can drive it.
$ reshape migration start
# The old schema is preserved, new schema is active
# Deploy, test, verify
# Complete the migration
$ reshape migration complete
Reshape's philosophy is similar to pgroll but with a simpler interface. It uses TOML files for migration definitions and focuses on making the expand/contract workflow as frictionless as possible.
Alembic (SQLAlchemy)
Alembic is the migration tool for the Python SQLAlchemy ecosystem. It does not implement Expand and Contract automatically — it generates migration scripts that you can customize. The advantage is full control; the disadvantage is that safety is your responsibility.
$ alembic revision --autogenerate -m "add email_address"
# Review and edit the generated migration file
# (this is where you add lock_timeout, batch backfills, etc.)
# Apply it
$ alembic upgrade head
Alembic autogenerates migrations by comparing your SQLAlchemy models to the actual database schema. It catches most changes, but it does not know about safety concerns like lock durations. You must manually review every generated migration and add safety measures yourself.
pg_repack
pg_repack is not a migration tool in the traditional sense. It reorganizes tables online without holding exclusive locks for the duration of the operation. It works by creating a shadow copy of the table, replaying any changes that happen during the copy, and then performing a brief lock swap at the end.
Use pg_repack when you need to reclaim disk space after mass deletions, rebuild bloated indexes, or change the physical order of rows (CLUSTER). These operations normally require VACUUM FULL or CLUSTER, both of which hold AccessExclusiveLock for the entire duration. pg_repack brings the exclusive lock down to a brief moment at the end of the operation.
Enterprise Options: Flyway and Liquibase
For Java/enterprise environments, two tools dominate:
- Flyway is the simpler choice. It runs versioned SQL scripts in order. No magic, no abstraction — you write the SQL, Flyway tracks which scripts have been applied. Choose Flyway when you want simplicity and full control over your SQL.
- Liquibase offers rollback guarantees. Migrations are defined in XML, YAML, or JSON, and Liquibase can auto-generate rollback scripts. Choose Liquibase when you need guaranteed rollback capability and work in a team that prefers declarative migration definitions.
Comparison Table
| Tool | Expand/Contract | Language | Best For |
|---|---|---|---|
| pgroll | Automatic | Any (CLI) | Teams wanting maximum safety |
| Reshape | Automatic | Any (CLI) | Coding agents, simple workflows |
| Alembic | Manual | Python | SQLAlchemy projects, full control |
| Prisma Migrate | Manual | TypeScript/JS | Prisma ORM projects |
| Flyway | Manual | Java / SQL | Enterprise, simplicity |
| Liquibase | Manual | Java / XML / YAML | Enterprise, rollback guarantees |
| pg_repack | N/A (table reorg) | Any (CLI) | Online table reorganization |
Our recommendation: If you are running PostgreSQL and want the safest possible migrations, use pgroll. It automates the hard parts of Expand and Contract. If you need more control or are using an ORM, use your ORM's migration tool but manually apply the safety practices from Section 4. No tool eliminates the need to understand PostgreSQL's locking behavior.
4. Step-by-Step Safe Migration Workflow
Regardless of which tool you use, every production migration should follow this workflow. This is a practical checklist that you can print out and tape to your monitor.
Before You Migrate
Review the generated SQL. Every ORM and migration tool generates SQL under the hood. Before running any migration in production, extract and read the actual SQL that will execute. alembic upgrade head --sql, prisma migrate diff, rails db:migrate:status — whatever your tool provides. Review the generated SQL before running it in production. Always.
Test against a backup first. Take a recent backup of your production database (or use a read replica) and run the migration against it. Measure how long it takes. Check for lock contention. Verify the results. Never run untested migrations on production. This sounds obvious, but the majority of migration outages happen because someone skipped this step.
Check for long-running transactions. Before starting your migration, query pg_stat_activity to find any long-running transactions that might hold locks on your target table. Kill or wait for them to finish before proceeding.
SELECT pid, now() - pg_stat_activity.query_start AS duration,
query, state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes'
AND state != 'idle'
ORDER BY duration DESC;
During the Migration
Set lock_timeout. Always set lock_timeout before running any DDL in production. A value of 3–5 seconds is reasonable. If the migration cannot acquire a lock in that time, it fails fast instead of queuing and blocking everything. You can retry later when the table is less busy.
SET lock_timeout = '5s';
-- Set statement timeout as a safety net
SET statement_timeout = '30s';
-- Now run your migration
ALTER TABLE users ADD COLUMN phone TEXT;
Use CREATE INDEX CONCURRENTLY. A regular CREATE INDEX holds a ShareLock on the table, which blocks all writes for the duration of the index build. On a large table, this can take hours. CREATE INDEX CONCURRENTLY uses a weaker lock that allows reads and writes to continue. It takes longer but does not block your application.
CREATE INDEX idx_users_email ON users(email);
-- RIGHT: Allows reads and writes during index creation
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);
Add constraints in two steps. Adding a NOT NULL constraint or a CHECK constraint normally validates all existing rows while holding a lock. Split it into two steps: first add the constraint as NOT VALID (instant, no validation), then validate it separately (scans the table but with a weaker lock).
ALTER TABLE orders
ADD CONSTRAINT orders_amount_positive
CHECK (amount > 0) NOT VALID;
-- Step 2: Validate existing rows (weaker lock)
ALTER TABLE orders
VALIDATE CONSTRAINT orders_amount_positive;
Backfill in batches. If your migration requires backfilling data (populating a new column with values derived from existing data), do it in small batches. A single UPDATE users SET new_col = old_col on a million-row table will hold a lock for the entire duration. Instead, update 5,000–10,000 rows at a time with a short sleep between batches.
WITH batch AS (
SELECT id FROM users
WHERE email_address IS NULL
LIMIT 5000
FOR UPDATE SKIP LOCKED
)
UPDATE users SET email_address = email
WHERE id IN (SELECT id FROM batch);
-- Repeat until no rows remain
-- Add pg_sleep(0.1) between batches to reduce load
After the Migration
Verify before contracting. After the expand phase, verify that the new schema is working correctly. Check application logs for errors. Run your test suite against the new schema. Monitor query performance. Only proceed to the contract phase (dropping old columns) once you are confident everything works.
Wait at least one deployment cycle. Do not run the contract phase immediately after the expand phase. Wait at least one full deployment cycle to ensure that all application instances have been updated to use the new schema. If you use rolling deployments, some instances might still be running old code that reads from the old columns.
Clean up and document. After the contract phase, update your documentation. Remove any dual-write code from the application. Update your ORM models to reflect the final schema. Future developers should not have to guess why there is a commented-out column reference in the codebase.
The most important step is number 2: test against a backup first. Every other step on this list mitigates risk. Testing against a backup eliminates it. If a migration works correctly on a copy of your production data, it will work correctly on production. If it fails on the copy, you just saved yourself an outage.
5. ORM-Specific Gotchas
Most developers do not write raw SQL migrations. They use an ORM that generates the SQL for them. This is convenient, but it means the ORM is making decisions about locking, transaction boundaries, and migration safety — and those decisions are not always correct for production. Here is what each major ORM does and what to watch for.
Django
Django's migration framework is one of the more production-friendly ORMs. On PostgreSQL 11 and later, Django's AddField operation generates safe SQL for adding columns with defaults. PostgreSQL 11 introduced the ability to add a column with a default value without rewriting the table — it stores the default in the catalog and applies it lazily. Django takes advantage of this.
However, Django does not use CREATE INDEX CONCURRENTLY by default. When you add db_index=True to a field or create an index through a migration, Django generates a regular CREATE INDEX, which blocks writes on the table. You need to use AddIndex with the concurrently=True parameter (available via django.contrib.postgres) to get non-blocking index creation.
class Migration(migrations.Migration):
operations = [
migrations.AddIndex(
model_name='user',
index=models.Index(fields=['email'], name='idx_email'),
),
]
# Django: RIGHT (non-blocking, requires atomic=False)
from django.contrib.postgres.operations import AddIndexConcurrently
class Migration(migrations.Migration):
atomic = False # Required for CONCURRENTLY
operations = [
AddIndexConcurrently(
model_name='user',
index=models.Index(fields=['email'], name='idx_email'),
),
]
Note the atomic = False. CREATE INDEX CONCURRENTLY cannot run inside a transaction, and Django wraps migrations in transactions by default. You must explicitly disable this.
Ruby on Rails
Rails' ActiveRecord migrations have a significant gotcha: add_index generates a regular CREATE INDEX by default, which blocks writes. You must explicitly pass algorithm: :concurrently and wrap the migration in disable_ddl_transaction!.
class AddEmailIndex < ActiveRecord::Migration[7.1]
def change
add_index :users, :email
end
end
# Rails: RIGHT (non-blocking)
class AddEmailIndex < ActiveRecord::Migration[7.1]
disable_ddl_transaction!
def change
add_index :users, :email, algorithm: :concurrently
end
end
Rails also does not set lock_timeout by default. The strong_migrations gem is widely recommended for Rails projects — it catches dangerous migrations before they run and suggests safer alternatives.
Prisma
Prisma Migrate wraps all migrations in a transaction by default. This is a problem because CREATE INDEX CONCURRENTLY cannot run inside a transaction. If Prisma detects that it needs to create an index, it will generate a regular CREATE INDEX, which blocks writes. There is no built-in way to disable the transaction wrapper for a single migration.
BEGIN;
CREATE INDEX "User_email_idx" ON "User"("email");
COMMIT;
-- What you actually need (outside a transaction):
CREATE INDEX CONCURRENTLY "User_email_idx" ON "User"("email");
The workaround is to generate the migration with prisma migrate diff, manually edit the generated SQL file to use CONCURRENTLY and remove the transaction wrapper, and then apply it with prisma db execute. This is not ideal, but it is the current state of Prisma's migration tooling for PostgreSQL.
Prisma gotcha: Prisma wraps migrations in transactions, which prevents CREATE INDEX CONCURRENTLY from working. Always review Prisma's generated SQL before running it in production. For index creation on large tables, you may need to bypass Prisma's migration runner entirely and execute the SQL directly.
SQLAlchemy / Alembic
Alembic gives you the most control of any ORM migration tool, but that means you are responsible for safety. By default, Alembic generates migrations that run in a transaction. For CREATE INDEX CONCURRENTLY, you need to explicitly configure the migration to run outside a transaction.
from alembic import op
import sqlalchemy as sa
# Disable transaction for this migration
# In env.py, use context.configure(transaction_per_migration=True)
# Then in your migration:
def upgrade():
# This runs outside the transaction block
op.execute("SET lock_timeout = '5s'")
op.create_index(
'idx_users_email',
'users',
['email'],
postgresql_concurrently=True
)
def downgrade():
op.drop_index('idx_users_email', 'users')
Alembic's postgresql_concurrently=True parameter generates the correct SQL. The gotcha is that you must also ensure the migration is not wrapped in a transaction, which requires configuring the migration context. Alembic's autogenerate will not add postgresql_concurrently for you — you must add it manually after reviewing the generated migration.
Summary of ORM Gotchas
| ORM | Safe Defaults? | Key Gotcha | Recommended Action |
|---|---|---|---|
| Django | Mostly | AddField safe on PG11+; indexes are not concurrent |
Use AddIndexConcurrently |
| Rails | No | add_index blocks writes by default |
Use algorithm: :concurrently + strong_migrations gem |
| Prisma | No | Transaction wrapper prevents CONCURRENTLY |
Manually edit SQL, bypass migration runner for indexes |
| Alembic | No | Autogenerate omits concurrently; manual review needed |
Add postgresql_concurrently=True, disable txn |
The common thread across all ORMs is the same: review the generated SQL before running it in production. Every ORM makes trade-offs between convenience and safety. The ORM does not know the size of your tables, the load on your database, or whether a 30-second lock will take down your application. You do. Act on that knowledge.
Final thought: Zero-downtime migrations are not about finding the perfect tool. They are about understanding what PostgreSQL does when you change a schema, and making deliberate choices about how and when to do it. The Expand and Contract pattern works with any tool or ORM. Setting lock_timeout works with any tool or ORM. Testing against a backup works with any tool or ORM. Master the principles, and the specific tool becomes a detail.
Sources
- Xata — Zero-Downtime Schema Migrations in PostgreSQL
- pgroll — Zero-Downtime PostgreSQL Schema Migrations (GitHub)
- Reshape — Zero-Downtime Schema Migrations for PostgreSQL (GitHub)
- Antonio Di Pinto — Zero-Downtime Schema Migrations in PostgreSQL (Medium)
- Lob — Running Database Changes with Zero Downtime
- DEV Community — Zero-Downtime Database Migration: The Definitive Guide