A schema change that takes 30 seconds on staging can lock a 50-million-row production table for 20 minutes. I have sat in the war room for that incident more than once. The engineer who wrote the migration did nothing wrong by any normal definition. They added a column, ran Doctrine Migrations, tested it, shipped it. The staging database had 40,000 rows. The production database had 50 million. The difference between those two numbers, not any difference in the code, is what caused the outage.
Zero-downtime schema migration is not a framework feature. It is a design discipline. It requires that you stop thinking about schema changes as atomic events and start thinking about them as multi-step choreographies that stretch across several deploys, each of which leaves the system in a shippable state.
This essay is the habit set I have ended up with after running this choreography enough times to have it wired into muscle memory. It assumes PostgreSQL, because most of my clients run on it, but the principles transfer cleanly to MySQL with the locking caveats noted where they apply.
Why the naive migration fails
The naive migration is the one Doctrine generates for you by default: ALTER TABLE post ADD COLUMN tenant_id UUID NOT NULL DEFAULT gen_random_uuid();. It passes review. It passes staging. Then it hits production and takes the lock.
On PostgreSQL, that statement does two things. It acquires an ACCESS EXCLUSIVE lock on the table, which blocks every other query against it (including reads). Then it rewrites every row to add the new column with its default value. On a 50-million-row table, the rewrite takes minutes to tens of minutes, during which every request that touches the table blocks, then eventually times out, then fails. Your users see 500 errors. Your on-call pages. You revert.
The statement is valid. It is just not valid at your scale during serving hours. The fix is not to get better at single-statement migrations. The fix is to stop doing single-statement migrations on large tables. Break the change into steps that each take a lock for milliseconds, not minutes.
The expand-contract pattern
The pattern that makes this work is called expand-contract (also “parallel change”), and if you learn one thing from this essay, learn it.
Any schema change, no matter how small it looks, is modeled as a sequence of four phases. Each phase is a separate deploy. The system is fully functional after each.
- Expand. Add the new schema (column, table, index) in a form that is safe to add online. The old schema still exists and is still authoritative.
- Migrate writes. Application code is updated so that every write updates both old and new schema. Reads still come from the old schema.
- Backfill. Historical data in the new schema is populated from the old, in batches, in the background, over hours if needed. Reads are now available from either side.
- Contract. Reads switch to the new schema. After a cooldown period during which you verify no caller reads the old schema, drop it.
Each of these phases ships independently. Each phase leaves the application fully functional. If any phase reveals a problem, you stop there and either fix forward or roll back to the previous phase. You never ship a “big bang” migration that has to either succeed completely or not at all.
This is more deploys, more code, and more elapsed time than the naive approach. That is the cost. The benefit is that no single deploy can take the database down.
Example: renaming a column on a large table
Concrete case. You have a users table with a username column. Product renamed the concept to “handle.” Engineering wants the column to match. The table has 80 million rows across multiple tenants.
The naive migration:
ALTER TABLE users RENAME COLUMN username TO handle;
On PostgreSQL this is actually fast (metadata-only change, microseconds). The reason it still causes an outage is not the SQL; it is the deploy. Application code on version N calls u.username. Application code on version N+1 calls u.handle. For the window in which both versions are live (rolling deploy, blue-green, any real deploy strategy), one of them is wrong. The column has been renamed and old pods now fail, or it has not and new pods fail.
The expand-contract version:
Deploy 1 (expand). Add handle as a new column, nullable, no default. This is cheap on PostgreSQL (metadata-only, no row rewrite).
public function up(Schema $schema): void
{
$this->addSql('ALTER TABLE users ADD COLUMN handle VARCHAR(255)');
}
Application code is unchanged. Every row has handle = NULL and username = '<value>'.
Deploy 2 (dual-write). Application code writes both columns on every create and update.
public function setUsername(string $value): void
{
$this->username = $value;
$this->handle = $value;
}
Better: introduce a single setHandle(string $value) that writes both, and route all callers through it. New rows now have both fields populated. Existing rows still have handle = NULL.
Deploy 3 (backfill). A one-off job (Symfony command, or a migration with batch loop) sets handle = username for every row where handle IS NULL, in batches of 1,000, with a small delay between batches. This runs over hours or days. The database is not blocked; individual batches are fast enough that they do not notice.
#[AsCommand(name: 'app:users:backfill-handle')]
final readonly class BackfillHandleCommand
{
public function __construct(
private UserRepositoryInterface $users,
private EntityManagerInterface $em,
) {}
public function __invoke(OutputInterface $output): int
{
$batchSize = 1000;
do {
$affected = $this->em->getConnection()->executeStatement(
'UPDATE users
SET handle = username
WHERE handle IS NULL
LIMIT :batch',
['batch' => $batchSize],
);
$output->writeln(\sprintf('Backfilled %d rows', $affected));
\usleep(100_000); // 100ms, keeps the table responsive
} while (0 < $affected);
return Command::SUCCESS;
}
}
At the end of the backfill, every row has a non-null handle.
Deploy 4 (read switch). Application code now reads from handle everywhere it previously read username. Writes still populate both. Production runs this for a week; you watch metrics for any errors related to handle being null or missing.
Deploy 5 (contract). Once you are confident no caller reads username, drop the column:
public function up(Schema $schema): void
{
$this->addSql('ALTER TABLE users DROP COLUMN username');
}
On most RDBMS this is a metadata-only operation, instant even on large tables.
Five deploys to rename a column. Elapsed time: typically two to four weeks including the backfill soak. Downtime: zero. That is the trade.
The lock table: know what each statement does
Before you ship any migration on a large table, you need to know what lock it takes and for how long. Staging will lie to you. A 10-MB staging table never blocks long enough to notice. A 100-GB production table will.
Here is the short version for PostgreSQL 15+, which covers most of what you will hit. This is worth printing and sticking on the wall.
Fast (metadata only, safe at any size):
ADD COLUMNwith no default or with a constant default (on PG 11+).DROP COLUMN.RENAME COLUMN,RENAME TABLE.ADD CONSTRAINT ... NOT VALID(deferred validation).ALTER TYPE ... ADD VALUEon an enum (PG 12+).
Potentially blocking (test on a replica first):
ADD COLUMNwith a volatile default (gen_random_uuid(),now()if nondeterministic): full rewrite on PG < 11; metadata-only but with backfill cost on newer versions.CREATE INDEXwithoutCONCURRENTLY: takes anACCESS EXCLUSIVElock for the full build. Never do this on a large table in production.ALTER COLUMN SET NOT NULL: requires a full table scan. On a large table, scans for minutes.ALTER COLUMN TYPE ...that requires a rewrite (e.g.,VARCHARtoTEXTis safe;INTtoBIGINTis a rewrite): blocks.
Always dangerous on large tables:
ALTER TABLE ... ADD FOREIGN KEY ...withoutNOT VALID: acquires locks on both referencing and referenced tables, scans the referencing table.VACUUM FULL: rewrites the entire table, blocks everything.- Dropping an index: takes an
ACCESS EXCLUSIVElock briefly on PG (fast but still blocks briefly).
The tool I reach for to check any non-trivial migration: strong_migrations as a mental model (it is Rails-ecosystem, but the rules translate), or manually cross-checking against the PG documentation for each statement.
MySQL adds its own wrinkles around ALGORITHM=INPLACE and the specific InnoDB behavior. If you are on MySQL, the equivalent of this lock table is even more important because the gotchas are different.
The four statements that need special handling
Four specific schema changes come up often enough that they deserve explicit recipes.
Adding a NOT NULL column
The naive form (ADD COLUMN status VARCHAR(32) NOT NULL DEFAULT 'active') is safe on recent PostgreSQL because constant defaults are stored as metadata. But if you want a dynamic default, a computed default, or no default at all, the safe form is:
- Add the column as nullable.
- Deploy application code that populates it on every write.
- Backfill historical rows.
- Add a
NOT NULLconstraint viaADD CONSTRAINT ... CHECK (column IS NOT NULL) NOT VALID, thenVALIDATE CONSTRAINT(which takes only aSHARE UPDATE EXCLUSIVElock, allowing concurrent reads and writes). - Convert the check constraint into a
NOT NULLcolumn constraint (PG 12+ supports this cheaply).
This is four migrations, but each step is fast and safe.
Adding a foreign key
A foreign key addition scans the referencing table to verify every row points at a valid row in the referenced table. On large tables, this scan blocks.
The safe form uses NOT VALID:
ALTER TABLE posts ADD CONSTRAINT fk_posts_author
FOREIGN KEY (author_id) REFERENCES authors(id)
NOT VALID;
This adds the constraint without validating existing rows. New writes are validated; old rows are not checked yet. The lock is fast.
Then, in a separate migration (or later, when load is low):
ALTER TABLE posts VALIDATE CONSTRAINT fk_posts_author;
VALIDATE takes a SHARE UPDATE EXCLUSIVE lock, which allows reads and writes to continue. It scans the table, but it does not block traffic.
Creating an index on a large table
CREATE INDEX blocks. CREATE INDEX CONCURRENTLY does not. Always use the latter on a large production table:
CREATE INDEX CONCURRENTLY idx_posts_published_at ON posts (published_at);
The catch: CONCURRENTLY cannot run inside a transaction. Doctrine Migrations wraps every migration in a transaction by default. You need to opt out for this migration:
public function isTransactional(): bool
{
return false;
}
The other catch: if CREATE INDEX CONCURRENTLY fails partway through (say, due to a conflict), it leaves an invalid index behind. Check pg_indexes and drop invalid indexes before retrying.
Changing a column’s type
This one is painful because there is no single safe recipe. The pattern depends on whether the new type is binary-compatible with the old (VARCHAR(100) to VARCHAR(200) is free; VARCHAR to TEXT on recent PG is free; INT to BIGINT is a rewrite).
For a compatible change, the migration is a single ALTER COLUMN TYPE, metadata only. For an incompatible change, you are back in expand-contract:
- Add the new column with the new type.
- Dual-write to both.
- Backfill.
- Switch reads.
- Drop the old.
There is no shortcut. If someone proposes a shortcut, check the lock they are taking.
Backfill: batched, throttled, observable
The backfill phase is where most outages hide when teams graduate from naive migrations to expand-contract. The backfill itself has to be run carefully, or it becomes the new outage.
Rules of thumb for a safe backfill:
Batch size. Start with 1,000 rows. If the table is hot (heavy write load), drop to 500 or 250. If it is cold, you can go to 5,000 or 10,000. The signal you are looking for: individual batches complete in under 100 ms. If they take longer, you are holding locks long enough to interfere with real traffic.
Throttling. Between batches, sleep. Even 100 ms between batches dramatically reduces the pressure on the database and gives other queries room to breathe. If the table is very hot, increase to 500 ms.
Monitoring. The backfill should emit metrics. Rows backfilled per second. Time per batch. Lock wait events. You want to see the backfill run on a dashboard and stop it instantly if anything looks wrong. I have killed more than one backfill that was running fine until p95 latency on the API started climbing.
Idempotency. The backfill query should always be a WHERE new_column IS NULL (or equivalent guard), so that restarting it picks up where it left off. Never write a backfill that needs to know its own state.
Kill switch. The command should accept a rate-limit flag that lets you reduce batch size or increase delay without redeploying. A Symfony command with #[Option] parameters works well. If the on-call engineer sees latency rise at 2 AM, they should be able to slow the backfill without merging code.
public function __invoke(
OutputInterface $output,
#[Option(description: 'Rows per batch')]
int $batchSize = 1000,
#[Option(description: 'Milliseconds between batches')]
int $delay = 100,
): int {
// ...
}
The migration habits
Beyond the specific recipes, three habits make all of this easier to sustain.
Review migrations for lock behavior, not just SQL correctness. Every migration PR in the teams I work with goes through a lock review: which lock does this take, for how long, against what size of table in production? “We ran it on staging” is not an answer. “On a table this size in production, this statement takes an ACCESS EXCLUSIVE lock for an estimated 8 minutes” is.
Never modify a merged migration. Once a migration is in production, it is frozen. If it was wrong, write a new migration that fixes it forward. Editing history breaks every other environment that has already run the old version. The CLAUDE.md convention for this is explicit and right.
Separate schema migrations from data migrations. Schema changes go in Doctrine migrations. Bulk data changes (backfills, cleanups, rewrites) go in Symfony commands. Mixing the two in a single migration is a common mistake that turns a 1-second schema change into a 20-minute data operation wrapped in a transaction. The transaction holds the schema lock for the duration of the data work, and that is how you get a 20-minute outage.
Test migrations against production-sized data. Every engagement I do, I ask to see the staging database size. On average, it is 0.1 percent of production. That is useless for verifying migration safety. The fix is to keep a scrubbed, anonymized copy of production (size-matched, not data-matched) somewhere engineering can run migrations against it. This is infrastructure work, but it pays for itself the first time it catches a pre-production lock disaster.
When you can skip all of this
There are cases where the full expand-contract dance is overkill:
- Tables under a few hundred thousand rows. The naive migration is fine. Even a full table rewrite completes in seconds. Do not introduce the overhead of expand-contract for a
settingstable with 200 rows. - True maintenance windows. If the business has a weekly maintenance window during which the application is legitimately offline, you can run a traditional migration inside the window. (Most businesses no longer have real maintenance windows. Check before assuming.)
- New tables. Creating a table that no code yet references is a single migration. There is nothing to coordinate. You only need expand-contract when the table already has callers.
The discipline is for the cases where it matters: large tables with traffic that cannot be paused. Those are the cases that cause the incidents.
The playbook, condensed
The checklist I walk every migration PR through, in one place:
- What is the row count on this table in production?
- What lock does this statement take, and for how long?
- Can this be split into expand-contract steps?
- If there is a backfill, is it batched, throttled, idempotent, and observable?
- Does the migration opt out of transactions if needed (
CREATE INDEX CONCURRENTLY)? - Has a lock review happened in the PR, with numbers?
- Is the deploy sequenced so the application is functional at every step?
If every answer is “yes, and I can point to the line that shows it,” the migration is safe. If any answer is “I didn’t think about that,” send the PR back. Production will ask the question you skipped, and it will not ask gently.
If you are looking at a database that is too large to migrate safely with the patterns your team currently uses, my scaling engagement includes a full migration audit. The lock table translated to your stack, a template for expand-contract migrations, and the review checklist wired into your PR process so the next schema change does not page anyone.
References
- ParallelChange by Danilo Sato (Martin Fowler’s bliki) : the canonical writeup of the expand / migrate / contract pattern this essay is built on.
- PostgreSQL: Explicit Locking : official reference for
ACCESS EXCLUSIVE,SHARE UPDATE EXCLUSIVE, and the full lock compatibility matrix. - PostgreSQL: CREATE INDEX : documents the
CONCURRENTLYoption and its transaction restrictions. - PostgreSQL: ALTER TABLE : reference for
ADD CONSTRAINT ... NOT VALID,VALIDATE CONSTRAINT, and the constant-defaultADD COLUMNoptimisation. - Doctrine Migrations : the migration library and the
isTransactional()opt-out used in the CONCURRENTLY recipe. - strong_migrations (Rails ecosystem) : a catalogue of unsafe migration patterns whose rules translate cleanly to Doctrine.