Schema migrations with Alembic, Python and PostgreSQL
PublishedIn his latest Write Stuff article, Gigi Sayfan takes a dive into database migrations with an Alembic Tour-de-Force.
Database schema migrations are the bane of agile development and continuous deployment. Some migrations can be trivial if you just add a new table or a new column that can be null. But, in other cases it can be very complicated when it involves denormalizing existing tables or changing column data types. In some cases, the schema change will require data migration too. What makes data schema migrations particularly complicated is that often they must be done in tandem with corresponding code changes. In large distributed systems you often don't have the capability to deploy a whole new version of your code + update your database schema instantly. In these cases, you must must resort to multi-step rollout. Then, there is the concern of being able to rollback your changes if something goes wrong.
In this article, I'll discuss all these scenarios and use case. I'll use Compose PostgreSQL as the target database, Python 3.5.1 as the programming language, SQLAlchemy 1.0.13 as the DB access library and Alembic 0.8.6 as the the database migration toolkit.
Sample Application - Root Cause Analysis
To make everything more concrete let's design a database to solve a real problem - analyzing the root cause of bugs that were discovered in production. As big believer in tests, I always say that code that is not tested properly is broken. The opposite is that every bug can be traced back to missing or insufficient test. The domain is then bugs and their root causes. For each bug I want to record its root cause and later I'll be able to analyze the root causes and find ways to improve the process. Here is the initial database design with some test data:
It's a single table called bug
. I capture the URL of the bug in the bug tracking system where a lot of additional information can be recorded. The root_cause
is just a string. Finally there is the who
and when
columns.
This is pretty basic, but can serve as a good starting point. I'll evolve it during the article.
BTW, bug-tracker.org is a fictional bug tracking system. There is nothing there.
SQLAlchemy Models
SQLAlchemy is arguably the best relational database access toolkit. It provides several interfaces. In this article I'll stick to the SQLAlchemy ORM. Here is the bug table defined as a SQLAlchemy model in the models.py file.
from sqlalchemy import Column, DateTime, String, Integer, func
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class Bug(Base):
__tablename__ = 'bug'
id = Column(Integer, primary_key=True)
bug_tracker_url = Column(String, unique=True)
root_cause = Column(String)
who = Column(String)
when = Column(DateTime, default=func.now())
def __repr__(self):
return 'id: {}, root cause: {}'.format(self.id, self.root_cause)
I will not explain too much about SQLAlchemy. The main point is that it provides a convenient object-oriented / Pythonic way to define your schema and later to evolve it.
Alembic Basics
Alembic is designed to work with SQLAlchemy and provides excellent migration support. Alembic works by managing change management scripts. Whenever you modify your models you create (or let Alembic create for you) a Python script that Alembic can invoke and upgrade your actual database schema. To install SQLAlchemy, Psycopg2 (the Postgress driver) and Alembic you can pip install -r requirements.txt
. Once all the requirements are installed you need to prepare your project for working with alembic. First verify that alembic is installed properly. Type alembic
and you should see something like:
(alembic-tour-de-force)(G)/alembic-tour-de-force > alembic
usage: alembic [-h] [-c CONFIG] [-n NAME] [-x X] [--raiseerr]
{branches,current,downgrade,edit,heads,history,init,list_templates,merge,revision,show,stamp,upgrade}
To prepare your project to work with alembic type alembic init alembic
in your project's directory:
(G)/alembic-tour-de-force > alembic init alembic
Creating directory /Users/gigi/Documents/dev/github/alembic-tour-de-force/alembic ... done
Creating directory /Users/gigi/Documents/dev/github/alembic-tour-de-force/alembic/versions ... done
Generating /Users/gigi/Documents/dev/github/alembic-tour-de-force/alembic.ini ... done
Generating /Users/gigi/Documents/dev/github/alembic-tour-de-force/alembic/env.py ... done
Generating /Users/gigi/Documents/dev/github/alembic-tour-de-force/alembic/README ... done
Generating /Users/gigi/Documents/dev/github/alembic-tour-de-force/alembic/script.py.mako ... done
Please edit configuration/connection/logging settings in '/Users/gigi/Documents/dev/github/alembic-tour-de-force/alembic.ini' before proceeding.
That will create a sub-directory called alembic
that contains the following files and directories:
(alembic-tour-de-force)(G)/alembic-tour-de-force > tree alembic
alembic
├── README
├── env.py
├── script.py.mako
└── versions
The README file is not very interesting. You can add your own instrutions for how to upgrade. The env.py is a Python script that performas a lot of the heavy lifting and script.py.mako is a template for generating migration scripts. You don't call them directly, but use the alembic
command-line tool. Finally, the versions directory is where your migration scripts go. Another important file that is generated is the alembic.ini file in your project's directory. You need to set the sqlalchemy.url to point to your database. You can leave the rest as is or modify to your liking.
Generate a Baseline Script
Let's generate a baseline script for the current state of the database. The baseline script can generate a database schema from scratch. I use the revision
command to generate a new revision of the database (the first in this case). The -m
flag is the revision name. Alembic will concatenate a uuid before to ensure uniqueness and create a migration script in the alembic/versions sub-directory.
alembic revision -m "baseline"
Generating /Users/gigi/Documents/dev/github/alembic-tour-de-force/alembic/versions/bc25b5f8939f_baseline.py ... done
This generates an empty script with upgrade() and downgrade() functions. In order to generate the baseline script I added some code to generate our bug
table:
"""baseline
Revision ID: bc25b5f8939f
Revises:
Create Date: 2016-05-29 15:00:18.721577
"""
# revision identifiers, used by Alembic.
revision = 'bc25b5f8939f'
down_revision = None
branch_labels = None
depends_on = None
from alembic import op
import sqlalchemy as sa
def upgrade():
op.create_table(
'bug',
sa.Column('id', sa.Integer, primary_key=True),
sa.Column('bug_tracker_url', sa.String(), nullable=False),
sa.Column('root_cause', sa.String()),
sa.Column('who', sa.String()),
sa.Column('when', sa.DateTime(), default=sa.func.now()))
def downgrade():
op.drop_table('bug')
Note, the revision ID bc25b5f8939f. Using this revisions ID and the down_revision alembic keeps track of the sequence of revisions. Since this is the first revision there is no down_revision.
To test the revision I dropped the bug
table in compose.io PostgressSQL SQL tab: http://i.imgur.com/asbicmT.png
Then I ran the alembic upgrade head
command. This command makes sure that the database is up to date with the most recent revision.
(G)/alembic-tour-de-force > alembic upgrade head
INFO [alembic.runtime.migration] Context impl PostgresqlImpl.
INFO [alembic.runtime.migration] Will assume transactional DDL.
INFO [alembic.runtime.migration] Running upgrade -> bc25b5f8939f, baseline
As you can see alembic correctly ran the baseline revision, created the bug table according to the upgrade instruction and also added its own table alembic_version
. This table has exactly one row with one column called version_num
that contains the current alembic revision ID. That's how alembic keeps track of where the database is and knows what revisions to apply to bring it up to date.
Modify the Models
The current modeling is fragile. The root cause is just a text field. There is no curated list of root causes. Anyone can write whatever text they want for the root cause. Typos can sneak in. Did you notice the typo for the root cause of bug 22? The root cause is 'no integration test' instead of 'no integration test'. Let's create a separate table for root causes that bugs will be related to via a foreign key.
from sqlalchemy import Column, DateTime, String, Integer, func, ForeignKey
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import relationship
Base = declarative_base()
class RootCause(Base):
__tablename__ = 'root_cause'
id = Column(Integer, primary_key=True)
name = Column(String)
class Bug(Base):
__tablename__ = 'bug'
id = Column(Integer, primary_key=True)
root_cause_id = Column(ForeignKey('root_cause.id'),
nullable=False,
index=True)
bug_tracker_url = Column(String, unique=True)
who = Column(String)
when = Column(DateTime, default=func.now())
root_cause = relationship(RootCause)
def __repr__(self):
return 'id: {}, root cause: {}'.format(self.id, self.root_cause.name)
Auto-generate a Migration Script
I need to write a migration script from the current single Bug table to the new Bug + RootCause tables including setting the foreign key correctly. This is annoying especially when the DB schema is changed a lot. Alembic's killer feature is its ability to auto-generate the migration scripts. Alembic looks at the current models to compare them with current DB schema and figures out what changed.
Before you can use the auto-generation you need to tell alembic where to find your model's metadata. This is done in the env.py
file. Here is the relevant section where I added the last line:
# add your model's MetaData object here
# for 'autogenerate' support
# from myapp import mymodel
# target_metadata = mymodel.Base.metadata
target_metadata = models.Base.metadata
Auto-generating the migration script uses the --autogenerate
flag to the alembic revision
command.
(G)/alembic-tour-de-force > alembic revision --autogenerate -m "add root_cause table"
INFO [alembic.runtime.migration] Context impl PostgresqlImpl.
INFO [alembic.runtime.migration] Will assume transactional DDL.
INFO [alembic.autogenerate.compare] Detected added table 'root_cause'
INFO [alembic.ddl.postgresql] Detected sequence named 'bug_id_seq' as owned by integer column 'bug(id)', assuming SERIAL and omitting
INFO [alembic.autogenerate.compare] Detected added column 'bug.root_cause_id'
INFO [alembic.autogenerate.compare] Detected NULL on column 'bug.bug_tracker_url'
INFO [alembic.autogenerate.compare] Detected added index 'ix_bug_root_cause_id' on '['root_cause_id']'
INFO [alembic.autogenerate.compare] Detected added unique constraint 'None' on '['bug_tracker_url']'
INFO [alembic.autogenerate.compare] Detected added foreign key (root_cause_id)(id) on table bug
INFO [alembic.autogenerate.compare] Detected removed column 'bug.root_cause'
Generating /Users/gigi/Documents/dev/github/alembic-tour-de-force/alembic/versions/6c55e0560fd4_add_root_cause_table.py ... done
Alembic generated a complete migration script:
"""add root_cause table
Revision ID: 6c55e0560fd4
Revises: bc25b5f8939f
Create Date: 2016-05-30 23:57:53.858201
"""
# revision identifiers, used by Alembic.
revision = '6c55e0560fd4'
down_revision = 'bc25b5f8939f'
branch_labels = None
depends_on = None
from alembic import op
import sqlalchemy as sa
def upgrade():
### commands auto generated by Alembic - please adjust! ###
op.create_table('root_cause',
sa.Column('id', sa.Integer(), nullable=False),
sa.Column('name', sa.String(), nullable=True),
sa.PrimaryKeyConstraint('id')
)
op.add_column('bug', sa.Column('root_cause_id', sa.Integer(), nullable=False))
op.alter_column('bug', 'bug_tracker_url',
existing_type=sa.VARCHAR(),
nullable=True)
op.create_index(op.f('ix_bug_root_cause_id'), 'bug', ['root_cause_id'], unique=False)
op.create_unique_constraint(None, 'bug', ['bug_tracker_url'])
op.create_foreign_key(None, 'bug', 'root_cause', ['root_cause_id'], ['id'])
op.drop_column('bug', 'root_cause')
### end Alembic commands ###
def downgrade():
### commands auto generated by Alembic - please adjust! ###
op.add_column('bug', sa.Column('root_cause', sa.VARCHAR(), autoincrement=False, nullable=True))
op.drop_constraint(None, 'bug', type_='foreignkey')
op.drop_constraint(None, 'bug', type_='unique')
op.drop_index(op.f('ix_bug_root_cause_id'), table_name='bug')
op.alter_column('bug', 'bug_tracker_url',
existing_type=sa.VARCHAR(),
nullable=False)
op.drop_column('bug', 'root_cause_id')
op.drop_table('root_cause')
### end Alembic commands ###
Note the down_revision, which is the baseline. To bring the database itself up to date let's upgrade to the new head.
(G)/alembic-tour-de-force > alembic upgrade head
INFO [alembic.runtime.migration] Context impl PostgresqlImpl.
INFO [alembic.runtime.migration] Will assume transactional DDL.
INFO [alembic.runtime.migration] Running upgrade bc25b5f8939f -> 6c55e0560fd4, add root_cause table
Checking the database itself the schema has been changed indeed and there is a new root_cause table: http://i.imgur.com/0Zqn28P.png.
Limitations of Auto-generation
Auto-generation is magical but it has some limits you need to be aware of. First of all if you rename a table or a column alembic will consider it a removal and addition of a new table/column. You will lose all the data. You should manually adjust the aut-generated script and replace the drop/create with an explicit rename operation that will preserve the data.
You should name all your constraints. Alembic can't handle anonymously named constraints.
There are some unimplemented features that auto-generation will detect and handle in the future:
- Some free-standing constraint additions and removals, like CHECK, PRIMARY KEY
- Sequence additions and removals
SQL Generation
You may not be comfortable directly running a migration script via alembic upgrade
or you may not be sure what the impact is going to be on your database. Alembic has for this purpose special offline mode where it generate a SQL file with all the SQL statements that represent the upgrade script instead of executing them. Then, you can observe the SQL file, make sure it doesn't do anything fishy and also perform the migration using the SQL file.
The command is alembic upgrade --sql
. You need to provide also a start and end revision. Let's see the SQL that generated our most recent upgrade.
(G)/alembic-tour-de-force > alembic upgrade --sql bc25b5f8939f:6c55e0560fd4
INFO [alembic.runtime.migration] Context impl PostgresqlImpl.
INFO [alembic.runtime.migration] Generating static SQL
INFO [alembic.runtime.migration] Will assume transactional DDL.
BEGIN;
INFO [alembic.runtime.migration] Running upgrade bc25b5f8939f -> 6c55e0560fd4, add root_cause table
-- Running upgrade bc25b5f8939f -> 6c55e0560fd4
CREATE TABLE root_cause (
id SERIAL NOT NULL,
name VARCHAR,
PRIMARY KEY (id)
);
ALTER TABLE bug ADD COLUMN root_cause_id INTEGER NOT NULL;
ALTER TABLE bug ALTER COLUMN bug_tracker_url DROP NOT NULL;
CREATE INDEX ix_bug_root_cause_id ON bug (root_cause_id);
ALTER TABLE bug ADD UNIQUE (bug_tracker_url);
ALTER TABLE bug ADD FOREIGN KEY(root_cause_id) REFERENCES root_cause (id);
ALTER TABLE bug DROP COLUMN root_cause;
UPDATE alembic_version SET version_num='6c55e0560fd4' WHERE alembic_version.version_num = 'bc25b5f8939f';
COMMIT;
Multi-Stage Upgrade
Naive schema changes may drop data. For example, if you look at the SQL generated above you'll note that the root_cause column was dropped from the bug
table. In order to preserve the data you'll have to perform a multi-stage migration. Here's how it can work for the root_cause case. First step just add the root_cause table and the root_cause_id foreign key, but DON'T remove the root_cause column just yet. Perform the migration and the code will still read the root cause information from the root_cause column in the bug
table. Then, the information in the root_cause column of the bug
table will be inserted into the new root_cause table. The code will be changed to read from the "root_cause" table. At this point the root_cause column is not used anymore and can be dropped in a second migration.
This approach may be necessary even if there is no concern about data loss. For example, in large distributed systems it is difficult if not impossible to perform synchronized changes to the DB schema and the code on all the servers. In these case every breaking change, must be done in a multi-stage approach (unless the system is designed to handle temporary mismatches and can recover gracefully later).
Conclusion
Database schema migrations are an important aspect of large-scale systems. They must be handled properly in order to prevent data loss and system availability. Alembic is an excellent solution for SQLAlchemy-based systems. It provides a methodical approach and supports auto-generation of migration scripts.
attribution Bitterherbs1