Schema migrations with Alembic, Python and PostgreSQL

In his latest Write Stuff article, Gigi Sayfan takes a dive into database migrations with an Alembic Tour-de-Force.

Database schema migrations are the bane of agile development and continuous deployment. Some migrations can be trivial if you just add a new table or a new column that can be null. But, in other cases it can be very complicated when it involves denormalizing existing tables or changing column data types. In some cases, the schema change will require data migration too. What makes data schema migrations particularly complicated is that often they must be done in tandem with corresponding code changes. In large distributed systems you often don't have the capability to deploy a whole new version of your code + update your database schema instantly. In these cases, you must must resort to multi-step rollout. Then, there is the concern of being able to rollback your changes if something goes wrong.

In this article, I'll discuss all these scenarios and use case. I'll use Compose PostgreSQL as the target database, Python 3.5.1 as the programming language, SQLAlchemy 1.0.13 as the DB access library and Alembic 0.8.6 as the the database migration toolkit.

Sample Application - Root Cause Analysis

To make everything more concrete let's design a database to solve a real problem - analyzing the root cause of bugs that were discovered in production. As big believer in tests, I always say that code that is not tested properly is broken. The opposite is that every bug can be traced back to missing or insufficient test. The domain is then bugs and their root causes. For each bug I want to record its root cause and later I'll be able to analyze the root causes and find ways to improve the process. Here is the initial database design with some test data:

It's a single table called bug. I capture the URL of the bug in the bug tracking system where a lot of additional information can be recorded. The root_cause is just a string. Finally there is the who and when columns.

This is pretty basic, but can serve as a good starting point. I'll evolve it during the article.

BTW, bug-tracker.org is a fictional bug tracking system. There is nothing there.

SQLAlchemy Models

SQLAlchemy is arguably the best relational database access toolkit. It provides several interfaces. In this article I'll stick to the SQLAlchemy ORM. Here is the bug table defined as a SQLAlchemy model in the models.py file.

from sqlalchemy import Column, DateTime, String, Integer, func  
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()


class Bug(Base):  
    __tablename__ = 'bug'
    id = Column(Integer, primary_key=True)
    bug_tracker_url = Column(String, unique=True)
    root_cause = Column(String)
    who = Column(String)
    when = Column(DateTime, default=func.now())

    def __repr__(self):
        return 'id: {}, root cause: {}'.format(self.id, self.root_cause)

I will not explain too much about SQLAlchemy. The main point is that it provides a convenient object-oriented / Pythonic way to define your schema and later to evolve it.

Alembic Basics

Alembic is designed to work with SQLAlchemy and provides excellent migration support. Alembic works by managing change management scripts. Whenever you modify your models you create (or let Alembic create for you) a Python script that Alembic can invoke and upgrade your actual database schema. To install SQLAlchemy, Psycopg2 (the Postgress driver) and Alembic you can pip install -r requirements.txt. Once all the requirements are installed you need to prepare your project for working with alembic. First verify that alembic is installed properly. Type alembic and you should see something like:

(alembic-tour-de-force)(G)/alembic-tour-de-force > alembic
usage: alembic [-h] [-c CONFIG] [-n NAME] [-x X] [--raiseerr]  
               {branches,current,downgrade,edit,heads,history,init,list_templates,merge,revision,show,stamp,upgrade}

To prepare your project to work with alembic type alembic init alembic in your project's directory:

(G)/alembic-tour-de-force > alembic init alembic
  Creating directory /Users/gigi/Documents/dev/github/alembic-tour-de-force/alembic ... done
  Creating directory /Users/gigi/Documents/dev/github/alembic-tour-de-force/alembic/versions ... done
  Generating /Users/gigi/Documents/dev/github/alembic-tour-de-force/alembic.ini ... done
  Generating /Users/gigi/Documents/dev/github/alembic-tour-de-force/alembic/env.py ... done
  Generating /Users/gigi/Documents/dev/github/alembic-tour-de-force/alembic/README ... done
  Generating /Users/gigi/Documents/dev/github/alembic-tour-de-force/alembic/script.py.mako ... done
  Please edit configuration/connection/logging settings in '/Users/gigi/Documents/dev/github/alembic-tour-de-force/alembic.ini' before proceeding.

That will create a sub-directory called alembic that contains the following files and directories:

(alembic-tour-de-force)(G)/alembic-tour-de-force > tree alembic
alembic  
├── README
├── env.py
├── script.py.mako
└── versions

The README file is not very interesting. You can add your own instrutions for how to upgrade. The env.py is a Python script that performas a lot of the heavy lifting and script.py.mako is a template for generating migration scripts. You don't call them directly, but use the alembic command-line tool. Finally, the versions directory is where your migration scripts go. Another important file that is generated is the alembic.ini file in your project's directory. You need to set the sqlalchemy.url to point to your database. You can leave the rest as is or modify to your liking.

Generate a Baseline Script

Let's generate a baseline script for the current state of the database. The baseline script can generate a database schema from scratch. I use the revision command to generate a new revision of the database (the first in this case). The -m flag is the revision name. Alembic will concatenate a uuid before to ensure uniqueness and create a migration script in the alembic/versions sub-directory.

alembic revision -m "baseline"  
  Generating /Users/gigi/Documents/dev/github/alembic-tour-de-force/alembic/versions/bc25b5f8939f_baseline.py ... done

This generates an empty script with upgrade() and downgrade() functions. In order to generate the baseline script I added some code to generate our bug table:

"""baseline

Revision ID: bc25b5f8939f  
Revises:  
Create Date: 2016-05-29 15:00:18.721577

"""

# revision identifiers, used by Alembic.
revision = 'bc25b5f8939f'  
down_revision = None  
branch_labels = None  
depends_on = None

from alembic import op  
import sqlalchemy as sa


def upgrade():  
    op.create_table(
        'bug',
        sa.Column('id', sa.Integer, primary_key=True),
        sa.Column('bug_tracker_url', sa.String(), nullable=False),
        sa.Column('root_cause', sa.String()),
        sa.Column('who', sa.String()),
        sa.Column('when', sa.DateTime(), default=sa.func.now()))


def downgrade():  
    op.drop_table('bug')

Note, the revision ID bc25b5f8939f. Using this revisions ID and the down_revision alembic keeps track of the sequence of revisions. Since this is the first revision there is no down_revision.

To test the revision I dropped the bug table in compose.io PostgressSQL SQL tab: http://i.imgur.com/asbicmT.png

Then I ran the alembic upgrade head command. This command makes sure that the database is up to date with the most recent revision.

(G)/alembic-tour-de-force > alembic upgrade head
INFO  [alembic.runtime.migration] Context impl PostgresqlImpl.  
INFO  [alembic.runtime.migration] Will assume transactional DDL.  
INFO  [alembic.runtime.migration] Running upgrade  -> bc25b5f8939f, baseline  

As you can see alembic correctly ran the baseline revision, created the bug table according to the upgrade instruction and also added its own table alembic_version. This table has exactly one row with one column called version_num that contains the current alembic revision ID. That's how alembic keeps track of where the database is and knows what revisions to apply to bring it up to date.

Modify the Models

The current modeling is fragile. The root cause is just a text field. There is no curated list of root causes. Anyone can write whatever text they want for the root cause. Typos can sneak in. Did you notice the typo for the root cause of bug 22? The root cause is 'no integration test' instead of 'no integration test'. Let's create a separate table for root causes that bugs will be related to via a foreign key.

from sqlalchemy import Column, DateTime, String, Integer, func, ForeignKey  
from sqlalchemy.ext.declarative import declarative_base  
from sqlalchemy.orm import relationship

Base = declarative_base()


class RootCause(Base):  
    __tablename__ = 'root_cause'
    id = Column(Integer, primary_key=True)
    name = Column(String)


class Bug(Base):  
    __tablename__ = 'bug'
    id = Column(Integer, primary_key=True)
    root_cause_id = Column(ForeignKey('root_cause.id'),
                           nullable=False,
                           index=True)
    bug_tracker_url = Column(String, unique=True)
    who = Column(String)
    when = Column(DateTime, default=func.now())

    root_cause = relationship(RootCause)

    def __repr__(self):
        return 'id: {}, root cause: {}'.format(self.id, self.root_cause.name)

Auto-generate a Migration Script

I need to write a migration script from the current single Bug table to the new Bug + RootCause tables including setting the foreign key correctly. This is annoying especially when the DB schema is changed a lot. Alembic's killer feature is its ability to auto-generate the migration scripts. Alembic looks at the current models to compare them with current DB schema and figures out what changed.

Before you can use the auto-generation you need to tell alembic where to find your model's metadata. This is done in the env.py file. Here is the relevant section where I added the last line:

# add your model's MetaData object here
# for 'autogenerate' support
# from myapp import mymodel
# target_metadata = mymodel.Base.metadata
target_metadata = models.Base.metadata  

Auto-generating the migration script uses the --autogenerate flag to the alembic revision command.

(G)/alembic-tour-de-force > alembic revision --autogenerate -m "add root_cause table"
INFO  [alembic.runtime.migration] Context impl PostgresqlImpl.  
INFO  [alembic.runtime.migration] Will assume transactional DDL.  
INFO  [alembic.autogenerate.compare] Detected added table 'root_cause'  
INFO  [alembic.ddl.postgresql] Detected sequence named 'bug_id_seq' as owned by integer column 'bug(id)', assuming SERIAL and omitting  
INFO  [alembic.autogenerate.compare] Detected added column 'bug.root_cause_id'  
INFO  [alembic.autogenerate.compare] Detected NULL on column 'bug.bug_tracker_url'  
INFO  [alembic.autogenerate.compare] Detected added index 'ix_bug_root_cause_id' on '['root_cause_id']'  
INFO  [alembic.autogenerate.compare] Detected added unique constraint 'None' on '['bug_tracker_url']'  
INFO  [alembic.autogenerate.compare] Detected added foreign key (root_cause_id)(id) on table bug  
INFO  [alembic.autogenerate.compare] Detected removed column 'bug.root_cause'  
  Generating /Users/gigi/Documents/dev/github/alembic-tour-de-force/alembic/versions/6c55e0560fd4_add_root_cause_table.py ... done  

Alembic generated a complete migration script:

"""add root_cause table

Revision ID: 6c55e0560fd4  
Revises: bc25b5f8939f  
Create Date: 2016-05-30 23:57:53.858201

"""

# revision identifiers, used by Alembic.
revision = '6c55e0560fd4'  
down_revision = 'bc25b5f8939f'  
branch_labels = None  
depends_on = None

from alembic import op  
import sqlalchemy as sa


def upgrade():  
    ### commands auto generated by Alembic - please adjust! ###
    op.create_table('root_cause',
    sa.Column('id', sa.Integer(), nullable=False),
    sa.Column('name', sa.String(), nullable=True),
    sa.PrimaryKeyConstraint('id')
    )
    op.add_column('bug', sa.Column('root_cause_id', sa.Integer(), nullable=False))
    op.alter_column('bug', 'bug_tracker_url',
               existing_type=sa.VARCHAR(),
               nullable=True)
    op.create_index(op.f('ix_bug_root_cause_id'), 'bug', ['root_cause_id'], unique=False)
    op.create_unique_constraint(None, 'bug', ['bug_tracker_url'])
    op.create_foreign_key(None, 'bug', 'root_cause', ['root_cause_id'], ['id'])
    op.drop_column('bug', 'root_cause')
    ### end Alembic commands ###


def downgrade():  
    ### commands auto generated by Alembic - please adjust! ###
    op.add_column('bug', sa.Column('root_cause', sa.VARCHAR(), autoincrement=False, nullable=True))
    op.drop_constraint(None, 'bug', type_='foreignkey')
    op.drop_constraint(None, 'bug', type_='unique')
    op.drop_index(op.f('ix_bug_root_cause_id'), table_name='bug')
    op.alter_column('bug', 'bug_tracker_url',
               existing_type=sa.VARCHAR(),
               nullable=False)
    op.drop_column('bug', 'root_cause_id')
    op.drop_table('root_cause')
    ### end Alembic commands ###

Note the down_revision, which is the baseline. To bring the database itself up to date let's upgrade to the new head.

(G)/alembic-tour-de-force > alembic upgrade head
INFO  [alembic.runtime.migration] Context impl PostgresqlImpl.  
INFO  [alembic.runtime.migration] Will assume transactional DDL.  
INFO  [alembic.runtime.migration] Running upgrade bc25b5f8939f -> 6c55e0560fd4, add root_cause table  

Checking the database itself the schema has been changed indeed and there is a new root_cause table: http://i.imgur.com/0Zqn28P.png.

Limitations of Auto-generation

Auto-generation is magical but it has some limits you need to be aware of. First of all if you rename a table or a column alembic will consider it a removal and addition of a new table/column. You will lose all the data. You should manually adjust the aut-generated script and replace the drop/create with an explicit rename operation that will preserve the data.

You should name all your constraints. Alembic can't handle anonymously named constraints.

There are some unimplemented features that auto-generation will detect and handle in the future:

SQL Generation

You may not be comfortable directly running a migration script via alembic upgrade or you may not be sure what the impact is going to be on your database. Alembic has for this purpose special offline mode where it generate a SQL file with all the SQL statements that represent the upgrade script instead of executing them. Then, you can observe the SQL file, make sure it doesn't do anything fishy and also perform the migration using the SQL file.

The command is alembic upgrade --sql. You need to provide also a start and end revision. Let's see the SQL that generated our most recent upgrade.

(G)/alembic-tour-de-force > alembic upgrade --sql bc25b5f8939f:6c55e0560fd4
INFO  [alembic.runtime.migration] Context impl PostgresqlImpl.  
INFO  [alembic.runtime.migration] Generating static SQL  
INFO  [alembic.runtime.migration] Will assume transactional DDL.  
BEGIN;

INFO  [alembic.runtime.migration] Running upgrade bc25b5f8939f -> 6c55e0560fd4, add root_cause table  
-- Running upgrade bc25b5f8939f -> 6c55e0560fd4

CREATE TABLE root_cause (  
    id SERIAL NOT NULL,
    name VARCHAR,
    PRIMARY KEY (id)
);

ALTER TABLE bug ADD COLUMN root_cause_id INTEGER NOT NULL;

ALTER TABLE bug ALTER COLUMN bug_tracker_url DROP NOT NULL;

CREATE INDEX ix_bug_root_cause_id ON bug (root_cause_id);

ALTER TABLE bug ADD UNIQUE (bug_tracker_url);

ALTER TABLE bug ADD FOREIGN KEY(root_cause_id) REFERENCES root_cause (id);

ALTER TABLE bug DROP COLUMN root_cause;

UPDATE alembic_version SET version_num='6c55e0560fd4' WHERE alembic_version.version_num = 'bc25b5f8939f';

COMMIT;  

Multi-Stage Upgrade

Naive schema changes may drop data. For example, if you look at the SQL generated above you'll note that the root_cause column was dropped from the bug table. In order to preserve the data you'll have to perform a multi-stage migration. Here's how it can work for the root_cause case. First step just add the root_cause table and the root_cause_id foreign key, but DON'T remove the root_cause column just yet. Perform the migration and the code will still read the root cause information from the root_cause column in the bug table. Then, the information in the root_cause column of the bug table will be inserted into the new root_cause table. The code will be changed to read from the "root_cause" table. At this point the root_cause column is not used anymore and can be dropped in a second migration.

This approach may be necessary even if there is no concern about data loss. For example, in large distributed systems it is difficult if not impossible to perform synchronized changes to the DB schema and the code on all the servers. In these case every breaking change, must be done in a multi-stage approach (unless the system is designed to handle temporary mismatches and can recover gracefully later).

Conclusion

Database schema migrations are an important aspect of large-scale systems. They must be handled properly in order to prevent data loss and system availability. Alembic is an excellent solution for SQLAlchemy-based systems. It provides a methodical approach and supports auto-generation of migration scripts.

Gigi Sayfan is the director of software infrastructure at Aclima (http://aclima.io), a start-up company that designs and deploys distributed sensor networks that enable a higher level of environmental awareness. Gigi has been developing software professionally for 20 years in domains as diverse as instant messaging, morphing, chip fabrication process control, embedded multi-media application for game consoles, brain-inspired machine learning, custom browser development, web services for 3D distributed game platform and most recently IoT/sensors.
Image via Bitterherbs1

This article is licensed with CC-BY-NC-SA 4.0 by Compose.