VersionAlchemy: Tracking row changes with SQLAlchemy

July 14th 2016

Our Problem

Over at https://www.nerdwallet.com/rewards-credit-cards we have a tool to help you choose the best credit card. A credit card has features like an annual fee, signup bonus, rewards points, interest rate and so on, so you can imagine a credit cards table that looks like:

+----+---------------------+------------+--------------+----------------+---------------+
| Id |        Name         | Annual Fee | Signup Bonus | Rewards Points | Interest Rate |
+----+---------------------+------------+--------------+----------------+---------------+
|  1 | The Karl Marx Card  | $0         | $0           | No rewards     | 0%            |
|  2 | The Adam Smith Card | $100       | $100         | 1% cash back   | 25%           |
+----+---------------------+------------+--------------+----------------+---------------+

Let’s say the Adam Smith Card increases its signup bonus to $200, but increases the annual fee to $200.

How do we track changes to our data over time (change capture), and how do we keep a record of who changed what?

Introducing VersionAlchemy

VersionAlchemy stores a copy of each row on every insert, update or delete. To make use of the versioned data, VersionAlchemy includes a historical API to support single and bulk row history queries between start_time and end_time. Using the historical API, the state of a table can be retrieved at any point in time.

VersionAlchemy also provides audit trails for free - who changed the row, and when they changed it.

VersionAlchemy and SQLAlchemy

To use VersionAlchemy you need to use the SQLAlchemy ORM. To start versioning a table, create a history table to store row changes for the original table:

+-----------------------+-------------+--------------------------------------+-------------------------------------------------------------------+
|      Column Name      | Column Type |                Index                 |                            Description                            |
+-----------------------+-------------+--------------------------------------+-------------------------------------------------------------------+
| id                    | INTEGER     | primary key                          | Unique row identifier                                             |
| (original table's PK) | ??          | UNIQUE(original table's PK, version) |                                                                   |
| version               | INTEGER     | see above                            | An auto-incrementing version keeping track of changes to row data |
| deleted               | BOOLEAN     |                                      | Soft delete flag                                                  |
| timestamp             | DATETIME    |                                      |                                                                   |
| data                  | TEXT        |                                      | JSON string representation of row in target table                 |
| user_id               | UUID        |                                      | User that performed the action. Used for keeping an audit trail.  |
+-----------------------+-------------+--------------------------------------+-------------------------------------------------------------------+

Creating the history table is a manual step right now, but we plan to make this automatic in a future version of VersionAlchemy.

VersionAlchemy assumes the primary key won’t change. This lets us easily look up row history:

SELECT *
FROM credit_cards_history
WHERE credit_card_id = 1
ORDER BY version DESC

To keep things simple, the history table stores a copy of the entire row for every version of a row rather than just the changes. We chose to encode the row as a JSON string so that the schema of the original table can be changed, as long as it isn’t the primary key, without requiring a schema change in the history table.

Deletes are recorded as soft deletes with a flag indicating the row was deleted from the original table. This means we can reconstruct the state of the original table from the history table at any point in time.

To use VersionAlchemy, we create a model for the history table that inherits from the LogTableMixin class and a model for the original table that inherits from the LogModelMixin class:

class CreditCardHistory(Base, LogTableMixin):
    ...

class CreditCard(Base, LogModelMixin):
    ...

Then, we attach the VersionAlchemy listeners when your app starts up:

@app.before_first_request
def initialize():
    CreditCard.register(
        ArchiveTable=CreditCardHistory,
        Session=db.Session,
        engine=db.engine,
        # Primary key columns
        'credit_card_id',
        'tag'
    )

How VersionAlchemy works

VersionAlchemy (VA) versions changes to rows in a SQL database by hooking into SQLAlchemy. It works by listening for insert, update and delete events in SQLAlchemy ORM and writing a copy of row changes to a history table before writing to the original table.

The easiest way to understand what VersionAlchemy does is to look at what it does when you do an insert, update and delete on a versioned table. We’ll use SQL to describe roughly how each operation is versioned.

Insert

Insert a new row in the history table with version = 0

INSERT INTO history_table (primary_key, data, version, timestamp)
VALUES(primary_key, JSON(row), 0, NOW())

Update

Insert a new row in the history table with version = version + 1

new_version = SELECT MAX(version) + 1 FROM archive_table WHERE candidate_key = {candidate_key}

INSERT INTO history_table (candidate_key, data, version, timestamp)
VALUES(candidate_key, JSON(row), new_version, NOW())

Delete

Insert a new row in the archive table with version = version + 1 and deleted = True

new_version = SELECT MAX(version) + 1 FROM archive_table WHERE primary_key = {primary_key}

INSERT INTO history_table (primary_key, data, version, timestamp, deleted)
VALUES(primary_key, JSON(row), new_version, NOW(), True)

Special Cases

Insert (previously deleted row)

If primary_key exists in history_table:

new_version = SELECT MAX(version) + 1 FROM history_table WHERE primary_key = {old_primary_key}

INSERT INTO history_table (primary_key, data, version, timestamp)
VALUES(old_primary_key, JSON(row), new_version, NOW())
Update (primary key changes)
  1. Insert a new row in the history table with version = version + 1 and deleted = True with the primary key value before the update

  2. Insert a new row in the history table with version = 0 with the primary key value after the update

new_version = SELECT MAX(version) + 1 FROM history_table WHERE primary_key = {old_primary_key}

INSERT INTO history_table (primary_key, data, version, timestamp, deleted)
VALUES(old_primary_key, JSON(row), new_version, NOW(), True)

INSERT INTO history_table (primary_key, data, version, timestamp)
VALUES(primary_key, JSON(row), 0, NOW())

Plans for the future

We want to open-source VersionAlchemy as soon as possible, but there are a few things we want to do first:

  • Automatically generate history tables

  • Hard deletes of history for redacting sensitive information