Manual data migrations with Ecto

stefanchrobot · February 10, 2021, 10:04am

Hi, I need a way to handle data migrations in my application. I found an article by @wojtekmach about manual migrations: Automatic and manual Ecto migrations - Dashbit Blog. That was very informative.

What the blogpost suggests is to have a new kind of migrations: manual migrations. But they still use the same rules as the usual (automatic) migrations: data is changed via SQL.

The challenge I’m facing right now is that I need to do a bit more involved data migrations. Schema-less queries is an option, but I’m not sure it will cover all the use cases. Some of the data in the database is encrypted at rest and can only be read/written by the application (it can be copied though). The other thing is that some of those migrations are more like backfills and they require the application logic which means they need to rely on the structs which has already bitten us and caused a migration to fail.

Any thoughts on how to approach this? I’ve started going the way of adding a new release task that runs the data migrations, but then I loose the benefit of the Ecto migrations keeping track of what has already been run. Writing idempotent data migrations is a solution but there’s no easy way to enforce that and it sounds like a decent potential for really bad errors.

My other concern about migrations is - why are they done as scripts and are not compiled into the application? The concern is coming from my thought of using structs in the migrations: the script may become broken over time (this has already happened at least once).

LostKobrakai · February 10, 2021, 10:18am

Migrations are meant to be sources of sql operations kept around for reproducability (change multiple systems concurrently). The way of making migration be that is by having immutable migrations. Immutability means you may not have mutable dependencies in their code. Using schemas is usually exactly that: a mutable external dependency.

If you cannot guarantee immutability you cannot guarantee reproducability.

What I can’t savely tell from your initial post is if you actually need reproducability between migrations?

Why would they need to be compiled into the app? They’re only needed on demand when doing an actual migration.

stefanchrobot · February 10, 2021, 10:38am

Is it really the case? I’ve never seen migrations that would be run on multiple databases. And I don’t mean concurrently, but at all.

So I’m thinking about migrations more like diffs. If my DB is at state S1, the migrations would apply “patches” that will take it through - say - state S2 (migration M2), state S3 (migration M3) and land in state S4 (migration M4). If be reproducibility you mean that M2, M3, M4 will take me from S1 to S4 every time, then yes - that’s a feature that I’m after.

Let’s say:

M2 is: add a column with default NULL,
M3 is: backfill the column,
M4 is: make column not-NULL

What if M3 cannot be done via SQL updates? Then the migrations are no longer “reproducible”. How would you address that?

They’re compiled on demand which is postponing discovery of the bugs to a later stage. If they were compiled into the app, they would be checked by the compiler at least when the release is assembled. Otherwise you’re shipping code that will potentially fail to compile on prod. I just don’t see the advantage of them being scripts.

LostKobrakai · February 10, 2021, 11:02am

Your dev, test and prod databases are already multiple dbs however people usually do not think of them like that. And then you have 2 devs and it’s even more. Those are multiple databases with different data needing to evolve concurrently at best without breaking each other, though it’s not so important to not break another devs db.

That’s what I mean.

Why are they not used in your tests? Imo this is a problem of your test harness not of them being scripts.

To the actual meat of this thread: Can you elaborate why M3 cannot be done via SQL updates? In some form or another it needs to boil down to sql in the end, because that’s the only thing your database can deal with. So I guess your problem is not the sql, but the immutability guarantees. You also seem to have code, which can handle whatever the migrations need to handle. So my wild suggestion would be: Make all that code, which supports your migrations operation immutable code as well. If someone needs to apply changes append a new migration adding only that change instead of altering existing code.

Basically mentally separate application code – which usually is only aware of how things work today – maybe being able to support one/few earlier versions of the db for ease of deployment – from code dealing with migrations, which needs to be aware how things worked a year ago to be able to reproduce it.

As soon as migrations are applied on all of your databases you can look into “moving the baseline”. Remove old migrations and replace them with a schema dump (+ maybe data seeds) of their result. This means you can no longer migrate from e.g. an empty db, but you can only migrate of of this set baseline.

What you also can do is consolidate migrations if you can be sure non of your databases is in one of the intermediate states.

For those last two points you’d need to evaluate if those are things you can do based on your usecase. They help in not infinitely growing the code for migrations.

stefanchrobot · February 10, 2021, 11:34am

That’s a very good point. One of the issues popped up because the dev and prod databases where being modified in different ways: it would work one applying migrations one by one, but not all of them together.

Valid point. But to catch that particular error case I think we’d need to introduce an integration branch and put a CI on that.

It needs to work with the data that’s encrypted in the database, or
Uses complex logic which should not be recreated in the migration script (that would be too error prone).

Thanks for the suggestions. I’ll go down the path of separating the data migration code from schema migrations. One thing it’s going to miss is the “run once” guarantee, but I’ll make the migrations idempotent which should balance the equation.

LostKobrakai · February 10, 2021, 11:36am

Maybe you could wrap your own setup around running the migrations.

stefanchrobot · February 10, 2021, 11:46am

OK, I’m having second thoughts about this. I think I’ll integrate the process into the one described by the article.

al2o3cr · February 10, 2021, 2:57pm

IMO the best way to handle these kinds of one-offs is to just write a script; anything much more complicated than DB-manipulation statements should have tests written for it, so migrations are at best a thin wrapper.

It does mean you need to be careful about idempotency, but if you’re writing a complex manipulation in a migration you’d need to be just as careful writing the down function.

Once you’ve done the backfill and are happy with the results, you avoid long-term maintainability hassles by deleting the backfiller - it’s still in Git history if anyone really needs to know, and then it isn’t laying around like an unexploded munition waiting for some unfortunate to find it.