How to perform Data Migrations aligned with regular Schema Migrations

pironim · July 1, 2020, 2:16pm

I’m trying to find good way to migrate data for Phoenix application.

Phoenix has regular migrations and it looks like rails migration.
My previous rails project was big and we often write data migrations(to update some records or delete)
rule of thumb - separate regular(schema) migrations from data migrations
for rails we have several libraries like https://github.com/ilyakatz/data-migrate
and you can call it like:

rake db:migrate:with_data

I find https://github.com/samsamai/ecto_immigrant but it looks outdated and there is no options.

Maybe somebody know good library to add migrations and have ability to run data migration after specific schema migration?

there different opinions how to run data migrations
link them to release and run after schema migrations
, mix schema and data migrations
run migrations after schema migrations.

schema depends on your app

LostKobrakai · July 1, 2020, 2:31pm

Can you explain the benefit from that? To me this sounds like it’s just waiting for some inconsistency between those two to happen.

pironim · July 1, 2020, 2:44pm

In two years you can get many migrations with outdated code and you need to fix them from time to time(module removed or renamed). It’s hard to do If you mix schema changes and data changes

with separate file for data migration you can get some control
e.g. run only schema migrations on local machine (dev, test env)
disable specific data migration and skip evaluation.

it’s all about control, testability and managebility.

Usually data migration have short life time

create it
write tests
release to all envs
comment or remove it somehow

LostKobrakai · July 1, 2020, 3:24pm

Ok, I don’t have such problems because I migrate data using plain sql, therefore I can keep them around without problems.

shanesveller · July 1, 2020, 5:19pm

I’d highly encourage writing these kinds of transformations as pure modules and functions that wrap straightforward Ecto calls, i.e. Repo.update_all. This approach lends itself well to testing, repeatable execution, and will look familiar to developers as they’re essentially specialized context modules. Tying them to the mindset of migrations that can be “checked off” as completed and don’t need to happen again doesn’t often mesh with my experience in reality. More often I see situations where it takes several passes through the data with the same intent to fully clean it in-place before a stricter DDL design can be applied. You can still drop the modules and their tests from your codebase once the need is conclusively finished, but you won’t need any acrobatics to repeat the effort again.

That’s also why I generally recommend that traditional Ecto migrations only contain DDL statements and no updates or insertions if at all avoidable.

LostKobrakai · July 1, 2020, 6:02pm

I even skip anything MyApp.Repo, because it means :my_app would need to be started to be able to run migrations. If I really don’t know the sql I do MyApp.Repo.to_sql/2 and copy it in the migration, where it’s run using execute(sql) or execute(up, down).

dimitarvp · July 1, 2020, 6:45pm

True, but for one-off, deterministic and simple conversions from an old column to a new column Ecto migratory updating code works okay even months in the future.

Agreed with everything else.

pironim · July 1, 2020, 7:34pm

I understand your point.
It’s mostly linked to migrations structure and testability

On previous rails project we use similar approach class which contain everything and it’s easy to test it.
Data migration was kind of humble object to run Migrator class with all logic

let me show rails example with data migration file

# cat db/data/20190814115720_remove_redundant_records.rb
class RemoveRedundantRecords < ActiveRecord::Migration
  def self.up
    DataMigrationJob.perform_sync Migrators::OldNotificationsDestroyer.name, 0
  end

  def self.down
  end
end

You can test Migrator like regular class and run it during db:migrate

Also to turn off or remove migration it should be executed everywhere and after certain period e.g. 1 year it can be disabled/removed.

dimitarvp · July 1, 2020, 9:10pm

One thing that saved the sanity of me and my colleagues both in Ruby on Rails and Elixir/Phoenix projects was to periodically squish all migrations into a singular .sql file with the contemporary schema and then just “restart” migrations (i.e. have zero of them after the squish).

At one point you can absolutely feel how maintaining the old migration scripts and making sure every new onboarded dev goes through them without errors is just not worth it.

But I still don’t recommend the squish / pruning to take place before 6 months (or 50-100 migrations).

shanesveller · July 1, 2020, 10:00pm

A really nice compromise on this idea is Ecto.Migrator.with_repo/2 which was introduced in the 3.x series. You won’t have your full OTP app running, but you can start a skeleton Ecto Repo with minimal connections for the duration of your DB interactions, which don’t have to be traditional migrations. You can readily use Repo calls or context modules, as long as the connection count is set appropriately for your needs. We use this in Distillery custom commands, seeds, etc. to avoid having the whole supervision tree start up just to do some INSERTs.

Something fairly similar is built-in as mix ecto.dump and mix ecto.load, and doesn’t require you to fully discard the previous migration code content. Instead it just loads a structure snapshot and then marks previous migrations up to that point as applied, and they can still be replayed from-scratch using the normal ecto.migrate to ensure continuity for environments that aren’t able to use this approach, like your Production database. I don’t personally endorse fully squashing or discarding the historical content ourselves when you can get all of the same speed improvements using built-in functionality and still have a great escape hatch when needed.

shanesveller · July 1, 2020, 10:02pm

This is part of the disconnect, perhaps, because I’m advocating for treating data-munging tasks as operationally and conceptually separate from schema evolution, and considering that a virtue rather than an inconvenience.

pironim · July 2, 2020, 6:06am

For big team you can get weird situation when somebody changing same data after you and your data migration should be executed exactly after specific schema migration to avoid clashes

pironim · July 2, 2020, 9:42am

Some Applications can have data outside database e.g. Microservice, redis, third party API

shanesveller · July 2, 2020, 2:34pm

That sounds like a process and communication problem being solved with a technical solution.

I can only phrase so many ways that I think your mindset is what is providing these inflexible constraints, and that they are not intrinsic to the problem itself. Treating data transforms as a migration is only one of several possible solutions. I’m going to drop off the thread now, good luck!

dimitarvp · July 2, 2020, 2:36pm

I don’t disagree and I value historical records a bit too much myself in all my work (tables containing history of changes made in other tables is one guilty pleasure of mine).

But I always worked in fairly small Elixir teams – maximum was 4 people and that was only once – and we had to optimize for minimum confusion and potential quick onboarding of future members, so we opted for a normalized schema (through the mix tasks you pointed at) and minimal migration scripts.

pironim · July 2, 2020, 2:49pm

This is common problem for legacy systems.