Seeds as migrations?

ecto
migrations
seed

#1

While working through a Docker build/deployment, it hit me that seeds are weird. Apart from requiring some extra setup (e.g. some variant of https://hexdocs.pm/distillery/guides/running_migrations.html), you can wind up in trouble if you accidentally run the seeds when you didn’t expect.

In some cases, the problems with seeds could be solved if they were treated as a migration – I even saw some Ruby Gems that do this, so other people have thought of this.

What are the thoughts about putting seeds into a migration? I don’t think there needs to be any consideration to rolling back a migration like that… the table structure or the table itself would just get altered or deleted.


Why do Ecto migrations not use raw SQL and instead rely on Ecto Schemas?
#2

Do you have the same seeds for test/dev/prod ?


#3

Good question – I usually have a basic set of seeds that I need for prod/dev. For other non-prod environments I can add regular seeds in addition to the migrations.


#4

Personally, I also have to fight this attitude/approach at work and it’s a hill I’m willing to die on: this is an abuse of the migrations system. Just use a seeds file, or preferably a more robust vanilla-Elixir module that calls either raw Ecto or Context functions. Expend the modicum of effort to make sure they’re idempotent, and they’re just as safe to run and rerun as mix ecto.migrate can be.

Table/column/index definitions aren’t the same thing as prepopulated data and they shouldn’t be managed the same way. Data insertions in migrations should be minimal, and any data updates done in a migration should be as necessary due to column changes, and execute as quick as possible. More general data munging and sanitization are an operational task that can and should happen out of band.

The only benefit you get from using migrations for this is quick proof that they have been run in the past, and a shortcut to preventing them from running again. Proper idempotence is far more preferable.

You can’t really undo a single migration from history, without just creating an inverse migration, but you can easily clean up rows that are no longer necessary through normal means.

What happens as the data you seed evolves with the application? Do you really want a new migration every time a value changes or an attribute gets added? We should never be editing existing migration files that have been broadly applied. If you adhere to that, is there actual value being captured in all of the prior historical migrations, that aren’t already captured by your VCS data? It’s just cruft, as we often only care about the current/latest state of that info.

Eventually you’ll likely run into the scenario where not every environment gets all the same bits of prepopulated data, and the migrations system is going to be inflexible enough to make this feel pretty clunky.


#5

I agree with some of that, but it’s so much more work to get seeds working in prod. That’s high risk, whereas migrations are shepherded. Why re-invent that particular wheel? For some of the tables (like a list of ISO country codes), the data is really static and importantly, the app won’t work without it. So it’s not the same as run-of-the-mill seed data: in some cases it’s critical to the app functionality. I’m not sure it’s worth dying on that hill one way or another, there are tradeoffs in the approaches.


#6

What issues are you facing getting them to work in production? Let’s chat about that!


#7

Re deployments: I can’t figure out any other way to get my migrations and seeds to run other than having a separate duplicated image for my app in docker-compose.

version: "3"

services:
  # This container is responsible for running migrations and seeds
  myapp-migrate:
    image: myapp:latest
    command: "seed"
    links:
      - "postgres"
    environment:
      - DB=myapp_docker
      - DB_HOST=myapp-db
      - DB_USER=myapp
      - DB_PASS=myapppw
      # ... etc...
  myapp:
    image: myapp:latest
    command: "foreground"
    ports: 
      - "4000:4000/tcp"
      - "14000:14000/tcp"
      - "24000:24000/tcp"
    links:
      - "postgres"
    depends_on:
      - "myapp-migrate"
    environment:
    - DB=myapp_docker
    - DB_HOST=myapp-db
    - DB_USER=myapp
    - DB_PASS=myapppw

  postgres:
    image: postgres:latest
    volumes:
      - ./docker-postgres-init:/docker-entrypoint-initdb.d
    ports:
      - "55432:5432"
    environment:
      - POSTGRES_USER=postgres
      - POSTGRES_PASSWORD=postgres

It just feels really smelly. And having to construct a separate app (e.g. as outlined by https://hexdocs.pm/distillery/guides/running_migrations.html) also feels like it needs improvement.

Ideally, I would love to have a simpler setup for deployments:

  • 2 containers only (one for my Elixir umbrella app, one for postgres)
  • Use mix cmd mix ecto.migrate on startup
  • Use a custom mix alias where I can manually list all the seed files that the root umbrella app should run, e.g.
defp aliases do
    [
      # Enumerate your migrations here
      "seed": [
        "run apps/app_one/priv/repo/seeds.exs",
        "run apps/app_two/priv/repo/seeds.exs",
        # ... etc ...
      ],
      # ... 
    ]
end

I deal with docker-compose maybe once or twice per year, so my understanding is pretty fragile… I usually figure out how to deploy the app I’m working on and then forget about everything. I just know it was a lot simpler in the previous apps I built (PHP apps). Any feedback on this would be great!


#8

Are you using Docker-Compose for your production deployments?

I’d probably use docker-compose run --rm myapp migrate and docker-compose run --rm myapp seed after setting up the ReleaseTasks module per the document you linked, plus the two migrate and seed Distillery commands. Obviously, this doesn’t automate the execution of the migrations/seeds as you bring up the application if you’re only typing docker-compose up [-d], but it’s a simple, documentable step that doesn’t leave you with the second container lingering.

If you are running docker-compose on your production deployment targets, hopefully you’re wrapping it with a Systemd unit or something, where you could add an ExecStartPost or similar hook.

I keep a skeleton repo to be a companion to blog posts and other open demonstrations, which has these implemented if you’d like to see a concrete example:


#9

Cool, that’s helpful. Thank you for the references!


#10

Thinking about this more… so the Dockerfile’s default CMD is “foreground” (from Distillery). So in order to get things started up and seeded, I think I can do one of the following:

  1. If the docker-compose.yml specifies that our Elixir service depends_on the database, then we can run it this way:
docker-compose run --rm myapp migrate
docker-compose run --rm myapp seed
docker-compose run up

That works but it always seems to kick off a few errors when the migrations are started:

05:14:53.970 [error] Postgrex.Protocol (#PID<0.146.0>) failed to connect: ** (DBConnection.ConnectionError) tcp connect (postgres:5432): connection refused - :econnrefused
05:14:53.970 [error] Postgrex.Protocol (#PID<0.145.0>) failed to connect: ** (DBConnection.ConnectionError) tcp connect (postgres:5432): connection refused - :econnrefused

and a few errors when the seeds are finished:

** (exit) exited in: GenServer.call(Mix.State, {:get, {Map, :get, [:env, :dev]}}, 5000)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (elixir) lib/gen_server.ex:979: GenServer.call/3
    (stdlib) erl_eval.erl:680: :erl_eval.do_apply/6
    (stdlib) erl_eval.erl:273: :erl_eval.expr/5
    (stdlib) eval_bits.erl:88: :eval_bits.eval_field/3
    (stdlib) eval_bits.erl:68: :eval_bits.expr_grp/4
    (stdlib) erl_eval.erl:484: :erl_eval.expr/5
    (stdlib) erl_eval.erl:232: :erl_eval.expr/5
    (stdlib) erl_eval.erl:233: :erl_eval.expr/5

OR, if there is no depends_on stipulation:

  1. In one tab, get the app running:
docker-compose run up

and in another, fire up the temporary images + commands:

docker-compose run --rm myapp migrate
docker-compose run --rm myapp seed

Which also works, but it gets the error upon conclusion of running the seeds.

And thirdly… the more I think about it, the more I think that there ARE cases where seeds should be included as migrations. I have a handful of behaviours that rely on dynamic dispatch, and the value that triggers the dispatch ultimately comes from the database. It’s not quite as on the nose as having a module name in the database (or some variant of polymorphism?), but we have situations where stuff like an order includes a unique string value representing say a vendor ID. And that vendor ID needs to trigger specific code to execute. So when we onboard a new vendor, we will have a new module (i.e. a new implementation of the order handling behaviour) and we’ll have a new row in the vendors table. Those two things need to go out together. I realize that may be an edge case, but it really feels like a migration is the proper way to handle that particular seed.

Sorry – 3 things in that post, but I’d love to hear your thoughts.


#11

Populating your vendors table sounds like it will work fine with a seeds file. Just create a new entry if the vendor_id doesn’t exist.


#12

We can’t simply use a seeds file because don’t know all the vendors beforehand. We might onboard new ones every couple weeks. So our options are:

Option One:

  1. Push out new code (a new module dedicated to handling the vendor-specific logic).
  2. Manually log in to our admin portal and create a new vendor record with the proper slug ID (so that the dynamic dispatch can resolve correctly to the new vendor module) OR have shell access to the server to run a prepared seed file.

Pros: no migration required
Cons: manual, prone to human errors (misspellings etc) OR it requires shell access to server.

Option Two:

  1. Push out new code (a new module dedicated to handling the vendor-specific logic).
  2. Include a migration that adds the proper record to the vendors table.
    Pros: automated, exact.
    Cons: ???

Keep in mind that in this simplified example, the “vendors” table is very narrow: it’s almost just a key/value store.


Enums

Another use case where “seed migrations” make sense: we have stopped using enums in our database columns because they don’t play nice with migrations. Some divisions “solve” this problem by letting the application restrict the column values. That causes headaches for BI because when they are doing any reporting on those databases, they have to crack open the application code to know what the allowable values are, e.g. status_id 1 means “active” and status_id 13 means “cancelled” etc. And god forbid you accidentally let a bad value in. It’s very opaque, there are no constraints inside the database, and you experience something akin to “vendor lock” with your application code so that even if you wanted to replace the application that sits on top of that database, you’d be stuck coding in the same ad-hoc ersatz enums somewhere inside the application. Boo.

So instead of using enums in the database, we create a dedicated table, e.g. “statuses” and then reference it via a foreign key (I know, it’s hardly revolutionary). It’s more work to set up, but it makes things more transparent for BI/reporting and it plays nice with migrations.

And that seems like a really good use-case for using a migration instead of a seed: the application code has specific logic built around those statuses. The application falls apart if you don’t have specific values in your database. And importantly, when you need to add a new status along with new behavior, a migration seems like a really clean way to introduce that change. Importantly if the enum columns played nice with migrations, then you’d modify the allowed values via a migration anyhow.

If you think of migrations as strictly defining database structure, that solution might annoy you, but if you think of migrations as managed database changes, it seems like a pretty viable solution.


#13

A package that implements the “seeds as migrations” idea is phil_columns. I haven’t found a problem that needed it yet, but worth checking out.


#14

What I’m advocating for (in this specific “vendors” table case) is Option 3:

  1. Push out new code
  2. Include an update to the seeds file that adds the new vendor
    Pros: automated, exact
    Note: seeds file runs on every deploy that changes it in the same way that you run migrations on every deploy that change/add any

For enums, we use Postgres enums at work and I’m totally in agreement about adding new values to the enums in a migration.

Also I want to be clear that (unlike shane) I’m not advocating to never “update data” in a migration, just to prefer using seeds instead of migrations for data.


#15

In the Option 3 scenario, wouldn’t your seeds file need to check if the records already exist? Or would you need to update the seeds file so it only contained the new records?


#16

Yes exactly that’s what shane and I meant about making the seeds file idempotent. For example by checking if the specific vendor_id already exists in the database.


#17

What about making your own GenServer that pre-populates all currently known vendors in a memory cache? And include that in your startup supervision tree so those records are available after the app init finishes? Why bother with migrations or seeds in the first place?

You didn’t make it sound like anybody else except the programmers should be able to change them. Apologies if I misunderstood.


#18

If I fully groked GenServers, that might be an option (but to-date, I have not fully grasped what GenServers are and do. Even though I have read the documentation and done the tutorials and made them work, I still have never once woke up in the morning and said to myself “You know what would solve this problem? A GenServer!” So, hold that thought… but if you have an example of how to warm up cache on startup using a GenServer, I would eat that up).

There is a more practical reason why I do not want to store these items in memory (and yes, you are right, in many cases, they do not change much):

  1. We need transparency for BI. It’s fun if your application code juggles memory storage, but reporting goes a lot easier if you have your data grounded transparently in a database where all the analysts can see it.
  2. Storing stuff in application memory is kind of like vendor lock: you become over-dependent on your implementation. It might be fast and work well etc. but if you wanted some other service or language to handle those calls, you can’t easily make the transition when the data lives inside memory. Storing data in a database is a much more flexible option in this regard.