Plan to create a seeder library

dgigafox · March 26, 2022, 4:21pm

Hi guys. Would like to ask for some help cause I am new to creating libraries.
So I am currently creating an elixir library for seeding that would solve this problem:

Seeding a database is simple using seeds.exs but fairly messy and complicated when the app is huge and has a bunch of tables. For example, the project that I am working on currently is taking like 30 to 40 minutes to seed which does not count DB timeouts which you need to re-run again if it happens.

Now the interface of my library is simple. Seeders have 2 callbacks:

@callback deps() :: [atom()]
@callback run(Context.t()) :: struct() | [struct()]

deps are list of seeders that should be run before the current seeder while run returns a struct/structs. It will look something like this

defmodule UserSeed do
  use ThisLibrary, key: :user

  @impl true
  def deps, do: []

  @impl true
  def run(%ThisLibrary.Context{}) do
    {:ok, users} = ...some_insert_operation()
    users
  end
end

After this seed runs it will be put to Context.seeds with the key which can be pattern matched by other seeders next in line. Something like:

defmodule PostSeed do
  use ThisLibrary, key: :post

  @impl true
  def deps, do: [UserSeed]

  @impl true
  def run(%ThisLibrary.Context{seeds: %{users: users}}) do
    {:ok, posts} = ...some_insert_operation(users)
    posts
  end
end

Now, I would like my implementation to be using GenServers. Each seeder library is a GenServer periodically checking if all deps already ran. If all deps run then that’s the time it will call run/0 and do the DB operation. The library runs a dynamic supervisor that starts each seeder as a child. So in case of error it can save the state in whatever format (maybe database, json file, etc…), re-running the seeder library will not start from the beginning but will just run from where it stopped. Also would like to run seeders asynchronously as long as its deps are complete for faster seeding.

Question is. Is this a good implementation? Am I overcomplicating things, or am I missing something?

al2o3cr · March 26, 2022, 8:59pm

Roughly how many rows are involved with this seed? 30-40 minutes seems like it should be tens of millions, which seems excessive.

Hitting DB timeouts suggests two things could be happening:

inserting too many rows in a single transaction
if there aren’t many rows involved, inserting things in ways that cause lock contention (unique indexes, etc)

It’s also worth investigating the database you’re seeding into - is it hitting a performance bottleneck? If so, dividing work up among GenServers is just going to mean more processes waiting for an overloaded DB.

dgigafox · March 27, 2022, 6:35am

@al2o3cr Thanks for the response!

Not actually 10s of millions but just estimate 40,000 rows from all tables. You are right about two things:

There are transactions that are taking too long thus result to DB timeout
If insertions will be through GenServer processes and DB is the bottleneck, the process will just wait for DB to open a connection

We need to update our seed from time to time when new features are introduced and sometimes when we update a function in the seed it can result to timeout which occurs intermittently. When we (especially QA) want a fresh seed to our database, and it times out, we need to fix the seed and re-run it from the beginning.

So I think it will be better to just run the seeds sequentially to avoid overloading the database? Also it will be easier to track where it stopped and where to fix

al2o3cr · March 28, 2022, 9:47pm

30-40 minutes to insert 40k rows seems remarkably slow; I’d recommend you start by figuring out where all that time is going. For instance, changing up your seeding workflow won’t help much if (for instance) you’re using bcrypt_elixir and inserting a lot of users with log_rounds set too high.

If you need updatable seeds, a package like phil_columns could be what you’re looking for.