Is this `seeds.exs` idiomatic Elixir?

coen.bakker · July 30, 2024, 7:32pm

One of the requirements of the Book Club exercise on elixirland.dev is to seed data.

I am referring to this requirement in the exercise.

## Requirements

...

  ### **Seeding**
  * Running mix ecto.setup creates the database tables but also seeds the database
  * Seeding inserts 4,000 books that each have 10 pages
  * Some seeded books have an active page, but not all
  * Seeding is fast

How happy or unhappy would you be if a co-worker would create a pull request containing this seeds.exs? Is it written in idiomatic Elixir? Use of comments, etc.

import Ecto.Query, only: [from: 2]
alias BookClub.Repo
alias BookClub.Books.{Book, Page}

# Setup
# Sets log level to :info for performance
initial_log_level = Logger.level()
Logger.configure(level: :info)
IO.puts("Start database seeding")
start_time = System.os_time(:millisecond)

# Constants
n_books = 4000
n_pages_per_book = 10
inserted_at = NaiveDateTime.utc_now(:second)
batch_size = 800

# Insert 4,000 books
books =
  for _ <- 1..n_books do
    %{
      title: XlFaker.generate_title(),
      inserted_at: inserted_at,
      updated_at: inserted_at
    }
  end

books
|> Enum.chunk_every(batch_size)
|> Enum.each(&Repo.insert_all(Book, &1))

# Batch insert 10 pages per book
# Gives some books an active page
book_ids =
  from(b in Book, select: b.id)
  |> Repo.all()

pages =
  book_ids
  |> Enum.flat_map(fn book_id ->
    pages =
      for i <- 1..n_pages_per_book do
        %{
          book_id: book_id,
          content: XlFaker.generate_page(),
          number: i,
          status: :inactive,
          inserted_at: inserted_at,
          updated_at: inserted_at
        }
      end

    List.update_at(
      pages,
      :rand.uniform(n_pages_per_book) - 1,
      &Map.put(&1, :status, Enum.random([:active, :inactive]))
    )
  end)

pages
|> Enum.chunk_every(batch_size)
|> Enum.each(&Repo.insert_all(Page, &1))

# Teardown
end_time = System.os_time(:millisecond)
IO.puts("Finish database seeding")
IO.puts("Seeded #{n_books} books in #{end_time - start_time}ms")
Logger.configure(level: initial_log_level)

LostKobrakai · July 30, 2024, 7:47pm

I don’t think the log level should be overwritten. Either the system is configured to info level and info is logged or it isn’t. I’d also consider changing the IO.puts to use logger.

The book ids could be returned from the Repo.insert_all calls with the :returning option + a flat_map.

And as with all code generating data: This “works”, but it doesn’t really do much in terms of making sure it continues to work over time and in the face of changes to the system it seeds. There’s a reason tests commonly use factories instead of code like this. Though this is one piece of code, tests are usually quite many, so it’s less of a problem.

dimitarvp · July 30, 2024, 7:58pm

Seeds should be IMO idempotent which means that you should have corresponding Repo.delete or Repo.delete_all before inserts. This of course assumes that you can in fact identify the data that must be seeded and remove it at all which in many cases is not possible i.e. I worked in places where certain tenants / companies / users must always exist and some even had hardcoded IDs. You can still achieve idempotency by simply not deleting and not inserting anything if you detect that these special records that must be always exist are already there.
I would not rely on the current datetime; I’d hardcode now as a fixed datetime in the past. To me determinism trumps almost all other concerns. However that too is not a hard rule because there are businesses where certain records are only valid if they’ve been created no more than f.ex. 6 months ago. So use your best judgement but still hardcode as much as you can if you can get away with it. Again – determinism.

Disagreed, seeds are practically their own universe and the only thing they share with the main app is the storage (in most cases a relational DB). It’s quite OK to override various runtime properties like logging in there. I even stuffed OTel tracing in one project’s seeds because they were taking mysteriously long (spoiler alert: one table was too big and the seeds scanned for records + did various levels of locking; we fixed it after).

thiagomajesk · July 30, 2024, 8:24pm

+1 here, the only thing would do differently is to use upserts instead of delete/ insert. Which I think is what you meant by:

You can still achieve idempotency by simply not deleting and not inserting anything if you detect that these special records that must be always exist are already there.

Pretty much agree with the rest as well

dimitarvp · July 30, 2024, 8:40pm

Yep. I didn’t say it clearly but that’s what I had in mind.

fuelen · July 30, 2024, 8:49pm

I’d be okay if this script is just a temporary solution. The code itself is good enough. My concerns are not about style of Elixir code but more about approach in general. It works for simple cases, but as the project grows, I’d not put

* Seeding is fast

to the requirements.
What is more important for me is correctness of data. The approach of inserting raw data will quickly become a mess on several dozen tables. I’m a proponent of using business functions in seed scripts. I’m okay to wait a bit longer. Inserting seed data can be sped up by parallelization (Task.async_stream or flow can help). Because of that, having progress bars in seeds is also a good thing (owl can help).

The number of inserted data must be configurable, so it is possible to specify tiny numbers to run seeds as a part of the test suit. Just create 1 record for each record type to ensure that the script doesn’t fail. Most likely, you won’t be happy when the script stops working when you need it the most

On one of the projects, we have modules for creating seeds defined in lib as a regular .ex file because we want these modules to be available in release. We run them when we launch a new server for testing. This happens quite often thanks to CI/CD. So, removing seeds.exs completely is OK, if someone is struggling with this just because the file is shipped with the default phoenix setup.

sodapopcan · July 30, 2024, 11:37pm

It’s always been my understanding (though I’ve been known to have a problem with understanding) that the particular seeds mechanism in Phoenix (though I’m taking this understanding from Rails from years ago) is just for data that your app requires to run on initial set up (like anything in a settings table, an initial user, etc), not for seeding dev/test data. Of course I see it used this way a lot. I like the idea of flags if one was going to use it this way so that it can cover all situations—I always create a whole separate task for adding “dev data.” I guess that essentially the same thing now that I write that out, it’s just keeping the concepts separate. Potato/potato, perhaps.

I also do upsets for this reason.

coen.bakker · July 31, 2024, 12:41pm

Your comments led me to make changes to both the exercise description and the example solution of the Book Club exercise.

For example, I added a motivation for the requirement that the database can be seeded.

"""
Chris is a member of a book club and wants to develop a simple web API to streamline their group activities. His vision is to create a system that stores books and tracks the pages currently being read and discussed. The book club reads multiple books simultaneously. Although there is already a system in place for tracking who is reading which books, there is no easy way to find the current page being read in any given book.

Chris shares his idea with other members who are also developers. They brainstorm additional features for future stages of the API that they could build together. For now, Chris will create the first version of the API. He and the other developers agreed to include functionality to seed the database with books and pages in this initial version, allowing Chris to easily demonstrate the API on his machine.
"""

Here are some of the changes I have made to the seeds.exs file of the example solution.

Added concurrent database inserts, which improved seeding time drastically.
Clearing all books and pages before any inserts.
Return book IDs from Repo.insert_all/3, instead of querying them separately.

Some of the community suggestions were great (!), but not directly applicable to this Elixirland exercise. However, let me know if I messed something up in the new version of the seeds.exs and/or exercise description.

This is the current version.

require Logger
alias BookClub.Repo
alias BookClub.Books.{Book, Page}

# Setup
initial_log_level = Logger.level()
# Comment the next line to not override the log level configuration. The default
# log level in applications generated with `mix phx.server` is `:debug`, which
# produces verbose output.
Logger.configure(level: :info)
Logger.info("Start database seeding")
start_time = System.os_time(:millisecond)

# Clear data
Repo.delete_all(Page)
Repo.delete_all(Book)

# Constants
n_books = 4_000
n_pages_per_book = 400
inserted_at = ~N[2000-01-01 12:00:00]
batch_size = 800
max_concurrency = 8

# Batch insert books
books =
  for _ <- 1..n_books do
    %{
      title: XlFaker.generate_title(),
      inserted_at: inserted_at,
      updated_at: inserted_at
    }
  end

insert_batch = fn batch -> Repo.insert_all(Book, batch, returning: [:id]) end

book_ids =
  Enum.chunk_every(books, batch_size)
  |> Task.async_stream(insert_batch, max_concurrency: max_concurrency)
  |> Enum.flat_map(fn {:ok, {_, books}} -> books end)
  |> Enum.map(fn %{id: id} -> id end)

# Batch insert multiple pages for each book.
# Let some books have an active page.
build_pages_for_book =
  fn book_id ->
    for i <- 1..n_pages_per_book do
      %{
        book_id: book_id,
        number: i,
        status: :inactive,
        content: XlFaker.generate_page(),
        inserted_at: inserted_at,
        updated_at: inserted_at
      }
    end
    |> List.update_at(
      :rand.uniform(n_pages_per_book) - 1,
      &Map.put(&1, :status, Enum.random([:active, :inactive]))
    )
  end

Enum.chunk_every(book_ids, batch_size)
|> Task.async_stream(
  fn batch ->
    Enum.each(
      batch,
      &Repo.insert_all(Page, build_pages_for_book.(&1))
    )
  end,
  max_concurrency: max_concurrency
)

# Teardown
end_time = System.os_time(:millisecond)
run_time = end_time - start_time
Logger.info("Finish database seeding")
Logger.info("Seeded #{n_books} books and #{n_books * n_pages_per_book} pages in #{run_time}ms")
Logger.configure(level: initial_log_level)

D4no0 · July 31, 2024, 12:57pm

What is the point of doing that? You want your seeds to be idenpotent, what if you have data besides those seeds in those tables?

coen.bakker · July 31, 2024, 1:05pm

Well, maybe I messed up!

But here’s the reasoning. As far as the exercise is concerned, only books and pages exist. If you run mix ecto.reset there is no point to clearing the data first. But if you run the seeding script separately then you’re making sure you end up with the same number of books and pages each time.

There are by the way no specific books or pages required to be in the seeded data. All content is “random”. Upserts are not necessary.

D4no0 · July 31, 2024, 1:14pm

What you are describing is more fitting to a custom mix task and not seeds. I’m not sure what would be the point of seeds that do destructive operations.

LostKobrakai · July 31, 2024, 1:20pm

That kinda depends on the usecase. I’ve seen people use seeds for both seeding of data for development purposes – so you’re not dealing with an empty app – as well as seeding the db with data, which the application depends on to function (properly) in the first place – hence needed for any environment. Those usecases have quite different constraints.

coen.bakker · July 31, 2024, 1:29pm

That’s what I was hoping to clarify with the motivation for the seeding.

fuelen · July 31, 2024, 1:56pm

One tiny optimization could be applied with placeholders. Probably, it won’t be noticeable, but still good as an exercise.

riebeekn · July 31, 2024, 2:02pm

Forget where I initially saw it, but another thing to consider which I often do is making enviro specific seed files so it is more apparent what their purpose is.

i.e. in mix.exs

"ecto.seed": ["run priv/repo/seeds.#{Mix.env()}.exs"]

and then you would have priv/repo/seeds.dev.exs, priv/repo/seeds.test.exs, and priv/repo/seeds.prod.exs files.

fuelen · July 31, 2024, 2:31pm

mix is not available in prod
and I’m not sure what is the use-case for test env, as usually each test creates data it needs.

gregvaughn · July 31, 2024, 5:30pm

I always take a second look at any line of code using Enum.each. In your case, the insert_all could be returning a result indicating that nothing was inserted, but it won’t be made visible because each always returns :ok

riebeekn · July 31, 2024, 5:43pm

true, you have to run the seeds from release.ex in production if for instance using releases.

I found test env seeds can be useful in cases where you have slow set up tasks for data that doesn’t change. The specific case I ran into was using triplex to set up tenant schemas. But yeah, this is all pretty case specific so maybe not worth it in most instances.

coen.bakker · July 31, 2024, 6:12pm

Also, I didn’t do any error handling. Is that an issue?

coen.bakker · July 31, 2024, 6:22pm

Never seen this, but would seeds.demo.exs or demo_seeds.exs be helpful?

Using just seeds.exs for demo data seems too ambiguous or even misleading, based on this thread.