Blink - Fast bulk seeding for Ecto/PostgreSQL with clean, declarative syntax

Blink is a library for fast bulk data insertion into PostgreSQL databases using the COPY command. It provides a clean, declarative syntax for defining seeders.

Features:

  • Uses PostgreSQL’s COPY for fast bulk inserts
  • Tables inserted in declaration order to respect foreign key constraints
  • Access data from previously defined tables when building subsequent tables
  • Store auxiliary context data that won’t be inserted into the database
  • Load data from CSV/JSON files with Blink.from_csv/2 and Blink.from_json/2
  • :transform option for type conversion when loading from files
  • Integrates with ExMachina nicely
  • Rollback on errors
  • Adapter pattern for supporting other databases

Example:

defmodule MyApp.Seeder do
  use Blink

  def call do
    new()
    |> add_table(:users)
    |> add_table(:posts)
    |> insert(MyApp.Repo)
  end

  def table(_store, :users) do
    [
      %{id: 1, name: "Alice", email: "alice@example.com"},
      %{id: 2, name: "Bob", email: "bob@example.com"}
    ]
  end

  def table(store, :posts) do
    users = store.tables.users
    # Build posts referencing users...
  end
end

Links:

11 Likes

Good library.

I’ve read the code and found a couple of fairly obvious bugs (like non-escaped strings in generated CSV) and limitations (like reading everything in memory), so I made a PR with fixes.

I am also providing fairly cheap consultancy services if you want to have this kind of review and contribution in your private projects.

2 Likes

Great. Ty.

I was aware of the memory issue and had a fix in mind similar to the one in your PR. I’ll have a closer look when I have time.

v0.5.0 Released

Version 0.5.0 is now available. This release marks a big step toward 1.0.0 — it covers all the major changes I had planned. Now the focus shifts to gathering feedback, fixing bugs, and addressing any remaining breaking changes before 1.0.0 (though I don’t have any in mind).

The headline feature is stream support, which enables memory-efficient seeding of large datasets.

Both table/2 clauses return streams in the example below, but returning lists still works as before.

defmodule Blog.Seeder do
  use Blink

  def call do
    new()
    |> with_table("users")
    |> with_table("posts")
    |> run(Blog.Repo, timeout: :infinity)
  end

  def table(_seeder, "users") do
    Stream.map(1..200_000, fn i ->
      %{
        id: i,
        name: "User #{i}",
        email: "user#{i}@example.com",
        ...
        inserted_at: ~U[2024-01-01 00:00:00Z],
        updated_at: ~U[2024-01-01 00:00:00Z]
      }
    end)
  end
  
  def table(seeder, "posts") do
    users_stream = seeder.tables["users"]

    Stream.flat_map(users_stream, fn user ->
      for i <- 1..20 do
        %{
          id: (user.id - 1) * 20 + i,
          title: "Post #{i} by #{user.name}",
          body: "This is the content of post #{i}",
          user_id: user.id,
          ...
          inserted_at: ~U[2024-01-01 00:00:00Z],
          updated_at: ~U[2024-01-01 00:00:00Z]
        }
      end
    end)
  end
end

Other highlights

  • JSONB support — nested maps are automatically JSON-encoded during insertion
  • Configurable timeout — :timeout option for long-running transactions
  • Configurable batch size — :batch_size option controls stream chunking (default: 10,000 rows)
  • Performance improvement — CSV encoding executes significantly faster
  • Bug fix — CSV escaping now correctly handles pipes, quotes, newlines, and backslashes

Breaking changes

  • Blink.Store → Blink.Seeder
  • insert/3 → run/3
  • add_table/2 → with_table/2
  • add_context/2 → with_context/2
  • Return values simplified to :ok (raises on failure)
  • Adapter call/4 callback now receives table_name as a string

Full changelog: v0.5.0 release

2 Likes

You missed a couple of other important things from my PR:

  1. Doing

    try do
      adapter.call(...)
    rescue
      UndefinedFunctionError ->
        raise "Module #{inspect adapter} must implement call/4"
    end
    

    is a strange approach. Removing the try completely would result in the more readable and meaningful exception.

    Plus, it is a buggy approach. Take for example a situation then the call function itself calls an undefined function. This try clause would hide this error, making debugging a nightmare

  2. You new approach opens and parses a CSV file twice in stream mode. First one to get the headers and second one to stream the data. This is not an issue when there is a one huge file, but it is an issue when there are a lot of small files. Opening a file is an operation which is more expensive than reading from a file

2 Likes
  1. Yes, I see. That does make sense. I removed the try-rescue block.
  2. Also changed this now.

Together they make up the changes of version 0.5.1 (see changelog)

Thank you.

Currently, exploring concurrent database connections for faster seeding, while keeping the API clean.

1 Like