Module caching variable for saving database hits

axiom242 · May 3, 2021, 3:32am

I’m coming over from the Ruby world and have decided to play with Elixir and Phoenix. I have started to read some books, and some online tutorials. However, for this issue, I haven’t found a clear, and concise answer. I truly appreciate any help.

I have a CSV file in the format of:

artist,album
SomeBand,SomeAlbum
SomeBand,AnotherAlbum
...
SomeBand,YetAnotherAlbum

And so on - perhaps up to 50 of SomeBand, and maybe another 50 for SomeOtherBand.

I have a script that iterates over this CSV file, and instead of inserting the Artist each time into the table artists, I want it to first check a Map to see if the artist has been inserted prior, and what the id is.

I have this so far in lib/populate_data.ex:

...

alias NimbleCSV.RFC4180, as: CSV
  
def parse_and_import do
 artists_map = %{}

 "priv/data/arists_and_albums.csv"
  |> File.stream!(read_ahead: 100_000)
  |> CSV.parse_stream
  |> Enum.map(fn row ->
     [artist, _album, _type, _release_date, _catalog, _label, _main_release] = row

      artists_map = check_or_insert_artist(artist, artists_map)
  end)
end

  def check_or_insert_artist(artist, artists_map) do
    case Map.has_key?(artists_map, artist) do
      true ->
        artists_map
      false ->
        insert_artist(artist, artists_map)
    end
  end

  def insert_artist(artist, artists_map) do
    case Repo.insert(%Artist{artist: artist}) do
      {:ok, struct}       ->
        Map.put(artists_map, artist, struct.artist_id)
      {:error, _} ->
        artists_map
    end
  end

What I am doing above is iterating over the CSV file, and on each row, checking if the artist already exists in the Map variable artists_map. I am sure many are seeing the gotcha…as artists_map is within a Enum.map loop in the parse_and_import function, it will not be recognized on the second run, or any subsequent run. I think it’s because I have:

 |> Enum.map(fn row ->
   ....
   artists_map = check_or_insert_artist(artist, artists_map)

I do not think artists_map is within scope on the next run - I am guessing this is the problem. If I debug:

Before the call to check_or_insert_artist:
%{}
After the call to check_or_insert_artist:
[debug] QUERY OK db=1.6ms decode=0.9ms queue=0.4ms idle=14.7ms
INSERT INTO "artists" ("artist") VALUES ($1) RETURNING "artist_id" ["Nine Inch Nails"]
%{"Nine Inch Nails" => 162}

Second run:

Before the value of check_or_insert_artist:
%{}
# error for trying to re-insert the same artist into the unique key field

Essentially, coming from the Ruby world, I’d do:

if artist_hash.key?(artist)
  artist_hash[artist]
else
  # insert into artists table and then:
  artist_hash[artist] = new_id
end

I have found conflicting answers on the best approach to this - I know I can use Agent but I’m wondering if it’s overkill for this. Also, there are other variables (label, type) that I would want to follow a similar pattern.

Thank you for any help in this matter.

Nicd · May 3, 2021, 4:09am

Since Elixir is not object oriented, there are no module level variables (“properties”) that you could set like this. A process can keep state which can contain that data, and that you could accomplish with Agent or GenServer, but a single process may (or may not) become a bottleneck for the queries. Another option is to put the data in ETS, perhaps using some caching wrapper. There is also persistent_term which are kind of globals, but writing to them is expensive.

Personally I would think if I really need to do this, as databases should be quite good at checking a unique index already. But if I did need it and was running on a single node, I would use ETS and get the ID from there.

axiom242 · May 3, 2021, 4:22am

Thank you - this is a one time setup, so I would assume ETS would work out. But, thinking about it, since the database level is good at checking unique indexes, and this is a one time setup, maybe I shouldn’t care too much about the caching (I suppose I was challenging myself). If I went the database route, do I pass the options of on_conflict: :nothing? I would still need to get the id, however. Perhaps I will look for a solution to do something like find_or_create (I’m assuming a check by Repo.one first, and if not, insert)…

Nicd · May 3, 2021, 5:20am

There’s some material here: Constraints and Upserts — Ecto v3.6.1

Also check out the :returning option.

al2o3cr · May 3, 2021, 11:39am

Correct. A minimal solution to this problem is to switch from Enum.map (useful when the transformation is independent of previous elements) to Enum.reduce which passes state to each iteration:

  |> Enum.reduce(%{}, fn row, artists_map ->
     [artist, _album, _type, _release_date, _catalog, _label, _main_release] = row

      check_or_insert_artist(artist, artists_map)
  end)

cenotaph · May 3, 2021, 12:50pm

I would just put everything in a map/list

Ecto.insert_all(%Artist, artist_map, opts)

Tell in the opts to use insert_ignore.

This would speed up your execution enough with minimal effort and offloading some of the responsibility to db. If the record exists ignored if not entered in one batch transaction and network call.

axiom242 · May 3, 2021, 3:28pm

Thank you all for your suggestions - I like the reduce option, however, I’ll be dealing with more than just artists, but other attributes.

At the end, I didn’t worry about the database hits to worry about ETS, or Agent (though, these I will look into for my own sake). From the suggested help, and posted links, I did this in the Artist model:

  def changeset(artist, params \\ %{}) do
    artist
    |> cast(params, [:artist])
    |> validate_required([:artist])
    |> unique_constraint(:artist)
  end

And then inserted within lib/populate_data.ex:

    "priv/data/artists_and_albums.csv"
    |> File.stream!(read_ahead: 100_000)
    |> CSV.parse_stream
    |> Enum.map(fn row ->
      [artist, _album, _type, _release_date, _catalog, _label, _main_release, _duration, _wiki_link] = row
      IO.inspect insert_artist(artist, artists_map)
    end)
  end

  def insert_artist(artist, artists_map) do
    case Repo.insert(Artist.changeset(%Artist{}, %{artist: artist})) do
      {:ok, struct} ->
        struct.artist_id
      {:error, _} ->
        Repo.get_by!(Artist, artist: artist).artist_id
    end
  end

I see errors whizz by but they are caught, and the output is:

[debug] QUERY OK source="artists" db=0.1ms idle=2.8ms
SELECT a0."artist_id", a0."artist" FROM "artists" AS a0 WHERE (a0."artist" = $1) ["Nine Inch Nails"]
[debug] QUERY ERROR db=0.2ms idle=2.8ms
INSERT INTO "artists" ("artist") VALUES ($1) RETURNING "artist_id" ["Nine Inch Nails"]
452

452 being the ID. I see it fails to insert, but then selects the entry again. I’m not worried about optimizing this as I learn, but maybe will revisit this with the other suggestions.

I was thinking maybe iterate through the CSV file, collect the artists then do a insert_all (as suggested above), and after, then fetch all the records into a Map so I could do artist_map[artist] and get the ID that way?

Awesome suggestions - thank you all!

axiom242 · May 3, 2021, 4:28pm

Actually, maybe I spoke too soon - it works but I was thinking maybe a more elegant pattern would be:

lib/populate_data.ex:

"priv/data/artists_and_albums.csv"
    |> File.stream!(read_ahead: 100_000)
    |> CSV.parse_stream
    |> Enum.map(fn row ->
      [artist, _album, _type, _release_date, _catalog, _label, _main_release, _duration, _wiki_link] = row

      IO.inspect  artist_id = Artist.fetch_or_insert_artist(artist)

    end)
  end

And then in the Artist model:

defmodule MyApp.AlbumType do
  use Ecto.Schema
  import Ecto.Changeset

  alias MyApp.Repo

  @primary_key {:artist_id, :id, autogenerate: true}

  schema "artists" do
    field :artist, :string
  end

  def fetch_or_insert_artist(artist) do
    Repo.get_by(MyApp.Artist, artist: artist) ||
     insert_artist(artist)
  end

  def insert_artist(artist) do
    case Repo.insert(MyApp.Artist.changeset(%MyApp.Artist{}, %{artist: artist})) do
      {:ok, struct} ->
        struct
      {:error, _} ->
        Repo.get_by!(MyApp.Artist, artist: artist)
    end
  end

  def changeset(artist, params \\ %{}) do
    artist
    |> cast(params, [:artist])
    |> validate_required([:artist])
    |> unique_constraint(:artist)
  end
end

Is this a good pattern - to move this to the model? Also referencing MyApp.Artist within the model seems…redundant - is there a better way?

Thank you - I am really loving this!

cenotaph · May 3, 2021, 4:35pm

No, you have meddled your model with business actions rather than just data and changes.

You should use a context to operate on your model and define business use cases / actions

dimitarvp · May 5, 2021, 11:18pm

Did you arrive at a solution you are happy with?

axiom242 · May 7, 2021, 2:16am

Yes, 100%. Thank you all!