Populating database from CSV file

kylelw23 · December 27, 2021, 8:03am

I have a Phoenix app that already set up a schema, a controller and a router…
I would like to populate data into my created schema through seeds.exs
I have a csv file and I created a file CSVUtil.ex which is like this:

defmodule CampaignsApi.Store.Campaign do
  @moduledoc """
    Utility module to ingest `campaigns.csv`
  """

  alias NimbleCSV.RFC4180, as: CSV
  alias CampaignsApi.{Store.Campaign, Repo}

  def column_data(file) do
    column_names = get_column_names(file)

    file
    |> File.stream!()
    |> CSV.parse_stream(skip_headers: true)
    |> Enum.map(fn row ->
      row
      |> Enum.with_index()
      |> Map.new(fn {val, num} -> {column_names[num], val} end)
      |> create_or_skip()
    end)
  end

  def get_column_names(file) do
    file
    |> File.stream!()
    |> CSV.parse_stream(skip_headers: false)
    |> Enum.fetch!(0)
    |> Enum.with_index()
    |> Map.new(fn {val, num} -> {num, val} end)
  end

  def create_or_skip(row) do
    case Repo.get_by(Campaign,
           id: row["id"]
         ) do
      nil ->
        Repo.insert(
          %Campaign{}
          |> Campaign.changeset(%{
            id: Integer.new(row["id"]),
            name: row["name"],
            start_date: row["start_date"],
            end_date: row["end_date"],
            budget: Integer.new(row["budget"]),
            hashtags: row["hashtags"],
            team_id: Integer.new(row["team_id"]),
            description: row["description"]
          })
        )

      campaign ->
        {:ok, campaign}
    end
  end
end

I ran mix test and my schema is working fine, and after I ran a seed file that run functions from CSVUtil.
I got this error:

** (Module.Types.Error) found error while checking types for CampaignsApi.Store.update_campaign/2

def update_campaign(%CampaignsApi.Store.Campaign{} = campaign, attrs) do
  CampaignsApi.Repo.update(
    CampaignsApi.Store.Campaign.changeset(
      campaign,
      attrs
    )
  )
end

Please report this bug: https://github.com/elixir-lang/elixir/issues

** (UndefinedFunctionError) function CampaignsApi.Store.Campaign.__struct__/0 is undefined or private
        CampaignsApi.Store.Campaign.__struct__()
        (elixir 1.13.1) lib/module/types/of.ex:131: Module.Types.Of.struct/3
        (elixir 1.13.1) lib/module/types/pattern.ex:750: Module.Types.Pattern.of_shared/4
        (elixir 1.13.1) lib/module/types/pattern.ex:47: Module.Types.Pattern.of_pattern/3
        (elixir 1.13.1) lib/module/types/helpers.ex:93: Module.Types.Helpers.do_map_reduce_ok/3
        (elixir 1.13.1) lib/module/types/pattern.ex:12: Module.Types.Pattern.of_head/4
        (elixir 1.13.1) lib/module/types.ex:62: Module.Types.warnings_from_clause/6
        (elixir 1.13.1) lib/module/types.ex:22: anonymous fn/8 in Module.Types.warnings/5

I am also new to this technology and I dont know what to do to get that CSVUtil works probably…

soup · December 27, 2021, 8:58am

Im not sure, but you’re defining your import module as CampaignsApi.Store.Campaign, then aliasing CampaignsApi.Store.Campaign (I assume this is supposed to be your struct).

As shown, CampaignsApi.Store.Campaign isn’t a struct, nor a schema? So ecto is probably giving up when it can’t find the “type” because it’s given a Module, not a Struct. CampaignsApi.Store.Campaign.__struct__/0 is undefined or private

Can you make your importer CampaignsApi.Import or something?

kylelw23 · December 27, 2021, 12:00pm

Thank you, turn out that I should define the module as CampaignsApi.Store.CSVUtil not .Campaign

trisolaran · December 27, 2021, 7:54pm

Hi @kylelw23 and welcome!

Apart from getting your code to run, as you already did with @soup’s help, I think there are a couple of things you could improve:

is somebody consuming the return value of column_data ? If yes, then I suggest you keep it a stream in order to avoid storing all the inserted structs in memory when calling Enum.map(). Thus, consider replacing Enum.map() with Stream.map(). Using the latter will allow the result of the mapping step to be lazily evaluated by the consumer, which is more efficient if you’re dealing with a large number of rows. If, on the other hand, no one is consuming the return value of column_data and you are only interested in the side effect of inserting the rows into the DB, then do replace Enum.map() with Enum.each() to make this clear.
In create_or_skip you are hitting the DB twice: once to check if a row with the given ID already exists, and then to insert the row if it doesn’t. Assuming you have a primary key constraint on the id column in your DB schema you could get rid of the first call by calling Repo.insert() with the on_conflict: :nothing option. The only downside is that the campaign struct returned by the insert, in case of a conflict, won’t be the one in the DB but the one you were trying to insert. This may or may not be a problem depending on who (if ever) is consuming this result (the answer to my question at point 1.)

I hope this helps!

kylelw23 · December 28, 2021, 10:03am

Hi @trisolaran

my seeds.exs is consuming the return value of column_data, I tried to change it to Stream.map() and nothing happened, sorry for my novice knowledge.
Yeah I realized that did not work so I changed it to

 CampaignsApi.Repo.insert!(%CampaignsApi.Store.Campaign{budget: String.to_integer(row["budget"]),
    description: row["description"],
    end_date: row["end_date"],
    hashtags: row["hashtags"],
    name: row["name"],
    start_date: row["start_date"],
    team_id: String.to_integer(row["team_id"])})

and it works normally, but I just wanted to update the database next time if I use column_data again, so It wont duplicate inputs that’s already added to the database.

dimitarvp · December 28, 2021, 11:49am

When using Stream you always have to use a function that collects the results at the end of the pipe because Stream functions only return other functions. Only when you put e.g. Enum.to_list or Stream.run in the end will the Stream functions in the pipe get executed.

Example:

"/path/to/file.csv"
|> File.stream!()
|> NimbleCSV.RFC41080.parse_stream()
|> Stream.map( ....... )
|> Enum.to_list()

Without the last function the code above only returns a function. Appending Enum.to_list forces that function to get executed. It’s how Stream works in Elixir and many other languages.

But be advised: using Stream incurs some performance penalty. Only use it if you have big collections of elements and you don’t want to have intermediate collections that get processed and then thrown away. Also it’s a good idea to only reach for Stream when you have several steps of processing. Your code above definitely does NOT need Stream as it is.

trisolaran · December 28, 2021, 12:27pm

His code above is already using a stream . The output of:

|> File.stream!()
|> NimbleCSV.RFC41080.parse_stream()

is a stream. What his code doesn’t need is Enum.map. It could just be Enum.each, provided nobody is using the result of column_data.

That’s why I’d like to see how column_data is called. @kylelw23 can you show us?

dimitarvp · December 28, 2021, 12:39pm

Sure, I’ve seen it above. I was seeking to remove confusion as to why only using Stream will seem to not work.

kylelw23 · December 28, 2021, 12:54pm

Hi @trisolaran ,
seeds.exs:

CampaignsApi.Store.CSVUtil.column_data("priv/repo/data/campaigns.csv")

Then I run mix run seeds.exs

trisolaran · December 28, 2021, 1:02pm

Ok then you should ditch Enum.map and use Enum.each instead, because you are only running column_data for the side effects, you don’t care about what it returns.

It makes the code clearer and I think also more efficient: Enum.map will store the result of the mapping in memory, and this could be a lot if you have a lot of rows.