Insert all lines from text file into DB with seed

This is what I am trying to do. I have a text file (movies.csv) with 40k lines. I wrote a function to parse each line using Stream/Enums. It parses the data I want and i am able to view the parsed data from iex ( though i can only see partial list… say 200+ lines).

I am using the parser to insert the data into the DB. What is interesting is, it only inserts a small set of the entire file into the DB. I am unable to insert all the records into the DB ( even though the small set takes only 10 ms)…

How can i insert all the 40k records in the database?

Elixir : 1.9.4
Erlang : 22
Postgresql : 12
Debian : 10

Inserting a .csv file into Postgres will be much much faster is you use the Postgres COPY statement. That said, it’s very difficult to give guidance without seeing the code you are using to do the inserting.

1 Like

Task.async_stream or Flow should sort you out.

Sorry had to step out. Postgresql COPY is an option but I want to learn to do it using Elixir/Phoenix

Here is the code snippet. It is not the prettiest ( i have to work to make it better). but it does insert.

@name_file "data/movies.csv"

def get_movies do
  [_header | data_rows] =
    File.read!(@name_file)
    |> String.split("\n")

    data_rows
    |> Stream.map(fn row ->
      String.split(row, ",")
    end)
    |> Stream.filter(fn row ->
      case row do
      [
        movie_name,
        movie_type,
        genres,
        _homepage,
        -imdb_id ]  -> true
        _ -> false
        end
        end)
    |> Enum.map( fn row ->
    [movie_name,
    movie_type,
    genres,
    _homepage,
    -imdb_id ] = row

    [ movie_name,
    movie_type,
    genres]
    end)

ParseData.get_movies()
    |> Enum.map(fn row ->
      [name, type, genres] = row

      Repo.insert(%Movie{name: name, type: type, genres: genres})
    end)

I am mapping the data to a Ecto function which inserts fine except it does only first 239 records.

Short version

You have -imdb_id not _imdb_id which I suspect means you probably end up with 239 rows in total (not tested).

Longer version

Here’s a sightly cleaned up version which I did just to understand what you’re trying to do. Its how I saw that you had mistyped _imdb_id. I also added {:ok, movie} = Repo.insert(movie) because if the insert fails you’ll get a match error and therefore know if the insert fails. This is a helpful debugging approach. I also changed Stream functions to Enum functions since you are already reading the whole file into memory so there’s nothing much to be saved and some performance to be gained.

defmodule Movie do
  @name_file "data/movies.csv"

  def movies do
    [_header | data_rows] =
      File.read!(@name_file)
      |> String.split("\n")

    data_rows
    |> Enum.map(&String.split(&1, ","))
    |> Enum.map(fn 
      [name, type, genres, _homepage, _imdb_id] ->
        %Movie{name: name, type: type, genres: genres}
      _ -> nil
    end)
    |> Enum.reject(&is_nil/1)
  end

  def load_database do
    Enum.each movies(), fn movie ->
      {:ok, movie} = Repo.insert(movie)
    end)
  end
end
2 Likes

@kip. Thank you very much. I made the changes you suggested but still no luck. It only inserts 239. Also I notied, it is randomly inserting the rows. i.e., it is not inserting rows sequentially…

i want to learn what exactly is happening when it is only inserting a small subset of records ( instead of the entire file) but due to time constraint, i am going to use Postgresql COPY.

If anyone has suggestions, pointers, it will be highly appreciated.

Maybe using Repo.insert_all can help. Something like…

  def load_database do
    max_batch = 5_000
    movies()
    |> Enum.chunk_every(max_batch)
    |> Enum.each(fn group_of_movies -> 
      Repo.insert_all(Movie, group_of_movies)
    end)
  end

If you’re not getting match errors on the insert (with the code example I suggested) then I suspect something else is going on.

I’d be interested to know what you get with:

  def load_database do
    IO.inspect Enum.count(movies()), label: “Number of movies”

    Enum.each movies(), fn movie ->
      {:ok, movie} = Repo.insert(movie)
    end)
  end

@kip. Thank you. The count helped find the issue. There were records with homepage and genres has “,” in it. I used CSV library and resolved it.
I really love the quick support.