How do I get rid of duplicates in a CSV file before importing into a database?

lowks · October 25, 2017, 5:01am

Hi,

Any idea how do I filter a CSV file to get rid of duplicate strings (based on two columns) before I save it into a db ?

The current code that I have looks something like this:

def import_products(account, path) do
    result =
      path
      |> File.stream!()
      |> Stream.drop(1)
      |> decode_product_csv <== decode the headers

When I detect duplicates in the columns from the csv file I want to log the errors and the row and not run any db operations.

cjen07 · October 25, 2017, 5:17am

I think dedup_by may be a solution given a valid data structure.

minhajuddin · October 25, 2017, 5:17am

Use Enum.uniq_by

Enumerates the enumerable, by removing the elements for which function fun
returned duplicate items.

The function fun maps every element to a term. Two elements are considered
duplicates if the return value of fun is equal for both of them.

The first occurrence of each element is kept.

## Example

    iex> Enum.uniq_by([{1, :x}, {2, :y}, {1, :z}], fn {x, _} -> x end)
    [{1, :x}, {2, :y}]
    
    iex> Enum.uniq_by([a: {:tea, 2}, b: {:tea, 2}, c: {:coffee, 1}], fn {_, y} -> y end)
    [a: {:tea, 2}, c: {:coffee, 1}]

benwilson512 · October 25, 2017, 2:32pm

How large is the CSV? Enum.uniq_by won’t help you detect duplicates, it’ll just remove them.

If the goal is to have something like “no products should have duplicate names” or similar then the easiest route is to just define a uniq database constraint. Then load all the csv rows in a single transaction, and rollback the transaction if one of the inserts fails.

axelson · October 25, 2017, 2:37pm

I definitely agree with a database focused approach. Although depending on the requirements one potential downside is that I believe that you will only get a warning on the first duplicate encountered. If there are 20 more after that you won’t know until you remove that line from the file. But that is most likely okay since that would indicate other errors in the file.

benwilson512 · October 25, 2017, 2:43pm

Ah yeah that’s a good point.

It all sorta depends a bit on the CSV size, execution time constraints, and so on. If the CSV is too large to hold in memory that will affect the answers a bit.

lowks · October 25, 2017, 2:57pm

So what should the approach be if the CSV file is large. The CSV file will typically contain 1000 over lines

benwilson512 · October 25, 2017, 4:13pm

Ah that’s not too bad at all. Just load it into memory, look for duplicates, error if there are some, and if there aren’t use Repo.insert_all.

If you end up with 10s of thousands to 100s of thousands of rows or more you can look at other options.

ianleeclark · October 26, 2017, 3:20pm

Personally I’d consider this a data integrity problem which would be solved by the database. Importing the CSV into the database would be the extent of the elixir applicaton.

This would be easier and faster to achieve through using something like Postgres’ INSERT INTO ... VALUES (...) ON CONFLICT (...) DO NOTHING as seen in the documentation here: https://www.postgresql.org/docs/10/static/sql-insert.html

As well, MySQL allows INSERT IGNORE as documented: https://dev.mysql.com/doc/refman/5.5/en/insert.html

Following an approach like this, you’d need to define what is duplicated data and figure out any necessary normalization.