Enumerates the enumerable, by removing the elements for which function fun
returned duplicate items.
The function fun maps every element to a term. Two elements are considered
duplicates if the return value of fun is equal for both of them.
The first occurrence of each element is kept.
## Example
iex> Enum.uniq_by([{1, :x}, {2, :y}, {1, :z}], fn {x, _} -> x end)
[{1, :x}, {2, :y}]
iex> Enum.uniq_by([a: {:tea, 2}, b: {:tea, 2}, c: {:coffee, 1}], fn {_, y} -> y end)
[a: {:tea, 2}, c: {:coffee, 1}]
How large is the CSV? Enum.uniq_by won’t help you detect duplicates, it’ll just remove them.
If the goal is to have something like “no products should have duplicate names” or similar then the easiest route is to just define a uniq database constraint. Then load all the csv rows in a single transaction, and rollback the transaction if one of the inserts fails.
I definitely agree with a database focused approach. Although depending on the requirements one potential downside is that I believe that you will only get a warning on the first duplicate encountered. If there are 20 more after that you won’t know until you remove that line from the file. But that is most likely okay since that would indicate other errors in the file.
It all sorta depends a bit on the CSV size, execution time constraints, and so on. If the CSV is too large to hold in memory that will affect the answers a bit.
Personally I’d consider this a data integrity problem which would be solved by the database. Importing the CSV into the database would be the extent of the elixir applicaton.
This would be easier and faster to achieve through using something like Postgres’ INSERT INTO ... VALUES (...) ON CONFLICT (...) DO NOTHING as seen in the documentation here: https://www.postgresql.org/docs/10/static/sql-insert.html