Data processing in Parallel (Hackathon project)

You have a mistake in your data sample file. First line contains a shell prompt:

anonymized@ANONYMIZED:~/file_watch_example$ head -100

The rest is valid CSV.

You are sorted. Here is the PR: Make it work by dimitarvp · Pull Request #1 · pascal-chenevas/file_watch_example · GitHub

I made it use batch inserting again and it works just fine. Only takes 108 seconds on my Intel Mac to insert exactly 10 million records.

Try my branch on your machine and let us know. No reason not to work unless your MySQL instance is broken, or you are not showing all your code and something else in it is bugging it.

5 Likes

@dimitarvp, I have merged your modifications and then cloned the repo into a new local directory.
Afterward, I run the Ecto-migration with mix ecto.migrate (so at that moment the table products is empty)
When the migration was finished I copied the CSV file into data/in.
I am very surprised at how fast the records are created in the database! :slight_smile:

mysql> select count(*) from products;
+----------+
| count(*) |
+----------+
| 84040000 |
+----------+

As you can see the number I got is much higher than the number of rows the file contains:

 cat products.csv | wc -l
10000000

Maybe the file_watcher.ex is not properly configured (and the CSV file is endlessly loaded?)?

Command ecto.migrate does not delete old data, it runs the migrations otherwise if everything is up-to-date it does nothing.

In your case you want to delete the entire database before inserting new data. This can be done by explicitly running ecto.reset, or ecto.drop followed by ecto.create and ecto.migrate.

To have easier visibility on what happens in your database I would strongly recommend using a cli tool to query it, or even better a visual tool like dbeaver.

2 Likes

Man, this is getting tiring. Just drop the DB, create it again, run the migration, and then start the code.

And indeed as @D4no0 said, post some command output e.g. mysql -e 'select count(id) from products;' before and after.

The code I did is working perfectly, you are missing something else and I believe it’s time you get a grip and make it work. :wink:

Are you opening iex -S mix and then running IngestCSV.load("10_million_products.csv")? Is there any other code you’re not showing?

2 Likes

Thank you for giving me the command you use to load the file. After using the same command I can confirm that I came to the same result as you, It works!

That’s confirm my doubt in my previous comment :

Maybe the file_watcher.ex is not properly configured (and the CSV file is endlessly loaded?)?

How I proceed to run in this error:

  1. mix deps.get (if I run the app for the first time)
  2. docker-compose up --build
  3. docker exec -ti file_watch_example_new_elixir_app_1 bash
    2.1 mix ecto.drop
    2.1 mix ecto.create
    2.1 mix ecto.migrate
  4. outside the container: cp products.csv data/in

In the Elixir App, I have a “file watcher” that should check for new incoming files in the directory data/in. If a new file is detected, then it should be loaded by IngestCSV (see line #20. So I think this process is wrong defined/configured and seems infinite load the CSV-File


Looks like you don’t have a detection of duplicates so you’re constantly inserting the same records.

Your next step is to add an unique index on the name (or another field) to make sure you don’t duplicate so much.

And it would have been useful for you to mention your exact steps before. We’re not mind readers. :smiley:

I consider this thread completed. Mark one of my answers as the solution (if you don’t mind) and if you have other problems, they are for a new thread IMO.

2 Likes
Looks like you don’t have a detection of duplicates so you’re constantly inserting the same records.

Your next step is to add an unique index on the name (or another field) to make sure you don’t duplicate so much.

Yes I will do that!

And it would have been useful for you to mention your exact steps before. We’re not mind readers. :smiley:

I will take care of that for my future posts :slight_smile: Sorry about that!

I consider this thread completed. Mark one of my answers as the solution (if you don’t mind) and if you have other problems, they are for a new thread IMO.

I already did it :slight_smile:

Thank you all of you for your support!

1 Like