I have a problem with ecto timeouts (so it seems). In a Elixir application I wrote a module for importing data. It gets a CSV file, goes through that file line by line (with a Stream.map()), looks if this dataset already exists in the database, updates it or inserts it as a new dataset. Pretty basic.
To make that operation restartable I wrap this whole process in a database transaction. The amount of datasets is pretty large and it can take up to an hour.
I know of Ecto timeouts, and thus I start that transaction with timeout: :infinity to avoid timeout problems. It works in development, but in prod I get strange errors:
Basically connection errors at random places in the application. I can avoid this problem by setting timeout: :infinity in the repo configuration, but this seems wrong and dangerous to me.
Is it the transaction that times out or the individual query?
As far as I can remember DB lessons during university, individual queries during a transaction might take longer than without the transaction due to the doublebookkeeping of indexes or linear scan of things that are only visible in the transaction but not yet commited to the database. Can you check if the problem persists if you increase the timeout of the individual query?
The timeouts appear all over the place in the application during the import job, in parts of the application which have nothing to do with the import and are not wrapped in the transaction. But the queries in the transaction seem not to time out, I did not see the connection errors there.
insert_all is not a viable solution in this case, I either update existing rows or insert new rows, but not all in one table. For example one CSV row can result in 4 rows in 4 different tables.
Instead of inserting new rows, I keep them in a list, and proceed all in one go.
If You have different tables, You still might keep multiple lists (one per table), and proceed with multiple insert_all.
I had similar constraint (huge textfile, multiple tables, over 1’000’000 records) with update or create. I had to be careful to race condition when processing the file concurrently.
After using tasks, poolboy, I finally setup my import pipeline with GenStage, and a couple of insert_all.
Hm. Race conditions on the data are a non-issue in this case. But I will overhaul my import pipeline, thanks for your input…
That said, I guess I found the underlying cause for this problem. Your hint with concurrently importing the data gave me the idea: the CSV library uses workers to parallelize the CSV reading and parsing. That means that my Stream.map() executions are parallelized as well.
Parallelized execution means: different ecto processes are used. And since the server has traffic and more cores than my workstation this means: more Stream workers, more user connections and thus the pool (I was using the default size of 15) could be exhausted pretty fast.
And indeed, if I reduce the pool size on my workstation to 2 and the timeout to 1 second, I get the same errors all over the place.
To avoid that I now use the :caller option on all repo calls, and now it works like a charm on my dev machine. Have still to test it in production, though
This also means: my import did not run in the transaction anyways…
Other applications using ecto on the same server work like a charm… none of my queries seem to take more than 20ms, most of them below 10ms. The :pool_size is 50. I’m seriously confused.
Are you running in a Docker container? I discovered a bug in the the official Elixir 1.8.1 container, which is backed by Erlang 21.3, which caused SSL problems like these in production. I had to switch to a custom image running Erlang 21.2.7.
but check OTP and elixir versions and make sure that you are on latest patch release of what you are using… (recently there was a transient bug with ConnectionError fixed in later OTP…)
also can you check/post mix hex.outdated for the relevant dependencies…
I’m not using distillery (had not yet the time to wrap my head around that), I just checked out the source and compiled it on the server. Using mix phx.server via systemd to start the service.