Timeout errors with long running transactions

cjk · March 14, 2019, 3:32pm

Hi there,

I have a problem with ecto timeouts (so it seems). In a Elixir application I wrote a module for importing data. It gets a CSV file, goes through that file line by line (with a Stream.map()), looks if this dataset already exists in the database, updates it or inserts it as a new dataset. Pretty basic.

To make that operation restartable I wrap this whole process in a database transaction. The amount of datasets is pretty large and it can take up to an hour.

I know of Ecto timeouts, and thus I start that transaction with timeout: :infinity to avoid timeout problems. It works in development, but in prod I get strange errors:

16:02:34.435 [error] Postgrex.Protocol (#PID<0.423.0>) disconnected: ** (DBConnection.ConnectionError) ssl send: closed

or

 ** (exit) an exception was raised:
     ** (DBConnection.ConnectionError) ssl send: closed
         (ecto_sql) lib/ecto/adapters/sql.ex:624: Ecto.Adapters.SQL.raise_sql_call_error/1
         (ecto_sql) lib/ecto/adapters/sql.ex:557: Ecto.Adapters.SQL.execute/5
         (ecto) lib/ecto/repo/queryable.ex:147: Ecto.Repo.Queryable.execute/4
         (ecto) lib/ecto/repo/queryable.ex:18: Ecto.Repo.Queryable.all/3
         (ecto) lib/ecto/repo/queryable.ex:66: Ecto.Repo.Queryable.one/3
         (termitool) lib/termitool/meta/meta.ex:332: Termitool.Meta.get_user_by/1
         (termitool) lib/termitool/meta/meta.ex:439: Termitool.Meta.get_by_username/1
         (termitool) lib/termitool/meta/meta.ex:465: Termitool.Meta.username_password_auth/2

Basically connection errors at random places in the application. I can avoid this problem by setting timeout: :infinity in the repo configuration, but this seems wrong and dangerous to me.

The code is basically:

Repo.transaction(fn ->
  stream
  |> Stream.map(fn {row, idx} -> update_or_create_row(row) end)
end,
timeout: :infinity)

I am using a stream because that’s what I get from the CSV parsing library.

What am I doing wrong?

Best regards,
CK

dimitarvp · March 14, 2019, 3:53pm

This looks strange to me: your last statement in such a processing pipeline should always be using Enum; Stream just returns a function.

Are you sure any actual work is done by this code in production?

cjk · March 14, 2019, 5:00pm

Sorry - it‘s an Enum.map, not a Stream.map

Edit: ok, I’m at the PC again.

Sorry, I simplified the code a bit, in fact it is more like this:

Repo.transaction(fn ->
  stream
  |> Stream.map(&create_or_replace/1)
  |> Enum.reduce(…)
end,
timeout: :infinity)

The reduce counts and throws away the success results and collects the error results with their changesets.

cjk · March 15, 2019, 6:29am

Nothing, nobody? Am I missing some important information?

NobbZ · March 15, 2019, 8:56am

Is it the transaction that times out or the individual query?

As far as I can remember DB lessons during university, individual queries during a transaction might take longer than without the transaction due to the doublebookkeeping of indexes or linear scan of things that are only visible in the transaction but not yet commited to the database. Can you check if the problem persists if you increase the timeout of the individual query?

kokolegorille · March 15, 2019, 9:13am

There are some tools to leverage high import, for example GenStage, which will provide back pressure support.

I would also reach for insert_all, because it’s more efficient to send one query with 10’000 insert than 10’000 single insert.

Enum.chunk_by can be useful to treat smaller pieces.

But all of this does not fit well within a huge transaction.

cjk · March 15, 2019, 9:33am

The timeouts appear all over the place in the application during the import job, in parts of the application which have nothing to do with the import and are not wrapped in the transaction. But the queries in the transaction seem not to time out, I did not see the connection errors there.

cjk · March 15, 2019, 9:34am

insert_all is not a viable solution in this case, I either update existing rows or insert new rows, but not all in one table. For example one CSV row can result in 4 rows in 4 different tables.

kokolegorille · March 15, 2019, 9:46am

Instead of inserting new rows, I keep them in a list, and proceed all in one go.

If You have different tables, You still might keep multiple lists (one per table), and proceed with multiple insert_all.

I had similar constraint (huge textfile, multiple tables, over 1’000’000 records) with update or create. I had to be careful to race condition when processing the file concurrently.

After using tasks, poolboy, I finally setup my import pipeline with GenStage, and a couple of insert_all.

cjk · March 15, 2019, 10:11am

Hm. Race conditions on the data are a non-issue in this case. But I will overhaul my import pipeline, thanks for your input…

That said, I guess I found the underlying cause for this problem. Your hint with concurrently importing the data gave me the idea: the CSV library uses workers to parallelize the CSV reading and parsing. That means that my Stream.map() executions are parallelized as well.

Parallelized execution means: different ecto processes are used. And since the server has traffic and more cores than my workstation this means: more Stream workers, more user connections and thus the pool (I was using the default size of 15) could be exhausted pretty fast.

And indeed, if I reduce the pool size on my workstation to 2 and the timeout to 1 second, I get the same errors all over the place.

To avoid that I now use the :caller option on all repo calls, and now it works like a charm on my dev machine. Have still to test it in production, though

This also means: my import did not run in the transaction anyways…

cjk · March 15, 2019, 10:42am

Dammit.

The errors appeared again, random connection drops by Postgrex:

11:38:50.415 [error] Postgrex.Protocol (#PID<0.425.0>) disconnected: ** (DBConnection.ConnectionError) ssl send: closed
11:38:52.415 [error] Postgrex.Protocol (#PID<0.431.0>) disconnected: ** (DBConnection.ConnectionError) ssl send: closed
11:38:54.415 [error] Postgrex.Protocol (#PID<0.418.0>) disconnected: ** (DBConnection.ConnectionError) ssl send: closed
11:38:56.415 [error] Postgrex.Protocol (#PID<0.424.0>) disconnected: ** (DBConnection.ConnectionError) ssl send: closed
11:38:58.415 [error] Postgrex.Protocol (#PID<0.433.0>) disconnected: ** (DBConnection.ConnectionError) ssl send: closed
11:39:00.415 [error] Postgrex.Protocol (#PID<0.426.0>) disconnected: ** (DBConnection.ConnectionError) ssl send: closed

And this time there was not even an import job running, I just removed the timeout: :infinity from the repo configuration.

What’s going on?!

cjk · March 15, 2019, 11:16am

Other applications using ecto on the same server work like a charm… none of my queries seem to take more than 20ms, most of them below 10ms. The :pool_size is 50. I’m seriously confused.

engineeringdept · March 15, 2019, 11:44am

Are you running in a Docker container? I discovered a bug in the the official Elixir 1.8.1 container, which is backed by Erlang 21.3, which caused SSL problems like these in production. I had to switch to a custom image running Erlang 21.2.7.

outlog · March 15, 2019, 11:45am

fishing expedition here…

but check OTP and elixir versions and make sure that you are on latest patch release of what you are using… (recently there was a transient bug with ConnectionError fixed in later OTP…)

also can you check/post mix hex.outdated for the relevant dependencies…

also is this using distillery and what version?

cjk · March 15, 2019, 11:54am

Nope. Bare metal with Ubuntu 18.04 LTS

ckruse@ares:/var/www/termitool$ erl -version
Erlang (SMP,ASYNC_THREADS,HIPE) (BEAM) emulator version 10.3
ckruse@ares:/var/www/termitool$ elixir --version
Erlang/OTP 21 [erts-10.3] [source] [64-bit] [smp:48:48] [ds:48:48:10] [async-threads:1] [hipe]

Elixir 1.8.1 (compiled with Erlang/OTP 20)
ckruse@ares:/var/www/termitool$

cjk · March 15, 2019, 11:56am

I’m not using distillery (had not yet the time to wrap my head around that), I just checked out the source and compiled it on the server. Using mix phx.server via systemd to start the service.

ckruse@ares:/var/www/termitool$ mix hex.outdated
Dependency        Current  Latest  Update possible
absinthe_phoenix  1.4.3    1.4.3
absinthe_plug     1.4.6    1.4.6
appsignal         1.9.4    1.9.4
arc               0.11.0   0.11.0
arc_ecto          0.11.1   0.11.1
bamboo            1.2.0    1.2.0
bamboo_smtp       1.6.0    1.6.0
bcrypt_elixir     1.1.1    2.0.1   No
cachex            3.1.3    3.1.3
comeonin          4.1.2    5.1.1   No
csv               2.2.0    2.2.0
dataloader        1.0.6    1.0.6
ecto_sql          3.0.5    3.0.5
ex_phone_number   0.2.0    0.2.0
exldap            0.6.3    0.6.3
gettext           0.16.1   0.16.1
guardian          1.2.1    1.2.1
jason             1.1.2    1.1.2
phoenix           1.4.2    1.4.2
phoenix_ecto      4.0.0    4.0.0
phoenix_pubsub    1.1.2    1.1.2
plug              1.7.2    1.7.2
plug_cowboy       2.0.1    2.0.1
porcelain         2.0.3    2.0.3
postgrex          0.14.1   0.14.1
timex             3.5.0    3.5.0

outlog · March 15, 2019, 12:06pm

think you are hitting this 21.3 OTP bug that is just emerging:
and mentioned here Workaround for Docker with Erlang/OTP 21.3 and

Redirecting… (identical but with patch)

can you downgrade to OTP 21.2.7 ? or wait for patch release (presumably later today)…

cjk · March 15, 2019, 12:08pm

Huh…

{:bad_action_from_state_function, {:reply, :undefined, {:error, :timeout}}}

I actually get this error nightly since a few days. I had not yet the time to look into this. So it seems very likely that you are right.

Phew, you just solved me some headaches…

cjk · March 15, 2019, 12:25pm

OK, after downgrading everything just works. Thanks for the hint, @outlog and @engineeringdept!