Repeat a select query leading to high memory usage

Inthavan · December 9, 2022, 12:21pm

Hello all,

When I repeat a query to select a “large” number of records (around 470k records in my test) , the query works as expected till the nth repetition that consumes all the RAM allocated to my pod and write on disk. So far, I am unable to figure out the root cause.

My environment:
I test on 2 pods :
Elixir: 1.10.3 - Ecto: 3.7.1 Postgrex: 0.15.13
or
Elixir: 1.11.4 - Ecto: 3.9.0 Postgrex: 0.16.5

AWS RDS Postgresql 13.7 - I use multi-tenancy and materialized view.

The requested material view represents customers, it has 650K records and the size is 100MB with btree indexes on email and phone columns.

In order to reproduce my issue, I execute this piece of code

{:ok, pid} = Postgrex.start_link( hostname: ...,  username: ..., password: ..., database: ...)
query = "SELECT DISTINCT ON  (c0.email) c0.email, c0.phone, c0.country FROM  my_schema.customers AS c0 WHERE (NOT (c0.phone IS NULL)) AND (c0.phone != '')  AND  (c0.email != '');"
Postgrex.query(pid, query, [])

I repeat the Postgrex.query(pid, query, []) till it fails, most of the time it fails at the 2nd attempt. Sometimes it happens at the first shot.

When the query works fine it takes a few seconds and I get the result:

{:ok,
 %Postgrex.Result{
   columns: ["email", "phone", "country"],
   command: :select,
   connection_id: 17465,
   messages: [],
   num_rows: 471782,
   rows: [..]
}}

the RAM increases of about 80Mo which is consistent and looking at performance insight in AWS console only CPU is used

When the request fails, it times out

{:error,
 %DBConnection.ConnectionError{
   message: "tcp recv: closed (the connection was closed by the pool, possibly due to a timeout or because the pool has been terminated)",
   reason: :error, 
   severity: :error
 }}

All the RAM is consumed, the disk IO increases of 5Mbps, on performance insight side I can see 5% CPU usage and 95% Client:ClientWrite (the client stucks)

I have also tested using stream, I faced the same behaviour.

Postgrex.transaction(pid, fn(conn) ->
conn
|> Postgrex.stream(query, [], max_rows: 5000)
|> Stream.map(fn %Postgrex.Result{rows: rows} -> rows end)
|> Enum.to_list()
end)

Anyone has already get in this trouble? Any idea know how I can move on the investigation?

Thanks in advance for your help

Regards

dimitarvp · December 9, 2022, 3:23pm

I can’t help you with the RAM issue and why is it occurring. But I’m curious: do you really need all emails in memory at the same time?

Using streaming can usually achieve your goals with minimum footprint.

wanton7 · December 10, 2022, 12:27pm

When you are using stream and then use Enum.to_list() everything is just copied into memory ignoring benefits of streams. You should do your work in stream chunks avoiding loading everything in memory. Could that disk usage be because OS just run out of memory and had to kill some processes to stay alive?

Have you tried calling :erlang.garbage_collect() between calls? I don’ know how Elixir GC works but maybe it’s not triggering fast enough. You can also run each query inside a Task (another process) and process it there. When task exits all memory it used is released because process it used dies. Except reference counted binaries it shared to other processes.
String are also binaries in Elixir and any string larger than 64 would be reference counted if I have understood how things work in VM correctly.

Maybe problem is similar to what C#'s .NET had that automatic detection of memory limits at least under docker caused .NET to allocate too much memory because GC wasn’t aggressive enough causing OOM. So setting memory limit when running Elixir could help. Fast search showed this is the way to do it Running Elixir with limited heap size - #3 by alvises

Inthavan · December 11, 2022, 10:22pm

Hello Dimitarvp,

Thank you for your message.

In our database we use multi tenancy so that we can have 1 schema per client, and most of the time we don’t have that many emails to process that why we prefer to get them all so far.

We know should/could use ‘stream’ reducing memory usage etc as I mentioned I give it a try and the behaviour is the same.

dimitarvp · December 11, 2022, 10:36pm

It is not the same, in the end you are doing Enum.to_list which gives you the entire list in memory again so using streaming becomes meaningless.

You have to combine Repo.stream with processing each email in place (or batches / chunks of emails). Nowhere you have to call Enum or else using streaming only half-way won’t matter at all, as you yourself discovered.

Let me illustrate:

Postgrex.transaction(pid, fn(conn) ->
  conn
  |> Postgrex.stream(query, [], max_rows: 1000)
  |> Stream.map(fn %Postgrex.Result{rows: rows} -> rows end)
  |> Stream.each(fn batch_of_emails ->
    Enum.each(batch_of_emails, fn email ->
      # maybe send message to email here
    end)
  end)
  |> Stream.run() # This runs the whole thing, otherwise nothing would happen.
end)

Since you are already emitting batches, we make use of them and go through them with Enum.each – the crucial difference here is that you are only having a single list of 1000 emails in memory, and not the entirety of all them in the database.

Inthavan · December 11, 2022, 10:41pm

Hello Wanton7,

Many thanks for your message and suggestion.

Could that disk usage be because OS just run out of memory and had to kill some processes to stay alive?

Yes this is also my understanding disk I/O happens when the OS is OOM.

Have you tried calling :erlang.garbage_collect()

Yes but it didn’t help

So setting memory limit when running Elixir could help

Definitely I m going to dig the memory setting.

Inthavan · December 11, 2022, 10:57pm

Many thanks for your explanation, I will give it a try.

Nevertheless, iI still don’t understand why on the same connection, the same query can consume the expected memory (80MB) or when it fails all the available memory (>1GB)

dimitarvp · December 11, 2022, 11:03pm

Not sure there is something much more to understand – just don’t load huge collections of stuff from the DB. That’s what full streaming mode is for (the one I demonstrated).

wanton7 · December 12, 2022, 6:39am

Create a minimal reproduceable sample code and put it in a public GitHub repo. I mean create a new Elixir project copy over only code that makes this problem happen. Then also include seeding code for the DB that generates data into database to make this problem appear simulating your current data. If you can make this, reproduce the problem and prove problem is in Postgrex then create an issue into Postgrex GitHub repo about this.

wanton7 · December 12, 2022, 8:18am

Googled your error and found this How to deal with DBConnection.ConnectionError? by Michał Dolata
According that error your are describing happens when query timeout is exceeded.

the second type usually means that there was a transaction/query that lasted too long and exceeded the timeout. I will refer to them as query timeout.

(DBConnection.ConnectionError) tcp recv: closed (the connection was closed by the pool, possibly due to a timeout or because the pool has been terminated)

(DBConnection.ConnectionError) client #PID<0.368.0> timed out because it checked out the connection for longer than 15000ms

Still that RAM usage sounds very odd even if problem could be a query timeout. Are you running any other queries against same DB while you running this code? It could explain why there is a query timeout sometimes.

Inthavan · December 12, 2022, 10:24am

Thanks wanton7

Yes I have already “played” with the timeout parameter, I need to set it to 1 minute in order to be successful.
I believe the timeout is related to the Client:ClientWrite I can observe on the DB graph and is a consequence of OOM since the node writes on disk it slows down the processing and the “dialogue” with the DB

No, I tested this code at night when there is nothing against the DB. Indeed the RAM usage is my main concern. I could probably circumvent it using stream…

wanton7 · December 12, 2022, 10:36am

Do a minimal repro project like I said. If there is indeed a bug in Postgrex knowing that and how to reproduce it will be very valuable for its dev team.

Edit: if you try to create a repro but can’t make it appear locally try some virtualized environment eg. VirtualBox with limited memory with PostgreSQL running on your machine that is running VirtualBox.

wanton7 · December 12, 2022, 11:01am

You wrote :erlang.garbage_collect() is not releasing memory. If you call :erlang.memory(:total)', :erlang.garbage_collect() and :erlang.memory(:total)' again nothing changed? If that is the case I feel like you are keeping reference to data you are returning. Are you sure you are just calling Postgrex.query(pid, query, []) and not storing result into a variable for example?

I’m also not sure how VM is checking that data goes out of scope. Is returning data into a function scope even without putting it into a variable enough not to get it garbage collected until function ends. If amount of memory allocated isn’t changing I would try creating a function that call the query and return :nil and then calling :erlang.garbage_collect() in different function scope like this

do_query() # returns :nil
`:erlang.garbage_collect()`