Weird memory leak issue

ream88 · August 3, 2022, 3:12pm

Hey everybody,

so I have a really weird memory leak problem: I have an app the calculates a schedule for a given month. To do that I have like 350+ users, which assign themselves free slots in predefined shifts. Basically there is a 1:n relation between shifts and slots, and a 1:1 between slots and users. As standard month therefore has ~50k slots to check and assign users to it, based on some application logic rules. Rules are just modules that follow a certain behaviour and return a score, the user with the highest score then “wins” the slot and is assigned to the shift. So far so good, I have unit tests for everything. Now the weird thing is, after the calculation is finished I always run into a weird memory leak, see the image below.

First this happened when starting a supervised Task from my LiveView which would take care of the calculation. Thanks to @hubertlepicki I was able to figure this problem out: Looks like copying a lot of Ecto records including associations from one process to another is not a very good idea.

Now I even get the memory leak in iex when I try to autocomplete like this: scheduler.sche... scheduler is a state struct of the Scheduler module which basically looks like this: defstruct schedule: [], rules: [], slots: [] Are 50k+ records in memory to much for beam to handle? Also why does the autocomplete crash, but accessing it without autocomplete does not. Could it be that autocomplete tries to copy the data to another process? And yes phx.server also crashes with the memory leak after the calculation, not sure where the copying happens there yet.

al2o3cr · August 3, 2022, 3:41pm

Unless those records are QUITE substantial (19GB memory / 50k records = 380 kilobytes apiece) this doesn’t seem like the whole story.

ream88 · August 3, 2022, 4:05pm

Is there a way to get the memory size of a variable? The things I tried always resulted in further crashes. Or could it be that preloading all the associations lead to a endless cycle? I’m must say I’m not 100% familiar with Ecto preloading.

al2o3cr · August 3, 2022, 4:39pm

If the associations that you’re preloading involve the same records more than once, you might encounter headaches due to loss of sharing. There’s a good discussion in the BEAM efficiency guide with some examples of getting the byte-size of various terms.

An example of what I’m thinking of in the standard posts/comments/authors pattern would look like:

Repo.all(Post) |> Repo.preload(comments: [author: :posts])

Here, there are going to be a LOT more resulting Post structs than rows in the Post table since every comment has a new set of the author’s posts.

In some ORMs in languages with mutable data this can be made efficient with an identity map, resulting in a cyclic graph of objects - but that shape is not available in Elixir.

benwilson512 · August 3, 2022, 4:45pm

:erts_debug.size/1 will emit the size of a variable in words.

Without seeing your code it’s a bit hard to guess at things like this. I would turn on debug ecto logging and make sure that you aren’t seeing unexpected querying.

ream88 · August 3, 2022, 5:34pm

I will try to share more code, but :erts_debug.size/1 now already runs for more than 10 minutes, so I guess the data is too complex.

ream88 · August 3, 2022, 5:37pm

I guess this is the problem… Thinking now about a way to prevent this somehow. I really wanted to make the code as much independent from Ecto as possible. In fact I had a working (in-memory) solution, before even adding Ecto to the app.

benwilson512 · August 3, 2022, 5:56pm

Yeah this means you have a giant value in a single variable. This doesn’t sound like a memory leak, this sounds like you’re somehow putting tons of data into a single value.

Exadra37 · August 3, 2022, 6:01pm

If your problem is the Ecto preload, and not an huge dataset in one variable, then you may want to see if the official docs on preload queries can help:

https://hexdocs.pm/ecto/Ecto.Query.html#preload/3-preload-queries

Sourced from this Stackoverflow question:

Another resource I found on preload is this talk:

In the past Ecto preload had issues with memory as per this issue:

github.com/elixir-ecto/ecto

Huge memory spike for parallel preload

opened 12:07PM - 08 May 20 UTC

closed 03:54PM - 14 Jun 20 UTC

ku1ik

I've debugged curious case of OOM error in our system and found out that there's… unexpected, huge memory spike in certain situations with `Repo.preload`. ### Environment * Elixir version (elixir -v): 1.9.4, 1.10.2 * Database and version: PostgreSQL 11.7 * Ecto version (mix deps): 3.4.3 * Database adapter and version (mix deps): ecto_sql 3.4.3, postgrex 0.15.3 * Operating system: Linux, macOS ### Current behavior When ALL of the following are true, the 2nd `Repo.preload` causes memory spike of couple of GB: 1. preload twice (preload again after it was already preloaded) 2. the preloads need to have `[has_many_assoc_for_schema_a: [belongs_to_assoc_for_schema_b: has_many_assoc_for_schema_a]` 3. there's more than 1 top level assoc preloaded 4. preloading is executed concurrently (`in_parallel: true`, _which is default_) With this you can observe the spike (~1 GB in the minimal example app I link below, and we see 3-5 GB in our system). With explicit `in_parallel: false` there's no mem usage at all, and the 2nd preload is basically a no-op like it should be. I created a minimal project with `ecto` + `ecto_sql`, with couple of schemas that mimics our setup, which allows reproducing this problem in consistent way: https://github.com/sickill/ecto-preload-bug (instructions on how to trigger and observe the spike are in the README). The amount of records for the top-level has-many assoc for schema "a" plays a role here, with a handful of records the spike is there but rather not very noticable. As the amount of records grows the spike seems to grow exponential-ish. First preload: <img width="694" alt="single-preload" src="https://user-images.githubusercontent.com/17589/81404128-1633a900-9135-11ea-988f-6654773f9a7f.png"> All data gets preloaded and you can't even see any mem usage change. Now, when you try to preload again: <img width="713" alt="double-preload" src="https://user-images.githubusercontent.com/17589/81404187-382d2b80-9135-11ea-8214-1c6fe074c4d3.png"> ### Expected behavior No significant extra memory usage, like with `in_parallel: false` case.

ream88 · August 3, 2022, 7:12pm

Why is this exactly a problem? I mean, even if I split this up into multiple lists or whatever … having something like a struct as a container above the data ends up in a single giant value again.

benwilson512 · August 3, 2022, 7:31pm

I mean that there is a bug in your code where you’re loading more data than you intend to. There’s no way 50k records, even with preloads, should be taking 19gb. There’s gotta be a bug.

dimitarvp · August 8, 2022, 1:34pm

You got somewhere with this? Super curious.

ream88 · August 30, 2022, 8:55am

Hey sorry for the long silence. So it was related to using nested schemas and copying them around between processes. I solved it by extracting the associations into their own variables and then instead of one potential huge schema passed into my calculation function, I pass now three schemas. First I did not like this approach because it felt like dealing with DB problems inside my calculation function which smelled like mixing concerns. But then I figured out, if the three schemas I currently use wouldn’t come from the same database anyways, I would need to solve it like this. So in the end its a better separation of concerns! Thank you all folks for helping me here!