I’m deep into debugging a very high memory usage problem in a group of GenServers.
There are two types of GenServer implementations I’m examining:
# module:
MessageEngine.Thought
# example state:
%DB.Thought{__meta__: #Ecto.Schema.Metadata<:loaded, "thoughts">, active: false,
score: 0.35795454545454547,
conversation: #Ecto.Association.NotLoaded<association :conversation is not loaded>,
conversation_id: 1621, id: 129158,
inserted_at: #Ecto.DateTime<2017-03-07 21:32:19>,
lost_against: %{"129129" => [51952, 51955, 51955, 51938, 51931, 51951, 51944], ...},
message: #Ecto.Association.NotLoaded<association :message is not loaded>,
message_id: 12748,
text: "Yes because will she listen to them or the people.",
updated_at: #Ecto.DateTime<2017-03-07 21:44:21>,
user: #Ecto.Association.NotLoaded<association :user is not loaded>,
user_id: 51959, vector: [],
won_against: %{"129129" => [51946, 51934, 51934, 51942, 51954, 51957], ...}}
# module:
MessageEngine.User
# example state:
%DB.MessageUser{__meta__: #Ecto.Schema.Metadata<:loaded, "messages_users">,
accepting_choices: false,
all_choices: [%{"c" => 129138, "nc" => 129154}, ...],
comparisons: [%{"a" => 129138, "b" => 129154}, ...],
conversation: #Ecto.Association.NotLoaded<association :conversation is not loaded>,
conversation_id: 1621, id: 132055,
inferred_choices: [%{"c" => 129138, "nc" => 129154}, ...],
manual_choices: [%{"c" => 129138, "nc" => 129130}, ...],
message: #Ecto.Association.NotLoaded<association :message is not loaded>,
message_id: 12748, rid: nil,
user: #Ecto.Association.NotLoaded<association :user is not loaded>,
user_id: 51959}
I don’t want to dig too deeply into why the states are what they are, but suffice to say that they have been well-researched and tested, and I don’t want to explain too much industry context
Now, we have been monitoring our app in production for a while, and noticed that, as the number of these processes alive increase, memory usage goes up almost exponentially.
With 600 MessageEngine.User
s and 600 MessageEngine.Thought
s, we measured almost 35GB of RAM being used across the cluster.
I first tried to measure the amount of memory used just by the state of the process, but this doesn’t seem like nearly enough data to have that substantial of an impact.
I popped into observer to learn more, and ran the following tests:
30 users and 30 thoughts
- With:
length(MessageEngine.User.all_choices) = 0
length(MessageEngine.User.manual_choices) = 0
length(MessageEngine.User.inferred_choices) = 0
length(MessageEngine.User.comparisons) = 0
One MessageEngine.User
process was consuming 139kb of memory
One MessageEngine.Thought
process was consuming 3kb of memory
- With:
length(MessageEngine.User.all_choices) = 53
length(MessageEngine.User.manual_choices) = 20
length(MessageEngine.User.inferred_choices) = 33
length(MessageEngine.User.comparisons) = 53
One MessageEngine.User
process was consuming 502kb of memory
One MessageEngine.Thought
process was consuming 25kb of memory
300 users and 300 thoughts
- With:
length(MessageEngine.User.all_choices) = 0
length(MessageEngine.User.manual_choices) = 0
length(MessageEngine.User.inferred_choices) = 0
length(MessageEngine.User.comparisons) = 0
One MessageEngine.User
process was consuming 1089kb of memory
One MessageEngine.Thought
process was consuming 6kb of memory
- With:
length(MessageEngine.User.all_choices) = 53
length(MessageEngine.User.manual_choices) = 20
length(MessageEngine.User.inferred_choices) = 33
length(MessageEngine.User.comparisons) = 53
One MessageEngine.User
process was consuming 4023kb of memory
One MessageEngine.Thought
process was consuming 41kb of memory
So, as you can see, not only is memory usage per-process scaling up a lot just by adding ~50 maps to a list, the usage of each process also seems to be dependent on the number of processes alive! An order of magnitude increase in the number of processes results in an order of magnitude increase in the memory usage of each one.
This seems like really weird behavior to me, and I’m kinda stuck on where to go next, because, by my calculations, the memory usage of the state of these processes should be more like 20-50kb each (used this guide: http://erlang.org/doc/efficiency_guide/advanced.html#id68680).
Here’s a full dump of the state of a process that was using 4023kb of RAM: https://gist.github.com/pdilyard/92a04ccad39be87d05e466ed4dbea193
Any help would be greatly appreciated.