Are there any env vars I can set to reduce memory used by the VM?

OldhamMade · June 20, 2019, 1:44pm

I’m hosting a project on a Heroku hobby instance, and I’m starting to get warnings about exceeding the memory quota on the dyno. Thing is, the spikes I’m seeing seem quite random; the average load on the dyno at the time isn’t any higher than at other times, the service isn’t seeing a spike in requests, and the memory spikes don’t coincide with dyno restarts.

Are there any env vars I can set for the BEAM that can help to keep memory usage down on machines with limited RAM/swap space?

I’d like to avoid moving up to the next dyno level if possible. $7/m isn’t too much of an out-of-pocket expense for a project that doesn’t make any money, but the next level is $25/m… that’s a bit too costly!

michalmuskala · June 21, 2019, 7:16am

Unfortunately there are no magic env vars like that. From your description, though, it seems to me like maybe you have an endpoint that is very memory-hungry. My usual suspects to look for would be loading a lot of data from the database with something like Repo.all on a huge table.

outlog · June 21, 2019, 7:47am

there is the fullsweep_after or ERL_FULLSWEEP_AFTER http://erlang.org/doc/man/erlang.html#system_flag-2 (can’t deep link for some reason…)

But I would be very careful using that, to the level of warning against using that… it’s treating the symptoms rather than the problem… (and can create a whole lot of other problems… increased cpu/latency - unpredictable latency etc etc)

what is the nature of your system? do you have periodic cron/quantum jobs that does the spike… is it one user that have obscene amount of data etc etc.

so identify what process/code is responsible for the spike…

OldhamMade · June 21, 2019, 7:47am

Unfortunately there are no magic env vars like that.

Ah, that’s a shame. Good to know, though.

loading a lot of data from the database

Thanks for the pointer. This app doesn’t actually have a database as such, though it does transfer Redis metrics to a postgresql db for long-term storage each night. Checking the metrics, I’m not seeing any load increases during that window, but I’ll be sure to investigate whether things are “cleaning up” after completion.

Interestingly, after posting I was reviewing code and found ERL_COMPILER_OPTIONS='native' was set in the deploy config, hidden away in a rarely-touched config file. I removed that and redeployed, ~12hrs ago. Average response time has gone up slightly (though still generally <10ms), however memory usage, swap usage, and average load on the server has gone down.

outlog · June 21, 2019, 7:50am

also what OTP/elixir is this… upgrading could help a bit here and there…

OldhamMade · June 21, 2019, 7:59am

do you have periodic cron/quantum jobs that does the spike

No, it seems to be truly random. I do have a nightly quantum job that migrates Redis metrics to postgresql for persistence, but the spikes aren’t around that time. I can go for days without issue, then I’ll experience an outage and a 3-4x memory spike. Bounce the server and it’ll run fine again… till the next time.

Traffic-wise, there’s no peak at the time of the spike; requests are generally quite consistent.

is it one user that have obscene amount of data

Yeah, good thought. I should’ve mentioned in the original post: this is a true “micro-service”, so there’s no auth, and very little processing, mostly reading binary data and returning it as json.

there is the fullsweep_after or ERL_FULLSWEEP_AFTER

Actually, this is very interesting. From the docs (I’ve bolded the parts relevant to my case):

A few cases when it can be useful to change fullsweep_after:

If binaries that are no longer used are to be thrown away as soon as possible. (Set Number to zero.)

A process that mostly have short-lived data is fullsweeped seldom or never, that is, the old heap contains mostly garbage. To ensure a fullsweep occasionally, set Number to a suitable value, such as 10 or 20.

In embedded systems with a limited amount of RAM and no virtual memory, you might want to preserve memory by setting Number to zero.

I think removing ERL_COMPILER_OPTIONS='native' may be the resolution, but if not I’ll try setting ERL_FULLSWEEP_AFTER=0 and see how I get on, and report back on this thread if I do (others may find the results useful).

OldhamMade · June 21, 2019, 8:01am

also what OTP/elixir is this… upgrading could help a bit here and there…

I’m using the latest Heroku offers:

erlang version = 21.3.7
elixir version = 1.8.2

LostKobrakai · June 21, 2019, 8:04am

Do you use long running processes to handle those binaries. There are known situations, where large binaries, which are handled by long running processes stick around because there’s no gc happening. There was a recent topic talking a bit about that (and the/a potential solution): When is 'Hibernation' of Processes useful?

OldhamMade · June 21, 2019, 8:29am

Do you use long running processes to handle those binaries.

Yes… poolboy is used to manage a pool of workers which access the binaries. It’s not so bad that they’re hanging around in memory; they need to be accessed constantly – my app rarely drops below 40reqs/s. With this model, the memory usage is pretty stable, varying +/- 10%. Until is randomly jumps by 3 or 4 times.

Thanks for the link to the topic though, I’ll give it a read!

LostKobrakai · June 21, 2019, 8:30am

Ah ok. I thought those backup processes might hang around doing nothing for nearly a day.

OldhamMade · June 21, 2019, 8:42am

Thought a picture might help (and could be interesting for readers ):

The yellow box is where the nightly quantum job happens. The label underneath each graph is the point of deployment of the new build without the ERL_COMPILER_OPTIONS='native' env var set. As you can see, since this deploy the load-arg has been lower than previous. Thoughput is down, but it tends to roll in waves, so another 5hrs should see it peak again. I’m looking forward to seeing whether there’s another memory incident with the flag disabled.

outlog · June 21, 2019, 9:19am

always interesting… I suggest gathering data for a few days before toying around with fullsweep_after… also might be good if you figure out setting fullsweep_after only on the worker processes…

OldhamMade · June 21, 2019, 9:31am

Good call. Hadn’t considered that.

outlog · June 21, 2019, 10:02am

think you should be able to pass in spawn_opt to the genserver.start_link http://erlang.org/doc/man/erlang.html#spawn_opt-2 in your poolboy setup…

OldhamMade · June 27, 2019, 3:57pm

So, quick follow-up to this thread. It looks like ERL_COMPILER_OPTIONS='native' was the culprit. I’ve had 0 issues since rebuilding without that flag, I’ve hit the lower memory quota only twice since (during automated restarts) and haven’t hit the higher quota, and throughput is steady, topping out at ~69reqs/s.

I’ve not needed to set ERL_FULLSWEEP_AFTER=0, but it’s good to have it “in the bag” for future issues.

Next steps are to move from elixir 1.8 to 1.9 using a “real” release (current deploy is naively using mix phx.server because I’m lazy and this is a side-project) and from OTP 21 to 22 (which has only this week become available on Heroku).