Phoenix server crashing randomly

true, it’s just mine (2.6.0) doesn’t yet use it

I still doubt it’s comeonin though, might be a hardware problem + logger_file_backend, but it’s a really long shot

1 Like

Asked on erlang questions mailing list. My case looks quite likely an OOM issue.

2 Likes

@Nils try something like that on your server grep -ri 'killed process' /var/log/* / grep -ri total_vm /var/log/* / grep -ri oom /var/log/*, this might shed some light on the issue

1 Like

I’m wondering if I’m starting my server in production mode incorrectly. htop seems to think there are a number os processes running for the BEAM VM, and this is a single core machine…

I’m effectively running it like this -

MIX_ENV=prod PORT=4001 elixir -pa _build/prod/consolidated -S mix phoenix.server

1 Like

Those are just threads of the main scheduler process, it spawns a few for non-async’able operations, they remain idle 99% of the time regardless. ^.^

1 Like

By default Erlang will launch one scheduler thread per core and 10 “async threads” to do blocking things like IO.

Additionally, more modern versions, will fork a process called “erl_child_setup” that is used for spawning ports to avoid forking the main VM process, which may be costly.

3 Likes

@yurko thanks, here are the results, not much I can see here. Just one thing I am wondering from the third command “release-upgrade invoked oom-killer”, this was called at six in the morning and I did not do a release at this time, but maybe it just means, that this process was started the day before from this command…

/var/log/kern.log.1:Feb 12 06:47:03 kernel: [566162.789442] Killed process 18588 (erl_child_setup) total-vm:4368kB, anon-rss:76kB, file-rss:568kB
/var/log/kern.log.1:Feb 12 06:47:03 kernel: [566162.963894] Killed process 18491 (beam) total-vm:2228888kB, anon-rss:55580kB, file-rss:5112kB
/var/log/kern.log.1:Feb 14 06:25:21 kernel: [737660.651762] Killed process 31933 (erl_child_setup) total-vm:4368kB, anon-rss:72kB, file-rss:564kB

/var/log/kern.log.1:Feb 12 06:47:03 kernel: [566162.779066] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
/var/log/kern.log.1:Feb 12 06:47:03 kernel: [566162.956619] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
/var/log/kern.log.1:Feb 14 06:25:21 kernel: [737660.647842] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name

/var/log/kern.log.1:Feb 12 06:47:03 kernel: [566162.778627] run-parts invoked oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=0
/var/log/kern.log.1:Feb 12 06:47:03 kernel: [566162.778778] [] oom_kill_process+0x202/0x3c0
/var/log/kern.log.1:Feb 12 06:47:03 kernel: [566162.779066] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
/var/log/kern.log.1:Feb 12 06:47:03 kernel: [566162.954112] release-upgrade invoked oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=0
/var/log/kern.log.1:Feb 12 06:47:03 kernel: [566162.954174] [] oom_kill_process+0x202/0x3c0
/var/log/kern.log.1:Feb 12 06:47:03 kernel: [566162.956619] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
/var/log/kern.log.1:Feb 14 06:25:21 kernel: [737660.647551] lxcfs invoked oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=0
/var/log/kern.log.1:Feb 14 06:25:21 kernel: [737660.647672] [] oom_kill_process+0x202/0x3c0
/var/log/kern.log.1:Feb 14 06:25:21 kernel: [737660.647842] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name

If we could at least see some sort of stacktrace to know which port/nif caused the crash. But I could not figure it out from the crash.dump.

I had the last crash on Feb 14, so one week ago. After that I started to build the release on my server and not inside a docker container. So that might be the case. I still have to investigate whats the difference in these two environments.

DO Server:

Linux service-check 4.4.0-62-generic #83-Ubuntu SMP Wed Jan 18 14:10:15 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Ubuntu 16.04.1 LTS

Docker:

Linux 831ca2d6d9c0 4.4.43-boot2docker #1 SMP Wed Jan 18 18:33:18 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Ubuntu 16.04.1 LTS

So a different kernel version, but same Ubuntu version.

Both are using

openssl 1.0.2g-1ubuntu4.6

So the next step might be to find out which packages have different versions or first what kind of system libraries are used by erlang/elixir/dependencies and then compare the versions. But that would be hard to keep in sync later on…

1 Like

Ah cool, I was unsure if it was always a set number or some multiple’s of the cores or something, good to know. :slight_smile:

1 Like

seems the problem is that the system ran out of memory which caused the process to be killed, as stated in @gcmurphy’s erlang questions thread, see this thread http://stackoverflow.com/questions/9199731/understanding-the-linux-oom-killers-logs or google to learn more.

Not sure what could be the reason for it but there is something you can do to prevent it unless you have a leak.

Run free and see if you have swap enabled, if not - run following to set up an 8gb swap in the /var/swapfile (substitute 8m with another number to change the number of gb, use sudo to run these):

dd if=/dev/zero of=/var/swapfile bs=1024 count=8M
mkswap /var/swapfile
swapon /var/swapfile

for this to be activated after reboot, add this line to /etc/fstab: /var/swapfile swap swap sw 0 0

If the cause of the crush is indeed the memory problem, swap will eliminate it unless you have something in the app that overflows all available memory. And if it’s not - swap wouldn’t hurt so it should be safe to try.

3 Likes

Wow a memory issue? :scream: I never thought about that, since digitalocean reported a max memory usage of 66%…

I did not use swap during the crashes. I added it later, since it was needed to compile the server on this small DO instance (only 512mb). I guess the graphs might be incorrect, since they do not show the spike during compilation on Feb 14. :frowning:
At the moment I am using a swap of 1gb.

So the next step would be to build the release again in the docker container and deploy it.
Thanks a lot for this analysis, I will report how the release from the docker container works.

I have also attached the full kern.log from the crash: https://www.dropbox.com/s/unpanhy987bm7r9/kern.log?dl=0

Killed process 18588 (erl_child_setup) total-vm:4368kB, anon-rss:76kB, file-rss:568kB

This process only used 4mb, but the beam process tried to use 2gb:

Killed process 18491 (beam) total-vm:2228888kB, anon-rss:55.580kB, file-rss:5112kB

Really strange, if I sum up the total_vm column I only get 113191kb -> 113mb.
I will take a deeper look at the logs tomorrow. Thank you so much so far!

3 Likes

So you didn’t have a crush after adding 1gb swap?

Before going in production I’d disable it without changing anything else and see if the system crushes (wait till it crushes). This way you’d know for sure it’s a memory issue and could probably adjust the app to minimize it.

As for the logs, “total-vm:2228888kB” looks like too much for a mini droplet :slight_smile:

2 Likes

That’s correct.
I have disabled the swap now, so let’s wait and see what happens.
Also I have looked at all three erl_crash.dumps using the erlang crashdump viewer and in all three cases only about 50mb of memory was used.

But the oom message from the logfile states that the beam process (pid 18491) used 557222 (I guess thats in kb) so 500mb which would explain the memory problem on DO mini instance.
So I am wondering how there can be such a high difference. I don’t now much about unix memory management, from what I read it could mean that this is just the reserved memory not the actual used memory?

1 Like

I’m not a memory management guru either, but I’m pretty sure processes don’t get killed unless absolutely needed and from what we can deduct it’s pretty clear that it’s the case here. OOM killer just tries to save the system while doing least possible damage: http://unix.stackexchange.com/questions/153585/how-the-oom-killer-decides-which-process-to-kill-first

So I think the crush dump does not mean that it only used 50mb while being killed, what it means is that “last thing it remembers” was using 50 mb and then it probably needed much more and got killed. See this thread http://erlang.org/pipermail/erlang-questions/2012-August/068477.html - when killed by OOM the dump might not have the most current info.

After (if) you reproduce the crush, check if the app has any in memory structures that might cause a problem - agents ets tables etc that are not properly managed. 500mb is not that much but we have few smaller apps running on such droplets and they have no problems (some of them do a lot though) so it’s possible there is a solvable problem here.

The fact that erlang just gets killed when it runs out of memory is also seemingly by design, the root “let it crash” case :slight_smile: You can supervise the app and restart it automatically, we even happen to have a wiki for that Elixir apps as systemd services - info & wiki but I’d really try to find the issue before setting it up.

2 Likes

Thanks for your links. The second one also describes an interesting step:

If you disable the Linux oom-killer and allow the BEAM to consume memory until it
crashes when a memory allocator cannot allocate more memory, you should get a
Erlang dump

When the crash happens again, I can also try that and hope that the dump file will show more memory. Unfortunately the server did not crash since yesterday… The awkward moment when you want the server to crash :smile: :joy:

I am really wondering what can cause so much memory, since this is just a standard phoenix app, with some html views, some json api endpoints and postgres queries. Nothing special. I am not directly using ets, custom task/genserver or even channels.

I am going to wait a couple of days, before I change anything.

Also there is a thread on digitalocean, where people are having the same problem with other software. Swap seemed to solve it for some people. I know that 500mb is not enough for a production site, but I expected to run it smoothly during testing: https://www.digitalocean.com/community/questions/getting-regular-out-of-memory-kill-process-how-to-resolve-this-isse
Are you using swap on your smaller droplets?

One thing I just noticed is that all three crashes happened at almost the same time:
crash 1: Feb 4, at 06:25:20
crash 2: Feb 12, at 06:47:03
crash 3: Feb 14, at 06:25:21
Very strange coincidence. There are no cron jobs running.

1 Like

I’m not sure how droplets run, but that makes it sound to me like they are sharing ram among containers (or do you just have users on a single box) and that they have overbilled their RAM. It might not be the EVM that is eating the memory then, but perhaps another container on the same box?

2 Likes

Good idea, I have just opened a ticket on DO to ask about the timeframe and shared memory.

1 Like

Answer from DO:

Memory isn’t shared between Droplets on a server—each Droplet has its own slice of memory from the physical server—so this would be a resource contention issue inside the Droplet. These can be especially common on our smallest Droplet size (512 MB) as users install more software.

For more details about what’s happening, you might want to set up some logging with the free or sar programs. They can give you more data from the perspective of inside the Droplet and may help focus your search.

So lets keep waiting :slight_smile:

1 Like

just checked a couple projects on mini droplets - no swap, no crushes either

One thing I just noticed is that all three crashes happened at almost the same time

good point, if not because of memory sharing, maybe they had hardware problems at the moment? What data center do you use? We are in Germany, maybe yours had a local problem that we didn’t “share”.

From what you write your app should indeed run on mini droplet without swap, so unless you get a crush in the near future it might have been an external problem (that you minimize by supervising your app with systemd service).

4 Likes

No changes so far, I will keep waiting a couple of days, before deploying my latest changes and continue to test.
I am going to post here if anything crashes again. Thanks to everyone who helped to investigate this.

I am also using the german data center :smile: :thumbsup:

2 Likes

However what currently fixed your issues with crashes or they just stopped? I kind of lost this part reading trough this topic…

1 Like