Phoenix server crashing randomly

Nils · February 14, 2017, 4:31pm

I have deployed my phoenix server to digitalocean and it crashed three times in the last month without having any real traffic. The only traffic is coming from me (testing the website) and the dummy bots trying out different urls which do not exist (GET /phpMyAdmin/scripts/setup.php and so on…).

From the erl_crash.dump the error says:

“Slogan: erl_child_setup closed”

The only thing I could find on google was this thread: [erlang-questions] Need help with the error dump - erl_child_setup closed
And it says that a crash like this was fixed in erlang 19.1, since I am using erlang 19.2 thats not the bug I am facing.
You can find the full crash dump here: Dropbox - Error - Simplify your life

I wasn’t sure where to post this (phoenix, plug, cowboy, erlang github) since I cannot say for sure where this error happens, so I am trying it here first.
Some more information on my setup:

erlang: 19.2
elixir: 1.4.1
Deployment: Building a release with edeliver and distillery locally in a docker container running ubuntu:16.04 (same ubuntu as on digitalocean).

The following packages are used, extracted from mix.lock:
> base64url: 0.0.1
bunt: 0.2.0
canada: 1.0.1
canary: 1.1.0
comeonin: 3.0.1
connection: 1.0.4
cowboy: 1.1.2
cowlib: 1.0.2
credo: 0.6.1
db_connection: 1.1.0
decimal: 1.3.1
dialyxir: 0.4.4
distillery: 1.1.2
ecto: 2.1.1
edeliver: 1.4.2
elixir_make: 0.4.0
fs: 0.9.2
gettext: 0.13.1
guardian: 0.14.2
jose: 1.8.0
logger_file_backend: 0.0.9
mime: 1.0.1
phoenix: 1.2.1
phoenix_ecto: 3.2.1
phoenix_html: 2.9.3
phoenix_live_reload: 1.0.8
phoenix_pubsub: 1.0.1
plug: 1.3.0
poison: 2.2.0
poolboy: 1.5.1
postgrex: 0.13.0
ranch: 1.3.2
scrivener: 2.2.1
scrivener_ecto: 1.1.3
uuid: 1.1.6

So its just a phoenix web project, nothing special going on and not a lot of traffic. I don’t now where to look anymore, some thinks I will try:

Build release on digitalocean server and not in docker container
start server without a release and directly from sourcecode

It is hard to test any changes since the server will sometimes run for days without a problem and then crashes at six in the morning… This time the crashed happened within 12 hours since the last deployment.

Is anybody else facing something like that or has any ideas what I should try?

bitwalker · February 14, 2017, 5:20pm

Perhaps it has something to do with NIFs? You are using elixir_make/comeonin/guardian which I believe all use NIFs - I haven’t seen this before though, so I’m just guessing.

Nils · February 14, 2017, 5:30pm

Interesting, I thought all this “simple” stuff would not use NIFs. So I will try to create the release on the production server to face out any issues with different build environments.
Thanks for the idea!

gcmurphy · February 16, 2017, 7:00pm

I hit the same thing this morning on a super basic site on digitalocean. Same deal with the bots.

Elixir 1.4.1
Erlang 19.2.3
Ubuntu 16.04.2
PostgreSQL 9.5.5
Nginx 1.10.0

These are the dependencies. Had to override poison version in mailgun because of https://github.com/chrismccord/mailgun/pull/46 so not sure if this related to my problem.

|── distillery ~> 1.1 (Hex package)
├── poison ~> 2.2.0 (Hex package) *override*
├── phoenix_pubsub ~> 1.0 (Hex package)
├── edeliver ~> 1.4 (Hex package)
│   └── distillery >= 0.8.0 (Hex package)
├── gettext ~> 0.11 (Hex package)
├── postgrex >= 0.0.0 (Hex package)
│   ├── connection ~> 1.0 (Hex package)
│   ├── db_connection ~> 1.1 (Hex package)
│   │   ├── connection ~> 1.0.2 (Hex package)
│   │   └── poolboy ~> 1.5 (Hex package)
│   └── decimal ~> 1.0 (Hex package)
├── mailgun ~> 0.1.2 (Hex package)
│   └── poison ~> 1.4 (Hex package)
├── cowboy ~> 1.0 (Hex package)
│   ├── cowlib ~> 1.0.0 (Hex package)
│   └── ranch ~> 1.0 (Hex package)
├── phoenix ~> 1.2.1 (Hex package)
│   ├── cowboy ~> 1.0 (Hex package)
│   ├── phoenix_pubsub ~> 1.0 (Hex package)
│   ├── plug ~> 1.1 (Hex package)
│   │   ├── cowboy ~> 1.0.1 or ~> 1.1 (Hex package)
│   │   └── mime ~> 1.0 (Hex package)
│   └── poison ~> 1.5 or ~> 2.0 (Hex package)
├── phoenix_live_reload ~> 1.0 (Hex package)
│   ├── fs ~> 0.9.1 (Hex package)
│   └── phoenix ~> 1.0 or ~> 1.2-rc (Hex package)
├── phoenix_html ~> 2.6 (Hex package)
│   └── plug ~> 1.0 (Hex package)
└── phoenix_ecto ~> 3.0 (Hex package)
    ├── ecto ~> 2.1 (Hex package)
    │   ├── db_connection ~> 1.1 (Hex package)
    │   ├── decimal ~> 1.2 (Hex package)
    │   ├── poison ~> 2.2 or ~> 3.0 (Hex package)
    │   ├── poolboy ~> 1.5 (Hex package)
    │   └── postgrex ~> 0.13.0 (Hex package)
    ├── phoenix_html ~> 2.9 (Hex package)
    └── plug ~> 1.0 (Hex package)

Will keep an eye and update if I figure out the cause.

Nils · February 16, 2017, 9:35pm

I tried to build my release on the production server instead of in a docker container, so that all ubuntu packages/dependencies are the same. So far no crash, but one of my previous crashed happened on day six or seven, so I will keep monitoring it.
What I find strange is that everything seems to work, even the stuff which seems to use NIFs, like authorization/registration and so on.

yurko · February 17, 2017, 5:29pm

I have a phoenix app that we use as a base to make other apps, it has comeonin, guardian, ueberauth etc (though only comeonin uses NIFs afak) and is deployed on DO’s Ubuntu 16.04.

I didn’t do anything with it since December of the last year and never experienced this issue, since dec 26 it just hangs there and gets spammed by bots. Just checked it - no problems, everything works fine.

I’m compiling from source on the server though and have no elixir_make dependency you might wanto to look closer into the one that needs elixir_make, it’s not comeonin nor guardian.

I would like to know if distillery can really cause such problems, please post any further findings.

bitwalker · February 17, 2017, 5:47pm

This is definitely not distillery, nor releases in general, that error message is coming from the runtime and is specific to the handling of ports - the question is what’s causing this problem to occur on DO.

Nils · February 17, 2017, 6:54pm

Actually comeonin is the one using elixir_make in the latest version: https://github.com/riverrun/comeonin/blob/master/mix.exs#L28

yurko · February 17, 2017, 10:39pm

true, it’s just mine (2.6.0) doesn’t yet use it

I still doubt it’s comeonin though, might be a hardware problem + logger_file_backend, but it’s a really long shot

gcmurphy · February 19, 2017, 3:25am

Asked on erlang questions mailing list. My case looks quite likely an OOM issue.

yurko · February 20, 2017, 6:21pm

@Nils try something like that on your server grep -ri 'killed process' /var/log/* / grep -ri total_vm /var/log/* / grep -ri oom /var/log/*, this might shed some light on the issue

gcmurphy · February 20, 2017, 7:43pm

I’m wondering if I’m starting my server in production mode incorrectly. htop seems to think there are a number os processes running for the BEAM VM, and this is a single core machine…

I’m effectively running it like this -

MIX_ENV=prod PORT=4001 elixir -pa _build/prod/consolidated -S mix phoenix.server

OvermindDL1 · February 20, 2017, 7:54pm

Those are just threads of the main scheduler process, it spawns a few for non-async’able operations, they remain idle 99% of the time regardless. ^.^

michalmuskala · February 20, 2017, 7:58pm

By default Erlang will launch one scheduler thread per core and 10 “async threads” to do blocking things like IO.

Additionally, more modern versions, will fork a process called “erl_child_setup” that is used for spawning ports to avoid forking the main VM process, which may be costly.

Nils · February 20, 2017, 8:46pm

@yurko thanks, here are the results, not much I can see here. Just one thing I am wondering from the third command “release-upgrade invoked oom-killer”, this was called at six in the morning and I did not do a release at this time, but maybe it just means, that this process was started the day before from this command…

/var/log/kern.log.1:Feb 12 06:47:03 kernel: [566162.789442] Killed process 18588 (erl_child_setup) total-vm:4368kB, anon-rss:76kB, file-rss:568kB
/var/log/kern.log.1:Feb 12 06:47:03 kernel: [566162.963894] Killed process 18491 (beam) total-vm:2228888kB, anon-rss:55580kB, file-rss:5112kB
/var/log/kern.log.1:Feb 14 06:25:21 kernel: [737660.651762] Killed process 31933 (erl_child_setup) total-vm:4368kB, anon-rss:72kB, file-rss:564kB

/var/log/kern.log.1:Feb 12 06:47:03 kernel: [566162.779066] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
/var/log/kern.log.1:Feb 12 06:47:03 kernel: [566162.956619] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
/var/log/kern.log.1:Feb 14 06:25:21 kernel: [737660.647842] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name

/var/log/kern.log.1:Feb 12 06:47:03 kernel: [566162.778627] run-parts invoked oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=0
/var/log/kern.log.1:Feb 12 06:47:03 kernel: [566162.778778] [] oom_kill_process+0x202/0x3c0
/var/log/kern.log.1:Feb 12 06:47:03 kernel: [566162.779066] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
/var/log/kern.log.1:Feb 12 06:47:03 kernel: [566162.954112] release-upgrade invoked oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=0
/var/log/kern.log.1:Feb 12 06:47:03 kernel: [566162.954174] [] oom_kill_process+0x202/0x3c0
/var/log/kern.log.1:Feb 12 06:47:03 kernel: [566162.956619] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
/var/log/kern.log.1:Feb 14 06:25:21 kernel: [737660.647551] lxcfs invoked oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=0
/var/log/kern.log.1:Feb 14 06:25:21 kernel: [737660.647672] [] oom_kill_process+0x202/0x3c0
/var/log/kern.log.1:Feb 14 06:25:21 kernel: [737660.647842] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name

If we could at least see some sort of stacktrace to know which port/nif caused the crash. But I could not figure it out from the crash.dump.

I had the last crash on Feb 14, so one week ago. After that I started to build the release on my server and not inside a docker container. So that might be the case. I still have to investigate whats the difference in these two environments.

DO Server:

Linux service-check 4.4.0-62-generic #83-Ubuntu SMP Wed Jan 18 14:10:15 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Ubuntu 16.04.1 LTS

Docker:

Linux 831ca2d6d9c0 4.4.43-boot2docker #1 SMP Wed Jan 18 18:33:18 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Ubuntu 16.04.1 LTS

So a different kernel version, but same Ubuntu version.

Both are using

openssl 1.0.2g-1ubuntu4.6

So the next step might be to find out which packages have different versions or first what kind of system libraries are used by erlang/elixir/dependencies and then compare the versions. But that would be hard to keep in sync later on…

OvermindDL1 · February 20, 2017, 8:47pm

Ah cool, I was unsure if it was always a set number or some multiple’s of the cores or something, good to know.

yurko · February 20, 2017, 9:17pm

seems the problem is that the system ran out of memory which caused the process to be killed, as stated in @gcmurphy’s erlang questions thread, see this thread http://stackoverflow.com/questions/9199731/understanding-the-linux-oom-killers-logs or google to learn more.

Not sure what could be the reason for it but there is something you can do to prevent it unless you have a leak.

Run free and see if you have swap enabled, if not - run following to set up an 8gb swap in the /var/swapfile (substitute 8m with another number to change the number of gb, use sudo to run these):

dd if=/dev/zero of=/var/swapfile bs=1024 count=8M
mkswap /var/swapfile
swapon /var/swapfile

for this to be activated after reboot, add this line to /etc/fstab: /var/swapfile swap swap sw 0 0

If the cause of the crush is indeed the memory problem, swap will eliminate it unless you have something in the app that overflows all available memory. And if it’s not - swap wouldn’t hurt so it should be safe to try.

Nils · February 20, 2017, 9:57pm

Wow a memory issue? I never thought about that, since digitalocean reported a max memory usage of 66%…

I did not use swap during the crashes. I added it later, since it was needed to compile the server on this small DO instance (only 512mb). I guess the graphs might be incorrect, since they do not show the spike during compilation on Feb 14.
At the moment I am using a swap of 1gb.

So the next step would be to build the release again in the docker container and deploy it.
Thanks a lot for this analysis, I will report how the release from the docker container works.

I have also attached the full kern.log from the crash: Dropbox - File Deleted - Simplify your life

Killed process 18588 (erl_child_setup) total-vm:4368kB, anon-rss:76kB, file-rss:568kB

This process only used 4mb, but the beam process tried to use 2gb:

Killed process 18491 (beam) total-vm:2228888kB, anon-rss:55.580kB, file-rss:5112kB

Really strange, if I sum up the total_vm column I only get 113191kb → 113mb.
I will take a deeper look at the logs tomorrow. Thank you so much so far!

yurko · February 21, 2017, 1:03pm

So you didn’t have a crush after adding 1gb swap?

Before going in production I’d disable it without changing anything else and see if the system crushes (wait till it crushes). This way you’d know for sure it’s a memory issue and could probably adjust the app to minimize it.

As for the logs, “total-vm:2228888kB” looks like too much for a mini droplet

Nils · February 21, 2017, 5:56pm

That’s correct.
I have disabled the swap now, so let’s wait and see what happens.
Also I have looked at all three erl_crash.dumps using the erlang crashdump viewer and in all three cases only about 50mb of memory was used.

But the oom message from the logfile states that the beam process (pid 18491) used 557222 (I guess thats in kb) so 500mb which would explain the memory problem on DO mini instance.
So I am wondering how there can be such a high difference. I don’t now much about unix memory management, from what I read it could mean that this is just the reserved memory not the actual used memory?