Phoenix server crashing randomly

yurko · February 21, 2017, 6:41pm

I’m not a memory management guru either, but I’m pretty sure processes don’t get killed unless absolutely needed and from what we can deduct it’s pretty clear that it’s the case here. OOM killer just tries to save the system while doing least possible damage: http://unix.stackexchange.com/questions/153585/how-the-oom-killer-decides-which-process-to-kill-first

So I think the crush dump does not mean that it only used 50mb while being killed, what it means is that “last thing it remembers” was using 50 mb and then it probably needed much more and got killed. See this thread http://erlang.org/pipermail/erlang-questions/2012-August/068477.html - when killed by OOM the dump might not have the most current info.

After (if) you reproduce the crush, check if the app has any in memory structures that might cause a problem - agents ets tables etc that are not properly managed. 500mb is not that much but we have few smaller apps running on such droplets and they have no problems (some of them do a lot though) so it’s possible there is a solvable problem here.

The fact that erlang just gets killed when it runs out of memory is also seemingly by design, the root “let it crash” case You can supervise the app and restart it automatically, we even happen to have a wiki for that Elixir apps as systemd services - info & wiki but I’d really try to find the issue before setting it up.

Nils · February 22, 2017, 5:13pm

Thanks for your links. The second one also describes an interesting step:

If you disable the Linux oom-killer and allow the BEAM to consume memory until it
crashes when a memory allocator cannot allocate more memory, you should get a
Erlang dump

When the crash happens again, I can also try that and hope that the dump file will show more memory. Unfortunately the server did not crash since yesterday… The awkward moment when you want the server to crash

I am really wondering what can cause so much memory, since this is just a standard phoenix app, with some html views, some json api endpoints and postgres queries. Nothing special. I am not directly using ets, custom task/genserver or even channels.

I am going to wait a couple of days, before I change anything.

Also there is a thread on digitalocean, where people are having the same problem with other software. Swap seemed to solve it for some people. I know that 500mb is not enough for a production site, but I expected to run it smoothly during testing: Question | DigitalOcean
Are you using swap on your smaller droplets?

One thing I just noticed is that all three crashes happened at almost the same time:
crash 1: Feb 4, at 06:25:20
crash 2: Feb 12, at 06:47:03
crash 3: Feb 14, at 06:25:21
Very strange coincidence. There are no cron jobs running.

OvermindDL1 · February 22, 2017, 5:21pm

I’m not sure how droplets run, but that makes it sound to me like they are sharing ram among containers (or do you just have users on a single box) and that they have overbilled their RAM. It might not be the EVM that is eating the memory then, but perhaps another container on the same box?

Nils · February 22, 2017, 6:35pm

Good idea, I have just opened a ticket on DO to ask about the timeframe and shared memory.

Nils · February 22, 2017, 10:13pm

Answer from DO:

Memory isn’t shared between Droplets on a server—each Droplet has its own slice of memory from the physical server—so this would be a resource contention issue inside the Droplet. These can be especially common on our smallest Droplet size (512 MB) as users install more software.

For more details about what’s happening, you might want to set up some logging with the free or sar programs. They can give you more data from the perspective of inside the Droplet and may help focus your search.

So lets keep waiting

yurko · February 23, 2017, 12:04pm

just checked a couple projects on mini droplets - no swap, no crushes either

One thing I just noticed is that all three crashes happened at almost the same time

good point, if not because of memory sharing, maybe they had hardware problems at the moment? What data center do you use? We are in Germany, maybe yours had a local problem that we didn’t “share”.

From what you write your app should indeed run on mini droplet without swap, so unless you get a crush in the near future it might have been an external problem (that you minimize by supervising your app with systemd service).

Nils · February 23, 2017, 8:58pm

No changes so far, I will keep waiting a couple of days, before deploying my latest changes and continue to test.
I am going to post here if anything crashes again. Thanks to everyone who helped to investigate this.

I am also using the german data center

krapans · February 23, 2017, 9:03pm

However what currently fixed your issues with crashes or they just stopped? I kind of lost this part reading trough this topic…

Nils · February 23, 2017, 9:21pm

They just stopped (for now). We thought it might be a memory issue, so I turned of swap again a couple of days ago to try to reproduce it (swap was not active when the crashes where happening in the past).
Unfortunately I was not able to reproduce any crashes so they might still happen within the next days.
I did no changes to the code so far. I am still a bit sceptic. So I am going to wait more days. Then I am going to continue to work on the phoenix app and do deployments again.

One more difference is still that the current version was build on the production server. All crashes happened with builds from a docker server. So thats something I will also try again next week.

krapans · February 24, 2017, 6:12am

Ok, then really looking forward for you investigation, because this case
seems very strange.

Nils · February 26, 2017, 11:44am

Had a crash again this morning at 6:47:03
So that fits with the other crash times:
crash 1: Feb 4, at 06:25:20
crash 2: Feb 12, at 06:47:03
crash 3: Feb 14, at 06:25:21

Feb 12 was also a sunday…

So I guess this has nothing todo with erlang. Previously I thought there are no cron jobs running, because crontab showed not jobs. But I forgot about the /etc/cron.* folders.
Also I can see in the syslog, that 2 seconds before the crash there was a cron executed:

Feb 26 06:47:01 service-check CRON[31380]: (root) CMD (test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.weekly ))

Not sure if this was also the case with the previous crashes, since my syslog was rotated since the last crash
In the cron.weekly folder there are three jobs configured: fstrim, man-db, update-notifier-common
So my guess is that these require a little more memory and then the 512mb aren’t enough.
Also from the phoenix log I could see that the last request was at 6:20 so over 20minutes before the crash. So there is no real activity inside the erlang vm.

Unfortunatley there are not more logs regarding cron, so I am not sure which task was running, I will try to configure more cron-logging.

50kudos · March 4, 2017, 12:08am

My phoenix server was crashed once too. It is a plain app from the phoenix.new command. Only have one static page. (Yeah a static page served by the cowboy :-))

/var/log/erlang.log.1

===== Fri Mar  3 10:15:56 UTC 2017
erl_child_setup closed
Crash dump is being written to: erl_crash.dump...done

mix.exs

  defp deps do
    [{:phoenix, "~> 1.2.1"},
     {:phoenix_pubsub, "~> 1.0"},
     {:phoenix_html, "~> 2.6"},
     {:phoenix_live_reload, "~> 1.0", only: :dev},
     {:gettext, "~> 0.11"},
     {:cowboy, "~> 1.0"},
     {:edeliver, "~> 1.4.0"},
     {:distillery, ">= 0.8.0", warn_missing: false}]
  end

Digital Ocean

512 MB Memory / 20 GB Disk / SGP1 - Ubuntu 16.10 (GNU/Linux 4.8.0-34-generic x86_64)
Without any server configuration.

My fix was mix edeliver restart production

What should I do from now on? I am looking for ways to monitor and setup some alert when things go wrong

Nils · March 5, 2017, 9:58am

Activate swap: https://www.digitalocean.com/community/tutorials/how-to-add-swap-space-on-ubuntu-16-04
And setup systemd as yurko described. There also some other threads in this forum about monitoring. MMonit is also an interesting alternative to systemd.