How to debug beam.smp segfaults in a docker container

Hi there! Beam Debugging Noob here.

I have a Hetzner Cloud VPS server (1 vCPU, 2GB RAM) with Ubuntu 18.04.2 and dokku installed. There’s an Elixir Phoenix application running in a docker container, running on Erlang 20.1 and Elixir 1.7.4, not the latest but I got errors trying to deploy newer versions. The machine experiences almost no load and reports having more than 1GB of RAM available.

Sometimes, after the container has been running for a while, the beam.smp process in the docker container will segfault, like it has done in the past few days:

Jun 16 19:48:40 nepasheli-1 kernel: [16533055.262703] 1_scheduler[12454]: segfault at 208 ip 0000000000608ce4 sp 00007f135077e940 error 4 in beam.smp[400000+339000]
...
Jun 19 12:58:56 nepasheli-1 kernel: [58674.014500] 1_scheduler[15398]: segfault at 208 ip 0000000000608ce4 sp 00007fcbc50be8e0 error 4 in beam.smp[400000+339000]

The first segfault at Jun 16 went unnoticed by me, apparently some dokku magic redeployed the container.? But after todays segfault at Jun 19 it was still down when I noticed it about 20 minutes later. I had to redeploy the container to get things running again.

Does anyone recognize such behavior? Is error 4 in beam.smp a well known error? How to debug such crashes inside a docker container? Googling β€˜beam.smp error 4’ did not return really good results. All help, pointers and insights are welcome.

2 Likes

Set the environmental variable ERL_CRASH_DUMP to put the crash dump file in a specific location - http://erlang.org/doc/man/erl.html#environment-variables . Run the container with a volume mount to the local filesystem. Make sure the crash dump file is pointed to that mount. Analyze the file with the crash dump viewer. I think (if memory serves me right (it’s been a while)) - you can view the file with observer. If all else fails you can set the Linux core dump location using the instructions here - https://unix.stackexchange.com/a/192836 - and drop down to gdb and friends to analyze.

1 Like

Excellent. I’m going to try to get a beam crash dump.

But I was wondering: If the beam.smp process segfaults, does it still have time/resources to write a dump?

I don’t know - that’s why you might need to drop down to core dump

And take a look at docker logs - it might be hitting a docker imposed limit

Ok, I think I got it. Volume mounted with help from dokku storage:mount docs and ENV var set with dokku config:set ERL_CRASH_DUMP=/storage/erl_crash.dump.

Now let’s wait for the next crash. :grimacing:

1 Like

I would say that the segfault is almost surely related to some NIF (Native Implemented Function) in a library that you are using. Otherwise segfaults in beam.smp is not a regular occurrence.

2 Likes

I’ve been searching a bit but could not find a library in my project that uses a NIF, I could be wrong there though.

FYI here are the dependencies of the project:

defp deps do
    [
      {:phoenix, "~> 1.4.0"},
      {:phoenix_pubsub, "~> 1.1"},
      {:phoenix_ecto, "~> 4.0"},
      {:ecto_sql, "~> 3.0"},
      {:postgrex, ">= 0.0.0"},
      {:phoenix_html, "~> 2.11"},
      {:phoenix_live_reload, "~> 1.2", only: :dev},
      {:gettext, "~> 0.11"},
      {:jason, "~> 1.0"},
      {:plug_cowboy, "~> 2.0"},
      {:ex_aws, "~> 2.1"},
      {:ex_aws_s3, "~> 2.0"},
      {:sweet_xml, "~> 0.6"},
      {:hackney, "~> 1.9"},
      {:ua_inspector, "~> 0.20"},
      {:timex, "~> 3.5"},
      # pagination
      {:scrivener_ecto, "~> 2.0"},
      {:scrivener_html, "~> 1.8"},
      {:mox, "~> 0.4", only: :test},
      {:ex_machina, "~> 2.3", only: :test},
      {:credo, "~> 1.0.0", only: [:dev, :test], runtime: false},
      {:dialyxir, "~> 1.0.0-rc.6", only: [:dev, :test], runtime: false}
    ]
  end
1 Like

Yeah, that list of dependencies looks pretty standard and I don’t think any of them use a NIF. Will be interesting to see what the crashdump reveals.

Make sure that you enable core file generation:

docker run --ulimit core=-1 hello_world

and that you can access the core file after the container terminates. Where the core file is placed is controller by the hosts /proc/sys/kernel/core_pattern so you want to take note of what that it set to.

4 Likes

mix deps.tree if you want to see transitive dependencies as well

Good point, here you go:

my_app
β”œβ”€β”€ gettext ~> 0.11 (Hex package)
β”œβ”€β”€ jason ~> 1.0 (Hex package)
β”‚   └── decimal ~> 1.0 (Hex package)
β”œβ”€β”€ dialyxir ~> 1.0.0-rc.6 (Hex package)
β”‚   └── erlex ~> 0.2.1 (Hex package)
β”œβ”€β”€ mox ~> 0.4 (Hex package)
β”œβ”€β”€ hackney ~> 1.9 (Hex package)
β”‚   β”œβ”€β”€ certifi 2.5.1 (Hex package)
β”‚   β”‚   └── parse_trans ~>3.3 (Hex package)
β”‚   β”œβ”€β”€ idna 6.0.0 (Hex package)
β”‚   β”‚   └── unicode_util_compat 0.4.1 (Hex package)
β”‚   β”œβ”€β”€ metrics 1.0.1 (Hex package)
β”‚   β”œβ”€β”€ mimerl ~>1.1 (Hex package)
β”‚   └── ssl_verify_fun 1.1.4 (Hex package)
β”œβ”€β”€ timex ~> 3.5 (Hex package)
β”‚   β”œβ”€β”€ combine ~> 0.10 (Hex package)
β”‚   β”œβ”€β”€ gettext ~> 0.10 (Hex package)
β”‚   └── tzdata ~> 0.1.8 or ~> 0.5 (Hex package)
β”‚       └── hackney ~> 1.0 (Hex package)
β”œβ”€β”€ sweet_xml ~> 0.6 (Hex package)
β”œβ”€β”€ ex_aws ~> 2.1 (Hex package)
β”‚   β”œβ”€β”€ hackney 1.6.3 or 1.6.5 or 1.7.1 or 1.8.6 or ~> 1.9 (Hex package)
β”‚   └── sweet_xml ~> 0.6 (Hex package)
β”œβ”€β”€ ex_aws_s3 ~> 2.0 (Hex package)
β”‚   β”œβ”€β”€ ex_aws ~> 2.0 (Hex package)
β”‚   └── sweet_xml >= 0.0.0 (Hex package)
β”œβ”€β”€ credo ~> 1.0.0 (Hex package)
β”‚   β”œβ”€β”€ bunt ~> 0.2.0 (Hex package)
β”‚   └── jason ~> 1.0 (Hex package)
β”œβ”€β”€ ua_inspector ~> 1.0 (Hex package)
β”‚   β”œβ”€β”€ hackney ~> 1.0 (Hex package)
β”‚   └── yamerl ~> 0.7 (Hex package)
β”œβ”€β”€ phoenix_pubsub ~> 1.1 (Hex package)
β”œβ”€β”€ postgrex >= 0.0.0 (Hex package)
β”‚   β”œβ”€β”€ connection ~> 1.0 (Hex package)
β”‚   β”œβ”€β”€ db_connection ~> 2.0 (Hex package)
β”‚   β”‚   └── connection ~> 1.0.2 (Hex package)
β”‚   β”œβ”€β”€ decimal ~> 1.0 (Hex package)
β”‚   └── jason ~> 1.0 (Hex package)
β”œβ”€β”€ ecto_sql ~> 3.0 (Hex package)
β”‚   β”œβ”€β”€ db_connection ~> 2.0 (Hex package)
β”‚   β”œβ”€β”€ ecto ~> 3.0.2 (Hex package)
β”‚   β”‚   β”œβ”€β”€ decimal ~> 1.5 (Hex package)
β”‚   β”‚   └── jason ~> 1.0 (Hex package)
β”‚   β”œβ”€β”€ postgrex ~> 0.14.0 (Hex package)
β”‚   └── telemetry ~> 0.2.0 (Hex package)
β”œβ”€β”€ ex_machina ~> 2.3 (Hex package)
β”‚   β”œβ”€β”€ ecto ~> 2.2 or ~> 3.0 (Hex package)
β”‚   └── ecto_sql ~> 3.0 (Hex package)
β”œβ”€β”€ scrivener_ecto ~> 2.0 (Hex package)
β”‚   β”œβ”€β”€ ecto ~> 3.0 (Hex package)
β”‚   └── scrivener ~> 2.4 (Hex package)
β”œβ”€β”€ phoenix_html ~> 2.11 (Hex package)
β”‚   └── plug ~> 1.5 (Hex package)
β”‚       β”œβ”€β”€ mime ~> 1.0 (Hex package)
β”‚       └── plug_crypto ~> 1.0 (Hex package)
β”œβ”€β”€ plug_cowboy ~> 2.0 (Hex package)
β”‚   β”œβ”€β”€ cowboy ~> 2.5 (Hex package)
β”‚   β”‚   β”œβ”€β”€ cowlib ~> 2.7.0 (Hex package)
β”‚   β”‚   └── ranch ~> 1.7.0 (Hex package)
β”‚   └── plug ~> 1.7 (Hex package)
β”œβ”€β”€ phoenix ~> 1.4.0 (Hex package)
β”‚   β”œβ”€β”€ jason ~> 1.0 (Hex package)
β”‚   β”œβ”€β”€ phoenix_pubsub ~> 1.1 (Hex package)
β”‚   β”œβ”€β”€ plug ~> 1.7 (Hex package)
β”‚   └── plug_cowboy ~> 1.0 or ~> 2.0 (Hex package)
β”œβ”€β”€ phoenix_live_reload ~> 1.2 (Hex package)
β”‚   β”œβ”€β”€ file_system ~> 0.2.1 or ~> 0.3 (Hex package)
β”‚   └── phoenix ~> 1.4 (Hex package)
β”œβ”€β”€ scrivener_html ~> 1.8 (Hex package)
β”‚   β”œβ”€β”€ phoenix ~> 1.0 and < 1.5.0 (Hex package)
β”‚   β”œβ”€β”€ phoenix_html ~> 2.2 (Hex package)
β”‚   β”œβ”€β”€ plug ~> 1.1 (Hex package)
β”‚   └── scrivener ~> 1.2 or ~> 2.0 (Hex package)
└── phoenix_ecto ~> 4.0 (Hex package)
    β”œβ”€β”€ ecto ~> 3.0 (Hex package)
    β”œβ”€β”€ phoenix_html ~> 2.9 (Hex package)
    └── plug ~> 1.0 (Hex package)