Staging environment? How to debug out-of-memory errors in production on Fly.io?

tadasajon · September 30, 2021, 9:45pm

So I have a hello-world Phoenix app deployed on Fly.io, which I achieved by following this guide: Build, Deploy and Run an Elixir Application

When I say that the app is a hello-world app, I mean that I literally have not done anything other than run mix phx.new myapp and mix phx.gen.auth Accounts User users. I then added the Dockerfile, fly.toml, etc needed for deployment and successfully deployed the app to Fly.io.

It still shows the “Welcome to Phoenix! Peace of mind from prototype to production” landing page, so I really haven’t touched anything.

Once deployed, the app appears to work fine – I can register for an account, log in, log out, change the background color and redeploy, etc.

But if I come back to the app the next day, the user registration and login system does not work. If I attempt to register a user the app will just crash. This has happened to me twice now, with two separate hello-world Phoenix apps.

This is what the logs show:

app[893322ef] maa [info] [21576.418891] Out of memory: Killed process 510 (beam.smp) total-vm:1388056kB, anon-rss:169764kB, file-rss:0kB, shmem-rss:79684kB, UID:65534 pgtables:704kB oom_score_adj:0
proxy[893322ef] iad [error] error.code=2003 error.message="App connection closed before request/response completed" request.method="POST" request.url="https://quiz-bot.fly.dev/users/register" request.id="01FGW9VXXXXXXXXXX1HVCS67DC" response.status=502
app[893322ef] maa [info] Main child exited with signal (with signal 'SIGKILL', core dumped? false)
app[893322ef] maa [info] Reaped child process with pid: 570 and signal: SIGUSR1, core dumped? false
app[893322ef] maa [info] Starting clean up.
app[893322ef] maa [info] Process appears to have been OOM killed!
runner[893322ef] maa [info] Starting instance
runner[893322ef] maa [info] Configuring virtual machine
runner[893322ef] maa [info] Pulling container image
runner[893322ef] maa [info] Unpacking image
runner[893322ef] maa [info] Preparing kernel init

The error message when I try to register a user is “App connection closed before request/response completed” and also “Process appears to have been OOM killed!”

So one thing that has occurred to me is that there is some kind of crucial difference between my development environment and my production environment that I do not understand. The most obvious candidate is the amount of RAM available.

So what I would like to do is figure out some way to run my phoenix app in the exact same docker configuration that it will have in production – i.e., with the same amount of RAM available. I guess I would call this a “staging environment”.

Is it possible to run a Phoenix app inside docker on my local machine with exactly the same amount of RAM specified as I will have in production? Also, what can I do to make my phoenix app use less RAM?

I also honestly don’t understand why the app would work correctly for a few hours before ceasing to work properly – I would think that an out-of-memory error would either appear immediately or else not appear at all – because I don’t see where I could have a slow memory leak over time, especially if the app is being used by exactly no one.

ruslandoga · September 30, 2021, 9:50pm

I guess you might be using argon2 for password hashing. It would explain why the app is OOM killed during registration and login.

Some more info: Argon2.Stats — argon2_elixir v2.4.0

tadasajon · September 30, 2021, 9:58pm

Ahh, thank you!

I noticed that one difference between Phoenix 1.6-rc.0 and Phoenix 1.6 was that {:bcrypt_elixir, "~> 2.0"} was replaced with {:argon2_elixir, "~> 2.0"} in the mix.exs dependencies.

I still don’t understand why I’m able to successfully register and log in / log out after I initially deploy and it is when I come back to the app the next day that the problem appears. If that encryption process uses so much RAM, why doesn’t it fail the very first time I try to register a user after successfully deploying?

I guess for now I should just pay Fly for some more RAM.

tadasajon · September 30, 2021, 10:25pm

On second thought, I’m first going to try putting this setting in the config…

config :argon2_elixir,
  m_cost: 12

17 is the default for “memory cost” and the docs say that you can set 8 to speed things up in development and testing, but that this is too low for production.

brainlid · September 30, 2021, 11:07pm

Yes, argon2 is a “memory hard” hashing algorithm. See: https://www.password-hashing.net/argon2-specs.pdf

This means it requires more RAM to operate. This doesn’t seem to fit in the smallest Fly.io size with the default settings. You can change the defaults like this:

  config :argon2_elixir,
      t_cost: 4,
      m_cost: 16

Or you can use different values for dev/test/prod.

If you switch to bcrypt (the default for new Phoenix applications), it uses an algorithm that is not memory hard and works just fine on the smallest Fly instances.

EDIT: I didn’t catch that the default for new Phoenix apps changed with 1.6 to use {:argon2_elixir, "~> 2.0"}! That’s going to cause problems for people creating new apps on the smallest instances of many platforms.

Morzaram · October 5, 2021, 8:01pm

Hey just wanted to respond to this for anyone else who is looking for help.

I can confirm on {:argon2_elixir, "~> 2.0"} (currently at 2.4 as of this post)

Putting the following in runtime.exs and deploying to fly.io w/ lowest plan works smoothly.
config :argon2_elixir, t_cost: 4, m_cost: 16

Thank you @brainlid !

dnsbty · October 5, 2021, 9:01pm

I ran into this same thing, and I would highly recommend looking into the documentation that @ruslandoga posted above. I would especially recommend checking out the section on choosing parameters. The ones that have provided will work, but I think it’s good to understand why they work and make sure you’re making the best tradeoffs for your application.

The m_cost being provided refers to the amount of memory that is used for hashing and the t_cost refers to the amount of time used for hashing (specifically the number of iterations being performed to arrive at the final hash). By default the m_cost is set to 17, which means that 2^17 KiB of memory (128 MiB) will be used out of the 256 MiB provided on the lowest tier Fly.io instance. I have several different applications running on this lowest tier, and it looks like all of them use between 150-170 MiB of memory at rest, so allocating 128 MiB for hashing will cause an OOM every time. The Argon2 RFC recommends that you use the highest amount of memory you can, but you can increase the number of iterations if less memory is available.

The recommendation above to use a t_cost of 4 and an m_cost of 16 should work fine for low traffic applications. 2^16 KiB is 64 MiB of memory, so combined with the memory used at rest would take your instance to somewhere around 225 MiB of memory used. However, if you have a lot of concurrent websocket users or you’re heavily using ets for caching and storage, or if you have multiple concurrent hashing operations, this will still lead to an OOM.

For my applications I decided to go with the following parameters instead:

config :argon2_elixir, t_cost: 18, m_cost: 15

This will use only 32 MiB of memory per hash, but will use several more iterations to obtain that final hash. I arrived at those numbers by following the guide provided in the RFC: decide first on the memory cost and then test the time it takes to create a hash and change the time cost until the total time is around 500ms (where they recommend for security purposes).

That said, 500ms is what’s recommended as the most secure option, but others suggest considering timing as fast as 50ms for user experience purposes. Here are the timings run inside the lowest tier fly.io instances between 50 and 500ms. I would recommend selecting the one that makes the best tradeoffs for your application:

config :argon2_elixir, t_cost: 2, m_cost: 15 # ~60ms
config :argon2_elixir, t_cost: 4, m_cost: 15 # ~115ms
config :argon2_elixir, t_cost: 8, m_cost: 15 # ~208ms
config :argon2_elixir, t_cost: 12, m_cost: 15 # ~307ms
config :argon2_elixir, t_cost: 16, m_cost: 15 # ~415ms
config :argon2_elixir, t_cost: 18, m_cost: 15 # ~460ms
config :argon2_elixir, t_cost: 20, m_cost: 15 # ~525ms

I should add that the most secure solution will be to give more memory to your application, but even with more memory, you should still make sure that the parameters you’re using make the most sense.

Morzaram · October 5, 2021, 9:25pm

Wow thank you for this write up!

kamidev · October 9, 2021, 9:41pm

Yes, this is valuable info for anyone looking at upgrading to argon. Thank you!

Note that from Phoenix 1.6.1 the default for new apps has reverted to bcrypt. From the changelog:

“phx.gen.auth] No longer set argon2 as the default hash algorithm for phx.gen.auth in favor of bcrypt for performance reasons on smaller hardware”

jordelver · January 29, 2022, 1:41pm

Thanks for this! I was having out of memory trouble with my app on Fly and this was my exact issue.