Staging environment? How to debug out-of-memory errors in production on Fly.io?

dnsbty · October 5, 2021, 9:01pm

I ran into this same thing, and I would highly recommend looking into the documentation that @ruslandoga posted above. I would especially recommend checking out the section on choosing parameters. The ones that have provided will work, but I think it’s good to understand why they work and make sure you’re making the best tradeoffs for your application.

The m_cost being provided refers to the amount of memory that is used for hashing and the t_cost refers to the amount of time used for hashing (specifically the number of iterations being performed to arrive at the final hash). By default the m_cost is set to 17, which means that 2^17 KiB of memory (128 MiB) will be used out of the 256 MiB provided on the lowest tier Fly.io instance. I have several different applications running on this lowest tier, and it looks like all of them use between 150-170 MiB of memory at rest, so allocating 128 MiB for hashing will cause an OOM every time. The Argon2 RFC recommends that you use the highest amount of memory you can, but you can increase the number of iterations if less memory is available.

The recommendation above to use a t_cost of 4 and an m_cost of 16 should work fine for low traffic applications. 2^16 KiB is 64 MiB of memory, so combined with the memory used at rest would take your instance to somewhere around 225 MiB of memory used. However, if you have a lot of concurrent websocket users or you’re heavily using ets for caching and storage, or if you have multiple concurrent hashing operations, this will still lead to an OOM.

For my applications I decided to go with the following parameters instead:

config :argon2_elixir, t_cost: 18, m_cost: 15

This will use only 32 MiB of memory per hash, but will use several more iterations to obtain that final hash. I arrived at those numbers by following the guide provided in the RFC: decide first on the memory cost and then test the time it takes to create a hash and change the time cost until the total time is around 500ms (where they recommend for security purposes).

That said, 500ms is what’s recommended as the most secure option, but others suggest considering timing as fast as 50ms for user experience purposes. Here are the timings run inside the lowest tier fly.io instances between 50 and 500ms. I would recommend selecting the one that makes the best tradeoffs for your application:

config :argon2_elixir, t_cost: 2, m_cost: 15 # ~60ms
config :argon2_elixir, t_cost: 4, m_cost: 15 # ~115ms
config :argon2_elixir, t_cost: 8, m_cost: 15 # ~208ms
config :argon2_elixir, t_cost: 12, m_cost: 15 # ~307ms
config :argon2_elixir, t_cost: 16, m_cost: 15 # ~415ms
config :argon2_elixir, t_cost: 18, m_cost: 15 # ~460ms
config :argon2_elixir, t_cost: 20, m_cost: 15 # ~525ms

I should add that the most secure solution will be to give more memory to your application, but even with more memory, you should still make sure that the parameters you’re using make the most sense.