Best way to isolate processes from misbehaving 3rd party libraries `try-catch` vs "let it crash"

chrishop · November 10, 2021, 4:56pm

I have a production web application, that handles ~1000rps.
We use Cowboy, which creates a process per connection.
For each request we make a synchronous call to a 3rd party library to fetch the cache (cachex) which errors every so often, making that request 500.

I want to isolate the connection process from 3rd party libraries, so If my cache library crashes, I won’t 500.

In this situation as I see it there are 3 options:

Use try-catch blocks for misbehaving calls to the library
Use a Tasks to make calls to the library
Use a single Genserver to interact with the library

Option 3 could introduce bottlenecks into the application, as the calls to the library happen every request.

Option 2 means that for each connection we create another process which is created and destroyed (like in [1]). Moreover communication between processes requires copying of the message [2], wouldn’t this have performance ramifications?

Options 1 isn’t the “OTP” way but I can seem to find a good reason why not to use it, there’s no risk of performance implications, from bottlenecks or copying. However in documentation it says that try-catch / try-rescue is rarely used because OTP patterns are used instead. [3]

To me using try-catch seems like the way to go. However it seems to fly in the face of everything that I’ve read. In what situations is it best to use try-catch, and is this one of them?

Thanks!

[1]I have tried to use combination of catch and rescue, curious that is a good way - #4 by josevalim
[2]The Zen of Erlang
[3]try, catch, and rescue - The Elixir programming language

ruslandoga · November 10, 2021, 6:32pm

If you use cowboy 2, it spawns a process per request, not just per connection, so it’s already isolated.

Other than that, whether to try/catch or not depends largely on the error you are dealing with. I think posting the errors you are facing would lead to better suggestions / answers.

al2o3cr · November 10, 2021, 9:33pm

It’s hard to make a specific recommendation without understanding why a caching library is failing, but some things to think about:

when that call fails, what should happen?
if failure leads to a retry, what happens if the retry also fails?
if failure leads to a slow calculation (presumably the thing-being-cached is expensive to compute), what happens if many calls fail at once?

chrishop · November 11, 2021, 7:43am

Hi thank you for replying, this is the error I’m getting:

Task #PID<0.32063.7888> started from #PID<0.26981.7902> terminating
** (stop) exited in: GenServer.call(:cache_locksmith, {:transaction, ["9e323b9335689400e1249ccb8a80bd84"], #Function<0.107038532/0 in Cachex.Actions.Touch.execute/3>}, :infinity)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (elixir 1.11.3) lib/gen_server.ex:1017: GenServer.call/3
    (cachex 3.3.0) lib/cachex.ex:1296: Cachex.touch/3
    (my_app 0.2.0) lib/my_app/cache/local.ex:23: MyApp.Cache.Local.fetch/2

Sorry what I mean to say is I want to isolate my request process, I did read this [1] but as you say there is one request per connection + one process per request and response. Basically I don’t want a cache error to bring down a request.

[1] Nine Nines: Flow diagram

chrishop · November 11, 2021, 8:25am

Hi thanks for your reply.

Some context:

The application itself is a normal web server returning a webpage with text/image responses, I have a cache in front because of the volume of traffic it gets. The logs show this error only shows up about 10 times a week, so not very often.

I’m thinking more about what’s the best way to handle a 3rd party library playing up. try-catch or Tasks? Would the copying of the responses from the processes Cachex -> Task -> request process make my application slower?

In answer to your questions:

When a call to cache fails, we should carry on and produce a response as normal
We could retry, but events like this happen so infrequently its not worth it
If many calls fail at once, we should just produce responses for all of those as normal

ruslandoga · November 11, 2021, 9:26am

In this case I’d investigate why the genserver process is not started, maybe there is a race condition during application startup where the web endpoint is started before the cache.

Would the copying of the responses from the processes Cachex -> Task -> request process make my application slower?

You can benchmark Cachex against Cachex -> Task with your data with benchee to find out. It would be slower since more work is involved, but by how much is hard to say, can be 1% can be more.

al2o3cr · November 11, 2021, 2:20pm

chrishop:

Task #PID<0.32063.7888> started from #PID<0.26981.7902> terminating
** (stop) exited in: GenServer.call(:cache_locksmith, {:transaction, ["9e323b9335689400e1249ccb8a80bd84"], #Function<0.107038532/0 in Cachex.Actions.Touch.execute/3>}, :infinity)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started

This seems like a symptom of a configuration / supervision issue, not something that application code should be handling at all.