I’m no Elixir guru, so take everything below with a healthy dose of skepticism and don’t take it at face value 
Elixir/Erlang are famous for their fault tolerance, but they’re not magic: if you design your project poorly it won’t continue running just because it happens to be on the BEAM.
In this case, it seems that the cause of failure was indeed poor application design. If you’re connecting to a remote service, you have to plan for “what happens when it’s unreachable or too slow?”. According to your description, it seems like the decision you made was “just let it crash”.
While that is oft repeated as part of Erlang’s design philosophy, it’s important to understand why people recommend that approach. When designing systems in OTP, it’s expected that restarting a service/process will bring them into a known good, stable state. Basically “turning off and on” which solves so many issues when technology goes bonkers.
In the case you describe, you can’t just “let it crash” because when the service restarts, the remote service you’re trying to connect to could still be unreachable. In other words, restarting your process didn’t (and couldn’t!) solve your issue at all.
Instead, it’s recommended to design with a so-called “error kernel”: a portion of the system that MUST ALWAYS be correct. If that error kernel gets corrupted, everything should get shut down.
I your design, you basically had the connection to the third party system within your error kernel (in effect having your code say "this remote system will always be reachable), and when that wasn’t the case, the system failed.
Since that’s apparently not what you want to happen, you should instead design your system so that connecting to the third party system isn’t part of the error kernel. In other words, your processes should continue operating even if unable to reach the remote service.
One way this could be achieved is by starting a task to connect to the remote service, and shut the task down if it doesn’t return before a certain cutoff period (i.e. a timeout). Your GenServer would then either respond with {:ok, response}
if it got a response, or {:error, :service_unavailable}
if the remote service couldn’t be reached in time (either because it’s down or the response came too late). Then the client can decide what to do if the remote service is unreachable (try later, etc.).
Here is some more info on the idea of service guarantees and error kernels (note there are 4 links, I don’t know why some don’t display properly):
https://ferd.ca/it-s-about-the-guarantees.html
https://medium.com/@jlouis666/error-kernels-9ad991200abd
https://www.reactivedesignpatterns.com/patterns/error-kernel.html
Hope this helps!