Feedback on my Phoenix app architecture

kodepett · December 11, 2019, 6:23pm

I’ve a phoenix application in production that handles between 1-2million requests in an hour. Clients that invoke the application endpoint does so in a fire and forget manner(asynchronous). In terms of implementation, I have a Supervisor that starts 25(configurable) Genservers. Once the client sends a request, I use cast to forward the request to the Supervisor and then respond to the client. The Supervisor selects a Genserver to handle the request using :erlang.phash2 function with the client phone number. The Genserver will then start and monitor a Task which does the processing and once complete logs a few things to the database and send sms to the client. There’s a lot of logging to file as well since I call various endpoints.
I’ve three instances of application running behind HAproxy.
I would like to know if my architecture is correct. Any general advise is welcomed - I’ve pitched the idea of writing most of our applications in elixir and I need to back it up with a high performant elixir application.
Thanks.

LostKobrakai · December 12, 2019, 8:38am

Can you explain how the system is meant to handle overload? Like what should happen if your 25 genservers/their tasks cannot keep up with the amount of requests coming in via http?

kodepett · December 12, 2019, 12:11pm

I’ve a number of instances running behind the soft load balancer.
I’m not quite sure as to how to automatically scale without starting an new instance or increasing the number of GenServers and restarting the instance.
Any advice on that will be extremely helpful.

benwilson512 · December 12, 2019, 1:16pm

Hi @kodepett! It’s great to hear that the architecture seems to be serving your current needs.

I think the general concern with architectures like this is that if your server can’t keep up, you guarantee data loss. Autoscaling won’t help, because scaling up can only solve the issue if it’s your application servers that are the bottleneck. If your database becomes the bottleneck, your app will simply run out of memory and crash.

To elaborate on the memory issue, GenServer.cast sends a message to the process inbox. If your process can’t keep up, this inbox will just keep filling up until you run out of memory. When your app runs out of memory, the whole thing will go down.

If you can tolerate dataloss then I’d consider turning the GenServers into some kind of bounded queue. There may even be implementations of a bounded queue using :ets which may provide better read / write performance. When the queue gets full you can start dropping data explicitly. While not great, this is better than running out of memory and crashing.

If dataloss isn’t acceptable then you’ll need to rethink the architecture. Common methods include moving the client side code to doing a synchronous call to a persistent queue, which you can then process.

kodepett · December 12, 2019, 2:21pm

I perfectly agree with all the salient points. I once encountered a large message queue on test environment hence my adoption of load balancing to spread the load among multiple instances.

Data loss is not acceptable, I would like to look into the persistent queue solution- any resource will be welcomed.

I would like to know if I can respond to an incoming web request and continue with processing without using the GenServer.

benwilson512 · December 12, 2019, 2:33pm

It depends on what you want the response to mean. If you want the response to mean “We have your data and we won’t lose it” then you shouldn’t send that response until the data is persisted. Anything less and you’re making promises to the client that you can’t keep. 2 million requests per hour is ~500 requests a second, 500 inserts per second into postgres should be completely doable.

EDIT: Blah, on a train, wifi keeps eating my responses.

The main point is that you don’t actually need to do all of the processing before you send the response, you just need to stick the client’s data somewhere so that you don’t lose it. Postgres isn’t an optimal queue, but it’s easy to use and will perform perfectly well for sustained writes into the 10s of thousands of records per second without having to overthink things. So get a request from the client, write the data down in a table, send the “OK” response. Then you should have N genservers working their way through the table in parallel, ideally sharded by some consistent has on the ID or similar. If the work for each request is independent you can just throw all of this in an Oban job queue.

From there you just need to have metrics to make sure that you aren’t falling behind. If you start falling behind you need more genservers / app servers. If the database can’t keep up with reads, that’s probably because the queries aren’t efficient or you aren’t pruning old data. If the writes can’t keep up, consider using Postgres 12 sharding, and get a beefier server. If you’re hitting millions of records per second look at Kafka.

kodepett · December 13, 2019, 8:06am

Thanks a lot everyone for the feedback on my phoenix app architecture. I will give the recommendations a shot and see how well it goes. I will update you on my findings. Thanks a lot, I’m most grateful.