josefrichter
Load testing advice
I have an application with several different versions of architecture.
[TLDR: the goal of the application is to let university students enrol into classes for next semester. These systems always failed 20 years ago when I was in uni, and according to my research, they still always fail today.]
To summarize my architectural experiments:
Thousands of people trying to enroll into hundreds of classes with limited capacity
- just throw it all to postgres, lock table on writes, cut off when capacity filled
[1b) throw it all to postgres, no locking, just write, and then read first 30 in class] - throw all to single genserver that serializes it and writes chunks in bulks to postgres every x-items/y-seconds
- separate genserver for every single class, that then writes to postgres in bulk
I am doing some load testing of these variants on localhost - I “deploy” a production version on localhost and then use K6 that can create ~150 concurrent POST requests to my API endpoint created solely for testing purposes. I’ve been able to reach ~200k enrolments in 1 minute on localhost, but that number might as well be completely meaningless without fully understanding the whole context.
Obviously this way of testing is very approximate and skipping half of the workflow.
I’d be curious to hear your advice where to go from here, please.
My inclination would be to:
- deploy this to real server like fly.io or heroku
- do some load testing that simulates the real human workflow, i.e. basically a human logging in, going to certain page, hit “enroll” button, going to another page, hit “enroll” button again, etc.
My guesstimate of real-world scenario is something like 10-30 thousand people trying to enrol to ~1 thousand different classes “all at once”. Each person is trying to enrol to ~30 classes out of that 1 thousand.
Could you please give me some hints how to test these scenarios, compare those variants, and especially reveal the real bottlenecks of each solution?
I know I could start fiddling with RabbitMQ/Kafka/GenStage or squeeze in ETS/Redis/Mnesia/Whatnot, but I’d be a headless chicken running here and there without knowing any real data. Now it’s time to understand and measure what my attempts so far can do.
This is becoming a “here be dragons” area for me, so and hints, guidance or mentorship is highly welcome, please. Happy to add you as collaborator to my repo, if you wish. (btw. should I read any particular book on this?)
Thank you.
Most Liked Responses
akoutmos
I may be a little biased on this one given I am the author of the library. But I would including PromEx as a dependency of your project and capture the BEAM, Phoenix, and Ecto metrics after running your K6 test suite with each architecture. That way you can test in a production-esque environment and see how your system behaves. I actually wrote a blog post about how to set all this up on Fly.io if that helps: Monitoring Elixir Apps on Fly.io With Prometheus and PromEx · The Fly Blog
LostKobrakai
Maybe it’s not everybody, but I’d expect exactly that to be the problem why those systems fail.
These things to my knowledge need to be fair, which usually means nobody can be denied their chance from within the system.
In general I’d also try to figure out how fast students need confirmations about their enrolements. A.k.a. can you accept enrollment attempts and only later come back to the student telling them if they’re successfully enrolled to the course? This “later” doesn’t need to be long, even like minutes or once a minute could allow you to do certain checks less often than per request, potentially even on a separate node, … by splitting writes from reads. It’ll also allow you to better cache/cachebust read heavy parts of the system, which likely will be hit hard as well, especially when they become the success indicators. CQRS in general can be a good step to an event driven system, which allows for a few infrastructure scenarios useful for scaling things independently as needed.
stefanchrobot
I’m not an expert on this, but here are my thoughts.
With load testing I would definitely to try to get as close to the production infrastructure as possible to get meaningful results. This includes:
- The amount of servers (application and DB),
- The CPU and RAM config of the servers,
- The DB config,
- Intermediate servers (e.g. proxies),
- The amount of data already in the DB.
As far as I understood, you’re building a new app, so I’d go and create two identical production environments and dedicate one to load testing.
for having some initial guesstimates on the amount of traffic. More things to think about:
- I’m guessing not everybody is going to sit there on minute one and try to sign up for the classes,
- Are people allowed to open multiple tabs at once and try to sign up that way?
- Are they going to be able to prepare an open tab and keep on refreshing it and then hit the “sign up” button? Or will they be able to log in into the system only after a specified point in time?
What I’d be trying to do with the questions above is to predict the behaviour patterns of the users. Then it would be great to replicate them in an automated test. If the app is a SPA, I think I would go with testing it via the API. If it’s SSR, I would replicate what the browser does. Either way, these tests could still be written in Elixir, but it should be a separate app (don’t use Phoenix test helpers as they short circuit certain things; use some HTTP client and Floki).
My predictions for the first run is that it would fail because:
- The server not being configured to accept that many connections at once,
- Too few DB connections,
- DB connection queueing timeouts,
- Testing tool not being able to produce the needed load (that’s an argument for using a well established load testing tool).
One thing I would consider is to step back and think about whether it’s possible to have a product (not technical) design which would make some of those problems disappear. The UX doesn’t have to be great - people use it once per semester. Maybe there’s a way to spread out the load, e.g. let more senior students sign up first?
Popular in Questions
Other popular topics
Categories:
Sub Categories:
Forums
Popular Tags
- #ecto
- #liveview
- #troubleshooting
- #learning-elixir
- #deployment
- #library
- #erlang
- #testing
- #genserver
- #mix
- #absinthe
- #remote-other
- #otp
- #plug
- #how-to-question
- #macros
- #postgres
- #channels
- #elixirconf
- #exunit
- #discussion
- #javascript
- #podcasts
- #code-sync
- #onsite
- #dialyzer
- #docker
- #authentication
- #umbrella
- #full-time-contract
- #podcasts-by-brainlid
- #ecto-query
- #elixir-ls
- #phoenix_html
- #iex
- #blog-post
- #graphql
- #genstage
- #ai
- #websockets
- #supervisor
- #advent-of-code
- #elixirconf-us
- #distillery
- #processes
- #forms
- #api
- #metaprogramming
- #security
- #performance








