Load testing advice

josefrichter · January 23, 2022, 12:34am

I have an application with several different versions of architecture.

[TLDR: the goal of the application is to let university students enrol into classes for next semester. These systems always failed 20 years ago when I was in uni, and according to my research, they still always fail today.]

To summarize my architectural experiments:

Thousands of people trying to enroll into hundreds of classes with limited capacity

just throw it all to postgres, lock table on writes, cut off when capacity filled
[1b) throw it all to postgres, no locking, just write, and then read first 30 in class]
throw all to single genserver that serializes it and writes chunks in bulks to postgres every x-items/y-seconds
separate genserver for every single class, that then writes to postgres in bulk

I am doing some load testing of these variants on localhost - I “deploy” a production version on localhost and then use K6 that can create ~150 concurrent POST requests to my API endpoint created solely for testing purposes. I’ve been able to reach ~200k enrolments in 1 minute on localhost, but that number might as well be completely meaningless without fully understanding the whole context.

Obviously this way of testing is very approximate and skipping half of the workflow.

I’d be curious to hear your advice where to go from here, please.

My inclination would be to:

deploy this to real server like fly.io or heroku
do some load testing that simulates the real human workflow, i.e. basically a human logging in, going to certain page, hit “enroll” button, going to another page, hit “enroll” button again, etc.

My guesstimate of real-world scenario is something like 10-30 thousand people trying to enrol to ~1 thousand different classes “all at once”. Each person is trying to enrol to ~30 classes out of that 1 thousand.

Could you please give me some hints how to test these scenarios, compare those variants, and especially reveal the real bottlenecks of each solution?

I know I could start fiddling with RabbitMQ/Kafka/GenStage or squeeze in ETS/Redis/Mnesia/Whatnot, but I’d be a headless chicken running here and there without knowing any real data. Now it’s time to understand and measure what my attempts so far can do.

This is becoming a “here be dragons” area for me, so and hints, guidance or mentorship is highly welcome, please. Happy to add you as collaborator to my repo, if you wish. (btw. should I read any particular book on this?)

Thank you.

akoutmos · January 23, 2022, 5:10am

I may be a little biased on this one given I am the author of the library. But I would including PromEx as a dependency of your project and capture the BEAM, Phoenix, and Ecto metrics after running your K6 test suite with each architecture. That way you can test in a production-esque environment and see how your system behaves. I actually wrote a blog post about how to set all this up on Fly.io if that helps: Monitoring Elixir Apps on Fly.io With Prometheus and PromEx · Fly

stefanchrobot · January 24, 2022, 8:38am

I’m not an expert on this, but here are my thoughts.

With load testing I would definitely to try to get as close to the production infrastructure as possible to get meaningful results. This includes:

The amount of servers (application and DB),
The CPU and RAM config of the servers,
The DB config,
Intermediate servers (e.g. proxies),
The amount of data already in the DB.

As far as I understood, you’re building a new app, so I’d go and create two identical production environments and dedicate one to load testing.

for having some initial guesstimates on the amount of traffic. More things to think about:

I’m guessing not everybody is going to sit there on minute one and try to sign up for the classes,
Are people allowed to open multiple tabs at once and try to sign up that way?
Are they going to be able to prepare an open tab and keep on refreshing it and then hit the “sign up” button? Or will they be able to log in into the system only after a specified point in time?

What I’d be trying to do with the questions above is to predict the behaviour patterns of the users. Then it would be great to replicate them in an automated test. If the app is a SPA, I think I would go with testing it via the API. If it’s SSR, I would replicate what the browser does. Either way, these tests could still be written in Elixir, but it should be a separate app (don’t use Phoenix test helpers as they short circuit certain things; use some HTTP client and Floki).

My predictions for the first run is that it would fail because:

The server not being configured to accept that many connections at once,
Too few DB connections,
DB connection queueing timeouts,
Testing tool not being able to produce the needed load (that’s an argument for using a well established load testing tool).

One thing I would consider is to step back and think about whether it’s possible to have a product (not technical) design which would make some of those problems disappear. The UX doesn’t have to be great - people use it once per semester. Maybe there’s a way to spread out the load, e.g. let more senior students sign up first?

LostKobrakai · January 24, 2022, 8:57am

Maybe it’s not everybody, but I’d expect exactly that to be the problem why those systems fail.

These things to my knowledge need to be fair, which usually means nobody can be denied their chance from within the system.

In general I’d also try to figure out how fast students need confirmations about their enrolements. A.k.a. can you accept enrollment attempts and only later come back to the student telling them if they’re successfully enrolled to the course? This “later” doesn’t need to be long, even like minutes or once a minute could allow you to do certain checks less often than per request, potentially even on a separate node, … by splitting writes from reads. It’ll also allow you to better cache/cachebust read heavy parts of the system, which likely will be hit hard as well, especially when they become the success indicators. CQRS in general can be a good step to an event driven system, which allows for a few infrastructure scenarios useful for scaling things independently as needed.

josefrichter · January 24, 2022, 3:36pm

Thank you, yes I think I will try to deploy to production (although probably just basic config), seed realistic data, and then I want to run the tests. So I am trying to figure out how to make it as realistic as possible.

In fact this is EXACTLY what is happening, as also noted by @LostKobrakai

The universities don’t know a better way how to make the system “fair”, other than announcing a datetime when the gates will open.

Which means almost all students of that university sit at their computers at that very minute and second and start firing requests all at once, it’s a brutal race for free slots.

What’s worse, they generally have as many devices or tabs open as possible, they even ask their friends and family to log in from all their devices and help them in this race. I’m not joking, I have this confirmed in interviews.

So at the exact second there will be 10-20 thousand people hitting “enroll” on 5-10 different devices all at once and praying for success. Of course, majority of these requests time out and fail, so they frantically hit refresh buttons and try over and over a again. Spiralling the problem out of control.

This process usually takes several hours before everyone is able to enrol to at least the minimum they need, and of course their schedule is a disastrous mess, far from optimal.

There are literally stories of people who then need to go around and ask professors to take them into class over capacity, otherwise they won’t have enough credits, they cannot finish their degree, etc. And there are also stories of students who actually dropped out of university, because they couldn’t create a meaningful schedule, that would let them work while studying, because they’re from poor family and have to work to get by.

Given the above, they unfortunately need to confirmation in semi-realtime, (worst case lets say within seconds). Because they need to act immediately and try to frantically find another course that was their “plan B”, and slots are filling super quickly.

^ so this is the summary of the horrors

Now my step 1 solution would be not to persuade universities that the system is stupid, that’s uphill battle. And you cannot really uproot something that works this way for 20 years (again, not kidding). Instead I’d like to have a system that can handle that load, because 20 thousand people trying to enroll in 20 classes each doesn’t sound like impossible numbers, especially in Elixir world.

Right now I have something that can make ~240 thousand enrollments in 1 minute on a standard macbook with postgres.

What I’m doing right now (one of the approaches) is that each class is a genserver that is collecting the enrollment requests (casts with a microsecond timestamp) (and serializing them, which is exactly what we need in this approach), and then writes them in bulk to postgres, every 1 second or every 1000 enrollments. So I don’t do 1000 writes in every second, but 1 write of 1000 records every second, basically. [if you look at my original post, this is actually the most complex solution, I want to measure the much easier ones too - basically postgres can ingest gazzilions]

This way the user gets immediate response that his enrolment was received (response from genserver) and then 1 second later they get another green checkmark that it was persisted in postgres. And this is actually show in LiveView, so it’s pretty cool watching it unfold

Of course step 2 would be to move away from this nonsense and create a system where everyone can pre-enroll their “ideal” schedule, with some backup options, and the system would optimize it to find the best possible combination for everyone.

Some universities do try to spread the load by opening enrolments gradually and letting more senior students enrol first. That makes sense, although the “fairness” of that system is also questionable in various scenarios. But at least it’s an attempt to take into account a more logical bonity score than “who hits the button quicker”.

Some universities allow you to pre-enroll your ideal schedule. But they generally don’t do the step 2 of optimizing the result for you. They just basically report back to you that 70% was successfully written for you, and the remaining 30% failed. And now you can get into system and scrape the leftovers (which may be nothing). So this also doesn’t work great…

LostKobrakai · January 24, 2022, 3:41pm

It seems like you already did the important step here though, which is decoupling accepting enrollment attempts from showing a result of “you’re in or not in”.

josefrichter · January 24, 2022, 4:33pm

Yeah exactly. I cast the enrolment to class genserver. The genserver adds it to its state and broadcasts it, so that all liveviews can pick it up - that way you know genserver got it, coz it shows up in your liveview, but it’s completely async.

And then genserver writes it to postgres in bulk, and if that’s successful, there’s another broadcast which again updates the liveview with green checkmarks (meaning it’s received AND persisted. sort of inspired by WhatsApp double checkmarks here ).

My guess is these two steps are avoiding two biggest potential bottlenecks. But I now need to test my hypothesis that these were the two biggest potential bottlenecks

stefanchrobot · January 25, 2022, 7:21am

Wow, what a mess. I think that letting students of later years register before the students of previous years at least gives one a chance to register to the most wanted courses. Otherwise you randomly roll the dice and might not have a chance to attend a course at all.

Since you’re planning to use LiveView, it seems it should be not that difficult to force users to use just one session (deny more than one web socket connection) which should put a known cap on the concurrency that you need.

How about this: allow each student to prepare a prioritised list of courses they want to attend and spin up a GenServer for each student that attempts to sign up for the courses from top to bottom of the list. You’d basically simulate what people are trying to do, but in a controlled way - you can inject fair sleeping periods between requests to lower down the load. Sounds like a solution that isn’t too far from the way the current system works, so there should no complains about the change of behaviour.

josefrichter · January 25, 2022, 1:36pm

That’s a neat first solution for step 2 I think! I was thinking of various ways how to do the optimization, but this might be the easiest first attempt. Just ask the users to make a list sorted by their priority, and then go in rounds, kinda like NHL draft

There are some additional quirks though: your list might look like this:

Maths 101 on Monday
Maths 101 on Wednesday
Maths 101 on Friday
…

^ coz you simply need Maths 101. And this way you might end up with 3 enrolments and unnecessarily blocking 2 slots. So I’d need to add some more logic e.g. to skip 2 and 3 if 1 was already successful, etc.

But it gets more complicated. Your list might be:

Maths on Monday 10:00
Maths on Tuesday 10:00
English on Wednesday 10:00
English on Monday 10:00

^ if only 1 and 4 succeed and 2 was automatically skipped, you’re screwed…