Our government in Czechia (10 mil. people) has launched several projects in recent months: 1. Toll e-vignette eshop 2. COVID vaccination booking 3. Census system.
All 3 failed on day 1. Remained down for 24+ hours.
I am well aware that high concurrency can be fairly difficult. But what do you think is the main source of failures in the real world?
Wrong choice of technology?
Underestimating the problem?
Not understanding the problem?
This is a classic IT problem and we do have solutions to it, don’t we? No solution is bulletproof, but I am wondering why the fail rate in real world is in fact so high? Do you have similar experience?
I think this and ‘not architecting things propery’ are probably good enough reasons for why highly-concurrent projects in the real world might fail.
Architecting concurrent applications, especially when it’s your first time doing so, is not an easy task. The actor model and other design patterns of Elixir/Erlang are not taught often in school (or especially in bootcamps), so people often learn how to architect concurrent applications themselves. This can lead to gaps in their knowledge or the Dunning-Kruger effect rearing it’s ugly head, which in turn leads to preventable architectural errors being introduced.
not architecting things propery & Dunning-Kruger effect
It seems to be part of that, as usual. They are currently blaming one of the components (typeahead for your home address) as being buggy and generated too much of the load on the database.
Interesting thing is that they have needed several hours to realized, that they have issue with the load caused with this service. Also, normally designed and deployed, distributed system should not be brought to knees by one miss-behaving component. Like, where is graceful degradation, service isolation etc.
And also, are you telling me that you are claiming microservice architecture and have a shared database
I can’t speak for other countries or other US states, but from things I saw in the past here in the state of Florida, USA, when the state would put out a request for proposal (RFP), the RFP would lay out the requirements of what needed to be done, and then go on to tell you what software and hardware you were allowed to use in order to accomplish it!
That was 15 years ago, and back then most of the time they wanted Dell hardware & a Windows Server OS, but the budgets were astounding! If you could manage to get a contract, there was plenty of money to be made.
Here in Florida, back in 2013, they spent $63 million USD to build CONNECT, Florida’s new unemployment system which failed in a spectacular fashion when COVID hit. And as usual, plenty of finger pointing and blame to go around. In 2020, during the COVID pandemic, they spent an additional $25 million USD patching it up to make it halfway work. Now, the price tag to replace it and build a whole new system is $70 million+ USD. It’s just unreal!
I look at something like Florida’s CONNECT failure and think, “Gee, I could have done that better with Elixir & Phoenix and a few Linux containers!” LOL But then I remember those old RFPs and think “Nah, they have you set up for failure before you even get started!”
Very often the case. The companies that have the necessary connections to secure government contracts usually skimp a lot on dev salaries so they have some people who are, shall we say, not very professional, and also very set in their ways. In my home country (Bulgaria) a lot of these people still deem PHP a cutting-edge technology. Some of them started praising Java as an upcoming modern tech lately…
No offense to Java or PHP btw; they both work fine in their niches. But the websites that regularly receive “the hug of death” (hugely spiking usage) are not their strongest areas – to be fair, it’s not the strongest area for 99% of the languages out there save for just a few (Erlang/Elixir included). If you are unaware of Elixir or Rust or Golang you’d definitely need an autoscaling Kubernetes cluster in AWS. And they have no clue those things even exist.
So it’s usually a mix of “the wrong tool for the job” and “when all you have is a hammer all problems look like a nail”.
Not very fashionable to say what I am about to say in this crazy extreme politically correct landscape but… yes, this is very often what’s going on as well (as I implied a little above).
When I asked some of these guys what kind of thread / process pools they use, one of them just hand-waved all scalability problems with “meh, we have 20 PHP processes running with a load balancer ON THE SAME MACHINE, that is enough” and most of the programmers on the table discretely facepalmed, me included.
These are the people who work in the companies that secure governmental contracts.
Connections with the right people – and being at the right place at the right time – overrule any other considerations, by a lot.
For recent census system, I think they reported there were 170K people working with the system at the same time on Saturday. One thing is http application layer but another one is database load it created here. But for typeahead I would expect that something like Elasticsearch was used. And for data storage load something like Kafka or Apache Pulsar can be used to handle a huge throughoutput with low latency.
Applications like this create a temporary spikes and thus should contain some protective layer to just politely refuse users asking them to come later rather than fail. Census is a one time system so it’s probably not worth to create some monster architecture environment to handle one or two days of high load.
But generally these governmental projects really don’t shine here in Czech republic
Thank you everyone. I was not trying to bash the local government, that’s a different topic unto itself
Putting aside political dirt, I am simply curious why it happens so much in general. It is such a common problem. It is a number 1 problem that anyone would think of, for any such system launched at one point to 10 million people. Yet the fail rate seems to be close to 100%. Even if it’s written in PHP, this is manageable, to certain extent.
It’s not so much about the language / framework, it’s about if the people who use language X or framework Y are skilled. And often times they are not, they are just warming seats in a company with a lot of political connections.
It’s closely related to the topic of high failure rate of the governmental projects.
Frankly, NSW (New South Wales, Australia) governmental projects aren’t bad at all. I’d even say impressively good, from the customer standpoint. To be frank, I’d even rate many Service NSW project heaps above commercial services like, say, Telstra.
Ah, if anything, NBN isn’t really a government project, so it does no count
OK, another new example from my home country. The health ministry is splitting the EU donations among applicants, on first come first served basis, by letting them submit applications at 14:00 sharp. 400M EUR in play. As a result, they have teams of folks sitting at computers, waiting to hit the submit button at the same time. One of the applicants appealed against the system, coz they ended up at 68th position, allegedly submitting at 14:00:03…
I admit this is not necessarily a high concurrency problem though - there’s probably just a few hundred of incoming requests, not hundreds of thousands. Yet the system crashed, of course.