From $erverless to Elixir

brightball · August 17, 2018, 3:37am

Really interesting post trending from a tweet earlier today (as seen in another thread). Felt like this one deserved it’s own topic though.

And then the post that Jose asked him to write…

OvermindDL1 · August 17, 2018, 3:36pm

That sounds about right for a service I ran about 5 years ago. I was paying about $300/month for a dedicated server (32-core, 64 gigs ram, 1gigabit dedicated line almost always saturated, etc… etc…) and was wondering if I could do it cheaper and started looking into AWS and I came up to between $15000 - $20000 / month.

Ever since then I can’t comprehend why anyone would ever use such a service, it is practically trivial to set up your own servers if you have even an inkling of what you are doing (I may be biased, I’ve been running a multitude of my own servers for 2 decades now), even spread around the globe…

jeremyjh · August 17, 2018, 3:57pm

API Gateway is insanely expensive. There is almost no platform or architecture you could move to that wouldn’t save you a lot of money when you are running those kind of volumes. But still, good to see good press for Elixir.

minhajuddin · August 17, 2018, 4:13pm

Serverless is great when your volume is low, and you have bursts of traffic. If you have sustained load you can definitely build a more performant app using serverful

If you have an hour of activity every day with less than a 33000 requests, you end up with a million lambda requests which is free and even if you have double that traffic you’d still end up paying less using a serverless model.

sribe · August 17, 2018, 4:16pm

AWS is useful for scaling. #1 you can get huge numbers of servers almost instantly if you need to scale up; #2 for loads with substantial peaks, which is a lot of internet services, you can actually save money over buying all that hardware, iff you do it right.

But yeah, “everything has to be in the cloud on somebody else’s hardware” is a bit faddish–like you, I run my own dang servers for some services. (But I’ve also worked with AWS a lot, and SoftLayer, and colo for base load + AWS for peaks…)

coryodaniel · August 17, 2018, 7:08pm

When I first started working on this project API Gateway was an “easy choice” because our infrastructure was kind of a mess and this tool was going to be in part monitoring our clients and our web stack.

Once operationally we had solid footing on kubernetes it was a no brainer that it would be better served there, especially since the usage had gone way beyond what I could have imagined.

I think serverless architectures are still good solutions for certain problems and for people coming out of boot camps that want to build something, but don’t have necessarily have the experience to run their own servers.

There are other products popping up everywhere. kubeless, fission, and knative are great ways to get that functionality for cheap in your own k8s cluster, but there are other tools with great developer experiences like Zeit that solve the same problem asf APIGW+Lamba for what I believe is much cheaper (at least when I looked as a potential solution vs. running in elixir).

Happy to answer any questions!

benwilson512 · August 17, 2018, 7:14pm

@coryodaniel How important is data preservation in your use case? One thing that’s kept me away from using GenServers as a front line buffer is that if any of them crash for any reason that data is basically just gone at this point. Is the nature of the data you have such that that’s OK, or is there some other sort of recovery mechanism in place should such a crash occur?

coryodaniel · August 17, 2018, 8:20pm

We have some loss tolerance, but try our best to not lose anything.

This system needs to be up when other systems go wrong. So if other systems are hunky dory and we blip and lose a batch of 500 or so (the max thats ever in our queue)^1 that is acceptable.

We have three GenServers divvied up by concerns.

The one that is handling incoming requests makes sure the data is valid and sends it off to our Queue GenServer. The request handling genserver only tracks stats in its internal state. So if it goes down, we just lose those stats, but thats fine, they’ve probably been scraped by prometheus.

The second GenServer (our queue) is the one that is holding data that could be lost. This genserver traps exits and dequeues everything immediately that case. We do rolling deploys in kubernetes and give a docker container about 45 seconds to “cool down” with no traffic, so regular exits work out fine. Besides overwhelming that process’s heap, I’m not sure what else could destroy it. I guess I’ll find out one day

The third genserver is our real worker. And I think you wrote a lot of it (ex_aws)! This dispatches the event its receives (via a cast) to ex_aws, and if that fails to POST to the kinesis api, it will requeue that batch. (Our error rate w/ Kinesis’s API is less than 1/1,000,000 POSTs, so we rarely requeue). When the queue sends this GenServer the data, it actually keeps a copy locally as well and its marked as pending. When it is originally initialized it asks the queue for anything that is pending (assuming it had crashed previously) and re-submits. (There is probably a better way to do this, but I was rushing… arent we all).

We have a hard requirement on all systems that interact with this system to pre-generate UUIDs for events, so if something funky happens during dequeueing and an event gets sent downstream twice we are always upserting to remove duplicates.

I suppose if we lost network connectivity the third GenServer could have some issues that would result in the queue building up and if that happened to explode we’d lose that data, but if our VPC lost network connectivity I think we’d have some other problems

We have our queues configured to handle sets of batches, but they dequeue so quickly its rarely above a single batch.