We have some loss tolerance, but try our best to not lose anything.
This system needs to be up when other systems go wrong. So if other systems are hunky dory and we blip and lose a batch of 500 or so (the max thats ever in our queue)^1 that is acceptable.
We have three GenServers divvied up by concerns.
The one that is handling incoming requests makes sure the data is valid and sends it off to our Queue GenServer. The request handling genserver only tracks stats in its internal state. So if it goes down, we just lose those stats, but thats fine, they’ve probably been scraped by prometheus.
The second GenServer (our queue) is the one that is holding data that could be lost. This genserver traps exits and dequeues everything immediately that case. We do rolling deploys in kubernetes and give a docker container about 45 seconds to “cool down” with no traffic, so regular exits work out fine. Besides overwhelming that process’s heap, I’m not sure what else could destroy it. I guess I’ll find out one day
The third genserver is our real worker. And I think you wrote a lot of it (ex_aws)! This dispatches the event its receives (via a cast) to ex_aws, and if that fails to POST to the kinesis api, it will requeue that batch. (Our error rate w/ Kinesis’s API is less than 1/1,000,000 POSTs, so we rarely requeue). When the queue sends this GenServer the data, it actually keeps a copy locally as well and its marked as pending. When it is originally initialized it asks the queue for anything that is pending (assuming it had crashed previously) and re-submits. (There is probably a better way to do this, but I was rushing… arent we all).
We have a hard requirement on all systems that interact with this system to pre-generate UUIDs for events, so if something funky happens during dequeueing and an event gets sent downstream twice we are always upserting to remove duplicates.
I suppose if we lost network connectivity the third GenServer could have some issues that would result in the queue building up and if that happened to explode we’d lose that data, but if our VPC lost network connectivity I think we’d have some other problems
- We have our queues configured to handle sets of batches, but they dequeue so quickly its rarely above a single batch.