Best practices: gen_stage to prevent overloading database

fireproofsocks · January 17, 2023, 5:10pm

I’m resurrecting some old code that relied on gen_stage to prevent the databases (postgres and mongo) from being overloaded. However, now that I’m looking at it, I’m not sure if it is following best practices. From what I remember, we had problems when processing large amounts of data that required database lookups, conceptually, this boiled down to something like

big_list_of_records
|> Enum.reject(fn x -> already_exists_in_db?(x) end)
|> cont_to_next_step()
# ... etc...

From what I remember, when we had a bunch of processes running with code like this, the database could get slammed (and I don’t remember if this was both mongo and postgres or if this was specific to only one, or if it had to do with a driver-issue).

Our solution at the time was to break up our pipelines so the database lookups happened in their own process (controlled by GenStage, but without any control over the demand). I.e. the “database producer” did this:

def handle_demand(_demand, state), do: {:noreply, [], state}

The thought was that it couldn’t magically summon demand, it just chewed on the stuff that the other processes sent to it. The effect was that this provided a way to have a mailbox for those items needing database cross-checks and it provided some control over concurrency with the various GenStage options.

However, now that I’m looking at this, I’m wondering if organizing our stages this way is the anti-pattern the gen_stage docs warned about? It feels like we are re-inventing a wheel. Our processes get bottlenecked (by design) by these database lookups – it’s almost like having a rate-limiter in front of it. Yes, this does prevent us from over-taxing the database, but shouldn’t the database drivers have their own queuing/pooling built-in so we don’t have to think about them? Not to mention the fact that by sending all this info into another process mailbox, we end up having to copy data across process boundaries at the cost of some efficiency.

My gut is telling me I should just implement each use case as a “siloed” GenStage producer + consumer and only put a layer of protection around the database if it really is needed.

I’m working solo on this, so I appreciate a sanity check from the forum comrades. Thanks!

LostKobrakai · January 17, 2023, 5:19pm

Yes there are queues everywhere, but queues don’t fix overload. You’ll need to be able to detect if you’re trying to do too much (e.g. the backpressure of GenStage vs how much events queue up in the producer) and then decide what you want to do if there’s more work than you can handle.

fireproofsocks · January 17, 2023, 6:17pm

That was a great article, thank you!

I will keep this in mind as we redesign our input strategy.