I’m resurrecting some old code that relied on gen_stage
to prevent the databases (postgres and mongo) from being overloaded. However, now that I’m looking at it, I’m not sure if it is following best practices. From what I remember, we had problems when processing large amounts of data that required database lookups, conceptually, this boiled down to something like
big_list_of_records
|> Enum.reject(fn x -> already_exists_in_db?(x) end)
|> cont_to_next_step()
# ... etc...
From what I remember, when we had a bunch of processes running with code like this, the database could get slammed (and I don’t remember if this was both mongo and postgres or if this was specific to only one, or if it had to do with a driver-issue).
Our solution at the time was to break up our pipelines so the database lookups happened in their own process (controlled by GenStage, but without any control over the demand). I.e. the “database producer” did this:
def handle_demand(_demand, state), do: {:noreply, [], state}
The thought was that it couldn’t magically summon demand, it just chewed on the stuff that the other processes sent to it. The effect was that this provided a way to have a mailbox for those items needing database cross-checks and it provided some control over concurrency with the various GenStage options.
However, now that I’m looking at this, I’m wondering if organizing our stages this way is the anti-pattern the gen_stage docs warned about? It feels like we are re-inventing a wheel. Our processes get bottlenecked (by design) by these database lookups – it’s almost like having a rate-limiter in front of it. Yes, this does prevent us from over-taxing the database, but shouldn’t the database drivers have their own queuing/pooling built-in so we don’t have to think about them? Not to mention the fact that by sending all this info into another process mailbox, we end up having to copy data across process boundaries at the cost of some efficiency.
My gut is telling me I should just implement each use case as a “siloed” GenStage producer + consumer and only put a layer of protection around the database if it really is needed.
I’m working solo on this, so I appreciate a sanity check from the forum comrades. Thanks!