Hi guys. I would like to understand a couple of scenarios with ecto. (assume they are not tests)
1 - Let’s imagine I have one single query and run this query inside 5 different processes at the same time. What happens? I’ll get 5 different connections from the pool to execute each one or I’ll share the same connection among 5 processes?
2 - Let’s imagine I have the same situation above, but this time I run those 5 processes inside one single transaction. What happens?
3 - Can I assume a transaction can share the same connection among all queries inside of it?
4 - If I use Task.async_stream/3 to run 5 different queries into 5 different processes. This function is going to consume 5 connections from the pool or share the same for each item in the enumerable?
5 - Repo.stream creates a lazy way to run queries, but imagine I have 200 records and my max_rows argument is equal to 100. What happens? I’ll hit the database twice to get 100 records each?
6 - if I use Task.async_stream/3 with Repo.stream. Would this increase the performance in those scenarios ?
Thank you guys!
This is based partly on the size of your Ecto connection pool but with enough pool size you will typically get 5 separate connections if they all occur at ~exactly the same moment. You can think of it like an actual book-library stocked with 5 copies of the same book, and the only way one copy of the book could wind up in the possession of two or more people is if the first person returns the book before the next person asks for it.
AFAIK you can’t share a transaction across multiple processes; one process would need to hold the transaction open and then do work with it based on sending and receiving Elixir process messages to other processes. There might be something in the guts of
Ecto.Adapters.SQL.Sandbox that proves or disproves this.
Yes, although I would phrase it as “All queries that occur in the same transaction also occur on the same connection” to get the relationship straight.
Similar to 1, this depends but most likely you’ll see a maximum of 5 connections being used for those Tasks if they all start very close together.
Yes - it will
BEGIN a SQL transaction, then perform
SELECT with limits of 100 until it has seen all of the rows that match your criteria in the order (un)specified, and only then will it release the transaction. The transaction is non-negotiable, and this is to preserve the accuracy of successive reads across separate queries. It is the biggest tradeoff to
Repo.stream in the scenarios you might intuitively reach for it - if you’re trying to do heavy work with all the rows, you wind up with a transaction (and possibly conflicting locks) for the entire duration.
The main mitigation I know of is to select or stream only row IDs to collect the list of candidates, then do your heavy work outside of the transaction using something like GenStage/Flow. You do need accept that some individual operations might fail as rows change out from under you if the table is high-churn, and will possibly need to be retried on another pass. If you design for this mental model intentionally you can generally be pretty satisfied with the results.
This will probably have “interesting” complications due to what I wrote about for 5. If every individual Task has no need to see either the work going on Repo.stream or the effects of its sibling Tasks, this model might work okay in isolation but also might not be sustainable in practice due to the transaction.
: I don’t personally view Tasks as resilient/controlled enough for this kind of effort, even under a supervisor. I always outgrow them in practice. Not really the point of the thread here so I won’t dwell on it.
Thank you very much for spending time on it. I really appreciate your answers. I’ll use them for testing.