Hi, I’m trying to understand concurrent jobs in elixir, I basically have a job in Java that I want to rewrite using elixir concurrent capabilities, basically looping over a big amount of elements, modifying some values and saving in the DB again, the element are independent from each other, so can be done in parallel. I’m just struggling a bit to understand whether to use Task, GenServer, or just spawn different processes. I don’t need a reply from the processes when completed, just logging a line somewhere saying if the update was successful or not. I would have:
Enum.each(list, function)
where “function” would create a new Task and call the function I need to execute on the element.
But what about GenServers/OTP? When is it useful to use GenServers? Would it make sense create one genserver for each element? or that is not the idea?
You would use GenServer if you wanted processes to maintain state between messages. It sounds like you don’t need that. Using Task.start/1 is an easy away to start fire-and-forget processes. If you need additional control over then number of concurrent processes, a GenStage ConsumerSupervisor would be a useful tool.
I suspect that you aren’t sharing some assumptions that you are making here. If you don’t need a reply - how are you planning to save information back to the database? Now it could work if you plan to use Ecto, as it is an independent OTP application that handles all DB requests concurrently. However even then ideally you should restructure your processing to fetch as much data upfront to avoid each separate process repeatedly hammering the database.
If you need the capability to throttle processing then as already suggested GenStage is worth considering. Typically you would organize processing a bit differently - rather than having a Task process an “element” end-to-end you set up a processing pipeline were each stage does just one (short) phase of the work before it sending it the the next stage which which does the next (short) phase of the work - etc.
wow, a bit off topic first, never had these kind of quality answers so quickly before in other forums! Hats off to this community
All the answers have very good points
I’m thinking of two different approaches, first, leave my Java code as it is, in an API endpoint, and I would just first call one endpoint to get the list of elements, and then call another API endpoint for each element and let java do all the job, I would just use elixir to make this processing concurrent. The downsize of this is that I’m not sure if this could have a negative impact in the DB as I’ll be making several calls to DB in parallel. what do you think?
The other approach is to re write the code in elixir and use ECTO to handle DB calls, do you think this approach would be better performance wise? I read somewhere that ECTO uses some kind of pool, so even if I run 300 processes and those 300 processes query the DB at the same time, ecto will manage that in order to not affect DB performance?
Don’t really think that would gain you that much - not knowing the details. The database is likely to become the bottleneck even with a relatively small number of parallel queries - so unleashing hundreds of queries at the DB simultaneously may not have the desired effect. Also a number of times I’ve come across this type of situation where the queries issued in Java were ultra-simple - and leveraging the facilities available in SQL a lot of the looping constructs (and Java code) can be eliminated - again your situation may be different. Furthermore this type of solution would still require the same sort of hardware resources as before - adding the overhead of the (I assume) HTTP requests.
This approach sounds potentially more profitable to me - however given that you are using Java there is a pretty good chance that you are using - Oracle. The rub with Ecto is that it’s typically used with PostgreSQL. For Oracle there is the possibility of writing your own adapter or joining somebody else’s effort in writing one. The other possibility is to forget about Ecto and to pursue some of the options mentioned here, like using ODBC.
Ultimately a lot depends on how many parallel queries (and updates which may create locks that slow things down) your particular database/schema setup can reasonably sustain. Because if that number is fairly low all the parallelism in the BEAM isn’t going to help you.