Understanding concurrent jobs and OTP applications

nelson687 · June 28, 2017, 3:08pm

Hi, I’m trying to understand concurrent jobs in elixir, I basically have a job in Java that I want to rewrite using elixir concurrent capabilities, basically looping over a big amount of elements, modifying some values and saving in the DB again, the element are independent from each other, so can be done in parallel. I’m just struggling a bit to understand whether to use Task, GenServer, or just spawn different processes. I don’t need a reply from the processes when completed, just logging a line somewhere saying if the update was successful or not. I would have:

Enum.each(list, function)

where “function” would create a new Task and call the function I need to execute on the element.
But what about GenServers/OTP? When is it useful to use GenServers? Would it make sense create one genserver for each element? or that is not the idea?

thanks!

kokolegorille · June 28, 2017, 4:24pm

Task is an elixir abstraction over process that would fit well for your “one time” job.

GenServer is an Erlang abstraction over process too. But it fits more for situation where You need to have a longer timeline.

While You could do everything with GenServer …

Tasks are processes meant to execute one particular action throughout their lifetime, often with little or no communication with other processes

from this page Task — Elixir v1.16.0

I use GenServer in almost anycase, where the lifetime of the process is unknown.

Like a cache server, a latency server etc.

And I use Task when I need concurrent work done, like scrapping an api, processing multiple files etc.

mmartinson · June 28, 2017, 4:44pm

You would use GenServer if you wanted processes to maintain state between messages. It sounds like you don’t need that. Using Task.start/1 is an easy away to start fire-and-forget processes. If you need additional control over then number of concurrent processes, a GenStage ConsumerSupervisor would be a useful tool.

github.com

elixir-lang/gen_stage/blob/master/examples/consumer_supervisor.exs

# Usage: mix run examples/consumer_supervisor.exs
#
# Hit Ctrl+C twice to stop it.

defmodule Counter do
  @moduledoc """
  This is a simple producer that counts from the given
  number whenever there is a demand.
  """

  use GenStage

  def start_link(initial) when is_integer(initial) do
    GenStage.start_link(__MODULE__, initial, name: __MODULE__)
  end

  ## Callbacks

  def init(initial) do
    {:producer, initial}

This file has been truncated. show original

kandros5591 · June 28, 2017, 5:06pm

@nelson687 I suggest you to take a look to these videos, reeeealy good material.

https://www.youtube.com/channel/UCp01DFl8kp-239gW289C0ew

peerreynders · June 28, 2017, 5:20pm

I suspect that you aren’t sharing some assumptions that you are making here. If you don’t need a reply - how are you planning to save information back to the database? Now it could work if you plan to use Ecto, as it is an independent OTP application that handles all DB requests concurrently. However even then ideally you should restructure your processing to fetch as much data upfront to avoid each separate process repeatedly hammering the database.

This topic may help you to get a better feel for all things Task, and GenServer:
https://elixirforum.com/t/sasa-jurics-beyond-task-async-blog/6107

If you need the capability to throttle processing then as already suggested GenStage is worth considering. Typically you would organize processing a bit differently - rather than having a Task process an “element” end-to-end you set up a processing pipeline were each stage does just one (short) phase of the work before it sending it the the next stage which which does the next (short) phase of the work - etc.

nelson687 · June 28, 2017, 6:10pm

wow, a bit off topic first, never had these kind of quality answers so quickly before in other forums! Hats off to this community

All the answers have very good points

I’m thinking of two different approaches, first, leave my Java code as it is, in an API endpoint, and I would just first call one endpoint to get the list of elements, and then call another API endpoint for each element and let java do all the job, I would just use elixir to make this processing concurrent. The downsize of this is that I’m not sure if this could have a negative impact in the DB as I’ll be making several calls to DB in parallel. what do you think?

The other approach is to re write the code in elixir and use ECTO to handle DB calls, do you think this approach would be better performance wise? I read somewhere that ECTO uses some kind of pool, so even if I run 300 processes and those 300 processes query the DB at the same time, ecto will manage that in order to not affect DB performance?

kokolegorille · June 28, 2017, 6:54pm

Ecto is using poolboy https://github.com/devinus/poolboy

for managing (and limiting) db access

And You could use it as well if You want to limit API call

peerreynders · June 28, 2017, 8:52pm

Don’t really think that would gain you that much - not knowing the details. The database is likely to become the bottleneck even with a relatively small number of parallel queries - so unleashing hundreds of queries at the DB simultaneously may not have the desired effect. Also a number of times I’ve come across this type of situation where the queries issued in Java were ultra-simple - and leveraging the facilities available in SQL a lot of the looping constructs (and Java code) can be eliminated - again your situation may be different. Furthermore this type of solution would still require the same sort of hardware resources as before - adding the overhead of the (I assume) HTTP requests.

This approach sounds potentially more profitable to me - however given that you are using Java there is a pretty good chance that you are using - Oracle. The rub with Ecto is that it’s typically used with PostgreSQL. For Oracle there is the possibility of writing your own adapter or joining somebody else’s effort in writing one. The other possibility is to forget about Ecto and to pursue some of the options mentioned here, like using ODBC.

Ultimately a lot depends on how many parallel queries (and updates which may create locks that slow things down) your particular database/schema setup can reasonably sustain. Because if that number is fairly low all the parallelism in the BEAM isn’t going to help you.

nelson687 · August 3, 2017, 1:21pm

We are actually using mysql. I’ll try to re write the code in elixir and post results I got.