Using concurrency to quickly collate data from multiple sources

Maxximiliann · February 28, 2020, 1:10am

Given-

Vehicles is a list of 50 VIN numbers (vin_number).

Colors, Makes, Models, Transmissions, Fuel_economy, Horsepower and Torques are each a list of 50 VIN numbers with their respective features.

For instance:

Colors = [%{vin_number: 5YJSA1DG9DFP14705, color: Black} ...]
Makes = [%{vin_number: 5YJSA1DG9DFP14705, make: DB8 GT} ...]

get_colors, get_makes, get_models, get_transmissions, get_fuel_economy, get_horsepower and get_torques are functions which get the respective colors, makes, models, etc., etc., for a particular vin_number.

Example:

def get_colors(vin_number) do 
	Enum.find(Colors, &(&1).vin_number == vin_number)
	|> Map.get(:color)  
end

Collating all of this data into a new map, super_cars, by vin_number-

def super_cars do	
	Enum.map(vehicles, &
		%{
			vin: (&1).vin_number,
			color: get_colors((&1).vin_number)
			make: get_makes((&1).vin_number) 
			model: get_models((&1).vin_number)
			transmission: get_transmissions((&1).vin_number)
			fuel_economy: get_fuel_economy((&1).vin_number)
			horsepower: get_horsepwer((&1).vin_number)
			torque: get_torques((&1).vin_number)
		} 
		)
	end

So here’s my question:

How can Task.async be utilized to optimize the time it takes to create the new super_cars map? (Is there perhaps a better approach? Ecto, maybe?)

As always, thanks for your generous and patient insights

chrisjowen · February 28, 2020, 2:13am

I suppose the question is how often do you need to materialize this super cars map?

It doesn’t sound like a lot of data to me and while you could parralize this is may not give you a huge perf increase do to scheduling and message passing overhead.

This also doesn’t seem like frequently changing data set, which means it’s probably suited for caching. If this is the case then the time to build the map becomes less important an it can be done periodically/when the data changes in a worker process and the resulting map can be cached.

One small optimization you can make before thinking about parrallism or caching is to convert to lists of i.e. colors to be a map of vin => color. Keying by vin would make the lookups much faster for subsiquent calls

Maxximiliann · February 28, 2020, 2:25am

Great questions! To better model my project’s use case, let’s say that in this scenario data is frequently updated and so a current map of supe_rcars needs to be generated as quickly as possible. Would it still not make sense to use concurrency?

al2o3cr · February 28, 2020, 2:30am

This isn’t directly relevant to your question about using parallelism, but if you’re concerned about performance consider converting your lists into maps:

colors = [%{vin_number: 5YJSA1DG9DFP14705, color: Black} ...]

map_colors = Map.new(colors, fn c -> {c.vin_number, c} end)

Then a function like get_colors is a map lookup, not a linear search.

chrisjowen · February 28, 2020, 2:34am

Maybe, as mentioned there are other overheads in concurrency. If you only have 50 records the question is how many records would each async task process. If you process say 1 item per task it may work out slower than the single process call.

The only way to tell is to try this with different configurations. My gut is that you would be better off keeping this as a single process call and doing smaller optimisations like I mentioned ( @al2o3cr just showed what I mean in their answer, keying by vin number will reduce your lookup time).

Maxximiliann · February 28, 2020, 2:53am

Thanks, I’ll refactor and run some benchmarks!

Maxximiliann · February 28, 2020, 2:54am

Thanks for your recommendation!

chrisjowen · February 28, 2020, 3:08am

No worries, and although I am not convinced you will need such things for this (I could be wrong just not enough info) I think its worth pointing out your options if you do need to look at parallel processing in Elixir

Firstly all abstractions including Task.async all live on top of the core process model of beam, and its really worth your time understanding this fully.

The next stage is to understand about GenServers and how they encapsulate generic process behaviour (https://hexdocs.pm/elixir/GenServer.html)

After this you may want to still use Task.async or maybe https://hexdocs.pm/elixir/Task.html#async_stream/3

If this is not enough for you then the excellent GenStage (https://hexdocs.pm/gen_stage/GenStage.html) gives you some real control when producing/consuming large datasets.

Finally, there are interesting abstractions above GenStage such as:

Basically there are a lot of ways to do concurrent data processing in Elixir

Maxximiliann · February 28, 2020, 3:09am

Thanks for all the great resources, I really appreciate it!

LostKobrakai · February 28, 2020, 10:00am

Depending on how you source the data you might be able to parallelize data gathering by putting the source data into an ETS table with read concurrency. So you don’t need to copy the source data, but parallelize the querying part.

chasers · February 28, 2020, 8:15pm

Yeah I mean I assume these are all in a database somewhere so I’d focus on making a process for each attribute which independently caches those locally in ETS periodically. And then just lookup the vin in ETS when you need it. If not ETS even just consolidating them in a single Postgres table…

Maxximiliann · February 29, 2020, 1:34am

Thank you gentlemen for your kindly help. It really means a lot especially since I have zero software development background