Help with Optimizing a Large Data Processing Task

stevesmith · May 21, 2024, 10:23am

Hi Community,

I am dealing with a task where I want to handle a huge dataset utilizing Elixir. The dataset comprises of millions of records and the processing involves several complex transformations and calculations. Currently, the processing is quite slow, and I am looking for advice on how to optimize it.

Here are a few particulars:

Each record undergoes multiple transformations.
There are several CPU-intensive calculations.
I’m using GenStage to handle the data flow.

I’ve tried using Task.async_stream to parallelize some of the work, but the improvement is marginal. I am wondering if there’s a better approach to parallel processing in Elixir or any specific libraries that could help with this type of workload. Any tips on optimizing performance or managing large data efficiently in Elixir would be greatly appreciated.

I also read this thread (https://elixirforum.com/t/massive-distributed-parallel-processing-of-large-data-sets-cissp -with-elixir/49584) on elixirforum but couldn’t get the solution of my query.

Much thanks to you.

Steve

dimitarvp · May 21, 2024, 10:30am

As a surface level advice: you can just not have stages at all and put all processing of a single record inside a single function. And then use Task.async_stream with it.

I presume you’re already using database batches?

stevesmith · May 21, 2024, 11:44am

Hey @dimitarvp,
Thanks for your response, yes I am using dataset batches.

dimitarvp · May 21, 2024, 12:06pm

Well, your post doesn’t give us much to work with. If you can post some more info, data and code, maybe we can help further.

cevado · May 21, 2024, 12:49pm

on that matter, what database are you using? if you’re using postgres this discussion on the mailing list might give you insights on optimizations for large operations in the database
https://groups.google.com/g/elixir-ecto/c/ucMvVCtYubM