How to run some function asynchronously and concurrently on large DataSet with fault tolerance?

collegeimprovements · July 27, 2018, 4:43pm

Hi Folks,

We have a translate function which converts Chinese Text into English.
We need to convert about 50k to 60k records from Chinese to English with it.
For a single record it takes between 5s to 20s to convert each column to English.
Then we store it in another DB which takes hardly a second.

Step-1: Fetch Record(s) from DB.
Step-2: For the Record(s) → Run Translate Function
Step-3: Store the translation to another table

Translate Function: It takes record from Source Table(Chinese) and make about 10-15 calls to AWS and generate English Text for each column and Store that data in Target Table(English).
It takes about 5s to 20s to load.

How to perform the above translate operation on 50k DB Records efficiently ?
We have few concerns:

What if program fails at say 10k record. We don’t want to re-translate those 10k again.
Where should we be doing book-keeping… in the DB itself (on target table) ?
How to perform this task efficiently i.e. Concurrenlty ?
What’s a good way to approach this taks? Do we need to use GenServer ?

Hi Folks,
I’ve a fairly simple looking problem.
We have a translate function which converts Chinese Text into English.
We need to convert about 50k to 60k records from Chinese to English with it.
For a single record it takes between 5s to 20s to convert each column to English.
Then we store it in another DB which takes hardly a second.

How to perform the above translate operation on 50k DB Records efficiently ?
We have few concerns:

What if program fails at say 10k record. We don’t want to re-translate those 10k again.
Where should we be doing book-keeping… in the DB itself (on target table) ?
How to perform this task efficiently i.e. Concurrenlty ?
What’s a good way to approach this taks? Do we need to use GenServer ?

crazymevt · July 27, 2018, 8:44pm

This sounds like a pretty good place to use Task.async_stream

list
|> Task.async_stream(Enum, Mod, Fun, args, [max_concurrency: 10])
|> Enum.map(fn {:ok, val} -> val end)

This will provide you with a tuple {:ok, value}, or an error where you can parse out the good / bad translations. This also allows you to set the number of concurrent translations you want to perform at any given time.

tty · July 28, 2018, 8:30am

Do the bookkeeping in the DB. You are looking at a 3 states: ready, processing, done
If you can tolerate work duplication you don’t even need to lock the table.
A process per record. Or a process per line. Or a process per word. This is something you have to decide. Best if you can run a simple load test to find the best chunk of work. Once you decided:
crazymevt already suggested Task. I would also suggest looking at rpc:pmap and the pool module as well. You would want a GenServer(s) to coordinate starting the Task/pool/pmap

Task/rpc:pmap/pool for concurrency. GenServer/Supervisor for fault tolerance.

collegeimprovements · July 30, 2018, 6:50am

Thanks a lot @crazymevt, @tty .
I tried to follow the same and it seems to work.
I hit another roadblock. As AWS throttle their api for more than 1000 requests in a minute.
So now i have to run it sequentially

Though i’m trying to run this command via GenServer now so that it can keep running.