Create an OTP DATA import app for large warehouse

pkrawat1 · September 5, 2017, 1:44pm

I want to structure an umbrella app, that will take a request from a rails app for data import.

Process Steps

Call an app in umbrella, like CSV importer. Which will get the file path to be imported. Start the process to parse data.
Call the XML generator umbrella sub app.
Call the XML uploader umbrella sub app to dedicated s3.
Call the data insertion API from XML generation.

The import data is usually in MB’s min 4MB to max 20MB.

I want to achieve this using OTP only. using job processing lib is final option only.

kokolegorille · September 5, 2017, 4:19pm

Please can You elaborate?

Why not something simple as this?

filepath |> YourModule.parse_data/1 |> YourModule.generate_xml/1 |> YourModule.upload_to_s3/1

Do You need OTP because:

You want back pressure -> GenStage
You want supervisor & simple_one_for_one genserver
or simply Task

It depends on the volume, the situation etc. Maybe You can explain the context.

NobbZ · September 5, 2017, 4:19pm

May I ask what your actual question is?

Do you want to know if this is possible at all?
Do you want to hire someone doing this for you?
Do you need some help implementing this?
Do you need help planning this in more detail?

kokolegorille · September 5, 2017, 5:21pm

BTW There is this video which describe a similar problem

pkrawat1 · September 5, 2017, 5:36pm

This is a multi tenant importer. where multiple tenants can start/auto-start imports from multiple data providers API’s.
data providers are Amazon & Amazon FBA, Bigcommerce, Ebay, Ecomdash, Magento, ShippingEasy, ShipStation, ShipWorks, Shopify, Teapplix.

Current data is imported using rails delayed jobs. The system can at anytime go to max resource utilisation, due to massive imports running simultaneously on 10 workers.

So we thought of using golang to reduce the pressure. We were actually getting a performance boost in golang with lambda process. But migration to golang is time consuming.

So now we are thinking of using elixir to come to the rescue.

So we need the import to a fault tolerant, recoverable, efficient than rails.

current structure is like: -
data processor(umbrella)
—|----- apps
------------|---- api
------------|---- parser(for each api like shopify, amazon, etc)
------------|---- xml generator for INDIVIDUAL record. So if 100 records are there, then create 100 XML files in concurrent tasks.
------------|---- upload individual XML to s3 and call DB inserter API.

parser will recieve any api request from csv, shopify, etc and should pass data to the specific worker.
Then it should parse data and call xml generator for individual records in chucks of maybe 100 records.
Then push to s3 and call DB API.

kokolegorille · September 5, 2017, 5:39pm

That really seems a perfect use case for Elixir… I did not notice at first that parsing data will produce 100 requests, but it seems more than doable.

I would probably add poolboy (for chunks), to avoid being blacklisted by amazon…

kokolegorille · September 5, 2017, 6:01pm

Maybe like this?

def do_stuff(filepath)
  filepath
  |> parse_data/1
  |> Enum.chunk_every(100)
  |> Enum.map(&process_batch(&1))
end

defp process_batch(batch) do
  batch
  |> Enum.map(&Task.async(__MODULE__, :process_xml, [&1]))
  |> Enum.map(&Task.await(&1, 10_000))
end

def process_xml(xml) do
  # do what You want
end

This should process concurently by chunk of 100. Of course, it’s just pseudo code…

Also I am not sure You need an umbrella with as many contexts. Maybe one module, with 3 functions would be enough.