artem

What is the elixir way of processing long excels/csv files line by line (or any similar large inputs that is itemizable)

Hi there

I am still learning my elixir ways, doing my first real elixir project. It is going to process excels or CSV files of user feedback, analyze them line by line with AI and provide useful summaries on e.g. whether users complain more about performance rather than some functionality.

Coming from JVM world once the file is uploaded I would use something like SpringBatch or Quartz with csv/excel readers able to stream through the file and create records in database (or push to RabbitMQ) in the atomic manner (so if server got rebooted while parsing row 124, job would continue from row 124 when restarted).

And then a similar job for actual analysis would fetch rows (from db or rabbitmq) and again in atomic way would process things row by row.

What is the elixir way of doing it?
I am googling and reading this forum about batch jobs in elixir and it seems like the most mature way that should work out of the box is Oban (the nice controlling dashboard for it is expensive though, but maybe I could live without it).

The part I am missing is this “atomicity” or parsing files and processing already parsed rows.

Could somebody, please, point me into the direction to research?
Or is Oban the wrong framework for it and not really suited for splitting large job into small atomically failing or succeeding items? What would you look at then?

10 comments

/oban #queues

1 2944 10

2023-07-28 20:39:58 UTC

Most Liked

fireproofsocks

I spent a fair amount of time doing large-scale data processing in Elixir. Once you have things loaded up into a queue of some sort, then you have many more tools available for handling failures. But that first pass over the big data file is hard to avoid, and I’m not aware of any tools in any language which fully “solve” for the problem of keeping track of your progress while doing that initial processing of the file.

I didn’t find a silver bullet for recovering seamlessly from crashes, but here are 2 methods I found to help with the chore of getting data from the files and into the queue:

Use Bash’s split utility to separate a larger file into smaller files. This helps you divide and conquer, so if things do really go off the rails, you can at least keep track of which files have been successfully processed. One pattern I used grouped items together and relied on Ecto’s insert_all/2 function – this was much more efficient than performing individual database operations, for example:

# Split will create files with an `x` prefix, e.g. `xaa`, `xab`, etc.
input_files = ~w(
/foo/xaa
/foo/xab
/foo/xac
# ... etc...
)

Task.Supervisor.start_link(name: TmpTaskSupervisor)
Task.Supervisor.async_stream_nolink(
  TmpTaskSupervisor,
  input_files,
  fn input_file ->
    input_file
    |> File.stream!()
    |> Stream.chunk_every(1000)
    |> Stream.each(fn chunk ->
      rows = chunk
        |> Enum.map(fn line ->
          %{payload: String.trim(line), foo: "bar", etc: "etc"}
        end)
      MyApp.Repo.insert_all(MyApp.Something, rows)
    end)
    |> Stream.run()

  end,
  timeout: 86400_000,
  max_concurrency: 50
)
|> Enum.to_list()

I had some success in some cases using Stream.with_index/2 to help me keep track of which line I was processing – e.g. I could write this number to a file (but in some circumstances, e.g. on AWS, these file operations became a bottleneck). If I needed to recover after a failure, I could pass the number of the last successful line processed to Stream.drop/2 and use that to skip past rows that had already been processed. This can still take a while on a long file, but it’s much faster than re-parsing the whole file.

Relatedly: I’ve been doing some benchmarking of Elixir (e.g. with Python), and one of the tasks involved reading over a large CSV. There are a handful of different patterns I tried, e.g. ex_vs_py/lib/eds/vet_files/control.ex at main · fireproofsocks/ex_vs_py · GitHub My hot take on this was that Python was quite performant for these types of quick one-off tasks – I think any tool is fair game for you to “prime the pump” and get your data out of files and into a queue so you can benefit from the supervision tools available to Elixir as it handles the long-running process.

Post #9

josevalim

Creator of Elixir

Hi @artem!

I think the architecture can be quite similar to the Java one. You can use either NimbleCSV or Explorer packages for parsing CSV. For processing later on, you can use Oban or something like Broadway RabbitMQ.

Post #2

dimitarvp

It is quite common but as others have said, your scenario is quite specific so there’s no custom library just for that.

Your task sounds like you would read every CSV record and put them in a DB and mark them with “not processed yet”, and then have a consumer that will dispatch them to processing agents. Very easy and trivial in Elixir, especially to make use of all CPU cores and maximize throughput. I’ve done such tasks in Elixir, Java, Golang, Rust.

Look at Oban and Broadway, they have what you need.

Post #8

Last Post!

zac

Just a variation here – I work with event streams a lot (effectively, infinite streams of incoming data, often arriving in “chunks”). My goto solution is to use Kafka as the inbound message layer. In your case, you could simply grab an inbound CSV file, break it into lines, and feed those lines into Kafka (effectively turning each line into an event). Once it’s in Kafka, it’s safe… on the other end, you have something reading the events (“lines”) one at a time and processing them. If something goes wrong and it crashes, you just restart at the same location, reading from the queue.

Pretty much exactly what others have alluded to, but thought I’d spell it out. Kafka offers you a lot of excellent consistency guarantees and is one of the highest performing tools when it comes to sheer throughput.

I very much doubt you would need to (because of the sheer speed Kafka runs) but, you could also chunk the data, say 5 lines at a time, to try and squeeze a little more performance out… but, I’d be pretty surprised if the gain would be worth it.

Post #11