Help with Broadway pipeline

egze · December 12, 2019, 4:41pm

Hi,

I have a task to import data unloaded from Redshift to S3 into our DB. I thought it would be a good task for Broadway and a chance to try it out. But I’m running into issues with it, and need help with designing it.

The files in the S3 bucket look like this:

bucket/output/file1.manifest
bucket/output/file1
bucket/output/file2.manifest
bucket/output/file2.1
bucket/output/file2.2
bucket/output/file3.manifest
bucket/output/file3

The manifest has a list of entries associated with it, the could be multiple (like for file2.manifest).

The content of the entry file is a csv.

id,name
1,Alice
2,Bob
...

I thought I would do something like this:

Poller (polls for manifest files) -> publish a message for each manifest path -> ManifestProcessor (downloads entries for the manifest) -> publish each entry (%{id: 1, name: "Alice"}) -> EntryProcessor (insert to DB)

But structure like this will spam the queue with the entries. Maybe I’m missing something how to go from 1 message with the manifest to a batch of messages with entries for this manifest and having back pressure.

Or maybe I’m way over-engineering this

outlog · December 13, 2019, 12:21am

haven’t done anything with broadway but this blog post might help https://blog.appsignal.com/2019/12/12/how-to-use-broadway-in-your-elixir-application.html

josevalim · December 13, 2019, 8:51am

I think you only need two steps: the producer and the processor. The producer will emit each file and the processor will parse the file and save to the database. In order to max your CPU/IO usage, bump up the concurrency for multiple processors.

You can also start multiple producers, but they will need a way to coordinate to not avoid sending files twice. You could do it with a simple partitioning though (one producer gets even files, the other gets odd ones).

egze · December 13, 2019, 9:00am

Thanks a lot! I will try it right now.

egze · December 15, 2019, 8:17pm

Worked like a charm, thanks again. Next challenge - make it run in a cluster, where nodes might get killed…