Hi,
I have a task to import data unloaded from Redshift to S3 into our DB. I thought it would be a good task for Broadway and a chance to try it out. But I’m running into issues with it, and need help with designing it.
The files in the S3 bucket look like this:
bucket/output/file1.manifest
bucket/output/file1
bucket/output/file2.manifest
bucket/output/file2.1
bucket/output/file2.2
bucket/output/file3.manifest
bucket/output/file3
The manifest has a list of entries associated with it, the could be multiple (like for file2.manifest).
The content of the entry file is a csv.
id,name
1,Alice
2,Bob
...
I thought I would do something like this:
Poller (polls for manifest files) -> publish a message for each manifest path -> ManifestProcessor (downloads entries for the manifest) -> publish each entry (%{id: 1, name: "Alice"}
) -> EntryProcessor (insert to DB)
But structure like this will spam the queue with the entries. Maybe I’m missing something how to go from 1 message with the manifest to a batch of messages with entries for this manifest and having back pressure.
Or maybe I’m way over-engineering this