A second opinion on importing multiple large XML Files

andrewnclark · January 10, 2021, 8:33am

I’m fairly new to Elixir/Phoenix and just wanted some second opinions on my approach; as a side-project being developed by myself in a language non of my colleagues are familiar with I’m missing the ability to bounce ideas around. I’m hoping you can help!

The problem

Periodically (say every 4 hours), I want to import ten (will eventually be more) Google Shopping feeds in XML, process each of the products, filter some out and then store the data into Postgres and create an Elasticsearch document for each.

Solution

A scheduled Oban task to get the XML files from URLs and then create an SQS event for each of the products in each feed (ten feeds x 1000+ products each)
Broadway with SQS Consumer to ingest the tens of thousands of products, filter then, save to database and create Elasticsearch documents.

Question(s)
Do you think this is a suitable solution? Overkill? Am I missing something much easier that I’m not aware of?

dimitarvp · January 10, 2021, 11:03am

Before we can answer that: what’s your goal with that project? Why Elastic docs? Why events?

andrewnclark · January 10, 2021, 11:37am

It’s essentially a price comparison website. Standard use would be like any regular eCommerce store and you can browse products and search (hence elastic) etc. The main difference is that the product page itself shows the price at multiple retailers, you click the one you want to purchase from and we affiliate link you off direct the product on their website.

Each retailer provides a product feed and I need to import them, filter unwanted products, match up to existing MPNs and save the prices at each retailer.

Having thought about it some more there’s an option to avoid events and queues and just use something like flow inside each Oban worker/task? But in short, I just need a reliable and relatively quick (in terms of completing the task) way to import from multiple feeds.

The more I think about it, Flow may well be a better way to tackle this initially.

sorentwo · January 10, 2021, 2:46pm

At a glance this does seem like overkill. Mixing technologies is a large source of complexity, and you can accomplish all of this with fewer tools. Given my natural biases, I would stick to Oban and PostgreSQL:

Define a periodic job that runs every four hours. The job fetches the xml itself and spawns sub-jobs to fetch in parallel.
Sub jobs process each feed either as a whole or in chunks—1000 items won’t take long to process.
Write documents directly into Postgres and add some indexes to speed up full text search.

This implementation only requires one database and one-two Oban workers, depending how you define them. Then all of the effort is in your feed handling and processing, not in gluing multiple systems together.

Down the line, if it is working well, you could make ingestion into a batch and layer notification callbacks on it. For now, start simple!

andrewnclark · January 10, 2021, 6:26pm

You’re right and the solution I threw together this morning pretty much mirrors this.

I have an Oban worker that runs every few hours and iterates through the retailers, creating a new worker for each which fetches the XML, streams through it and saves it into the database.

You read about these things and try to use them, often forgetting that it’s really not necessary (especially at this sort of MVP stage)

dimitarvp · January 10, 2021, 6:31pm

The huge advantage of your (and @sorentwo’s) approach is that you can extend it in the future by reading the Postgres records and import them in a proper searching backend like Elastic. But you don’t have to do it immediately and can get away with having a half-baked search through Postgres’ full-text searching abilities for a time.