Is Broadway suitable for this task?

jordelver · February 12, 2020, 5:29pm

I’ve got a potential project that I think Broadway is well matched for, but I wanted to ask here for a sanity check

The basic workflow is:

Periodically fetch data from a JSON HTTP API
Store / cache the data
Convert data / munge data into another format
Provide data over XML HTTP API

To begin with there will only be one endpoint, but in the future there will be many different sources. Each may have different schedules for fetching data.

If I was doing this in Ruby (my background) I would probably use some sort of background job to fetch and convert the data, and then serve the data via Rails.

One reason Broadway is attractive to me is because it has features like rate limiting, which I can see being useful as this scales.

Would you use Broadway for this?

axelson · February 15, 2020, 1:23am

Disclaimer: I have never actually used broadway

I don’t think that broadway seems like a good fit since it won’t help you with this piece since the official and unofficial producers all pull from an event stream of sorts:

Currently we officially support three Broadway producers:

    Amazon SQS: Source - Guide
    Google Cloud Pub/Sub: Source - Guide
    RabbitMQ: Source - Guide

Although perhaps someone who’s actually used broadway could offer more guidance.

sorentwo · February 15, 2020, 1:52am

I’m extremely biased, but I would use oban. It has periodic jobs, scheduled jobs, retries, resiliency, etc. Broadway is tailored for ingesting high throughout event streams.

jordelver · February 15, 2020, 2:11pm

I’m extremely biased, but I would use oban . It has periodic jobs, scheduled jobs, retries, resiliency, etc. Broadway is tailored for ingesting high throughout event streams.

Oban was actually on my list for consideration I discovered it last week and haven’t had time to play with it yet but it looks really good

jordelver · February 15, 2020, 2:11pm

I don’t think that broadway seems like a good fit since it won’t help you with this piece since the official and unofficial producers all pull from an event stream

That’s actually part of the reason I was asking the question - the README does focus on usage with event streams as you say. However, this article uses fetching JSON as an example, but not quite in the same way as I was planning, which confused my thinking.

Thanks for your response

svilen · February 15, 2020, 3:12pm

Disclaimer: I haven’t used Oban

I don’t see a reason not to use Broadway, but I haven’t used it extensively myself either. You can give it a go by creating a custom producer as shown in the official guide:

https://hexdocs.pm/broadway/custom-producers.html#content

It’s easy to make it do work periodically using Process.send_after/4.

It’s true that Broadway is great for data-ingestion pipelines because you can implement events acknowledgement when dealing with SQS. But if you don’t need that, you can still roll out your producer without acknowledgements and use the rest that Broadway provides.

chasers · February 16, 2020, 2:27am

I’ve used Broadway extensively. Have not used Oban.

You can use Broadway to poll you just put stuff in your own producer. So you can totally use Broadway here and it would be great.

With Broadway, you do need to understand the lifecycle of a process and your app if you want to make sure you get every event.

Oban is backed by a database so your state is always there. You don’t have to worry about deploys affecting processes, etc. You sacrifice some throughout for this.

So those are mostly your high level trade offs.

If you’re not super comfortable with gen servers I’d say use Oban. If you need all the throughput use Broadway.

Edit: I should maybe clarify. Broadway won’t do the polling for you. Make your poller and put the results in a queue somewhere (ETS probably) and the have your producer pull from that.

sorentwo · February 16, 2020, 4:45pm

There is definitely a difference in throughput between batch ingesting SQS events and pulling from a transactional database. However, I think you’ll be limited by the actual job processing before the job processor’s throughput comes into play.

In my benchmarking Oban can process 15k no-op jobs per second on a single node—and batch processing is currently in the works which will increase that throughput significantly.

akoutmos · February 16, 2020, 6:58pm

I have written a few article/tutorials on Broadway and perhaps those can help inform your decision:
https://akoutmos.com/post/using-broadway/
https://akoutmos.com/post/broadway-rabbitmq-and-the-rise-of-elixir/
https://akoutmos.com/post/broadway-rabbitmq-and-the-rise-of-elixir-two/

I think some more information may be required to better answer your question:

How often are you fetching data from this JSON API?
Is it a singular piece of information that needs to be processed (i.e no benefit from concurrent processing)?
Do you require a message queue to persist messages across deployments of your service?

On the simple side of the spectrum, you could have a simple GenServer with a send_after to do everything that you outlined and just start that process up in your application.ex supervision tree. On the complex side of things, you could run RabbitMQ and Broadway to do your processing, but that depends on the answers to the Qs above . Like others have mentioned, Oban is also a good tool for the requirements that you outlined.

jordelver · February 17, 2020, 3:40pm

Thanks. I’ll have a read of those docs

jordelver · February 17, 2020, 3:43pm

Your top article is the one I quoted above Thanks for writing it. I haven’t read your others yet, but will.

To be completely honest, I don’t know the answers to your questions yet - this project is in the very early stages. It’s great to be aware of multiple strategies that may fit, so thanks for contributing to this post

jordelver · February 17, 2020, 3:47pm

Thanks @chasers and @sorentwo for this interesting discussion. At the moment, I feel like I’d be more comfortable implementing a solution using Oban mainly because I understand the approach more clearly.

One thing that I like about the Oban is approach is observability. It seems to me that I can look at the state in the database and have a better idea of what is happening. I know Broadway has Telemetry hooks but I’m not sure about how to use those at the moment - that sounds like a totally new topic