BroadwaySQS pulling far more messages at a time than expected

mbklein · January 6, 2022, 4:05am

I’ve been using Broadway with SQS for a little more than two years, and I’ve just run into an issue I can’t understand or explain. For reasons too complex to get into here, ~~the application is still using Broadway v0.6.2 with BroadwaySQS v0.6.1~~. UPDATE: I have a local version running with Broadway v1.0.1 and Broadway Dashboard, but I’m still not sure what’s going on.

I have a multi-stage Broadway application where each of the 15 stages is started with these options:

[
  producer: [
    module: {BroadwaySQS.Producer,
     [
       receive_interval: 1000,
       wait_time_seconds: 1,
       max_number_of_messages: 10,
       visibility_timeout: 30
     ]},
    concurrency: 10
  ],
  processors: [default: [concurrency: 100, min_demand: 5, max_demand: 10]],
  batchers: [sns: [concurrency: 1, batch_size: 100, batch_timeout: 1000]]
]

I’m running a single instance of the application this is part of.

I had a situation today where I noticed the system seemed to be choking – my logs showed the same message coming through multiple times at 30 second intervals. The work – just a quick and simple database update – was completing almost immediately, but the message was never being acked back to SQS, causing it to drop back into the queue when its visibility timeout expired.

I looked at the SQS console, and the queue in question showed more than 15,000 messages in flight. There are no consumers for this queue other than the 10 BroadwaySQS producers above.

I shut down the application, waited for all of the timeouts to expire, and purged the queue. I restarted the application and waited for everything to initialize, and then flooded the queue with messages again just to see what would happen. The number in the “Messages in flight” column started increasing rapidly again while the messages “Messages available” column dropped. Once again, the application couldn’t keep up with the acknowledgements, and messages started timing out and requeueing. We’ve experienced a few timeouts here and there in the past, but nothing like this.

I’ve been under the impression that the number of messages in flight would always max out somewhere around [producer concurrency] * [max_number_of_messages] while the producer waits for messages to be processed (successfully or not) before requesting more. Have I fundamentally misunderstood how BroadwaySQS polls for messages this whole time?

~~I wish I could use the Broadway Dashboard to get a good visual representation of what’s going on inside the pipeline, but like I said, we’re stuck with Broadway v0.6.2 for the time being.~~

stefanchrobot · January 27, 2022, 9:20am

Have you been able to figure this out?

Can you explain this a little bit more? Does it mean that you have 15 instances of this pipeline? If so, then I think it all makes sense.

From GenStage docs:

When implementing consumers, we often set the :max_demand and :min_demand on subscription. The :max_demand specifies the maximum amount of events that must be in flow while the :min_demand specifies the minimum threshold to trigger for more demand. For example, if :max_demand is 1000 and :min_demand is 750, the consumer will ask for 1000 events initially and ask for more only after it processes at least 250.

So you have 100 processors each asking for 10 messages and this is instantiated 15 times (100 * 10 * 15 = 15000). It looks like the producers are able to keep up with that demand.

mbklein · January 31, 2022, 4:30pm

I have 15 pipelines, not 15 instances of the same pipeline. (My terminology here gets confusing because what we call our “processing pipeline” is actually 15 different Broadway GenServers that pass data forward among them. So the meaning of “pipeline” gets overloaded depending on whether I use the Broadway definition or our internal app’s definition.)

But yes, changing the producer/processor concurrency settings seems to have done the trick, and I understand how Broadway handles demand a lot better now.