I’ve been using Broadway with SQS for a little more than two years, and I’ve just run into an issue I can’t understand or explain. For reasons too complex to get into here, the application is still using Broadway v0.6.2 with BroadwaySQS v0.6.1. UPDATE: I have a local version running with Broadway v1.0.1 and Broadway Dashboard, but I’m still not sure what’s going on.
I have a multi-stage Broadway application where each of the 15 stages is started with these options:
[
producer: [
module: {BroadwaySQS.Producer,
[
receive_interval: 1000,
wait_time_seconds: 1,
max_number_of_messages: 10,
visibility_timeout: 30
]},
concurrency: 10
],
processors: [default: [concurrency: 100, min_demand: 5, max_demand: 10]],
batchers: [sns: [concurrency: 1, batch_size: 100, batch_timeout: 1000]]
]
I’m running a single instance of the application this is part of.
I had a situation today where I noticed the system seemed to be choking – my logs showed the same message coming through multiple times at 30 second intervals. The work – just a quick and simple database update – was completing almost immediately, but the message was never being ack
ed back to SQS, causing it to drop back into the queue when its visibility timeout expired.
I looked at the SQS console, and the queue in question showed more than 15,000 messages in flight. There are no consumers for this queue other than the 10 BroadwaySQS producers above.
I shut down the application, waited for all of the timeouts to expire, and purged the queue. I restarted the application and waited for everything to initialize, and then flooded the queue with messages again just to see what would happen. The number in the “Messages in flight” column started increasing rapidly again while the messages “Messages available” column dropped. Once again, the application couldn’t keep up with the acknowledgements, and messages started timing out and requeueing. We’ve experienced a few timeouts here and there in the past, but nothing like this.
I’ve been under the impression that the number of messages in flight would always max out somewhere around [producer concurrency] * [max_number_of_messages]
while the producer waits for messages to be processed (successfully or not) before requesting more. Have I fundamentally misunderstood how BroadwaySQS polls for messages this whole time?
I wish I could use the Broadway Dashboard to get a good visual representation of what’s going on inside the pipeline, but like I said, we’re stuck with Broadway v0.6.2 for the time being.