Fault-tolerant handling of exclusive AMQP queue

krasenyp · November 13, 2020, 3:06pm

Hi everyone. I have an interesting case at work. I’m developing a service which ingest data from an external data provider via an exclusive AMQP queue. Since it’s an exclusive queue, when the consumer disconnects, the queue gets destroyed and having more than one queues means I have to de-duplicate messages. My task is to minimize the chance of missing messages in case of flaky connection or if something happens to the container the service is running in.

By using supervision, I can minimize the missed messages when the connection is flaky but how to handle containers crashing? I’ve identified two ways to tackle the problem both of them presume a cluster of at least two instances of the services. One approach is to register the AMQP connection process as a global process by using :global or horde. The second approach consists of creating a process group with :pg and synchronize the connection and channel where the queue is declared. I couldn’t manage to make it working reliably when nodes are started at the same time.

Are there any alternatives and am I missing something? What approach would you advise me to use? Thanks in advance.

benwilson512 · November 13, 2020, 3:58pm

Ideally, the external data provider needs to require that you ACKnowledge data that you receive. This way if the connection crashes, they can re-send data that they dont’ have acknowledged. This requires some minor de-duplification on your end but it also guarantees that you don’t miss messages.

krasenyp · November 13, 2020, 5:00pm

The problem is the queue is exclusive and when the service reconnects, a new query is created and all the messages between the lost connection and the reconnect are lost.

benwilson512 · November 13, 2020, 5:28pm

This is an intrinsically fault intolerant design. This cannot be made to not drop messages without changing something about that. You can try to minimize faults by keeping things online as much as possible, but the very idea of “fault tolerance” is that a fault can happen, but things can recover.

krasenyp · November 19, 2020, 4:13pm

For the sake of continuity, I managed to minimize the possibility of losing messages with a two-node cluster, using horde to manage a global AMQP connection. If a connection dies, it gets restarted, and if a node dies, the connection is restarted on the other node. It’s not perfect but it’s better than nothing.