Queues cannot guarantee at-least-once delivery

Strong theorems state that a distributed system cannot guarantee that a message will be delivered exactly once. If you ever tried, then a freak sequence of network partitions, delays and interruptions would lead to the message either disappearing, or being received more than once.

The next best thing is to guarantee that a message will be delivered at least once—there are no theorems about at-least-once-delivery being impossible—and rework the receiver so that duplicate messages do not cause issues. It does not matter if the user's profile picture is resized twice, or their PDF is generated twice; it would be troublesome if the profile picture was never resized, or the PDF was never generated.

Queues are a common way of delivering messages at least once. They are quite convenient: the sender inserts the message into the queue and forgets about it, the receiver reads the message and then deletes it. Multiple delivery occurs if the receiver fails to delete the message, leaving it in the queue to be read again.

This has reached the point where software engineers, who should know better, think « this needs at-least-once delivery semantics » and immediately reach out for their distributed queue of choice. Bad news : that queue does not not guarantee at-least-once delivery!

  1. Can the queue server crash and lose all data since the last backup? Messages not delivered.

  2. Can queues be deleted, given administrator privileges? Not delivered!

  3. Are toxic messages automatically dropped after a number of reads?
    After spending too long in the queue?

  4. Can messages become unreadable if the serialization format changes while they are in the queue?

  5. Can messages be deleted from queues by anyone but the recipient, such as a broken debug version running on a developer's machine? If the notion of delivered includes being piped to /dev/null, then it is useless.

Rather than theorems, these are practical issues that any software team will face, and a theoretical guarantee that at-least-once-delivery will happen if none of the above occur, is not very useful unless you spend the necessary effort to prevent them from happening:

  1. Have queue backups on the same level as the production database.

  2. Sysadmin procedure requires that queues must be emptied before deletion.

  3. Toxic messages remain in the queue until production code is deployed to process them.

  4. Queue messages must be migrated prior to serialization changes, or the receivers must be backwards-compatible with the old format, or new-serialization messages are pushed to a new queue, with old-serialization receivers remaining active until the old queue is drained.

  5. No direct developer access to queues—but then, who migrates the messages?

At this point, queues become production-critical repositories of state, just like the database, but without the invariants, management tools, and explorability. Why not use a database? But it does not matter: queues never go this far anyway. At-least-once-delivery is a lie.

Don't believe me?

Consider that queue-as-a-service providers do not promise at-least-once delivery. Azure Queue Storage does not mention it (but does mention their SLA for not losing messages), and Amazon SQS has a section titled At-least-once-delivery which merely states that at-most-once-delivery is not supported:

Amazon SQS stores copies of your messages on multiple servers for redundancy and high availability. On rare occasions, one of the servers that stores a copy of a message might be unavailable when you receive or delete a message.

If this occurs, the copy of the message isn't deleted on that unavailable server, and you might get that message copy again when you receive messages. Design your applications to be idempotent (they should not be affected adversely when processing the same message more than once).

Still unconvinced?

When was the last time you reviewed code that pushed messages to a queue, and did not have one of these two thoughts:

In truth, delivery semantics are irrelevant. What matters is whether the message will be processed, because that processing should be followed by an effect that was the original intent behind the message. Whether the PDF-generation-worker will receive the PDF-generation-request is only important as a condition of the document being generated. The other condition is for the PDF-generation-worker to generate the document (or emit an error message) and notify the requester. Receiving the message does not guarantee that this other condition will succeed.

The real question is: what part of the distributed system is responsible for the PDF generation? Not for the act of generating the PDF, but for ensuring that the PDF will be generated.

Obviously, until the request is inserted into the queue, it is the sender's responsibility—and if the queue server does not respond, or if the message is rejected because the queue is full, the sender is expected to retry or to propagate an error message back. What about after the message is inserted into the queue? Is the sender is still responsible?

Maybe nobody is responsible, also known as « the user is the timeout » and repeats their request, slightly more annoyed than before. But this is fine if nobody cares, and the result is nice-to-have rather than absolutely necessary.

Maybe the queue is responsible, at least until the receiver has deleted the message. At this point, the queue becomes a production-grade database.

Maybe the sender is responsible, remembering the request in its own persistent database until the response is received, and re-inserting a message into the queue when it determines that the previous message is lost. At this point, the distributed queue's delivered-at-least-once guarantee becomes redundant: the sender already implements a processed-at-least-once guarantee that is strictly stronger.

Of course, distributed queues remain useful for their other properties: being multi-reader and multi-writer, high-throughput, and not requiring the sender and receiver to be alive at the same time or even able to reach one another. But at-least-once delivery semantics are dead to me.