Aug 26, 2024

Using Celery with an SQS broker

by Sean Kerr

If you know me, you know that I’ve pretty much married Celery and RabbitMQ together ever since I discovered them. They work well in unison for large projects where the cost of RabbitMQ is justified. Sometimes it’s easy to forget that Celery supports other brokers such as Redis and SQS, both of which are excellent choices.

Reliability

Most people know what Redis is, but I feel that SQS isn’t so widely known in the Celery community. It stands for Simple Queue Service, and it’s AWS’ undeniably reliable message queue. And when I say reliable, I’m doing the SQS name a disservice. Here are their words, not mine:

Amazon SQS stores all message queues and messages within a single, highly-available AWS region with multiple redundant Availability Zones (AZs), so that no single computer, network, or AZ failure can make messages inaccessible.

Why use SQS over RabbitMQ or Redis?

All that aside, SQS is a great choice as a Celery broker because there are no servers to maintain, no upfront costs, and no reason to ever consider scaling issues because it’s practically infinite. You can store unlimited SQS messages (tasks in our case) for 14 days at no cost.

Maybe there is a cost…

The only cost that comes to mind is that you can’t run Flower. Without Flower, you won’t be able to see your currently running celery instances or get task insights. These are mission critical requirements for large applications, but are negiotable for smaller ones.

Let’s get down to the nitty gritty…

ACKs late

My recommendation is to enable the acks_late Celery configuration setting. By default, with acks_late disabled, as soon as the task is received, Celery would delete the message from SQS rendering the other important features such as visibility timeout, dead letter queues and redrive policies useless.

Visibility timeout

SQS has the concept of a visibility timeout. This simply means when a message is sent to a consumer, that consumer is the only one that will be able to see the message while the visibility timeout is in place. If the consumer doesn’t delete the message within the visibility timeout, inevitably the message will be sent to another consumer.

In Celery terms, this means you need to do one of two things when you process the task:

Return the task successfully (message is deleted from SQS)
Make sure all retries for the task complete before the visibility timeout expires

This functionality differs from a typical broker installation, wherein a task executes and regardless of the ACK status, the same task will not be delivered to another consumer. ACKing late in that scenario only protects you from losing your task in the event of a worker failure.

In my recent implementation, I found 600 seconds (10 minutes) to be a good visibility timeout. This is enough time to run the task and factors in enough time for all retries as well as potential HTTP timeouts for outbound tasks. If my task fails all retries, Celery will log the exception. Shortly after, the visibility timeout in SQS will expire and the same exact task will run again. This will repeat until the redrive policy kicks in.

Redrive policy and dead letter queues

When you configure the redrive policy for an SQS queue, you specify the maxReceiveCount setting which tells SQS how many times a message can be consumed before it is sent to the policy’s dead letter queue.

If you want SQS to function the same way RabbitMQ and Redis do, you can set the maxReceiveCount setting to 1. If your task completes successfully before the visibility timeout expires, it’ll be deleted from the queue. But if the task does not complete successfully, SQS will move it to the dead letter queue when the visibility timeout expires. From there you can examine the cause of failure, and possibly move messages back into a working queue once you’ve corrected the code causing the failure.

There are various reasons why you would want to raise the maxReceiveCount, but they are for you to decide.