Last post, we went over RabbitMQ best practices. This time I’ll be going over how we’ve failed with RabbitMQ, best practices in development and in production, and the observed side effects and resolution.
Unlimited Prefetch
One of the first issues we encountered was using the default prefetch which is unlimited. Once we tried load testing our application, we saw the consumer run out of memory and then restart. The prefetch should always be set for consumers. The rule of thumb is to set the prefetch value to (the total round trip time) / (processing time on the client for each message) if you have a single consumer or a prefetch of 1 if there are multiple consumers and/or slow consumers.
For example if it takes 10 ms for the message to be delivered to the consumer, 20 ms to process the message, and 10 ms to send the acknowledgement, the prefetch should be (10 + 20 + 10 ms) / (20 ms) = 2. However, for multi-threaded consumers, the prefetch should be larger than this rule of thumb since more than one message can be processed simultaneously.
RabbitMQ Sizing
Another failure case we encountered was the crash of RabbitMQ due to too many connections for the instance size. We had doubled the connections we were making without checking that our RabbitMQ instance could handle the additional connections. This can be monitored using the metric monitors on the RabbitMQ Management UI. When scaling up the number of consumers and messages, it would be a good idea to first assess whether or not a larger instance size would be required based on these metrics.
After this incident, we bumped the instance size as well as moving over to the 2 node cluster for failover.
The RabbitMQ plans can be found here.
Requeueing Failed Messages
Requeueing failed messages by nack’ing them with requeue not set to false or by throwing an error causes them to be requeued. These messages will be resent to the consumer until they are rejected with requeue = true
or they are successfully processed. This can lead to messages being stuck in a failure loop and can be catastrophic if there are side effects before the consumer fails.
We protected against this by retrying once before rejecting/n’acking with requeue = false
and then logging these messages for debugging.
Too Many Connections
Earlier we discussed running into resource constraints due to having too many connections. Another issue could be a bug in an application which is causing it to make more and more connections. We had an api which was making new connections for each request when it should have been reusing a single connection. Fortunately we were monitoring metrics and caught this issue soon after it was deployed and before our alerts were triggered.
Having alerting set up through one of the RabbitMQ integrations with Datadog, Kibana, etc. can help catch these situations early. Having a development environment with it’s own RabbitMQ server would also serve to catch these issues before they get deployed to production.
RabbitMQ Management UI Issues
Another issue we ran into was the failure of the RabbitMQ Management UI to load. After login, the headers would be visible with no data in the body. We were able to work around it sometimes by logging in through the CloudAMQP management dashboard. We realized that the logstream queue was missing a consumer which meant logs weren’t being ingested. Although we haven’t diagnosed the cause of the issue, this was resolved through a server restart.