Generic server report and stopping Virtual host '/'. What is the root problem? #14075
-
Describe the bugDisclaimer: I post this because I want to understand the root cause of the problem to avoid it in the future Our services had problems while comunicating with rabbitmq.
Error logs from rabbitmq
Reproduction stepsWe run rabbit in a single replica inside docker swarm. Our services couldn't start due to some problems with rabbitmq. It happened the first time (in years) and sporadically. Expected behaviorRabbit runs without closing virtual host (aka healthy) Additional contextIt happened at rabbitmq: docker compose spec rabbit:
image: custom/rabbitmq:3.13.7-management
init: true
hostname: "{{.Node.Hostname}}-{{.Task.Slot}}"
environment:
RABBITMQ_DEFAULT_USER: ${RABBIT_USER}
RABBITMQ_DEFAULT_PASS: ${RABBIT_PASSWORD}
volumes:
- rabbit_data:/var/lib/rabbitmq
networks:
- default
- computational_services_subnet
- interactive_services_subnet
- autoscaling_subnet
healthcheck:
# see https://www.rabbitmq.com/monitoring.html#individual-checks for info about health-checks available in rabbitmq
test: rabbitmq-diagnostics -q status
interval: 5s
timeout: 30s
retries: 5
start_period: 5s Dockerfile ARG VERSION
FROM rabbitmq:${VERSION}
# installs plugins
RUN rabbitmq-plugins enable \
--offline rabbitmq_management \
rabbitmq_management_agent \
rabbitmq_web_dispatch \
rabbitmq_prometheus Metrics |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 1 reply
-
According to the stacktrace this crash happens when the classic queue shared message store tries to delete a message file (rdq file) which it does not need any more (all message in the file have been removed from queues) You mentioned root cause, but as you added the bug label I want to mention that the rdq file deletion was refactored in 4.1.1 so this "bug" won't happen there any more, so we might consider it fixed (might happen in a slightly different scenario during compaction). A GH discussion would be more appropriate. Regarding the root cause when I saw this crash I could only find out that usually the shared message store only stores the same message once, even if it was published to multiple queues. However in certain scenarios the same message can be stored multiple times. In this case the same message is stored multiple times and in different files. That's fine but what is unexpected is that the stored messages have different sizes on disk (that's what I suspect). The code is not prepared for this and crashes. I've seen two examples when the size was different for the same message:
I would be curious what the Core Team thinks how this later case can happen (what route the message needs to take within the broker? dead-lettering? maybe shovelling with a direct connection?) and if it can still cause a similar crash theoretically during compaction in 4.1 and above? |
Beta Was this translation helpful? Give feedback.
-
Could you refer to the commit and release note related to this refactoring / fix? |
Beta Was this translation helpful? Give feedback.
-
This is the PR that eliminates the "scanning" operation before the file delete (its the scanning that crashed for you): |
Beta Was this translation helpful? Give feedback.
-
If we update to 4.1.1, is there any way to be sure that we don't encounter this bug ("in slightly different scenario") anymore? |
Beta Was this translation helpful? Give feedback.
-
As already mentioned, you are looking at #13951 (or a variation of). |
Beta Was this translation helpful? Give feedback.
As already mentioned, you are looking at #13951 (or a variation of).