(EU) Incident Report: 11/08/24 - 13/08/24
Resolved
Aug 15 at 10:14am CEST
☁︎ Incident Summary
On August 11, 2024, at approximately 20:00 CET, the platform on the EU instance began experiencing instability, followed by downtime and maintenance, which lasted until August 13, 2024, at 14:00 CET.
☁︎ Impact
● Services Affected:
○ Platform UI: Partial availability, followed by downtime
○ Platform services: Partial availability, followed by downtime
○ Integration flows: stopped for a 6-hour platform maintenance
☁︎ Timeframe
● Entire incident tracking:
○ 11.08.2024 20:00 CET - 13.08.2024 14:00 CET
● Detailed:
○ 11.08.2024 20:00 CET - 13.08.2024 ≈1:00 CET
The platform receives messages and stores them. There was a timeframe, during
approximately 6 hours the messages were partially processed.
○ 13.08.2024 ≈1:00 CET - 13.08.2024 ≈8:00 CET
Partial availability of platform services.
○ 13.08.2024 ≈8:00 CET - 13.08.2024 14:00 CET
Maintenance. All flows have been stopped for the duration of the maintenance.
☁︎ Root Cause
The main issue happened with the networking part on one of the RabbitMQ nodes. Since RabbitMQ operates in a clustered state, this issue did not affect the platform until 20:00 CET. Typically, we experience an increase in load during the evening, which is expected behaviour for our platform and does not generally cause any problems. However, the malfunctioning RabbitMQ node entered a partial split-brain state.
☁︎ Description
During the period of platform instability, our Development and DevOps teams concentrated on stabilizing the platform. All these actions were performed with the primary objective of preventing any incoming data loss.
Despite extensive efforts, all available options for recovering the existing RabbitMQ functionality were unsuccessful, and the duration of the platform instability kept increasing. Consequently, we decided to perform a more significant step by resetting the RabbitMQ cluster.
This essential step required a temporary stop of all flows. Once this action was completed, all affected platform services were back to standard operation.
Additionally, the following steps were done:
- Upgrading RabbitMQ nodes
- Re-working RabbitMQ queues policies
The maintenance process allowed our team to enhance the stability of the RabbitMQ cluster and platform services in terms of queues management. These actions will protect the platform's functionality against similar issues in the future.
☁︎ Additional Measures
● Elastic.io will enhance the status page to provide more detailed information about the status of platform services.
● Maintenance to apply the same enhancements to the US instance, AU instance, and private instances.
● Provide the new enhanced platform notification service
Affected services
EU cluster API
EU cluster frontend
EU cluster WebHook
EU cluster Realtime
US cluster API