Back to overview

(EU) Incident Report: 11/08/24 - 13/08/24

Aug 15 at 10:14am CEST
Affected services
EU cluster API
EU cluster frontend
EU cluster WebHook
EU cluster Realtime
US cluster API

Resolved
Aug 15 at 10:14am CEST

☁︎ Incident Summary

On August 11, 2024, at approximately 20:00 CET, the platform on the EU instance began experiencing instability, followed by downtime and maintenance, which lasted until August 13, 2024, at 14:00 CET.

☁︎ Impact

● Services Affected:

○ Platform UI: Partial availability, followed by downtime
○ Platform services: Partial availability, followed by downtime
○ Integration flows: stopped for a 6-hour platform maintenance

☁︎ Timeframe

● Entire incident tracking:

○ 11.08.2024 20:00 CET - 13.08.2024 14:00 CET

● Detailed:

○ 11.08.2024 20:00 CET - 13.08.2024 ≈1:00 CET
The platform receives messages and stores them. There was a timeframe, during 
approximately 6 hours the messages were partially processed.
○ 13.08.2024 ≈1:00 CET - 13.08.2024 ≈8:00 CET 
Partial availability of platform services.
○ 13.08.2024 ≈8:00 CET - 13.08.2024 14:00 CET
Maintenance. All flows have been stopped for the duration of the maintenance.

☁︎ Root Cause

The main issue happened with the networking part on one of the RabbitMQ nodes. Since RabbitMQ operates in a clustered state, this issue did not affect the platform until 20:00 CET. Typically, we experience an increase in load during the evening, which is expected behaviour for our platform and does not generally cause any problems. However, the malfunctioning RabbitMQ node entered a partial split-brain state.

☁︎ Description

During the period of platform instability, our Development and DevOps teams concentrated on stabilizing the platform. All these actions were performed with the primary objective of preventing any incoming data loss.
Despite extensive efforts, all available options for recovering the existing RabbitMQ functionality were unsuccessful, and the duration of the platform instability kept increasing. Consequently, we decided to perform a more significant step by resetting the RabbitMQ cluster.
This essential step required a temporary stop of all flows. Once this action was completed, all affected platform services were back to standard operation.

Additionally, the following steps were done:
- Upgrading RabbitMQ nodes
- Re-working RabbitMQ queues policies

The maintenance process allowed our team to enhance the stability of the RabbitMQ cluster and platform services in terms of queues management. These actions will protect the platform's functionality against similar issues in the future.

☁︎ Additional Measures

● Elastic.io will enhance the status page to provide more detailed information about the status of platform services.
● Maintenance to apply the same enhancements to the US instance, AU instance, and private instances.
● Provide the new enhanced platform notification service