Back to overview
Downtime

EU cluster API, EU cluster frontend, EU cluster WebHook, EU cluster Realtime, and 2 other services are down

Jan 21 at 12:06am CET
Affected services
EU cluster API
EU cluster frontend
EU cluster WebHook
EU cluster Realtime

Resolved
Jan 23 at 11:21am CET

Post-mortem

Issue

On Friday 20th Jan, 2023 a significant spike in data volume was observed, this spike was also accompanied by an increase in data package size. The following increase in data processing times, increased message queue processing times and data volume combined to overwhelm our Kubernetes master node.
At 16:02 CET, the platform service became unavailable.

Google support where unable to correct this and so it was necessary to recreate the platform. During the downtime, our Webhook and queueing system remained available, except for brief periods during relaunching platform versions.
Status / Resolution

At 02:53 CET on Saturday 21 Jan, 2023 a stable version of the platform was relaunched. Due to the huge load increase, plus the backlog of messages, it was necessary to re-start integration flows in phases while monitoring platform health. The last phase of flow re-starts occurred at 10:00 CET on Saturday 21 Jan, 2023.

Prevention / Corrective Actions

Preventing these issues in the future will be achieved by the pending upgrade of our core scheduling service. Implementation is scheduled for mid-February 2023.

In addition, we will work with clients to optimise their integration flow design to avoid such instances and possibly migrating higher volume clients to a separate elastic.io instance.

We apologise sincerely for any inconvenience and thank you for your trust in elastic.io

Updated
Jan 21 at 03:00am CET

EU cluster RealtimThe platform is re-launched and service is resumed. Unfortunately with reduced capacity. Significant delays in processing data are to be expected, especially where message sizes are large.

Data loss is believed to be minimal - our webhook and queueing system were down only shortly while redeploying platform versions.

We apologise sincerely for the disruption in service and will continue to monitor and optimise over the weekend. Delays in data processing are to be expected over the weekend as we work to clear the message backlog.

Kind regards
Your elastic.io Teame recovered.

Updated
Jan 21 at 01:30am CET

EU cluster WebHook recovered.

Updated
Jan 21 at 01:28am CET

EU cluster API recovered.

Updated
Jan 21 at 01:28am CET

EU cluster frontend recovered.

Updated
Jan 21 at 01:19am CET

EU cluster WebHook went down.

Updated
Jan 21 at 01:18am CET

EU cluster frontend went down.

Updated
Jan 21 at 01:18am CET

EU cluster API went down.

Created
Jan 21 at 12:06am CET

EU cluster Realtime went down.