EU cluster API, EU cluster frontend, EU cluster WebHook, EU cluster Realtime, and 2 other services are down
Resolved
Jan 23 at 11:21am CET
Post-mortem
Issue
On Friday 20th Jan, 2023 a significant spike in data volume was observed, this spike was also accompanied by an increase in data package size. The following increase in data processing times, increased message queue processing times and data volume combined to overwhelm our Kubernetes master node.
At 16:02 CET, the platform service became unavailable.
Google support where unable to correct this and so it was necessary to recreate the platform. During the downtime, our Webhook and queueing system remained available, except for brief periods during relaunching platform versions.
Status / Resolution
At 02:53 CET on Saturday 21 Jan, 2023 a stable version of the platform was relaunched. Due to the huge load increase, plus the backlog of messages, it was necessary to re-start integration flows in phases while monitoring platform health. The last phase of flow re-starts occurred at 10:00 CET on Saturday 21 Jan, 2023.
Prevention / Corrective Actions
Preventing these issues in the future will be achieved by the pending upgrade of our core scheduling service. Implementation is scheduled for mid-February 2023.
In addition, we will work with clients to optimise their integration flow design to avoid such instances and possibly migrating higher volume clients to a separate elastic.io instance.
We apologise sincerely for any inconvenience and thank you for your trust in elastic.io
Affected services
EU cluster Realtime
Updated
Jan 21 at 03:00am CET
EU cluster RealtimThe platform is re-launched and service is resumed. Unfortunately with reduced capacity. Significant delays in processing data are to be expected, especially where message sizes are large.
Data loss is believed to be minimal - our webhook and queueing system were down only shortly while redeploying platform versions.
We apologise sincerely for the disruption in service and will continue to monitor and optimise over the weekend. Delays in data processing are to be expected over the weekend as we work to clear the message backlog.
Kind regards
Your elastic.io Teame recovered.
Affected services
EU cluster Realtime
Updated
Jan 21 at 01:30am CET
EU cluster WebHook recovered.
Affected services
EU cluster WebHook
Updated
Jan 21 at 01:28am CET
EU cluster API recovered.
Affected services
EU cluster API
Updated
Jan 21 at 01:28am CET
EU cluster frontend recovered.
Affected services
EU cluster frontend
Updated
Jan 21 at 01:19am CET
EU cluster WebHook went down.
Affected services
EU cluster WebHook
Updated
Jan 21 at 01:18am CET
EU cluster frontend went down.
Affected services
EU cluster frontend
Updated
Jan 21 at 01:18am CET
EU cluster API went down.
Affected services
EU cluster API
Created
Jan 21 at 12:06am CET
EU cluster Realtime went down.
Affected services
EU cluster Realtime