Previous incidents

March 2023
Mar 21, 2023
1 incident

EU cluster Realtime is down

Downtime

Resolved Mar 21 at 06:45pm CET

EU cluster Realtime recovered.

1 previous update

February 2023
Feb 10, 2023
1 incident

EU cluster frontend is down

Downtime

Resolved Feb 10 at 03:51pm CET

EU cluster frontend recovered.

1 previous update

Feb 09, 2023
1 incident

EU cluster API and EU cluster frontend are down

Downtime

Resolved Feb 09 at 10:09pm CET

EU cluster frontend recovered.

5 previous updates

Feb 02, 2023
2 incidents

Brief downtime

Resolved Feb 02 at 05:43pm CET

Two short periods of degraded service were experienced today between 15:51 and 15:57 CET and 16:13 and 16:21 CET. API, front-end, and webhook response rates were slow during this time and and some customers may have experienced time outs.

No data loss resulted.

The cause was due to flows abusing the storage (Maester) and queueing service limits. These flows have been suspended and we are working with the client to optimise their flows.

All services are restored and working normally....

EU cluster API, EU cluster frontend, EU cluster Realtime, and 1 other service...

Downtime

Resolved Feb 02 at 04:21pm CET

EU cluster frontend recovered.

11 previous updates

Feb 01, 2023
1 incident

EU cluster API, EU cluster frontend, EU cluster Realtime, and 1 other service...

Downtime

Resolved Feb 01 at 07:58pm CET

EU cluster frontend recovered.

11 previous updates

January 2023
Jan 31, 2023
1 incident

EU cluster API, EU cluster frontend, EU cluster Realtime, and 1 other service...

Downtime

Resolved Jan 31 at 08:10pm CET

EU cluster frontend recovered.

10 previous updates

Jan 21, 2023
1 incident

EU cluster API, EU cluster frontend, EU cluster WebHook, EU cluster Realtime,...

Downtime

Resolved Jan 23 at 11:21am CET

Post-mortem

Issue

On Friday 20th Jan, 2023 a significant spike in data volume was observed, this spike was also accompanied by an increase in data package size. The following increase in data processing times, increased message queue processing times and data volume combined to overwhelm our Kubernetes master node.
At 16:02 CET, the platform service became unavailable.

Google support where unable to correct this and so it was necessary to recreate the platform. During the downt...

8 previous updates

Jan 20, 2023
1 incident

EU cluster API, EU cluster frontend, EU cluster WebHook, EU cluster Realtime,...

Downtime

Resolved Jan 20 at 10:59pm CET

Most of the flows have started. We are monitoring the situation.

11 previous updates

Jan 05, 2023
1 incident

Flow sample retrieval is not working

Degraded

Resolved Jan 05 at 06:02pm CET

Flow sample retrieval is not working

Issue

The recent issues experienced in the platform are because of our core scheduling service becoming temporarily overwhelmed because of a high volume of messages combined with many errors generated by integration flows transporting large batches of data.

Issues experienced included:
* Slow and failing sample retrieval
* Slow processing of message queues

Status/Resolution

The issues are now resolved, and no data loss is observed ...

3 previous updates