Increased amount of errors on integration flows
Incident Report for elastic.io GmbH
Postmortem

Incidents on 28th and 29th of January

Late afternoon on January 28th some of our customers reported performance problems related to the slower than expected shutdown of the containers responsible for flow execution hence delays in starting new flows & processing data. After initial analysis it became obvious that one of the Kubernetes slaves had a problematic situation with Docker daemon that as a result had a higher than usual number of container in the ‘terminating’ state that was, unfortunately not detected by our monitoring system. After manual intervention of our operations team the slow-performing node was marked as invalid and evicted from the cluster.

Unfortunately, due to resource limitations of the underlying IaaS provider’s data center, new node was started in a different network partition that led to a networking configuration change (a). Such networking configuration change was made temporary to resolve the problem and was scheduled to be removed during regular maintenance period in the following day. After new node was introduced in the cluster overall performance of the system was back to normal.

From early morning next day (29th of January) our monitoring system reported performance issues of one of the back-end DB nodes. This is a normal situation in the cloud-based systems where a single ‘server’ or ‘node’ may be not reliable or affected by ‘noisy neighbour’ or other data center related issues. We were well prepared for that as all critical systems are deployed in highly available scenario.

This day, however, replica set election process didn’t worked well as it was affected by temporary networking change did on 28th of January (a) that led to the split brain situation that quickly spread to the other cluster members. Due to this 3 of of 4 requests to the DB got timed-out. This caused degraded performance of the storage subsystem affecting integration flows of some users. Since those errors happen too quick in a very short period our system monitor started suspending the flows with errors as preventive measure.

What we did to addressed the situation?

First we found the root cause of the DB problem and restarted the suffering replica sets. It took some time to do this since the requests to the DB were still coming and the load was more than one DB was able to handle. The same time we stopped the service which suspends the flows.

Why we did not anticipate this?

The downtime or slow performance of one DB Cluster node was anticipated, however due to the temporary networking change that happened a day before two events of low probability overlapped and caused a bigger disruption of the overall system. Despite the downtimes and degraded performance no customer data was lost.

What are we doing to avoid this problem in the future?

We are going to put additional measures such as:

  1. Establish a anomaly monitoring to detect stuck ‘docker’ process earlier on and react on it with preventive measures
  2. Make sure the cluster extension is possible without additional networking partition
  3. Increase resources on all DB clusters to anticipate higher load.
  4. Make sure networking changes are not done as a reaction to cluster extension
Posted Jan 30, 2020 - 11:34 CET

Resolved
This incident has been resolved.
Posted Jan 29, 2020 - 14:07 CET
Monitoring
We have addressed the immediate problems are now monitoring the situation.
Posted Jan 29, 2020 - 12:29 CET
Update
We are restarting some services to synchronise messaging. It might send some messages about the platform being down.
Posted Jan 29, 2020 - 12:21 CET
Update
The execution errors are still happening but the flows should not suspend from them.
Posted Jan 29, 2020 - 12:11 CET
Identified
The issue has been identified and a fix is being implemented.
Posted Jan 29, 2020 - 12:02 CET
Update
We are continuing to investigate this issue.
Posted Jan 29, 2020 - 11:49 CET
Update
We are still investigating the problem with flow suspensions.
Posted Jan 29, 2020 - 11:48 CET
Investigating
We are investigating report on the increased amount of errors happening on integration flows which cause suspensions. Our team is investigating the problem right now.
Posted Jan 29, 2020 - 11:03 CET
This incident affected: elastic.io app.