Late afternoon on January 28th some of our customers reported performance problems related to the slower than expected shutdown of the containers responsible for flow execution hence delays in starting new flows & processing data. After initial analysis it became obvious that one of the Kubernetes slaves had a problematic situation with Docker daemon that as a result had a higher than usual number of container in the ‘terminating’ state that was, unfortunately not detected by our monitoring system. After manual intervention of our operations team the slow-performing node was marked as invalid and evicted from the cluster.
Unfortunately, due to resource limitations of the underlying IaaS provider’s data center, new node was started in a different network partition that led to a networking configuration change (a). Such networking configuration change was made temporary to resolve the problem and was scheduled to be removed during regular maintenance period in the following day. After new node was introduced in the cluster overall performance of the system was back to normal.
From early morning next day (29th of January) our monitoring system reported performance issues of one of the back-end DB nodes. This is a normal situation in the cloud-based systems where a single ‘server’ or ‘node’ may be not reliable or affected by ‘noisy neighbour’ or other data center related issues. We were well prepared for that as all critical systems are deployed in highly available scenario.
This day, however, replica set election process didn’t worked well as it was affected by temporary networking change did on 28th of January (a) that led to the split brain situation that quickly spread to the other cluster members. Due to this 3 of of 4 requests to the DB got timed-out. This caused degraded performance of the storage subsystem affecting integration flows of some users. Since those errors happen too quick in a very short period our system monitor started suspending the flows with errors as preventive measure.
First we found the root cause of the DB problem and restarted the suffering replica sets. It took some time to do this since the requests to the DB were still coming and the load was more than one DB was able to handle. The same time we stopped the service which suspends the flows.
The downtime or slow performance of one DB Cluster node was anticipated, however due to the temporary networking change that happened a day before two events of low probability overlapped and caused a bigger disruption of the overall system. Despite the downtimes and degraded performance no customer data was lost.
We are going to put additional measures such as: