When Clouds Clash: Understanding Kubernetes Cluster Failures and How to Recover

We faced recently Kubernetes cluster crush. It failed classically - on a Friday 😬. Most of the pods were is fail state and cluster restart didn't help at all.

The AKS Kubernetes version is 1.19.11, when up to date version is 1.24.10

Upgrading the cluster natively the best solution we've got, the alternative would be creating a new cluster from scratch.

However, upgrading the cluster version, wasn't smooth like walking in a park they say :)

Symptoms

Failed to upgrade AKS cluster - basically we fail to update cluster from portal as well as from cli. Detailed issues we faced are listed below.

The most of cluster's pods are fail - failed to start with internal cluster networking problem.

Root cause

This cluster which is relatively old and its version wasn't updated for at least 2 years.

Fixing the cluster

When we started updating we faced the following errors:

Not enough IP addresses available

For some reason started to allocate Ip addresses from the pool.

Failed to save Kubernetes service 'cluster'. Error: Pre-allocated IPs 327
exceeds IPs available 251 in Subnet Cidr 10.5.1.0/24, Subnet Name subnet_k8s.
http://aka.ms/aks/insufficientsubnetsize

Though the recommended solution is to create new subnet and then create a new node pool connected to the new subnet, didn't work for us
because outdated current version of cluster. Kind of chasing own tail situation: You can't upgrade version of cluster because the node pool allocated too many IP addresses, and you can create a new node pool attached to bigger subnet
because version of the cluster is not supported.

Solution: Scaling down all pods worked for us, released allocated recourses and also IP adresses.

Version conflict between Kubernetes and the node pool

After cluster version was upgraded, node pool version remain 1.19.11, which caused the cluster being in fail state and didn't allowed saving configuration:

Failed to save Kubernetes service 'cluster'. Error: Node pool version 1.19.11 > and control plane version 1.24.10 are incompatible. Minor version of node pool
cannot be more than 2 versions less than control plane's version. Minor
version of node pool is 19 and control plane is 24

and restart the cluster for example

Failed to stop the Kubernetes service 'cluster'. Error: Unable to perform
'stopping' operation since the cluster is in 'failed' provision state. Please
use 'az aks update' to reconcile the cluster to succeeded state.

Solution:

Create a new System node pool with version corresponding to cluster version
After new node is up delete old node
Scale up pods previously scaled down
Verify everything

Lessons learned

Keep your cluster up to date before they crashing

Pay attention to Azure (or another cloud provider) alerts, plan and schedule cluster updates as part of technical dept in your sprints as you don't need surprises

Keep the documentation up to date

Last time we deal with this dev cluster was a two years ago, back then there were IaC scripts created, which are pretty descriptive and helped us quickly bring Istio to live. However,
detailed setup steps wouldn't harm

When Clouds Clash: Understanding Kubernetes Cluster Failures and How to Overcome Them

Symptoms

Root cause

Fixing the cluster

Not enough IP addresses available

Version conflict between Kubernetes and the node pool

Lessons learned

Keep your cluster up to date before they crashing

Keep the documentation up to date

Share this article

About the Author

Alexander Lvovich

No comments are allowed for this post