When Clouds Clash: Understanding Kubernetes Cluster Failures and How to Overcome Them
We faced recently Kubernetes cluster crush. It failed classically - on a Friday 😬. Most of the pods were is fail state and cluster restart didn't help at all.
The AKS Kubernetes version is 1.19.11, when up to date version is 1.24.10
Upgrading the cluster natively the best solution we've got, the alternative would be creating a new cluster from scratch.
However, upgrading the cluster version, wasn't smooth like walking in a park they say :)
Symptoms
Failed to upgrade AKS cluster - basically we fail to update cluster from portal as well as from cli. Detailed issues we faced are listed below.
The most of cluster's pods are fail - failed to start with internal cluster networking problem.
Root cause
This cluster which is relatively old and its version wasn't updated for at least 2 years.
Fixing the cluster
When we started updating we faced the following errors:
Not enough IP addresses available
For some reason started to allocate Ip addresses from the pool.
Failed to save Kubernetes service 'cluster'. Error: Pre-allocated IPs 327
exceeds IPs available 251 in Subnet Cidr 10.5.1.0/24, Subnet Name subnet_k8s.
http://aka.ms/aks/insufficientsubnetsize
Though the recommended solution is to create new subnet and then create a new node pool connected to the new subnet, didn't work for usÂ
because outdated current version of cluster. Kind of chasing own tail situation: You can't upgrade version of cluster because the node pool allocated too many IP addresses, and you can create a new node pool attached to bigger subnet
because version of the cluster is not supported.
Solution: Scaling down all pods worked for us, released allocated recourses and also IP adresses.
Version conflict between Kubernetes and the node pool
After cluster version was upgraded, node pool version remain 1.19.11, which caused the cluster being in fail state and didn't allowed saving configuration:
Failed to save Kubernetes service 'cluster'. Error: Node pool version 1.19.11 > and control plane version 1.24.10 are incompatible. Minor version of node pool
cannot be more than 2 versions less than control plane's version. Minor
version of node pool is 19 and control plane is 24
and restart the cluster for example
Failed to stop the Kubernetes service 'cluster'. Error: Unable to perform
'stopping' operation since the cluster is in 'failed' provision state. Please
use 'az aks update' to reconcile the cluster to succeeded state.
Solution:
- Create a new System node pool with version corresponding to cluster version
- After new node is up delete old node
- Scale up pods previously scaled down
- Verify everything
Lessons learned
Keep your cluster up to date before they crashing
Pay attention to Azure (or another cloud provider) alerts, plan and schedule cluster updates as part of technical dept in your sprints as you don't need surprises
Keep the documentation up to date
Last time we deal with this dev cluster was a two years ago, back then there were IaC scripts created, which are pretty descriptive and helped us quickly bring Istio to live. However,
detailed setup steps wouldn't harm
🚀 Turbocharge Your Infrastructure with Our Terraform Template Kits! 🚀
🌟 Slash deployment time and costs! Discover the ultimate solution for efficient, cost-effective cloud infrastructure. Perfect for DevOps enthusiasts looking for a reliable, scalable setup. Click here to revolutionize your workflow!
Learn More about Starter Terraform Kits for AKS,EKS and GKE
No comments are allowed for this post