Alex's Coding Blog
  • home
  • about
  • projects
  • contact

Blog

When Clouds Clash: Understanding Kubernetes Cluster Failures and How to Overcome Them

  • July 17, 2023
  • by Alexander

We faced recently Kubernetes cluster crush. It failed classically - on a Friday 😬. Most of the pods were is fail state and cluster restart didn't help at all.

The AKS Kubernetes version is 1.19.11, when up to date version is 1.24.10

Upgrading the cluster natively the best solution we've got, the alternative would be creating a new cluster from scratch.

However, upgrading the cluster version, wasn't smooth like walking in a park they say :)

Symptoms

Failed to upgrade AKS cluster - basically we fail to update cluster from portal as well as from cli. Detailed issues we faced are listed below.

The most of cluster's pods are fail - failed to start with internal cluster networking problem.

Root cause

This cluster which is relatively old and its version wasn't updated for at least 2 years.

Fixing the cluster

When we started updating we faced the following errors:

Not enough IP addresses available

For some reason started to allocate Ip addresses from the pool.

Failed to save Kubernetes service 'cluster'. Error: Pre-allocated IPs 327
exceeds IPs available 251 in Subnet Cidr 10.5.1.0/24, Subnet Name subnet_k8s.
http://aka.ms/aks/insufficientsubnetsize

Though the recommended solution is to create new subnet and then create a new node pool connected to the new subnet, didn't work for us 
because outdated current version of cluster. Kind of chasing own tail situation: You can't upgrade version of cluster because the node pool allocated too many IP addresses, and you can create a new node pool attached to bigger subnet
because version of the cluster is not supported.

Solution: Scaling down all pods worked for us, released allocated recourses and also IP adresses.

Version conflict between Kubernetes and the node pool

After cluster version was upgraded, node pool version remain 1.19.11, which caused the cluster being in fail state and didn't allowed saving configuration:

Failed to save Kubernetes service 'cluster'. Error: Node pool version 1.19.11 > and control plane version 1.24.10 are incompatible. Minor version of node pool
cannot be more than 2 versions less than control plane's version. Minor
version of node pool is 19 and control plane is 24

and restart the cluster for example

Failed to stop the Kubernetes service 'cluster'. Error: Unable to perform
'stopping' operation since the cluster is in 'failed' provision state. Please
use 'az aks update' to reconcile the cluster to succeeded state.

Solution:

  1. Create a new System node pool with version corresponding to cluster version
  2. After new node is up delete old node
  3. Scale up pods previously scaled down
  4. Verify everything

Lessons learned

Keep your cluster up to date before they crashing

Pay attention to Azure (or another cloud provider) alerts, plan and schedule cluster updates as part of technical dept in your sprints as you don't need surprises

Keep the documentation up to date

Last time we deal with this dev cluster was a two years ago, back then there were IaC scripts created, which are pretty descriptive and helped us quickly bring Istio to live. However,
detailed setup steps wouldn't harm

Azure DevOps Kubernetes

🚀 Turbocharge Your Kubernetes Cluster with my Terraform Kits! 🚀

🌟 Slash deployment time and costs! Discover the ultimate solution for efficient, cost-effective Kubernetes Terraform Kits. Perfect for DevOps enthusiasts looking for a reliable, scalable setup.

Learn More about Terraform Kits for AKS,EKS and GKE

Alexander Lvovich

Solution Architect & Software Developer | Automating & Scaling Infrastructure

💡 Working with Kubernetes, Istio, and DevOps. Got questions? Feel free to reach out!

Share on:

No comments are allowed for this post

Recent Posts

  • Securing Web Services Against Unwanted Traffic with NGINX
  • Optimizing API by Offloading Responsibilities to an API Gateway
  • How to Clean Up Local Branches of Remote Merged Branches
  • Resolving Namespace Overriding in Argo CD with Kustomize
  • Connecting to Gitlab's private Nuget registry
  • Why Istio?

Categories

  • Azure
  • Architecture
  • .NET Core
  • Certification
  • DevOps
  • How-to
  • Azure Functions
  • Serverless
  • Cosmos DB
  • Security
  • Thoughts
  • Kubernetes
  • Istio
© Copyright 2025, Alexander Lvovich. Theme by Colorlib
I use cookies and similar technologies on our website to enhance your browsing experience and analyze website traffic. By clicking "Accept," you consent with my Privacy Policy to the use of these technologies.
Accept