How to Upgrade a Kubernetes Cluster

I still remember upgrading a Kubernetes cluster for the first time. Despite taking great care and following all the documentation, I managed to break some applications. Luckily, the impact was minimal, and the issue was solved quickly. The most interesting part is that the same set of steps worked perfectly in upgrading non-production clusters, but Murphy’s law struck, and things went south only in the production cluster due to the running of old versions of applications.

This post will cover some high-level processes for upgrading a Kubernetes cluster rather than documenting the detailed steps involved.

Components involved in the upgrade

At a high level, there are four main components that need to be considered before upgrading the cluster.

Control Plane
Nodes
Critical Cluster Applications
Applications running on the kubernetes cluster

Official documentation mentions the following order for the upgrade.

Upgrade the control plane.
Upgrade the nodes in your cluster.
Upgrade clients such as kubectl
Adjust manifests and other resources based on the API changes that accompany the new Kubernetes version.

The order mentioned is fine, but there are some more details that need to be considered before the upgrade. Let’s examine each of the sections below.

Pre-upgrade checks

Imagine upgrading your Kubernetes cluster, and after performing the upgrade, some applications start crashing. You have also updated the manifests and Helm charts to match your new Kubernetes cluster version, but the applications still crash :(.

An important point to note is that, in addition to the manifests where you specify the API object versions, many applications’ code invokes certain Kubernetes APIs with specific versions. If the versions used by your applications are removed in the upgraded version of Kubernetes, there is a very good chance that your applications will fail.

To prevent this, always go through the Changelogs. It is advisable to go through the changelogs in detail to understand the changes that have gone in. However, Changelogs are cumbersome to read. Thankfully, to make our lives easier, documentation has a section called “Urgent Upgrade Notes” ( Ex 1.31 release notes). Kubernetes team has really tried to get your attention by having a warning: “No, really, you MUST read this before you upgrade.” The user cannot miss this section before the cluster upgrade. This section highlights some key removals that can make or break your cluster and depreciations that need to be addressed sooner or later.

For example:

In the 1.27 release in tree, EBS storage plugin had been removed in AWS. If the cluster was upgraded from 1.26 to 1.27 without installing an external CSI driver, all of the functionalities related to EBS storage would have run into issues.
In the 1.26 release, the v2beta2 API version of HorizontalPodAutoscaler was removed. This meant that applications needed to use the API version “autoscaling/v2”. If this is not taken care of, any application that uses HorizontalPodAutoScaler could potentially run into issues.

Critical cluster applications

All the essential software components that ensure the proper functioning of the cluster can be classified as critical cluster applications(for the lack of good naming 🙂 ). If any of these critical applications fail, the entire cluster’s functionality might be compromised.

For example, if CoreDNS fails, name resolution across the entire cluster will fail. Similarly, if the kube-proxy running as a DaemonSet fails, routing within the cluster will be disrupted.

List all the critical cluster applications and ensure they are compatible with the upgraded Kubernetes cluster. Applications and Kubernetes are generally designed so that an upgrade by a single version increment should not have any impact. However, if any applications cannot run on the upgraded Kubernetes version, a version compatible with both the older and newer Kubernetes versions needs to be installed before the upgrade.

Users should go through the application documentation for the supported versions of kubernetes (Ex, Cluster autoscaler compatibility ), deploy the applications in a test environment, and test the functionalities thoroughly.

Applications

So, what is the major difference between regular applications and critical cluster applications? The short answer is the impact when things fail. When critical cluster applications break, it might break the entire cluster, which could potentially mean all the applications running on the cluster may break. If your regular applications break, the impact is limited to the specific application that breaks and some co-related applications.

Similar to how the critical cluster applications are validated, all the applications need to be checked for compatibility with the upgraded version of Kubernetes.

Control plane

If the control plane is self-managed, the upgrade of the different components should be done in a specific order. Version skew between components are also documented and it is advisable to upgrade components ensuring both order and version compatibility. The control plane should always be upgraded in increments of one version, For example, never upgrade directly from 1.24 to 1.27, the upgrade should follow the path of 1.24-> 1.25->1.26->1.27( Worker nodes also need to be upgraded which we will talk in next section).

For the managed or hosted control plane, upgrades are easy; it’s just a click of a button or an API call. However, many of the hosted control plane upgrades are irreversible, which means if you upgrade your control plane and something goes wrong, there is no way to reverse the changes. Either you have to recreate the environment in a different kubernetes cluster, or the issue needs to be fixed.

Nodes

Kubelet running on the node can be up to three minor versions older than the Kube API server (control plane). Although upgrading the nodes during a cluster upgrade is not mandatory, it is still advised to upgrade them once the control plane has been upgraded. This ensures that components like Kubelet are also upgraded.

One of the approaches to upgrading nodes is to use a strategy similar to rolling upgrades.

The following are the steps:

Add new nodes with the upgraded version to the cluster. The number of nodes to be added depends on the environment. After adding the nodes, there will be a set of nodes with the old version and a set of nodes with the new version.
Cordon the nodes running on an old version. This makes the node unschedulable, and new pods created cannot run on a cordoned node.
Migrate the applications to new nodes. This could be done by adding affinity or sometimes just killing the pod, which is part of deployment, stateful set, etc.
Drain the nodes one by one that are running on an older version.
Remove the drained nodes from the cluster.

Plan for disaster

Despite all the efforts and the precautions taken, things might break and it can cause issues. Always plan for disaster recovery scenarios and have a business continuity plan for your applications.

Tools like Velero help backup and restore applications running on Kubernetes. Do test runs of application restore before doing the actual migration. If there is an option for backup and restore provided in your Kubernetes application, test it out as application-specific recovery could be more accurate and efficient.

Flow

The below chart indicates the process involved.

K8s Upgrade Flow

How to upgrade a Kubernetes cluster, a conclusion:

Never skip the changelogs before upgrading your kubernetes cluster.
Always go through the recommended order and process from the official documentation.
Always upgrade the cluster in increments of one.
Keep all the applications up to date.
Never plan for an upgrade without a disaster recovery plan.
If you still manage to break the upgrade, welcome to the club. You are not alone 🙂

Ready to elevate your Kubernetes management? Check out Percona Everest for seamless database control on Kubernetes, offering automated scaling, unified management, and open source flexibility.

FAQs

1. What are the main components involved in upgrading a Kubernetes cluster, and why is their order important?

The main components involved in upgrading a Kubernetes cluster are the Control Plane, Nodes, Critical Cluster Applications, and Applications running on the Kubernetes cluster. The order of upgrading these components is crucial because it ensures system stability and minimizes downtime. Typically, the Control Plane should be upgraded first, followed by the nodes, then critical cluster applications, and finally, regular applications. This order allows for a smooth transition and helps maintain compatibility between different parts of the system during the upgrade process.

2. How can pre-upgrade checks help prevent issues during a Kubernetes cluster upgrade, and what specific areas should be focused on?

Pre-upgrade checks are essential in preventing potential issues during a Kubernetes cluster upgrade. These checks should focus on several key areas:

a) Reviewing changelogs and “Urgent Upgrade Notes” to identify any API changes, deprecations, or removals that might affect your applications. b) Ensuring compatibility of critical cluster applications with the new Kubernetes version. c) Verifying that all regular applications running on the cluster are compatible with the upgraded version. d) Checking for any custom resources or controllers that might be affected by the upgrade.

By thoroughly examining these areas, you can identify and address potential problems before they occur during the upgrade process, significantly reducing the risk of unexpected issues and downtime.

3. What strategies can be employed for upgrading nodes in a Kubernetes cluster, and what precautions should be taken?

A common strategy for upgrading nodes in a Kubernetes cluster is to use a rolling upgrade approach. This involves:

a) Adding new nodes with the upgraded version to the cluster. b) Cordoning off old nodes to prevent new pods from being scheduled on them. c) Migrating applications to the new nodes, either by adding affinity rules or by manually redeploying them. d) Draining the old nodes one by one to safely evict all pods. e) Removing the drained nodes from the cluster.

Precautions to take during this process include:

Ensuring sufficient capacity in the cluster to handle the workload during the migration.
Testing the upgrade process in a non-production environment first.
Having a rollback plan in case of unforeseen issues.
Monitoring the cluster closely during and after the upgrade for any anomalies.

4. Why is disaster recovery planning crucial when upgrading a Kubernetes cluster, and what tools can assist in this process?

Disaster recovery planning is crucial when upgrading a Kubernetes cluster because, despite all precautions, unforeseen issues can still arise. A solid disaster recovery plan ensures that you can quickly restore your applications and data if something goes wrong during the upgrade process.

Tools like Velero can assist in disaster recovery planning by:

Allowing you to backup and restore entire applications, including their data and configurations.
Providing the ability to migrate applications between clusters, which can be useful if you need to revert to a previous version.
Offering scheduling capabilities for regular backups, ensuring you always have a recent backup available.

It’s important to not only have these tools in place but also to regularly test your disaster recovery procedures to ensure they work as expected when needed.

5. What are some best practices for managing API version changes during a Kubernetes upgrade, and how can potential issues be mitigated?

Managing API version changes during a Kubernetes upgrade requires careful planning and execution. Some best practices include:

a) Thoroughly reviewing the changelog and deprecation notices for the new Kubernetes version. b) Identifying all resources in your cluster that use deprecated API versions. c) Updating manifests, Helm charts, and other configuration files to use the new API versions before the upgrade. d) Using tools like kubectl convert to automatically update your resource definitions to the latest API versions. e) Implementing a gradual rollout strategy, starting with non-critical workloads to identify any unforeseen issues.

To mitigate potential issues:

Maintain a comprehensive inventory of all applications and their API dependencies.
Use static analysis tools to scan your codebase and configurations for deprecated API usage.
Implement automated testing that includes API version compatibility checks.
Consider using admission controllers or custom validators to prevent the deployment of resources using deprecated API versions.
Keep your continuous integration and deployment pipelines updated to work with the latest API versions.