Tale of a Kubernetes node-feature-discovery incident

This is the analysis of a low severity incident that took place in the Kubernetes clusters at the company I work at that taught me a lot about how to think about the off-the-shelf components we bring from the ecosystem into the critical path and operate at a scale much larger than these components are intended.

Many years ago when we were first starting to run Kubernetes, the team figured node-feature-discovery (NFD) project looked like a DaemonSet that we could install in every node to expose the bare metal node’s features (like CPU type, PCI devices, NUMA enablement) as node labels on Kubernetes, and allow workloads to specify these labels as scheduling predicates.

Nowadays, NFD is even more popular thanks to NVIDIA’s GPU feature discovery and device plugin both requiring NFD installed in the cluster to expose the GPU features as node labels and use them for scheduling.

After 5 years of happily running a really old version of NFD, we decided to upgrade the component. The new version was architecturally the same (an agent on the node notified a “master” component of the node features, and master labels the nodes). In the new version, the agent on the node would now write into a new custom resource NodeFeature that the master component would watch, instead of communicating over gRPC.

As with any big upgrade done after a long while, things went poorly. We ran into bugs in the new version that removed all node labels (which breaks pod scheduling), and we observed scale issues that manifested only in the largest (which also happens to be most critical) Kubernetes clusters in our fleet.

At the end, we’ve decided to roll back to the five year old version until we phase out and remove the component from our Kubernetes clusters (except for the GPU nodes that still need it), and dediced to give the open source project this feedback for others to benefit from the component better.

Scale issues

A major architectural shift modern NFD versions brought is that the NFD workers now write into the NodeFeature custom resource (which previously didn’t exist) to communicate the node features from the node to the controller that labels the Node on Kubernetes API server. This didn’t seem concerning prior to the upgrade, as we write our fair share of Kubernetes controllers and custom resources.

Before the upgrade, we took a backup of all the nodes and their labels in our fleet into a temporary YAML dump in case things went south. As the new version upgrade rolled out in our pre-production clusters, things went fairly smoothly.

If wasn’t until the new version has hit larger clusters in production, we found out the hard way that each of these NodeFeature custom resources take up ~140 KBs on the kube-apiserver/etcd. This was partially because NFD reports a ton of kernel settings by default that we didn’t use.

This is particularly problematic because the kube-apiserver stores all custom resources as JSON (builtin Kubernetes resources are stored in protobuf encoding), which is a lot less space-efficient. If you do the math for 4,000 nodes, you’re looking at 540 MB just to store some features of nodes in a cluster.

Since we were running etcd storage with the recommended storage limit of 8 GB (which we’re now considering to increase), and and kube-apiserver has a watch cache for all resources by default, this has put more strain on both etcd and the kube-apiserver.

The large object sizes has made NFD controller unable to list the large number of NodeFeatures from the apiserver, causing its list requests to repetitively timeout.

Unfortunately, the NFD controller proceeding to start its job without a successful list response from the apiserver has triggered a worse bug…

Bugs leading to node label removals

We rely on node labels to be able to route the workloads to the correct node pool (and similarly, keep the unwanted workloads away from dedicated single-tenant node pools). Even though our node labels are rather static, if the component that manages your node labels decides to remove the node labels, you’ll have a big problem at hand.

Something we relied on NFD is to keep the node labels unless it is certain that information that a node label should be removed. However, after upgrading, we lost all node labels managed by NFD nearly simultaneously.

It turned out that in NFD v0.16.0 (and in many versions prior), the controller starts up without authoritative list of NodeFeatures from the apiserver. So when the controller could not find a Node’s NodeFeature object due to cache being incomplete, it would treat the list of node labels as “empty”, so it would go ahead and remove all node labels.

Normally Kubernetes controllers must not start unless the controller has successfully built an informer cache. However, NFD did not check the return value of the WaitForCacheSync() method (which would’ve told the controller to not start with a missing cache when its list request was timing out). This issue was reported here.

This bug was easily reproducible on a kind cluster: Install NFD v0.16.0, and observe that the kind node would get its feature labels. Next, create 1,000 fake Nodes and NodeFeatures, and watch the previously added feature labels on the kind node disappear as the nfd-master controller would run into list timeouts from the apiserver (which there’s now a fix for).

Upon further auditing the code, we’ve found several other failure modes that similarly would lead to node removals under different conditions:

We found other controllers like nfd-gc that also did not check the return value of the WaitForCacheSync() method, which would similarly cause node label removals. This was reported here and fixed here and here.
We recently found a newly introduced code where this mistake is repeated once again (reported here, yet to be fixed).

This is rather proving my point that implementing controllers correctly is inherently hard. NFD uses Kubernetes Go client correctly and manages informer lifecycle (which is a low level primitive), using a higher level controller development framework like controller-runtime would make it impossible to get this wrong.
NodeFeature custom resources have their owner references set to the nfd-worker Pod, which is a bad idea, because these Pod get deleted all the time during upgrades etc, and Kubernetes would garbage-collect these NodeFeature resources. This issue is reported here.

The newer v0.16.6 version makes a change to set the parent object as DaemonSet, which I think is still the wrong fix, deletion of a DaemonSet should not cause all node labels to be nuked). The NFD worker would treat the lack of the NodeFeature object as “node has no labels” and proceeded with node label removals as well. This was reported here and a fix was merged (however it’s still not the default behavior and requires you to start NFD with --no-owner-refs flag).

Overall, we decided we’re probably better off not relying on dynamic node labeling controllers to block scheduling and decided to move all static node labels directly to the kubelet configuration file.

Conclusions

Kubernetes controllers are harder to write correctly. Frameworks like kubebuilder/controller-runtime are really good at giving you the impression that you have something working.

As we grew our Kubernetes infrastructure over time, something we observe time and time again is that very few projects¹ outside the kubernetes/kubernetes repo scale well in large clusters. As a large Kubernetes customer, it’s on you to due your code audit, scale test and understand the failure modes of the components you bring into your stack.

If you’re a controller developer, consider performing a scale test on your controller in a large cluster (tools like kwok can help with creating synthetic clusters), and monitor your controller’s behavior through audit logs, and metrics around how long reconciliation loops take, whether your controller is able to keep up with the rate of changes in the cluster, etc.

As and end-user, I now feel the need to understand how an external off-the-shelf controller works (to the extent that I probably audit the code) before bringing it into the critical path of my Kubernetes clusters.

I also maintain some repos under kubernetes-sigs org, and my code there wouldn’t pass my own quality and testing bar today, which makes me extra cautious about the quality of off-the-shelf non-core Kubernetes components. ↩︎