This is the tale of a low-severity incident that took place in our Kubernetes cluster fleet at LinkedIn that taught me a lot about how to think about off the shelf components we bring from the ecosystem into the critical path.

Many years ago when Kubernetes was still a lab experiment at LinkedIn, node-feature-discovery (NFD) looked like a nice little fun DaemonSet that we could install in every node to expose the bare metal node’s features (like CPU type, PCI devices, NUMA enablement) as node labels on Kubernetes, and allow workloads to specify these labels as scheduling predicates.

After 5 years of happily running a really old version of NFD, we decided to upgrade the component. The new version seemed architecturally better: Node agents in the old version used to communicate with the master component (doing the labeling) over gPRC. With the new version, the agents would now use a new NodeFeature custom resource API that a controller watches, which is decoupling the feature detection and labeling process.

As with any big upgrade done after a long while, things went really poorly. We ran into bugs that dropped all node labels (breaking pod scheduling in our clusters), and observed scale issues that manifested only in our largest (and happens to be most critical) Kubernetes clusters.

At the end, we’ve decided to roll back to the five year old version that was working fine, decided to indefinitely not upgrade, and instead phase out the component from our Kubernetes clusters while at the same time giving the open source project the feedback for others to benefit from the component better.

The case against off-the-shelf components

As we matured our Kubernetes infrastructure over time, I came to realize that very few projects outside the kubernetes/kubernetes work as advertised and scale well in large clusters. (For fun, try your favorite Kubernetes visualization tool in a 3,000-nodes high pod density cluster and watch your apiserver melt).

I say this as someone who’s responsible for creating OSS tools in this ecosystem (some under kubernetes-sigs org) and those projects wouldn’t pass my code quality and testing bar today: Do your own due diligence anytime you bring an off-the-shelf Kubernetes component that’s not battle-hardened at your scale.

As a large company deploying off-the-shelf components into your critical path, it’s your job to read the source code to:

  • audit how well the code is written and tested
  • understand the failure modes and scalability parameters
  • run the scale tests for the component

That said, this is easier said than done. We all take shortcuts, or never get around re-evaluating our dependencies as our scale grows. It’s more often than not we find that most off-the-shelf controllers in the open source ecosystem don’t hold up a craft bar as high as ours.

Scale issues

A major architectural shift modern NFD versions brought is that the NFD workers now write into the NodeFeature custom resource (which previously didn’t exist) to communicate the node features from the node to the controller that labels the Node on Kubernetes API server. This didn’t seem concerning prior to the upgrade, as we write our fair share of Kubernetes controllers at LinkedIn.

As this new version rolled out the upgrade, everything went smoothly in our pre-production environment.

If wasn’t until the new version has hit the larger clusters in production, we found out the hard way that each of these NodeFeature custom resources take up 140 KBs on the apiserver. This was partially because NFD reports a ton of kernel settings that we didn’t use by default out of the box, and also partially because Kubernetes API Server doesn’t store custom resources in a space-efficient wire format (it stores them as JSON).

If you do the math for 4,000 nodes, you’re looking at 540 MB just to store some features of nodes in a cluster. Given etcd has a suggested size limit of 8 GB and and apiserver has a watch cache for all resources by default, this has put more strain on both components.

The large object sizes has made NFD controller unable to list the large number of NodeFeatures from the apiserver, causing its list requests to repetitively timeout.

Unfortunately the NFD controller proceeding to start without a successful list response from the apiserver has triggered a worse bug in NFD.

Bugs leading to node label removals

We rely on node labels to be able to route the workloads to the correct node pool (and similarly, keep the unwanted workloads away from dedicated single-tenant node pools). Even though our node labels are rather static, if the component that manages your node labels decides to remove the node labels, you’ll have a big problem at hand.

Something we relied on NFD is to keep the node labels unless it is certain that information that a node label should be removed. However, after upgrading, we lost all node labels managed by NFD nearly simultaneously.

It turned out that in NFD v0.16.0 (and in many versions prior), the controller starts up without authoritative list of NodeFeatures from the apiserver. So when the controller could not find a Node’s NodeFeature object due to cache being incomplete, it would treat the list of node labels as “empty”, so it would go ahead and remove all node labels.

Normally Kubernetes controllers must not start unless the controller has successfully built an informer cache. However, NFD did not check the return value of the WaitForCacheSync() method (which would’ve told the controller to not start with a missing cache when its list request was timing out). This issue was reported here.

This bug was easily reproducible on a kind cluster: Install NFD v0.16.0, and observe that the kind node would get its feature labels. Next, create 1,000 fake Nodes and NodeFeatures, and watch the previously added feature labels on the kind node disappear as the nfd-master controller would run into list timeouts from the apiserver (which there’s now a fix for).

Upon further auditing the code, we’ve found several other failure modes that similarly would lead to node removals under different conditions:

  1. We found other controllers like nfd-gc that also did not check the return value of the WaitForCacheSync() method, which would similarly cause node label removals. This was reported here and fixed here and here.

  2. We recently found a newly introduced code where this mistake is repeated once again (reported here, yet to be fixed).

    This is rather proving my point that implementing controllers correctly is inherently hard. NFD uses Kubernetes Go client correctly and manages informer lifecycle (which is a low level primitive), using a higher level controller development framework like controller-runtime would make it impossible to get this wrong.

  3. NodeFeature custom resources have their owner references set to the nfd-worker Pod, which is a bad idea, because these Pod get deleted all the time during upgrades etc, and Kubernetes would garbage-collect these NodeFeature resources. This issue is reported here.

    The newer v0.16.6 version makes a change to set the parent object as DaemonSet , which I think is still the wrong fix, deletion of a DaemonSet should not cause all node labels to be nuked). The NFD worker would treat the lack of the NodeFeature object as “node has no labels” and proceeded with node label removals as well. This was reported here and still not fully fixed in a way that makes sense to me.

Overall, we’ve decided we’re probably better off not dynamically relying on node labels and moving all static node labels to the kubelet configuration, and using our existing node lifecycle management controllers to manage the dynamic node labels.

Conclusions

Kubernetes controllers are harder to write correctly. The tooling and frameworks like controller-runtime are really good at giving you the impression that you have something working. That said, correctness and scalability aren’t always obvious things to test for if you aren’t aware of the pitfalls you might be exposed to.

A big lesson that I’ve developed over time is that you really need to deeply understand how an external off-the-shelf controller you bring into your ecosystem works and do an extensive due diligence before putting the component in your critical path.