This is the tale of a low-severity incident that took place in our Kubernetes cluster fleet at LinkedIn that taught me a lot about how to think about off the shelf components we bring from the ecosystem into the critical path.
Many years ago when Kubernetes was still a lab experiment at LinkedIn, node-feature-discovery (NFD) looked like a nice little fun DaemonSet that we could install in every node to expose the bare metal node’s features (like CPU type, PCI devices, NUMA enablement) as node labels on Kubernetes, and allow workloads to specify these labels as scheduling predicates.
After 5 years of happily running a really old version of NFD, we decided to
upgrade the component. The new version seemed architecturally better: Node
agents in the old version used to communicate with the master component (doing
the labeling) over gPRC. With the new version, the agents would now use a new
NodeFeature
custom resource API that a controller watches, which is decoupling
the feature detection and labeling process.
As with any big upgrade done after a long while, things went really poorly. We ran into bugs that dropped all node labels (breaking pod scheduling in our clusters), and observed scale issues that manifested only in our largest (and happens to be most critical) Kubernetes clusters.
At the end, we’ve decided to roll back to the five year old version that was working fine, decided to indefinitely not upgrade, and instead phase out the component from our Kubernetes clusters while at the same time giving the open source project the feedback for others to benefit from the component better.
The case against off-the-shelf components
As we matured our Kubernetes infrastructure over time, I came to realize that
very few projects outside the kubernetes/kubernetes
work as advertised and
scale well in large clusters. (For fun, try your favorite Kubernetes
visualization tool in a 3,000-nodes high pod density cluster and watch your
apiserver melt).
I say this as someone who’s responsible for creating OSS tools in this ecosystem
(some under kubernetes-sigs
org) and those projects wouldn’t pass my code
quality and testing bar today: Do your own due diligence anytime you bring an
off-the-shelf Kubernetes component that’s not battle-hardened at your scale.
As a large company deploying off-the-shelf components into your critical path, it’s your job to read the source code to:
- audit how well the code is written and tested
- understand the failure modes and scalability parameters
- run the scale tests for the component
That said, this is easier said than done. We all take shortcuts, or never get around re-evaluating our dependencies as our scale grows. It’s more often than not we find that most off-the-shelf controllers in the open source ecosystem don’t hold up a craft bar as high as ours.
Scale issues
A major architectural shift modern NFD versions brought is that the NFD workers
now write into the NodeFeature
custom resource (which previously didn’t exist)
to communicate the node features from the node to the controller that labels the
Node
on Kubernetes API server. This didn’t seem concerning prior to the
upgrade, as we write our fair share of Kubernetes
controllers
at LinkedIn.
As this new version rolled out the upgrade, everything went smoothly in our pre-production environment.
If wasn’t until the new version has hit the larger clusters in production, we
found out the hard way that each of these NodeFeature
custom resources take up
140 KBs on the apiserver. This was partially because NFD reports a ton of
kernel settings that we didn’t use by default out of the box, and also partially
because Kubernetes API Server doesn’t store custom resources in a
space-efficient wire format (it stores them as JSON).
If you do the math for 4,000 nodes, you’re looking at 540 MB just to store some features of nodes in a cluster. Given etcd has a suggested size limit of 8 GB and and apiserver has a watch cache for all resources by default, this has put more strain on both components.
The large object sizes has made NFD controller unable to list the large number of NodeFeatures from the apiserver, causing its list requests to repetitively timeout.
Unfortunately the NFD controller proceeding to start without a successful list response from the apiserver has triggered a worse bug in NFD.
Bugs leading to node label removals
We rely on node labels to be able to route the workloads to the correct node pool (and similarly, keep the unwanted workloads away from dedicated single-tenant node pools). Even though our node labels are rather static, if the component that manages your node labels decides to remove the node labels, you’ll have a big problem at hand.
Something we relied on NFD is to keep the node labels unless it is certain that information that a node label should be removed. However, after upgrading, we lost all node labels managed by NFD nearly simultaneously.
It turned out that in NFD v0.16.0 (and in many versions prior), the controller
starts up without authoritative list of NodeFeature
s from the apiserver. So
when the controller could not find a Node’s NodeFeature
object due to cache
being incomplete, it would treat the list of node labels as
“empty”,
so it would go ahead and remove all node labels.
Normally Kubernetes controllers must not start unless the controller has
successfully built an informer cache. However, NFD did not check the return
value of the WaitForCacheSync()
method (which would’ve told the controller to
not start with a missing cache when its list request was timing out). This issue
was reported
here.
This bug was easily reproducible on a kind
cluster: Install NFD v0.16.0, and
observe that the kind node would get its feature labels. Next, create 1,000 fake
Nodes and NodeFeatures, and watch the previously added feature labels on the
kind node disappear as the nfd-master
controller would run into list timeouts
from the apiserver (which there’s now a
fix
for).
Upon further auditing the code, we’ve found several other failure modes that similarly would lead to node removals under different conditions:
-
We found other controllers like nfd-gc that also did not check the return value of the
WaitForCacheSync()
method, which would similarly cause node label removals. This was reported here and fixed here and here. -
We recently found a newly introduced code where this mistake is repeated once again (reported here, yet to be fixed).
This is rather proving my point that implementing controllers correctly is inherently hard. NFD uses Kubernetes Go client correctly and manages informer lifecycle (which is a low level primitive), using a higher level controller development framework like controller-runtime would make it impossible to get this wrong.
-
NodeFeature
custom resources have their owner references set to thenfd-worker
Pod, which is a bad idea, because these Pod get deleted all the time during upgrades etc, and Kubernetes would garbage-collect theseNodeFeature
resources. This issue is reported here.The newer v0.16.6 version makes a change to set the parent object as
DaemonSet
, which I think is still the wrong fix, deletion of a DaemonSet should not cause all node labels to be nuked). The NFD worker would treat the lack of theNodeFeature
object as “node has no labels” and proceeded with node label removals as well. This was reported here and still not fully fixed in a way that makes sense to me.
Overall, we’ve decided we’re probably better off not dynamically relying on node labels and moving all static node labels to the kubelet configuration, and using our existing node lifecycle management controllers to manage the dynamic node labels.
Conclusions
Kubernetes controllers are harder to write correctly. The tooling and frameworks like controller-runtime are really good at giving you the impression that you have something working. That said, correctness and scalability aren’t always obvious things to test for if you aren’t aware of the pitfalls you might be exposed to.
A big lesson that I’ve developed over time is that you really need to deeply understand how an external off-the-shelf controller you bring into your ecosystem works and do an extensive due diligence before putting the component in your critical path.