First look at GKE Autopilot

Last month Google introduced GKE Autopilot. It’s a Kubernetes cluster that feels serverless: where you don’t see or manage machines, it auto-scales for you, it comes with some limitations, and you pay for what you use: per-Pod per-second (CPU/memory), instead of paying for machines.

In this article, I’ll do a hands-on review of GKE Autopilot works by poking at its nodes, API and run a 0-to-500 Pod autoscaling to see how well it scales from a user’s perspective.

Cluster creation
Poking at nodes
System Pods (kube-system)
limits vs request overriding behavior
Autoscaling under pressure: zero to 500 pods
Conclusion

New Autopilot Cluster

GKE cluster creation has always been super simple, a single command that took you to a managed Kubernetes cluster:

gcloud container clusters create [NAME] [--zone=...|--region=...]

Autopilot is a single-word change in the command no different.

gcloud container clusters create-auto [NAME] [--zone=...|--region=...]

Creating the cluster took 6 minutes, which is a little longer than a single-zone GKE Standard creation took (3.2 minutes). I think it’s okay, since you probably are not creating tens of clusters per day.

As part of Autopilot, Cloud Console website also got a nice re-design that helps you choose a cluster type:

Cloud Console helps you choose a cluster type

Poking at Nodes

It seems my cluster has started with 2 nodes:

$ kubectl get nodes
NAME                                               STATUS   ROLES    AGE     VERSION
gk3-autopilot-cluster-default-pool-bcd71fbe-6qh9   Ready    <none>   5m39s   v1.18.15-gke.1501
gk3-autopilot-cluster-default-pool-cfb54b99-qsk7   Ready    <none>   5m27s   v1.18.15-gke.1501

Funny enough, GKE used to prefix VMs with gke-, but these are prefixed with gk3-, and I’m not sure why (and it doesn’t really matter).

Unlike technologies like virtual-kubelet, there are still real nodes in the cluster, but these VMs are inaccessible to me. If I list my Compute Engine VM instances, I don’t see anything:

$ gcloud compute instances list
Listed 0 items.

However, if I run kubectl describe on these nodes, I can find their name, compute zone and external IP address. This is enough information to attempt to SSH into the node:

$ gcloud compute ssh --zone=us-central1-f gk3-autopilot-cluster-default-pool-cfb54b99-qsk7
This account is currently not available.
Connection to 34.123.136.111 closed.

As you can see, opening a SSH session to nodes is blocked by design. If you could get SSH access to an Autopilot node, you could run things on the node without going through the Kubernetes API. Google runs a vulnerability report program for (unintended) those who can get access to Autopilot nodes.

System Pods (`kube-system`)

You don’t have access to kube-system beyond querying objects. It’s a readonly namespace to prevent tampering with the cluster components:

kubectl delete pods --all -n kube-system
Error from server (Forbidden): pods "event-exporter-gke-564fb97f9-5g5xj" is
forbidden: User "[email protected]" cannot delete resource "pods" in API group ""
in the namespace "kube-system": GKEAutopilot authz: the namespace
"kube-system" is managed and the request's verb "delete" is denied.

In the error message above, I notice there might be a custom authorization layer ("GKEAutopilot authz") written in addition to Kubernetes’ RBAC. I’m guessing this is done to prevent cluster admins from tampering with the system namespace.

Here’s the list of Pods, nothing interesting stands out:

kubectl get pods -n kube-system
NAME                                           READY
event-exporter-gke-564fb97f9-5g5xj             2/2
fluentbit-gke-224zv                            2/2
fluentbit-gke-zvxkc                            2/2
gke-metadata-server-4bdsq                      1/1
gke-metadata-server-hvvm6                      1/1
gke-metrics-agent-2kt54                        1/1
gke-metrics-agent-rqp52                        1/1
kube-dns-57fcf698d8-4jtzc                      4/4
kube-dns-57fcf698d8-nmwrt                      4/4
kube-dns-autoscaler-7f89fb6b79-vg7gx           1/1
kube-proxy-gk3-autopilot-cluster-[...]         1/1
kube-proxy-gk3-autopilot-cluster-[...]         1/1
l7-default-backend-7fd66b8b88-wwfrv            1/1
metrics-server-v0.3.6-7c5cb99b6f-j8mxx         2/2
netd-bbct4                                     1/1
netd-ckd5z                                     1/1
node-local-dns-h9w2s                           1/1
node-local-dns-vtmlk                           1/1
pdcsi-node-bnzqm                               2/2
pdcsi-node-l2gr8                               2/2
stackdriver-metadata-agent-cluster-level-[...] 2/2

Based on my observation, GKE runs at least 2 nodes at all times to host these cluster components. You are not paying for these kube-system pods, but there’s a per-cluster/hour fee ($0.10/hour) with the first cluster free.

Resource Limits/Requests and overriding behavior

GKE Autopilot is somewhat restrictive when it comes to CPU and memory requests: CPU requests are allowed in 250m increments (10m for DaemonSets), and there’s a 1.0-6.5 range allowed between CPU-to-memory GiB.

I wanted to see what happens if I specify an invalid cpu or memory combinations on my Deployment, as well as different requests vs limits. I found out that:

Autopilot replaces limits with given requests
invalid CPU/memory values are rounded up to a valid value.

When GKE Autopilot silently overwrites limits with the requests (or corrects the cpu/memory specs to allowed values), there’s no visible warnings or errors. However, Autopilot adds a autopilot.gke.io/resource-adjustment annotation on the Deployment when a resource config is overridden (and I’m told it will soon present a kubectl warning when this happens).

Cloud Run is similar in this regard (you’re charged for memory/CPU during requests to container instance) as it ignores requests and uses limits.

When you specify a value that’s invalid, the control plane will round-up your input to the nearest value silently. ¹ For example cpu: 100m becomes 250m, and 255m becomes 500m.

I tried to see which component overrides the resource spec inputs, but since querying mutatingwebhookconfigurations is forbidden², I could not find anything. I also browsed apiserver /logs endpoint through kubectl proxy, but could not see any messages or Kubernetes Event objects that warns the user about this behavior.

Autoscaling under pressure (0→500 Pods)

Before we start, it’s useful to remind yourself that Kubernetes is not really designed for rapid scale-up events, since there are still machines that need to be provisioned under the covers. That’s why typically Kubernetes users over-provision the compute capacity and leave some room for dynamic scaling based on load patterns.

Keep in mind that, the autoscaling scenario we’re about to try is possibly the worst case scenario and you would very rarely need to go from 0-to-500 pods in a very short timeframe.

That said, it’s possible to add spare capacity to GKE Autopilot by creating “balloon pods” that hold onto the nodes and get preempted when there are actual workloads to run on these nodes (such as burst scale-up).

If you want the ability to have burst scaling (e.g. 0-to-1000 instances instantly), without pre-provisioning infrastructure, you might want to check out Cloud Run.

Now let’s put the autoscaling promise to see how fast the cluster will scale up to meet the demand created by the unscheduled Pods. I will deploy a 500-replica Deployment using the minimum allowed resources (250m CPU, 512 MiB memory).

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
spec:
  replicas: 500
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: app
        image: nginx
        resources:
          limits:
            memory: "512Mi"
            cpu: "250m"

NOTE

By deploying too many Pods, we easily run out of capacity, which triggers the cluster autoscaler to add many more nodes to the cluster, which in turn triggers a master resize event on GKE.

The master resize happens on GKE because GKE the VMs that run the the Kubernetes API server and etcd need to be resized to a higher machine type when the node count gets larger.

This would normally mean Kubernetes API downtime, but Autopilot clusters are regional, which means we run 3 master nodes in different zones, and therefore the Kubernetes API stays up. While this event is happening, your GKE cluster will show status RECONCILING.

Below are the observed results from the first test run.

Node count over time (first run):

t	Ready	NotReady
0	2	0
2m	2	13
3m	3	40
5m	7	36
7m	25	19
10m	29	16
11m	40	5
13m45s	48	0

Pod count over time per phase (first run):

t	Running	NotReady
0	0	0
10s	1	44
1m30s	1	110
3m50s	1	405
4m	7	403
6m	114	329
9m	250	146
13m	473	26
14m	500	0

We reached 500 running Pods in 13 minutes; however, I could not help but notice:

It took 4 minutes for 400 out of 500 Pods to have record on the API (even as Pending). I think this is because of the ongoing “master resize” event (though there might be also some rate limiting), but it did not reproduce on the second run.
Time to run the second Pod is about 4 minutes. It reproducibly requires a new node to be provisioned to run the second Pod. (It seems there’s only enough leftover space in the first two nodes running system pods.) This takes several minutes, so it might feel awkward while getting started.

Upon further testing, it seems this has happened due to the “master resize” that occurred due to growing cluster size. Further testing showed that it takes roughly 1.5 minutes to add a new node to run a “pending” Pod.
Cluster autoscaler could not immediately determine how many nodes it needed to run. The cluster eventually reached 48 nodes, but the autoscaler took several minutes to realize we need ~40 nodes.
There are over 200 Pods left in OutOfpods and OutOfcpu status, which means the node has hit its max Pod count or ran out of CPU allocation. This doesn’t impact functionality, but it definitely pollutes the kubectl get pods output, and there doesn’t seem to be a controller cleaning these API entries. They seem to stay there as long as the underlying ReplicaSet exists.
When I ran the test run again, I noticed the same workload has fit in 35 nodes (as opposed to 48). I don’t know why this disparity occurs, but it does not impact the GKE Autopilot users since you wouldn’t be paying for the nodes.
Despite deploying 500 identical Pods, the underlying nodes were created using a cocktail of e2-micro, e2-medium, e2-standard-4, e2-standard-8 machine types. Similar to the previous bullet point, I think the autoscaler does not act deterministically and the timing of pending pods influence the decisions.

Below are the observed results from the second test run. I deleted the first Deployment (and let the worker nodes scale back to 3). Now that we have larger master VMs that will not be resized during the deployment:

Node count over time (second run):

t	Ready	NotReady
0	3	0
2m	8	29
2m30s	37	0

Pod count over time per phase (second run):

t	Running	NotReady
0	2	141
1m	26	474
2m20s	105	384
2m40s	497	2
3m	500	0

The second run shows that after the master resize occurred, scaling up the cluster by adding more nodes, and reaching 500 Pods is a lot quicker. This time we reached 500 running Pods in 3 minutes!

Conclusion

I’m excited about GKE Autopilot. Finding the optimal size for your cluster and nodes can be quite tricky. GKE Autopilot solves that problem (and a lot more) for you, and gives you Kubernetes API as the abstraction layer.

As with all managed services, it may seem more expensive per compute unit (CPU/memory/seconds) compared to GKE Standard. However, comparing GKE Autopilot and GKE Standard is not apples to apples.

GKE Autopilot can be cheaper³, considering the engineer-hours, the opportunity cost and the total cost of ownership (TCO) that you would spend managing a cluster and and ensuring high utilization on the nodes you’re paying for.

Watch this space, as I think GKE Autopilot will become the preferred way of running Kubernetes workloads on Google Cloud and the list of limitations will only get shorter from here.

Hope you enjoyed this review. Follow @WilliamDenniss and his blog (as well as GKE product blog) to keep up with GKE Autopilot news. (Thanks to the GKE Autopilot team for reading drafts of this and providing corrections.)

An invalid CPU/memory combination could’ve been an error in the API, but that would make migrations from GKE Standard harder. Similarly, AWS Fargate also rounds up CPU/memory configurations. ↩︎
I’m not sure why listing mutatingwebhookconfigurations is forbidden. (I realize creating mutating webhooks is forbidden, which is a current limitation of Autopilot.) That being said, you can list/write validatingwebhookconfigurations. ↩︎
Not to mention, you are not paying for the OS/kernel overhead and the compute resources used by kube-system pods which take away from your “node allocatable” space that you pay for in GKE Standard. ↩︎