Node Pool Management#

Subsystem Goal#

This subsystem is responsible for ensuring:

There are enough machine resources to complete the required work
Workloads are grouped by team/group/org to add an additional security boundary and support cost accounting

Hands-on Capabilities

This subsystem is designed for spinning up and tearing down nodes using AWS APIs, making it largely dependent on AWS environments. Therefore, it has limited functionality on local clusters. For details on how we manage on-prem clusters, see the section titled Running Node Pools on AWS vs. On-Prem below.

Components in Use#

Karpenter - provides the ability to define node provisioners using Kubernetes objects to support direct scaling
Cluster Autoscaler - provides the ability to scale up and down machines leveraging AWS auto-scaling groups
Gatekeeper - by using Gatekeeper's mutation support, we can force tenant pods into their respective node pool

Background#

In Kubernetes, nodes are simply machines that can run workloads. These nodes can come and go. It is important to note that there is no native concept of a node pool in Kubernetes. But, we can create the idea of node pools using other primitives of Kubernetes.

Understanding Taints and Tolerations#

For our purposes, a node pool is essentially a collection of node pools designated to run specific workloads. A node pool might exist for a specific team, running CI workloads for a team, or really anything else we can think of. The goal is they should be easy to define and flexible.

To create a node pool in Kubernetes, we will use a combination of taints and tolerations with node affinity and labels. The idea can be stated as follows:

Taints are put on a node to ensure pods aren't accidentally scheduled
Tolerations are placed onto pods to indicate they allow the taint
Node affinity will require the pod to be scheduled on nodes with the specified labels

To play with taints/tolerations, try the following:

Get the name of your local node:

kubectl get nodes

You should see output like the following:

NAME             STATUS   ROLES                  AGE   VERSION
docker-desktop   Ready    control-plane,master   30d   v1.21.4

Let's add a taint to the node to prevent pods from accidentally being scheduled on this node.

Run this command, replacing docker-desktop with the name of your node:
```
kubectl taint node docker-desktop node-pool=sample:NoSchedule
```
Now, try launching a new pod.
```
kubectl run --namespace=default --image=nginx:alpine taint-experiment
```
You should see that the pod was successfully created.

If we describe the pod, we should see why it doesn't start.

kubectl describe pod -n default taint-experiment

And output:

Name:         taint-experiment
Namespace:    default
...
Events:
Type     Reason            Age   From               Message
----     ------            ----  ----               -------
Warning  FailedScheduling  79s   default-scheduler  0/1 nodes are available: 1 node(s) had taint {node-pool: sample}, that the pod didn't tolerate.
Warning  FailedScheduling  7s    default-scheduler  0/1 nodes are available: 1 node(s) had taint {node-pool: sample}, that the pod didn't tolerate.

It's the taint that is preventing the pod from being scheduled on the node!

Let's add a toleration to the pod and see if it starts up. We're going to add the following yaml:

spec:
  tolerations:
    - effect: NoSchedule
      key: node-pool
      operator: Equal
      value: sample

The command to do so is:

kubectl patch pod taint-experiment -n default --type=json -p='[{"op": "add", "path": "/spec/tolerations/-", "value": {"effect":"NoSchedule", "key":"node-pool", "operator":"Equal", "value":"sample"}}]'

Examine the pod now and you should that it's successfully running!

kubectl get pods -n default

And you should see output similar to the following:

NAME               READY   STATUS    RESTARTS   AGE
taint-experiment   1/1     Running   0          3m36s

Before we go too much farther, let's remove the taint and our test pod. And yes... the syntax to remove a taint looks a little odd.
```
kubectl delete pod taint-experiment
kubectl taint node docker-desktop node-pool=sample:NoSchedule-
```

Defining our Node Pools with Karpenter#

Now that we understand how taints and tolerations work, let's look at how we actually manage the various node pools. Using Karpenter, we can simply define a Provisioner, which provides configuration on how to spin up nodes.

The following Provisioner will be able to create nodes using the taint we used before and also adds labels that can be used for node affinity. We'll talk about the cost code pieces in a moment.

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: platform-docs
spec:
  # Put taints on the nodes to prevent accidental scheduling
  taints:
    - key: platform.it.vt.edu/node-pool
      value: platform-docs
      effect: NoSchedule

  # Scale down empty nodes after low utilization. Defaults to an hour
  ttlSecondsAfterEmpty: 300

  # Kubernetes labels to be applied to the nodes
  labels:
    platform.it.vt.edu/cost-code: platform
    platform.it.vt.edu/node-pool: platform-docs

  provider:
    instanceProfile: karpenter-profile

    # Tags used to discover the subnets/security groups to use
    securityGroupSelector:
      Name: "*eks_worker_sg"
      kubernetes.io/cluster/vt-common-platform-prod-cluster: owned
    subnetSelector:
      karpenter.sh/discovery: "*"

    # Tags to be applied to the EC2 nodes themselves
    tags:
      CostCode: platform
      Project: platform-docs
      NodePool: platform-docs

Now, when a pod is defined, but unable to be scheduled (possibly to a lack of available resources), Karpenter will try to fix the issue. Karpenter will use this Provisioner if the pod spec has the matching node affinity and tolerations.

Supporting Cost Accounting#

One of the original goals of the platform was to gain an understanding of how much each team was spending on machine resources (acknowledging that there are other costs associated with running a platform beyond just machines). To support this, we are leveraging AWS Cost Allocation tags on the machine resources themselves. Both the CostCode and Project tags are cost allocation tags, allowing us to track spending accurately at both the team and project levels.

CostCode - represents the higher-level team/organization
Project - the individual node pool

By separating these tags, we can have separate node pools for different functions (such as dev, CI, or production), yet roll the costs into a higher-level team/org cost.

In addition to AWS tags, we are also utilizing Kubecost, a Kubernetes cost monitoring and management tool. Kubecost provides real-time cost visibility and insights into our Kubernetes clusters, enabling us to allocate costs down to the level of individual namespaces, workloads, and even labels. This ensures that we not only understand the cost at a broader level but can also monitor and optimize expenses for specific Kubernetes resources across the platform.

Forcing Tenants into their Node Pools#

Just as we did with log forwarding, we can leverage Gatekeeper's mutation support to mutate pods and add the correct toleration and node selector. Doing this, the idea of node pools can be mostly invisible to the tenants themselves.

The following will add a nodeSelector to all pods in the sample-tenant namespace and force all pods to run on nodes with a label of platform.it.vt.edu/node-pool=sample-pool.

apiVersion: mutations.gatekeeper.sh/v1beta1
kind: AssignMetadata
metadata:
  name: sample-tenant-nodepool-selector
  namespace: gatekeeper-system
spec:
  applyTo:
    - groups: [""]
      kinds: ["Pod"]
      versions: ["v1"]
  match:
    scope: Namespaced
    kinds:
      - apiGroups: ["*"]
        kinds: ["Pod"]
    namespaces: ["sample-tenant"]
  location: "spec.nodeSelector"
  parameters:
    assign:
      value:
        platform.it.vt.edu/node-pool: "sample-pool"

And then the following mutation will add a toleration, which will allow the pod to actually run on the nodes with the node pool taints.

apiVersion: mutations.gatekeeper.sh/v1beta1
kind: Assign
metadata:
  name: sample-tenant-nodepool-toleration
  namespace: gatekeeper-system
spec:
  applyTo:
    - groups: [""]
      kinds: ["Pod"]
      versions: ["v1"]
  match:
    scope: Namespaced
    kinds:
      - apiGroups: ["*"]
        kinds: ["Pod"]
    namespaces: ["sample-tenant"]
  location: "spec.tolerations"
  parameters:
    assign:
      value:
        - key: platform.it.vt.edu/node-pool
          operator: "Equal"
          value: "sample-pool"

The landlord chart makes it possible for us to define the node pools themselves (which creates the Provisioner objects) and define the necessary mutations needed for tenant workloads to run on the correct nodes.

Running Node Pools on AWS vs. On-Prem#

In our platform, node pool management varies depending on the environment. We use Karpenter for AWS-based clusters and EKS Anywhere (EKSA) for on-prem clusters. EKSA brings the power and flexibility of AWS EKS to on-premises environments, allowing us to maintain a consistent Kubernetes experience across both cloud and on-prem setups.

Karpenter is tightly integrated with AWS services and is specifically designed to manage node pools within AWS. For our on-prem clusters, which are built on VMware, we configure node pools using EKSA to suit the specific needs of our on-prem infrastructure.

Here’s a simplified example of a node pool configuration in an on-prem environment using EKSA:

apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
  name: example-nodepool
  namespace: default
spec:
  datastore: global-datastore
  diskGiB: 50
  memoryMiB: 8192
  numCPUs: 4
  osFamily: bottlerocket
  resourcePool: /AISB-Common-Platform/host/plat-isb-cluster/Resources

This configuration defines the resource allocation for a specific node pool in a VMware-based cluster. It specifies the datastore, disk size, memory, CPU count, and operating system for the nodes, ensuring that each node pool is optimized for its intended workload.

By leveraging EKSA’s integration with VMware, we achieve smooth operations within our existing IT infrastructure while maintaining consistency across our cloud and on-prem environments. This approach ensures that our resources are utilized efficiently and securely, regardless of where the clusters are running.