Kubernetes Scheduler (Continued)Digital Garden

Video https://www.youtube.com/watch?v=0O_06RNEiL4 Vanilla scheduler is becoming insufficient for more diverse workloads

Advanced Scheduling Requirements Gang scheduling Topology-awareness GPU Bin-Packing etc.

Scheduling Framework offers plugins/endpoints for building custom schedulers - we can leverage this to make different types of schedulers.

There are different methods for extending the vanilla scheduler

Modify the scheduler directly (ad-hoc)
Use webhook extensions to extend the scheduler (Very limited extension points, postfilter, postscore) - low performance due to latency added by webhooks
- In this approach there is very limited extension points unlike the scheduling framework that has plugins, this approach works via HTTP webhooks, meaning main scheduler calls external service at only a handful of stages (e..g postfilter, postscore)
Multiple schedulers - Running multiple independent schedulers - this can cause conflicts with other schedulers, has high overhead, and can cause conflicts (multiple schedulers competing for resources) - medium performance - this requires recompiling a separate binary for each scheduler
Scheduling Framework - in this approach you write lightweight plugins that extend the default scheduler the plugins can hook into any stage - you do need to recompile the scheduler but its still just one scheduler running

In essence approaches three and four are similar

One should aim to use approach four using a single scheduler with multiple plugins to avoid needing to run different schedulers. However if different schedulers are absolutely needed you would recompile approach four multiple times (one for each custom scheduler)
Approach there is an older way to solve the problem as it may not even involve plugins its just completely new binaries recompiled and running concurrently.

⏳ So, When Do You Need to Recompile?

Action	Recompilation Required?	Why?
Writing a new plugin inside the scheduler code (like hardcoding a plugin inside the scheduler that must be run in a certain stage across all schedulers)	✅ Yes	The scheduler binary must include the new plugin
Modifying an existing plugin inside the scheduler code	✅ Yes	The binary must be updated
Enabling/disabling an already-compiled plugin via config	❌ No	The plugin is already built; just update the config
Changing a plugin running as an external gRPC service	❌ No	The scheduler is unchanged; the plugin runs separately

A single scheduler with “multiple plugins” could be like the following:

Custom filtering for CPU vs. GPU workloads

Example Code for a Custom Plugin (CPU vs. GPU Filtering)

type CustomFilterPlugin struct{}
 
func (p *CustomFilterPlugin) Filter(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
    // Check pod labels or resource requests
    if val, exists := pod.Labels["workload-type"]; exists {
        if val == "CPU-intensive" {
            if !nodeHasEnoughCPU(nodeInfo) {
                return framework.NewStatus(framework.Unschedulable, "Node lacks CPU resources")
            }
        } else if val == "GPU-intensive" {
            if !nodeHasGPU(nodeInfo) {
                return framework.NewStatus(framework.Unschedulable, "Node lacks GPU resources")
            }
        }
    }
 
    return framework.NewStatus(framework.Success, "")
}
 
// Helper functions to check resources
func nodeHasEnoughCPU(nodeInfo *framework.NodeInfo) bool {
    return nodeInfo.Allocatable.MilliCPU >= 2000 // Example threshold
}
 
func nodeHasGPU(nodeInfo *framework.NodeInfo) bool {
    for _, device := range nodeInfo.Node().Status.Capacity {
        if strings.Contains(device.String(), "nvidia.com/gpu") {
            return true
        }
    }
    return false
}

CPU & GPU Pod

apiVersion: v1
kind: Pod
metadata:
  name: cpu-workload
  labels:
    workload-type: "CPU-intensive"  # This is what the scheduler plugin reads
spec:
  containers:
    - name: app
      image: my-app
      resources:
        requests:
          cpu: "2"
 
---
 
apiVersion: v1
kind: Pod
metadata:
  name: gpu-workload
  labels:
    workload-type: "GPU-intensive"
spec:
  containers:
    - name: app
      image: my-gpu-app
      resources:
        limits:
          nvidia.com/gpu: "1"  # This will trigger GPU filtering

Scheduling for StatefulPod Say we have a static IP address and we want to ensure a pod lands on the node/rack with this IP, how to handle this

Prefilter stage (gathers relevant info before node filtering happens) First it syncs IP reservation information - fetches IP allocation details using the node/pod informer - also checks to see if pod is stateful and requires IP reservation

Filter stage here checks pod requirements and checks if the node has free IP addresses available.

Note prefilter runs once per pod scheduling attempt (fetches global data that all filter calls can use) - filter stage runes for every node, meaning if you have 1,000 nodes filter function runs 1,000 times for a single pod - this is why prefilter reduces redundant API queries when evaluating many nodes.

Gang Scheduling (all or nothing scheduling) Coscheduling plugin

Ensure pods of same pod group are scheduled together (sort stage)
Create or update PodGroup if not present and validate if the total number of pods belonging to podgroup reaches minAvailable (prefilter)
Unreserve if the waiting pods exceed timeout limit (reserve)
PodGroup meets minAvailable allow the pods of the PodGroup to get scheduled - otherwise put the pod in waiting state (permit)

Also need a PodInformer for PodGroup to get cleaned up or updated

Future features Custom preemption (if evict one pod in podgroup, should we evict all pods in podgroup) Reservation with backfill - say two podgroups are waiting for resources, one smaller, one larger. Reservation with backfill allows the smaller podgroup to temporarily be scheduled while waiting for the larger podgroup to be scheduled. When larger group ready it preempts the smaller group Generic sorting plugins - treat sorting of pods and podgroups as same thing - no custom logic needed

For very large clusters say hundreds, thousands or even tens of thousands of nodes the kubernetes scheduler is limiting as it schedules pods by pods - these clusters deal with autoscaling and so having each pod evaluated against thousands of nodes becomes infeasible/slow.

Optimizations for large clusters Customizing scheduler parameters

percentageOfNodesToScore - score a percentage of nodes and rank those for scheduling as opposed to scoring all nodes - this parameter can be set per scheduling profile - score defaults to 50% of nodes if there are more than 100 nodes

Group Scoring Score a group of pods with identical resource requirements at same time - sorting all pods at sort stage is more efficient (PodGroupSort - this is a custom scheduling plugin implemented in custom schedulers like Volcano or via custom kubernetes scheduler plugins) - because it allows us to do scheduling steps altogether for the entire group - as opposed to say scoring for each pod separately

Now we score nodes for a single pod in the group - then we assign the top k scoring nodes to the k pods

If multiple plugins are enabled in a stage, they run sequentially in a predefined order

PluginConfig allows passing addtional args to a plugin

Performance difference between using the Scheduler Webhook Extender and Plugin is drastic.

A Predicate Extender (which is webhook-based) can take up to 50% of the scheduling algorithm duration.
A FilterPlugin (which is a scheduler plugin-based approach) takes only up to 4% of the scheduling algorithm duration.

Coscheduling

Informers (Pod & Node)

Pod informer tracks pod states, assignments and metadata (e.g. does this pod already have a reserved IP) Node Informer tracks nodes availability, capacity, labels and conditions (e.g. does this node have enough IPs)

Informers are always running in the background - keeping an up-to-date cache of kubernetes objects. They help reduce API server load by allowing the scheduler to use local data instead of making constant queries.

Default kubernetes informers run in the scheduler - k8s scheduler itself is a controller that contains informers to watch nodes, pods, and persistent volumes. These informers are started inside the scheduler process itself.

There are other default informers:

Kubernetes Component	Which Controller Runs the Informer?	What it Watches?
Scheduler	Scheduler Controller	Pods, Nodes, PVs
Kubelet	Kubelet Controller	Pods assigned to the node
ReplicaSet Controller	ReplicaSet Controller	ReplicaSets, Pods
Custom Controller	Custom Controller	Any CRD or object

Informer does not exist for CRD by default, you must explicitly create it in your custom controller.

How to create an informer for a CRD - TODO: Dynamic vs typed client in kubernetes? DynamicSharedInformerFactory vs. Typed client

Client-go vs. Controller-runtime

There is overlapping functionality between these two libraries. client-go is a lower level library than controller-runtime

Feature	`client-go`	`controller-runtime`
Purpose	Low-level library for interacting with the Kubernetes API	Higher-level framework for building Kubernetes controllers
Complexity	Requires more boilerplate code	Provides abstractions to simplify controller development
Use Case	Direct interaction with Kubernetes API (CRUD operations, watches, informers)	Building custom controllers and operators
Informer Handling	Manually managed informers and watches	Uses Manager pattern for automatic lifecycle management
Reconciler Pattern	Not built-in, needs manual implementation	Provides built-in reconcile loops
Dependency	Core Kubernetes library, used by Kubernetes itself	Built on top of `client-go`, simplifies controller design
RBAC Handling	Needs manual configuration	Integrates with RBAC settings automatically

If you need fine-grained control over Kubernetes API interactions, client-go is the way to go.
If you are building a custom controller/operator, controller-runtime makes it much easier to manage. Client-go call

clientset.CoreV1().Pods(pod.Namespace).Patch(
    context.TODO(),
    pod.Name,
    metav1.TypeMergePatchType,
    patch,
    metav1.PatchOptions{},
)

r.client.Get(ctx, req.NamespacedName, &pod)

Controller-runtime is generally more efficient for controllers

Comparing Both Approaches

Feature	`client-go` (`clientset`)	`controller-runtime` (`client`)
API Call Type	Direct API requests	Uses cached client with lazy loading
Efficiency	Less efficient (every call hits API server)	More efficient (uses informers & local cache)
Use Case	One-off scripts, CLI tools, direct API interactions	Controllers, operators, reconciliation loops
Code Complexity	Requires manually handling watches and sync	Manages lifecycle and reconciliation

When Should You Use `client-go` vs. `controller-runtime`?

Scenario	Use `client-go`	Use `controller-runtime`
Writing a Kubernetes controller/operator	🚫 Not ideal	✅ Best approach
One-time script or tool	✅ Simple and direct	🚫 Overkill
Making API requests without a long-running process	✅ Best fit	🚫 Not needed
Automating Kubernetes object reconciliation	🚫 Too much manual work	✅ Handles it automatically

client-go (clientset): Direct API calls every time → better for short-lived tasks.
controller-runtime (client): Uses a cached client → better for long-running controllers.

kubernetes kubernetes scheduling

Ishaan Sehgal

Explorer

Kubernetes Scheduler (Continued)

⏳ So, When Do You Need to Recompile?

Informers (Pod & Node)

Client-go vs. Controller-runtime

Comparing Both Approaches

When Should You Use `client-go` vs. `controller-runtime`?

Graph View

Table of Contents

Ishaan Sehgal

Explorer

Kubernetes Scheduler (Continued)

⏳ So, When Do You Need to Recompile?

Informers (Pod & Node)

Client-go vs. Controller-runtime

Comparing Both Approaches

When Should You Use client-go vs. controller-runtime?

Graph View

Table of Contents

When Should You Use `client-go` vs. `controller-runtime`?