Based on https://www.youtube.com/watch?v=eDkE4WNWKUc

Traditional K8s Scheduler uses predicates (hard requirements) and priorities (soft requirements)

As K8s has evolved now these requirements are placed into scheduler framework with scheduling plugins (callbacks) this consists of a series of steps.

Here’s the full set of configurable plugin extension points in the Kubernetes scheduler:

Plugin Extension PointPurposeCan Be Configured in KubeSchedulerConfiguration?
PreEnqueueRuns before adding a pod to the scheduling queue.✅ Yes
PreFilterRuns before filtering nodes, allows modifying pod state.✅ Yes
FilterDetermines if a node is suitable for a pod.✅ Yes
PostFilterRuns when no nodes are viable, can implement preemption.✅ Yes
PreScoreRuns before scoring, allows modifying scoring logic.✅ Yes
ScoreAssigns scores to nodes.✅ Yes
NormalizeScoreAdjusts node scores before final selection.✅ Yes
ReserveTemporarily holds a node for a pod.✅ Yes
PermitApproves/rejects a pod before binding.✅ Yes
WaitOnPermitWaits for external approval before proceeding.✅ Yes
PreBindRuns just before binding a pod to a node.✅ Yes
BindActually assigns the pod to a node.✅ Yes
PostBindRuns after binding is complete.✅ Yes

Preenqueue plugin returns success if the pod is allowed to enter the active queue. If failed it places pod in an internal unschedulable pods list but it is retried later.

As you can see here there is two separate go routines so these phases can be handled in parallel. One for scheduling (filtering and scoring) and another for binding.

Sorting allows you to change the pods you get from the queue for scheduling.

Binding includes a permitting step (like only allow scheduling once some condition is met, this is used for gang scheduling for example).

Filtering is split into three stages. One for pre-filter to optimize scheduling by eliminating unavailable options from the start. One for post-filter this happens after schedule has been made to make further adjustments or corrections. For example removing overlapping appointments or take in additional constraints that weren’t considered initially.

Applying Scheduler Plugins In order to apply any custom scheduler plugins, you need a custom scheduler build. This is done by compiling the scheduler with your custom plugin (e.g. go build -o custom-scheduler ./cmd/kube-scheduler) Then deploying the custom scheduler as a pod.

Pod affinity and antiaffinity are rules for how to schedule pods based on other pods. They can either be imposed (hard like predicates) with requiredDuringSchedulingIgnoredDuringExection or (soft like priorities) preferredDuringSchedulingIgnoredDuringExection. Preferred would take a weight value into account for how strong a preference there is. There is also requiredDuringSchedulingRequiredDuringExection which requires conditions to be ensured both during the scheduling process as well as even after once the pod is scheduled, it needs to be kept.

There is also node affinity and antiafinity which are rules for how to schedule pods based on nodes.

Each affinity has a topology associated with it, which determines the level of affinity:

  • “kubernetes.io/hostname” Applies at the node level
  • “topology.kubernetes.io/zone” Applies at the zone level
  • “toplogy.kubernetes.io/region” Applies at the region level

Apart of affinities there is topologySpreadConstraints - these determine how to spread pods across nodes. For example

apiVersion: v1
kind: Pod
metadata:
  name: example-pod
spec:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: "kubernetes.io/hostname"
      whenUnsatisfiable: "ScheduleAnyway"
      labelSelector:
        matchLabels:
          app: web # Applies to pods with label web
  containers:
    - name: nginx
      image: nginx

maxSkew: 1 means difference between most and least populated nodes can be at most one

Daemonset controller just sets the node affinity on the pod and then leaves it up to the scheduler to schedule the pod

Pod Overhead Pods may have additional system beyond just the application resources it specifies (extra resources not part of the container spec). For example if running in a sanboxed environment like kata you need additional resources to run this environment. So to account for this you can set a runtimeclass with overhead like

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: kata-containers
handler: kata
overhead:
  podFixed:
    cpu: "250m"
    memory: "64Mi"

And then apply the runtimeclass to a pod

apiVersion: v1
kind: Pod
metadata:
  name: example-pod
spec:
  runtimeClassName: kata-containers
  containers:
    - name: app
      image: nginx
      resources:
        requests:
          cpu: "500m"
          memory: "128Mi"
        limits:
          cpu: "1000m"
          memory: "256Mi"

This overhead is applied/added to each pod scheduled with the runtime class specified

In-place update of pod resources - this is a proposed feature which you can enabled via feature gate for in-place increase of pod resources.

Multiple Schedulers

We can define all our schedulers in the KubeSchedulerConfiguration e.g.

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
  - schedulerName: "default-scheduler"
    plugins:
      filter:
        disabled:
          - name: "NodeResourcesFit"  # Disable NodeResourcesFit for default profile
  - schedulerName: "custom-scheduler"
    plugins:
      filter:
        enabled:
          - name: "NodeResourcesFit"  # Enable it for the custom profile

or

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
  - schedulerName: "default-scheduler"
    pluginConfig:
      - name: "MyCustomPlugin"
        args:
          enableForPodsWithAnnotation: "custom-scheduling=true"  # Only applies to Pods with this annotation

If you enable multiple schedulers, each scheduler is a separate controller. Then in your podspec you specify the schedulerName like so:

apiVersion: v1
kind: Pod
metadata:
  name: custom-scheduler-pod
spec:
  schedulerName: my-custom-scheduler
  containers:
    - name: app
      image: nginx

If no scheduler specified then default kube-scheduler is used. Running a custom scheduler as a pod is the most common approach.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-custom-scheduler
spec:
  replicas: 1
  selector:
    matchLabels:
      component: my-custom-scheduler
  template:
    metadata:
      labels:
        component: my-custom-scheduler
    spec:
      containers:
        - name: scheduler
          image: my-scheduler-image
          command: ["my-custom-scheduler", "--config", "/etc/scheduler/config.yaml"]

If pod doesn’t have a matching scheduler. Like if a schedulerName is specified that doesn’t exist, the pod remains unscheduled forever.

Cluster Autoscaler uses scheduler inbuilt. Which is why for say kubernetes 1.16 release CAS version needs to be 1.16 as well because it uses the scheduler from upstream. It is important it does this because CAS needs to determine if the number of nodes its creating is enough for the number of pending pods - so it uses the filtering mechanisms for example to determine this.

Descheduler Default scheduler doesn’t care about pods once they are scheduled. If descheduler sees pods not conforming to scheduling constraints anymore you can evict those pods (note it doesn’t reschedule them it leaves that up to the scheduler to do). However it can even evict pods under other constraints like pods having too many restarts or pods violating (pod or node) affinity rules. This could happen when say new nodes join the cluster.

Create Configmap for Descheduler Policy

apiVersion: v1
kind: ConfigMap
metadata:
  name: descheduler-policy
  namespace: kube-system
data:
  policy.yaml: |
    apiVersion: "descheduler/v1alpha1"
    kind: "DeschedulerPolicy"
    strategies:
      RemovePodsViolatingNodeAffinity:
        enabled: true
      RemovePodsHavingTooManyRestarts:
        enabled: true
        params:
          podsHavingTooManyRestarts:
            podRestartThreshold: 3
      RemovePodsViolatingInterPodAntiAffinity:
        enabled: true
 

Deploy descheduler as a CronJob

apiVersion: batch/v1
kind: CronJob
metadata:
  name: descheduler
  namespace: kube-system
spec:
  schedule: "*/10 * * * *"  # Runs every 10 minutes
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: descheduler-sa
          containers:
            - name: descheduler
              image: registry.k8s.io/descheduler/descheduler:v0.26.0
              command:
                - "/bin/descheduler"
                - "--policy-config-file"
                - "/policy-dir/policy.yaml"
              volumeMounts:
                - name: policy-volume
                  mountPath: /policy-dir
          volumes:
            - name: policy-volume
              configMap:
                name: descheduler-policy
          restartPolicy: Never

Cluster role attaches a service account to a cluster role

apiVersion: v1
kind: ServiceAccount
metadata:
  name: descheduler-sa
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: descheduler-role
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: descheduler-rolebinding
subjects:
  - kind: ServiceAccount
    name: descheduler-sa
    namespace: kube-system
roleRef:
  kind: ClusterRole
  name: descheduler-role
  apiGroup: rbac.authorization.k8s.io

Resource requests and limits are degined at the container level not the pod level. Kubernetes adds up container requests to determine pod scheduling requirements.

kubernetes kubernetes scheduling