Go语言kube-scheduler深度剖析开发之scheduler初始化

这篇文章主要介绍了Go语言kube-scheduler深度剖析开发之scheduler初始化实现过程示例，有需要的朋友可以借鉴参考下，希望能够有所帮助，祝大家多多进步，早日升职加薪

引言

为了深入学习 kube-scheduler，本系从源码和实战角度深度学习kube-scheduler，该系列一共分6篇文章，如下：

kube-scheduler 整体架构
本文：初始化一个 scheduler
一个 Pod 是如何调度的
如何开发一个属于自己的scheduler插件
开发一个 prefilter 扩展点的插件
开发一个 socre 扩展点的插件

上一篇，我们说了 kube-scheduler 的整体架构，是从整体的架构方面来考虑的，本文我们说说 kube-scheduler 是如何初始化出来的，kube-scheduler 里面都有些什么东西。

因为 kube-scheduler 源码内容比较多，对于那些不是关键的东西，就忽略不做讨论。

Scheduler之Profiles

下面我们先看下 Scheduler 的结构

type Scheduler struct { Cache internalcache.Cache Extenders []framework.Extender NextPod func() *framework.QueuedPodInfo FailureHandler FailureHandlerFn SchedulePod func(ctx context.Context, fwk framework.Framework, state  *framework.CycleState, pod *v1.Pod) (ScheduleResult, error) StopEverything <-chan struct{} SchedulingQueue internalqueue.SchedulingQueue Profiles profile.Map client clientset.Interface nodeInfoSnapshot *internalcache.Snapshot percentageOfNodesToScore int32 nextStartNodeIndex int }

上一篇我们说过，为一个 Pod 选择一个 Node 是按照固定顺序运行扩展点的；在扩展点内，是按照插件注册的顺序运行插件，如下图

上面的这些扩展点在 kube-scheduler 中是固定的，而且也不支持增加扩展点（实际上有这些扩展点已经足够了），而且扩展点顺序也是固定执行的。

下图是插件（以preFilter为例）运行的顺序，扩展点内的插件，你既可以调整插件的执行顺序（实际很少会修改默认的插件执行顺序），可以关闭某个内置插件，还可以增加自己开发的插件。

那么这些插件是怎么注册的，注册在哪里呢，自己开发的插件又是怎么加进去的呢？

我们来看下 Scheduler 里面最重要的一个成员：Profiles profile.Map

// 路径：pkg/scheduler/profile/profile.go // Map holds frameworks indexed by scheduler name. type Map map[string]framework.Framework

Profiles 是一个 key 为 scheduler name，value 是 framework.Framework 的map，表示根据 scheduler name 来获取 framework.Framework 类型的值，所以可以有多个scheduler。或许你在使用 k8s 的时候没有关注过 pod 或 deploment 里面的 scheduler，因为你没有指定的话，k8s 就会自动设置为默认的调度器，下图是 deployment 中未指定 schedulerName 被设置了默认调度器的一个deployment

假设现在我想要使用自己开发的一个名叫 my-scheduler-1 的调度器，这个调度器在 preFilter 扩展点中增加了 zoneLabel 插件，怎么做？

使用 kubeadm 部署的 k8s 集群，会在 /etc/kubernetes/manifests 目录下创建 kube-scheduler.yaml 文件，kubelet 会根据这个文件自动拉起来一个静态 Pod，一个 kube-scheduler pod就被创建了，而且这个 kube-scheduler 运行的参数是直接在命令行上指定的。

apiVersion: v1 kind: Pod metadata: creationTimestamp: null labels: component: kube-scheduler tier: control-plane name: kube-scheduler namespace: kube-system spec: containers: - command: - kube-scheduler - --address=0.0.0.0 - --authentication-kubeconfig=/etc/kubernetes/scheduler.conf - --authorization-kubeconfig=/etc/kubernetes/scheduler.conf - --bind-address=127.0.0.1 - --kubeconfig=/etc/kubernetes/scheduler.conf - --leader-elect=true image: k8s.gcr.io/kube-scheduler:v1.16.8 ....

其实 kube-scheduler 运行的时候可以指定配置文件，而不直接把参数写在启动命令上,如下形式。

./kube-scheduler --config /etc/kube-scheduler.conf

于是乎，我们就可以在配置文件中配置我们调度器的插件了

apiVersion: kubescheduler.config.k8s.io/v1beta2 kind: KubeSchedulerConfiguration leaderElection: leaderElect: true clientConnection: kubeconfig: "/etc/kubernetes/scheduler.conf" profiles: - schedulerName: my-scheduler plugins: preFilter: enabled: - name: zoneLabel disabled: - name: NodePorts

我们可以使用 enabled，disabled 开关来关闭或打开某个插件。通过配置文件，还可以控制扩展点的调用顺序，规则如下：

如果某个扩展点没有配置对应的扩展，调度框架将使用默认插件中的扩展
如果为某个扩展点配置且激活了扩展，则调度框架将先调用默认插件的扩展，再调用配置中的扩展
默认插件的扩展始终被最先调用，然后按照 KubeSchedulerConfiguration 中扩展的激活 enabled 顺序逐个调用扩展点的扩展
可以先禁用默认插件的扩展，然后在 enabled 列表中的某个位置激活默认插件的扩展，这种做法可以改变默认插件的扩展被调用时的顺序

还可以添加多个调度器，在 deployment 等控制器中指定自己想要使用的调度器即可：

apiVersion: kubescheduler.config.k8s.io/v1beta2 kind: KubeSchedulerConfiguration leaderElection: leaderElect: true clientConnection: kubeconfig: "/etc/kubernetes/scheduler.conf" profiles: - schedulerName: my-scheduler-1 plugins: preFilter: enabled: - name: zoneLabel - schedulerName: my-scheduler-2 plugins: queueSort: enabled: - name: mySort

当然了，现在我们在配置文件中定义的 mySort，zoneLabel 这样的插件还不能使用，我们需要开发具体的插件注册进去，才能正常运行，后面的文章会详细讲。

好了，现在 Profiles 成员（一个map）已经包含了两个元素，{"my-scheduler-1": framework.Framework ,"my-scheduler-2": framework.Framework}。当一个 Pod 需要被调度的时候，kube-scheduler 会先取出 Pod 的 schedulerName 字段的值，然后通过 Profiles[schedulerName]，拿到 framework.Framework 对象，进而使用这个对象开始调度，我们可以用下面这种张图总结下上面描述的各个对象的关系。

那么重点就来到了 framework.Framework ，下面是 framework.Framework 的定义：

// pkg/scheduler/framework/interface.go type Framework interface { Handle QueueSortFunc() LessFunc RunPreFilterPlugins(ctx context.Context, state *CycleState, pod *v1.Pod) (*PreFilterResult, *Status) RunPostFilterPlugins(ctx context.Context, state *CycleState, pod *v1.Pod, filteredNodeStatusMap NodeToStatusMap) (*PostFilterResult, *Status) RunPreBindPlugins(ctx context.Context, state *CycleState, pod *v1.Pod, nodeName string) *Status RunPostBindPlugins(ctx context.Context, state *CycleState, pod *v1.Pod, nodeName string) RunReservePluginsReserve(ctx context.Context, state *CycleState, pod *v1.Pod, nodeName string) *Status RunReservePluginsUnreserve(ctx context.Context, state *CycleState, pod *v1.Pod, nodeName string) RunPermitPlugins(ctx context.Context, state *CycleState, pod *v1.Pod, nodeName string) *Status WaitOnPermit(ctx context.Context, pod *v1.Pod) *Status RunBindPlugins(ctx context.Context, state *CycleState, pod *v1.Pod, nodeName string) *Status HasFilterPlugins() bool HasPostFilterPlugins() bool HasScorePlugins() bool ListPlugins() *config.Plugins ProfileName() string }

Framework 是一个接口，需要实现的方法大部分形式为：Run***Plugins，也就是运行某个扩展点的插件，那么只要实现这个 Framework 接口就可以对 Pod 进行调度了。那么需要用户自己实现么？答案是不用，kube-scheduler 已经有一个该接口的实现：frameworkImpl

// pkg/scheduler/framework/runtime/framework.go type frameworkImpl struct { registry             Registry snapshotSharedLister framework.SharedLister waitingPods          *waitingPodsMap scorePluginWeight    map[string]int queueSortPlugins     []framework.QueueSortPlugin preFilterPlugins     []framework.PreFilterPlugin filterPlugins        []framework.FilterPlugin postFilterPlugins    []framework.PostFilterPlugin preScorePlugins      []framework.PreScorePlugin scorePlugins         []framework.ScorePlugin reservePlugins       []framework.ReservePlugin preBindPlugins       []framework.PreBindPlugin bindPlugins          []framework.BindPlugin postBindPlugins      []framework.PostBindPlugin permitPlugins        []framework.PermitPlugin clientSet       clientset.Interface kubeConfig      *restclient.Config eventRecorder   events.EventRecorder informerFactory informers.SharedInformerFactory metricsRecorder *metricsRecorder profileName     string extenders []framework.Extender framework.PodNominator parallelizer parallelize.Parallelizer }

frameworkImpl 这个结构体里面包含了每个扩展点插件数组，所以某个扩展点要被执行的时候，只要遍历这个数组里面的所有插件，然后执行这些插件就可以了。我们看看 framework.FilterPlugin 是怎么定义的（其他的也类似）：

type Plugin interface { Name() string } type FilterPlugin interface { Plugin Filter(ctx context.Context, state *CycleState, pod *v1.Pod, nodeInfo *NodeInfo) *Status }

插件数组的类型是一个接口，那么某个插件只要实现了这个接口就可以被运行。实际上，我们前面说的那些默认插件，都实现了这个接口，在目录 pkg/scheduler/framework/plugins 目录下面包含了所有内置插件的实现，主要就是对上面说的这个插件接口的实现。我们可以简单用图描述下 Pod被调度的时候执行插件的流程

那么这些默认插件是怎么加到framework里面的，自定义插件又是怎么加进来的呢？

分三步：

根据配置文件（--config指定的）、系统默认的插件，按照扩展点生成需要被加载的插件数组（包括插件名字，权重信息），也就是初始化 KubeSchedulerConfiguration 中的 Profiles 成员。

type KubeSchedulerConfiguration struct { metav1.TypeMeta Parallelism int32 LeaderElection componentbaseconfig.LeaderElectionConfiguration ClientConnection componentbaseconfig.ClientConnectionConfiguration HealthzBindAddress string MetricsBindAddress string componentbaseconfig.DebuggingConfiguration PercentageOfNodesToScore int32 PodInitialBackoffSeconds int64 PodMaxBackoffSeconds int64 Profiles []KubeSchedulerProfile Extenders []Extender }

创建 registry 集合，这个集合内是每个插件实例化函数，也就是插件名字->插件实例化函数的映射，通俗一点说就是告诉系统：1.我叫王二； 2. 你应该怎么把我创建出来。那么张三、李四、王五分别告诉系统怎么创建自己，就组成了这个集合。

type PluginFactory = func(configuration runtime.Object, f framework.Handle) (framework.Plugin, error) type Registry map[string]PluginFactory

这个集合是内置（叫inTree）默认的插件映射和用户自定义（outOfTree）的插件映射的并集，内置的映射通过下面函数创建：

// pkg/scheduler/framework/plugins/registry.go func NewInTreeRegistry() runtime.Registry { fts := plfeature.Features{ EnableReadWriteOncePod:                       feature.DefaultFeatureGate.Enabled(features.ReadWriteOncePod), EnableVolumeCapacityPriority:                 feature.DefaultFeatureGate.Enabled(features.VolumeCapacityPriority), EnableMinDomainsInPodTopologySpread:          feature.DefaultFeatureGate.Enabled(features.MinDomainsInPodTopologySpread), EnableNodeInclusionPolicyInPodTopologySpread: feature.DefaultFeatureGate.Enabled(features.NodeInclusionPolicyInPodTopologySpread), } return runtime.Registry{ selectorspread.Name:                  selectorspread.New, imagelocality.Name:                   imagelocality.New, tainttoleration.Name:                 tainttoleration.New, nodename.Name:                        nodename.New, nodeports.Name:                       nodeports.New, nodeaffinity.Name:                    nodeaffinity.New, podtopologyspread.Name:               runtime.FactoryAdapter(fts, podtopologyspread.New), nodeunschedulable.Name:               nodeunschedulable.New, noderesources.Name:                   runtime.FactoryAdapter(fts, noderesources.NewFit), noderesources.BalancedAllocationName: runtime.FactoryAdapter(fts, noderesources.NewBalancedAllocation), volumebinding.Name:                   runtime.FactoryAdapter(fts, volumebinding.New), volumerestrictions.Name:              runtime.FactoryAdapter(fts, volumerestrictions.New), volumezone.Name:                      volumezone.New, nodevolumelimits.CSIName:             runtime.FactoryAdapter(fts, nodevolumelimits.NewCSI), nodevolumelimits.EBSName:             runtime.FactoryAdapter(fts, nodevolumelimits.NewEBS), nodevolumelimits.GCEPDName:           runtime.FactoryAdapter(fts, nodevolumelimits.NewGCEPD), nodevolumelimits.AzureDiskName:       runtime.FactoryAdapter(fts, nodevolumelimits.NewAzureDisk), nodevolumelimits.CinderName:          runtime.FactoryAdapter(fts, nodevolumelimits.NewCinder), interpodaffinity.Name:                interpodaffinity.New, queuesort.Name:                       queuesort.New, defaultbinder.Name:                   defaultbinder.New, defaultpreemption.Name:               runtime.FactoryAdapter(fts, defaultpreemption.New), } }

那么用户自定义的插件怎么来的呢？这里咱们先不展开，在后面插件开发的时候再详细讲，不影响我们理解。我们假设用户自定义的也已经生成了 registry，下面的代码就是把他们合并在一起

// pkg/scheduler/scheduler.go registry := frameworkplugins.NewInTreeRegistry() if err := registry.Merge(options.frameworkOutOfTreeRegistry); err != nil { return nil, err }

现在内置插件和系统默认插件的实例化函数映射已经创建好了

将（1）中每个扩展点的每个插件（就是插件名字）拿出来，去（2）的映射（map）中获取实例化函数，然后运行这个实例化函数，最后把这个实例化出来的插件（可以被运行的）追加到上面提到过的 frameworkImpl 中对应扩展点数组中，这样后面要运行某个扩展点插件的时候就可以遍历运行就可以了。我们可以把上述过程用下图表示

Scheduler 之 SchedulingQueue

上面我们介绍了 Scheduler 第一个关键成员 Profiles 的初始化和作用，下面我们来谈谈第二个关键成员：SchedulingQueue。

// pkg/scheduler/scheduler.go podQueue := internalqueue.NewSchedulingQueue( profiles[options.profiles[0].SchedulerName].QueueSortFunc(), informerFactory, // 1s internalqueue.WithPodInitialBackoffDuration(time.Duration(options.podInitialBackoffSeconds)*time.Second), // 10s internalqueue.WithPodMaxBackoffDuration(time.Duration(options.podMaxBackoffSeconds)*time.Second), internalqueue.WithPodNominator(nominator), internalqueue.WithClusterEventMap(clusterEventMap), // 5min internalqueue.WithPodMaxInUnschedulablePodsDuration(options.podMaxInUnschedulablePodsDuration), )

func NewSchedulingQueue( lessFn framework.LessFunc, informerFactory informers.SharedInformerFactory, opts ...Option) SchedulingQueue { return NewPriorityQueue(lessFn, informerFactory, opts...) }

type PriorityQueue struct { framework.PodNominator stop  chan struct{} clock clock.Clock podInitialBackoffDuration time.Duration podMaxBackoffDuration time.Duration podMaxInUnschedulablePodsDuration time.Duration lock sync.RWMutex cond sync.Cond activeQ *heap.Heap podBackoffQ *heap.Heap unschedulablePods *UnschedulablePods schedulingCycle int64 moveRequestCycle int64 clusterEventMap map[framework.ClusterEvent]sets.String closed bool nsLister listersv1.NamespaceLister }

SchedulingQueue 是一个 internalqueue.SchedulingQueue 的接口类型，PriorityQueue 对这个接口进行了实现，创建 Scheduler 的时候 SchedulingQueue 会被 PriorityQueue 类型对象赋值。

PriorityQueue 中有关键的3个成员：activeQ、podBackoffQ、unschedulablePods。

activeQ 是一个优先队列，用来存放待调度的 Pod，Pod 按照优先级存放在队列中
podBackoffQ 用来存放异常的 Pod，该队列里面的 Pod 会等待一定时间后被移动到 activeQ 里面重新被调度
unschedulablePods 中会存放调度失败的 Pod，它不是队列，而是使用 map 来存放的，这个 map 里面的 Pod 在一定条件下会被移动到 activeQ 或 podBackoffQ 中

PriorityQueue 还有两个方法：flushUnschedulablePodsLeftover 和 flushBackoffQCompleted

flushUnschedulablePodsLeftover：调度失败的 Pod 如果满足一定条件，这个函数会将这种 Pod 移动到 activeQ 或 podBackoffQ
flushBackoffQCompleted：运行异常的 Pod 等待时间完成后，flushBackoffQCompleted 将该 Pod 移动到 activeQ

Scheduler 在启动的时候，会创建2个协程来定期运行这两个函数

func (p *PriorityQueue) Run() { go wait.Until(p.flushBackoffQCompleted, 1.0*time.Second, p.stop) go wait.Until(p.flushUnschedulablePodsLeftover, 30*time.Second, p.stop) }

上面是定期对 Pod 在这些队列之间的转换，那么除了定期刷新的方式，还有下面情况也会触发队列转换：

有新节点加入集群
节点配置或状态发生变化
已经存在的 Pod 发生变化
集群内有Pod被删除

至于他们之间是如何转换的，我们在下一篇文章里面详细介绍

Scheduler 之 cache

要说 cache 最大的作用就是提升 Scheduler 的效率，降低 kube-apiserver(本质是 etcd)的压力，在调用各个插件计算的时候所需要的 Node 信息和其他 Pod 信息都缓存在本地，在需要使用的时候直接从缓存获取即可，而不需要调用 api 从 kube-apiserver 获取。cache 类型是 internalcache.Cache 的接口，cacheImpl 实现了这个接口。

下面是 cacheImpl 的结构

type Cache interface NodeCount() int PodCount() (int, error) AssumePod(pod *v1.Pod) error FinishBinding(pod *v1.Pod) error ForgetPod(pod *v1.Pod) error AddPod(pod *v1.Pod) error UpdatePod(oldPod, newPod *v1.Pod) error RemovePod(pod *v1.Pod) error GetPod(pod *v1.Pod) (*v1.Pod, error) IsAssumedPod(pod *v1.Pod) (bool, error) AddNode(node *v1.Node) *framework.NodeInfo UpdateNode(oldNode, newNode *v1.Node) *framework.NodeInfo RemoveNode(node *v1.Node) error UpdateSnapshot(nodeSnapshot *Snapshot) error Dump() *Dump }

type cacheImpl struct { stop   <-chan struct{} ttl    time.Duration period time.Duration mu sync.RWMutex assumedPods sets.String podStates map[string]*podState nodes     map[string]*nodeInfoListItem headNode *nodeInfoListItem nodeTree *nodeTree imageStates map[string]*imageState }

cacheImpl 中的 nodes 存放集群内所有 Node 信息；podStates 存放所有 Pod 信息；，assumedPods 存放已经调度成功但是还没调用 kube-apiserver 的进行绑定的（也就是还没有执行 bind 插件）的Pod，需要这个缓存的原因也是为了提升调度效率，将绑定和调度分开，因为绑定需要调用 kube-apiserver，这是一个重操作会消耗比较多的时间，所以 Scheduler 乐观的假设调度已经成功，然后返回去调度其他 Pod，而这个 Pod 就会放入 assumedPods 中，并且也会放入到 podStates 中，后续其他 Pod 在进行调度的时候，这个 Pod 也会在插件的计算范围内（如亲和性），然后会新起协程进行最后的绑定，要是最后绑定失败了，那么这个 Pod 的信息会从 assumedPods 和 podStates 移除，并且把这个 Pod 重新放入 activeQ 中，重新被调度。

Scheduler 在启动时首先会 list 一份全量的 Pod 和 Node 数据到上述的缓存中，后续通过 watch 的方式发现变化的 Node 和 Pod，然后将变化的 Node 或 Pod 更新到上述缓存中。

Scheduler 之 NextPod 和 SchedulePod

到了这里，调度框架 framework 和调度队列 SchedulingQueue 都已经创建出来了，现在是时候开始调度Pod了。

Scheduler 中有个成员 NextPod 会从 activeQ 队列中尝试获取一个待调度的 Pod，该函数在 SchedulePod 中被调用，如下：

// 启动 Scheduler func (sched *Scheduler) Run(ctx context.Context) { sched.SchedulingQueue.Run() go wait.UntilWithContext(ctx, sched.scheduleOne, 0) <-ctx.Done() sched.SchedulingQueue.Close() } // 尝试调度一个 Pod，所以 Pod 的调度入口 func (sched *Scheduler) scheduleOne(ctx context.Context) { // 会一直阻塞，直到获取到一个Pod ...... podInfo := sched.NextPod() ...... }

NextPod 它被赋予如下函数：

// pkg/scheduler/internal/queue/scheduling_queue.go func MakeNextPodFunc(queue SchedulingQueue) func() *framework.QueuedPodInfo { return func() *framework.QueuedPodInfo { podInfo, err := queue.Pop() if err == nil { klog.V(4).InfoS("About to try and schedule pod", "pod", klog.KObj(podInfo.Pod)) for plugin := range podInfo.UnschedulablePlugins { metrics.UnschedulableReason(plugin, podInfo.Pod.Spec.SchedulerName).Dec() } return podInfo } klog.ErrorS(err, "Error while retrieving next pod from scheduling queue") return nil } }

Pop 会一直阻塞，直到 activeQ 长度大于0，然后去取出一个 Pod 返回

// pkg/scheduler/internal/queue/scheduling_queue.go func (p *PriorityQueue) Pop() (*framework.QueuedPodInfo, error) { p.lock.Lock() defer p.lock.Unlock() for p.activeQ.Len() == 0 { // When the queue is empty, invocation of Pop() is blocked until new item is enqueued. // When Close() is called, the p.closed is set and the condition is broadcast, // which causes this loop to continue and return from the Pop(). if p.closed { return nil, fmt.Errorf(queueClosed) } p.cond.Wait() } obj, err := p.activeQ.Pop() if err != nil { return nil, err } pInfo := obj.(*framework.QueuedPodInfo) pInfo.Attempts++ p.schedulingCycle++ return pInfo, nil }

到了这里我们就介绍完了 Scheduler 中最重要的几个成员，简单总结下：