集群节点关机导致dns在eviction pod之前几率不可用

coredns

字数统计: 3.7k阅读时长: 19 min

 2021/02/02 

由来

这几天我们内部在做新项目的容灾测试，业务都是在 K8S 上的。容灾里就是随便选节点 shutdown -h now。关机后同事便发现了（页面有错误，最终问题是）集群内 DNS 解析会有几率无法解析（导致的）。

根据 SVC 的流程，node 关机后，由于 kubelet 没有 update 自己。node 和 pod 在 apiserver get 的时候显示还是正常的。在 kube-controller-manager 的 --node-monitor-grace-period 时间后再过 --pod-eviction-timeout 时间开始 eviction pod，大概流程是这样。

在 pod 被 eviction 之前，默认是大概 5m 的时间。这段时间内，node 上的所有 POD 的 IP 还在 SVC 的 endpoint 里。而同事关机的 node 上恰好有 coredns 。所以在 5m 内一直会有 coredns 副本数之一的几率解析失败。

环境信息

其实和 K8S 版本没关系，因为 SVC 和 eviction 的行为都是这样的。实际我调整了 node 更新自身状态的所有相关参数，调整到在 20s 内就会 eviction pod，但是 20s 内还是存在几率无法解析(后续这个相关的更新时间调很快，结果出现了 svc 选中的pod还在running，但是kubelet实际更新自己 status 失败了，导致 kube-controller-mananger 把 pod 的 status patch 成了非 true，也就是 svc 的 endpoint 消失了，回退更新时间后没这个bug了)。当然也问了下群友和社区群里，发现似乎大家从来没关机测试过这方面，应该是现在大伙都在用公有云了。。。。

$  kubectl  version -o json
{
  "clientVersion": {
    "major": "1",
    "minor": "15",
    "gitVersion": "v1.15.5",
    "gitCommit": "20c265fef0741dd71a66480e35bd69f18351daea",
    "gitTreeState": "clean",
    "buildDate": "2019-10-15T19:16:51Z",
    "goVersion": "go1.12.10",
    "compiler": "gc",
    "platform": "linux/amd64"
  },
  "serverVersion": {
    "major": "1",
    "minor": "15",
    "gitVersion": "v1.15.5",
    "gitCommit": "20c265fef0741dd71a66480e35bd69f18351daea",
    "gitTreeState": "clean",
    "buildDate": "2019-10-15T19:07:57Z",
    "goVersion": "go1.12.10",
    "compiler": "gc",
    "platform": "linux/amd64"
  }

解决过程

loca-dns 真的可以吗

当然首选是 local-dns 的方案了。方案搜下，很多人介绍了。简单讲下就是在每个 node 上起 hostNetwork 的 node-cache 进程做代理，然后利用 dummy 接口和 nat 来拦截发向 kube-dns SVC IP 的 dns 请求做缓存。

官方提供的 yaml 文件里的 __PILLAR__LOCAL__DNS__,__PILLAR__DNS__SERVER__需要换成dummy接口 IP 和 kube-dns SVC 的 IP，还有 __PILLAR__DNS__DOMAIN__ 自行根据文档更换。其余几个变量会在启动的时候替换，可以启动后看日志。

然后实际测试了下还是有问题。然后捋了下流程，yaml 文件里有这个 SVC 和 node-cache 的启动参数

apiVersion: v1
kind: Service
metadata:
  name: kube-dns-upstream
  namespace: kube-system
...
spec:
  ports:
  - name: dns
    port: 53
    protocol: UDP
    targetPort: 53
  - name: dns-tcp
    port: 53
    protocol: TCP
    targetPort: 53
  selector:
	k8s-app: kube-dns
...
 args: [ ..., "-upstreamsvc", "kube-dns-upstream" ]

启动的日志里可以看到配置文件被渲染了：

cluster1.local:53 {
    errors
    reload
    bind 169.254.20.10 172.26.0.2
    forward . 172.26.189.136 {
        force_tcp
    }
    prometheus :9253
    health 169.254.20.10:8080
}
in-addr.arpa:53 {
    errors
    cache 30
    reload
    loop
    bind 169.254.20.10 172.26.0.2
    forward . 172.26.189.136 {
            force_tcp
    }
    prometheus :9253
    }
ip6.arpa:53 {
    errors
    cache 30
    reload
    loop
    bind 169.254.20.10 172.26.0.2
    forward . 172.26.189.136 {
            force_tcp
    }
    prometheus :9253
    }
.:53 {
    errors
    cache 30
    reload
    loop
    bind 169.254.20.10 172.26.0.2
    forward . /etc/resolv.conf
    prometheus :9253
  }

因为要 nat 去 hook 请求 kube-dns SVC IP（172.26.0.2）的请求，但是它自己也需要访问 kube-dns，所以 yaml 文件里创建了一个和 kube-dns 一样的属性的 svc，启动参数写了这个 SVC 名字，可以看到它代理的是走 SVC 的 ip 的。因为 enableServiceLinks 的默认开启，pod 会有如下环境变量：

1 2	$ docker exec dfa env \| grep KUBE_DNS_UPSTREAM_SERVICE_HOST KUBE_DNS_UPSTREAM_SERVICE_HOST=172.26.189.136

代码里可以看到就是把参数的 - 转换成 _ 取值然后渲染配置文件，这样就能取到 SVC 的 IP 了。

func toSvcEnv(svcName string) string {
	envName := strings.Replace(svcName, "-", "_", -1)
	return "$" + strings.ToUpper(envName) + "_SERVICE_HOST"
}

cluster1.local:53 这个 zone 在默认配置下还是代理到 SVC 上，所以还是有问题。

所以只有绕过 SVC 才能从根本上解决这个问题。然后就把 coredns 改成 port 153 + hostNetwork: true 加 nodeSelector 到三个 master 上固定了。然后配置文件如下：

cluster1.local:53 {
    errors
    reload
    bind 169.254.20.10 172.26.0.2
    forward . 10.11.86.107:153 10.11.86.108:153 10.11.86.109:153 {
        force_tcp
    }
    prometheus :9253
    health 169.254.20.10:8080
}
...

然后测试还是有几率无法访问。之前看到过米开朗基杨分享过 coredns 的一个带故障转移的插件 dnsredir，尝试加这个插件去编译。

查阅文档编译后最后运行起来无法识别配置文件，因为官方不是直接基于 coredns 引入自己的插件开发的，而是自己的代码上来引入 coredns 的内置插件。

大概过程详情 issue 见链接 include coredns plugin at node-cache don’t work expect

官方的这个 node-cache 里的 bind 插件就是 dummy接口和 iptables 的 nat 部分了，这个特性蛮吸引我的，决定继续尝试下这个看看能不能配置成功。

意外收获

在测试加入插件 dnsredir 的时候米开朗基杨叫我试下最小配置段看看有干扰没，尝试了下面的配置段来回切换测：


  Corefile: |
    cluster1.local:53 {
        errors
        reload
        dnsredir . {
            to 10.11.86.107:153 10.11.86.108:153 10.11.86.109:153
            max_fails 1
            health_check 1s
            spray
        }
        #forward . 10.11.86.107:153 10.11.86.108:153 10.11.86.109:153 {
        #    max_fails 1
        #    policy round_robin
        #    health_check 0.4s
        #}
        prometheus :9253
        health 169.254.20.10:8080
    }
#----------
  Corefile: |
    cluster1.local:53 {
        errors
        reload
        #dnsredir . {
        #    to 10.11.86.107:153 10.11.86.108:153 10.11.86.109:153
        #    max_fails 1
        #    health_check 1s
        #    spray
        #}
        forward . 10.11.86.107:153 10.11.86.108:153 10.11.86.109:153 {
            max_fails 1
            policy round_robin
            health_check 0.4s
        }
        prometheus :9253
        health 169.254.20.10:8080
    }

然后发现请求居然不会发生解析失败了：

$ function d(){ while :;do sleep 0.2; date;dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short; done; }
$ d
2021年 02月 02日 星期二 12:54:43 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:44 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:44 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:44 CST  <---这个时间点关机了一个 master 
172.26.158.130
2021年 02月 02日 星期二 12:54:45 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:47 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:48 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:48 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:48 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:51 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:51 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:52 CST
172.26.158.130

然后就不打算继续折腾 dnsredir 插件了，去叫同事测试了下没问题，叫我在另一个环境上应用下修改他再测下，发现还是会发生：

$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
; <<>> DiG 9.10.3-P4-Ubuntu <<>> @172.26.0.2 account-gateway +short
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short

然后我多次测试最小配置 zone，对比排查到是反向解析的问题，反向解析关闭了就不存在任何问题了，注释掉下面的内容：

#in-addr.arpa:53 {
#    errors
#    cache 30
#    reload
#    loop
#    bind 169.254.20.10 172.26.0.2
#    forward . __PILLAR__CLUSTER__DNS__ {
#            force_tcp
#    }
#    prometheus :9253
#    }
#ip6.arpa:53 {
#    errors
#    cache 30
#    reload
#    loop
#    bind 169.254.20.10 172.26.0.2
#    forward . __PILLAR__CLUSTER__DNS__ {
#            force_tcp
#    }
#    prometheus :9253
#    }

测试解析的过程中去关机任何一台 coredns 所在 node 也没问题了。

$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124

大致的yaml文件

apiVersion: v1
kind: ServiceAccount
metadata:
  name: node-local-dns
  namespace: kube-system
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
---
apiVersion: v1
kind: Service
metadata:
  name: kube-dns-upstream
  namespace: kube-system
  labels:
    k8s-app: kube-dns
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/name: "KubeDNSUpstream"
spec:
  clusterIP: 172.26.0.3 # <---- 给他固定了得了，可以直接这个ip不走node-cache作为测试
  ports:
  - name: dns
    port: 53
    protocol: UDP
    targetPort: 153
  - name: dns-tcp
    port: 53
    protocol: TCP
    targetPort: 153
  selector:
    k8s-app: kube-dns
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: node-local-dns
  namespace: kube-system
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
data:
  Corefile: |
    cluster1.local:53 {
        errors
        cache {
            success 9984 30
            denial 9984 5
        }
        reload
        loop
        bind 169.254.20.10 172.26.0.2
        forward . 10.11.86.107:153 10.11.86.108:153 10.11.86.109:153 {
            force_tcp
            max_fails 1
            policy round_robin
            health_check 0.5s
        }
        prometheus :9253
        health 169.254.20.10:8070
    }
    #in-addr.arpa:53 {
    #    errors
    #    cache 30
    #    reload
    #    loop
    #    bind 169.254.20.10 172.26.0.2
    #    forward . __PILLAR__CLUSTER__DNS__ {
    #            force_tcp
    #    }
    #    prometheus :9253
    #    }
    #ip6.arpa:53 {
    #    errors
    #    cache 30
    #    reload
    #    loop
    #    bind 169.254.20.10 172.26.0.2
    #    forward . __PILLAR__CLUSTER__DNS__ {
    #            force_tcp
    #    }
    #    prometheus :9253
    #    }
    .:53 {
        errors
        cache 30
        reload
        loop
        bind 169.254.20.10 172.26.0.2
        forward . __PILLAR__UPSTREAM__SERVERS__
        prometheus :9253
      }
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-local-dns
  namespace: kube-system
  labels:
    k8s-app: node-local-dns
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
spec:
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 10%
  selector:
    matchLabels:
      k8s-app: node-local-dns
  template:
    metadata:
      labels:
        k8s-app: node-local-dns
      annotations:
        prometheus.io/port: "9253"
        prometheus.io/scrape: "true"
    spec:
      imagePullSecrets:
      - name: regcred
      priorityClassName: system-node-critical
      serviceAccountName: node-local-dns
      hostNetwork: true
      dnsPolicy: Default  # Don't use cluster DNS.
      tolerations:
      - key: "CriticalAddonsOnly"
        operator: "Exists"
      - effect: "NoExecute"
        operator: "Exists"
      - effect: "NoSchedule"
        operator: "Exists"
      containers:
      - name: node-cache
        image: xxx.lan:5000/k8s-dns-node-cache:1.16.0
        resources:
          requests:
            cpu: 25m
            memory: 10Mi
        args: [ "-localip", "169.254.20.10,172.26.0.2", "-conf", "/etc/Corefile", "-upstreamsvc", "kube-dns-upstream", "-health-port","8070" ]
        securityContext:
          privileged: true
        ports:
        - containerPort: 53
          name: dns
          protocol: UDP
        - containerPort: 53
          name: dns-tcp
          protocol: TCP
        - containerPort: 9253
          name: metrics
          protocol: TCP
        livenessProbe:
          httpGet:
            host: 169.254.20.10
            path: /health
            port: 8070
          initialDelaySeconds: 40
          timeoutSeconds: 3
        volumeMounts:
        - mountPath: /run/xtables.lock
          name: xtables-lock
          readOnly: false
        - name: config-volume
          mountPath: /etc/coredns
        - name: kube-dns-config
          mountPath: /etc/kube-dns
      volumes:
      - name: xtables-lock
        hostPath:
          path: /run/xtables.lock
          type: FileOrCreate
      - name: kube-dns-config
        configMap:
          name: kube-dns
          optional: true
      - name: config-volume
        configMap:
          name: node-local-dns
          items:
            - key: Corefile
              path: Corefile.base
---
# A headless service is a service with a service IP but instead of load-balancing it will return the IPs of our associated Pods.
# We use this to expose metrics to Prometheus.
apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/port: "9253"
    prometheus.io/scrape: "true"
  labels:
    k8s-app: node-local-dns
  name: node-local-dns
  namespace: kube-system
spec:
  clusterIP: None
  ports:
    - name: metrics
      port: 9253
      targetPort: 9253
  selector:
    k8s-app: node-local-dns
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: coredns
  namespace: kube-system
  labels:
      kubernetes.io/cluster-service: "true"
      addonmanager.kubernetes.io/mode: Reconcile
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    kubernetes.io/bootstrapping: rbac-defaults
    addonmanager.kubernetes.io/mode: Reconcile
  name: system:coredns
rules:
- apiGroups:
  - ""
  resources:
  - endpoints
  - services
  - pods
  - namespaces
  verbs:
  - list
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  annotations:
    rbac.authorization.kubernetes.io/autoupdate: "true"
  labels:
    kubernetes.io/bootstrapping: rbac-defaults
    addonmanager.kubernetes.io/mode: EnsureExists
  name: system:coredns
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:coredns
subjects:
- kind: ServiceAccount
  name: coredns
  namespace: kube-system
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
  labels:
      addonmanager.kubernetes.io/mode: EnsureExists
data:
  Corefile: |
    .:153 {
        errors
        health :8180
        kubernetes cluster1.local. in-addr.arpa ip6.arpa {
            pods insecure
            fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . /etc/resolv.conf
        cache 30
        loop
        reload
    }
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: coredns
  namespace: kube-system
  labels:
    k8s-app: kube-dns
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/name: "CoreDNS"
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  selector:
    matchLabels:
      k8s-app: kube-dns
  template:
    metadata:
      labels:
        k8s-app: kube-dns
      annotations:
        seccomp.security.alpha.kubernetes.io/pod: 'docker/default'
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: k8s-app
                      operator: In
                      values:
                        - kube-dns
                topologyKey: kubernetes.io/hostname
      hostNetwork: true
      priorityClassName: system-cluster-critical
      serviceAccountName: coredns
      nodeSelector:
        node-role.kubernetes.io/master: "true"
      tolerations:
        - key: node-role.kubernetes.io/master
          effect: NoSchedule
        - key: "CriticalAddonsOnly"
          operator: "Exists"
      imagePullSecrets:
        - name: regcred
      containers:
      - name: coredns
        image: xxxx.lan:5000/coredns:1.7.1
        imagePullPolicy: IfNotPresent
        resources:
          limits:
            memory: 270Mi
          requests:
            cpu: 100m
            memory: 150Mi
        args: [ "-conf", "/etc/coredns/Corefile" ]
        volumeMounts:
        - name: config-volume
          mountPath: /etc/coredns
          readOnly: true
        ports:
        - containerPort: 153
          name: dns
          protocol: UDP
        - containerPort: 153
          name: dns-tcp
          protocol: TCP
        - containerPort: 9153
          name: metrics
          protocol: TCP
        livenessProbe:
          httpGet:
            path: /health
            port: 8180
            scheme: HTTP
          initialDelaySeconds: 60
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 5
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            add:
            - NET_BIND_SERVICE
            drop:
            - all
          readOnlyRootFilesystem: true
      dnsPolicy: Default
      volumes:
        - name: config-volume
          configMap:
            name: coredns
            items:
            - key: Corefile
              path: Corefile
---
apiVersion: v1
kind: Service
metadata:
  name: kube-dns
  namespace: kube-system
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9153"
  labels:
    k8s-app: kube-dns
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/name: "CoreDNS"
spec:
  selector:
    k8s-app: kube-dns
  clusterIP: 172.26.0.2
  ports:
  - name: dns
    port: 53
    targetPort: 153
    protocol: UDP
  - name: dns-tcp
    port: 53
    targetPort: 153
    protocol: TCP

自己的方案

但是后面发现 cpu 太高了，决定自己整个方案，中途尝试了很多，最后决定自己把里面的 dummy 接口部分源码抠出来写成一个工具（这样就不用改 svc ip 了），然后高可用用其他手段。主要是替换掉 nodelocaldns 的部分

---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-local-dns
  namespace: kube-system
  labels:
    k8s-app: node-local-dns
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
spec:
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 10%
  selector:
    matchLabels:
      k8s-app: node-local-dns
  template:
    metadata:
      labels:
        k8s-app: node-local-dns
    spec:
      imagePullSecrets:
      - name: regcred
      priorityClassName: system-node-critical
      serviceAccountName: node-local-dns
      hostNetwork: true
      dnsPolicy: Default  # Don't use cluster DNS.
      tolerations:
      - key: "CriticalAddonsOnly"
        operator: "Exists"
      - effect: "NoExecute"
        operator: "Exists"
      - effect: "NoSchedule"
        operator: "Exists"
      containers:
      - name: dummy-tool
        #image: registry.aliyuncs.com/zhangguanzhang/dummy-tool:v0.1
        image: {{ docker_repo_url }}/dummy-tool:v0.1
        args: 
        - -local-ip=169.254.20.10,172.26.0.2
        - -health-port=8070
        - -interface-name=nodelocaldns
        securityContext:
          privileged: true
        livenessProbe:
          httpGet:
            host: 169.254.20.10
            path: /health
            port: 8070
          initialDelaySeconds: 40
          timeoutSeconds: 3
      - name: dnsmasq
        #image: registry.aliyuncs.com/zhangguanzhang/dnsmasq:2.83
        image: {{ docker_repo_url }}/dnsmasq:2.83
        command:
        - dnsmasq
        - -d
        - --conf-file=/etc/dnsmasq/dnsmasq.conf
        resources:
          requests:
            cpu: 25m
            memory: 10Mi
        securityContext:
          privileged: true
        volumeMounts:
        - mountPath: /etc/localtime
          name: host-localtime
        - name: config-volume
          mountPath: /etc/dnsmasq
      volumes:
      - name: config-volume
        configMap:
          name: node-local-dns
      - hostPath:
          path: /etc/localtime
        name: host-localtime
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: node-local-dns
  namespace: kube-system
  labels:
      addonmanager.kubernetes.io/mode: EnsureExists
data:
  dnsmasq.conf: |
    no-resolv
    all-servers
    server=10.11.86.107#153
    server=10.11.86.108#153
    server=10.11.86.109#153
    #log-queries