zhangguanzhang's Blog

小白向的 kubernetes 证书讲解

字数统计: 5k阅读时长: 26 min
2025/06/02

从小白角度来写下 k8s 证书…..

由来

很多人对 k8s 证书和 kubeconfig 望而却步,证书过期和相关报错就无从下手,市面上有写证书的文章博客,但是感觉很长的理论会让很多人看不下去,实际更需要的是解决问题时候的具体步骤和方向。

理论部分

简单讲解证书理论部分。

双向 SSL

访问一个 https 网站,需要目标 web server 配置有 ssl 证书,而证书来源两种:

  1. CA(证书颁发机构)使用私钥签署出 根 CA 证书(公钥),浏览器和操作系统内置这些 根 CA 证书, 只有 CA 机构签署的证书才会是绿锁。
  2. 使用证书工具或者遵守证书规范的库生成的 CA 私钥自己签署出的证书,浏览器显示红色警告(连接不安全/无效证书)

k8s 采用的是基于 X.509 V3 标准的双向 SSL,客户端和服务端通信,都会验证双方证书,根据双方是否是一样的 CA 签署的证书,而 CA 私钥是自己生成的,你可以看到 k8s 组件的 cmdline 都有指定参数 ca-file|cert-file 相关。

证书建议相关

时间

无论是 openssl 还是 cfssl,推荐都把过期时间设置高一些:

1
2
3
4
$ openssl req -x509 ... -days 10000
$ cat ca-config.json
...
"expiry": "876000h"

而对于 kubeadm,网上有修改编译的,或者 go build 的时候注入覆盖默认的时间的,自行搜索。

certSAN

k8s 里 kube-apiserveretcd 都是部署在多个机器上实现高可用的,在 openssl/cfssl/kubeadm 里推荐加 IP 以外还要加域名以防后续换 IP 相关:

这里以 ipv4 下 kubeadm 的指定 yml 创建集群来举例一般写那些:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
$ cat initconfig.yaml
apiServer:
certSANs:
- 10.96.0.1 # service cidr的第一个ip
- 127.0.0.1 # 多个master的时候负载均衡出问题了能够快速使用localhost调试
- localhost
- apiserver.k8s.local # 负载均衡的域名或者vip
- 172.19.0.2 # 三台 kube-apiserver 的 IP
- 172.19.0.3
- 172.19.0.4
- apiserver01.k8s.local
- apiserver02.k8s.local
- apiserver03.k8s.local
- apiserver04.k8s.local # 预留域名
- apiserver05.k8s.local
- master
- kubernetes
- kubernetes.default
- kubernetes.default.svc
- kubernetes.default.svc.cluster.local # 集群内 dns search 和 clusterDomain
...
etcd: # https://godoc.org/k8s.io/kubernetes/cmd/kubeadm/app/apis/kubeadm/v1beta2#Etcd
local:
serverCertSANs:
- localhost
- 127.0.0.1
- 172.19.0.2
- 172.19.0.3
- 172.19.0.4
- etcd01.k8s.local
- etcd02.k8s.local
- etcd03.k8s.local
- etcd04.k8s.local # 预留域名
- etcd05.k8s.local
peerCertSANs:
- localhost
- 127.0.0.1
- 172.19.0.2
- 172.19.0.3
- 172.19.0.4
- etcd01.k8s.local
- etcd02.k8s.local
- etcd03.k8s.local
- etcd04.k8s.local # 预留域名
- etcd05.k8s.local

上面只列举 IPv4 的,如果后续有双栈啥的可以预先写上,如果写漏了域名和 IP,管理组件或者 pod 内通过 SDK 访问 kube-apiserver 的时候会报错:

1
Unable to connect to the server: tls: failed to verify certificate: x509: certificate is valid for 10.96.0.1, yyyy, not xxxx

也就是证书的 certSANs 只有 10.96.0.1, yyyy 而没有 xxxx,可以使用原有 CA 证书签署下。命令行查看证书的 certSAN 可以使用 openssl ,编程语言的话推荐去使用库:

1
2
3
4
# 一般会把证书放在 /etc/kubernetes/pki 找不到的找 apiserver cmdline 参数路径
$ openssl x509 -noout -text -in apiserver.crt
X509v3 Subject Alternative Name:
DNS:shaolin, DNS:kubernetes, DNS:kubernetes.default, DNS:kubernetes.default.svc, DNS:kubernetes.default.svc.cluster.local, IP Address:10.96.0.1, IP Address: 172.19.0.2

openssl 上面的输出里很多信息,还包含证书过期时间,而且 openssl x509 下很多选项的:

1
2
3
4
5
6
7
8
# 利用 -certopt 和 -text 配合只打印 certSANs
openssl x509 -noout -in apiserver.crt -certopt no_subject,no_header,no_version,no_serial,no_signame,no_validity,no_issuer,no_pubkey,no_sigdump,no_aux -text

# 只查看证书时间
openssl x509 -noout -in apiserver.crt -dates

# 只展示 subject
openssl x509 -noout -in apiserver.crt -subject

只有同一套 ca 签署证书才符合要求,可以使用 openssl 命令检查:

1
2
3
# 检查 apiserver.crt 是否是由 ca.key 签署
$ openssl verify -CAfile ca.crt apiserver.crt
apiserver.crt: OK

k8s role 和 RBAC

双向 TLS 过去了,但是具体权限控制 k8s 怎么做的呢,就是 X.509 证书签署(Subject 字段内)的 CN(Common Name)O(Organization) 字段,对应 User NameGroup,也就是 k8s 的 RBAC:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
$ kubectl get clusterrolebinding
NAME ROLE AGE
cluster-admin ClusterRole/cluster-admin 81m
kubeadm:get-nodes ClusterRole/kubeadm:get-nodes 81m
kubeadm:kubelet-bootstrap ClusterRole/system:node-bootstrapper 81m
kubeadm:node-autoapprove-bootstrap ClusterRole/system:certificates.k8s.io:certificatesigningrequests:nodeclient 81m
kubeadm:node-autoapprove-certificate-rotation ClusterRole/system:certificates.k8s.io:certificatesigningrequests:selfnodeclient 81m
kubeadm:node-proxier ClusterRole/system:node-proxier 81m
...
system:coredns ClusterRole/system:coredns 81m
system:discovery ClusterRole/system:discovery 81m
system:kube-controller-manager ClusterRole/system:kube-controller-manager 81m
system:kube-dns ClusterRole/system:kube-dns 81m
system:kube-scheduler ClusterRole/system:kube-scheduler 81m
system:monitoring ClusterRole/system:monitoring 81m
system:node ClusterRole/system:node 81m
system:node-proxier ClusterRole/system:node-proxier 81m
system:public-info-viewer ClusterRole/system:public-info-viewer 81m
system:service-account-issuer-discovery ClusterRole/system:service-account-issuer-discovery 81m
system:volume-scheduler ClusterRole/system:volume-scheduler 81m

kube-apiserver 启动后会创建上面的 clusterrolebinding,kubectl 本质就是个 client + kubeconfig 访问 kube-apiserver 的,查看 kubectl 当前使用的 kubeconfig 可以通过 k8s 所有二进制的 cmdline 的 -v 选项,从详细信息里获取:

1
2
3
$ kubectl config view -v=6
I0604 10:40:46.836786 14513 loader.go:395] Config loaded from file: /root/.kube/config
...

可以从上面看到是 /root/.kube/config ,以 kubeadm 的举例,该文件内容内有证书内容:

1
2
3
4
$ cat /root/.kube/config
certificate-authority-data /etc/kubernetes/pki/ca.crt 内容 base64 加密后的值
client-certificate-data /etc/kubernetes/pki/admin.crt 内容 base64 加密后的值
client-key-data /etc/kubernetes/pki/admin.key 内容 base64 加密后的值

上面后面俩已经内嵌了,所以文件不存在,但是也会有些人自建集群上面的后面值是路径,是因为 kubectl config 生成 kubeconfig 的时候没指定选项 --embed-certs,内嵌的步骤如下:

1
2
3
4
5
kubectl --kubeconfig /etc/kubernetes/admin.conf config set-credentials admin \
--client-certificate=/etc/kubernetes/pki/admin.crt \
--embed-certs=true \
--client-key=/etc/kubernetes/pki/admin.key"
# rm -f /etc/kubernetes/pki/admin.???

kubeadm golang 直接没有落地文件,直接生成 yml 内容的,我们可以扣出证书信息看看

1
2
3
4
$ openssl x509 -in <(kubectl config view --raw -o jsonpath="{.users[0]['user']['client-certificate-data']}" | base64 -d ) -noout -text

$ openssl x509 -in <(kubectl config view --raw -o jsonpath="{.users[0]['user']['client-certificate-data']}" | base64 -d ) -noout -subject
subject= /O=system:masters/CN=kubernetes-admin

我们可以看到 O=system:masters ,实际对应 clusterrolebinding cluster-admin 的信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
$ kubectl get clusterrolebinding cluster-admin -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
annotations:
rbac.authorization.kubernetes.io/autoupdate: "true"
creationTimestamp: "2025-06-03T07:36:44Z"
labels:
kubernetes.io/bootstrapping: rbac-defaults
name: cluster-admin
resourceVersion: "160"
uid: d5638680-38de-4010-a2c9-084645a8ad21
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: Group
name: system:masters

也就是组 system:masters 具备 clusterrole cluster-admin 的权限。

本小结参考:

实战

说完理论部分,来实战下,利用证书的 CN(Common Name)O(Organization) 字段来创建两个权限证书测试下:

  • 用户 test1 具备 default ns 下的 pod list 权限
  • 组 test2 具备所有 ns 的 pod list 权限

避免路径、证书名字和后缀和习惯问题,实战部分以 cfssl 再 kubeadm 初始化后的文件目录内操作。

对证书做操作之前要有备份习惯,无论证书损坏还是过期:

1
2
cd /etc/kubernetes
cp -a pki pki.bak

创建证书

创建配置文件:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
cd /etc/kubernetes/pki/

# 创建 ca 签署证书签名配置文件,因为该证书是只 client 使用,不需要在 usages 里带 "server auth"
# 如果所有证书手动生成时候用同一个 ca-config.json 可以偷懒带上 "server auth"

cat > ca-config.json << EOF
{
"signing": {
"default": {
"expiry": "876000h"
},
"profiles": {
"kubernetes": {
"usages": [
"signing",
"key encipherment",
"client auth"
],
"expiry": "876000h"
}
}
}
}
EOF

# cn 对应 user o 对应 group
cat > test1-csr.json << EOF
{
"CN": "test1",
"hosts": [],
"key": {
"algo": "rsa",
"size": 2048
},
"names": [
{
"O": "test1",
"OU": "System"
}
]
}
EOF

签署证书:

1
2
3
4
5
cfssl gencert \
-ca=ca.crt \
-ca-key=ca.key \
-config=ca-config.json \
-profile=kubernetes test1-csr.json | cfssljson -bare test1

测试证书权限:

1
2
3
4
5
6
7
# 避免 kubeconfig 干扰,改名下家目录文件
$ mv ~/.kube/config ~/.kube/config.bak
$ KUBECONFIG= kubectl --server=https://xxx:6443 \
--certificate-authority=/etc/kubernetes/pki/ca.crt \
--client-certificate=/etc/kubernetes/pki/test1.pem \
--client-key=/etc/kubernetes/pki/test1-key.pem get pod
Error from server (Forbidden): pods is forbidden: User "test1" cannot list resource "pods" in API group "" in the namespace "default"

kube-apiserver 本质是 http/grpc server,我们也可以 curl 测下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
$ curl -X GET \
--cacert /etc/kubernetes/pki/ca.crt \
--cert /etc/kubernetes/pki/test1.pem \
--key /etc/kubernetes/pki/test1-key.pem \
-H "Accept: application/json" \
"https://xxx:6443/api/v1/namespaces/default/pods?limit=500"
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {},
"status": "Failure",
"message": "pods is forbidden: User \"test1\" cannot list resource \"pods\" in API group \"\" in the namespace \"default\"",
"reason": "Forbidden",
"details": {
"kind": "pods"
},
"code": 403
}

因为我们没有创建 test1RBAC ,也就是 rolebinding,创建下后再试试:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
$ mv ~/.kube/config.bak ~/.kube/config
$ cat > test1-rbac.yml << EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: test1-role
namespace: default
rules:
- apiGroups:
- ""
resources:
- pods
verbs:
- get
- list
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: test1-rolebinding
namespace: default
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: test1-role
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: User
name: test1
EOF

$ kubectl apply -f test1-rbac.yml

然后再测试:

1
2
3
4
5
6
7
8
$ mv ~/.kube/config ~/.kube/config.bak
$ KUBECONFIG= kubectl --server=https://xxx:6443 \
--certificate-authority=/etc/kubernetes/pki/ca.crt \
--client-certificate=/etc/kubernetes/pki/test1.pem \
--client-key=/etc/kubernetes/pki/test1-key.pem get pod
No resources found in default namespace.
$ KUBECONFIG= kubectl --server=https://xxx:6443 --certificate-authority=/etc/kubernetes/pki/ca.crt --client-certificate=/etc/kubernetes/pki/test1.pem --client-key=/etc/kubernetes/pki/test1-key.pem get svc
Error from server (Forbidden): services is forbidden: User "test1" cannot list resource "services" in API group "" in the namespace "default"

同样使用 curl 测下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
$ curl -X GET  \
--cacert /etc/kubernetes/pki/ca.crt \
--cert /etc/kubernetes/pki/test1.pem \
--key /etc/kubernetes/pki/test1-key.pem \
-H "Accept: application/json" "https://10.xxx.xx.xxx:6443/api/v1/namespaces/default/pods?limit=500"
{
"kind": "PodList",
"apiVersion": "v1",
"metadata": {
"resourceVersion": "107116"
},
"items": []
}
$ curl -X GET \
--cacert /etc/kubernetes/pki/ca.crt \
--cert /etc/kubernetes/pki/test1.pem \
--key /etc/kubernetes/pki/test1-key.pem \
-H "Accept: application/json" "https://10.xxx.xx.xxx:6443/api/v1/namespaces/default/services?limit=500"
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {},
"status": "Failure",
"message": "services is forbidden: User \"test1\" cannot list resource \"services\" in API group \"\" in the namespace \"default\"",
"reason": "Forbidden",
"details": {
"kind": "services"
},
"code": 403
}

可以看到证书权限符合预期,kubeconfig 里可以包含多个配置段的, kubeconfig 可以使用 kubectl 下面步骤生成对应配置段:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# 设置集群参数,指定CA证书和apiserver地址
kubectl --kubeconfig=test1.kubeconfig config set-cluster kubernetes \
--certificate-authority=/etc/kubernetes/pki/ca.crt \
--embed-certs=true \
--server=https://xxx:6443

# 设置客户端认证参数,指定使用证书和私钥
kubectl --kubeconfig=test1.kubeconfig config set-credentials test1 \
--client-certificate=test1.pem \
--embed-certs=true \
--client-key=test1-key.pem

# 追加一个名为 kubernetes 的上下文参数,指定它使用前面添加的 集群 kubernetes 和名为 test1 的凭据
kubectl --kubeconfig=test1.kubeconfig config set-context kubernetes \
--cluster=kubernetes --user=test1

# 选择默认的上下文
kubectl --kubeconfig=test1.kubeconfig config use-context kubernetes

然后使用该 kubeconfig 测试:

1
2
3
4
[root@zgz pki]# kubectl --kubeconfig=test1.kubeconfig get pod
No resources found in default namespace.
[root@zgz pki]# kubectl --kubeconfig=test1.kubeconfig get svc
Error from server (Forbidden): services is forbidden: User "test1" cannot list resource "services" in API group "" in the namespace "default"

group test2 一样操作,就是注意 O 字段即可,然后是 clusterroleclusterrolebinding ,自行挑战下。

kube-apiserver 的 certSAN

此部分解决 kube-apiserver 的证书(过期也可以按照如下步骤来),例如很多人 kubeadm 初始化后,certSAN 缺少 hosts 报错:

1
2
3
4
5

$ echo '127.0.0.1 santest' >> /etc/hosts
$ kubectl --server https://santest:6443 get pod
...
Unable to connect to the server: tls: failed to verify certificate: x509: certificate is valid for kubernetes, kubernetes.default, kubernetes.default.svc, kubernetes.default.svc.cluster.local, node, not santest

这个时候可以用 ca 签署新证书来包含 santest :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
cd /etc/kubernetes/pki/

# 创建 ca 签署证书签名配置文件,因为该证书是只 server 使用,不需要在 usages 里带 "client auth"
# 如果所有证书手动生成时候用同一个 ca-config.json 可以偷懒带上 "client auth"

cat > ca-config.json << EOF
{
"signing": {
"default": {
"expiry": "876000h"
},
"profiles": {
"kubernetes": {
"usages": [
"signing",
"key encipherment",
"server auth"
],
"expiry": "876000h"
}
}
}
}
EOF


# 查看现有 certSAN
openssl x509 -noout -in apiserver.crt -certopt no_subject,no_header,no_version,no_serial,no_signame,no_validity,no_issuer,no_pubkey,no_sigdump,no_aux -text

# 把下上面老的和要添加的 写到文件里
cat > kubernetes-csr.json << EOF
{
"CN": "kube-apiserver",
"hosts": [
"127.0.0.1",
"::1",
"localhost",
"santest"
"10.xx",
"10.96.0.1",
"kubernetes",
"kubernetes.default",
"kubernetes.default.svc",
"kubernetes.default.svc.cluster",
"kubernetes.default.svc.cluster.local"
],
"key": {
"algo": "rsa",
"size": 2048
},
"names": [
{
"O": "k8s",
"OU": "Kubernetes"
}
]
}
EOF

# 签署证书

cfssl gencert \
-ca=ca.crt \
-ca-key=ca.key \
-config=ca-config.json \
-profile=kubernetes kubernetes-csr.json | cfssljson -bare apiserver2

修改 kube-apiserver 的 cmdline 使用 apiserver2.pemapiserver2-key.pem :

1
2
3
4
5
6
7
8
$ cd /etc/kubernetes/manifests/
$ grep -E 'apiserver.(crt|key)' /etc/kubernetes/manifests/kube-apiserver.yaml
- --tls-cert-file=/etc/kubernetes/pki/apiserver.crt
- --tls-private-key-file=/etc/kubernetes/pki/apiserver.key
$ sed -ri -e 's#/apiserver.crt#/apiserver2.pem#' -e 's#/apiserver.key#/apiserver2-key.pem#' /etc/kubernetes/manifests/kube-apiserver.yaml
$ grep -E -- '--tls-(cert|private)' /etc/kubernetes/manifests/kube-apiserver.yaml
- --tls-cert-file=/etc/kubernetes/pki/apiserver2.pem
- --tls-private-key-file=/etc/kubernetes/pki/apiserver2-key.pem

然后用上面的 santest 域名测试:

1
2
3
4
5
$ kubectl --server https://santest:6443 get pod
No resources found in default namespace.
$ kubectl --server https://santest:6443 get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 1h

故障案例

kubectl 证书过期

1
2
$ kubectl apply -f /tmp/test-svc.yml
... x509: certificate has exprired or is not yet valid: current time 2025-05-20T23:25:51+08:00 is after 2025-01-16T02:16:34Z

查看 kubeconfig 内嵌的证书过期时间:

1
2
$ openssl x509 -in <(kubectl config view --raw -o jsonpath="{.users[0]['user']['client-certificate-data']}" | base64 -d ) -noout -enddate
notAfter=Jan 16 02:16:34 2025 GMT

admin.pem 看现场重新签署了下,时间没过期:

1
2
3
$ openssl x509 -in admin.pem -noout -dates
notBefore=Jan 17 03:05:00 2024 GMT
notAfter=Dec 24 03:05:00 2123 GMT

所以内嵌下证书生成新的 kubeconfig 即可:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# 我们证书后缀和文件路径不一样,不要照抄,env 和命令行指定生成的 kubeconfig 均一样
cd /etc/kubernetes/cluster1/ssl

KUBECONFIG=/etc/kubernetes/cluster1/.kube/config2 \
kubectl config set-cluster kubernetes \
--certificate-authority=ca.pem \
--embed-certs=true \
--server=https://127.0.0.1:8443

KUBECONFIG=/etc/kubernetes/cluster1/.kube/config2 \
kubectl config set-credentials admin \
--client-certificate=admin.pem \
--embed-certs=true \
--client-key=admin-key.pem

KUBECONFIG=/etc/kubernetes/cluster1/.kube/config2 \
kubectl config set-context kubernetes \
--cluster=kubernetes --user=admin

KUBECONFIG=/etc/kubernetes/cluster1/.kube/config2 \
kubectl config use-context kubernetes

KUBECONFIG=/etc/kubernetes/cluster1/.kube/config2 kubectl get node

deploy 的 rs 创建报错过期

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
$ kubectl -n default describe deploy deployment-example
Name: deployment-example
Namespace: default
CreationTimestamp: Fri, 30 May 2025 15:45:03 +0800
Labels: <none>
Annotations: <none>
Selector: app=nginx
Replicas: 2 desired | 0 updated | 0 total | 0 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: app=nginx
Containers:
nginx:
Image: nginx:1.19-alpine
Port: 12343/TCP
Host Port: 0/TCP
Environment: <none>
Mounts: <none>
Volumes: <none>
Conditions:
Type Status Reason
---- ------ ------
Progressing False ReplicaSetCreateError
OldReplicaSets: <none>
NewReplicaSet: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ReplicaSetCreateError 21s (x7 over 21s) deployment-controller Failed to create new replica set "deployment-example-b4f6c7989": Get "https://[::1]:6443/api/v1/namespaces/default/resourcequotas": x509: certificate has expired or is not yet valid: current time 2025-05-30T22:06:13+08:00 is after 2024-08-28T14:45:36Z
Warning ReplicaSetCreateError 20s (x2 over 20s) deployment-controller Failed to create new replica set "deployment-example-b4f6c7989": Get "https://[::1]:6443/api/v1/namespaces/default/resourcequotas": x509: certificate has expired or is not yet valid: current time 2025-05-30T22:06:14+08:00 is after 2024-08-28T14:45:36Z
Warning ReplicaSetCreateError 18s deployment-controller Failed to create new replica set "deployment-example-b4f6c7989": Get "https://[::1]:6443/api/v1/namespaces/default/resourcequotas": x509: certificate has expired or is not yet valid: current time 2025-05-30T22:06:16+08:00 is after 2024-08-28T14:45:36Z
Warning ReplicaSetCreateError 16s deployment-controller Failed to create new replica set "deployment-example-b4f6c7989": Get "https://[::1]:6443/api/v1/namespaces/default/resourcequotas": x509: certificate has expired or is not yet valid: current time 2025-05-30T22:06:18+08:00 is after 2024-08-28T14:45:36Z
Warning ReplicaSetCreateError 11s deployment-controller Failed to create new replica set "deployment-example-b4f6c7989": Get "https://[::1]:6443/api/v1/namespaces/default/resourcequotas": x509: certificate has expired or is not yet valid: current time 2025-05-30T22:06:23+08:00 is after 2024-08-28T14:45:36Z
Warning ReplicaSetCreateError 0s deployment-controller Failed to create new replica set "deployment-example-b4f6c7989": Get "https://[::1]:6443/api/v1/namespaces/default/resourcequotas": x509: certificate has expired or is not yet valid: current time 2025-05-30T22:06:34+08:00 is after 2024-08-28T14:45:36Z

这套环境是二进制部署,ReplicaSet 是 kube-controller-manager 创建的,该报错需要看 kube-controller-manager 日志,然后 k8s 的管理组件是通过 lease 对象保证只有一个真正处理:

1
2
3
4
$ kubectl -n kube-system get lease
NAME HOLDER AGE
kube-controller-manager ubuntu-Standard-PC-i440FX-PIIX-1996_296a57fb-a219-4301-a0a6-62c3cd09e0f2 639d
kube-scheduler ubuntu-Standard-PC-i440FX-PIIX-1996_edd2caff-d647-4633-8bd5-2d9788986e1f 639d

holder 的名字生成规则如下

1
2
3
4
5
6
7
8
// https://github.com/kubernetes/kubernetes/blob/v1.29.5/cmd/kube-controller-manager/app/controllermanager.go#L256-L286
id, err := os.Hostname()
if err != nil {
return err
}

// add a uniquifier so that two processes on the same host don't accidentally both become active
id = id + "_" + string(uuid.NewUUID())

发现每台机器的 hostname 一样的,完全不知道当前持有 lease 的 kube-controller-manager 是哪台,算了,每个去看日志吧:

1
2
3
4
5
6
7
$ journalctl -xe --no-pager -u kube-controller-manager.service 
-- Logs begin at Fri 2025-05-09 14:24:19 CST, end at Fri 2025-05-30 22:21:03 CST. --
May 22 00:01:48 ubuntu-Standard-PC-i440FX-PIIX-1996 kube-controller-manager[53418]: E0522 00:01:48.204891 53418 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: etcdserver: leader changed

May 30 22:08:57 ubuntu-Standard-PC-i440FX-PIIX-1996 kube-controller-manager[22314]: E0530 22:08:57.593721 22314 deployment_controller.go:495] Get "https://[::1]:6443/api/v1/namespaces/default/resourcequotas": x509: certificate has expired or is not yet valid: current time 2025-05-30T22:08:57+08:00 is after 2024-08-28T14:45:36Z
May 30 22:08:57 ubuntu-Standard-PC-i440FX-PIIX-1996 kube-controller-manager[22314]: I0530 22:08:57.593752 22314 deployment_controller.go:496] Dropping deployment "default/deployment-example" out of the queue: Get "https://[::1]:6443/api/v1/namespaces/default/resourcequotas": x509: certificate has expired or is not yet valid: current time 2025-05-30T22:08:57+08:00 is after 2024-08-28T14:45:36Z
May 30 22:08:57 ubuntu-Standard-PC-i440FX-PIIX-1996 kube-controller-manager[22314]: I0530 22:08:57.593824 22314 event.go:291] "Event occurred" object="default/deployment-example" kind="Deployment" apiVersion="apps/v1" type="Warning" reason="ReplicaSetCreateError" message="Failed to create new replica set \"deployment-example-b4f6c7989\": Get \"https://[::1]:6443/api/v1/namespaces/default/resourcequotas\": x509: certificate has expired or is not yet valid: current time 2025-05-30T22:08:57+08:00 is after 2024-08-28T14:45:36Z"

从 cmdline 获取 kubeconfig 路径:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$ systemctl cat kube-controller-manager.service 
# /etc/systemd/system/kube-controller-manager.service
[Unit]
Description=Kubernetes Controller Manager
Documentation=https://github.com/GoogleCloudPlatform/kubernetes

[Service]
ExecStart=/data/kube/bin/kube-controller-manager \
--address=127.0.0.1 \
--kubeconfig=/etc/kubernetes/cluster1/ssl/kube-controller-manager.kubeconfig \
...
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

查看下时间

1
2
3
4
$ kubectl --kubeconfig  /etc/kubernetes/cluster1/ssl/kube-controller-manager.kubeconfig \
config view --raw -o jsonpath="{.users[0]['user']['client-certificate-data']}" | base64 -d > test.pem
$ openssl x509 -in test.pem -noout -enddate
notAfter=Aug 5 15:37:00 2123 GMT

时间没问题,然后看了下每个 kube-controller-manager 都没问题,轮流间隔重启了下 kube-controller-manager 还是一样,然后重启了下 kube-apiserver 才好,感觉 kube-apiserver 缓存 bug。

kubelet 轮转证书

证书位于 /var/lib/kubelet ,有时候 kubelet 不会自动轮转,该目录内证书备份下后重启 kubelet 即可,以及推荐设置 kube-controller-manager 的轮转证书时间久些。

一些其他的

不单单 k8s 证书, etcd 证书一样,k8s 访问 etcd 相关大同小异,上面实战部分如果理解能力不行,在关于 ca-config.json 的 usages 可以偷懒下面 client 和 server 都写上了,一个 ca 配置文件用于所有:

1
2
3
4
5
6
"usages": [
"signing",
"key encipherment",
"server auth",
"client auth"
],

任何关于证书报错的信息和日志仔细看,证书过期、 certSAN 不匹配和不是一套 ca 导致校验不通过等是不一样的事情,不要无脑找到啥证书文章博客就跟着瞎操作,证书操作前要备份已有证书,产生 kubeconfig 文件的时候,要使用 kubectl 指定新路径生成,不要动老的。

CATALOG
  1. 1. 由来
  2. 2. 理论部分
    1. 2.1. 双向 SSL
    2. 2.2. 证书建议相关
      1. 2.2.1. 时间
      2. 2.2.2. certSAN
    3. 2.3. k8s role 和 RBAC
  3. 3. 实战
    1. 3.1. 创建证书
    2. 3.2. kube-apiserver 的 certSAN
  4. 4. 故障案例
    1. 4.1. kubectl 证书过期
    2. 4.2. deploy 的 rs 创建报错过期
    3. 4.3. kubelet 轮转证书
  5. 5. 一些其他的