zhangguanzhang's Blog

k8s master机器文件系统故障的一次恢复过程

字数统计: 5.9k阅读时长: 35 min
2020/07/23

由来

研发反馈他们那边一套集群有台master文件系统损坏无法开机,他们是三台openstack上的虚机,是虚拟化宿主机故障导致的虚机文件系统损坏。三台机器是master+node,指导他修复后开机,修复过程和我之前文章opensuse的一次救援步骤一样

起来后我上去看,因为做了 HA 的,所以只有这个node有问题,集群没影响

1
2
3
4
5
[root@k8s-m1 ~]# kubectl get node -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
10.252.146.104 NotReady <none> 30d v1.16.9 10.252.146.104 <none> CentOS Linux 8 (Core) 4.18.0-193.6.3.el8_2.x86_64 docker://19.3.11
10.252.146.105 Ready <none> 30d v1.16.9 10.252.146.105 <none> CentOS Linux 8 (Core) 4.18.0-193.6.3.el8_2.x86_64 docker://19.3.11
10.252.146.106 Ready <none> 30d v1.16.9 10.252.146.106 <none> CentOS Linux 8 (Core) 4.18.0-193.6.3.el8_2.x86_64 docker://19.3.11

启动docker试试

1
2
[root@k8s-m1 ~]# systemctl start docker
Job for docker.service canceled.

无法启动,查看下启动失败的服务

containerd

1
2
3
[root@k8s-m1 ~]# systemctl --failed
UNIT LOAD ACTIVE SUB DESCRIPTION
● containerd.service loaded failed failed containerd container runtime

如果没有containerd.service。可能被 service 纳管了,尝试下service containerd start试试

查看下containerd的日志

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
[root@k8s-m1 ~]# journalctl -xe -u containerd
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481459735+08:00" level=info msg="loading plugin "io.containerd.service.v1.snapshots-service"..." type=io.containerd.service.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481472223+08:00" level=info msg="loading plugin "io.containerd.runtime.v1.linux"..." type=io.containerd.runtime.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481517630+08:00" level=info msg="loading plugin "io.containerd.runtime.v2.task"..." type=io.containerd.runtime.v2
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481562176+08:00" level=info msg="loading plugin "io.containerd.monitor.v1.cgroups"..." type=io.containerd.monitor.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481964349+08:00" level=info msg="loading plugin "io.containerd.service.v1.tasks-service"..." type=io.containerd.service.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481996158+08:00" level=info msg="loading plugin "io.containerd.internal.v1.restart"..." type=io.containerd.internal.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482048208+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.containers"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482081110+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.content"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482096598+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.diff"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482112263+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.events"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482123307+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.healthcheck"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482133477+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.images"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482142943+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.leases"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482151644+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.namespaces"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482160741+08:00" level=info msg="loading plugin "io.containerd.internal.v1.opt"..." type=io.containerd.internal.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482184201+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.snapshots"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482194643+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.tasks"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482206871+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.version"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482215454+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.introspection"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482365838+08:00" level=info msg=serving... address="/run/containerd/containerd.sock"
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482404139+08:00" level=info msg="containerd successfully booted in 0.003611s"
Jul 23 11:20:11 k8s-m1 containerd[9186]: panic: runtime error: invalid memory address or nil pointer dereference
Jul 23 11:20:11 k8s-m1 containerd[9186]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x5626b983c259]
Jul 23 11:20:11 k8s-m1 containerd[9186]: goroutine 55 [running]:
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/vendor/go.etcd.io/bbolt.(*Bucket).Cursor(...)
Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/bucket.go:84
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/vendor/go.etcd.io/bbolt.(*Bucket).Get(0x0, 0x5626bb7e3f10, 0xb, 0xb, 0x0, 0x2, 0x4)
Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/bucket.go:260 +0x39
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.scanRoots.func6(0x7fe557c63020, 0x2, 0x2, 0x0, 0x0, 0x0, 0x0, 0x5626b95eec72)
Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/metadata/gc.go:222 +0xcb
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/vendor/go.etcd.io/bbolt.(*Bucket).ForEach(0xc0003d1780, 0xc00057b640, 0xa, 0xa)
Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/bucket.go:388 +0x100
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.scanRoots(0x5626bacedde0, 0xc0003d1680, 0xc0002ee2a0, 0xc00031a3c0, 0xc000527a60, 0x7fe586a43fff)
Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/metadata/gc.go:216 +0x4df
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.(*DB).getMarked.func1(0xc0002ee2a0, 0x0, 0x0)
Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/metadata/db.go:359 +0x165
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/vendor/go.etcd.io/bbolt.(*DB).View(0xc00000c1e0, 0xc00008b860, 0x0, 0x0)
Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/db.go:701 +0x92
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.(*DB).getMarked(0xc0000a0a80, 0x5626bacede20, 0xc0000d6010, 0x203000, 0x203000, 0x400)
Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/metadata/db.go:342 +0x7e
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.(*DB).GarbageCollect(0xc0000a0a80, 0x5626bacede20, 0xc0000d6010, 0x0, 0x1, 0x0, 0x0)
Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/metadata/db.go:257 +0xa3
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/gc/scheduler.(*gcScheduler).run(0xc0000a0b40, 0x5626bacede20, 0xc0000d6010)
Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/gc/scheduler/scheduler.go:310 +0x511
Jul 23 11:20:11 k8s-m1 containerd[9186]: created by github.com/containerd/containerd/gc/scheduler.init.0.func1
Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/gc/scheduler/scheduler.go:132 +0x462
Jul 23 11:20:11 k8s-m1 systemd[1]: containerd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jul 23 11:20:11 k8s-m1 systemd[1]: containerd.service: Failed with result 'exit-code'.

这个问题从panic抛出的堆栈信息看和我之前文章docker启动panic很类似,都是 boltdb 文件出错,找下 git 信息去看看代码路径在哪

1
2
3
4
5
6
[root@k8s-m1 ~]# systemctl cat containerd | grep ExecStart
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/bin/containerd

[root@k8s-m1 ~]# /usr/bin/containerd --version
containerd containerd.io 1.2.13 7ad184331fa3e55e52b890ea95e65ba581ae3429

按照这个blob去用github的url访问是404,只有去按照tag版本查看了,根据相关代码找到了 boltdb 的文件名是meta.db
https://github.com/containerd/containerd/blob/v1.2.13/metadata/db.go#L257
https://github.com/containerd/containerd/blob/v1.2.13/metadata/db.go#L79
https://github.com/containerd/containerd/blob/v1.2.13/services/server/server.go#L261-L268

查找下ic.Root路径是多少

1
2
3
4
5
6
7
8
[root@k8s-m1 ~]# /usr/bin/containerd --help | grep config
config information on the containerd config
--config value, -c value path to the configuration file (default: "/etc/containerd/config.toml")

[root@k8s-m1 ~]# grep root /etc/containerd/config.toml
#root = "/var/lib/containerd"
[root@k8s-m1 ~]]# find /var/lib/containerd -type f -name meta.db
/var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db

找到boltdb文件,改名启动

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
[root@k8s-m1 ~]]# mv /var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db{,.bak}
[root@k8s-m1 ~]# systemctl status containerd.service
● containerd.service - containerd container runtime
Loaded: loaded (/usr/lib/systemd/system/containerd.service; disabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2020-07-23 11:20:11 CST; 17min ago
Docs: https://containerd.io
Process: 9186 ExecStart=/usr/bin/containerd (code=exited, status=2)
Process: 9182 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
Main PID: 9186 (code=exited, status=2)

Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.(*DB).getMarked(0xc0000a0a80, 0x5626bacede20, 0xc0000d6010, 0x203000, 0x203000, 0x400)
Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/metadata/db.go:342 +0x7e
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.(*DB).GarbageCollect(0xc0000a0a80, 0x5626bacede20, 0xc0000d6010, 0x0, 0x1, 0x0, 0x0)
Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/metadata/db.go:257 +0xa3
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/gc/scheduler.(*gcScheduler).run(0xc0000a0b40, 0x5626bacede20, 0xc0000d6010)
Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/gc/scheduler/scheduler.go:310 +0x511
Jul 23 11:20:11 k8s-m1 containerd[9186]: created by github.com/containerd/containerd/gc/scheduler.init.0.func1
Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/gc/scheduler/scheduler.go:132 +0x462
Jul 23 11:20:11 k8s-m1 systemd[1]: containerd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jul 23 11:20:11 k8s-m1 systemd[1]: containerd.service: Failed with result 'exit-code'.
[root@k8s-m1 ~]# systemctl restart containerd.service
[root@k8s-m1 ~]# systemctl status containerd.service
● containerd.service - containerd container runtime
Loaded: loaded (/usr/lib/systemd/system/containerd.service; disabled; vendor preset: disabled)
Active: active (running) since Thu 2020-07-23 11:25:37 CST; 1s ago
Docs: https://containerd.io
Process: 15661 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
Main PID: 15663 (containerd)
Tasks: 16
Memory: 28.6M
CGroup: /system.slice/containerd.service
└─15663 /usr/bin/containerd

Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496725460+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.images"..." type=io.containerd.grpc.v1
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496734129+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.leases"..." type=io.containerd.grpc.v1
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496742793+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.namespaces"..." type=io.containerd.grpc.v1
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496751740+08:00" level=info msg="loading plugin "io.containerd.internal.v1.opt"..." type=io.containerd.internal.v1
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496775185+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.snapshots"..." type=io.containerd.grpc.v1
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496785498+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.tasks"..." type=io.containerd.grpc.v1
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496794873+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.version"..." type=io.containerd.grpc.v1
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496803178+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.introspection"..." type=io.containerd.grpc.v1
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496944458+08:00" level=info msg=serving... address="/run/containerd/containerd.sock"
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496958031+08:00" level=info msg="containerd successfully booted in 0.003994s"

docker

containerd 起来后,启动下 docker

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
[root@k8s-m1 ~]# systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/docker.service.d
└─10-docker.conf
Active: inactive (dead) since Thu 2020-07-23 11:20:13 CST; 18min ago
Docs: https://docs.docker.com
Process: 9398 ExecStopPost=/bin/bash -c /sbin/iptables -D FORWARD -s 0.0.0.0/0 -j ACCEPT &> /dev/null || : (code=exited, status=0/SUCCESS)
Process: 9187 ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock (code=exited, status=0/SUCCESS)
Main PID: 9187 (code=exited, status=0/SUCCESS)

Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.956503485+08:00" level=error msg="Stop container error: Failed to stop container 68860c8d16b9ce7e74e8efd9db00e70a57eef1b752c2e6c703073c0bce5517d3 with error: Cannot kill c>
Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.954347116+08:00" level=error msg="Stop container error: Failed to stop container 5ec9922beed1276989f1866c3fd911f37cc26aae4e4b27c7ce78183a9a4725cc with error: Cannot kill c>
Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.953615411+08:00" level=info msg="Container failed to stop after sending signal 15 to the process, force killing"
Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.956557179+08:00" level=error msg="Stop container error: Failed to stop container 6d0096fbcd4055f8bafb6b38f502a0186cd1dfca34219e9dd6050f512971aef5 with error: Cannot kill c>
Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.954601191+08:00" level=info msg="Container failed to stop after sending signal 15 to the process, force killing"
Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.956600790+08:00" level=error msg="Stop container error: Failed to stop container 6d1175ba6c55cb05ad89f4134ba8e9d3495c5acb5f07938dc16339b7cca013bf with error: Cannot kill c>
Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.957188989+08:00" level=info msg="Daemon shutdown complete"
Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.957212655+08:00" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=plugins.moby
Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.957209679+08:00" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=moby
Jul 23 11:20:13 k8s-m1 systemd[1]: Stopped Docker Application Container Engine.
[root@k8s-m1 ~]# systemctl start docker
[root@k8s-m1 ~]# systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/docker.service.d
└─10-docker.conf
Active: active (running) since Thu 2020-07-23 11:26:11 CST; 1s ago
Docs: https://docs.docker.com
Process: 9398 ExecStopPost=/bin/bash -c /sbin/iptables -D FORWARD -s 0.0.0.0/0 -j ACCEPT &> /dev/null || : (code=exited, status=0/SUCCESS)
Process: 16156 ExecStartPost=/sbin/iptables -I FORWARD -s 0.0.0.0/0 -j ACCEPT (code=exited, status=0/SUCCESS)
Main PID: 15974 (dockerd)
Tasks: 62
Memory: 89.1M
CGroup: /system.slice/docker.service
└─15974 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock

Jul 23 11:26:10 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:10.851106564+08:00" level=error msg="cb4e16249cd8eac48ed734c71237195f04d63c56c55c0199b3cdf3d49461903d cleanup: failed to delete container from containerd: no such container"
Jul 23 11:26:10 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:10.860456898+08:00" level=error msg="d9bbcab186ccb59f96c95fc886ec1b66a52aa96e45b117cf7d12e3ff9b95db9f cleanup: failed to delete container from containerd: no such container"
Jul 23 11:26:10 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:10.872405757+08:00" level=error msg="07eb7a09bc8589abcb4d79af4b46798327bfb00624a7b9ceea457de392ad8f3d cleanup: failed to delete container from containerd: no such container"
Jul 23 11:26:10 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:10.877896618+08:00" level=error msg="f5867657025bd7c3951cbd3e08ad97338cf69df2a97967a419e0e78eda869b73 cleanup: failed to delete container from containerd: no such container"
Jul 23 11:26:11 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:11.143661583+08:00" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Jul 23 11:26:11 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:11.198200760+08:00" level=info msg="Loading containers: done."
Jul 23 11:26:11 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:11.219959208+08:00" level=info msg="Docker daemon" commit=42e35e61f3 graphdriver(s)=overlay2 version=19.03.11
Jul 23 11:26:11 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:11.220049865+08:00" level=info msg="Daemon has completed initialization"
Jul 23 11:26:11 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:11.232373131+08:00" level=info msg="API listen on /var/run/docker.sock"
Jul 23 11:26:11 k8s-m1 systemd[1]: Started Docker Application Container Engine.

etcd

etcd启动也失败,journal 查看下 etcd 状态

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
[root@k8s-m1 ~]# journalctl -xe -u etcd
Jul 23 11:26:15 k8s-m1 etcd[18129]: Loading server configuration from "/etc/etcd/etcd.config.yml"
Jul 23 11:26:15 k8s-m1 etcd[18129]: etcd Version: 3.3.20
Jul 23 11:26:15 k8s-m1 etcd[18129]: Git SHA: 9fd7e2b80
Jul 23 11:26:15 k8s-m1 etcd[18129]: Go Version: go1.12.17
Jul 23 11:26:15 k8s-m1 etcd[18129]: Go OS/Arch: linux/amd64
Jul 23 11:26:15 k8s-m1 etcd[18129]: setting maximum number of CPUs to 16, total number of available CPUs is 16
Jul 23 11:26:15 k8s-m1 etcd[18129]: found invalid file/dir wal under data dir /var/lib/etcd (Ignore this if you are upgrading etcd)
Jul 23 11:26:15 k8s-m1 etcd[18129]: the server is already initialized as member before, starting as etcd member...
Jul 23 11:26:15 k8s-m1 etcd[18129]: ignoring peer auto TLS since certs given
Jul 23 11:26:15 k8s-m1 etcd[18129]: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, ca = /etc/kubernetes/pki/etcd/ca.crt, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = fals>
Jul 23 11:26:15 k8s-m1 etcd[18129]: listening for peers on https://10.252.146.104:2380
Jul 23 11:26:15 k8s-m1 etcd[18129]: ignoring client auto TLS since certs given
Jul 23 11:26:15 k8s-m1 etcd[18129]: pprof is enabled under /debug/pprof
Jul 23 11:26:15 k8s-m1 etcd[18129]: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert files are presented. Ignored key/cert files.
Jul 23 11:26:15 k8s-m1 etcd[18129]: The scheme of client url http://127.0.0.1:2379 is HTTP while client cert auth (--client-cert-auth) is enabled. Ignored client cert auth for this url.
Jul 23 11:26:15 k8s-m1 etcd[18129]: listening for client requests on 127.0.0.1:2379
Jul 23 11:26:15 k8s-m1 etcd[18129]: listening for client requests on 10.252.146.104:2379
Jul 23 11:26:15 k8s-m1 etcd[18129]: skipped unexpected non snapshot file 000000000000002e-000000000052f2be.snap.broken
Jul 23 11:26:15 k8s-m1 etcd[18129]: recovered store from snapshot at index 5426092
Jul 23 11:26:15 k8s-m1 etcd[18129]: restore compact to 3967425
Jul 23 11:26:15 k8s-m1 etcd[18129]: cannot unmarshal event: proto: KeyValue: illegal tag 0 (wire type 0)
Jul 23 11:26:15 k8s-m1 systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE
Jul 23 11:26:15 k8s-m1 systemd[1]: etcd.service: Failed with result 'exit-code'.
Jul 23 11:26:15 k8s-m1 systemd[1]: Failed to start Etcd Service.
[root@k8s-m1 ~]# ll /var/lib/etcd/member/snap/
total 8560
-rw-r--r-- 1 root root 13499 Jul 20 13:36 000000000000002e-000000000052cbac.snap
-rw-r--r-- 2 root root 128360 Jul 20 13:01 000000000000002e-000000000052f2be.snap.broken
-rw------- 1 root root 8617984 Jul 23 11:26 db

这套集群是使用我的ansible部署的,求star,自带了备份脚本,但是是三天前坏的

1
2
3
4
5
6
[root@k8s-m1 ~]# ll /opt/etcd_bak/
total 41524
-rw-r--r-- 1 root root 8618016 Jul 17 02:00 etcd-2020-07-17-02:00:01.db
-rw-r--r-- 1 root root 8618016 Jul 18 02:00 etcd-2020-07-18-02:00:01.db
-rw-r--r-- 1 root root 8323104 Jul 19 02:00 etcd-2020-07-19-02:00:01.db
-rw-r--r-- 1 root root 8618016 Jul 20 02:00 etcd-2020-07-20-02:00:01.db

etcd备份文件恢复

有恢复剧本,但是前提是etcd的v2和v3不能共存,否则无法恢复备份,我们线上都是把v2的存储关闭了的。主要是这个tasks里的26到42行步骤,这里复制了其他机器master上的 07/23 号的etcd备份文件,然后改了下host跑了下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
[root@k8s-m1 ~]# cp -a /var/lib/etcd{,.bak} #先备份下etcd目录
[root@k8s-m1 ~]# cd Kubernetes-ansible
[root@k8s-m1 Kubernetes-ansible]# ansible-playbook restoreETCD.yml -e 'db=/opt/etcd_bak/etcd-bak.db'

PLAY [10.252.146.104] **********************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] *********************************************************************************************************************************************************************************************************************
ok: [10.252.146.104]

TASK [restoreETCD : fail] ******************************************************************************************************************************************************************************************************************
skipping: [10.252.146.104]

TASK [restoreETCD : 检测备份文件存在否] *************************************************************************************************************************************************************************************************************
ok: [10.252.146.104]

TASK [restoreETCD : fail] ******************************************************************************************************************************************************************************************************************
skipping: [10.252.146.104]

TASK [restoreETCD : set_fact] **************************************************************************************************************************************************************************************************************
skipping: [10.252.146.104]

TASK [restoreETCD : set_fact] **************************************************************************************************************************************************************************************************************
ok: [10.252.146.104]

TASK [restoreETCD : 停止etcd] ****************************************************************************************************************************************************************************************************************
ok: [10.252.146.104]

TASK [restoreETCD : 删除etcd数据目录] ************************************************************************************************************************************************************************************************************
ok: [10.252.146.104] => (item=/var/lib/etcd)

TASK [restoreETCD : 分发备份文件] ****************************************************************************************************************************************************************************************************************
ok: [10.252.146.104]

TASK [restoreETCD : 恢复备份] ******************************************************************************************************************************************************************************************************************
changed: [10.252.146.104]

TASK [restoreETCD : 启动etcd] ****************************************************************************************************************************************************************************************************************
fatal: [10.252.146.104]: FAILED! => {"changed": false, "msg": "Unable to start service etcd: Job for etcd.service failed because the control process exited with error code.\nSee \"systemctl status etcd.service\" and \"journalctl -xe\" for details.\n"}

PLAY RECAP *********************************************************************************************************************************************************************************************************************************
10.252.146.104 : ok=7 changed=1 unreachable=0 failed=1 skipped=3 rescued=0 ignored=0

查看下日志

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
[root@k8s-m1 Kubernetes-ansible]# journalctl -xe -u etcd
Jul 23 11:27:46 k8s-m1 etcd[58954]: Loading server configuration from "/etc/etcd/etcd.config.yml"
Jul 23 11:27:46 k8s-m1 etcd[58954]: etcd Version: 3.3.20
Jul 23 11:27:46 k8s-m1 etcd[58954]: Git SHA: 9fd7e2b80
Jul 23 11:27:46 k8s-m1 etcd[58954]: Go Version: go1.12.17
Jul 23 11:27:46 k8s-m1 etcd[58954]: Go OS/Arch: linux/amd64
Jul 23 11:27:46 k8s-m1 etcd[58954]: setting maximum number of CPUs to 16, total number of available CPUs is 16
Jul 23 11:27:46 k8s-m1 etcd[58954]: the server is already initialized as member before, starting as etcd member...
Jul 23 11:27:46 k8s-m1 etcd[58954]: ignoring peer auto TLS since certs given
Jul 23 11:27:46 k8s-m1 etcd[58954]: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, ca = /etc/kubernetes/pki/etcd/ca.crt, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = fals>
Jul 23 11:27:46 k8s-m1 etcd[58954]: listening for peers on https://10.252.146.104:2380
Jul 23 11:27:46 k8s-m1 etcd[58954]: ignoring client auto TLS since certs given
Jul 23 11:27:46 k8s-m1 etcd[58954]: pprof is enabled under /debug/pprof
Jul 23 11:27:46 k8s-m1 etcd[58954]: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert files are presented. Ignored key/cert files.
Jul 23 11:27:46 k8s-m1 etcd[58954]: The scheme of client url http://127.0.0.1:2379 is HTTP while client cert auth (--client-cert-auth) is enabled. Ignored client cert auth for this url.
Jul 23 11:27:46 k8s-m1 etcd[58954]: listening for client requests on 127.0.0.1:2379
Jul 23 11:27:46 k8s-m1 etcd[58954]: listening for client requests on 10.252.146.104:2379
Jul 23 11:27:47 k8s-m1 etcd[58954]: member ac2dcf6aed12e8f1 has already been bootstrapped
Jul 23 11:27:47 k8s-m1 systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE
Jul 23 11:27:47 k8s-m1 systemd[1]: etcd.service: Failed with result 'exit-code'.
Jul 23 11:27:47 k8s-m1 systemd[1]: Failed to start Etcd Service.

这个member xxxx has already been bootstrapped解决办法就是把配置文件的下面修改,后面启动完记得改回来

1
initial-cluster-state: 'new' 改成 initial-cluster-state: 'existing'

然后成功启动

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
[root@k8s-m1 Kubernetes-ansible]# systemctl start etcd
[root@k8s-m1 Kubernetes-ansible]# journalctl -xe -u etcd
Jul 23 11:27:55 k8s-m1 etcd[59889]: Loading server configuration from "/etc/etcd/etcd.config.yml"
Jul 23 11:27:55 k8s-m1 etcd[59889]: etcd Version: 3.3.20
Jul 23 11:27:55 k8s-m1 etcd[59889]: Git SHA: 9fd7e2b80
Jul 23 11:27:55 k8s-m1 etcd[59889]: Go Version: go1.12.17
Jul 23 11:27:55 k8s-m1 etcd[59889]: Go OS/Arch: linux/amd64
Jul 23 11:27:55 k8s-m1 etcd[59889]: setting maximum number of CPUs to 16, total number of available CPUs is 16
Jul 23 11:27:55 k8s-m1 etcd[59889]: found invalid file/dir wal under data dir /var/lib/etcd (Ignore this if you are upgrading etcd)
Jul 23 11:27:55 k8s-m1 etcd[59889]: the server is already initialized as member before, starting as etcd member...
Jul 23 11:27:55 k8s-m1 etcd[59889]: ignoring peer auto TLS since certs given
Jul 23 11:27:55 k8s-m1 etcd[59889]: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, ca = /etc/kubernetes/pki/etcd/ca.crt, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = fals>
Jul 23 11:27:55 k8s-m1 etcd[59889]: listening for peers on https://10.252.146.104:2380
Jul 23 11:27:55 k8s-m1 etcd[59889]: ignoring client auto TLS since certs given
Jul 23 11:27:55 k8s-m1 etcd[59889]: pprof is enabled under /debug/pprof
Jul 23 11:27:55 k8s-m1 etcd[59889]: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert files are presented. Ignored key/cert files.
Jul 23 11:27:55 k8s-m1 etcd[59889]: The scheme of client url http://127.0.0.1:2379 is HTTP while client cert auth (--client-cert-auth) is enabled. Ignored client cert auth for this url.
Jul 23 11:27:55 k8s-m1 etcd[59889]: listening for client requests on 127.0.0.1:2379
Jul 23 11:27:55 k8s-m1 etcd[59889]: listening for client requests on 10.252.146.104:2379
Jul 23 11:27:55 k8s-m1 etcd[59889]: recovered store from snapshot at index 5952463
Jul 23 11:27:55 k8s-m1 etcd[59889]: restore compact to 4369703
Jul 23 11:27:55 k8s-m1 etcd[59889]: name = etcd-001
Jul 23 11:27:55 k8s-m1 etcd[59889]: data dir = /var/lib/etcd
Jul 23 11:27:55 k8s-m1 etcd[59889]: member dir = /var/lib/etcd/member
Jul 23 11:27:55 k8s-m1 etcd[59889]: dedicated WAL dir = /var/lib/etcd/wal
Jul 23 11:27:55 k8s-m1 etcd[59889]: heartbeat = 100ms
Jul 23 11:27:55 k8s-m1 etcd[59889]: election = 1000ms
Jul 23 11:27:55 k8s-m1 etcd[59889]: snapshot count = 5000
Jul 23 11:27:55 k8s-m1 etcd[59889]: advertise client URLs = https://10.252.146.104:2379
Jul 23 11:27:55 k8s-m1 etcd[59889]: restarting member ac2dcf6aed12e8f1 in cluster 367e2aebc6430cbe at commit index 5952491
Jul 23 11:27:55 k8s-m1 etcd[59889]: ac2dcf6aed12e8f1 became follower at term 47
Jul 23 11:27:55 k8s-m1 etcd[59889]: newRaft ac2dcf6aed12e8f1 [peers: [1e713be314744d53,8b1621b475555fd9,ac2dcf6aed12e8f1], term: 47, commit: 5952491, applied: 5952463, lastindex: 5952491, lastterm: 47]
Jul 23 11:27:55 k8s-m1 etcd[59889]: enabled capabilities for version 3.3
Jul 23 11:27:55 k8s-m1 etcd[59889]: added member 1e713be314744d53 [https://10.252.146.105:2380] to cluster 367e2aebc6430cbe from store
Jul 23 11:27:55 k8s-m1 etcd[59889]: added member 8b1621b475555fd9 [https://10.252.146.106:2380] to cluster 367e2aebc6430cbe from store
Jul 23 11:27:55 k8s-m1 etcd[59889]: added member ac2dcf6aed12e8f1 [https://10.252.146.104:2380] to cluster 367e2aebc6430cbe from store

查看集群状态

1
2
3
4
5
6
7
8
[root@k8s-m1 Kubernetes-ansible]# etcd-ha
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://10.252.146.104:2379 | ac2dcf6aed12e8f1 | 3.3.20 | 8.3 MB | false | 47 | 5953557 |
| https://10.252.146.105:2379 | 1e713be314744d53 | 3.3.20 | 8.6 MB | false | 47 | 5953557 |
| https://10.252.146.106:2379 | 8b1621b475555fd9 | 3.3.20 | 8.3 MB | true | 47 | 5953557 |
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+

然后给kube-apiserver三个组件和kubelet起来后

1
2
3
4
5
[root@k8s-m1 Kubernetes-ansible]# kubectl get node -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
10.252.146.104 Ready <none> 30d v1.16.9 10.252.146.104 <none> CentOS Linux 8 (Core) 4.18.0-193.6.3.el8_2.x86_64 docker://19.3.11
10.252.146.105 Ready <none> 30d v1.16.9 10.252.146.105 <none> CentOS Linux 8 (Core) 4.18.0-193.6.3.el8_2.x86_64 docker://19.3.11
10.252.146.106 Ready <none> 30d v1.16.9 10.252.146.106 <none> CentOS Linux 8 (Core) 4.18.0-193.6.3.el8_2.x86_64 docker://19.3.11

pod也在慢慢自愈了

CATALOG
  1. 1. 由来
    1. 1.1. containerd
    2. 1.2. docker
    3. 1.3. etcd
      1. 1.3.1. etcd备份文件恢复