zhangguanzhang's Blog

记录一次线上k8s节点故障

字数统计: 926阅读时长: 4 min
2018/06/25
  • 邮件收到zabbix的告警,业务的网页登陆状态不是200,后面又自愈了

说明服务挂掉过一次,登陆到机器上发现集群有台节点状态是nodelost状态
上去看到相关服务都挂掉了
然后排查到根分区占满了,排查到是k8s日志堆满了/var/log/

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[root@cloudos02 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/centos-root 219G 216G 0 100% /
devtmpfs 63G 0 63G 0% /dev
tmpfs 63G 12K 63G 1% /dev/shm
tmpfs 63G 226M 63G 1% /run
tmpfs 63G 0 63G 0% /sys/fs/cgroup
/dev/sda3 197M 136M 61M 70% /boot
/dev/sda2 200M 0 200M 0% /boot/efi
tmpfs 13G 0 13G 0% /run/user/0
[root@cloudos02 /var/log/]# du -shx /var/log/* | grep -P '^\S+?G'
31G /var/log/heat
1.1G /var/log/keystone
163G /var/log/kubernetes
3.7G /var/log/nova
1.7G /var/log/openstack-compute
  • 日志文件名有规律,直接删掉20天之前的日志文件
1
2
3
4
5
6
7
8
9
10
11
[root@cloudos02 kubernetes]# find -mtime +20 -name 'kube*.cloudos02*' -exec rm -f {} \;
[root@cloudos02 kubernetes]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/centos-root 219G 117G 91G 57% /
devtmpfs 63G 0 63G 0% /dev
tmpfs 63G 12K 63G 1% /dev/shm
tmpfs 63G 226M 63G 1% /run
tmpfs 63G 0 63G 0% /sys/fs/cgroup
/dev/sda3 197M 136M 61M 70% /boot
/dev/sda2 200M 0 200M 0% /boot/efi
tmpfs 13G 0 13G 0% /run/user/0
  • k8s核心是etcd,果然是etcd有问题,k8s相关服务全部挂了
1
2
3
4
5
6
[root@cloudos02 ~]# /opt/bin/etcdctl cluster-health
member 9bd4565552fd93c is healthy: got healthy result from http://10.12.0.21:2379
failed to check the health of member 658a31702f200e95 on http://10.12.0.22:2379: Get http://10.12.0.22:2379/health: dial tcp 10.12.0.22:2379: getsockopt: connection refused
member 658a31702f200e95 is unreachable: [http://10.12.0.22:2379] are all unreachable
member d1a9f9229366f9b8 is healthy: got healthy result from http://10.12.0.23:2379
cluster is healthy
  • 查看日志
1
2
[root@cloudos02 ~]# journalctl -xe -u etcd2
一大堆输出说snap.broken

通过日志可以确定etcd的文件损坏了,肯定是由于根分区满了同步过来的数据无法写入导致损坏
先查找etcd的数据目录在哪,解决方法就是删掉此台的数据目录,然后再同步过来就行了
由于是实体服务,直接去找systemd脚本

1
2
3
4
5
6
7
8
[root@cloudos02 ~]# cat /usr/lib/systemd/system/etcd2.service 
[Unit]
Description=Etcd2 Server

[Service]
Type=notify
EnvironmentFile=-/etc/sysconfig/kube-etcd-cluster
ExecStart=/opt/bin/etcd --name=${ETCD_NAME} ......省略

从/usr/lib/systemd/system/etcd2.service看到没有写数据目录,那么默认数据目录是默认为 ${name}.etcd
etcd2.service里的name是从/etc/sysconfig/kube-etcd-cluster里读取变量

1
2
3
4
5
6
7
8
9
[root@cloudos02 ~]# cat /etc/sysconfig/kube-etcd-cluster
ETCD_NAME="NODE2"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.12.0.22:2380"
ETCD_LISTEN_PEER_URLS="http://10.12.0.22:2380"
ETCD_LISTEN_CLIENT_URLS="http://10.12.0.22:2379,http://127.0.0.1:2379"
ETCD_ADVERTISE_CLIENT_URLS="http://10.12.0.22:2379"
ETCD_INITIAL_CLUSTER_TOKEN="my-etcd-cluster"
ETCD_INITIAL_CLUSTER="NODE1=http://10.12.0.21:2380,NODE2=http://10.12.0.22:2380,NODE3=http://10.12.0.23:2380"
ETCD_INITIAL_CLUSTER_STATE="new"

根目录确实有NODE2.etcd,删掉数据目录

1
2
3
[root@cloudos02 ~]# ll /
drwx------ 3 root root 4096 Jun 29 11:47 NODE2.etcd
[root@cloudos02 ~]# rm -rf /NODE2.etcd

去另外正常的节点上移除这个节点,然后再加上

1
2
3
[root@cloudos01 ~]# /opt/bin/etcdctl member remove 658a31702f200e95
Removed member 658a31702f200e95 from cluster
[root@cloudos01 ~]# /opt/bin/etcdctl member add NODE2 http://10.12.0.22:2380

然后去异常节点上修改配置文件/etc/sysconfig/kube-etcd-cluster
将ETCD_INITIAL_CLUSTER_STATE=new,修改为ETCD_INITIAL_CLUSTER_STATE=existing并启动etcd

1
2
[root@cloudos02 ~]# sed -ri '/ETCD_INITIAL_CLUSTER_STATE/s#new#existing#' /etc/sysconfig/kube-etcd-cluster
[root@cloudos02 ~]# systemctl start etcd2

查看集群成员状态

1
2
3
4
5
[root@cloudos02 ~]# /opt/bin/etcdctl cluster-health
member 9bd4565552fd93c is healthy: got healthy result from http://10.12.0.21:2379
member d1a9f9229366f9b8 is healthy: got healthy result from http://10.12.0.23:2379
member f95341f81eb9322c is healthy: got healthy result from http://10.12.0.22:2379
cluster is healthy

然后去异常节点上修改配置文件/etc/sysconfig/kube-etcd-cluster
将ETCD_INITIAL_CLUSTER_STATE=existing改回new

1
[root@cloudos02 ~]# sed -ri '/ETCD_INITIAL_CLUSTER_STATE/s#existing#new#' /etc/sysconfig/kube-etcd-cluster

后面启动相关服务节点完全正常

CATALOG