记录一次线上k8s节点故障

k8s etcd

字数统计: 926阅读时长: 4 min

 2018/06/25 

邮件收到zabbix的告警,业务的网页登陆状态不是200,后面又自愈了

说明服务挂掉过一次,登陆到机器上发现集群有台节点状态是nodelost状态
上去看到相关服务都挂掉了
然后排查到根分区占满了,排查到是k8s日志堆满了/var/log/

[root@cloudos02 ~]# df -h
Filesystem               Size  Used Avail Use% Mounted on
/dev/mapper/centos-root  219G  216G     0 100% /
devtmpfs                  63G     0   63G   0% /dev
tmpfs                     63G   12K   63G   1% /dev/shm
tmpfs                     63G  226M   63G   1% /run
tmpfs                     63G     0   63G   0% /sys/fs/cgroup
/dev/sda3                197M  136M   61M  70% /boot
/dev/sda2                200M     0  200M   0% /boot/efi
tmpfs                     13G     0   13G   0% /run/user/0
[root@cloudos02 /var/log/]# du -shx /var/log/* | grep -P '^\S+?G'
31G   /var/log/heat
1.1G  /var/log/keystone
163G  /var/log/kubernetes
3.7G  /var/log/nova
1.7G  /var/log/openstack-compute

日志文件名有规律,直接删掉20天之前的日志文件

[root@cloudos02 kubernetes]# find -mtime +20 -name 'kube*.cloudos02*' -exec rm -f {} \;
[root@cloudos02 kubernetes]# df -h
Filesystem               Size  Used Avail Use% Mounted on
/dev/mapper/centos-root  219G  117G   91G  57% /
devtmpfs                  63G     0   63G   0% /dev
tmpfs                     63G   12K   63G   1% /dev/shm
tmpfs                     63G  226M   63G   1% /run
tmpfs                     63G     0   63G   0% /sys/fs/cgroup
/dev/sda3                197M  136M   61M  70% /boot
/dev/sda2                200M     0  200M   0% /boot/efi
tmpfs                     13G     0   13G   0% /run/user/0

k8s核心是etcd,果然是etcd有问题,k8s相关服务全部挂了

[root@cloudos02 ~]# /opt/bin/etcdctl cluster-health
member 9bd4565552fd93c is healthy: got healthy result from http://10.12.0.21:2379
failed to check the health of member 658a31702f200e95 on http://10.12.0.22:2379: Get http://10.12.0.22:2379/health: dial tcp 10.12.0.22:2379: getsockopt: connection refused
member 658a31702f200e95 is unreachable: [http://10.12.0.22:2379] are all unreachable
member d1a9f9229366f9b8 is healthy: got healthy result from http://10.12.0.23:2379
cluster is healthy

查看日志

1 2	[root@cloudos02 ~]# journalctl -xe -u etcd2 一大堆输出说snap.broken

通过日志可以确定etcd的文件损坏了,肯定是由于根分区满了同步过来的数据无法写入导致损坏
先查找etcd的数据目录在哪,解决方法就是删掉此台的数据目录,然后再同步过来就行了
由于是实体服务,直接去找systemd脚本

[root@cloudos02 ~]# cat /usr/lib/systemd/system/etcd2.service 
[Unit]
Description=Etcd2 Server

[Service]
Type=notify
EnvironmentFile=-/etc/sysconfig/kube-etcd-cluster
ExecStart=/opt/bin/etcd --name=${ETCD_NAME} ......省略

从/usr/lib/systemd/system/etcd2.service看到没有写数据目录,那么默认数据目录是默认为 ${name}.etcd
etcd2.service里的name是从/etc/sysconfig/kube-etcd-cluster里读取变量

[root@cloudos02 ~]# cat /etc/sysconfig/kube-etcd-cluster
ETCD_NAME="NODE2"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.12.0.22:2380"
ETCD_LISTEN_PEER_URLS="http://10.12.0.22:2380"
ETCD_LISTEN_CLIENT_URLS="http://10.12.0.22:2379,http://127.0.0.1:2379"
ETCD_ADVERTISE_CLIENT_URLS="http://10.12.0.22:2379"
ETCD_INITIAL_CLUSTER_TOKEN="my-etcd-cluster"
ETCD_INITIAL_CLUSTER="NODE1=http://10.12.0.21:2380,NODE2=http://10.12.0.22:2380,NODE3=http://10.12.0.23:2380"
ETCD_INITIAL_CLUSTER_STATE="new"

根目录确实有NODE2.etcd,删掉数据目录

1
2
3

[root@cloudos02 ~]# ll /
drwx------    3 root root  4096 Jun 29 11:47 NODE2.etcd
[root@cloudos02 ~]# rm -rf /NODE2.etcd

去另外正常的节点上移除这个节点,然后再加上

1
2
3

[root@cloudos01 ~]# /opt/bin/etcdctl member remove 658a31702f200e95
Removed member 658a31702f200e95 from cluster
[root@cloudos01 ~]# /opt/bin/etcdctl member add NODE2 http://10.12.0.22:2380

然后去异常节点上修改配置文件/etc/sysconfig/kube-etcd-cluster
将ETCD_INITIAL_CLUSTER_STATE=new，修改为ETCD_INITIAL_CLUSTER_STATE=existing并启动etcd

1 2	[root@cloudos02 ~]# sed -ri '/ETCD_INITIAL_CLUSTER_STATE/s#new#existing#' /etc/sysconfig/kube-etcd-cluster [root@cloudos02 ~]# systemctl start etcd2

查看集群成员状态

[root@cloudos02 ~]# /opt/bin/etcdctl cluster-health
member 9bd4565552fd93c is healthy: got healthy result from http://10.12.0.21:2379
member d1a9f9229366f9b8 is healthy: got healthy result from http://10.12.0.23:2379
member f95341f81eb9322c is healthy: got healthy result from http://10.12.0.22:2379
cluster is healthy

然后去异常节点上修改配置文件/etc/sysconfig/kube-etcd-cluster
将ETCD_INITIAL_CLUSTER_STATE=existing改回new