zhangguanzhang's Blog

opensuse的一次救援

字数统计: 3.4k阅读时长: 18 min
2019/12/05

昨晚吃完晚饭回到办公室,右边同事在控制台看着一个suse起不来一直启动的时候卡在suse的蜥蜴logo背景图那。见我来了叫我看下,他们已经尝试过恢复快照,但是还不行,应该是很久之前损坏的,只不过因为没重启没发现,我叫他重启下看看卡在哪,重启后进入内核后显示背景图那按下esc然后看卡在/sysroot挂载那。目测分区损坏了,经历了ubuntu的安装iso的rescue mode就是渣渣后,我还是信任centos的iso。

处理

先备份和准备工作

关闭虚机,后台拷贝下系统盘的卷先备份下。然后给虚机的IDE光驱挂载了个centos 7.5 DVD的iso,修改虚机启动顺序到ISO,进Troubleshooting –> Rescue a CentOS Linux system
一般损坏的都不建议选1,因为挂载不上,所以是选3手动处理

Device or resource busy

1
2
3
4
5
1) Continue
2) Read-only mount
3) Skip to shell
4) Quit (Reboot)
Please make a selection from the above: 3

最开始我lsblk和看了下硬盘的分区表,最后vgchange -a y激活lvm后xfs_repair /dev/mapper/suse-lv_root的时候提示该设备繁忙,遂查看了下

1
2
sh-4.2# lsof /dev/mapper/suse-lv_root
sh-4.2# ps aux | less

lsof和fuser都是返回空的,最后就ps aux一个个看,发现了个mount进程一直hung在那

1
2
sh-4.2# ps aux | grep moun[t]
root 6126 0.0 0.0 19940 840 pts/0 D+ 11:56 0:00 /usr/bin/mount -t xfs -o defaults,ro /dev/mapper/suse-lv_root /mnt/sysimage

这个进程尝试过了,死活杀不掉,进rescue mode的时候选的Skip to shell,以为是iso的版本bug,换了一个7.6 minimal的iso进入rescue mode后不选择直接ctrl+alt+F2进到tty还是一样会自动挂载,于是想下从父进程的角度上看看能不能处理

1
2
3
4
5
6
[anaconda root@localhost /]# ps aux | grep moun[t]
root 6113 0.0 0.0 19940 840 pts/0 D+ 12:02 0:00 /usr/bin/mount -t xfs -o defaults,ro /dev/mapper/suse-lv_root /mnt/sysimage
[anaconda root@localhost /]# ps -Al | grep mount
4 D 0 6113 5862 0 80 0 - 4985 xfs_bu pts/0 00:00:00 mount
[anaconda root@localhost /]# ps aux | grep 586[2]
root 5862 0.0 0.0 19940 840 pts/0 D+ 12:02 0:00 python anaconda

找到了该mount的父进程是anaconda,也就是我们进入rescue mode的第一个tty提供交互,直接杀掉它,mount终止

1
2
[anaconda root@localhost /]# kill -9 5862;kill 6113
bash: kill: (6113) - No such process

但是还是busy,发现该mount又tm的起来了,最终想了个骚套路,进入rescue mode,然后4个选项不选择,直接ctrl+alt+F2进到tty后杀掉mount的进程后把mount命名改名,执行下面

1
2
3
4
5
6
7
8
[anaconda root@localhost /]# mv /usr/bin/mount{,.bak}  #先改名,再杀进程
[anaconda root@localhost /]# ps aux | grep moun[t]
root 6128 0.3 0.0 19940 844 pts/0 D+ 12:06 0:00 /usr/bin/mount -t xfs -o defaults,ro /dev/mapper/suse-lv_root /mnt/sysimage
[anaconda root@localhost /]# ps -Al | grep mount
4 D 0 6128 5877 0 80 0 - 4985 xfs_bu pts/0 00:00:00 mount
[anaconda root@localhost /]# kill -9 5877;kill 6128
bash: kill: (6128) - No such process
[anaconda root@localhost /]# ps aux | grep moun[t] #然后再也没mount进程

修复

接着前面的,激活lvm,这里不详细说,可以自己去lsblkfdisk -l /dev/vdx去看相关分区信息

1
2
3
[anaconda root@localhost /]# vgchange -a y
1 logical volume(s) in volume group "suse" now active
4 logical volume(s) in volume group "vgsap" now active
1
2
3
4
5
6
7
[anaconda root@localhost /]# xfs_repair /dev/mapper/suse-lv_root
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed. Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair. If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

该报错大致意思是: 文件系统的log需要在repair之前先mount它来触发回放log,如果无法挂载,使用xfs_repair带上-L选项摧毁log强制修复
正确姿势是先使用xfs_metadump导出metadata,见文章 https://serverfault.com/questions/777299/proper-way-to-deal-with-corrupt-xfs-filesystems
这里因为已经损坏了,没必要测试mount了,并且我未导出metadata,直接-L修复的,下次遇到了相似场景可以试下xfs_metadump

1
[anaconda root@localhost /]# xfs_repair -L /dev/mapper/suse-lv_root

漫长的等待修复,然后卡在了一个inode那,等待了20分钟直接ctrl c取消再来,然当这次不需要带-L选项

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
[anaconda root@localhost /]# xfs_repair /dev/mapper/suse-lv_root
...
...
resetting inode 15847758 nlinks from 0 to 2
resetting inode 16180728 nlinks from 0 to 2
resetting inode 16500950 nlinks from 0 to 2
resetting inode 17347042 nlinks from 0 to 2
resetting inode 19414733 nlinks from 0 to 2
Metadata corruption detected at xfs_dir3_block block 0x2a09ba8/0x1000
libxfs_writebufr: write verufer failed on xfs_dur3_block bno 0x2a09ba8/0x1000
Metadata corruption detected at xfs_dir3_block block 0x145ce28/0x1000
libxfs_writebufr: write verufer failed on xfs_dur3_block bno 0x145ce28/0x1000
...
...
Maximum metadata LSN (1919513701:1600352110) is ahead of log (1:2).
Format log to cycle 1919513704.
releasing dirty buffer (bulk) to free list!releasing dirty buffer (bulk) to free list!releasing dirty buffer (bulk) to free list!releasing dirty buffer (bulk) to free list!releasing dirty buffer (bulk) to free list!releasing dirty buffer (bulk) to free list!releasing dirty buffer (bulk) to free list!releasing dirty buffer (bulk) to free list!releasing dirty buffer (bulk) to free list!releasing dirty buffer (bulk) to free list!done

然后再次修复

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
[anaconda root@localhost /]# xfs_repair /dev/mapper/suse-lv_root
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
- scan filesystem freespace and inode maps...
- found root inode chunk
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- process known inodes and perfrom inode discovery...
- agno = 0
...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 1
- agno = 0
- agno = 2
- agno = 3
Phase 5 - rebuild AG headers and trees...
- reset superblock...
Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- traversing filesystem ...
- traversal finished ...
- moveing disconnected inodes to lost_found ...
Phase 7 - verify and correct link counts...
resetting inode 64 nlinks from 25 to 24
done

然后挂载试试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
[anaconda root@localhost /]# mount /dev/mapper/suse-lv_root /mnt/sysimage
bash: mount: command not found
[anaconda root@localhost /]# mount.bak /dev/mapper/suse-lv_root /mnt/sysimage #漫长等待,大概30多秒
[anaconda root@localhost /]# ls -l /mnt/sysimage
total 40296
drwxr-xr-x 2 root root 4096 Aug 6 2018 bin
drwxr-xr-x 3 root root 6 Aug 6 2018 boot
drwxr-xr-x 22 root root 6 Aug 6 22:19 dev
drwxr-xr-x 131 root root 8192 Nov 30 04:05 etc
drwxr-xr-x 5 root root 46 Oct 18 2018 home
drwxr-xr-x 12 root root 8192 Nov 30 04:04 lib
drwxr-xr-x 7 root root 8192 Aug 6 2018 lib64
drwxr-xr-x 2270 root root 27242496 Dec 4 20:47 lost+found
drwxr-xr-x 2 root root 6 Jun 27 2017 mnt
drwxr-xr-x 2 root root 6 Jun 27 2017 opt
dr-xr-xr-x 190 root root 6 Aug 6 2018 proc
drwx------ 21 root root 4096 Nov 30 04:39 root
drwxr-xr-x 31 root root 6 Aug 6 2018 run
drwxr-xr-x 4 root sapsys 6 Oct 11 2018 sapmnt
drwxr-xr-x 2 root root 8192 Oct 11 2018 sbin
drwxr-xr-x 2 root root 6 Jun 27 2017 selinux
drwxr-xr-x 9 root root 4096 Oct 11 2018 software
drwxr-xr-x 4 root root 28 Aug 6 2018 srv
dr-xr-xr-x 13 root root 0 Dec 4 22:16 sys
drwxrwxrwt 31 root root 4096 Dec 3 22:18 tmp
drwxr-xr-x 14 root root 182 Nov 30 04:37 usr
drwxr-xr-x 13 root root 201 Nov 30 04:37 var

然后取消光驱挂载,修改启动顺序重启,能进到登陆,直接ctrl+alt+F2进到tty登陆,发现没有网络,查看了下失败的启动,控制台观察的,输出不能被复制,下面命令输出大致的写下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
$ systemctl --failed
UNIT LOAD ACTIVE SUB DESCRIPTION
● cryptctl-auto-unlock@sys-devices-pci0000:00-0000:00:08.0-virtio3-block-vda.service
● cryptctl-auto-unlock@aD7Wov-Krfg-KPbq-Dnf6-1dAj-e9dM-N7dUir.service
● cryptctl-auto-unlock@abd69e01-d874-4658-b738-1107d33cd84c.service
● cryptctl-auto-unlock@abd69e01-d874-4658-b738-1107d33cd84c.service
● cryptctl-auto-unlock@0zSmA1-nPGR-FuVE-ZIvq-vxhl-2WdX-eh58e2.service
● postfix.service loaded failed failed Postfix Mail Transport Agent
● wicked.service loaded failed failed wicked managed network interfaces
● wickedd-auto4.service loaded failed failed wicked AutoIPv4 supplicant service
● wickedd-dhcp4.service loaded failed failed wicked DHCPv4 supplicant service
● wickedd-dhcp6.service loaded failed failed wicked DHCPv6 supplicant service
● wickedd.service loaded failed failed wicked network management service daemon

LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.

查看了下系统日志

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
vi /var/log/messages
2019-12-04T21:42:56.401177+08:00 bpcprdascs1 cryptctl[1506]: open /etc/sysconfig/cryptctl-client: no such file or directory
2019-12-04T21:42:56.403110+08:00 bpcprdascs1 cryptctl[1515]: open /etc/sysconfig/cryptctl-client: no such file or directory
2019-12-04T21:42:56.408475+08:00 bpcprdascs1 cryptctl[1495]: open /etc/sysconfig/cryptctl-client: no such file or directory
2019-12-04T21:42:56.408634+08:00 bpcprdascs1 cryptctl[1529]: open /etc/sysconfig/cryptctl-client: no such file or directory
2019-12-04T21:42:56.408807+08:00 bpcprdascs1 cryptctl[1508]: open /etc/sysconfig/cryptctl-client: no such file or directory
2019-12-04T21:42:56.414068+08:00 bpcprdascs1 systemd[1]: cryptctl-auto-unlock@sys-devices-pci0000:00-0000:00:08.0-virtio3-block-vda.service: Main process exited, code=exited, status=1/FAILURE
2019-12-04T21:42:56.414237+08:00 bpcprdascs1 systemd[1]: cryptctl-auto-unlock@sys-devices-pci0000:00-0000:00:08.0-virtio3-block-vda.service: Unit entered failed state.
2019-12-04T21:42:56.414319+08:00 bpcprdascs1 systemd[1]: cryptctl-auto-unlock@sys-devices-pci0000:00-0000:00:08.0-virtio3-block-vda.service: Failed with result 'exit-code'.
2019-12-04T21:42:56.414403+08:00 bpcprdascs1 systemd[1]: cryptctl-auto-unlock@aD7Wov-Krfg-KPbq-Dnf6-1dAj-e9dM-N7dUir.service: Main process exited, code=exited, status=1/FAILURE
2019-12-04T21:42:56.414467+08:00 bpcprdascs1 systemd[1]: cryptctl-auto-unlock@aD7Wov-Krfg-KPbq-Dnf6-1dAj-e9dM-N7dUir.service: Unit entered failed state.
2019-12-04T21:42:56.414528+08:00 bpcprdascs1 systemd[1]: cryptctl-auto-unlock@aD7Wov-Krfg-KPbq-Dnf6-1dAj-e9dM-N7dUir.service: Failed with result 'exit-code'.
2019-12-04T21:42:56.414596+08:00 bpcprdascs1 systemd[1]: cryptctl-auto-unlock@abd69e01-d874-4658-b738-1107d33cd84c.service: Main process exited, code=exited, status=1/FAILURE
2019-12-04T21:42:56.414657+08:00 bpcprdascs1 systemd[1]: cryptctl-auto-unlock@abd69e01-d874-4658-b738-1107d33cd84c.service: Unit entered failed state.
2019-12-04T21:42:56.414735+08:00 bpcprdascs1 systemd[1]: cryptctl-auto-unlock@abd69e01-d874-4658-b738-1107d33cd84c.service: Failed with result 'exit-code'.
2019-12-04T21:42:56.414794+08:00 bpcprdascs1 systemd[1]: cryptctl-auto-unlock@0zSmA1-nPGR-FuVE-ZIvq-vxhl-2WdX-eh58e2.service: Main process exited, code=exited, status=1/FAILURE
2019-12-04T21:42:56.414851+08:00 bpcprdascs1 systemd[1]: cryptctl-auto-unlock@0zSmA1-nPGR-FuVE-ZIvq-vxhl-2WdX-eh58e2.service: Unit entered failed state.
2019-12-04T21:42:56.414907+08:00 bpcprdascs1 systemd[1]: cryptctl-auto-unlock@0zSmA1-nPGR-FuVE-ZIvq-vxhl-2WdX-eh58e2.service: Failed with result 'exit-code'.

发现该文件丢失,同样系统的机器上去把内容手动创建,然后重启只剩下这些

1
2
3
4
5
● wicked.service        loaded failed failed wicked managed network interfaces
● wickedd-auto4.service loaded failed failed wicked AutoIPv4 supplicant service
● wickedd-dhcp4.service loaded failed failed wicked DHCPv4 supplicant service
● wickedd-dhcp6.service loaded failed failed wicked DHCPv6 supplicant service
● wickedd.service loaded failed failed wicked network management service daemon

找到相关日志,或者手动启动wicked或者网卡也报错下面类似

1
2
3
4
5
6
7
2019-12-04T21:54:50.170654+08:00 bpcprdascs1 wickedd[1399]: Failed to register dbus bus name "org.opensuse.Network" (Connection ":1.2" is not allowed to own the service "org.opensuse.Network" due to security policies in the configuration file)
2019-12-04T21:54:50.170657+08:00 bpcprdascs1 wickedd[1399]: unable to initialize dbus service
2019-12-04T21:54:50.170659+08:00 bpcprdascs1 systemd[1]: wickedd.service: Main process exited, code=exited, status=1/FAILURE
2019-12-04T21:54:50.170661+08:00 bpcprdascs1 systemd[1]: Failed to start wicked network management service daemon.
...
2019-12-04T22:02:05.868058+08:00 bpcprdascs1 wicked: /org/opensuse/Network/Interface.getManagedObjects failed. Server responds:
2019-12-04T22:02:05.868883+08:00 bpcprdascs1 wicked: org.freedesktop.DBus.Error.ServiceUnknown: The name org.opensuse.Network was not provided by any .service files

这个错误找了一圈都没正确的解决办法,还是自己突发奇想在/etc/dbus-1/对比了下发现文件丢失
正常机器上

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
bpcprdascs2:/etc/dbus-1/system.d # find /etc/dbus-1/ -type f 
/etc/dbus-1/system.d/org.opensuse.Snapper.conf
/etc/dbus-1/system.d/org.freedesktop.hostname1.conf
/etc/dbus-1/system.d/org.freedesktop.import1.conf
/etc/dbus-1/system.d/org.freedesktop.locale1.conf
/etc/dbus-1/system.d/org.freedesktop.login1.conf
/etc/dbus-1/system.d/org.freedesktop.machine1.conf
/etc/dbus-1/system.d/org.freedesktop.systemd1.conf
/etc/dbus-1/system.d/org.freedesktop.timedate1.conf
/etc/dbus-1/system.d/com.redhat.PrinterDriversInstaller.conf
/etc/dbus-1/system.d/org.freedesktop.UPower.conf
/etc/dbus-1/system.d/org.freedesktop.GeoClue2.Agent.conf
/etc/dbus-1/system.d/org.freedesktop.GeoClue2.conf
/etc/dbus-1/system.d/bluetooth.conf
/etc/dbus-1/system.d/com.redhat.tuned.conf
/etc/dbus-1/system.d/org.freedesktop.PolicyKit1.conf
/etc/dbus-1/system.d/org.freedesktop.UDisks2.conf
/etc/dbus-1/system.d/org.freedesktop.RealtimeKit1.conf
/etc/dbus-1/system.d/org.freedesktop.Accounts.conf
/etc/dbus-1/system.d/org.opensuse.Network.AUTO4.conf
/etc/dbus-1/system.d/org.opensuse.Network.DHCP4.conf
/etc/dbus-1/system.d/org.opensuse.Network.DHCP6.conf
/etc/dbus-1/system.d/org.opensuse.Network.Nanny.conf
/etc/dbus-1/system.d/org.opensuse.Network.conf
/etc/dbus-1/system.d/pulseaudio-system.conf
/etc/dbus-1/system.d/org.freedesktop.PackageKit.conf
/etc/dbus-1/system.d/cups.conf
/etc/dbus-1/system.d/org.opensuse.CupsPkHelper.Mechanism.conf
/etc/dbus-1/system.d/gdm.conf
/etc/dbus-1/session.conf
/etc/dbus-1/system.conf

该故障机器上

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
bpcprdascs1:/var/log # find /etc/dbus-1/ -type f 
/etc/dbus-1/system.d/org.opensuse.Snapper.conf
/etc/dbus-1/system.d/org.freedesktop.hostname1.conf
/etc/dbus-1/system.d/org.freedesktop.import1.conf
/etc/dbus-1/system.d/org.freedesktop.locale1.conf
/etc/dbus-1/system.d/org.freedesktop.login1.conf
/etc/dbus-1/system.d/org.freedesktop.machine1.conf
/etc/dbus-1/system.d/org.freedesktop.systemd1.conf
/etc/dbus-1/system.d/org.freedesktop.timedate1.conf
/etc/dbus-1/system.d/com.redhat.PrinterDriversInstaller.conf
/etc/dbus-1/system.d/org.freedesktop.UPower.conf
/etc/dbus-1/system.d/org.freedesktop.GeoClue2.Agent.conf
/etc/dbus-1/system.d/org.freedesktop.GeoClue2.conf
/etc/dbus-1/system.d/bluetooth.conf
/etc/dbus-1/system.d/com.redhat.tuned.conf
/etc/dbus-1/system.d/org.freedesktop.PolicyKit1.conf
/etc/dbus-1/system.d/org.freedesktop.RealtimeKit1.conf
/etc/dbus-1/session.conf
/etc/dbus-1/system.conf

因为故障机器的网络无法启动,即使手动ip addr add也报错dbus,所以无法通过网络scp。于是在后台正常机器给添加了一个数据盘,把该目录的文件拷贝到数据盘上,再把数据盘挂载到故障机器上。然后cp拷贝完重启,然后网络起来了
只剩下故障

1
2
3
4
5
$ systemctl --failed
UNIT LOAD ACTIVE SUB DESCRIPTION
● wickedd-auto4.service loaded failed failed wicked AutoIPv4 supplicant service
● wickedd-dhcp4.service loaded failed failed wicked DHCPv4 supplicant service
● wickedd-dhcp6.service loaded failed failed wicked DHCPv6 supplicant service

上面三个通过系统日志可以找到是文件丢失,其他机器上去拷贝就行了,当然也不是只有这三个服务的文件丢失,其他服务的文件也丢失了,自行看下系统日志处理下

1
2
3
4
5
6
2019-12-04T21:54:50.170648+08:00 bpcprdascs1 display-manager[1429]: /usr/lib/X11/display-manager: line 17: /etc/sysconfig/displaymanager: No such file or directory
2019-12-04T22:18:03.689878+08:00 bpcprdascs1 systemd[1395]: wickedd-dhcp6.service: Failed at step EXEC spawning /usr/lib/wicked/bin/wickedd-dhcp6: No such file or directory
...
2019-12-04T22:18:03.689884+08:00 bpcprdascs1 systemd[1396]: wickedd-dhcp4.service: Failed at step EXEC spawning /usr/lib/wicked/bin/wickedd-dhcp4: No such file or directory
...
2019-12-04T22:18:03.689897+08:00 bpcprdascs1 systemd[1403]: wickedd-auto4.service: Failed at step EXEC spawning /usr/lib/wicked/bin/wickedd-auto4: No such file or directory
CATALOG
  1. 1. 处理
    1. 1.1. 先备份和准备工作
    2. 1.2. Device or resource busy
    3. 1.3. 修复