misterli's Blog.

记一次longhorn 组件重启导致pv无法正常挂载

字数统计: 1.9k阅读时长: 10 min
2021/09/22

集群中的longhorn组件异常重启后发现我们使用longhorn创建的pv无法正常挂载 报错如下

1
2
3
4
5
6
7
8
9
10
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 99s default-scheduler Successfully assigned devops/nexus3-84c8b98cb-rshlv to node-02
Warning FailedMount 78s kubelet MountVolume.SetUp failed for volume "pvc-9784831a-3130-4377-9d44-7e7129473b90" : rpc error: code = Internal desc = 'fsck' found errors on device /dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90 but could not correct them: fsck from util-linux 2.31.1
/dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90 contains a file system with errors, check forced.
/dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90: Inodes that were part of a corrupted orphan linked list found.

/dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.

describe信息提示我们执行fsck,我们到pv所在的node节点上执行fsck如下

1
2
3
4
[root@node-02 e2fsprogs-1.45.6]# fsck.ext4 -cvf /dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90 
e2fsck 1.42.9 (28-Dec-2013)
/dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90 has unsupported feature(s): metadata_csum
e2fsck: Get a newer version of e2fsck!

提示e2fsck版本太低需要升级,我们这里先升级一下e2fsck

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
[root@node-02 replicas]# wget https://distfiles.macports.org/e2fsprogs/e2fsprogs-1.45.6.tar.gz
--2021-09-16 11:51:48-- https://distfiles.macports.org/e2fsprogs/e2fsprogs-1.45.6.tar.gz
正在解析主机 distfiles.macports.org (distfiles.macports.org)... 151.101.230.132
正在连接 distfiles.macports.org (distfiles.macports.org)|151.101.230.132|:443... 已连接。
已发出 HTTP 请求,正在等待回应... 200 OK
长度:7938544 (7.6M) [application/x-gzip]
正在保存至: “e2fsprogs-1.45.6.tar.gz”

100%[=======================================================================================================================================>] 7,938,544 747KB/s 用时 10s

2021-09-16 11:52:04 (747 KB/s) - 已保存 “e2fsprogs-1.45.6.tar.gz” [7938544/7938544])

[root@node-02 replicas]# tar -zxvf e2fsprogs-1.45.6.tar.gz
e2fsprogs-1.45.6/
e2fsprogs-1.45.6/.gitignore
e2fsprogs-1.45.6/.missing-copyright
e2fsprogs-1.45.6/.release-checklist
.......
[root@node-02 replicas]# cd e2fsprogs-1.45.6/
[root@node-02 e2fsprogs-1.45.6]# ./configure
Generating configuration file for e2fsprogs version 1.45.6
Release date is March, 2020
checking build system type... x86_64-pc-linux-gnu
checking host system type... x86_64-pc-linux-gnu
checking for gcc... gcc
checking whether the C compiler works... yes
.......
[root@node-02 e2fsprogs-1.45.6]# make
cd ./util ; make subst
make[1]: 进入目录“/var/lib/longhorn/replicas/e2fsprogs-1.45.6/util”
CREATE dirpaths.h
CC subst.c
LD subst
make[1]: 离开目录“/var/lib/longhorn/replicas/e2fsprogs-1.45.6/util”
make[1]: 进入目录“/var/lib/longhorn/replicas/e2fsprogs-1.45.6”
make[1]: “util/subst.conf”是最新的。
.......
[root@node-02 e2fsprogs-1.45.6]# ls
ABOUT-NLS asm_types.h config.status debian e2fsck include intl MCONFIG parse-types.log RELEASE-NOTES SUBMITTING-PATCHES wordwrap.pl
acinclude.m4 CleanSpec.mk configure debugfs e2fsprogs.lsm INSTALL lib MCONFIG.in po resize tests
aclocal.m4 config configure.ac depfix.sed e2fsprogs.spec INSTALL.elfbin Makefile misc public_config.h scrub util
Android.bp config.log contrib doc ext2ed install-utils Makefile.in NOTICE README SHLIBS version.h
[root@node-02 e2fsprogs-1.45.6]# cd e2fsck/
[root@node-02 e2fsck]# ls
Android.bp dx_dirinfo.c e2fsck.conf.5 ehandler.c flushb.c logfile.o mtrace.c pass2.c pass5.c quota.c region.c scantest.c unix.o
badblocks.c dx_dirinfo.o e2fsck.conf.5.in ehandler.o iscan.c Makefile mtrace.h pass2.o pass5.o quota.o region.o sigcatcher.c util.c
badblocks.o e2fsck e2fsck.h emptydir.c jfs_user.h Makefile.in pass1b.c pass3.c problem.c readahead.c rehash.c sigcatcher.o util.o
CHANGES e2fsck.8 e2fsck.o extend.c journal.c message.c pass1b.o pass3.o problem.h readahead.o rehash.o super.c
dirinfo.c e2fsck.8.in ea_refcount.c extents.c journal.o message.o pass1.c pass4.c problem.o recovery.c revoke.c super.o
dirinfo.o e2fsck.c ea_refcount.o extents.o logfile.c mtrace.awk pass1.o pass4.o problemP.h recovery.o revoke.o unix.c
[root@node-02 e2fsck]# e2fsck #查看编译好的最新e2fsck信息
[root@node-02 e2fsck]# cp e2fsck /sbin #将e2fsck复制替换掉系统原有e2fsck
cp:是否覆盖"/sbin/e2fsck"? y

我们再使用fsck执行一下修复

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
[root@node-02 e2fsck]# fsck.ext4 -cvf /dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90 
e2fsck 1.45.6 (20-Mar-2020)
Checking for bad blocks (read-only test): done
/dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90: Updating bad block inode.
第一步: 检查inode,块,和大小
Inodes that were part of a corrupted orphan linked list found. 处理<y>? 是
Inode 131102 was part of the ��立的 inode list. 已处理.
Inode 131103 was part of the ��立的 inode list. 已处理.
Inode 131104 was part of the ��立的 inode list. 已处理.
Inode 131105 was part of the ��立的 inode list. 已处理.
Inode 131106 was part of the ��立的 inode list. 已处理.
Inode 131107 was part of the ��立的 inode list. 已处理.
Inode 131117 was part of the ��立的 inode list. 已处理.
Inode 131402 was part of the ��立的 inode list. 已处理.
Inode 131412 was part of the ��立的 inode list. 已处理.
Inode 131630 was part of the ��立的 inode list. 已处理.
Inode 131638 was part of the ��立的 inode list. 已处理.
Inode 131644 was part of the ��立的 inode list. 已处理.
第二步: 检查目录结构
第3步: 检查目录连接性
Pass 4: Checking reference counts
第5步: 检查簇概要信息
块位图差异: -(688640--690326)
处理<y>? 是
Free 块s count wrong for 簇 #21 (31069, counted=32756).
处理<y>? 是
Free 块s count wrong (1227977, counted=1229664).
处理<y>? 是
Inode位图差异: -(131101--131107) -131117 -131402 -131412 -131630 -131638 -131644
处理<y>? 是
Free inodes count wrong for 簇 #16 (7567, counted=7580).
处理<y>? 是
Free inodes count wrong (325295, counted=325308).
处理<y>? 是

/dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90: ***** 文件系统已修改 *****

2372 inodes used (0.72%, out of 327680)
182 non-contiguous files (7.7%)
1 non-contiguous directory (0.0%)
# of inodes with ind/dind/tind blocks: 0/0/0
Extent depth histogram: 2361/3
81056 blocks used (6.18%, out of 1310720)
0 bad blocks
1 large file

1600 regular files
763 directories
0 character device files
0 block device files
0 fifos
0 links
0 symbolic links (0 fast symbolic links)
0 sockets
------------
2363 files
[root@node-02 e2fsck]

检查完成后我们使用descibe 查看之前报错的pod 发现如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
  Normal   Scheduled               12m                   default-scheduler        Successfully assigned devops/nexus3-5c9c5545d9-nmfjg to node-02
Normal SuccessfulAttachVolume 12m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-9784831a-3130-4377-9d44-7e7129473b90"
Warning FailedMount 3m46s (x12 over 12m) kubelet MountVolume.SetUp failed for volume "pvc-9784831a-3130-4377-9d44-7e7129473b90" : rpc error: code = Internal desc = 'fsck' found errors on device /dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90 but could not correct them: fsck from util-linux 2.31.1
/dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90 contains a file system with errors, check forced.
/dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90: Inodes that were part of a corrupted orphan linked list found.

/dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
(i.e., without -a or -p options)
Warning FailedMount 3m33s kubelet Unable to attach or mount volumes: unmounted volumes=[nexus-data], unattached volumes=[default-token-dv7nx nexus-data]: timed out waiting for the condition
Warning FailedMount 104s kubelet MountVolume.SetUp failed for volume "pvc-9784831a-3130-4377-9d44-7e7129473b90" : rpc error: code = Internal desc = mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t ext4 -o defaults /dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90 /var/lib/kubelet/pods/8268934a-f1d9-4c14-ad4a-276d6986cee8/volumes/kubernetes.io~csi/pvc-9784831a-3130-4377-9d44-7e7129473b90/mount
Output: mount: /var/lib/kubelet/pods/8268934a-f1d9-4c14-ad4a-276d6986cee8/volumes/kubernetes.io~csi/pvc-9784831a-3130-4377-9d44-7e7129473b90/mount: /dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90 already mounted or mount point busy.
Warning FailedMount 78s (x4 over 10m) kubelet Unable to attach or mount volumes: unmounted volumes=[nexus-data], unattached volumes=[nexus-data default-token-dv7nx]: timed out waiting for the condition

此时我们删除这个pod,重建pod 就会发现pv已经可以正常挂载了

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
[root@master-01 nexus]# kubectl -n devops  delete pod nexus3-5c9c5545d9-nmfjg 
pod "nexus3-5c9c5545d9-nmfjg" deleted
[root@master-01 nexus]# kubectl describe pod -n devops nexus3-5c9c5545d9-bm9dk
Name: nexus3-5c9c5545d9-bm9dk
Namespace: devops
Priority: 0
Node: node-02/172.26.204.144
Start Time: Thu, 16 Sep 2021 12:00:11 +0800
Labels: k8s-app=nexus3
pod-template-hash=5c9c5545d9
Annotations: cni.projectcalico.org/podIP: 100.114.252.214/32
Status: Running
IP: 100.114.252.214
IPs:
IP: 100.114.252.214
Controlled By: ReplicaSet/nexus3-5c9c5545d9
Containers:
nexus3:
Container ID: docker://a729451dbf3482c0847397b355a204f4e2fa0681392d28a478276b6efeb7c0a2
Image: sonatype/nexus3:3.32.0
Image ID: docker-pullable://sonatype/nexus3@sha256:4b73d33797727349adb7dff50da9c8eb17298706b481a00b330c589b8a893f36
Ports: 8083/TCP, 8081/TCP
Host Ports: 0/TCP, 0/TCP
State: Running
Started: Thu, 16 Sep 2021 12:00:20 +0800
Ready: True
Restart Count: 0
Limits:
memory: 2Gi
Requests:
cpu: 100m
memory: 200Mi
Environment: <none>
Mounts:
/nexus-data from nexus-data (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-dv7nx (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
nexus-data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: nexus-data
ReadOnly: false
default-token-dv7nx:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-dv7nx
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 12s default-scheduler Successfully assigned devops/nexus3-5c9c5545d9-bm9dk to node-02
Normal Pulled 3s kubelet Container image "sonatype/nexus3:3.32.0" already present on machine
Normal Created 3s kubelet Created container nexus3
Normal Started 3s kubelet Started container nexus3

CATALOG