misterli's Blog.

记一次kuboard故障—Etcd空间爆满

字数统计: 1k阅读时长: 5 min
2023/05/05

kuboard是一个给研发使用的k8s dashboard ,研发告诉说我打不开了,打开页面发现页面500 无法访问(此处无图,忘了截图保存了)

首先查看了一个pod状态都正常

1
2
3
4
5
6
7
[root@dev-tools ~]# kubectl -n kuboard  get pod
NAME READY STATUS RESTARTS AGE
kuboard-etcd-0 1/1 Running 0 26d
kuboard-etcd-1 1/1 Running 0 162d
kuboard-etcd-2 1/1 Running 0 155d
kuboard-v3-57b6fbcf4f-dglkk 1/1 Running 0 153d

想偷个懒,先祭出祖传重启大法看看能不能打开

1
2
kubectl -n kuboard  rollout  restart deployment  kuboard-v3
kubectl -n kuboard rollout restart statefulset kuboard-etcd

重启后还是打不开,于是开始查看日志

kuboard日志

1
2
3
4
{"level":"warn","ts":"2023-04-12T14:38:31.018+0800","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-5cd98b97-dba1-498e-ae6f-4bcf1408145f/kuboard-etcd-0.kuboard-etcd:2379","attempt":0,"error":"rpc error: code = ResourceExhausted desc = etcdserver: mvcc: database space exceeded"}
time="2023-04-12T06:38:31Z" level=error msg="Failed to create authorization request: etcdserver: mvcc: database space exceeded"
[GIN] 2023/04/12 - 14:38:31 | 500 | 7.722147ms | 106.15.137.195 | GET "/sso/auth?access_type=offline&client_id=kuboard-sso&redirect_uri=%2Fcallback&response_type=code&scope=openid+profile+email+groups&state=%2Fkuboard%2Fcluster&connector_id=gitlab"
{"level":"warn","ts":"2023-04-12T14:38:32.750+0800","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-5cd98b97-dba1-498e-ae6f-4bcf1408145f/kuboard-etcd-0.kuboard-etcd:2379","attempt":0,"error":"rpc error: code = ResourceExhausted desc = etcdserver: mvcc: database space exceeded"}

etcd日志

1
2
3
4
2023-04-12 06:38:05.763042 W | etcdserver: failed to apply request "header:<ID:12031796812563778311 > txn:<compare:<target:CREATE key:\"kuboard-sso-namespace/auth_req/rmq7qrteaen47y5kdvf3y32w5\" create_revision:0 > success:<request_put:<key:\"kuboard-sso-namespace/auth_req/rmq7qrteaen47y5kdvf3y32w5\" value_size:427 >> failure:<>>" with response "" took (481ns) to execute, err is etcdserver: no space
2023-04-12 06:38:05.763071 W | etcdserver: failed to apply request "header:<ID:12031796812563778314 > txn:<compare:<target:CREATE key:\"kuboard-sso-namespace/auth_req/ym7vdka6tote5euge43on5lj4\" create_revision:0 > success:<request_put:<key:\"kuboard-sso-namespace/auth_req/ym7vdka6tote5euge43on5lj4\" value_size:411 >> failure:<>>" with response "" took (320ns) to execute, err is etcdserver: no space
2023-04-12 06:38:05.763086 W | etcdserver: failed to apply request "header:<ID:12031796812563778315 > txn:<compare:<target:CREATE key:\"kuboard-sso-namespace/auth_req/b2zt4l2qlu5kdfgp62cvdpsby\" create_revision:0 > success:<request_put:<key:\"kuboard-sso-namespace/auth_req/b2zt4l2qlu5kdfgp62cvdpsby\" value_size:411 >> failure:<>>" with response "" took (360ns) to execute, err is etcdserver: no space
2023-04-12 06:38:05.763102 W | etcdserver: failed to apply request "header:<ID:12031796812563778316 > txn:<compare:<target:CREATE key:\"kuboard-sso-namespace/auth_req/iycfih7p4u45ckbhcjydx4nq2\" create_revision:0 > success:<request_put:<key:\"kuboard-sso-namespace/auth_req/iycfih7p4u45ckbhcjydx4nq2\" value_size:372 >> failure:<>>" with response "" took (270ns) to execute, err is etcdserver: no space

根据日志推测大概是etcd空间不够了,隐约记得etcd DB 空间配额大小默认限制为2G,当数据达到2G的时候就不允许写入

进入etcd容器内部查看

1
2
3
4
5
6
7
8
9
10
11
[root@dev-tools kuboard]# kubectl exec -it -n kuboard  kuboard-etcd-0  sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
# etcdctl endpoint status --write-out=table
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
| 127.0.0.1:2379 | 5a62a94db08693c1 | 3.4.14 | 2.1 GB | false | false | 55 | 6115946 | 6115946 | memberID:9834606138033252550 |
| | | | | | | | | | alarm:NOSPACE |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
# etcdctl alarm list
memberID:9834606138033252550 alarm:NOSPACE

确实存在磁盘空间不足,由于不方便修改配置增加新空间,就只能压缩老数据了

注意:此处 alarm 提示 NOSPACE,需要升级 ETCD 集群的空间(默认为2G的磁盘使用空间),或者压缩老数据,升级空间后,需要使用 etcd命令,取消此报警信息,否则集群依旧无法使用。

获取etcd数据历史版本号

1
2
# etcdctl endpoint status --write-out="json" | egrep -o '"revision":[0-9]*' | egrep -o '[0-9].*'
6115560

压缩旧版本

1
2
# etcdctl compact 6115560
compacted revision 6115560

整理磁盘碎片

1
2
# etcdctl defrag
Finished defragmenting etcd member[127.0.0.1:2379]

再次查看etcd db大小

1
2
3
4
5
6
7
8
# etcdctl endpoint status --write-out=table
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
| 127.0.0.1:2379 | 5a62a94db08693c1 | 3.4.14 | 4.1 MB | false | false | 55 | 6116017 | 6116017 | memberID:9834606138033252550 |
| | | | | | | | | | alarm:NOSPACE |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+

解除告警

1
# etcdctl alarm disarm

解除告警后,页面可以正常访问。

CATALOG