misterli's Blog.

chaos-mesh 混沌工程(下)

字数统计: 8.3k阅读时长: 43 min
2021/01/22

Chaos-mesh 配置

定义混沌实验范围

Chaos-mesh提供了多种选择器,可以用来定义混沌实验的范围,这些选择器spec.selectorchaos对象的字段中定义。

名称空间选择

1
2
3
4
spec: 
selector:
namespaces:
- "app-ns"

标签选择器

标签选择器通过标签过滤混沌实验的目标,定义为字符串键和值的映射。

1
2
3
4
spec: 
selector:
labelSelectors:
"app.kubernetes.io/component": "tikv"

字段选择器

字段选择器通过资源字段过滤实验目标。定义为字符串键和值的映射。

1
2
3
4
spec: 
selector:
fieldSelectors:
"metadata.name": "my-pod"

注解选择器

注解选择器通过注解过滤实验目标。定义为字符串键和值的映射。

1
2
3
4
spec: 
selector:
annotationSelectors:
"example-annotation": "group-a"

pod状态选择器

吊舱相位选择器可根据条件过滤混沌实验目标。定义为一组字符串。支持的条件:

PendingRunningSucceededFailedUnknown

1
2
3
4
spec: 
selector:
podPhaseSelectors:
- "Running"

pod选择器

Pod选择器按pod名称过滤混沌实验目标。定义为字符串键和值的映射。此映射中的键指定Pod所属的名称空间,键下的每个值都是Pod。如果此选择器不为空,则将直接使用此映射中定义的这些容器,其他已定义的选择器将被忽略。例如:

1
2
3
4
5
6
7
8
spec: 
selector:
pods:
tidb-cluster: # namespace of the target pods
- basic-tidb-0
- basic-pd-0
- basic-tikv-0
- basic-tikv-1

混沌实验类型

pod混沌实验

Chaos-mesh仅支持某些特定的pod,如deploymentstatefulsetdaemonset等控制器创建的pod。

Pod chaos实验可以模拟pod故障或者特定的容器问题。这里提供了pod failurepod killcontainer kill这三个action。

  • Podfailure操作会由chaosMesh模拟了ErrImagePull错误,pod会拉取gcr.io/google-containers/pause:latest用作image,如果无法拉取该image,则会出现ErrImagePull错误;如果可以拉取该image,则会发生crashloopbackoff错误。
  • Pod Kill操作将杀死指定的Pod
  • container kill操作会杀死目标pod中的指定容器。

这里我们使用下面的busybox.yaml文件创建一组pod作为实验对象

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
[root@master-01 pod]# cat busybox.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
name: busybox-deployment
namespace: default
labels:
app: busybox
spec:
selector:
matchLabels:
app: busybox
replicas: 5
template:
metadata:
labels:
app: busybox
annotations:
enable.version-checker.io/busybox: "true"
spec:
containers:
- name: busybox
image: busybox:1.29
imagePullPolicy: IfNotPresent
ports:
- containerPort: 80
args:
- /bin/sh
- -c
- touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600

pod-failure配置文件

这里选择了skywalking空间下标签为app: apm-item得pod,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-fail-item
namespace: chaos-testing
spec:
action: pod-failure
mode: fixed
value: '2'
selector:
namespaces:
- default
labelSelectors:
app: busybox
duration: '120s'
scheduler:
cron: "@every 3m"

实验效果:

我们先查看一下event信息,发现Pulling image "gcr.io/google-containers/pause:latest"

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
0s          Normal   Killing             pod/busybox-deployment-7bfd6d554c-n8xsr    Container busybox definition changed, will be restarted
0s Normal Killing pod/busybox-deployment-7bfd6d554c-hs42b Container busybox definition changed, will be restarted
0s Normal Killing pod/busybox-deployment-7bfd6d554c-hs42b Stopping container busybox
0s Normal SuccessfulCreate replicaset/busybox-deployment-7bfd6d554c Created pod: busybox-deployment-7bfd6d554c-nxzdp
0s Normal Killing pod/busybox-deployment-7bfd6d554c-n8xsr Stopping container busybox
<unknown> Normal Scheduled pod/busybox-deployment-7bfd6d554c-nxzdp Successfully assigned default/busybox-deployment-7bfd6d554c-nxzdp to node-01
<unknown> Normal Scheduled pod/busybox-deployment-7bfd6d554c-9s9cf Successfully assigned default/busybox-deployment-7bfd6d554c-9s9cf to node-01
0s Normal SuccessfulCreate replicaset/busybox-deployment-7bfd6d554c Created pod: busybox-deployment-7bfd6d554c-9s9cf
0s Normal Pulling pod/busybox-deployment-7bfd6d554c-n8xsr Pulling image "gcr.io/google-containers/pause:latest"
0s Normal Pulling pod/busybox-deployment-7bfd6d554c-hs42b Pulling image "gcr.io/google-containers/pause:latest"
0s Normal Pulled pod/busybox-deployment-7bfd6d554c-9s9cf Container image "busybox:1.29" already present on machine
0s Normal Pulled pod/busybox-deployment-7bfd6d554c-nxzdp Container image "busybox:1.29" already present on machine
0s Normal Created pod/busybox-deployment-7bfd6d554c-9s9cf Created container busybox
0s Normal Created pod/busybox-deployment-7bfd6d554c-nxzdp Created container busybox
0s Normal Started pod/busybox-deployment-7bfd6d554c-9s9cf Started container busybox
0s Normal Started pod/busybox-deployment-7bfd6d554c-nxzdp Started container busybox
0s Warning Failed pod/busybox-deployment-7bfd6d554c-n8xsr Failed to pull image "gcr.io/google-containers/pause:latest": rpc error: code = Unknown desc = Error response from daemon: Get https://gcr.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

查看pod的变化情况:

pod因为拉取不到镜像变为ErrImagePull状态,然后pod 被删除重新创建。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
[root@master-01 ~]# kubectl get pod -n skywalking  -w 
NAME READY STATUS RESTARTS AGE
busybox-deployment-7bfd6d554c-6rc8b 1/1 Running 0 37s
busybox-deployment-7bfd6d554c-jptvr 1/1 Running 0 4m16s
busybox-deployment-7bfd6d554c-vd4z2 1/1 Running 0 4m16s
busybox-deployment-7bfd6d554c-xtb9x 1/1 Running 0 4m16s
busybox-deployment-7bfd6d554c-xwhwg 1/1 Running 0 4m16s
check-ecs-price-7cdc97b997-2t5gc 1/1 Running 0 13d
web-show-768dd97986-8v7ld 1/1 Running 0 9d
busybox-deployment-7bfd6d554c-jptvr 0/1 ErrImagePull 0 5m1s
busybox-deployment-7bfd6d554c-jptvr 0/1 ImagePullBackOff 0 5m15s
busybox-deployment-7bfd6d554c-xwhwg 0/1 ErrImagePull 0 5m16s
busybox-deployment-7bfd6d554c-xwhwg 0/1 ImagePullBackOff 0 5m29s
busybox-deployment-7bfd6d554c-jptvr 0/1 ErrImagePull 0 5m44s
busybox-deployment-7bfd6d554c-jptvr 0/1 Terminating 0 5m53s
busybox-deployment-7bfd6d554c-jptvr 0/1 Terminating 0 5m53s
busybox-deployment-7bfd6d554c-4b2m2 0/1 Pending 0 0s
busybox-deployment-7bfd6d554c-4b2m2 0/1 Pending 0 0s
busybox-deployment-7bfd6d554c-xwhwg 0/1 Terminating 0 5m53s
busybox-deployment-7bfd6d554c-99942 0/1 Pending 0 0s
busybox-deployment-7bfd6d554c-xwhwg 0/1 Terminating 0 5m53s
busybox-deployment-7bfd6d554c-99942 0/1 Pending 0 0s
busybox-deployment-7bfd6d554c-4b2m2 0/1 ContainerCreating 0 0s
busybox-deployment-7bfd6d554c-99942 0/1 ContainerCreating 0 0s
busybox-deployment-7bfd6d554c-4b2m2 0/1 ContainerCreating 0 1s
busybox-deployment-7bfd6d554c-99942 0/1 ContainerCreating 0 1s
busybox-deployment-7bfd6d554c-4b2m2 1/1 Running 0 2s
busybox-deployment-7bfd6d554c-99942 1/1 Running 0 2s

pod-kill配置文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill-example
namespace: chaos-testing
spec:
action: pod-kill
mode: fixed-percent
value: "60"
selector:
namespaces:
- default
labelSelectors:
app: busybox
scheduler:
cron: "@every 30s"

实验效果:

我们这里设置的是随机删除60%的pod,所以相应的pod每隔1分钟被随机删除3个

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
[root@master-01 ~]# kubectl get pod  -w 
NAME READY STATUS RESTARTS AGE
busybox-deployment-7bfd6d554c-66jxc 1/1 Running 0 5s
busybox-deployment-7bfd6d554c-ccv4x 1/1 Running 0 5s
busybox-deployment-7bfd6d554c-n9z2t 1/1 Running 0 5s
busybox-deployment-7bfd6d554c-sc2j2 1/1 Running 0 5s
busybox-deployment-7bfd6d554c-zmhpm 1/1 Running 0 5s
check-ecs-price-7cdc97b997-2t5gc 1/1 Running 0 13d
web-show-768dd97986-8v7ld 1/1 Running 0 9d
busybox-deployment-7bfd6d554c-zmhpm 1/1 Terminating 0 30s
busybox-deployment-7bfd6d554c-ccv4x 1/1 Terminating 0 30s
busybox-deployment-7bfd6d554c-sc2j2 1/1 Terminating 0 30s
busybox-deployment-7bfd6d554c-6hzc8 0/1 Pending 0 0s
busybox-deployment-7bfd6d554c-zmhpm 1/1 Terminating 0 30s
busybox-deployment-7bfd6d554c-6hzc8 0/1 Pending 0 0s
busybox-deployment-7bfd6d554c-ccv4x 1/1 Terminating 0 30s
busybox-deployment-7bfd6d554c-sc2j2 1/1 Terminating 0 30s
busybox-deployment-7bfd6d554c-2bmt7 0/1 Pending 0 0s
busybox-deployment-7bfd6d554c-2bmt7 0/1 Pending 0 0s
busybox-deployment-7bfd6d554c-gcmk8 0/1 Pending 0 0s
busybox-deployment-7bfd6d554c-gcmk8 0/1 Pending 0 0s
busybox-deployment-7bfd6d554c-6hzc8 0/1 ContainerCreating 0 0s
busybox-deployment-7bfd6d554c-2bmt7 0/1 ContainerCreating 0 0s
busybox-deployment-7bfd6d554c-gcmk8 0/1 ContainerCreating 0 0s
busybox-deployment-7bfd6d554c-6hzc8 0/1 ContainerCreating 0 1s
busybox-deployment-7bfd6d554c-gcmk8 0/1 ContainerCreating 0 1s
busybox-deployment-7bfd6d554c-2bmt7 0/1 ContainerCreating 0 1s
busybox-deployment-7bfd6d554c-gcmk8 1/1 Running 0 2s
busybox-deployment-7bfd6d554c-2bmt7 1/1 Running 0 2s
busybox-deployment-7bfd6d554c-6hzc8 1/1 Running 0 2s
busybox-deployment-7bfd6d554c-2bmt7 1/1 Terminating 0 30s
busybox-deployment-7bfd6d554c-66jxc 1/1 Terminating 0 60s
busybox-deployment-7bfd6d554c-gcmk8 1/1 Terminating 0 30s
busybox-deployment-7bfd6d554c-66jxc 1/1 Terminating 0 60s
busybox-deployment-7bfd6d554c-n7fk2 0/1 Pending 0 0s
busybox-deployment-7bfd6d554c-2bmt7 1/1 Terminating 0 30s
busybox-deployment-7bfd6d554c-gcmk8 1/1 Terminating 0 30s
busybox-deployment-7bfd6d554c-n7fk2 0/1 Pending 0 0s
busybox-deployment-7bfd6d554c-jwgmg 0/1 Pending 0 0s
busybox-deployment-7bfd6d554c-g8tp6 0/1 Pending 0 0s
busybox-deployment-7bfd6d554c-jwgmg 0/1 Pending 0 0s
busybox-deployment-7bfd6d554c-g8tp6 0/1 Pending 0 0s
busybox-deployment-7bfd6d554c-n7fk2 0/1 ContainerCreating 0 0s
busybox-deployment-7bfd6d554c-jwgmg 0/1 ContainerCreating 0 0s
busybox-deployment-7bfd6d554c-g8tp6 0/1 ContainerCreating 0 0s
busybox-deployment-7bfd6d554c-n7fk2 0/1 ContainerCreating 0 1s
busybox-deployment-7bfd6d554c-jwgmg 0/1 ContainerCreating 0 1s
busybox-deployment-7bfd6d554c-g8tp6 0/1 ContainerCreating 0 1s
busybox-deployment-7bfd6d554c-jwgmg 1/1 Running 0 3s
busybox-deployment-7bfd6d554c-g8tp6 1/1 Running 0 3s
busybox-deployment-7bfd6d554c-n7fk2 1/1 Running 0 3s

container-kill配置文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: container-kill-example
namespace: chaos-testing
spec:
action: container-kill
mode: one
containerName: 'prometheus'
selector:
labelSelectors:
'app.kubernetes.io/component': 'monitor'
scheduler:
cron: '@every 30s'

实验效果:

字段说明

  • action: 定义pod chaos 执行的是那种动作,可选 pod-kill,pod-failure,container-kill
  • mode: 定义pod chaos 执行的模式,实质上是action操作影响pod的个数,支持的模式:one/ all/ fixed/ fixed-percent/ random-max-percent
  • value: value的值取决于mode的设定,如果modeoneall,则value留空,表示影响一个或者所有pod,如果为fixed,则提供一个整数作为影响pod的个数。如果为fixed-percent,则提供0到100之间的数字以指定action操作影响的pod的百分比。如果为random-max-percent,需提供0到100之间的数字,以指定action操作影响的pod的最大百分比。
  • selector: 选择目标pod,参考上面的定义混沌实验范围
  • containerName: 定义目标pod中的container名称,一般执行container-kill时才需要
  • gracePeriod:定义应该删除Pod之前的持续时间(以秒为单位)。它用于pod-kill操作,其值必须为非负整数。默认值为零,表示立即删除。
  • duration: 定义实验持续时间,默认值为30s,表示Pod故障将持续30s。执行pod-failure时候建议将该值调大一点。
  • scheduler: 定义实验运行时间该如何调度,请参阅robfig / cron

网络混沌实验

NetworkChaos操作分为两类:

  • Network Partition(网络分区)操作可以将pod划分为独立的子网,阻止Pod之间的通信或者pod与指定ip的通信。
  • Network Emulation (Netem) Chaos(网络仿真)操作涵盖常规的网络故障,例如网络延迟,重复,丢失和损坏。

实验准备:

修改pod实验的buxybox.yaml文件,将副本数设置为1,复制一份并把两个文件中相关名称分别修改为busybox-1busybox-2,部署后如下:

1
2
3
4
5
6
[root@master-01 network]# kubectl get pod --show-labels 
NAME READY STATUS RESTARTS AGE LABELS
busybox-1-7f5c9995d5-mbncr 1/1 Running 0 2m1s app=busybox-1,pod-template-hash=7f5c9995d5
busybox-2-7fd6c66cc6-7psnc 1/1 Running 0 99s app=busybox-2,pod-template-hash=7fd6c66cc6
check-ecs-price-7cdc97b997-j9w9q 1/1 Running 0 13d app=check-ecs-price,pod-template-hash=7cdc97b997

字段说明

action: 定义了要执行哪种网络混沌实验操作

mode:定义运行动作的模式

selector: 选择目标pod,参考上面的定义混沌实验范围

direction:指定网络方向。支持的方向fromtoboth

target:指定网络分区的目标

duration: 定义实验持续时间。

scheduler: 定义实验运行时间该如何调度,请参阅robfig / cron

Network Partition(网络分区)配置文件

pod与指定ip的通信
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-partition-example
namespace: chaos-testing
spec:
action: partition
mode: one
selector:
labelSelectors:
"app": "busybox-1"
namespaces:
- default
direction: to
externalTargets:
- "8.8.8.8"
- "www.baidu.com"
- "8.8.0.0/16"
duration: "5s"
scheduler:
cron: "@every 15s"

实验效果:

每隔十五秒,busybox容器就会有5s时间无法访问www.baidu.com

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
[root@master-01 ~]# kubectl exec -it busybox-deployment-7f5c9995d5-g5qp8  -- sh
/ # ping www.baidu.com
PING www.baidu.com (220.181.38.149): 56 data bytes
64 bytes from 220.181.38.149: seq=0 ttl=52 time=9.316 ms
64 bytes from 220.181.38.149: seq=1 ttl=52 time=9.355 ms
ping: sendto: Operation not permitted
/ # ping www.baidu.com
PING www.baidu.com (220.181.38.149): 56 data bytes
ping: sendto: Operation not permitted
/ # ping www.baidu.com
PING www.baidu.com (220.181.38.149): 56 data bytes
64 bytes from 220.181.38.149: seq=0 ttl=52 time=9.297 ms
^[[A64 bytes from 220.181.38.149: seq=1 ttl=52 time=9.324 ms
64 bytes from 220.181.38.149: seq=2 ttl=52 time=9.264 ms
64 bytes from 220.181.38.149: seq=3 ttl=52 time=9.334 ms
64 bytes from 220.181.38.149: seq=4 ttl=52 time=9.349 ms
64 bytes from 220.181.38.149: seq=5 ttl=52 time=9.728 ms
64 bytes from 220.181.38.149: seq=6 ttl=52 time=9.303 ms
64 bytes from 220.181.38.149: seq=7 ttl=52 time=9.298 ms
64 bytes from 220.181.38.149: seq=8 ttl=52 time=12.220 ms
64 bytes from 220.181.38.149: seq=9 ttl=52 time=9.306 ms
ping: sendto: Operation not permitted
/ # ping www.baidu.com
PING www.baidu.com (220.181.38.150): 56 data bytes
ping: sendto: Operation not permitted
/ # ping www.baidu.com
PING www.baidu.com (220.181.38.150): 56 data bytes
ping: sendto: Operation not permitted

pod与pod间通信
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-partition-example
namespace: chaos-testing
spec:
action: partition
mode: one
selector:
labelSelectors:
"app": "busybox-1"
namespaces:
- default
direction: to
target:
selector:
labelSelectors:
"app": "busybox-2"
namespaces:
- default
mode: one
duration: "5s"
scheduler:
cron: "@every 15s"

实验效果:

为了简单这里没创建service,直接使用pod ip 进行访问

1
2
3
4
5
[root@master-01 ~]# kubectl get pod -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
busybox-1-7f5c9995d5-rlzbr 1/1 Running 0 15s 100.67.79.136 node-01 <none> <none>
busybox-2-7fd6c66cc6-7wqdv 1/1 Running 0 13s 100.67.79.137 node-01 <none> <none>
check-ecs-price-7cdc97b997-j9w9q 1/1 Running 0 13d 100.67.79.132 node-01 <none> <none>

在busybox-1上访问busybox-2会发现每隔15s会有5s不可访问

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
[root@master-01 ~]# kubectl exec -it busybox-1-7f5c9995d5-rlzbr  -- sh 
/ # ping 100.67.79.137
PING 100.67.79.137 (100.67.79.137): 56 data bytes
ping: sendto: Operation not permitted
/ # ping 100.67.79.137
PING 100.67.79.137 (100.67.79.137): 56 data bytes
ping: sendto: Operation not permitted
/ # ping 100.67.79.137
PING 100.67.79.137 (100.67.79.137): 56 data bytes
ping: sendto: Operation not permitted
/ # ping 100.67.79.137
PING 100.67.79.137 (100.67.79.137): 56 data bytes
64 bytes from 100.67.79.137: seq=0 ttl=63 time=0.150 ms
64 bytes from 100.67.79.137: seq=1 ttl=63 time=0.064 ms
64 bytes from 100.67.79.137: seq=2 ttl=63 time=0.070 ms
64 bytes from 100.67.79.137: seq=3 ttl=63 time=0.051 ms
64 bytes from 100.67.79.137: seq=4 ttl=63 time=0.062 ms
64 bytes from 100.67.79.137: seq=5 ttl=63 time=0.053 ms
64 bytes from 100.67.79.137: seq=6 ttl=63 time=0.058 ms
64 bytes from 100.67.79.137: seq=7 ttl=63 time=0.060 ms
64 bytes from 100.67.79.137: seq=8 ttl=63 time=0.059 ms
64 bytes from 100.67.79.137: seq=9 ttl=63 time=0.060 ms
ping: sendto: Operation not permitted
/ # ping 100.67.79.137
PING 100.67.79.137 (100.67.79.137): 56 data bytes
ping: sendto: Operation not permitted
/ # ping 100.67.79.137
PING 100.67.79.137 (100.67.79.137): 56 data bytes
ping: sendto: Operation not permitted

在busybox-2上访问busybox-1会发现每隔15s会卡住5s

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
[root@master-01 ~]# kubectl exec -it busybox-2-7fd6c66cc6-7wqdv  -- sh 
/ # ping 100.67.79.136
PING 100.67.79.136 (100.67.79.136): 56 data bytes
64 bytes from 100.67.79.136: seq=0 ttl=63 time=0.086 ms
64 bytes from 100.67.79.136: seq=1 ttl=63 time=0.089 ms
64 bytes from 100.67.79.136: seq=2 ttl=63 time=0.059 ms
64 bytes from 100.67.79.136: seq=3 ttl=63 time=0.060 ms
64 bytes from 100.67.79.136: seq=4 ttl=63 time=0.068 ms
64 bytes from 100.67.79.136: seq=5 ttl=63 time=0.053 ms
64 bytes from 100.67.79.136: seq=6 ttl=63 time=0.088 ms
64 bytes from 100.67.79.136: seq=7 ttl=63 time=0.074 ms
64 bytes from 100.67.79.136: seq=8 ttl=63 time=0.074 ms
64 bytes from 100.67.79.136: seq=9 ttl=63 time=0.092 ms
##此处seq 直接跳为15
64 bytes from 100.67.79.136: seq=15 ttl=63 time=0.075 ms
64 bytes from 100.67.79.136: seq=16 ttl=63 time=0.065 ms
64 bytes from 100.67.79.136: seq=17 ttl=63 time=0.087 ms
64 bytes from 100.67.79.136: seq=18 ttl=63 time=0.075 ms
64 bytes from 100.67.79.136: seq=19 ttl=63 time=0.070 ms
64 bytes from 100.67.79.136: seq=20 ttl=63 time=0.058 ms
64 bytes from 100.67.79.136: seq=21 ttl=63 time=0.087 ms
64 bytes from 100.67.79.136: seq=22 ttl=63 time=0.070 ms
64 bytes from 100.67.79.136: seq=23 ttl=63 time=0.079 ms

Network Emulation (Netem) (网络仿真)

netem操作有4种情况,即loss(丢失), delay(延迟), duplicate(重复), and corrupt(损坏)。

loss(丢失)

网络丢失操作会导致网络数据包随机丢弃。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-loss-example
namespace: chaos-testing
spec:
action: loss
mode: one
selector:
labelSelectors:
"app": "busybox-1"
namespaces:
- default
loss:
loss: "50"
correlation: "50" #包率是50%,并且当前报文丢弃的可能性和前一个报文 50% 相关
duration: "30s"
scheduler:
cron: "@every 60s"

loss.loss定义了数据包丢失的百分比,NetworkChaos变化并不是纯粹随机的,因此可以模拟存在一个correlation(相关值)。

Correlation:下一个报文延迟时间和上一个报文的相关系数.因为网络状况是平滑变化的,短时间里相邻报文的延迟应该是近似的而不是完全随机的。这个值是个百分比,如果为 100%,就退化到固定延迟的情况;如果是 0% 则退化到随机延迟的情况

实验效果:

可以发现每隔几秒就会发生丢包现象。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
/ # ping 100.67.79.136
PING 100.67.79.136 (100.67.79.136): 56 data bytes
64 bytes from 100.67.79.136: seq=0 ttl=63 time=0.075 ms
64 bytes from 100.67.79.136: seq=1 ttl=63 time=0.075 ms
64 bytes from 100.67.79.136: seq=2 ttl=63 time=0.059 ms
64 bytes from 100.67.79.136: seq=3 ttl=63 time=0.045 ms
64 bytes from 100.67.79.136: seq=6 ttl=63 time=0.071 ms
64 bytes from 100.67.79.136: seq=7 ttl=63 time=0.049 ms
64 bytes from 100.67.79.136: seq=9 ttl=63 time=0.058 ms
64 bytes from 100.67.79.136: seq=10 ttl=63 time=0.080 ms
64 bytes from 100.67.79.136: seq=11 ttl=63 time=0.073 ms
64 bytes from 100.67.79.136: seq=12 ttl=63 time=0.054 ms
64 bytes from 100.67.79.136: seq=13 ttl=63 time=0.114 ms
64 bytes from 100.67.79.136: seq=14 ttl=63 time=0.074 ms
64 bytes from 100.67.79.136: seq=15 ttl=63 time=0.063 ms
64 bytes from 100.67.79.136: seq=16 ttl=63 time=0.068 ms
64 bytes from 100.67.79.136: seq=17 ttl=63 time=0.073 ms
64 bytes from 100.67.79.136: seq=19 ttl=63 time=0.076 ms
64 bytes from 100.67.79.136: seq=24 ttl=63 time=0.114 ms
64 bytes from 100.67.79.136: seq=26 ttl=63 time=0.080 ms
64 bytes from 100.67.79.136: seq=27 ttl=63 time=0.053 ms
64 bytes from 100.67.79.136: seq=28 ttl=63 time=0.088 ms
64 bytes from 100.67.79.136: seq=35 ttl=63 time=0.075 ms
64 bytes from 100.67.79.136: seq=41 ttl=63 time=0.072 ms
64 bytes from 100.67.79.136: seq=42 ttl=63 time=0.068 ms
64 bytes from 100.67.79.136: seq=43 ttl=63 time=0.053 ms
64 bytes from 100.67.79.136: seq=44 ttl=63 time=0.058 ms
64 bytes from 100.67.79.136: seq=46 ttl=63 time=0.062 ms
....
64 bytes from 100.67.79.136: seq=143 ttl=63 time=0.052 ms
64 bytes from 100.67.79.136: seq=145 ttl=63 time=0.050 ms
^C
--- 100.67.79.136 ping statistics ---
146 packets transmitted, 106 packets received, 27% packet loss
round-trip min/avg/max = 0.044/0.068/0.235 ms
/ # command terminated with exit code 137
Delay(网络延迟)

网络延迟操作会导致消息发送延迟,需要三个特定于动作的属性-correlation(相关性),``jitter(抖动)和latency`(延迟)。

延迟定义了发送数据包的延迟时间。

抖动指定延迟时间的抖动。默认值为0ms。抖动在技术上也称为数据包延迟变化。

相关性指定抖动的相关性。默认值为0

在上面的示例中,网络延迟为500ms±100ms,相关性为50%

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-delay-example
namespace: chaos-testing
spec:
action: delay
mode: all
selector:
namespaces:
- skywalking
labelSelectors:
app: apm-item
delay:
latency: "500ms"
correlation: "50"
jitter: "100ms"
duration: "10s"
scheduler:
cron: "@every 15s"

实现效果:

我们可以看到实验开始之后,访问时间从0.006s增加到了1s左右。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
[root@master-01 network]# time curl 10.97.57.107:8082/item
{"id":0,"name":"car","price":10000.0}
real 0m0.006s
user 0m0.001s
sys 0m0.002s
[root@master-01 network]# time curl 10.97.57.107:8082/item
{"id":0,"name":"car","price":10000.0}
real 0m0.006s
user 0m0.001s
sys 0m0.002s
[root@master-01 network]# time curl 10.97.57.107:8082/item
{"id":0,"name":"car","price":10000.0}
real 0m0.005s
user 0m0.002s
sys 0m0.001s
[root@master-01 network]# time curl 10.97.57.107:8082/item
{"id":0,"name":"car","price":10000.0}
real 0m0.006s
user 0m0.000s
sys 0m0.003s
[root@master-01 network]# kubectl apply -f network-delay.yaml
networkchaos.chaos-mesh.org/network-delay-example created
[root@master-01 network]# time curl 10.97.57.107:8082/item
{"id":0,"name":"car","price":10000.0}
real 0m0.998s
user 0m0.001s
sys 0m0.002s
[root@master-01 network]# time curl 10.97.57.107:8082/item
{"id":0,"name":"car","price":10000.0}
real 0m1.110s
user 0m0.000s
sys 0m0.003s
[root@master-01 network]# time curl 10.97.57.107:8082/item
{"id":0,"name":"car","price":10000.0}
real 0m0.895s
user 0m0.001s
sys 0m0.002s
[root@master-01 network]# time curl 10.97.57.107:8082/item
{"id":0,"name":"car","price":10000.0}
real 0m1.056s
user 0m0.002s
sys 0m0.002s
[root@master-01 network]# time curl 10.97.57.107:8082/item
{"id":0,"name":"car","price":10000.0}
real 0m1.086s
user 0m0.001s
sys 0m0.002s

我们还可以配置到相关pod或者某个ip的延迟

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
#pod间访问延迟
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-delay-example
namespace: chaos-testing
spec:
action: delay
mode: one
selector:
labelSelectors:
"app.kubernetes.io/component": "tikv"
delay:
latency: "90ms"
correlation: "25"
jitter: "90ms"
direction: to
target:
selector:
labelSelectors:
"app.kubernetes.io/component": "tikv"
mode: one
duration: "10s"
scheduler:
cron: "@every 15s"

#pod到某个ip间的访问延迟
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-delay-example
namespace: chaos-testing
spec:
action: delay
mode: one
selector:
labelSelectors:
"app.kubernetes.io/component": "tikv"
delay:
latency: "90ms"
correlation: "25"
jitter: "90ms"
direction: to
externalTargets:
- "8.8.8.8"
- "www.google.com"
- "8.8.0.0/16"
duration: "10s"
scheduler:
cron: "@every 15s"
duplicate(重复)

网络重复操作导致数据包重复。

需要两个属性correlation(相关性)和duplicate(重复),在下面的示例中,重复率为40%

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-duplicate-example
namespace: chaos-testing
spec:
action: duplicate
mode: one
selector:
labelSelectors:
"app": "busybox-1"
namespaces:
- default
duplicate:
duplicate: "40"
correlation: "25"
duration: "10s"
scheduler:
cron: "@every 15s"

实验效果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
/ # ping 100.67.79.136
PING 100.67.79.136 (100.67.79.136): 56 data bytes
64 bytes from 100.67.79.136: seq=0 ttl=63 time=0.076 ms
64 bytes from 100.67.79.136: seq=1 ttl=63 time=0.086 ms
64 bytes from 100.67.79.136: seq=2 ttl=63 time=0.054 ms
64 bytes from 100.67.79.136: seq=3 ttl=63 time=0.087 ms
64 bytes from 100.67.79.136: seq=4 ttl=63 time=0.080 ms
64 bytes from 100.67.79.136: seq=4 ttl=63 time=0.102 ms (DUP!)
64 bytes from 100.67.79.136: seq=5 ttl=63 time=0.089 ms
64 bytes from 100.67.79.136: seq=5 ttl=63 time=0.107 ms (DUP!)
64 bytes from 100.67.79.136: seq=6 ttl=63 time=0.086 ms
64 bytes from 100.67.79.136: seq=7 ttl=63 time=0.081 ms
64 bytes from 100.67.79.136: seq=8 ttl=63 time=0.083 ms
64 bytes from 100.67.79.136: seq=9 ttl=63 time=0.093 ms
64 bytes from 100.67.79.136: seq=10 ttl=63 time=0.071 ms
64 bytes from 100.67.79.136: seq=11 ttl=63 time=0.082 ms
64 bytes from 100.67.79.136: seq=12 ttl=63 time=0.089 ms
64 bytes from 100.67.79.136: seq=13 ttl=63 time=0.093 ms
64 bytes from 100.67.79.136: seq=14 ttl=63 time=0.078 ms
64 bytes from 100.67.79.136: seq=15 ttl=63 time=0.068 ms
64 bytes from 100.67.79.136: seq=16 ttl=63 time=0.052 ms
64 bytes from 100.67.79.136: seq=17 ttl=63 time=0.073 ms
64 bytes from 100.67.79.136: seq=18 ttl=63 time=0.088 ms
64 bytes from 100.67.79.136: seq=18 ttl=63 time=0.116 ms (DUP!)
64 bytes from 100.67.79.136: seq=19 ttl=63 time=0.091 ms
64 bytes from 100.67.79.136: seq=20 ttl=63 time=0.066 ms
64 bytes from 100.67.79.136: seq=21 ttl=63 time=0.090 ms
64 bytes from 100.67.79.136: seq=22 ttl=63 time=0.086 ms
64 bytes from 100.67.79.136: seq=22 ttl=63 time=0.113 ms (DUP!)
64 bytes from 100.67.79.136: seq=23 ttl=63 time=0.083 ms
64 bytes from 100.67.79.136: seq=23 ttl=63 time=0.102 ms (DUP!)
64 bytes from 100.67.79.136: seq=24 ttl=63 time=0.079 ms
64 bytes from 100.67.79.136: seq=25 ttl=63 time=0.092 ms
64 bytes from 100.67.79.136: seq=26 ttl=63 time=0.083 ms
64 bytes from 100.67.79.136: seq=27 ttl=63 time=0.058 ms
^C
--- 100.67.79.136 ping statistics ---
28 packets transmitted, 28 packets received, 5 duplicates, 0% packet loss
round-trip min/avg/max = 0.052/0.084/0.116 ms

corrupt(损坏)

网络损坏操作会导致数据包损坏

需要两个属性correlation(相关性)和corrupt(损坏),损坏指定数据包损坏的百分比。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-corrupt-example
namespace: chaos-testing
spec:
action: corrupt
mode: one
selector:
labelSelectors:
"app": "busybox-1"
namespaces:
- default
corrupt:
corrupt: "40"
correlation: "25"
duration: "10s"
scheduler:
cron: "@every 15s"

实验效果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
/ # ping 100.67.79.136
PING 100.67.79.136 (100.67.79.136): 56 data bytes
#此时包已经损坏,故未收到seq=0 seq=1
64 bytes from 100.67.79.136: seq=2 ttl=63 time=0.056 ms
64 bytes from 100.67.79.136: seq=3 ttl=63 time=0.062 ms
64 bytes from 100.67.79.136: seq=4 ttl=63 time=0.053 ms
64 bytes from 100.67.79.136: seq=5 ttl=63 time=0.052 ms
64 bytes from 100.67.79.136: seq=6 ttl=63 time=0.055 ms
64 bytes from 100.67.79.136: seq=7 ttl=63 time=0.051 ms
64 bytes from 100.67.79.136: seq=8 ttl=63 time=0.076 ms
64 bytes from 100.67.79.136: seq=9 ttl=63 time=0.066 ms
64 bytes from 100.67.79.136: seq=10 ttl=63 time=0.073 ms
64 bytes from 100.67.79.136: seq=11 ttl=63 time=0.062 ms
64 bytes from 100.67.79.136: seq=12 ttl=63 time=0.084 ms
64 bytes from 100.67.79.136: seq=13 ttl=63 time=0.055 ms
64 bytes from 100.67.79.136: seq=14 ttl=63 time=0.066 ms
64 bytes from 100.67.79.136: seq=15 ttl=63 time=0.054 ms
#此时包已经损坏,故未收到seq=16
64 bytes from 100.67.79.136: seq=17 ttl=63 time=0.062 ms
#此时包已经损坏,故未收到seq=18 19 20
64 bytes from 100.67.79.136: seq=20 ttl=63 time=0.074 ms
64 bytes from 100.67.79.136: seq=21 ttl=63 time=0.058 ms
64 bytes from 100.67.79.136: seq=22 ttl=63 time=0.049 ms
64 bytes from 100.67.79.136: seq=23 ttl=63 time=0.048 ms
64 bytes from 100.67.79.136: seq=24 ttl=63 time=0.049 ms
64 bytes from 100.67.79.136: seq=25 ttl=63 time=0.087 ms
^C
--- 100.67.79.136 ping statistics ---
26 packets transmitted, 21 packets received, 19% packet loss
round-trip min/avg/max = 0.048/0.061/0.087 ms

Network Bandwidth Action(网络带宽操作)

网络带宽操作用于限制网络带宽。要注入网络带宽故障,需要三个特定的属性-rate(速率), buffer(缓冲区)和limit(限制),缓冲区和限制。

rate允许使用“ bps”,“ kbps”,“ mbps”,“ gbps”,“ tbps”单位。“ bps”表示每秒字节数。

limit定义了等待令牌可用的可排队的字节数。

buffer是令牌可立即使用的最大字节数。

peakrate速率是存储桶的最大消耗速率。

minburst指定峰值速率桶的大小。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-bandwidth-example
namespace: chaos-testing
spec:
action: bandwidth
mode: one
selector:
labelSelectors:
"app": "busybox-1"
namespaces:
- default
bandwidth:
rate: 100kbps
limit: 100
buffer: 10000
peakrate: 1000000
minburst: 1000000
duration: "10s"
scheduler:
cron: "@every 15s"

dns chaos 实验

使用DNSChaos,可以在发送请求后模拟故障DNS响应,例如DNS错误或随机IP地址。

注意:DNSChaos仅支持dns的A记录AAAA记录

如果之前使用helm安装未开启dns服务,这里使用下面命令进行开启

1
helm upgrade chaos-mesh helm/chaos-mesh --namespace=chaos-testing --set dnsServer.create=true

这里我们还以busybox的pod为例,使用下面文件创建一个pod

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
apiVersion: apps/v1
kind: Deployment
metadata:
name: busybox-deployment
namespace: default
labels:
app: busybox
spec:
selector:
matchLabels:
app: busybox
replicas: 1
template:
metadata:
labels:
app: busybox
spec:
containers:
- name: busybox
image: busybox:1.29
imagePullPolicy: IfNotPresent
ports:
- containerPort: 80
args:
- /bin/sh
- -c
- touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 60000

我们定义一个dns-chaos实验的配置文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
apiVersion: chaos-mesh.org/v1alpha1
kind: DNSChaos
metadata:
name: busybox-dns-random-chaos
spec:
action: random
scope: all
mode: all
selector:
namespaces:
- default
labelSelectors:
app: busybox
duration: "90s"
scheduler:
cron: "@every 100s"

这里需要注意下面两个配置:

  • action

    :定义DNS实验的混乱行为。支持的操作有:

    • error -发送DNS请求时出现错误
    • random -发送DNS请求时获取随机IP
  • scope:定义DNS实验的范围。支持的范围是:

    • outer -DNS混乱仅适用于Kubernetes集群的外部主机
    • inner-DNS混乱仅适用于Kubernetes集群的内部主机
    • all -DNS混乱适用于所有主机。

实验效果

我们首先进入busybox这个pod内部,使用nslookup测试一下www.baidu.com这个域名

1
2
3
4
5
6
7
8
9
10
11
12
[root@master-01 ~]# kubectl exec -it busybox-deployment-6c9dfd98bc-46xn5  -- sh 
/ # nslookup -type=A www.baidu.com
Server: 10.96.0.10
Address: 10.96.0.10:53

Non-authoritative answer:
www.baidu.com canonical name = www.a.shifen.com
Name: www.a.shifen.com
Address: 180.101.49.12
Name: www.a.shifen.com
Address: 180.101.49.11

当我们部署dns chaos的配置文件后如下,由于我们使用的是random ,大概每隔10s,请求域名解析到的A记录地址就会随机变化一次。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[root@master-01 ~]# kubectl exec -it busybox-deployment-6c9dfd98bc-46xn5  -- sh 
/ # nslookup -type=A www.baidu.com
Server: 10.96.175.95
Address: 10.96.175.95:53

Name: www.baidu.com
Address: 161.208.21.171

/ # nslookup -type=A www.baidu.com
Server: 10.96.175.95
Address: 10.96.175.95:53

Name: www.baidu.com
Address: 159.220.76.203

当我们使用error模式时如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
[root@master-01 ~]# kubectl exec -it busybox-deployment-6c9dfd98bc-46xn5  -- sh 
/ # nslookup -type=A www.baidu.com
Server: 10.96.175.95
Address: 10.96.175.95:53

** server can't find www.baidu.com: SERVFAIL

/ # nslookup -type=A www.baidu.com
Server: 10.96.175.95
Address: 10.96.175.95:53

** server can't find www.baidu.com: SERVFAIL

StressChaos实验(压力测试)

StressChaos会在一系列Pod上产生大量压力。压力源通过chaos-daemon内部注入到目标pod中。

Stressors定义了多个压力源,可用来压测系统组件。可以使用其中的一个或多个来测试各种压力。应该至少指定一个压力源。目前支持以下压力源:

memory

一个memory压力源将不断强调虚拟内存不足。

选项 类型 需要 描述
workers Integer True 指定并发压力实例。
size String False 指定每个工作人员消耗的内存大小,默认为总可用内存。也可以将大小指定为总可用内存的百分比,或者以B,KB / KiB,MB / MiB,GB / GiB,TB / TiB为单位

cpu

一个cpu压力源将继续强调CPU出来。

选项 类型 需要 描述
workers Integer True 指定并发压力实例。实际上,它指定当它小于可用CPU时要承受的CPU数量。
load Integer False 指定每个worker的负载百分比。0实际上是睡眠(无负载),而100是满负载。

示例配置:

我们定义了用cpu压力源进行压测,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: burn-cpu
namespace: chaos-testing
spec:
mode: one
selector:
labelSelectors:
"app": "busybox-1"
namespaces:
- default
stressors:
cpu:
workers: 1
load: 60
duration: "2m"
scheduler:
cron: "@every 3m"

实验效果:

我们修改之前的busybox-1的yaml文件如下,添加cpu限制,注意这里最好requests和limit值设置一致:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
apiVersion: apps/v1
kind: Deployment
metadata:
name: busybox-1
namespace: default
labels:
app: busybox-1
spec:
selector:
matchLabels:
app: busybox-1
replicas: 1
template:
metadata:
labels:
app: busybox-1
annotations:
enable.version-checker.io/busybox: "true"
spec:
containers:
- name: busybox
image: busybox:1.29
imagePullPolicy: IfNotPresent
ports:
- containerPort: 80
args:
- /bin/sh
- -c
- sleep 12h
resources:
limits:
cpu: 1000m
requests:
cpu: 1000m

我们执行上面的stress chaos 后可以在prometheus上观察cpu变化。

image-20201214155233412

使用内存进行压测

被压测的pod

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
apiVersion: apps/v1
kind: Deployment
metadata:
name: busybox-2
namespace: default
labels:
app: busybox-2
spec:
selector:
matchLabels:
app: busybox-2
replicas: 1
template:
metadata:
labels:
app: busybox-2
annotations:
enable.version-checker.io/busybox: "true"
spec:
containers:
- name: busybox-2
image: busybox:1.29
imagePullPolicy: IfNotPresent
ports:
- containerPort: 80
args:
- /bin/sh
- -c
- sleep 12h
resources:
limits:
memory: 512Mi
requests:
memory: 512Mi

使用内存压力源压测

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: burn-mem
namespace: chaos-testing
spec:
mode: one
selector:
labelSelectors:
"app": "busybox-2"
namespaces:
- default
stressors:
memory:
workers: 1
load: 200MI
duration: "60s"
scheduler:
cron: "@every 2m"

实验效果

TimeChaos实验

TimeChaos用于修改的返回值clock_gettime,这会导致Gotime.Now()和Rust stdstd::time::Instant::now()等的时间偏移。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
apiVersion: chaos-mesh.org/v1alpha1
kind: TimeChaos
metadata:
name: time-shift
namespace: chaos-testing
spec:
mode: one
selector:
labelSelectors:
"app": "go"
namespaces:
- default
clockIds:
- CLOCK_REALTIME
timeOffset: "-10m100ns"
duration: "10s"
scheduler:
cron: "@every 15s"

出下面两个参数,别的参数都和之前的一样

  • timeOffset指定时间偏移。它是具有指定单位的时间字符串,例如300ms-1.5h。有效时间单位为“ ns”,“ us”(或“ µs”),“ ms”,“ s”,“ m”,“ h”。
  • clockIds定义所有受影响的对象clk_idclk_idclock_gettimecall的第一个参数。对于大多数应用来说,CLOCK_REALTIME就足够了。

注意:

  • 时间修改只能注入到容器的主过程中。
  • Time-chaos对纯系统调用没有影响。
  • 所有注入的vDSO调用都使用纯系统调用来获取实时信息,因此与时钟相关的函数调用可能要慢得多。

实验效果:

这里新建一个deployment,misterli/chaos-mesh-time:v1镜像是一个持续输出当前时间的服务。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
apiVersion: apps/v1
kind: Deployment
metadata:
name: go
namespace: default
labels:
app: go
spec:
selector:
matchLabels:
app: go
replicas: 1
template:
metadata:
labels:
app: go
spec:
containers:
- name: golang
image: misterli/chaos-mesh-time:v1
imagePullPolicy: IfNotPresent

执行time-chaos后,我们会发现输出的当前时间会向前偏移10m100ns

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
kubectl  logs -f go-7c9c5496fb-cbstm golang 
当前时间:2020-12-09 08:33:44.242509095 +0000 UTC m=+590.000164188
当前时间:2020-12-09 08:33:45.242486302 +0000 UTC m=+591.000141485
当前时间:2020-12-09 08:33:46.242474529 +0000 UTC m=+592.000129632
当前时间:2020-12-09 08:33:47.242477013 +0000 UTC m=+593.000132116
当前时间:2020-12-09 08:33:48.2424384 +0000 UTC m=+594.000093574
当前时间:2020-12-09 08:23:49.242405168 +0000 UTC m=+595.000060502
当前时间:2020-12-09 08:23:50.242466362 +0000 UTC m=+596.000121585
当前时间:2020-12-09 08:23:51.242474215 +0000 UTC m=+597.000129509
当前时间:2020-12-09 08:23:52.242473614 +0000 UTC m=+598.000128847
当前时间:2020-12-09 08:23:53.242483071 +0000 UTC m=+599.000138284
当前时间:2020-12-09 08:23:54.242551938 +0000 UTC m=+600.000207242
当前时间:2020-12-09 08:23:55.242465166 +0000 UTC m=+601.000120459
当前时间:2020-12-09 08:23:56.243491971 +0000 UTC m=+602.001147235
当前时间:2020-12-09 08:23:57.242482998 +0000 UTC m=+603.000138302
当前时间:2020-12-09 08:33:58.246163715 +0000 UTC m=+604.003818868
当前时间:2020-12-09 08:33:59.242488497 +0000 UTC m=+605.000143680
当前时间:2020-12-09 08:34:00.24244772 +0000 UTC m=+606.000102823
当前时间:2020-12-09 08:34:01.242478637 +0000 UTC m=+607.000133740
当前时间:2020-12-09 08:34:02.242489246 +0000 UTC m=+608.000144350
当前时间:2020-12-09 08:24:03.242567903 +0000 UTC m=+609.000223176
当前时间:2020-12-09 08:24:04.242612565 +0000 UTC m=+610.000267789
当前时间:2020-12-09 08:24:05.242490617 +0000 UTC m=+611.000145921
当前时间:2020-12-09 08:24:06.242484445 +0000 UTC m=+612.000139689
当前时间:2020-12-09 08:24:07.242487661 +0000 UTC m=+613.000142974

我们进入容器内部使用date命令查看时间,发现时间不受影响

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
[root@master-01 network]# kubectl exec -it  go-7c9c5496fb-cbstm  -- sh 
/app # date
Wed Dec 9 08:37:06 UTC 2020
/app # date
Wed Dec 9 08:37:07 UTC 2020
/app # date
Wed Dec 9 08:37:08 UTC 2020
/app # date
Wed Dec 9 08:37:09 UTC 2020
/app # date
Wed Dec 9 08:37:10 UTC 2020
/app # date
Wed Dec 9 08:37:11 UTC 2020
/app # date
Wed Dec 9 08:37:12 UTC 2020
/app # date
Wed Dec 9 08:37:12 UTC 2020
/app # date
Wed Dec 9 08:37:13 UTC 2020
/app # date
Wed Dec 9 08:37:14 UTC 2020
/app # date
Wed Dec 9 08:37:15 UTC 2020
/app # date
Wed Dec 9 08:37:16 UTC 2020
/app # date
Wed Dec 9 08:37:16 UTC 2020
/app # date
Wed Dec 9 08:37:17 UTC 2020
/app # date
Wed Dec 9 08:37:18 UTC 2020
/app # date
Wed Dec 9 08:37:20 UTC 2020
/app # date
Wed Dec 9 08:37:21 UTC 2020
/app # date
Wed Dec 9 08:37:22 UTC 2020
CATALOG
  1. 1. Chaos-mesh 配置
    1. 1.1. 定义混沌实验范围
      1. 1.1.1. 名称空间选择
      2. 1.1.2. 标签选择器
      3. 1.1.3. 字段选择器
      4. 1.1.4. 注解选择器
      5. 1.1.5. pod状态选择器
      6. 1.1.6. pod选择器
    2. 1.2. 混沌实验类型
      1. 1.2.1. pod混沌实验
        1. 1.2.1.1. pod-failure配置文件
        2. 1.2.1.2. pod-kill配置文件
        3. 1.2.1.3. container-kill配置文件
        4. 1.2.1.4. 字段说明
      2. 1.2.2. 网络混沌实验
        1. 1.2.2.1. 字段说明
        2. 1.2.2.2. Network Partition(网络分区)配置文件
          1. 1.2.2.2.1. pod与指定ip的通信
          2. 1.2.2.2.2. pod与pod间通信
        3. 1.2.2.3. Network Emulation (Netem) (网络仿真)
          1. 1.2.2.3.1. loss(丢失)
          2. 1.2.2.3.2. Delay(网络延迟)
          3. 1.2.2.3.3. duplicate(重复)
          4. 1.2.2.3.4. corrupt(损坏)
        4. 1.2.2.4. Network Bandwidth Action(网络带宽操作)
      3. 1.2.3. dns chaos 实验
      4. 1.2.4. StressChaos实验(压力测试)
      5. 1.2.5. TimeChaos实验