misterli's Blog.

使用Rancher的rke安装k8s

字数统计: 6.8k阅读时长: 38 min
2020/12/02

主机列表

主机名 公网ip 内网ip 系统 使用用户
rke-test-master-01 180.153.180.33 192.168.0.60 Ubuntu 18.04 robot
Rke-test-master-02 180.153.180.11 192.168.0.31 Ubuntu 18.04 robot
Rke-test-master-03 180.153.180.23 192.168.0.65 Ubuntu 18.04 robot
rke-test-node-01 180.153.180.34 192.168.0.55 Ubuntu 18.04 Robot

负载均衡slb

负载均衡slb的6443端口采用轮询方式代理rke-test-master-01 rke-test-master-02 rke-test-master-03的6443端口

安装docker及配置

参考:https://docs.docker.com/engine/install/ubuntu/

卸载旧版本

如果之前安装旧版本,需要先卸载:

1
$ sudo apt-get remove docker docker-engine docker.io containerd runc

设置仓库

更新apt包索引:

1
$ sudo apt-get update

让apt能通过HTTPS使用仓库:

1
2
3
4
5
$ sudo apt-get install \
apt-transport-https \
ca-certificates \
curl \
software-properties-common

添加官方的GPG 密钥:

1
$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

验证你的密钥:

1
$ sudo apt-key fingerprint 0EBFCD88

设置选用哪个版本。

1
2
3
4
sudo add-apt-repository \
"deb [arch=amd64] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) \
stable"

安装

更新apt包索引

1
$ sudo apt-get update

开始安装

1
$ sudo apt-get install docker-ce docker-ce-cli containerd.io

测试

1
$ sudo docker info

将当前用户加入到docker组中 usermod -aG docker <user_name>

1
2
3
root@rke-test-master-1:/home/robot# usermod -aG docker robot 
root@rke-test-master-1:/home/robot# id robot
uid=1001(robot) gid=1006(robot) groups=1006(robot),999(docker)

禁止使用swap

1
swapoff -a && sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab

关闭selinux

1
2
apt install -y selinux-utils
setenforce 0 && sed -i 's/^SELINUX=.*/SELINUX=disabled/' /etc/selinux/config

检查内核模块

执行下面脚本

1
2
3
4
5
6
7
8
9
10
11
12
13
#!/bin/bash
for module in br_netfilter ip6_udp_tunnel ip_set ip_set_hash_ip ip_set_hash_net iptable_filter iptable_nat iptable_mangle iptable_raw nf_conntrack_netlink nf_conntrack nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat nf_nat_ipv4 nf_nat_masquerade_ipv4 nfnetlink udp_tunnel veth vxlan x_tables xt_addrtype xt_conntrack xt_comment xt_mark xt_multiport xt_nat xt_recent xt_set xt_statistic xt_tcpudp; do
if ! lsmod | grep -q $module; then
echo "module $module is not present, try to install...";
modprobe $module
if [ $? -eq 0 ]; then
echo -e "\033[32;1mSuccessfully installed $module!\033[0m"
else
echo -e "\033[31;1mInstall $module failed!!!\033[0m"
fi
fi;
done

正常如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
robot@rke-test-master-1:~$ sudo bash check.sh 
[sudo] password for robot:
module ip6_udp_tunnel is not present, try to install...
Successfully installed ip6_udp_tunnel!
module ip_set is not present, try to install...
Successfully installed ip_set!
module ip_set_hash_ip is not present, try to install...
Successfully installed ip_set_hash_ip!
module ip_set_hash_net is not present, try to install...
Successfully installed ip_set_hash_net!
module iptable_mangle is not present, try to install...
Successfully installed iptable_mangle!
module iptable_raw is not present, try to install...
Successfully installed iptable_raw!
module veth is not present, try to install...
Successfully installed veth!
module vxlan is not present, try to install...
Successfully installed vxlan!
module xt_comment is not present, try to install...
Successfully installed xt_comment!
module xt_mark is not present, try to install...
Successfully installed xt_mark!
module xt_multiport is not present, try to install...
Successfully installed xt_multiport!
module xt_nat is not present, try to install...
Successfully installed xt_nat!
module xt_recent is not present, try to install...
Successfully installed xt_recent!
module xt_set is not present, try to install...
Successfully installed xt_set!
module xt_statistic is not present, try to install...
Successfully installed xt_statistic!
module xt_tcpudp is not present, try to install...
Successfully installed xt_tcpudp!

修改sysctl 配置

修改 /etc/sysctl.conf,添加如下配置

1
2
3
4
net.ipv4.ip_forward = 1
net.bridge.bridge-nf-call-ip6tables = 1
# 官方中要求的
net.bridge.bridge-nf-call-iptables = 1

执行 sudo sysctl --system

设置禁止自动休眠

1
sudo systemctl mask sleep.target suspend.target hibernate.target hybrid-sleep.target

使用 iptables打开TCP/6443 端口

以下语句执行一个就可以

1
2
3
4
5
# Open TCP/6443 for all
iptables -A INPUT -p tcp --dport 6443 -j ACCEPT

# Open TCP/6443 for one specific IP
iptables -A INPUT -p tcp -s your_ip_here --dport 6443 -j ACCEPT

SSH Server 配置

您的 SSH server 全系统配置文件,位于/etc/ssh/sshd_config,该文件必须包含以下代码,允许 TCP 转发。

1
AllowTcpForwarding yes

执行systemctl restart sshd

修改hosts

1
2
3
4
5
6
7
cat /etc/hosts
....
192.168.0.60 rke-test-master-01
192.168.0.31 rke-test-master-02
192.168.0.65 rke-test-master-03
192.168.0.55 rke-test-node-01

配置ssh密钥

执行ssh-keygen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
robot@rke-test-master-1:~$ ssh-keygen 
Generating public/private rsa key pair.
Enter file in which to save the key (/home/robot/.ssh/id_rsa):
Created directory '/home/robot/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/robot/.ssh/id_rsa.
Your public key has been saved in /home/robot/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:lWbqBm0dVE2gHQ0LaEQrbl/ufd2CmQN6G70xWcETaxg robot@rke-test-master-1
The key's randomart image is:
+---[RSA 2048]----+
| oo.o.E*o |
| oo = *oo |
| ... B + * |
| . o * . . o |
| + S o . |
| . = o.. o |
| +.o.=+ ..|
| ....o=+o o|
| .o.oo . |
+----[SHA256]-----+

配置免密登陆

1
2
3
4
#!/bin/bash 
for i in rke-test-master-01 rke-test-master-02 rke-test-master-03 rke-test-node-01;do
ssh-copy-id robot@${i}
done

配置slb

下载RKE

1
wget https://github.com/rancher/rke/releases/download/v1.0.5/rke_linux-amd64

注意不同版本k8s 要使用不同版本rke

参考:https://github.com/rancher/rke/releases

创建rke使用的cluster.yml文件

系统镜像要和k8s版本对应,参考:https://github.com/rancher/kontainer-driver-metadata/blob/master/rke/k8s_rke_system_images.go

cluster.yml如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
nodes:
- address: 180.153.180.33 #设置节点的ip地址,rke要能访问
port: "22" #ssh端口
internal_address: 192.168.0.60 #设置内网地址,如果没有设置internal_address,则使用address进行主机间通信。
role: #定义节点的角色,
- controlplane
- etcd
hostname_override: 180.153.180.33 #设置一个名称供注册节点使用
user: robot # 用户
docker_socket: /var/run/docker.sock #docker套接字地址,默认是/var/run/docker.sock
ssh_key: ""
ssh_key_path: ~/.ssh/id_rsa #连接节点使用的密钥位置
labels: #为节点定义标签
node-role.kubernetes.io/master: ""
- address: 180.153.180.11
port: "22"
internal_address: 192.168.0.31
role:
- controlplane
- etcd
hostname_override: 180.153.180.11
user: robot
docker_socket: /var/run/docker.sock
ssh_key: ""
ssh_key_path: ~/.ssh/id_rsa
labels:
node-role.kubernetes.io/master: ""
- address: 180.153.180.23
port: "22"
internal_address: 192.168.0.65
role:
- controlplane
- etcd
hostname_override: 180.153.180.23
user: robot
docker_socket: /var/run/docker.sock
ssh_key: ""
ssh_key_path: ~/.ssh/id_rsa
labels:
node-role.kubernetes.io/master: ""
- address: 180.153.180.34
port: "22"
internal_address: 192.168.0.55
role:
- worker
hostname_override: 180.153.180.34
user: robot
docker_socket: /var/run/docker.sock
ssh_key: ""
ssh_key_path: ~/.ssh/id_rsa
labels:
sms-wuxi-test: true
services:
etcd:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
external_urls: []
ca_cert: ""
cert: ""
key: ""
path: ""
snapshot: false
retention: ""
creation: ""
kube-api:
image: ""
extra_args:
external-hostname: 180.153.180.40
audit-log-path: "/var/log/audit.log"
audit-log-format: "json"
audit-log-maxage: 60
audit-log-maxbackup: 10
audit-policy-file: "/etc/kubernetes/audit-rule/rule.yaml"
extra_binds:
- "/var/log/kubernetes:/var/log"
# audit-log-path: "/etc/kubernetes/audit-log/audit.log"
# audit-log-maxage: 60
# audit-log-maxbackup: 10
# audit-policy-file: "/etc/kubernetes/audit-rule/rule.yaml"
feature-gates: PersistentLocalVolumes=true,VolumeScheduling=true,MountPropagation=true,BlockVolume=true
extra_binds: []
extra_env: []
service_cluster_ip_range: 10.43.0.0/16
service_node_port_range: 30000-32767
pod_security_policy: false
kube-controller:
image: ""
extra_args:
terminated-pod-gc-threshold: 20
cluster-signing-cert-file: /etc/kubernetes/ssl/kube-ca.pem
cluster-signing-key-file: /etc/kubernetes/ssl/kube-ca-key.pem
cluster-name: "stage"
extra_binds: []
extra_env: []
cluster_cidr: 10.42.0.0/16
service_cluster_ip_range: 10.43.0.0/16
scheduler:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
kubelet:
image: ""
extra_args:
cgroup-driver: cgroupfs
resolv-conf: ""
cluster-dns: 169.254.20.10
max-pods: 250
extra_binds:
- "/mnt:/mnt"
extra_env: []
cluster_domain: cluster.local
infra_container_image: ""
cluster_dns_server: 10.43.0.10
fail_swap_on: false
kubeproxy:
image: ""
extra_args:
proxy-mode: ipvs
extra_binds:
- "/var/run/dbus:/var/run/dbus"
extra_env: []
network:
plugin: flannel
options: {}
authentication:
strategy: x509
options: {}
sans:
- "101.198.181.20"
addons: ""
addons_include: []
ssh_key_path: ~/.ssh/id_rsa #集群级 SSH 密钥路径,如果在集群级和节点级都定义了 ssh 密钥路径,那么 RKE 会优先使用节点层级的密钥。
ssh_agent_auth: false #是否使用使用本地 ssh 代理的 ssh 连接配
authorization:
mode: rbac
options: {}
ignore_docker_version: false #是否在RKE运行之前检查Docker版本号
#kubernetes_version: "v1.15.11-rancher1-0" # k8s版本
private_registries: []
ingress:
provider: "none"
options: {}
node_selector: {}
extra_args: {}
cluster_name: "rke-test" #集群名称
cloud_provider:
name: ""
prefix_path: ""
addon_job_timeout: 0
bastion_host:
address: ""
port: ""
user: ""
ssh_key: ""
ssh_key_path: ""
monitoring:
provider: "none"
options: {}
dns:
provider: coredns
nodelocal:
ip_address: "169.254.20.10"
system_images:
etcd: rancher/coreos-etcd:v3.3.10-rancher1
alpine: rancher/rke-tools:v0.1.50
nginx_proxy: rancher/rke-tools:v0.1.50
cert_downloader: rancher/rke-tools:v0.1.50
kubernetes_services_sidecar: rancher/rke-tools:v0.1.50
kubedns: rancher/k8s-dns-kube-dns:1.15.0
dnsmasq: rancher/k8s-dns-dnsmasq-nanny:1.15.0
kubedns_sidecar: rancher/k8s-dns-sidecar:1.15.0
kubedns_autoscaler: rancher/cluster-proportional-autoscaler:1.3.0
coredns: rancher/coredns-coredns:1.3.1
coredns_autoscaler: rancher/cluster-proportional-autoscaler:1.3.0
kubernetes: rancher/hyperkube:v1.15.11-rancher1
flannel: rancher/coreos-flannel:v0.11.0-rancher1
flannel_cni: rancher/flannel-cni:v0.3.0-rancher5
calico_node: rancher/calico-node:v3.7.4
calico_cni: rancher/calico-cni:v3.7.4
calico_controllers: rancher/calico-kube-controllers:v3.7.4
calico_ctl: rancher/calico-ctl:v2.0.0
canal_node: rancher/calico-node:v3.7.4
canal_cni: rancher/calico-cni:v3.7.4
canal_flannel: rancher/coreos-flannel:v3.7.4
weave_node: weaveworks/weave-kube:2.5.2
weave_cni: weaveworks/weave-npc:2.5.2
pod_infra_container: rancher/pause:3.1
ingress: rancher/nginx-ingress-controller:nginx-0.25.1-rancher1
ingress_backend: rancher/nginx-ingress-controller-defaultbackend:1.5-rancher1
metrics_server: rancher/metrics-server:v0.3.3

创建集群

执行rke up 创建集群,直到出现成功提示

1
2
3
4
5
6
7
8
9
10
11
robot@rke-test-master-01:~$ rke up 
.....
.....
INFO[0040] [addons] Executing deploy job rke-coredns-addon
INFO[0045] [addons] CoreDNS deployed successfully..
INFO[0045] [dns] DNS provider coredns deployed successfully
INFO[0045] [ingress] Metrics Server is disabled, skipping Metrics server installation
INFO[0045] [ingress] ingress controller is disabled, skipping ingress controller
INFO[0045] [addons] Setting up user addons
INFO[0045] [addons] no user addons defined
INFO[0045] Finished building Kubernetes cluster successfully

在创建 Kubernetes 集群的过程中,会创建一个kubeconfig 文件,它的文件名称是 kube_config_cluster.yml,可以使用它控制 Kubernetes 集群,该文件和kubeadm安装后生成的admin.conf实质是一样的

  • cluster.yml:RKE 集群的配置文件。
  • kube_config_cluster.yml:该集群的Kubeconfig 文件包含了获取该集群所有权限的认证凭据。
  • cluster.rkestateKubernetes 集群状态文件,包含了获取该集群所有权限的认证凭据,使用 RKE v0.2.0 时才会创建这个文件

验证

安装kubectl,建议和kubernetes版本一致

1
2
3
curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.15.11/bin/linux/amd64/kubectl
chmod +x kubectl
mv kubectl /usr/local/bin

查看集群的一些信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
root@rke-test-master-01:/home/robot# kubectl --kubeconfig kube_config_cluster.yml cluster-info 
Kubernetes master is running at https://180.153.180.33:6443
CoreDNS is running at https://180.153.180.33:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
root@rke-test-master-01:/home/robot# kubectl version --client
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.11", GitCommit:"d94a81c724ea8e1ccc9002d89b7fe81d58f89ede", GitTreeState:"clean", BuildDate:"2020-03-12T21:08:59Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}
root@rke-test-master-01:/home/robot# kubectl --kubeconfig kube_config_cluster.yml get nodes
NAME STATUS ROLES AGE VERSION
180.153.180.11 Ready controlplane,etcd,master 48m v1.15.11
180.153.180.23 Ready controlplane,etcd,master 49m v1.15.11
180.153.180.33 Ready controlplane,etcd,master 48m v1.15.11
180.153.180.34 Ready worker 48m v1.15.11

节点管理

添加节点

我们准备一个节点

主机名 公网ip 内网ip 系统 使用用户
rke-test-master-01 180.153.180.45 192.168.0.78 Ubuntu 18.04 robot

在这个节点执行上面的节点的准备工最后,修改cluster.yml文件,添加额外的节点,并指定它们在 Kubernetes 集群中的角色

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
......
- address: 180.153.180.34
port: "22"
internal_address: 192.168.0.55
role:
- worker
hostname_override: 180.153.180.34
user: robot
docker_socket: /var/run/docker.sock
ssh_key: ""
ssh_key_path: ~/.ssh/id_rsa
labels:
sms-wuxi-test: true
- address: 180.153.180.45
port: "22"
internal_address: 192.168.0.78
role:
- worker
hostname_override: 180.153.180.45
user: robot
docker_socket: /var/run/docker.sock
ssh_key: ""
ssh_key_path: ~/.ssh/id_rsa
labels:
sms-wuxi-test: true
........

运行rke up --update-only命令,只添加或删除工作节点。这将会忽略除了cluster.yml中的工作节点以外的其他内容。

注意

使用--update-only添加或删除 worker 节点时,可能会触发插件或其他组件的重新部署或更新。

成功后可以看到node被顺利加入集群

1
2
3
4
5
6
7
robot@rke-test-master-01:~$ kubectl  get nodes
NAME STATUS ROLES AGE VERSION
180.153.180.11 Ready controlplane,etcd,master 5h31m v1.15.11
180.153.180.23 Ready controlplane,etcd,master 5h31m v1.15.11
180.153.180.33 Ready controlplane,etcd,master 5h31m v1.15.11
180.153.180.34 Ready worker 5h31m v1.15.11
180.153.180.45 Ready worker 13m v1.15.11

删除节点

移除节点和添加节点基本相同,只需要修改cluster.yml文件,删除要移除的的节点,然后运行rke up --update-only即可

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
robot@rke-test-master-01:~$ rke up --update-only
INFO[0000] Running RKE version: v1.0.5
INFO[0000] Initiating Kubernetes cluster
INFO[0000] [certificates] Generating admin certificates and kubeconfig
INFO[0000] Successfully Deployed state file at [./cluster.rkestate]
INFO[0000] Building Kubernetes cluster
INFO[0000] [dialer] Setup tunnel for host [180.153.180.11]
INFO[0000] [dialer] Setup tunnel for host [180.153.180.33]
INFO[0000] [dialer] Setup tunnel for host [180.153.180.23]
INFO[0000] [dialer] Setup tunnel for host [180.153.180.34]
INFO[0000] [network] No hosts added existing cluster, skipping port check
INFO[0000] [certificates] Deploying kubernetes certificates to Cluster nodes
INFO[0000] Checking if container [cert-deployer] is running on host [180.153.180.34], try #1
INFO[0000] Checking if container [cert-deployer] is running on host [180.153.180.33], try #1
INFO[0000] Checking if container [cert-deployer] is running on host [180.153.180.23], try #1
INFO[0000] Checking if container [cert-deployer] is running on host [180.153.180.11], try #1
INFO[0000] Image [rancher/rke-tools:v0.1.50] exists on host [180.153.180.23]
INFO[0000] Image [rancher/rke-tools:v0.1.50] exists on host [180.153.180.11]
INFO[0000] Image [rancher/rke-tools:v0.1.50] exists on host [180.153.180.34]
INFO[0000] Image [rancher/rke-tools:v0.1.50] exists on host [180.153.180.33]
INFO[0001] Starting container [cert-deployer] on host [180.153.180.33], try #1
INFO[0001] Starting container [cert-deployer] on host [180.153.180.23], try #1
INFO[0001] Starting container [cert-deployer] on host [180.153.180.34], try #1
INFO[0001] Starting container [cert-deployer] on host [180.153.180.11], try #1
INFO[0001] Checking if container [cert-deployer] is running on host [180.153.180.33], try #1
INFO[0001] Checking if container [cert-deployer] is running on host [180.153.180.23], try #1
INFO[0001] Checking if container [cert-deployer] is running on host [180.153.180.11], try #1
INFO[0001] Checking if container [cert-deployer] is running on host [180.153.180.34], try #1
INFO[0006] Checking if container [cert-deployer] is running on host [180.153.180.33], try #1
INFO[0006] Removing container [cert-deployer] on host [180.153.180.33], try #1
INFO[0006] Checking if container [cert-deployer] is running on host [180.153.180.23], try #1
INFO[0006] Checking if container [cert-deployer] is running on host [180.153.180.11], try #1
INFO[0006] Removing container [cert-deployer] on host [180.153.180.23], try #1
INFO[0006] Removing container [cert-deployer] on host [180.153.180.11], try #1
INFO[0006] Checking if container [cert-deployer] is running on host [180.153.180.34], try #1
INFO[0006] Removing container [cert-deployer] on host [180.153.180.34], try #1
INFO[0006] [reconcile] Rebuilding and updating local kube config
INFO[0006] Successfully Deployed local admin kubeconfig at [./kube_config_cluster.yml]
INFO[0006] [reconcile] host [180.153.180.33] is active master on the cluster
INFO[0006] [certificates] Successfully deployed kubernetes certificates to Cluster nodes
INFO[0006] [reconcile] Reconciling cluster state
INFO[0006] [reconcile] Check etcd hosts to be deleted
INFO[0006] [reconcile] Check etcd hosts to be added
INFO[0006] [hosts] Cordoning host [180.153.180.45]
INFO[0006] [hosts] Deleting host [180.153.180.45] from the cluster
INFO[0006] [hosts] Successfully deleted host [180.153.180.45] from the cluster
INFO[0006] [dialer] Setup tunnel for host [180.153.180.45]
INFO[0007] [worker] Tearing down Worker Plane..
INFO[0007] Removing container [kubelet] on host [180.153.180.45], try #1
INFO[0007] [remove/kubelet] Successfully removed container on host [180.153.180.45]
INFO[0007] Removing container [nginx-proxy] on host [180.153.180.45], try #1
INFO[0008] [remove/nginx-proxy] Successfully removed container on host [180.153.180.45]
INFO[0008] Removing container [service-sidekick] on host [180.153.180.45], try #1
INFO[0008] [remove/service-sidekick] Successfully removed container on host [180.153.180.45]
INFO[0008] [worker] Successfully tore down Worker Plane..
INFO[0008] [hosts] Cleaning up host [180.153.180.45]
INFO[0008] [hosts] Running cleaner container on host [180.153.180.45]
INFO[0008] Image [rancher/rke-tools:v0.1.50] exists on host [180.153.180.45]
INFO[0008] Starting container [kube-cleaner] on host [180.153.180.45], try #1
INFO[0008] [kube-cleaner] Successfully started [kube-cleaner] container on host [180.153.180.45]
INFO[0008] Waiting for [kube-cleaner] container to exit on host [180.153.180.45]
INFO[0008] Container [kube-cleaner] is still running on host [180.153.180.45]
INFO[0009] Waiting for [kube-cleaner] container to exit on host [180.153.180.45]
INFO[0009] [hosts] Removing cleaner container on host [180.153.180.45]
INFO[0009] Removing container [kube-cleaner] on host [180.153.180.45], try #1
INFO[0009] [hosts] Removing dead container logs on host [180.153.180.45]
INFO[0009] Image [rancher/rke-tools:v0.1.50] exists on host [180.153.180.45]
INFO[0009] Starting container [rke-log-cleaner] on host [180.153.180.45], try #1
INFO[0010] [cleanup] Successfully started [rke-log-cleaner] container on host [180.153.180.45]
INFO[0010] Removing container [rke-log-cleaner] on host [180.153.180.45], try #1
INFO[0010] [remove/rke-log-cleaner] Successfully removed container on host [180.153.180.45]
INFO[0010] [hosts] Successfully cleaned up host [180.153.180.45]
INFO[0010] [reconcile] Rebuilding and updating local kube config
INFO[0010] Successfully Deployed local admin kubeconfig at [./kube_config_cluster.yml]
INFO[0010] [reconcile] host [180.153.180.33] is active master on the cluster
INFO[0010] [reconcile] Reconciled cluster state successfully
INFO[0010] Pre-pulling kubernetes images
INFO[0010] Image [rancher/hyperkube:v1.15.11-rancher1] exists on host [180.153.180.33]
INFO[0010] Image [rancher/hyperkube:v1.15.11-rancher1] exists on host [180.153.180.23]
INFO[0010] Image [rancher/hyperkube:v1.15.11-rancher1] exists on host [180.153.180.11]
INFO[0010] Image [rancher/hyperkube:v1.15.11-rancher1] exists on host [180.153.180.34]
INFO[0010] Kubernetes images pulled successfully
INFO[0010] [etcd] Building up etcd plane..
INFO[0010] [etcd] Successfully started etcd plane.. Checking etcd cluster health
INFO[0011] [authz] Creating rke-job-deployer ServiceAccount
INFO[0011] [authz] rke-job-deployer ServiceAccount created successfully
INFO[0011] [authz] Creating system:node ClusterRoleBinding
INFO[0011] [authz] system:node ClusterRoleBinding created successfully
INFO[0011] [authz] Creating kube-apiserver proxy ClusterRole and ClusterRoleBinding
INFO[0011] [authz] kube-apiserver proxy ClusterRole and ClusterRoleBinding created successfully
INFO[0011] Successfully Deployed state file at [./cluster.rkestate]
INFO[0011] [state] Saving full cluster state to Kubernetes
INFO[0011] [state] Successfully Saved full cluster state to Kubernetes ConfigMap: cluster-state
INFO[0011] [worker] Building up Worker Plane..
INFO[0011] [worker] Successfully started Worker Plane..
INFO[0011] Image [rancher/rke-tools:v0.1.50] exists on host [180.153.180.33]
INFO[0011] Image [rancher/rke-tools:v0.1.50] exists on host [180.153.180.11]
INFO[0011] Image [rancher/rke-tools:v0.1.50] exists on host [180.153.180.34]
INFO[0011] Image [rancher/rke-tools:v0.1.50] exists on host [180.153.180.23]
INFO[0011] Starting container [rke-log-cleaner] on host [180.153.180.11], try #1
INFO[0011] Starting container [rke-log-cleaner] on host [180.153.180.34], try #1
INFO[0011] Starting container [rke-log-cleaner] on host [180.153.180.33], try #1
INFO[0011] Starting container [rke-log-cleaner] on host [180.153.180.23], try #1
INFO[0012] [cleanup] Successfully started [rke-log-cleaner] container on host [180.153.180.11]
INFO[0012] Removing container [rke-log-cleaner] on host [180.153.180.11], try #1
INFO[0012] [cleanup] Successfully started [rke-log-cleaner] container on host [180.153.180.33]
INFO[0012] Removing container [rke-log-cleaner] on host [180.153.180.33], try #1
INFO[0012] [cleanup] Successfully started [rke-log-cleaner] container on host [180.153.180.34]
INFO[0012] Removing container [rke-log-cleaner] on host [180.153.180.34], try #1
INFO[0012] [cleanup] Successfully started [rke-log-cleaner] container on host [180.153.180.23]
INFO[0012] Removing container [rke-log-cleaner] on host [180.153.180.23], try #1
INFO[0012] [remove/rke-log-cleaner] Successfully removed container on host [180.153.180.11]
INFO[0012] [remove/rke-log-cleaner] Successfully removed container on host [180.153.180.33]
INFO[0012] [remove/rke-log-cleaner] Successfully removed container on host [180.153.180.34]
INFO[0012] [remove/rke-log-cleaner] Successfully removed container on host [180.153.180.23]
INFO[0012] [sync] Syncing nodes Labels and Taints
INFO[0012] [sync] Successfully synced nodes Labels and Taints
INFO[0012] [network] Setting up network plugin: flannel
INFO[0012] [addons] Saving ConfigMap for addon rke-network-plugin to Kubernetes
INFO[0012] [addons] Successfully saved ConfigMap for addon rke-network-plugin to Kubernetes
INFO[0012] [addons] Executing deploy job rke-network-plugin
INFO[0012] [addons] Setting up coredns
INFO[0012] [addons] Saving ConfigMap for addon rke-coredns-addon to Kubernetes
INFO[0012] [addons] Successfully saved ConfigMap for addon rke-coredns-addon to Kubernetes
INFO[0012] [addons] Executing deploy job rke-coredns-addon
INFO[0012] [addons] CoreDNS deployed successfully..
INFO[0012] [dns] DNS provider coredns deployed successfully
INFO[0012] [ingress] Metrics Server is disabled, skipping Metrics server installation
INFO[0012] [ingress] ingress controller is disabled, skipping ingress controller
INFO[0012] [addons] Setting up user addons
INFO[0012] [addons] no user addons defined
INFO[0012] Finished building Kubernetes cluster successfully
robot@rke-test-master-01:~$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
180.153.180.11 Ready controlplane,etcd,master 8h v1.15.11
180.153.180.23 Ready controlplane,etcd,master 8h v1.15.11
180.153.180.33 Ready controlplane,etcd,master 8h v1.15.11
180.153.180.34 Ready worker 8h v1.15.11

升级集群

注意事项:每个版本的 RKE 都有对应的支持Kubernetes 版本列表。要根据要升级的k8s版本选择合适的rke版本,如果在kubernetes_versionsystem-images都定义了 Kubernetes 版本号,system-images中的版本号优先级高于kubernetes_version的版本号。

1
2
3
4
root@rke-test-master-01:/home/robot# rke config --list-version --all
v1.17.4-rancher1-1
v1.16.8-rancher1-1
v1.15.11-rancher1-1

Kubernetes 版本要求

升级集群现有的 Kubernetes 时,必须是从一个小版本升级到另一个小版本,例如从 v1.16.0 升级到 v1.17.0,或是升级到同一个小版本内的补丁版,例如从 v1.16.0 升级到 v1.16.1

先决条件:

  • 保证cluster.yml缺少system_images的说明和配置
  • 保证工作目录中有管理Kubernetes 集群状态所需的文件,即cluster.rkestate

完成准备工作后执行rke up --config cluster.yml

注意⚠️:升级集群需要拉取相关镜像。可能需要时间比较长,可以提前拉取相关镜像缩短时间。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
robot@rke-test-master-01:~$ rke up --config cluster.yml
INFO[0000] Running RKE version: v1.0.5
INFO[0000] Initiating Kubernetes cluster
INFO[0000] [certificates] Generating admin certificates and kubeconfig
INFO[0000] Successfully Deployed state file at [./cluster.rkestate]
INFO[0000] Building Kubernetes cluster
INFO[0000] [dialer] Setup tunnel for host [180.153.180.23]
INFO[0000] [dialer] Setup tunnel for host [180.153.180.11]
INFO[0000] [dialer] Setup tunnel for host [180.153.180.33]
INFO[0000] [dialer] Setup tunnel for host [180.153.180.34]
INFO[0000] [network] No hosts added existing cluster, skipping port check
.....
INFO[0038] [dns] DNS provider coredns deployed successfully
INFO[0038] [ingress] Metrics Server is disabled, skipping Metrics server installation
INFO[0038] [ingress] ingress controller is disabled, skipping ingress controller
INFO[0038] [addons] Setting up user addons
INFO[0038] [addons] no user addons defined
INFO[0038] Finished building Kubernetes cluster successfully
robot@rke-test-master-01:~$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
180.153.180.11 Ready controlplane,etcd,master 25h v1.16.8
180.153.180.23 Ready controlplane,etcd,master 25h v1.16.8
180.153.180.33 Ready controlplane,etcd,master 25h v1.16.8
180.153.180.34 Ready worker 25h v1.16.8
robot@rke-test-master-01:~$ kubectl get pod --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-5c59fd465f-fqf7t 0/1 ContainerCreating 0 8s
kube-system coredns-5c59fd465f-hhkb8 0/1 ContainerCreating 0 23s
kube-system coredns-autoscaler-d765c8497-cz5qw 1/1 Running 0 23s
kube-system kube-flannel-gnkr8 2/2 Running 2 25h
kube-system kube-flannel-j29qn 2/2 Running 2 25h
kube-system kube-flannel-kttxx 2/2 Running 2 25h
kube-system kube-flannel-ws484 2/2 Running 2 25h
kube-system rke-coredns-addon-deploy-job-sr8nk 0/1 Completed 0 25s
kube-system rke-network-plugin-deploy-job-6lcvq 0/1 Completed 0 25h

维护

创建一次性快照

RKE 会创建一个用于备份快照的容器。完成备份后,RKE 会删除该容器。您可以将一次性快照适配 S3 的后端主机。具体过程如下:

  1. 首先,运行以下命令,在本地创建一个一次性快照:

    1
    rke etcd snapshot-save --config cluster.yml --name snapshot-name

    结果: 创建了一个快照,保存在 /opt/rke/etcd-snapshots路径下。

示例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
robot@rke-test-master-01:~$ rke etcd snapshot-save --config cluster.yml --name snapshot-name
INFO[0000] Running RKE version: v1.0.5
INFO[0000] Starting saving snapshot on etcd hosts
INFO[0000] [dialer] Setup tunnel for host [180.153.180.23]
INFO[0000] [dialer] Setup tunnel for host [180.153.180.33]
INFO[0000] [dialer] Setup tunnel for host [180.153.180.11]
INFO[0000] [dialer] Setup tunnel for host [180.153.180.34]
INFO[0000] [etcd] Running snapshot save once on host [180.153.180.33]
INFO[0000] Pulling image [rancher/rke-tools:v0.1.52] on host [180.153.180.33], try #1
INFO[0028] Image [rancher/rke-tools:v0.1.52] exists on host [180.153.180.33]
INFO[0029] Starting container [etcd-snapshot-once] on host [180.153.180.33], try #1
INFO[0029] [etcd] Successfully started [etcd-snapshot-once] container on host [180.153.180.33]
INFO[0029] Waiting for [etcd-snapshot-once] container to exit on host [180.153.180.33]
INFO[0029] Container [etcd-snapshot-once] is still running on host [180.153.180.33]
INFO[0030] Waiting for [etcd-snapshot-once] container to exit on host [180.153.180.33]
INFO[0030] Removing container [etcd-snapshot-once] on host [180.153.180.33], try #1
INFO[0030] [etcd] Running snapshot save once on host [180.153.180.11]
INFO[0030] Pulling image [rancher/rke-tools:v0.1.52] on host [180.153.180.11], try #1
INFO[0101] Image [rancher/rke-tools:v0.1.52] exists on host [180.153.180.11]
INFO[0102] Starting container [etcd-snapshot-once] on host [180.153.180.11], try #1
INFO[0102] [etcd] Successfully started [etcd-snapshot-once] container on host [180.153.180.11]
INFO[0102] Waiting for [etcd-snapshot-once] container to exit on host [180.153.180.11]
INFO[0102] Container [etcd-snapshot-once] is still running on host [180.153.180.11]
INFO[0103] Waiting for [etcd-snapshot-once] container to exit on host [180.153.180.11]
INFO[0103] Removing container [etcd-snapshot-once] on host [180.153.180.11], try #1
INFO[0103] [etcd] Running snapshot save once on host [180.153.180.23]
INFO[0104] Pulling image [rancher/rke-tools:v0.1.52] on host [180.153.180.23], try #1
INFO[0145] Image [rancher/rke-tools:v0.1.52] exists on host [180.153.180.23]
INFO[0146] Starting container [etcd-snapshot-once] on host [180.153.180.23], try #1
INFO[0147] [etcd] Successfully started [etcd-snapshot-once] container on host [180.153.180.23]
INFO[0147] Waiting for [etcd-snapshot-once] container to exit on host [180.153.180.23]
INFO[0147] Container [etcd-snapshot-once] is still running on host [180.153.180.23]
INFO[0148] Waiting for [etcd-snapshot-once] container to exit on host [180.153.180.23]
INFO[0148] Removing container [etcd-snapshot-once] on host [180.153.180.23], try #1
INFO[0148] Finished saving/uploading snapshot [snapshot-name] on all etcd hosts
robot@rke-test-master-01:~$ ls /opt/rke/etcd-snapshots/
snapshot-name.zip

恢复集群

如果您的 Kubernetes 集群发生了灾难,您可以使用rke etcd snapshot-restore来恢复您的 etcd。这个命令可以将 etcd 恢复到特定的快照,应该在遭受灾难的特定集群的 etcd 节点上运行。

当您运行该命令时,将执行以下操作。

  • 同步快照或从 S3 下载快照(如有必要)。
  • 跨 etcd 节点检查快照校验和,确保它们是相同的。
  • 通过运行rke remove删除您当前的集群并清理旧数据。这将删除整个 Kubernetes 集群,而不仅仅是 etcd 集群。
  • 从选择的快照重建 etcd 集群。
  • 通过运行rke up创建一个新的集群。
  • 重新启动集群系统 pod。

警告:在运行rke etcd snapshot-restore之前,您应该备份集群中的任何重要数据,因为该命令会删除您当前的 Kubernetes 集群,并用新的集群替换。

用于恢复 etcd 集群的快照可以存储在本地的/opt/rke/etcd-snapshots中,也可以从 S3 兼容的后端存储。

从本地快照恢复的示例

1
rke etcd snapshot-restore --config cluster.yml --name mysnapshot

安装过程中清理集群

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# 删除所有容器
docker rm -f $(docker ps -qa)

# 删除所有容器卷
docker volume rm $(docker volume ls -q)

# 卸载mount目录
for mount in $(mount | grep tmpfs | grep '/var/lib/kubelet' | awk '{ print $3 }') /var/lib/kubelet /var/lib/rancher; do umount $mount; done

# 备份目录
mv /etc/kubernetes /etc/kubernetes-bak-$(date +"%Y%m%d%H%M")
mv /var/lib/etcd /var/lib/etcd-bak-$(date +"%Y%m%d%H%M")
mv /var/lib/rancher /var/lib/rancher-bak-$(date +"%Y%m%d%H%M")
mv /opt/rke /opt/rke-bak-$(date +"%Y%m%d%H%M")

# 删除残留路径
rm -rf /etc/ceph \
/etc/cni \
/opt/cni \
/run/secrets/kubernetes.io \
/run/calico \
/run/flannel \
/var/lib/calico \
/var/lib/cni \
/var/lib/kubelet \
/var/log/containers \
/var/log/pods \
/var/run/calico

# 清理网络接口
network_interface=`ls /sys/class/net`
for net_inter in $network_interface;
do
if ! echo $net_inter | grep -qiE 'lo|docker0|eth*|ens*';then
ip link delete $net_inter
fi
done

# 清理残留进程
port_list='80 443 6443 2376 2379 2380 8472 9099 10250 10254'
for port in $port_list
do
pid=`netstat -atlnup|grep $port |awk '{print $7}'|awk -F '/' '{print $1}'|grep -v -|sort -rnk2|uniq`
if [[ -n $pid ]];then
kill -9 $pid
fi
done

pro_pid=`ps -ef |grep -v grep |grep kube|awk '{print $2}'`
if [[ -n $pro_pid ]];then
kill -9 $pro_pid
fi

# 清理Iptables表
## 注意:如果节点Iptables有特殊配置,以下命令请谨慎操作
sudo iptables --flush
sudo iptables --flush --table nat
sudo iptables --flush --table filter
sudo iptables --table nat --delete-chain
sudo iptables --table filter --delete-chain

systemctl restart docker

报错

1、FATA[0024] [workerPlane] Failed to bring up Worker Plane: [Failed to create [kube-proxy] container on host [180.153.180.11]: Failed to create Docker container [kube-proxy] on host [180.153.180.11]: <nil>]

解决办法:

1
2
cd /etc/docker/
mv key.json key.json.bak

2、cordons 报错

1
2
robot@rke-test-master-01:~$ kubectl -n kube-system  logs -f coredns-5c8bd56579-twjfz
/etc/coredns/Corefile:6 - Error during parsing: Unknown directive 'ready'

解决办法:

1、kubectl edit cm coredns -n kube-system

2、删除文件中的 ready

3、重启pod

1
2
3
4
5
6
7
8
9
robot@rke-test-master-01:~$ kubectl -n kube-system   logs -f coredns-5c8bd56579-5kfbq 
.:53
2020-12-01T05:40:02.607Z [INFO] CoreDNS-1.3.1
2020-12-01T05:40:02.607Z [INFO] linux/amd64, go1.11.4, 6b56a9c
CoreDNS-1.3.1
linux/amd64, go1.11.4, 6b56a9c
2020-12-01T05:40:02.607Z [INFO] plugin/reload: Running configuration MD5 = a84140c77bf2394a38d8cb760902c38f
2020-12-01T05:40:03.608Z [FATAL] plugin/loop: Loop (127.0.0.1:34482 -> :53) detected for zone ".", see https://coredns.io/plugins/loop#troubleshooting. Query: "HINFO 1441588974363502120.8928791511667081803."

解决办法:

1、kubectl edit cm coredns -n kube-system

2、delete ‘loop’ , save and exit

3、restart coredns pods by:”kubctel delete pod coredns…. -n kube-system”

3、cordons-autoscaler 报错

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
robot@rke-test-master-01:~$ kubectl  -n kube-system  logs -f coredns-autoscaler-d8477bc7f-fdczp 
unknown flag: --nodelabels
Usage of /cluster-proportional-autoscaler:
--alsologtostderr[=false]: log to standard error as well as files
--configmap="": ConfigMap containing our scaling parameters.
--default-params=map[]: Default parameters(JSON format) for auto-scaling. Will create/re-create a ConfigMap with this default params if ConfigMap is not present.
--log-backtrace-at=:0: when logging hits line file:N, emit a stack trace
--log-dir="": If non-empty, write log files in this directory
--logtostderr[=false]: log to standard error instead of files
--namespace="": Namespace for all operations, fallback to the namespace of this autoscaler(through MY_POD_NAMESPACE env) if not specified.
--poll-period-seconds=10: The time, in seconds, to check cluster status and perform autoscale.
--stderrthreshold=2: logs at or above this threshold go to stderr
--target="": Target to scale. In format: deployment/*, replicationcontroller/* or replicaset/* (not case sensitive).
--v=0: log level for V logs
--version[=false]: Print the version and exit.
--vmodule=: comma-separated list of pattern=N settings for file-filtered logging

解决办法:

1、kubectl -n kube-system edit deployments. coredns-autoscaler

2、删除 - --nodelabels=node-role.kubernetes.io/worker=true,beta.kubernetes.io/os=linux

CATALOG
  1. 1. 主机列表
  2. 2. 负载均衡slb
  3. 3. 安装docker及配置
    1. 3.1. 卸载旧版本
    2. 3.2. 设置仓库
    3. 3.3. 安装
    4. 3.4. 测试
  4. 4. 禁止使用swap
  5. 5. 关闭selinux
  6. 6. 检查内核模块
  7. 7. 修改sysctl 配置
  8. 8. 设置禁止自动休眠
  9. 9. 使用 iptables打开TCP/6443 端口
  10. 10. SSH Server 配置
  11. 11. 修改hosts
  12. 12. 配置ssh密钥
  13. 13. 配置免密登陆
  14. 14. 配置slb
  15. 15. 下载RKE
  16. 16. 创建rke使用的cluster.yml文件
  17. 17. 创建集群
  18. 18. 验证
  19. 19. 节点管理
    1. 19.1. 添加节点
      1. 19.1.0.1. 注意
  20. 19.2. 删除节点
  • 20. 升级集群
    1. 20.1. Kubernetes 版本要求
    2. 20.2. 先决条件:
  • 21. 维护
    1. 21.1. 创建一次性快照
    2. 21.2. 恢复集群
      1. 21.2.1. 从本地快照恢复的示例
  • 22. 安装过程中清理集群
  • 23. 报错