misterli's Blog.

prometheus监控集群外机器的node-exporter的两种方式

字数统计: 1k阅读时长: 5 min
2024/09/11

背景

使用kube-prometheus项目在kubernetes集群里部署了prometheus监控,中间件的独立部署在集群外的服务器上的,本来是对中间件有监控,但是这批服务器比较拉,之前经历过五台机器同时发生重启的情况,这里还是要监控mysql/redis/es的的机器。

安装

https://github.com/prometheus/node_exporter/releases下载安装包,这里我选择下载1.8.2

1
2
3
4
cd /opt
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz
tar xf node_exporter-1.8.2.linux-amd64.tar.gz
mv node_exporter-1.8.2.linux-amd64 node_exporter

创建一个用户用于启动node_exporter,提升安全性

1
2
3
groupadd prometheus 
useradd -g prometheus -s /sbin/nologin prometheus
chown -R prometheus:prometheus /opt/node_exporter/

创建一个systemd文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
cat   /usr/lib/systemd/system/node_exporter.service

[Unit]
Description=node_exporter
Documentation=https://prometheus.io/
After=network.target

[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/opt/node_exporter/node_exporter
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure

[Install]
WantedBy=multi-user.target

启动

1
2
3
systemctl daemon-reload
systemctl enable --now node_exporter
执行后出现Created symlink from /etc/systemd/system/multi-user.target.wants/node_exporter.service to /usr/lib/systemd/system/node_exporter.service. 说明已经添加到开机启动项里了

访问 http://127.0.0.1:9100/metrics 验证一下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[root@es-3 ~]# curl  http://127.0.0.1:9100/metrics  |head -100 
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 5.5111e-05
go_gc_duration_seconds{quantile="0.25"} 6.2745e-05
go_gc_duration_seconds{quantile="0.5"} 6.8598e-05
go_gc_duration_seconds{quantile="0.75"} 9.7717e-05
go_gc_duration_seconds{quantile="1"} 0.000294668
go_gc_duration_seconds_sum 0.05627822
go_gc_duration_seconds_count 659
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 9
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.22.5"

采集数据

在kubernetes中的prometheus采集这几个集群外的node-exporter的数据 这里有两种方式:

方式一:使用servicemonitor

创建一个endpoint和service对象

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
---
apiVersion: v1
kind: Endpoints
metadata:
name: node-exporter-other
namespace: monitoring
subsets:
- addresses:
- ip: 10.251.1.51
- ip: 10.251.1.52
- ip: 10.251.1.66
- ip: 10.251.1.67
- ip: 10.251.1.68
- ip: 10.251.1.63
- ip: 10.251.1.64
- ip: 10.251.1.65
ports:
- name: metrics
port: 9100
protocol: TCP

---
kind: Service
apiVersion: v1
metadata:
name: node-exporter-other
namespace: monitoring
labels:
app: node-exporter-other
prometheus.io/monitor: "true"
spec:
type: ClusterIP
ports:
- name: metrics
port: 9100
protocol: TCP
---

创建servicemonitor对象

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: node-exporter-other
namespace: monitoring
labels:
prometheus.io/monitor: "true"
spec:
endpoints:
- interval: 15s
port: metrics
path: /metrics
jobLabel: node-exporter-other
selector:
matchLabels:
app: node-exporter-other
namespaceSelector:
matchNames:
- monitoring

执行kubectl apply -f 把这两个对象加载进入集群后,稍等片刻就可以在promehtues上查看

image-20240911021257998

注意:这里的interval 时间不要写太长,比如60s,不然会导致部分语句无数据比如increase(node_network_receive_bytes_total{instance=“$instance”,device=“$device”}[1m]) ,时间间隔太长,导致数据点位不足,以至于 [1m] 时间范围内没有足够的数据点来计算速率,如果一定要写60s 那就需要把时间窗口改大一点。

image-20240911021506220

方式二:修改prometheus-prometheus.yaml文件

修改prometheus的yaml文件prometheus-prometheus.yaml,添加如下参数

additionalScrapeConfigs:
name: additional-configs
key: prometheus-additional.yaml

该选项会把additional-configs这个serect文件中的prometheus-additional.yaml这个key的值作为配置文件加载到prometheus的配置里。

cat prometheus-prometheus.yaml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
labels:
prometheus: k8s
name: k8s
namespace: monitoring
.......
securityContext:
fsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
#添加这个
additionalScrapeConfigs:
name: additional-configs
key: prometheus-additional.yaml
serviceAccountName: prometheus-k8s
serviceMonitorNamespaceSelector: {}
....

additional-configs这个secret 这里通常通过文件去创建,例如我们创建一个prometheus-additional.yaml文件,然后执行:

1
2
3
# 先删除,在重新创建,更新 additional-configs secrets配置 ,Prometheus 会自动 reload
kubectl delete secrets -n monitoring additional-configs
kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring

prometheus-additional.yaml文件内容为,大家可以按自己需要填写

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
- job_name: "node_exporter-other"
scrape_interval: 10s
static_configs:
- targets: ['10.251.1.61:9100']
labels:
instance: redis-m
- targets: ['10.251.1.63:9100']
labels:
instance: redis-s
- targets: ['10.251.1.64:9100']
labels:
instance: mysql-m
- targets: ['10.251.1.65:9100']
labels:
instance: mysql-s

总结

两种方式都可以满足监控集群外的资源,不仅适用于node_exporter 还适用于监控集群外的别的资源,两种方式各有利弊,一般我个人是比较喜欢第一种,如果是比较复杂的情景的可能会采用第二种。

CATALOG
  1. 1. 背景
  2. 2. 安装
  3. 3. 采集数据
  4. 4. 总结