kube-prometheus安装与配置

1.kube-prometheus下载，解压，进入目录

1
2
3


wgets https://github.com/prometheus-operator/kube-prometheus/archive/refs/tags/v0.13.0.tar.gz 
tar -xf v0.13.0.tar.gz 
cd kube-prometheus-0.13.0/manifests/

2.修改prometheus、grafana、alertmanager的ingress文件

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36


#下面这个是prometheus-ingress.yaml,grafana-ingress.yaml和alertmanager-ingress.yaml内容类似
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: prometheus-k8s-ingress
  namespace: monitoring
  annotations:
    field.cattle.io/publicEndpoints: >-
      [{"addresses":["192.168.1.181","192.168.1.182","192.168.1.183","192.168.1.184","192.168.1.185"],"port":443,"protocol":"HTTPS","serviceName":"monitoring:prometheus-k8s","ingressName":"monitoring:prometheus-k8s-ingress","hostname":"prometheus.ga.skyvault.cn","path":"/","allNodes":false}]
    kubernetes.io/ingress.class: "nginx"
    prometheus.io/http_probe: "true"
spec:
  ingressClassName: nginx
  rules:
  - host: prometheus.ga.skyvault.cn
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: prometheus-k8s
            port:
              number: 9090
  tls:
    - hosts:
        - prometheus.ga.skyvault.cn
      secretName: skyvault-cn-tls-certificate
status:
  loadBalancer:
    ingress:
      - ip: 192.168.1.181
      - ip: 192.168.1.182
      - ip: 192.168.1.183
      - ip: 192.168.1.184
      - ip: 192.168.1.185

3.镜像可能拉取不到，需要修改拉取的源

将镜像源改成私有镜像仓库

4.删除自带的网络策略，否则所有访问服务都会被阻塞

1

kubectl -n monitoring delete networkpolicies.networking.k8s.io --all

5.安装kube-prometheus

1
2
3
4
5
6


kubectl apply --server-side -f manifests/setup
kubectl wait \
	--for condition=Established \
	--all CustomResourceDefinition \
	--namespace=monitoring
kubectl apply -f manifests/

6.添加TLS证书，namespace为monitoring，name为ingress文件中secretName字段的值

7.查询monitoring命名空间下的所有资源，确认是否都正常运行

1

kubectl get all -n monitoring

8.安装后访问prometheus，会发现有以下三个报警：`Watchdog`、`KubeControllerManagerDown`、`KubeSchedulerDown`

因为集群没有给系统kube-controller-manager、kube-scheduler组件创建svc，所以会有报警

创建kube-controller-manager对应的endpoints以及svc

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32


#vi cm-prometheus.yaml
apiVersion: v1
kind: Endpoints
metadata:
  labels:
    app.kubernetes.io/name: kube-controller-manager
  name: cm-prometheus
  namespace: kube-system
subsets:
  - addresses:
      - ip: 192.168.1.181
      - ip: 192.168.1.182
      - ip: 192.168.1.183
    ports:
      - name: https-metrics
        port: 10257
        protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: kube-controller-manager
  name: cm-prometheus
  namespace: kube-system
spec:
  type: ClusterIP
  ports:
    - name: https-metrics
      port: 10257
      protocol: TCP
      targetPort: 10257

创建kube-scheduler对应的endpoints以及svc

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32


#vi cm-prometheus.yaml
apiVersion: v1
kind: Endpoints
metadata:
  labels:
    app.kubernetes.io/name: kube-scheduler
  name: scheduler-prometheus
  namespace: kube-system
subsets:
  - addresses:
      - ip: 192.168.1.181
      - ip: 192.168.1.182
      - ip: 192.168.1.183
    ports:
      - name: https-metrics
        port: 10259
        protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: kube-scheduler
  name: scheduler-prometheus
  namespace: kube-system
spec:
  type: ClusterIP
  ports:
    - name: https-metrics
      port: 10259
      protocol: TCP
      targetPort: 10259

创建完成后会发现prometheus的targets下的kube-prometheus-controller和kube-scheduler已经有采集目标，但是报错，这是因为 kube-scheduler 启动的时候默认绑定的是 127.0.0.1 地址，所以要通过 IP 地址去访问就被拒绝了

rancher => Cluster Management => gongan => Edit Config => Cluster Configuration => Advanced

找到Additional Controller Manager Args和Additional Scheduler Args ，add参数 --bind-address=0.0.0.0

9.配置grafana

添加数据源，URL为https://prometheus.ga.skyvault.cn:30443

导入dashboard，可以通过json文件或id，id为13105

10.增加自定义监控

包括对外部服务的监控规则

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59


#vi prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    app.kubernetes.io/component: prometheus
    app.kubernetes.io/instance: k8s
    app.kubernetes.io/name: prometheus
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 2.46.0
    prometheus: k8s
    role: alert-rules
  name: k8s-prometheus-rules
  namespace: monitoring
spec:
  groups:
  - name: node-rules
    rules:
    - alert: NodeDown
      expr: up{job="node-exporter"} == 0
      for: 3m
      labels:
        severity: critical
      annotations:
        summary: "Node is down"
        description: "The node {{ $labels.instance }} is down for more than 3 minutes."
  - name: namespace
    rules:
    - alert: PodRestart
      expr: (floor(increase(kube_pod_container_status_restarts_total{namespace="kube-system"}[1m])) > 0)
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: "Pod restart in last 1 minutes."
        description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} restarted {{ $value }} times in the last 1 minute."
  - name: cupom
    rules:
    - alert: RequestError
      expr: sum(increase(RequestErrorCount{job="cupom",instance="cupom-api.dev.skyvault.cn:443"}[1m])) by (path,method) > 0
      labels:
        severity: warning
      annotations:
        summary: "New request errors occurred."
        description: "{{ $value }} new errors in the last 1 minute for path {{ $labels.path }} and method {{ $labels.method }} on cupom."
    - alert: ResponseCode4**
      expr: sum(increase(RequestCount{job="cupom",instance="cupom-api.dev.skyvault.cn:443",code=~"4..*"}[1m])) by (path,method) > 2
      labels:
        severity: warning
      annotations:
        summary: "Multiple errors with response code 4** occurred."
        description: "{{ $value }} new errors in the last 1 minute for path {{ $labels.path }} and method {{ $labels.method }} on cupom."
    - alert: ResponseCode5**
      expr: sum(increase(RequestCount{job="cupom",instance="cupom-api.dev.skyvault.cn:443",code=~"5..*"}[1m])) by (path,method) > 2
      labels:
        severity: warning
      annotations:
        summary: "Multiple errors with response code 5** occurred."
        description: "{{ $value }} new errors in the last 1 minute for path {{ $labels.path }} and method {{ $labels.method }} on cupom."

1

kubectl apply -f prometheus-rules.yaml

11.alertmanager报警配置

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73


#vi alertmanager-secret.yaml
apiVersion: v1
kind: Secret
metadata:
  labels:
    app.kubernetes.io/component: alert-router
    app.kubernetes.io/instance: main
    app.kubernetes.io/name: alertmanager
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 0.26.0
  name: alertmanager-main
  namespace: monitoring
stringData:
  alertmanager.yaml: |-
    "global":
      "resolve_timeout": "5m"
      smtp_smarthost: 'smtp.qq.com:465'
      smtp_from: '[email protected]'
      smtp_auth_username: '[email protected]'
      smtp_auth_password: 'qutnfirpngmybccf'
      smtp_require_tls: false
    "inhibit_rules":
    - "equal":
      - "namespace"
      - "alertname"
      "source_matchers":
      - "severity = critical"
      "target_matchers":
      - "severity =~ warning|info"
    - "equal":
      - "namespace"
      - "alertname"
      "source_matchers":
      - "severity = warning"
      "target_matchers":
      - "severity = info"
    - "equal":
      - "namespace"
      "source_matchers":
      - "alertname = InfoInhibitor"
      "target_matchers":
      - "severity = info"
    "receivers":
    - "name": "Default"
      "email_configs":
      - to: '[email protected]'
        send_resolved: true
    - "name": "warning-critical-receiver"
      "webhook_configs":
      - "url": "https://alert.ga.skyvault.cn:30443/prometheusalert?type=fs&tpl=prometheus-fs&fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/cca82303-5b6d-4b29-8168-422da1fd3af8"
    - "name": "Watchdog"
    - "name": "null"
    "route":
      "group_by": ['alertname','service']
      "group_interval": "2m"
      "group_wait": "30s"
      "receiver": "Default"
      "repeat_interval": "1h"
      "routes":
      - "matchers":
        - "alertname = Watchdog"
        "receiver": "Watchdog"
      - "matchers":
        - "alertname = InfoInhibitor"
        "receiver": "null"
      - "matchers":
        - "severity = critical"
        - "severity = warning"
        "receiver": "warning-critical-receiver"
      - "matchers":
        - "severity = info"
        "receiver": "Default"
type: Opaque

1

kubectl replace -f alertmanager-secret.yaml

12.监控集群外部服务

静态配置

创建prometheus-additional.yaml

1
2
3
4
5
6
7
8


#vi prometheus-additional.yaml
- job_name: cupom
  honor_timestamps: true
  metrics_path: /metrics
  scheme: https
  static_configs:
  - targets:
    - cupom-api.dev.skyvault.cn:443

创建secret文件并部署到monitoring命名空间

1
2
3


kubectl create secret generic additional-scrape-configs --from-file=prometheus-additional.yaml --dry-run -oyaml  > additional-scrape-configs.yaml

kubectl apply -f additional-scrape-configs.yaml  -n monitoring

注：如需更新，需要把secret删除，再重新创建

1
2


kubectl delete secret additional-scrape-configs -n monitoring
rm additional-scrape-configs.yaml

在prometheus-prometheus.yaml中添加additionalScrapeConfigs选项

1
2
3
4


#vi prometheus-prometheus.yaml
additionalScrapeConfigs:
  name: additional-scrape-configs
  key: prometheus-additional.yaml

1

kubectl apply -f prometheus-prometheus.yaml

导入Application-GinExporter-1719367918282.json文件作为dashboard

13.修改prometheus operator数据存储时间

prometheus operator默认数据存储的时间为1d，修改为30d

1
2
3


#vi prometheus-prometheus.yaml
spce:
  retention:30d

1

kubectl apply -f prometheus-prometheus.yaml

14.将报警信息通过webhook转发至飞书

安装开源项目PrometheusAlert，用于消息转发

1

kubectl apply -n monitoring -f Prometheus-Deployment.yaml

下面是Prometheus-Deployment.yaml的一部分，只有这部分需要修改

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36


fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/**************************************
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    field.cattle.io/publicEndpoints: >-
      [{"addresses":["192.168.1.181","192.168.1.182","192.168.1.183","192.168.1.184","192.168.1.185"],"port":443,"protocol":"HTTPS","serviceName":"monitoring:prometheus-k8s","ingressName":"monitoring:prometheus-k8s-ingress","hostname":"prometheus.ga.skyvault.cn","path":"/","allNodes":false}]
    kubernetes.io/ingress.class: nginx
  name: prometheus-alert-center
  namespace: monitoring
spec:
  ingressClassName: nginx
  rules:
    - host: alert.ga.skyvault.cn
      http:
        paths:
        - path: /
          pathType: Prefix
          backend:
            service:
              name: prometheus-alert-center
              port:
                number: 8080
  tls:
    - hosts:
        - alert.ga.skyvault.cn
      secretName: skyvault-cn-tls-certificate
status:
  loadBalancer:
    ingress:
      - ip: 192.168.1.181
      - ip: 192.168.1.182
      - ip: 192.168.1.183
      - ip: 192.168.1.184
      - ip: 192.168.1.185

修改消息模板

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


{{ $var := .externalURL}}{{ range $k,$v:=.alerts }}{{if eq $v.status "resolved"}}**[Prometheus恢复信息]({{$v.generatorURL}}) ✅**
*[{{$v.labels.alertname}}]({{$var}})*
告警级别：{{$v.labels.severity}}
开始时间：{{GetCSTtime $v.startsAt}}
结束时间：{{GetCSTtime $v.endsAt}} 
**{{$v.annotations.description}}**{{else}}**[Prometheus告警信息]({{$v.generatorURL}}) 🔥**
*[{{$v.labels.alertname}}]({{$var}})*
告警级别：{{$v.labels.severity}}
开始时间：{{GetCSTtime $v.startsAt}}
**{{$v.annotations.description}}**
[点击打开Grafana](https://grafana.ga.skyvault.cn:30443/dashboards)
[点击打开Prometheus](https://prometheus.ga.skyvault.cn:30443/alerts)
{{end}}{{ end }}

kube-prometheus安装与配置#

1.kube-prometheus下载，解压，进入目录#

2.修改prometheus、grafana、alertmanager的ingress文件#

3.镜像可能拉取不到，需要修改拉取的源#

4.删除自带的网络策略，否则所有访问服务都会被阻塞#

5.安装kube-prometheus#

6.添加TLS证书，namespace为monitoring，name为ingress文件中secretName字段的值#

7.查询monitoring命名空间下的所有资源，确认是否都正常运行#

8.安装后访问prometheus，会发现有以下三个报警：Watchdog、KubeControllerManagerDown、KubeSchedulerDown#

9.配置grafana#

10.增加自定义监控#

11.alertmanager报警配置#

12.监控集群外部服务#

13.修改prometheus operator数据存储时间#

14.将报警信息通过webhook转发至飞书#