kube-prometheus安装与配置

1.kube-prometheus下载,解压,进入目录

1
2
3
wgets https://github.com/prometheus-operator/kube-prometheus/archive/refs/tags/v0.13.0.tar.gz 
tar -xf v0.13.0.tar.gz 
cd kube-prometheus-0.13.0/manifests/

2.修改prometheus、grafana、alertmanager的ingress文件

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#下面这个是prometheus-ingress.yaml,grafana-ingress.yaml和alertmanager-ingress.yaml内容类似
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: prometheus-k8s-ingress
  namespace: monitoring
  annotations:
    field.cattle.io/publicEndpoints: >-
      [{"addresses":["192.168.1.181","192.168.1.182","192.168.1.183","192.168.1.184","192.168.1.185"],"port":443,"protocol":"HTTPS","serviceName":"monitoring:prometheus-k8s","ingressName":"monitoring:prometheus-k8s-ingress","hostname":"prometheus.ga.skyvault.cn","path":"/","allNodes":false}]
    kubernetes.io/ingress.class: "nginx"
    prometheus.io/http_probe: "true"
spec:
  ingressClassName: nginx
  rules:
  - host: prometheus.ga.skyvault.cn
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: prometheus-k8s
            port:
              number: 9090
  tls:
    - hosts:
        - prometheus.ga.skyvault.cn
      secretName: skyvault-cn-tls-certificate
status:
  loadBalancer:
    ingress:
      - ip: 192.168.1.181
      - ip: 192.168.1.182
      - ip: 192.168.1.183
      - ip: 192.168.1.184
      - ip: 192.168.1.185

3.镜像可能拉取不到,需要修改拉取的源

将镜像源改成私有镜像仓库

4.删除自带的网络策略,否则所有访问服务都会被阻塞

1
kubectl -n monitoring delete networkpolicies.networking.k8s.io --all

5.安装kube-prometheus

1
2
3
4
5
6
kubectl apply --server-side -f manifests/setup
kubectl wait \
	--for condition=Established \
	--all CustomResourceDefinition \
	--namespace=monitoring
kubectl apply -f manifests/

6.添加TLS证书,namespace为monitoring,name为ingress文件中secretName字段的值

7.查询monitoring命名空间下的所有资源,确认是否都正常运行

1
kubectl get all -n monitoring

8.安装后访问prometheus,会发现有以下三个报警:WatchdogKubeControllerManagerDownKubeSchedulerDown

因为集群没有给系统kube-controller-managerkube-scheduler组件创建svc,所以会有报警

  • 创建kube-controller-manager对应的endpoints以及svc

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    
    #vi cm-prometheus.yaml
    apiVersion: v1
    kind: Endpoints
    metadata:
      labels:
        app.kubernetes.io/name: kube-controller-manager
      name: cm-prometheus
      namespace: kube-system
    subsets:
      - addresses:
          - ip: 192.168.1.181
          - ip: 192.168.1.182
          - ip: 192.168.1.183
        ports:
          - name: https-metrics
            port: 10257
            protocol: TCP
    ---
    apiVersion: v1
    kind: Service
    metadata:
      labels:
        app.kubernetes.io/name: kube-controller-manager
      name: cm-prometheus
      namespace: kube-system
    spec:
      type: ClusterIP
      ports:
        - name: https-metrics
          port: 10257
          protocol: TCP
          targetPort: 10257
    
  • 创建kube-scheduler对应的endpoints以及svc

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    
    #vi cm-prometheus.yaml
    apiVersion: v1
    kind: Endpoints
    metadata:
      labels:
        app.kubernetes.io/name: kube-scheduler
      name: scheduler-prometheus
      namespace: kube-system
    subsets:
      - addresses:
          - ip: 192.168.1.181
          - ip: 192.168.1.182
          - ip: 192.168.1.183
        ports:
          - name: https-metrics
            port: 10259
            protocol: TCP
    ---
    apiVersion: v1
    kind: Service
    metadata:
      labels:
        app.kubernetes.io/name: kube-scheduler
      name: scheduler-prometheus
      namespace: kube-system
    spec:
      type: ClusterIP
      ports:
        - name: https-metrics
          port: 10259
          protocol: TCP
          targetPort: 10259
    

创建完成后会发现prometheus的targets下的kube-prometheus-controllerkube-scheduler已经有采集目标,但是报错,这是因为 kube-scheduler 启动的时候默认绑定的是 127.0.0.1 地址,所以要通过 IP 地址去访问就被拒绝了

rancher => Cluster Management => gongan => Edit Config => Cluster Configuration => Advanced

找到Additional Controller Manager ArgsAdditional Scheduler Args ,add参数 --bind-address=0.0.0.0

9.配置grafana

添加数据源,URL为https://prometheus.ga.skyvault.cn:30443

导入dashboard,可以通过json文件或id,id为13105

10.增加自定义监控

包括对外部服务的监控规则

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#vi prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    app.kubernetes.io/component: prometheus
    app.kubernetes.io/instance: k8s
    app.kubernetes.io/name: prometheus
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 2.46.0
    prometheus: k8s
    role: alert-rules
  name: k8s-prometheus-rules
  namespace: monitoring
spec:
  groups:
  - name: node-rules
    rules:
    - alert: NodeDown
      expr: up{job="node-exporter"} == 0
      for: 3m
      labels:
        severity: critical
      annotations:
        summary: "Node is down"
        description: "The node {{ $labels.instance }} is down for more than 3 minutes."
  - name: namespace
    rules:
    - alert: PodRestart
      expr: (floor(increase(kube_pod_container_status_restarts_total{namespace="kube-system"}[1m])) > 0)
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: "Pod restart in last 1 minutes."
        description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} restarted {{ $value }} times in the last 1 minute."
  - name: cupom
    rules:
    - alert: RequestError
      expr: sum(increase(RequestErrorCount{job="cupom",instance="cupom-api.dev.skyvault.cn:443"}[1m])) by (path,method) > 0
      labels:
        severity: warning
      annotations:
        summary: "New request errors occurred."
        description: "{{ $value }} new errors in the last 1 minute for path {{ $labels.path }} and method {{ $labels.method }} on cupom."
    - alert: ResponseCode4**
      expr: sum(increase(RequestCount{job="cupom",instance="cupom-api.dev.skyvault.cn:443",code=~"4..*"}[1m])) by (path,method) > 2
      labels:
        severity: warning
      annotations:
        summary: "Multiple errors with response code 4** occurred."
        description: "{{ $value }} new errors in the last 1 minute for path {{ $labels.path }} and method {{ $labels.method }} on cupom."
    - alert: ResponseCode5**
      expr: sum(increase(RequestCount{job="cupom",instance="cupom-api.dev.skyvault.cn:443",code=~"5..*"}[1m])) by (path,method) > 2
      labels:
        severity: warning
      annotations:
        summary: "Multiple errors with response code 5** occurred."
        description: "{{ $value }} new errors in the last 1 minute for path {{ $labels.path }} and method {{ $labels.method }} on cupom."
1
kubectl apply -f prometheus-rules.yaml

11.alertmanager报警配置

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
#vi alertmanager-secret.yaml
apiVersion: v1
kind: Secret
metadata:
  labels:
    app.kubernetes.io/component: alert-router
    app.kubernetes.io/instance: main
    app.kubernetes.io/name: alertmanager
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 0.26.0
  name: alertmanager-main
  namespace: monitoring
stringData:
  alertmanager.yaml: |-
    "global":
      "resolve_timeout": "5m"
      smtp_smarthost: 'smtp.qq.com:465'
      smtp_from: '[email protected]'
      smtp_auth_username: '[email protected]'
      smtp_auth_password: 'qutnfirpngmybccf'
      smtp_require_tls: false
    "inhibit_rules":
    - "equal":
      - "namespace"
      - "alertname"
      "source_matchers":
      - "severity = critical"
      "target_matchers":
      - "severity =~ warning|info"
    - "equal":
      - "namespace"
      - "alertname"
      "source_matchers":
      - "severity = warning"
      "target_matchers":
      - "severity = info"
    - "equal":
      - "namespace"
      "source_matchers":
      - "alertname = InfoInhibitor"
      "target_matchers":
      - "severity = info"
    "receivers":
    - "name": "Default"
      "email_configs":
      - to: '[email protected]'
        send_resolved: true
    - "name": "warning-critical-receiver"
      "webhook_configs":
      - "url": "https://alert.ga.skyvault.cn:30443/prometheusalert?type=fs&tpl=prometheus-fs&fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/cca82303-5b6d-4b29-8168-422da1fd3af8"
    - "name": "Watchdog"
    - "name": "null"
    "route":
      "group_by": ['alertname','service']
      "group_interval": "2m"
      "group_wait": "30s"
      "receiver": "Default"
      "repeat_interval": "1h"
      "routes":
      - "matchers":
        - "alertname = Watchdog"
        "receiver": "Watchdog"
      - "matchers":
        - "alertname = InfoInhibitor"
        "receiver": "null"
      - "matchers":
        - "severity = critical"
        - "severity = warning"
        "receiver": "warning-critical-receiver"
      - "matchers":
        - "severity = info"
        "receiver": "Default"
type: Opaque
1
kubectl replace -f alertmanager-secret.yaml

12.监控集群外部服务

静态配置

  • 创建prometheus-additional.yaml

    1
    2
    3
    4
    5
    6
    7
    8
    
    #vi prometheus-additional.yaml
    - job_name: cupom
      honor_timestamps: true
      metrics_path: /metrics
      scheme: https
      static_configs:
      - targets:
        - cupom-api.dev.skyvault.cn:443
    
  • 创建secret文件并部署到monitoring命名空间

    1
    2
    3
    
    kubectl create secret generic additional-scrape-configs --from-file=prometheus-additional.yaml --dry-run -oyaml  > additional-scrape-configs.yaml
    
    kubectl apply -f additional-scrape-configs.yaml  -n monitoring
    

    注:如需更新,需要把secret删除,再重新创建

    1
    2
    
    kubectl delete secret additional-scrape-configs -n monitoring
    rm additional-scrape-configs.yaml
    
  • prometheus-prometheus.yaml中添加additionalScrapeConfigs选项

    1
    2
    3
    4
    
    #vi prometheus-prometheus.yaml
    additionalScrapeConfigs:
      name: additional-scrape-configs
      key: prometheus-additional.yaml
    
    1
    
    kubectl apply -f prometheus-prometheus.yaml
    

导入Application-GinExporter-1719367918282.json文件作为dashboard

13.修改prometheus operator数据存储时间

prometheus operator默认数据存储的时间为1d,修改为30d

1
2
3
#vi prometheus-prometheus.yaml
spce:
  retention:30d
1
kubectl apply -f prometheus-prometheus.yaml

14.将报警信息通过webhook转发至飞书

  • 安装开源项目PrometheusAlert,用于消息转发

    1
    
    kubectl apply -n monitoring -f Prometheus-Deployment.yaml
    

    下面是Prometheus-Deployment.yaml的一部分,只有这部分需要修改

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    
    fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/**************************************
    ---
    apiVersion: networking.k8s.io/v1
    kind: Ingress
    metadata:
      annotations:
        field.cattle.io/publicEndpoints: >-
          [{"addresses":["192.168.1.181","192.168.1.182","192.168.1.183","192.168.1.184","192.168.1.185"],"port":443,"protocol":"HTTPS","serviceName":"monitoring:prometheus-k8s","ingressName":"monitoring:prometheus-k8s-ingress","hostname":"prometheus.ga.skyvault.cn","path":"/","allNodes":false}]
        kubernetes.io/ingress.class: nginx
      name: prometheus-alert-center
      namespace: monitoring
    spec:
      ingressClassName: nginx
      rules:
        - host: alert.ga.skyvault.cn
          http:
            paths:
            - path: /
              pathType: Prefix
              backend:
                service:
                  name: prometheus-alert-center
                  port:
                    number: 8080
      tls:
        - hosts:
            - alert.ga.skyvault.cn
          secretName: skyvault-cn-tls-certificate
    status:
      loadBalancer:
        ingress:
          - ip: 192.168.1.181
          - ip: 192.168.1.182
          - ip: 192.168.1.183
          - ip: 192.168.1.184
          - ip: 192.168.1.185
    
  • 修改消息模板

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    
    {{ $var := .externalURL}}{{ range $k,$v:=.alerts }}{{if eq $v.status "resolved"}}**[Prometheus恢复信息]({{$v.generatorURL}}) ✅**
    *[{{$v.labels.alertname}}]({{$var}})*
    告警级别:{{$v.labels.severity}}
    开始时间:{{GetCSTtime $v.startsAt}}
    结束时间:{{GetCSTtime $v.endsAt}} 
    **{{$v.annotations.description}}**{{else}}**[Prometheus告警信息]({{$v.generatorURL}}) 🔥**
    *[{{$v.labels.alertname}}]({{$var}})*
    告警级别:{{$v.labels.severity}}
    开始时间:{{GetCSTtime $v.startsAt}}
    **{{$v.annotations.description}}**
    [点击打开Grafana](https://grafana.ga.skyvault.cn:30443/dashboards)
    [点击打开Prometheus](https://prometheus.ga.skyvault.cn:30443/alerts)
    {{end}}{{ end }}