kube-prometheus安装与配置
1.kube-prometheus下载,解压,进入目录
|
|
2.修改prometheus、grafana、alertmanager的ingress文件
|
|
3.镜像可能拉取不到,需要修改拉取的源
将镜像源改成私有镜像仓库
4.删除自带的网络策略,否则所有访问服务都会被阻塞
|
|
5.安装kube-prometheus
|
|
6.添加TLS证书,namespace为monitoring,name为ingress文件中secretName字段的值
7.查询monitoring命名空间下的所有资源,确认是否都正常运行
|
|
8.安装后访问prometheus,会发现有以下三个报警:Watchdog
、KubeControllerManagerDown
、KubeSchedulerDown
因为集群没有给系统kube-controller-manager
、kube-scheduler
组件创建svc
,所以会有报警
-
创建kube-controller-manager对应的endpoints以及svc
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
#vi cm-prometheus.yaml apiVersion: v1 kind: Endpoints metadata: labels: app.kubernetes.io/name: kube-controller-manager name: cm-prometheus namespace: kube-system subsets: - addresses: - ip: 192.168.1.181 - ip: 192.168.1.182 - ip: 192.168.1.183 ports: - name: https-metrics port: 10257 protocol: TCP --- apiVersion: v1 kind: Service metadata: labels: app.kubernetes.io/name: kube-controller-manager name: cm-prometheus namespace: kube-system spec: type: ClusterIP ports: - name: https-metrics port: 10257 protocol: TCP targetPort: 10257
-
创建kube-scheduler对应的endpoints以及svc
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
#vi cm-prometheus.yaml apiVersion: v1 kind: Endpoints metadata: labels: app.kubernetes.io/name: kube-scheduler name: scheduler-prometheus namespace: kube-system subsets: - addresses: - ip: 192.168.1.181 - ip: 192.168.1.182 - ip: 192.168.1.183 ports: - name: https-metrics port: 10259 protocol: TCP --- apiVersion: v1 kind: Service metadata: labels: app.kubernetes.io/name: kube-scheduler name: scheduler-prometheus namespace: kube-system spec: type: ClusterIP ports: - name: https-metrics port: 10259 protocol: TCP targetPort: 10259
创建完成后会发现prometheus的targets下的kube-prometheus-controller
和kube-scheduler
已经有采集目标,但是报错,这是因为 kube-scheduler 启动的时候默认绑定的是 127.0.0.1
地址,所以要通过 IP 地址去访问就被拒绝了
rancher
=> Cluster Management
=> gongan
=> Edit Config
=> Cluster Configuration
=> Advanced
找到Additional Controller Manager Args
和Additional Scheduler Args
,add参数 --bind-address=0.0.0.0
9.配置grafana
添加数据源,URL为https://prometheus.ga.skyvault.cn:30443
导入dashboard,可以通过json文件或id,id为13105
10.增加自定义监控
包括对外部服务的监控规则
|
|
|
|
11.alertmanager报警配置
|
|
|
|
12.监控集群外部服务
静态配置
-
创建
prometheus-additional.yaml
1 2 3 4 5 6 7 8
#vi prometheus-additional.yaml - job_name: cupom honor_timestamps: true metrics_path: /metrics scheme: https static_configs: - targets: - cupom-api.dev.skyvault.cn:443
-
创建secret文件并部署到monitoring命名空间
1 2 3
kubectl create secret generic additional-scrape-configs --from-file=prometheus-additional.yaml --dry-run -oyaml > additional-scrape-configs.yaml kubectl apply -f additional-scrape-configs.yaml -n monitoring
注:如需更新,需要把secret删除,再重新创建
1 2
kubectl delete secret additional-scrape-configs -n monitoring rm additional-scrape-configs.yaml
-
在
prometheus-prometheus.yaml
中添加additionalScrapeConfigs
选项1 2 3 4
#vi prometheus-prometheus.yaml additionalScrapeConfigs: name: additional-scrape-configs key: prometheus-additional.yaml
1
kubectl apply -f prometheus-prometheus.yaml
导入Application-GinExporter-1719367918282.json文件作为dashboard
13.修改prometheus operator数据存储时间
prometheus operator默认数据存储的时间为1d,修改为30d
|
|
|
|
14.将报警信息通过webhook转发至飞书
-
安装开源项目PrometheusAlert,用于消息转发
1
kubectl apply -n monitoring -f Prometheus-Deployment.yaml
下面是Prometheus-Deployment.yaml的一部分,只有这部分需要修改
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/************************************** --- apiVersion: networking.k8s.io/v1 kind: Ingress metadata: annotations: field.cattle.io/publicEndpoints: >- [{"addresses":["192.168.1.181","192.168.1.182","192.168.1.183","192.168.1.184","192.168.1.185"],"port":443,"protocol":"HTTPS","serviceName":"monitoring:prometheus-k8s","ingressName":"monitoring:prometheus-k8s-ingress","hostname":"prometheus.ga.skyvault.cn","path":"/","allNodes":false}] kubernetes.io/ingress.class: nginx name: prometheus-alert-center namespace: monitoring spec: ingressClassName: nginx rules: - host: alert.ga.skyvault.cn http: paths: - path: / pathType: Prefix backend: service: name: prometheus-alert-center port: number: 8080 tls: - hosts: - alert.ga.skyvault.cn secretName: skyvault-cn-tls-certificate status: loadBalancer: ingress: - ip: 192.168.1.181 - ip: 192.168.1.182 - ip: 192.168.1.183 - ip: 192.168.1.184 - ip: 192.168.1.185
-
修改消息模板
1 2 3 4 5 6 7 8 9 10 11 12 13
{{ $var := .externalURL}}{{ range $k,$v:=.alerts }}{{if eq $v.status "resolved"}}**[Prometheus恢复信息]({{$v.generatorURL}}) ✅** *[{{$v.labels.alertname}}]({{$var}})* 告警级别:{{$v.labels.severity}} 开始时间:{{GetCSTtime $v.startsAt}} 结束时间:{{GetCSTtime $v.endsAt}} **{{$v.annotations.description}}**{{else}}**[Prometheus告警信息]({{$v.generatorURL}}) 🔥** *[{{$v.labels.alertname}}]({{$var}})* 告警级别:{{$v.labels.severity}} 开始时间:{{GetCSTtime $v.startsAt}} **{{$v.annotations.description}}** [点击打开Grafana](https://grafana.ga.skyvault.cn:30443/dashboards) [点击打开Prometheus](https://prometheus.ga.skyvault.cn:30443/alerts) {{end}}{{ end }}