博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
prometheus 笔记
阅读量:5127 次
发布时间:2019-06-13

本文共 4279 字,大约阅读时间需要 14 分钟。

前言

  prometheus 是监控应用软件类似于nagios.

 

安装

  1.官网下载prometheus-2.2.0.linux-amd64压缩包,解压,执行./prometheus即可。这里重要的是配置文件。

     a.如果要远程热加载配置文件,启动时加上--web.enable-lifecycle参数。 调用指令是curl -X POST http://localhost:9090/-/reload

     b.重要掌握 prometheus.yml 配置文件.prometheus启动时会加载它。

[root@vm-local1 prometheus-2.2.0.linux-amd64]# cat prometheus.yml # my global configglobal:  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.评估间隔  # scrape_timeout is set to the global default (10s). 默认抓取超时10秒# Alertmanager configuration #管理报警配置alerting:  alertmanagers:  - static_configs:    - targets: ["localhost:9093"]  #管理报警包需要单独下载,默认启动端口是9093          # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.rule_files:  # - "first_rules.yml"  # - "second_rules.yml"  - rules/mengyuan.rules     #要发送报警,就得写规则,定义规则文件# A scrape configuration containing exactly one endpoint to scrape:# Here it's Prometheus itself.scrape_configs:    #抓取配置,就是你要抓取那些主机  # The job name is added as a label `job=
` to any timeseries scraped from this config. - job_name: 'prometheus' #任务名称 # metrics_path defaults to '/metrics' #默认抓取监控机的url后缀地址是/metrics # scheme defaults to 'http'. #模式是http static_configs: - targets: ['localhost:9090','localhost:9100'] labels: group: 'zus' #targets就是要抓取的主机,对应的客户端,我这有两个,把它们俩规定为一个组,组名是zus - job_name: dj #又建立个任务名称 static_configs: - targets: ['localhost:8000'] #我用django自定义的客户端

 

 注意:

     localhost:9090,默认prometheus提供了数据抓取接口,9100端口是prometheus提供的一个监控客户端

2.安装prometheus客户端

  官网下载node_exporter-0.16.0-rc.1.linux-amd64客户端,解压,执行./node_exporter 即可,默认是9100端口

3.如何自定义一个客户端,其实很简单,只要返回的数据库类型是这样就可以.我这用的django..只要格式正确就可以

  

def metrics(req):    ss = "feiji 32" + "\n" + "caidian 31"    return HttpResponse(ss)

 

4.编写 rules/mengyuan.rules 规则,规则是发送报警的前提

 

[root@vm-local1 rules]# cat mengyuan.rules groups:- name: zus  rules:  # Alert for any instance that is unreachable for >5 minutes.  - alert: InstanceDown   #报警名字随便写    expr: up == 0   #这是一个表达式,如果主机up状态为0,表示关机了,条件为真就会触发报警 可以通过$value得到值    for: 5s         #5s内,还是0,就发送报警信息,当然是发送给报警管理器    labels:      severity: page  #这个类型的报警定了个标签    annotations:      summary: "Instance {
{ $labels.instance }} down dangqian {
{ $value }}" description: "{
{ $labels.instance }} of job {
{ $labels.job }} has been down for more than 5 minutes."

  

5.现在安装报警管理器

  a.官网下载alertmanager-0.15.0-rc.1.linux-amd64  

    重要的还是配置文件,创建修改它

  

[root@vm-local1 alertmanager-0.15.0-rc.1.linux-amd64]# cat alertmanager.yml route:  receiver: mengyuan2  #接收的名字,默认必须有一个,对应receivers的- name  group_wait: 1s  #等待1s  group_interval: 1s #发送间隔1s  repeat_interval: 1m  #重复发送等待1m分钟再发  group_by: ["zus"]     routes:      #路由了,匹配规则标签的severity:page 走 receiver: mengyuan , 如果routes不写,就会走默认的mengyuan2  - receiver: mengyuan      match:      severity: pagereceivers:- name: 'mengyuan'  webhook_configs:  #这我用的webhook_configs 钩子方法,  默认会把规则的报警信息发送到127.0.0.1:8000  - url: http://127.0.0.1:8000    send_resolved: true- name: 'mengyuan2'  webhook_configs:  - url: http://127.0.0.1:8000/2    send_resolved: true

 

6.django接收报警发过来的消息

  用Django的  request.body会受到json格式的数据,大概像这样

  {"receiver":"mengyuan","status":"resolved","alerts":[{"status":"resolved","labels":{"alertname":"InstanceDown","group":"zus","instance":"localhost:9100","job":"prometheus","severity":"page"},"annotations":{"description":"localhost:9100 of job prometheus has been down for more than 5 minutes.","summary":"Instance localhost:9100 down dangqian  0"},"startsAt":"2018-04-06T22:34:13.51281763+08:00","endsAt":"2018-04-06T23:07:43.514552824+08:00","generatorURL":"http://vm-local1:9090/graph?g0.expr=up+%3D%3D+0\u0026g0.tab=1"}],"groupLabels":{},"commonLabels":{"alertname":"InstanceDown","group":"zus","instance":"localhost:9100","job":"prometheus","severity":"page"},"commonAnnotations":{"description":"localhost:9100 of job prometheus has been down for more than 5 minutes.","summary":"Instance localhost:9100 down dangqian  0"},"externalURL":"http://vm-local1:9093","version":"4","groupKey":"{}/{severity=\"page\"}:{}"}

 到此,我就可以根据收到的数据,调用邮件接口,或其他第三方报警接口了。

 

总结:

   本人也是刚入门。做的一个笔记。

 

转载于:https://www.cnblogs.com/whf191/p/8729460.html

你可能感兴趣的文章
PHP使用缓存提高网站性能
查看>>
用C#实现智能设备上的NotifyIcon类
查看>>
项目实施(二)
查看>>
HDU 1045 Fire Net
查看>>
Github
查看>>
cmake 手册详解【转】
查看>>
一般在页面上添加隐藏域用来接受设置一些值方便开发
查看>>
net 表格控件
查看>>
CodeForces Round 197 Div2
查看>>
boost-使用format和lexical_cast实现数字和字符串之间的转换
查看>>
Learn a Linux command every day--day2:ls命令
查看>>
java集合的三种遍历方式
查看>>
Visual formatting model
查看>>
木马分析(隐藏分析)实验
查看>>
eclipse中编译时enum出现cannot be resolved to a type错误
查看>>
POJ - 2823 Sliding Window(单调队列)
查看>>
Oozie分布式工作流——Action节点
查看>>
汇编语言 手记6
查看>>
linux添加超级用户
查看>>
Checkbutton
查看>>