[未写完]使用go开发一个Prometheus的exporter

exporter

字数统计: 1.6k阅读时长: 7 min

 2020/06/19 

市面上的例子太多都太简单了，这里详细写下我知道的

基本概念和前提

这里使用go mod开发，别问包怎么拉取

Prometheus将所有数据存储为时间序列，这里先来了解一下prometheus中的一些基本概念

指标名和标签

每个时间序列都由指标名和一组键值对（也称为标签）唯一标识。

metric的格式如下：

1	<metric name>{<label name>=<label value>, ...} metrics_value

metrics_value的值只能是float64，那些想着收集日志的就别想了

例如：

1	http_requests_total{host="192.10.0.1", method="POST", handler="/messages"} 278

http_requests_total是指标名；
host、method、handler是三个标签(label)，也就是三个维度；
值278，根据 metrics 的名字总体就是这个接口POST的次数是278；
查询语句可以基于这些标签or维度进行过滤和聚合；

prometheus的监控架构是server向提供了metrics信息的http(s)接口发起GET请求，目标进程或者exporter必须在web路由(例如/metrics)上暴漏metrics的指标。例如下面有三个指标:

# HELP harbor_exporter_collector_duration_seconds Collector time duration.
# TYPE harbor_exporter_collector_duration_seconds gauge
harbor_exporter_collector_duration_seconds{collector="logs"} 0.04826962
harbor_exporter_collector_duration_seconds{collector="projects"} 0.174844256
harbor_exporter_collector_duration_seconds{collector="reach"} 0.011827241
harbor_exporter_collector_duration_seconds{collector="statistics"} 0.056164916
harbor_exporter_collector_duration_seconds{collector="systeminfo"} 0.032053573
harbor_exporter_collector_duration_seconds{collector="systeminfoVolumes"} 0.030168302
# HELP harbor_exporter_last_scrape_error Whether the last scrape of metrics from harbor resulted in an error (1 for error, 0 for success).
# TYPE harbor_exporter_last_scrape_error gauge
harbor_exporter_last_scrape_error 0
# HELP harbor_exporter_scrapes_total Total number of times harbor was scraped for metrics.
# TYPE harbor_exporter_scrapes_total counter
harbor_exporter_scrapes_total 697

单独一个指标在web上的格式为:

# HELP <metric name> <help_msg>
# TYPE <metric name> <metrics type>
<metric name>{<label1>=value1,label2=value2} <metrics value1>
<metric name>{<label1>=value3,label2=value4} <metrics value2>
...

prometheus的client库已经封装好了这些，我们直接使用即可

指标类型(metrics type)

Prometheus client库提供四种核心度量标准类型。注意是客户端。Prometheus服务端没有区分类型，将所有数据展平为无类型时间序列。

1、 Counter：只增不减的累加指标

# HELP node_cpu_seconds_total Seconds the cpus spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 380090.49
node_cpu_seconds_total{cpu="0",mode="iowait"} 114.2
node_cpu_seconds_total{cpu="0",mode="irq"} 0
node_cpu_seconds_total{cpu="0",mode="nice"} 0.05
...

Counter就是一个计数器，表示一种累积型指标，该指标只能单调递增或在重新启动时重置为零，例如，您可以使用计数器来表示所服务的请求数，已完成的任务或错误。

2、 Gauge：可增可减的测量指标

# HELP node_filesystem_avail_bytes Filesystem space available to non-root users in bytes.
# TYPE node_filesystem_avail_bytes gauge
node_filesystem_avail_bytes{device="/dev/mapper/centos-home",fstype="xfs",mountpoint="/home"} 1.300291584e+10
node_filesystem_avail_bytes{device="/dev/mapper/centos-root",fstype="xfs",mountpoint="/"} 1.300291584e+10
node_filesystem_avail_bytes{device="rootfs",fstype="rootfs",mountpoint="/"} 1.300291584e+10

Gauge是最简单的度量类型，只有一个简单的返回值，可增可减，也可以 set 为指定的值，例如是否down了，可以在1和0之间set。

所以 Gauge 通常用于反映当前状态，比如当前温度或当前内存使用情况；是一种“可增加可减少”的计数指标。

3、Histogram：自带buckets区间用于统计分布的直方图

Histogram主要用于在设定的分布范围内(Buckets)记录个数，而不是值。

例如http请求响应时间：0-100ms、100-200ms、200-300ms、>300ms 的分布情况，Histogram会自动创建3个指标，分别为：

事件发送的总次数<basename>_count：比如当前一共发生了2次http请求
所有事件产生值的大小的总和<basename>_sum：比如发生的2次http请求总的响应时间为150ms
事件产生的值分布在bucket中的次数<basename>_bucket{le="上限"}：比如响应时间0-100ms的请求1次，100-200ms的请求1次，是累计的直方图

# HELP rest_client_request_latency_seconds Request latency in seconds. Broken down by verb and URL.
# TYPE rest_client_request_latency_seconds histogram
rest_client_request_latency_seconds_bucket{path="/",method="GET",code="200",le="0.1"} 1.0
rest_client_request_latency_seconds_bucket{path="/",method="GET",code="200",le="0.2"} 2.0
rest_client_request_latency_seconds_bucket{path="/",method="GET",code="200",le="0.3"} 2.0
rest_client_request_latency_seconds_bucket{path="/",method="GET",code="200",le="+Inf"} 2.0
rest_client_request_latency_seconds_sum{path="/",method="GET",code="200"} 0.150
rest_client_request_latency_seconds_count{path="/",method="GET",code="200"} 2.0

4、Summary：数据分布统计图

# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 1.4846e-05
go_gc_duration_seconds{quantile="0.25"} 1.8948e-05
go_gc_duration_seconds{quantile="0.5"} 3.9602e-05
go_gc_duration_seconds{quantile="0.75"} 5.8061e-05
go_gc_duration_seconds{quantile="1"} 9.6987e-05
go_gc_duration_seconds_sum 0.000772525
go_gc_duration_seconds_count 18

其中 quantile 的0和1表示最小和最大值，其余例如 go_gc_duration_seconds{quantile="0.75"} 5.8061e-05，有百分之75的值是5.8061e-05

Summary和Histogram类似，都可以统计事件发生的次数或者大小，以及其分布情况。

如果需要聚合（aggregate），选择histograms。

如果比较清楚要观测的指标的范围和分布情况，选择histograms。如果需要精确的分位数选择summary

作业和实例

在Prometheus中，一个可以拉取数据的端点IP:Port叫做一个实例（instance），而具有多个相同类型实例的集合称作一个作业（job）

- job: api-server
     - instance 1: 1.2.3.4:5670
     - instance 2: 1.2.3.4:5671
     - instance 3: 5.6.7.8:5670
     - instance 4: 5.6.7.8:5671

当Prometheus拉取指标数据时，会自动生成一些标签（label）用于区别抓取的来源：

job：配置的作业名；
instance：配置的实例名，若没有实例名，则是抓取的IP:Port。

对于每一个实例（instance）的抓取，Prometheus会默认保存以下数据：

up{job="<job>", instance="<instance>"}：如果实例是健康的，即可达，值为1，否则为0；
scrape_duration_seconds{job="<job>", instance="<instance>"}：抓取耗时；
scrape_samples_post_metric_relabeling{job="<job>", instance="<instance>"}：指标重新标记后剩余的样本数。
scrape_samples_scraped{job="<job>", instance="<instance>"}：实例暴露的样本数

该up指标对于监控实例健康状态很有用。