使用 Vector 将 PostgreSQL 日志输出为 Prometheus 指标
新钛云服已累计为您分享728篇技术干货
准备日志以供进一步使用
postgresql.conf log_line_prefix = '%m %p %u@%d from %h [vxid:%v txid:%x] [%i] '`
%m
是一个时间戳,包括毫秒;%p
– 进程号;%u
- 用户名;%d
- 数据库名称。
2022-05-12 07:33:54.285 UTC 2672031 @ from [vxid: txid:0] [] LOG: checkpoint complete: wrote 64 buffers (0.0%); 0 WAL file(s) added, 0 removed, 10 recycled; write=6.266
s, sync=0.004 s, total=6.285 s; sync files=10, longest=0.003 s, average=0.001 s; distance=163840 kB, estimate=163844 kB
寻找最佳解决方案
pipeline_stages:
multiline:
firstline: '^\[\d{4}-\d{2}-\d{2} \d{1,2}:\d{2}:\d{2}\]'
regex:
expression: '^(?P\[\d{4}-\d{2}-\d{2} \d{1,2}:\d{2}:\d{2}\]) (?P(?s:.*))$'
metrics:
log_lines_total:
type: Counter
description: "total number of log lines"
prefix: pg_custom_
max_idle_duration: 24h
config:
match_all: true
action: inc
log_bytes_total:
type: Counter
description: "total bytes of log lines"
prefix: pg_custom_
max_idle_duration: 24h
config:
match_all: true
count_entry_bytes: true
action: add
Vector:解析日志并输出到 Prometheus
# playbook-vector.yaml
---
name: Setup vector
hosts:
pg
become: yes
vars:
arch: amd64
version: 0.18.1
vector_template: files/40-vector.toml
vector_config_file: /etc/vector/vector.toml
tasks:
name: Setup install vector
become: yes
apt:
deb: "https://packages.timber.io/vector/{{ version }}/vector-{{ version }}-{{ arch }}.deb"
install_recommends: yes
notify:
restart vector
name: Copy config
copy:
src: "{{ vector_template }}"
dest: "{{ vector_config_file }}"
mode: 0644
owner: vector
group: vector
notify: restart vector
name: Start Vector
service:
state: started
enabled: yes
name: vector
handlers:
name: restart vector
service:
state: restarted
daemon_reload: yes
name: vector
# vector.toml
[sources.postgres_logs.multiline]
start_pattern = '^\d{4}-[0-1]\d-[0-3]\d \d+:\d+:\d+\.\d+ [A-Z]{3}'
mode = "halt_before"
condition_pattern = '^\d{4}-[0-1]\d-[0-3]\d \d+:\d+:\d+\.\d+ [A-Z]{3}'
timeout_ms = 1000
halt_before
mode 意味着 Vector 会将跟在condition_pattern
(并且不以后者开头)之后的所有行视为单个消息。multiline.mode
值。例如,该half_with
模式包括所有连续的行,直到并包括与condition_pattern
消息中匹配的第一行。# vector.toml
[transforms.postgres_remap]
type = "remap"
inputs = [ "postgres_logs" ]
source = """. |= parse_regex!(.message, r'^(?P\\d{4}-[0-1]\\d-[0-3]\\d \\d+:\\d+:\\d+\\.\\d+ [A-Z]{3}) (?P\\d+) (?P(\\[\\w+\\]@\\w+|@|\\w+@\\w+)) from (?P(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}|\\[\\w+\\]|\\s*)) (?P\\[\\w+:.+:\\d+\\]) (?P(\\[\\]|\\[\\w.+\\])) (?P.*[A-Z]): (?P.*)$')del(.timestamp)message_parts, err = split(.message, ", ", limit: 2)structured = parse_key_value(message_parts[1], key_value_delimiter: ":", field_delimiter: ",") ?? {}message = message_parts[0]. = merge(., structured)del(."please try the setup again")del(.message)"""
指定日志源; 设置一个正则表达式来解析日志消息; 删除了不必要的字段; 使用“,”分隔符拆分消息; 将结果保存到 map
数组中,对其进行处理,并获得 JSON 输出,以便我们可以继续操作其字段。
# vector.toml
[transforms.postgres_filter]
type = "filter"
inputs = [ "postgres_remap" ]
condition = '.level == "ERROR" || .level == "FATAL"'
[ ]
type = "log_to_metric"
inputs = [ "postgres_filter" ]
[ ]]
type = "counter"
field = "level"
name = "error_total"
namespace = "pg_log"
[ ]
level = "{{level}}"
host = "{{host}}"
[sinks.postgres_export_metric]
type = "prometheus_exporter"
inputs = [ "postgres_metric" ]
address = "0.0.0.0:9598"
default_namespace = "pg_log"
根据检索到的指标设置警报
scrape_configs:
- job_name: custom-pg-log-exporter
static_configs:
- targets: ['10.10.10.2:9598', '10.10.10.3:9598', '10.10.10.4:9598']
- alert: PgErrorCountChangeWarning
expr: | increase(pg_log_error_total{level="ERROR"}[30m]) > 0
for: 10m
labels:
severity: warning
annotations:
summary: The amount of errors in pg host {{$labels.host}} log has changed to {{$value}}
description: | There are errors in the PostgreSQL logs on the {{$labels.host}} server.
- alert: PgErrorCountChangeCritical
expr: | increase(pg_log_error_total{level="FATAL"}[30m]) > 0
for: 10m
labels:
severity: critical
annotations:
summary: The amount of fatal errors in pg host {{$labels.host}} log has changed to {{$value}}
description: |
There are fatal errors in the PostgreSQL logs on the {{$labels.host}} server.
pg_log_error_total
这里计算向量时间序列30分钟的增量。其值大于零意味着计数器已更改,从而导致向用户发送警报。然后用户可以检查 PostgreSQL 日志以找出问题的原因。结论
vector top Cli
工具有助于调试日志管道和查看 Vector 的指标。它在漂亮的 TUI 界面中显示信息。推荐阅读
推荐视频
微信扫码关注该文公众号作者
戳这里提交新闻线索和高质量文章给我们。
来源: qq
点击查看作者最近其他文章