categraf/inputs/dns_query
kongfei 73613a200a support multi inputs 2023-04-25 16:48:00 +08:00
..
README.md Create README.md 2023-02-21 21:33:53 +08:00
dns_query.go support multi inputs 2023-04-25 16:48:00 +08:00

README.md

应用场景

一般用于对DNS服务器的响应监测帮助运维快速定位网络问题。

部署场景

不需要每台虚拟机都启用此插件,建议是独立或复合的某一台虚拟机启用此插件。

配置场景

本配置启用或数据定义如下功能:
使用本机DNS查询域名解析质量。
使用外部DNS查询域名解析质量。
使用不同记录类型进行DNS查询。
每种查询都设置超时时间5秒。
增加自定义标签,可通过自定义标签筛选数据及更加精确的告警推送。
在domains字段处增加自己想要被DNS查询的域名一般填写公司业务系统的域名或第三方依赖的业务系统。

修改dns_query.toml文件配置

[root@aliyun input.dns_query]# cat dns_query.toml
# # collect interval
# interval = 15

[[instances]]
# # append some labels for series
labels = { cloud="huaweicloud", region="huabei-beijing-4",azone="az1", product="n9e" }

# # interval = global.interval * interval_times
# interval_times = 1

# #
auto_detect_local_dns_server  = true

### A record

## servers to query
servers = ["223.5.5.5","114.114.114.114","119.29.29.29"]

## Network is the network protocol name.
# network = "udp"

## Domains or subdomains to query.
domains = ["www.huaweicloud.com", "www.baidu.com", "www.tapd.cn"]

## Query record type.
## Possible values: A, AAAA, CNAME, MX, NS, PTR, TXT, SOA, SPF, SRV.
record_type = "A"

## Dns server port.
# port = 53

## Query timeout in seconds.
timeout = 5


### CNAME record

[[instances]]
# # append some labels for series
labels = { cloud="huaweicloud", region="huabei-beijing-4",azone="az1", product="n9e" }

# # interval = global.interval * interval_times
# interval_times = 1

# #
auto_detect_local_dns_server  = false

## servers to query
servers = ["223.5.5.5","114.114.114.114","119.29.29.29"]

## Network is the network protocol name.
# network = "udp"

## Domains or subdomains to query.
domains = ["www.huaweicloud.com", "www.baidu.com", "www.tapd.cn"]

## Query record type.
## Possible values: A, AAAA, CNAME, MX, NS, PTR, TXT, SOA, SPF, SRV.
record_type = "CNAME"

## Dns server port.
# port = 53

## Query timeout in seconds.
timeout = 5


### NS record

[[instances]]
# # append some labels for series
labels = { cloud="huaweicloud", region="huabei-beijing-4",azone="az1", product="n9e" }

# # interval = global.interval * interval_times
# interval_times = 1

# #
auto_detect_local_dns_server  = false

## servers to query
servers = ["223.5.5.5","114.114.114.114","119.29.29.29"]

## Network is the network protocol name.
# network = "udp"

## Domains or subdomains to query.
domains = ["www.huaweicloud.com", "www.baidu.com", "www.tapd.cn"]

## Query record type.
## Possible values: A, AAAA, CNAME, MX, NS, PTR, TXT, SOA, SPF, SRV.
record_type = "NS"

## Dns server port.
# port = 53

## Query timeout in seconds.
timeout = 5

测试配置

./categraf --test --inputs dns_query
....... A记录同理就省略
20:51:34 dns_query_rcode_value agent_hostname=aliyun.tjf.n9e.001 azone=az1 cloud=huaweicloud domain=www.tapd.cn product=n9e record_type=CNAME region=huabei-beijing-4 server=119.29.29.29 0
20:51:34 dns_query_result_code agent_hostname=aliyun.tjf.n9e.001 azone=az1 cloud=huaweicloud domain=www.tapd.cn product=n9e record_type=CNAME region=huabei-beijing-4 server=119.29.29.29 0
20:51:34 dns_query_query_time_ms agent_hostname=aliyun.tjf.n9e.001 azone=az1 cloud=huaweicloud domain=www.tapd.cn product=n9e record_type=CNAME region=huabei-beijing-4 server=119.29.29.29 33.500371

20:51:34 dns_query_rcode_value agent_hostname=aliyun.tjf.n9e.001 azone=az1 cloud=huaweicloud domain=www.baidu.com product=n9e record_type=CNAME region=huabei-beijing-4 server=119.29.29.29 0
20:51:34 dns_query_result_code agent_hostname=aliyun.tjf.n9e.001 azone=az1 cloud=huaweicloud domain=www.baidu.com product=n9e record_type=CNAME region=huabei-beijing-4 server=119.29.29.29 0
20:51:34 dns_query_query_time_ms agent_hostname=aliyun.tjf.n9e.001 azone=az1 cloud=huaweicloud domain=www.baidu.com product=n9e record_type=CNAME region=huabei-beijing-4 server=119.29.29.29 34.328242

20:51:34 dns_query_rcode_value agent_hostname=aliyun.tjf.n9e.001 azone=az1 cloud=huaweicloud domain=www.huaweicloud.com product=n9e record_type=CNAME region=huabei-beijing-4 server=119.29.29.29 0
20:51:34 dns_query_result_code agent_hostname=aliyun.tjf.n9e.001 azone=az1 cloud=huaweicloud domain=www.huaweicloud.com product=n9e record_type=CNAME region=huabei-beijing-4 server=119.29.29.29 0
20:51:34 dns_query_query_time_ms agent_hostname=aliyun.tjf.n9e.001 azone=az1 cloud=huaweicloud domain=www.huaweicloud.com product=n9e record_type=CNAME region=huabei-beijing-4 server=119.29.29.29
.....

重启服务

重启categraf服务生效
systemctl daemon-reload && systemctl restart categraf && systemctl status categraf

查看启动日志是否有错误
journalctl -f -n 500 -u categraf | grep "E\!" | grep "W\!"

检查数据呈现

等待1-2分钟后数据就会在图表中展示出来如图 image

监控告警规则配置

个人经验仅供参考一般DNS解析延迟时间
超过2000毫秒为P2级别启用企业微信应用推送告警3分钟内恢复发出恢复告警。
超过5000毫秒为P1级别启用电话语音告警&企业微信应用告警3分钟内恢复发出恢复告警。

为什么会这么考量设计?
在用到DNS监控时一般公司业务是遍布全国的然而全国各个地区在解析DNS存在各种场景因素导致的DNS问题如DNS被劫持、片区DNS服务器故障等所以需要以高级别对待。
从收到告警到恢复告警设置3分钟的意图是防止期间是短暂时间有问题,同时也给SLA(99.99%)给足处理时长。

监控图表配置

先略过