BIND 9
(Part 2 - Long-Term Statistics Monitoring and Log Analysis)
Carsten Strotmann and the ISC Team
Created: 2021-03-17 Wed 14:18
Welcome
Welcome to part two of our BIND 9 webinar series
In this Webinar
- Identifying outliers in BIND 9 logfiles
- BIND 9 monitoring from the
named.stats
file
- BIND 9 monitoring with the statistics channel
- Using open source tools to store and display metrics
- Using open source tools to search and analyze logs
- BIND 9 logs and remote syslog best practice
- Best practices for metrics to monitor for authoritative and recursive
Goals of monitoring
- finding outliers / anomalies -> potential security or performance problem
- observe change in traffic patterns
- observe change in load (CPU load, traffic load etc)
- observe change in protocol use (new resource records, IPv4 vs. IPv6 usage, UDP vs. TCP usage, DoH/DoT usage)
Identifying outliers in BIND 9 logfiles
Outliers
- outliers in log-files are entries that do not appear during normal operation
- one approach to catch outliers is called Artificial Ignorance
Artificial Ignorance
- the concept of AI (Artificial Ignorance)
- there are two types of log messages
- the ones the admin does not care about and do not need attention
- the ones the admin does care about and need attention
Artificial Ignorance - Messages that do not need attention
- the log messages that the admin does not care about and do not
need attention are noise
- it might be still valid to collect these messages in the logs,
for example for statistical analysis
- so the AI system will filter them (suppress them)
- what are left are, by definition, the messages that do need
attention
Artificial Ignorance - Messages that do need attention
- the messages that are passing the filter fall in two categories
- new messages that are not indicating a security or performance issue
- a new filter (usually a regular expression) needs to be added
to the software to hide this type of message in the next run
- new messages that do indicate a potential security or
performance issue
- the admin needs to investigate the root case of the messages
and need to fix the cause for the log messages
Artificial Ignorance - operation
- the AI software is run periodically (every 24 hours, every hour),
the results are send via mail (chat etc) to the group of
administrators
- in the ideal case, the mail message will have no new log
messages
- the mail (chat etc) should be send even if no new information
is available
- in case a log message appears, it must be dealt with until the
next run of the software (internal SLA)
Artificial Ignorance - Software
- there are several implementations of "Artificial Ignorance"
available
Artificial Ignorance - additional information
BIND 9 monitoring from the "named.stats" file
BIND 9 "named.stats"
- the command
rndc stats
will trigger a BIND 9 server to write a
file with internal statistics
- the statistics content is written to the file
named.stats
in
the BIND 9 servers home directory
- the directory and the name of the file can be changed in the
BIND 9 configuration file
named.conf
with the
statistics-file
directive
BIND 9 "named.stats" example (1)
BIND 9 "named.stats" example (2)
BIND 9 "named.stats" example (3)
BIND 9 "named.stats" example (4)
BIND 9 "named.stats" example (5)
BIND 9 "named.stats" example (6)
BIND 9 "named.stats" example (7)
Zone-Statistics
- It is also possible to enable the zone statistics only for
selected zones
- This is done with the statement
zone-statistics yes;
inside
the zone
block:
zone "example.org" in {
type primary;
file "primary/example.org";
zone-statistics yes;
};
Monitoring with "named.stats"
- many popular monitoring tools offer modules to use the data in
the
named.stats
file
Challenges with "named.stats"
- BIND 9 will always append new statistics to the
named.stats
file, the file will always grow
- the file should be purged from time to time, as monitoring
plugins usually read the file from the beginning to find the
latest information
- the
named.stats
file contains human readable data, which needs
to be parsed by a tool
- the contents of
named.stats
can change with new BIND 9
releases, the monitoring plugins might fail when the parser is
not well written.
BIND 9 monitoring with the statistics channel
BIND 9 http statistics channel
- The BIND 9 statistics can also be retrieved from a running BIND 9
server via the http protocol
- BIND 9 has a tiny build-in web-server
- It provides the statistics data in XML or JSON format
BIND 9 statistics channel vs. "named.stats"
- The BIND 9 statistics channel has some benefits compared to the
older
named.stats
statistics
- The statistics can be read over the network
- The statistics comes in structured data (XML or JSON) that is
parse-able by software (more robust monitoring)
- The format of the statistics data is versioned
- A change in the statistics format will not break existing
tools
BIND 9 statistics channel dependencies
BIND 9 statistics channel configuration
- The BIND 9 statistics channel is enabled in the BIND 9
configuration file
named.conf
- Zone statistics can be enabled with the same statements used
for the
named.stats
statistics
- It has its own configuration block
statistics-channels {
inet 192.0.2.53 port 8053 allow { localhost; adminnets; };
inet fd00::1053 port 8053 allow { fd00::/64; };
};
BIND 9 statistics channel
JSON Statistics
- JSON (JavaScript Object Notation) is an open standard file format
that uses human-readable text
- JSON is faster to parse than XML
- Some (many?) people find JSON easier to work with compared to
XML
BIND 9 statistics channel
Security recommendations for the statistics channel
- The BIND 9 statistics channel should not be exposed to the open
Internet without authentication
- It reveals internal information that can be use to attack the
DNS server
- It increases the attack surface
- Best practices
- Bind the statistics channel only to internal management networks
- Protect the BIND 9 statistics channel with a reverse web proxy
(NGINX, Caddy, OpenBSD httpd etc) with basic authentication or
TLS client certificate authentication
Additional information on the statistics channel
Using open source tools to store and display metrics
Prometheus
- Prometheus is a popular monitoring solution
- Prometheus is easy to deploy and scales from small to large
networks
Prometheus architecture
- small agent programs (called "exporters") collect data
- exporters offer the data over http in a key/value format
- easy to test the correct function of an agent with a
web-browser of http command line tool (such as
curl
)
- it is easy to write custom exporters
- exporter agent can collect data local on the BIND 9 server
(
named.stats
) or via network (statistics channel)
Prometheus architecture
- Central Prometheus server collects the data from all agents and
stored the data into a time series database
- Data can be queried over a web interface
- Visualization via Prometheus Expression Browser (simple) or
Grafana (elaborate)
Prometheus exporter for DNS
Using open source tools to search and analyze logs
The ELK Stack
- ELK is a popular solution for an centralized log management. ELK
combines the open source tools
Logstash
- Logstash collects log data from various sources and formats
- Logstash can normalize and filter the data
- After transformation, Logstash stores the data in a central
database (usually into Elastic-Search)
- other outputs are possible, like Syslog, file, MongoDB, StatsD,
Network-Monitoring …)
Elastic-Search
- Elastic-Search is a distributed search and analysis engine
- Elastic-Search can work with large amounts of data
- Elastic-Search provides log-analysis, monitoring,
anomaly-detection and SIEM capabilities (Security information and
event management)
Kibana / Grafana
- Kibana / Grafana visualize the data stored in Elastic Search
- Query the log-data
- Interactive "drill down" into the dataset
- Graphical trend analysis
- Uptime monitoring
Kibana Visualization - DNS Server Load
Kibana Visualization - DNS Query Types
Kibana Visualization - Malware RPZ Hits
BIND 9 logs and remote syslog best practice
Central Log Server
- A central log server helps correlating log events and central log
analysis
- Log data can be transferred via syslog (push) or Systemd-Journal (push
or pull)
- Use TLS transport security for sending log data over untrusted networks
- Central server should store the data in a structured way
- Database (SQL or noSQL)
- for large amounts of log data, the central server might be a
cluster of multiple machines
Plan you logging
- estimate the number of events per seconds
- plan for the worst case (DDoS attack)
- Estimate the size of log messages that need to be stored (~ 100-150
Byte per message)
- Estimate the load
- Can your network sustain the data rate?
- Does this log collection will have an performance (CPU, Network,
RAM) impact on the BIND 9 DNS server?
- Can the central server process the data fast enough
(normalization, structured data)
- Can the storage keep up with the data rate (Careful with central
log servers on virtual machines)?
Plan you logging
- How long will a typical query into the data take (seconds,
minutes, hours)
- How good does the central log analysis / database scale (over
multiple CPU, NUMA Architectures, multiple machines)?
- How will the log data be secured (GDPR)?
- Encryption on storage
- Encryption on transport
- User authentication
- Log-Source authentication
- The log-server needs monitoring, too
Normalize log data before sending/storing
- Unfortunately, most Syslog and BIND 9 log data is unstructured
- Modern logging systems (rsyslog, systemd-journal) can convert
the unstructured syslog data into structured data
- Structured data is more easy to filter and search
- If possible, structure the data already at the source (to help
with filtering, see next slide)
- Send log data in the newer structured RFC 5424 format
https://tools.ietf.org/html/rfc5424
- Log normalization for different formats (mmnormalize)
https://www.rsyslog.com/log-normalization-for-different-formats/
Filter before sending
- Some BIND 9 categories can be very "chatty"
- during an attack (DDoS), the log data can overload a logging
server (or the network, adding to the performance pain)
- Try to filter irrelevant information from the logs at the source
(see "Artificial Ignorance" from the beginning)
- forward the filtered and aggregated information to a central
server
- You don't want to have 1 mil. lines of the same DNS error, you
want to know that this error happen 1 mil. times in a time frame
Local buffering
- Some syslog server implementations support local buffering
- They write the log data to local storage in case the network or
the remote log server cannot keep up with the amount of data
- Plan for enough local "buffer" storage space
- Make sure the local "buffer" cannot fill the local storage
(dedicated log buffer partition)
- Reliable Forwarding of syslog Messages with Rsyslog
Log-Server security
- DNS data can contain sensitive information
- IP addresses
- personalized domain names (using URLs with personalized labels
on wildcard domain names)
- If the log data passes untrusted networks (the Internet), encrypt
the data and authenticate the log server with TLS (Encrypting
Syslog Traffic with TLS
Log-Server security
- don't store large amounts of log data on DNS servers exposed to
the Internet - forward the log data towards an internal, secured
system
- Restrict access to log information (authentication)
- Keep access logs
- Delete old log data (raw data), keep aggregated data and outliers
Log-Server security
- For security sensitive data, apply cryptographic signatures to
the log messages to be able to detect tampering
The human factor
- You can condense and aggregate the log information …
- … but in the end, it has to be humans that need to check and react
on the log data
nobody can replace a good analyst with a perl script (Marcus J. Ranum)
Best practices for metrics to monitor for authoritative and recursive
Metrics for recursive DNS server (DNS resolver)
- Memory consumption of the BIND 9 process (Cache Memory / Memory
fragmentation)
- CPU load (load per CPU core)
- Network card utilization
- Number of clients per time unit
- Number of concurrent clients over UDP
- Number of concurrent clients over TCP
- Rate of incoming TCP queries vs. UDP queries (Clients to
resolver)
- Rate of outgoing TCP queries vs. UDP queries (Resolver to
authoritative server)
Metrics for recursive DNS server (DNS resolver)
- Number of outgoing SERVFAIL responses (indicator for DNSSEC
validation issues or a server issue)
- Latency of DNS answers from outside authoritative server
(generic, and from a set of "well known" important domains like
google.com, facebook.com etc)
- Rate of FORMERR responses towards clients (indicator for network
issues, failing CPE updates, malware infected clients)
Metrics for authoritative BIND 9 DNS Server
- Number of queries per time unit (load)
- Number of UDP and TCP queries
- Size of DNS answers (-> EDNS0 / Fragmentation)
- Percentage of truncated answers
- NXDOMAIN answers per time unit (indicator for issues with the zone
content or DDoS attacks -> random subdomain attack)
- SERVFAIL answers per time unit (indicator for server
mis-configuration or DNSSEC issues)
Metrics for authoritative BIND 9 DNS Server
- Network card utilization
- CPU utilization (DNSSEC + NSEC3)
- Zone-Transfer per time unit / Errors with Zone-Transfer
- Response-Rate Limiting per client IP
- DNSSEC signing (and automated key rollover) events and errors
- SOA serial numbers on primary/secondary zones, zone update latency
- for dynamic zones: update per time unit
Upcoming Webinars
- April 21: Session 3. Load balancing with DNSdist
- May 19: Session 4. Dynamic zones, pt1 - Basics
- June 16: Session 5. Dynamic zones, pt2 - Advanced topics