BIND 9

(Part 2 - Long-Term Statistics Monitoring and Log Analysis)

Carsten Strotmann and the ISC Team

Created: 2021-03-17 Wed 14:18

Welcome

Welcome to part two of our BIND 9 webinar series

In this Webinar

  • Identifying outliers in BIND 9 logfiles
  • BIND 9 monitoring from the named.stats file
  • BIND 9 monitoring with the statistics channel
  • Using open source tools to store and display metrics
  • Using open source tools to search and analyze logs
  • BIND 9 logs and remote syslog best practice
  • Best practices for metrics to monitor for authoritative and recursive

Goals of monitoring

  • finding outliers / anomalies -> potential security or performance problem
  • observe change in traffic patterns
  • observe change in load (CPU load, traffic load etc)
  • observe change in protocol use (new resource records, IPv4 vs. IPv6 usage, UDP vs. TCP usage, DoH/DoT usage)

Identifying outliers in BIND 9 logfiles

Outliers

  • outliers in log-files are entries that do not appear during normal operation
  • one approach to catch outliers is called Artificial Ignorance

Artificial Ignorance

  • the concept of AI (Artificial Ignorance)
    • there are two types of log messages
      • the ones the admin does not care about and do not need attention
      • the ones the admin does care about and need attention

Artificial Ignorance - Messages that do not need attention

  • the log messages that the admin does not care about and do not need attention are noise
    • it might be still valid to collect these messages in the logs, for example for statistical analysis
    • so the AI system will filter them (suppress them)
    • what are left are, by definition, the messages that do need attention

Artificial Ignorance - Messages that do need attention

  • the messages that are passing the filter fall in two categories
    • new messages that are not indicating a security or performance issue
      • a new filter (usually a regular expression) needs to be added to the software to hide this type of message in the next run
    • new messages that do indicate a potential security or performance issue
      • the admin needs to investigate the root case of the messages and need to fix the cause for the log messages

Artificial Ignorance - operation

  • the AI software is run periodically (every 24 hours, every hour), the results are send via mail (chat etc) to the group of administrators
    • in the ideal case, the mail message will have no new log messages
      • the mail (chat etc) should be send even if no new information is available
    • in case a log message appears, it must be dealt with until the next run of the software (internal SLA)

Artificial Ignorance - Software

Artificial Ignorance - additional information

BIND 9 monitoring from the "named.stats" file

BIND 9 "named.stats"

  • the command rndc stats will trigger a BIND 9 server to write a file with internal statistics
    • the statistics content is written to the file named.stats in the BIND 9 servers home directory
    • the directory and the name of the file can be changed in the BIND 9 configuration file named.conf with the statistics-file directive

BIND 9 "named.stats" example (1)

bind9-named-stats01.png

BIND 9 "named.stats" example (2)

bind9-named-stats02.png

BIND 9 "named.stats" example (3)

bind9-named-stats03.png

BIND 9 "named.stats" example (4)

bind9-named-stats04.png

BIND 9 "named.stats" example (5)

bind9-named-stats05.png

BIND 9 "named.stats" example (6)

bind9-named-stats06.png

BIND 9 "named.stats" example (7)

bind9-named-stats07.png

Zone-Statistics

  • Zone statistics need to be enabled in the BIND 9 configuration file named.conf

    • the statement zone-statistics yes; inside the options block enables the zone statistics for all zones
         options {
            [...]
            zone-statistics yes;
         };
    

Zone-Statistics

  • It is also possible to enable the zone statistics only for selected zones
    • This is done with the statement zone-statistics yes; inside the zone block:
zone "example.org" in {
     type primary;
     file "primary/example.org";
     zone-statistics yes;
};

Monitoring with "named.stats"

Challenges with "named.stats"

  • BIND 9 will always append new statistics to the named.stats file, the file will always grow
    • the file should be purged from time to time, as monitoring plugins usually read the file from the beginning to find the latest information
  • the named.stats file contains human readable data, which needs to be parsed by a tool
    • the contents of named.stats can change with new BIND 9 releases, the monitoring plugins might fail when the parser is not well written.

BIND 9 monitoring with the statistics channel

BIND 9 http statistics channel

  • The BIND 9 statistics can also be retrieved from a running BIND 9 server via the http protocol
    • BIND 9 has a tiny build-in web-server
    • It provides the statistics data in XML or JSON format

BIND 9 statistics channel vs. "named.stats"

  • The BIND 9 statistics channel has some benefits compared to the older named.stats statistics
    • The statistics can be read over the network
    • The statistics comes in structured data (XML or JSON) that is parse-able by software (more robust monitoring)
    • The format of the statistics data is versioned
      • A change in the statistics format will not break existing tools

BIND 9 statistics channel dependencies

BIND 9 statistics channel configuration

  • The BIND 9 statistics channel is enabled in the BIND 9 configuration file named.conf
    • Zone statistics can be enabled with the same statements used for the named.stats statistics
    • It has its own configuration block
statistics-channels {
     inet 192.0.2.53 port 8053 allow { localhost; adminnets; };
     inet fd00::1053 port 8053 allow { fd00::/64; };
};

BIND 9 statistics channel

BIND9-statistics-channel.png

JSON Statistics

  • JSON (JavaScript Object Notation) is an open standard file format that uses human-readable text
    • JSON is faster to parse than XML
    • Some (many?) people find JSON easier to work with compared to XML

BIND 9 statistics channel

BIND9-JSON.png

Security recommendations for the statistics channel

  • The BIND 9 statistics channel should not be exposed to the open Internet without authentication
    • It reveals internal information that can be use to attack the DNS server
    • It increases the attack surface
  • Best practices
    • Bind the statistics channel only to internal management networks
    • Protect the BIND 9 statistics channel with a reverse web proxy (NGINX, Caddy, OpenBSD httpd etc) with basic authentication or TLS client certificate authentication

Additional information on the statistics channel

Using open source tools to store and display metrics

Prometheus

  • Prometheus is a popular monitoring solution
  • Prometheus is easy to deploy and scales from small to large networks

Prometheus architecture

  • small agent programs (called "exporters") collect data
    • exporters offer the data over http in a key/value format
    • easy to test the correct function of an agent with a web-browser of http command line tool (such as curl)
    • it is easy to write custom exporters
    • exporter agent can collect data local on the BIND 9 server (named.stats) or via network (statistics channel)

Prometheus architecture

  • Central Prometheus server collects the data from all agents and stored the data into a time series database
  • Data can be queried over a web interface
  • Visualization via Prometheus Expression Browser (simple) or Grafana (elaborate)

Prometheus architecture

Prometheus-Architecture01.png

Prometheus architecture

Prometheus-Architecture02.png

Prometheus architecture

Prometheus-Architecture03.png

Prometheus architecture

Prometheus-Architecture04.png

Prometheus exporter for DNS

Using open source tools to search and analyze logs

The ELK Stack

Logstash

  • Logstash collects log data from various sources and formats
  • Logstash can normalize and filter the data
  • After transformation, Logstash stores the data in a central database (usually into Elastic-Search)
    • other outputs are possible, like Syslog, file, MongoDB, StatsD, Network-Monitoring …)

Elastic-Search

  • Elastic-Search is a distributed search and analysis engine
  • Elastic-Search can work with large amounts of data
  • Elastic-Search provides log-analysis, monitoring, anomaly-detection and SIEM capabilities (Security information and event management)

Kibana / Grafana

  • Kibana / Grafana visualize the data stored in Elastic Search
  • Query the log-data
  • Interactive "drill down" into the dataset
  • Graphical trend analysis
  • Uptime monitoring

ELK Stack Architecture

ELK-Architecture01.png

ELK Stack Architecture

ELK-Architecture02.png

ELK Stack Architecture

ELK-Architecture03.png

ELK Stack Architecture

ELK-Architecture04.png

ELK Stack Architecture

ELK-Architecture05.png

Log-Stash Data Sources

Kibana Visualization - DNS Server Load

Kibana-DNS-Server-Load.png

Kibana Visualization - DNS Query Types

Kibana-Query-Types.png

Kibana Visualization - Malware RPZ Hits

Kibana-Malware-Hits.png

BIND 9 logs and remote syslog best practice

Central Log Server

  • A central log server helps correlating log events and central log analysis
  • Log data can be transferred via syslog (push) or Systemd-Journal (push or pull)
  • Use TLS transport security for sending log data over untrusted networks
  • Central server should store the data in a structured way
    • Database (SQL or noSQL)
    • for large amounts of log data, the central server might be a cluster of multiple machines

Plan you logging

  • estimate the number of events per seconds
    • plan for the worst case (DDoS attack)
  • Estimate the size of log messages that need to be stored (~ 100-150 Byte per message)
  • Estimate the load
    • Can your network sustain the data rate?
    • Does this log collection will have an performance (CPU, Network, RAM) impact on the BIND 9 DNS server?
    • Can the central server process the data fast enough (normalization, structured data)
    • Can the storage keep up with the data rate (Careful with central log servers on virtual machines)?

Plan you logging

  • How long will a typical query into the data take (seconds, minutes, hours)
  • How good does the central log analysis / database scale (over multiple CPU, NUMA Architectures, multiple machines)?
  • How will the log data be secured (GDPR)?
    • Encryption on storage
    • Encryption on transport
    • User authentication
    • Log-Source authentication
  • The log-server needs monitoring, too

Normalize log data before sending/storing

  • Unfortunately, most Syslog and BIND 9 log data is unstructured
    • Modern logging systems (rsyslog, systemd-journal) can convert the unstructured syslog data into structured data
    • Structured data is more easy to filter and search
    • If possible, structure the data already at the source (to help with filtering, see next slide)
  • Send log data in the newer structured RFC 5424 format https://tools.ietf.org/html/rfc5424
  • Log normalization for different formats (mmnormalize) https://www.rsyslog.com/log-normalization-for-different-formats/

Filter before sending

  • Some BIND 9 categories can be very "chatty"
    • during an attack (DDoS), the log data can overload a logging server (or the network, adding to the performance pain)
  • Try to filter irrelevant information from the logs at the source (see "Artificial Ignorance" from the beginning)
    • forward the filtered and aggregated information to a central server
    • You don't want to have 1 mil. lines of the same DNS error, you want to know that this error happen 1 mil. times in a time frame

Local buffering

  • Some syslog server implementations support local buffering
    • They write the log data to local storage in case the network or the remote log server cannot keep up with the amount of data
    • Plan for enough local "buffer" storage space
    • Make sure the local "buffer" cannot fill the local storage (dedicated log buffer partition)
  • Reliable Forwarding of syslog Messages with Rsyslog

Log-Server security

  • DNS data can contain sensitive information
    • IP addresses
    • personalized domain names (using URLs with personalized labels on wildcard domain names)
  • If the log data passes untrusted networks (the Internet), encrypt the data and authenticate the log server with TLS (Encrypting Syslog Traffic with TLS

Log-Server security

  • don't store large amounts of log data on DNS servers exposed to the Internet - forward the log data towards an internal, secured system
  • Restrict access to log information (authentication)
  • Keep access logs
  • Delete old log data (raw data), keep aggregated data and outliers

Log-Server security

The human factor

  • You can condense and aggregate the log information …
    • … but in the end, it has to be humans that need to check and react on the log data

nobody can replace a good analyst with a perl script (Marcus J. Ranum)

Best practices for metrics to monitor for authoritative and recursive

Metrics for recursive DNS server (DNS resolver)

  • Memory consumption of the BIND 9 process (Cache Memory / Memory fragmentation)
  • CPU load (load per CPU core)
  • Network card utilization
  • Number of clients per time unit
  • Number of concurrent clients over UDP
  • Number of concurrent clients over TCP
  • Rate of incoming TCP queries vs. UDP queries (Clients to resolver)
  • Rate of outgoing TCP queries vs. UDP queries (Resolver to authoritative server)

Metrics for recursive DNS server (DNS resolver)

  • Number of outgoing SERVFAIL responses (indicator for DNSSEC validation issues or a server issue)
  • Latency of DNS answers from outside authoritative server (generic, and from a set of "well known" important domains like google.com, facebook.com etc)
  • Rate of FORMERR responses towards clients (indicator for network issues, failing CPE updates, malware infected clients)

Metrics for authoritative BIND 9 DNS Server

  • Number of queries per time unit (load)
  • Number of UDP and TCP queries
  • Size of DNS answers (-> EDNS0 / Fragmentation)
  • Percentage of truncated answers
  • NXDOMAIN answers per time unit (indicator for issues with the zone content or DDoS attacks -> random subdomain attack)
  • SERVFAIL answers per time unit (indicator for server mis-configuration or DNSSEC issues)

Metrics for authoritative BIND 9 DNS Server

  • Network card utilization
  • CPU utilization (DNSSEC + NSEC3)
  • Zone-Transfer per time unit / Errors with Zone-Transfer
  • Response-Rate Limiting per client IP
  • DNSSEC signing (and automated key rollover) events and errors
  • SOA serial numbers on primary/secondary zones, zone update latency
  • for dynamic zones: update per time unit

Additional Information

Upcoming Webinars

  • April 21: Session 3. Load balancing with DNSdist
  • May 19: Session 4. Dynamic zones, pt1 - Basics
  • June 16: Session 5. Dynamic zones, pt2 - Advanced topics

Questions and Answers