Knowledge Base ISC Main Website Ask a Question/Contact ISC
 Featured
What to do with a misbehaving BIND server
Author: Cathy Almond Reference Number: AA-00341 Views: 17828 Created: 2011-05-17 13:27 Last Updated: 2015-02-21 07:13 0 Rating/ Voters


Sometimes a named process will appear to behave abnormally - for example it uses more CPU or memory than usual (or less), emits unexpected error messages, doesn't respond to queries, or responds negatively or late. It's tempting just to restart named or to try a reload/reconfig/flush to see if that helps. If it does help, then this is really good for the production environment at that time, but the opportunity to collect useful troubleshooting information is destroyed at the same time.

Here are some things that we'd recommend you do as many of as possible before attempting to clear the problem - and then report the results/submit data along with the full report of the problem that was encountered and its symptoms.

This checklist assumes that you've already qualified in what way named is not working by using dig to confirm subjective/other reports of failure

  1. Run pstack (or similar OS-specific tool) against the process 3 or 4 times (from this output we get several snapshots of what named is doing at that instant - comparing instants we can see whether threads are moving or are stuck - e.g. on a lock. We also get clean stack traces of each thread from the run-time environment without any possibility of mis-matched executables/core).

  2. Obtain a snapshot of the current named status (if named is still consuming CPU, it might be useful to repeat this several times along with step 1) :
    rndc status
  3. Generate a list of the client queries that named is currently handling (the default filename is named.recursingIt may be useful to repeat this several times if named is still running and consuming CPU, especially if the reported problems relate to recursive resolution:
    rndc recursing
  4. Get a snapshot of the current state of named's cache (the default filename is named_dump.db). It may be useful to repeat this several times if named is still running and consuming CPU, especially if the reported problems relate to recursive resolution:

    rndc dumpdb -all
  5. Toggle query logging on for a few minutes (if it's not already enabled):

    rndc querylog
  6. Temporarily increase the level of server logging for a few minutes (this relies on the logging channels being defined such that this level of logging can be output - it may be necessary to review the logging configuration in named.conf if changing the debug level via rndc does not produce additional logging output anywhere):

    rndc trace 3
  7. Take a snapshot packet trace (wireshark or similar) of both inbound and outbound traffic on the nameserver.  Make sure you trace on all the interfaces on the nameserver host.

  8. If the problem is that a recursive server does not appear to be able to resolve queries that involve recursion then worth running some tests to see if the problem is external to named - perhaps the network environment.  On the actual machine that the instance of named that you are troubleshooting runs on, try using dig +trace to verify connectivity. For example:

    dig +trace www.facebook.com


    Don't use the dig +trace option from your clients for troubleshooting specific server behaviour problems

    For more information on the +trace option, read:  Why is the outcome different from dig when using the +trace option?

    Depending on the results of this, you can issue direct queries (emulating named's communication with authoritative servers). For example:

    dig @204.74.67.132 +norec +dnssec +multi www.facebook.com
  9. Check OS resource use and whether any limits appear to have been reached (memory use, #open sockets per process, network statistics etc.) 

Once you've done all/some of the above, then the pressing need to restart the server will probably mean that there is little else you can do. 

Please try to capture a core dump however (gcore or kill -6 should provide one) rather than using rndc to halt the server - and then follow the checklist of files to submit with a core dump as well including the data that's been generated prior to stopping name


© 2001-2016 Internet Systems Consortium

Please help us to improve the content of our knowledge base by letting us know below how we can improve this article.

If you have a technical question or problem on which you'd like help, please don't submit it here as article feedback.

For assistance with problems and questions for which you have not been able to find an answer in our Knowledge Base, we recommend searching our community mailing list archives and/or posting your question there (you will need to register there first for your posts to be accepted). The bind-users and the dhcp-users lists particularly have a long-standing and active membership.

ISC relies on the financial support of the community to fund the development of its open source software products. If you would like to support future product evolution and maintenance as well having peace of mind knowing that our team of experts are poised to provide you with individual technical assistance whenever you call upon them, then please consider our Professional Subscription Support services - details can be found on our main website.

Feedback
  • There is no feedback for this article
Info Submit Feedback on this Article
Nickname: Your Email: Subject: Comment:
Enter the code below:
Quick Jump Menu