Several new tuning options for recursive server behavior had been undergoing testing in production environments by ISC customers who have been using special feature preview builds of BIND. These features are intended to optimize
recursive server behavior in favor of good client queries, whilst at
the same time limiting the impact of bad client queries (e.g. queries which cannot be resolved, or which
take too long to resolve) on local recursive server
Based on successful field testing, we've now rolled the Recursive Client Rate Limiting feature into production Open Source versions of BIND, along with some feature updates based on feedback received. This article documents the revised and improved functionality that is being introduced in BIND 9.9.8, 9.10.3 and 9.11.0. Please refer to article Recursive Client Rate limiting in BIND 9.9 Subscription Version and BIND 9.9 and 9.10 Experimental versions if you need to know how the early releases of this code behaved, or if you are running an older subscription version of BIND.
BIND 9.9 and 9.10 are already stable production versions of BIND, therefore per ISC's policy that significant feature changes should not be added to stable versions, Recursive Client Rate limiting is not available by default, it must be explicitly enabled when building BIND by using this new configure option:
Rate-limiting Fetches Per Server
The fetches-per-server option
sets a hard upper limit to the number of outstanding fetches allowed
for a single server. The lower limit is 2% of fetches-per-server, but
never below 1. It also allows you to select what to do with the queries that are being limited - either drop them, or send back a SERVFAIL response.
Based on a moving average of the timeout ratio for
each server, the server's individual quota will be periodically
adjusted up or down. The adjustments up and down are not linear;
instead they follow a curve that is initially aggressive but which has a
The default value for fetches-per-server is 0, which disables this
feature. When fetches-per-server is enabled, the default behaviour when
rate-limiting is active is to SERVFAIL queries that exceed the limit.
The fetch-quota-params option specifies four parameters that control how the per-server fetch limit is calculated.
fetches-per-server 200 fail;
fetch-quota-params 100 0.1 0.3 0.7;
The first number in fetch-quota-params
specifies how often, in number of queries to the server, to recalculate
its fetch quota. The default is to recalculate every 100 queries sent.
second number specifies the threshold timeout ratio below which the
server will be considered to be "good" and will have its fetch quota
raised if it is below the maximum. The default is 0.1, or 10%.
third number specifies the threshold timeout ratio above which the
server will be considered to be "bad" and will have its fetch quota
lowered if it is above the minimum. The default is 0.3, or 30%.
fourth number specifies the weight given to the most recent counting
period when averaging it with the previously held timeout ratio. The
default is 0.7, or 70%.
By design, this per-server quota should
have little impact on lightly-used servers no matter how responsive (or
not) they are, whilst heavily-used servers will have enough traffic to
keep the moving average of their timeout ratio "fresh" even when they
are deeply penalized for not responding.
Rate-limiting Fetches Per Zone
already had an option that limits how many identical client queries
(that cannot be answered directly from cache or authoritative zone data)
it will accept. When many clients simultaneously query for the same
name and type, the clients will all be attached to the same fetch, up to
the max-clients-per-query limit, and only one iterative query
will be sent. This doesn't help however in the situation where client
queries are for the same domain, but the hostname portion of the query
is unique for each.
To help with this, we've introduced logic to rate-limit by zone
instead. This is configured using a new option fetches-per-zone
which defines the maximum number of simultaneous iterative queries to
any one domain that the server will permit before blocking new queries
for data in or beneath that zone. If fetches-per-zone is set to zero, then there is no limit on the number of fetches per query and no queries will be dropped. Similar to fetches-per-server, fetches-per-zone also offers the choice of whether to drop or send back a SERVFAIL response when queries are being limited.
The default value for fetches-per-zone is 0, which disables this feature. When fetches-per-zone is enabled, the default behaviour when
rate-limiting is active is to drop queries that exceed the limit (this is not the same as the default for fetches-per server)
a fetch context is created to carry out an iterative query, it gets
initialized with the closest known zone cut, and named adds both a cap (the value of which is configured by fetches-per-zone) on the
number of fetches are allowed to be querying for that same zone cut at a
time, and a counter for those that are currently outstanding (waiting for responses from authoritative servers).
The counters maintained on fetches per zone are reset when there are
no outstanding fetches for that zone. This is because the structure that was
holding them doesn't persist once the last fetch for that zone has completed. The periodic logging of the impact of fetches-per-zone on named's performance will therefore produce unreliable results for monitoring purposes - we recommend using the new counters added to BIND statistics instead.
Statistics and logging
now reports the list of current fetches, with statistics on how many
are active, how many have been allowed and how many have been dropped
due to exceeding the fetches-per-server and fetches-per-zone quotas.
You can also monitor the BIND statistics - two new counters have been added:
ZoneQuota counts the number of client queries that are dropped or sent SERVFAIL due to the fetches-per-zone limit being reached.
ServerQuota counts the number of client queries that are dropped or sent SERVFAIL due to the fetchers-per-server limit
When applying Recursive Client Rate limiting, logging is emitted at intervals, but the logging of per-zone statistics may sporadically reset back to the original value (when the structure that was capturing the values is released). The logging is useful as an indication that Recursive Client Rate limiting is active during a time period, and to what extent client queries are being dropped, but BIND's statistics provide a much more accurate set of counters for graphing and statistics.
Recursive Client Contexts Soft Quota
Strictly speaking, this is not part of the Recursive Client Rate-limiting new functionality, but it was included with and tested at the same time as the other mitigation techniques we were developing, so is noted here for completeness.
In the traditional
recursive clients context model, we have both a soft and a hard limit
to the number of recursive clients. When reached, the soft limit acts
by dropping a pending request for each new incoming request that it starts to process. When named
reaches the hard limit, it drops both a pending request, and the new
inbound client query. So ideally we want named to be managing its
backlog of recursive clients before reaching the hard limit - i.e., reaching a soft limit is the preferred mode of operation when under pressure.
In versions of BIND earlier than 9.9.8 and 9.10.3, there was no soft quota at all when
recursive-clients <= 1000. For recursive-clients > 1000, the soft
quota defaulted to hard-quota -100.
This was a particular problem for DNS administrators who used the default recursive-clients (1000) because under high rates of client query traffic, it could happen that legitimate client queries that should be handled (and which would resolve quickly) were being rejected because there was only a hard quota.
Now in BIND 9.9.8 and 9.10.3 (and all newer releases going forward) when recursive-clients
<= 1000 the soft quota is 90% of recursive-clients. When
recursive-clients > 1000, the soft quota will the equal to the hard
quota minus either 100 or the number of worker threads, whichever is
We did not back-port this feature to BIND 9.9 and 9.10 although it has been tested in some stable preview editions of BIND (not available for public download).
Although not entirely part of the Recursive Client Rate Limiting suite of features, SERVFAIL cache, newly introduced in BIND 9.11.0 (after being trialed in special preview editions of BIND) may help mitigate server loads where clients are repeated sending the same failing queries.
By default this is enabled at 1 seconds - equivalent to:
You can disable the servfail cache by setting the ttl to zero. The maximum is 30s but we do not recommend increasing the value beyond 1 or 2 seconds without operational testing as you may find that transient failures will be cached and persist for longer than you would like them to. Most clients will retry a failed query and for a transient problem it is better that the query is retried versus the server responding with a second SERVFAIL from cache.
Please refer to the the Administrator Reference Manual for more details.
In addition to the information provided in this article, you may also be interested in:
© 2001-2017 Internet Systems ConsortiumFor assistance with problems and questions for which you have not been able to find an answer in our Knowledge Base, we recommend searching our community mailing list archives and/or posting your question there (you will need to register there first for your posts to be accepted). The bind-users and the dhcp-users lists particularly have a long-standing and active membership.ISC relies on the financial support of the community to fund the development of its open source software products. If you would like to support future product evolution and maintenance as well having peace of mind knowing that our team of experts are poised to provide you with individual technical assistance whenever you call upon them, then please consider our Professional Subscription Support services - details can be found on our main website.