Recursive DNS is a critical piece of infrastructure. Redundancy, reliability, and high-performance are required. For that reason, most customers deploy many recursive DNS servers and distribute them throughout their networks. Since being physically close to customers is important for minimizing latency, many customers deploy “clusters” where several DNS server instances service a region of customers via a load-balancer, anycast, or other load distribution techniques.
We embrace this design and have provided several (some patented) features like “cache sharing”, “pinning”, and “predictive prefetch” that enhance performance and reliability in these situations while allowing customers to horizontally and inexpensively scale their deployments and clusters to meet whatever transaction rate is needed. Due to AnswerX’s advanced policy engine, these features are much more powerful than similarly touted features in other systems.
In 2014, sizing a single machine to run AnswerX at 50,000-200,000 peak tps requires no adjustments or modifications on modern hardware. Just install and run. We generally recommend deploying appropriate hardware so that they handle 10,000-30,000 tps on a 7x24 basis each so that there is plenty of room for growth. However, there are situations where providing very high (1,000,000 tps+) transaction rates are needed. DOS scenarios, network failures where load from other locations must be handled, and other emergency situations do occur. Although this can be handled by “clusters” of machines, we recognize that there are always situations where having more capacity is critical.
Akamai AnswerX (then Xerocole) has engineered it's 2014.1 and later releases for single instances of the system to perform at 1,000,000 and higher transactions per second, provided sufficient network bandwidth exists, without special, exotic, or expensive hardware.
Our testing and validation have centered around RedHat/CentOS 6. That said, these configurations and techniques can be used with other operating systems. At extremely high transaction rates, UDP packet inflow tends to get extremely bursty. To avoid dropping packets in the kernel, we need to increase the maximum allowed network receive buffer size. This can be done with the following adjustments:
sysctl -w net.core.rmem_max=50000000
and add this to /etc/sysctl.conf :
net.core.rmem_max = 60000000
This will increase the maximum allowed kernel incoming buffer size to 50Mb. We’ve found that a values of 20Mb-100Mb are sufficient as a maximum. This is a maximum limit and not what every incoming connection will consume. By raising this limit, we can specify larger values for ReceiveBufferSize in our next step.
Due to UDP kernel processing overhead, we budget ~200,000 tps per “incoming port”. In addition, we increase the incoming “receive” buffer for each port. This value, which is specified by using the ReceiveBufferSize variable might need to be larger depending on how large a load you want to handle.
At 1,000,000 tps, we found that 10Mb of buffer and 6 ports is sufficient. For example: 5001, 5002, 5003, 5004, 5005, and 5006 and 10Mb of incoming buffer for each port:
IncomingInterfaces: fromClients-5001, fromClients-5002, fromClients-5003, fromClients-5004,
To tune the ReceiveBufferSize value, when under load, the “ss” command will display the current value of the “Recv-Q”. If that value is stuck at the maximum you define here, you should consider increasing the ReceiveBufferSize value.
Since we are opening multiple ports, client load must be distributed from the published incoming port (port 53) to all the ports where the server is listening. This can be done in many ways. Off-machine load-balancing (load-balancer/anycast) or via iptables on the machine are common methods.
For example, this “iptables” rule will spread incoming port 53 to the 6 ports (5001-5006):
# Use “destination NAT” to spread incoming UDP from port 53 to
# ports 5001-5006 randomly iptables -t nat -A PREROUTING -p udp -d 10.1.1.1 --dport 53 -j DNAT
--to-destination 10.1.1.1:5001-5006 --random
It is also advantageous if you are using a load-balancer or anycast, to make sure that “conntrack” is NOT enabled in “iptables”. Shutting that off will boost the ability to process incoming packets at high rates.
Starting in release 2014.2 the key MaxNetworkBuffering needs to be set to true inside the <AnswerX> object. This key allows AnswerX processing threads to extract the maximum number of packets from the network interface as quickly as the operating system allows. This key will cause a substantial increase in memory consumption, so, having lots of memory is important.
With these changes, what remains is to have enough CPU power. AnswerX will allocate one core to servicing each incoming port and will use the remaining cores for processing. For 1,000.000+ tps, we would recommend having 8+ cores so that policy, tables, and other processing can proceed without interfering with network packet processing.
Results and Comments
We have demonstrated with a machine with “two 4-way Intel Xeon’s (8 physical cores) using hyperthreading to produce 16 logical cores” can handle in excess of 1,000,000 tps while consuming ~50% of the available CPU with a fairly standard AnswerX policy. This enables the remaining CPU power to be used to implement more complex AnswerX policy and data handling.
It took many client machines to produce enough load to demonstrate a steady load of 1,000,000+ tps, so, if you want to run simulations, be prepared to dedicate some lab resources. Xerocole will be happy to provide you with load generation tools that can produce real-world client query distributions.
Finally, as your transaction rates go beyond 1,000,000 tps, you will discover that query and response bandwidth will begin to fill standard 1 gigabit ethernet links. So, to continue to scale, you will have to either embrace 10 gigabit NIC’s or open ports across multiple 1 gigabit ethernet interfaces. Both are valid approaches and are accomplished easily using the <Interface> objects detailed above.
Find out More
Given AnswerX's "devops" architecture, the best way to validate the '1,000,000 tps' data is by trying AnswerX. Contact our AnswerX Team and request a demonstration license.