Tuesday, October 28, 2014

DNS Resolvers Considered Harmful

Authors: Kyle Schomp (CWRU), Mark Allman (ICSI), Michael Rabinovich (CWRU)
Presenter: Kyle Schomp

DNS resolvers abstract away the complexity of name resolution and improve the scalability of the DNS system, but are vulnerable to attacks such as cache injection and DNS amplification. Such attacks continue to persist, and are difficult to prevent. In addition, resolvers have been caught intentionally providing spurious responses (for censorship and for serving ads, for example). Yet, we continue to trust them. Further, resolvers also obscure client location, which can cause CDNs to pick the wrong CDN node for a client to connect to, thus increasing latency. These are well-known issues, and the community is developing solutions (for example, getting rid of open resolvers, DNSSEC, etc.), but these solutions aren't really working (there are still millions of open resolvers, and DNSSEC is 10 years old, but has low adoption).

Kyle proposes that perhaps we should try something different, i.e., getting rid of shared resolvers entirely! Today, the client sends name resolution requests to the DNS infrastructure, which sees a request, processes it in some opaque way, and sends a response back. This work suggests getting rid of the middlemen, and having clients communicate directly with authoritative DNS servers.

The advantages of this approach are:

a. reduced complexity of the name resolution system
b. removal of the target of many attacks, i.e., a reduction of the attack surface
c. better edge-server selection and load balancing for CDNs

However, there is also the potential loss of scalability of the DNS system, anonymity (in that the auth servers would then know what requests a client makes, instead of say, just the local resolver knowing), and the performance benefits of caching. The lack of anonymity may be somewhat fundamental to better, more transparent decision making, Kyle says. The work focuses on the performance and scalability implications.

They make 4 months of observations of DNS traffic from a 100 residences, and compare the increase in name resolution time in these traces to that in simulated client resolutions absent any resolvers. Results do show that name resolution without the shared resolvers does take a bit longer on the whole, but 45% of the time, it doesn't take longer, and 85% of the time the increase in latency is less than 50ms. Further, DNS responses are not used immediately in subsequent traffic, so there's some slack here. In fact, 50% of the responses are not being used at all, due to aggressive prefetching, with 36% being used within 50ms. Combining these observations, Kyle concludes that only 5% of connections will be delayed by 50ms or more.

For DNS scalability, the authors compare the number of queries that reach the auth DNS in traces versus in simulations without shared resolvers: 93% of auth domains don't see any increase in load. Most of these are unpopular ones, so they weren't benefit from caching anyway, but the rest are popular domains, and caching is useful. For 'com' for e.g., without resolvers, the load increases by 3.4 1 on average and 1.14 at the peak. The authors suggest increasing the TTL of records to mitigate this problem. In addition, the DNS protocol supports multiple simultaneous queries, so one can increase the number of questions per protocol message to reduce load. Together these approaches can cut the impact on scalability and performance to reasonable levels (the load increase being 1.33 on average with both ideas in place).

Broadly, this work questions whether the performance and scalability gains from using shared resolvers is worth the expanded attack surface.

Q: (Brad Karp) How did you decide that some of the queries were not used?
A: We paired requests to responses. Even if you get 4 A records and use just 1 of them, we count that as used.

Q: (Brad Karp) There's a DNS study from 2001(http://nms.csail.mit.edu/papers/index.php?detail=63); and some of these observations were known before.
A: I'll have to take a look.

Q: Do you have any data on what happens when auth is very far, but the resolvers are pretty close?
A: Not enough data yet.

Q: The numbers in your measurements can be very different if you're Asia or other places. In USC, for example, I'm 50ms away from auth, but only 5ms from my resolver.
A: The prefetching can mitigate this issue to some extent.

Q: It might be harder to remove resolvers than to fix them. Clients are by default set up to use a particular resolver, and many resolvers are very good. Perhaps a more useful strategy is to get rid of bad resolvers?
A: Definitely, not all resolvers are bad. All this doesn't have to be done across all clients for it to work; if you trust your resolver, don't get rid of it.

Q: (Anja Feldmann) In principle, this could be good. But how does it work with NATs and other "nasty little devices" in the network?
A: I have in fact seen problems using this on my laptop; definitely needs more work.

Q: (John Heidemann). Why would switching to recursive resolution fix some of these attack problems?
A: Kaminsky for e.g. requires timing knowledge which would disappear here.