Thursday, November 21, 2013

HotNets '13: Network Stack Specialisation for Performance

Bonus: Panel discussion for Session 3 at the end!

Speaker: Ilias Marinos

Authors: Ilias Marinos, Robert N. M. Watson (University of Cambridge), Mark Handley (University College London).

Traditionally, servers and OSes have been built to be general purpose. However now we have a high degree of specialization. In fact, in a big web service, you might have thousands of machines dedicated to one function. Therefore, there's scope for specialization. This paper looks at a specific opportunity in that space. Network stacks today are good for high throughput with large transfers, but not small files (which are common in web browsing). For example one of their experiments showed with 8KB files, ~85% CPU load for just ~50% throughput (5 Gbps).

The goal of this paper is to design a specialized network stack that addresses that problem, by reducing system overhead and providing free flow of memory between the NIC and applications. The paper introduces a zero copy web server called Sandstorm which accomplishes that.

Key design decisions include putting apps and the network stack in the same address space; static content pre-segmented into packets; batching of data; and bufferless operation. In performance evaluation, the most impressive improvements are in flow completion time when there are a small number of connections, and CPU load reduction of about 5x. When moving to 6x10GbE NICs, the thoughput improvement also becomes impressive: about 3.5x higher throughput than FreeBSD+nginx and even bigger improvement over Linux+nginx. Overall, the paper's contribution is a new programmign model that tightly couples the application with the network stack, reaching 55 Gbps at 72% load.

Q&A

Q: Was this with interrupt coalescing on the NIC turned on?
A: Yes.

Q: Would it be possible to turn it off?
A: We can poll.

Q: Does this stack optimization perhaps suggest changes in TCP -- things like, say, selective ACKs, which we maybe thought was a good idea, but is just too complex?
A: Not so far. The real problem is the system stack not the protocols themselves.

Q: At what level do you demultiplex, if you have two different stacks, legacy and Sandstorm?
A: Could share stack if you are careful, or use two queues and put apps on different stacks.
Q: But current NICs generally don't have enough demultiplexing flexibility to send certain traffic types to different stacks.
A: There are a few NICs that can do that.

Q: Could you fix the stack overhead with presegmentation?
A: Presegmentation is only one part of the problem.
Q: It would be interesting to see a graph showing how much benefit comes from each of these specializations.

Panel discussion for all papers in this session

Q: How does big data affect these works on the data plane?
A: Facebook wouldn't be supporting Open Compute Project if it thought it didn't matter.

Q (for the Disaggregation paper [Han et al.]): The RAMCloud project also had tight latency constraints, but they're trying to treat servers as a pool of memory (not full disaggregation). How do the approaches the compare? Why isn't RAMCloud the solution?
A: RAMCloud doesn't allow you to change the CPU/Memory ratio. (Additional discussion not recorded. .

Q: Why not push functionality near to the storage device?

Q: Does it make sense to combine the visions, using an approach like tiny packet programs to get functionality closer to the network?

Q: There are several reasons you have hierarchies of latency. One is because we just don't know how to build really low latency memory. But also there are others, e.g. fast memory is more expensive. The cross-cutting vision of these papers seems to be removing implementation barriers, which then allows you to create a more optimized hierarchy (based on fundamental cost tradeoffs rather than implementation). Is this how you think about this?
A: May depend on workload.

No comments:

Post a Comment