Tuesday, August 23, 2016

ClickNP: Highly flexible and High-performance Network Processing with Reconfigurable Hardware

Bojie Li (USTC / Microsoft Research), Kun Tan (Microsoft Research), Layong (Larry) Luo (Microsoft), Yanqing Peng (SJTU / Microsoft Research), Renqian Luo (USTC / Microsoft Research), Ningyi Xu (Microsoft Research), Yongqiang Xiong (Microsoft Research), Peng Cheng (Microsoft Research), Enhong Chen (USTC)

Presenter: Bojie Li 

With the rising requirements of network functions(NFs) to perform various packet processing tasks, hardware NFs are not flexible enough. To counter this, Virtual NFs are run on commodity servers. However, this software approach faces major scale-up challenges: limited processing capacity of cores force the requirement of large number of cores (100 cores for NF at 40Gbps). Also, the latency of software NFs is inflated and unstable. Thus, the authors present using FPGAs for running NFs (found in NICs in modern times). FPGAs provides massive parallelism for packet processing, consume lesser power and are cheap. The challenges of using FPGA are programmability, as FPGAs use hardware description languages like Verilog which can be obscure for software developers to develop NFs in these languages. 

The authors present ClickNP to ease FPGA programming for network programming. ClickNP allows programmers to develop NFs in a high-level language, allows modularization (like the Click modular router), synthesizes high-performance code and provides functionality for joint CPU/FPGA processing. 

NFs can be developed in a multi-core programming model, however a difference in ClickNP is that cores share information via channels (rather than a shared memory model), preventing the shared memory bottleneck. A 'element' is a building block which represents a single-threaded core and the speaker describes the components of the element. Two extensions were presented: the element can be executed on both CPU or FPGA cores. The architecture and runtime components of ClickNP is presented with a example of packet logger presented.  The authors then present several optimizations: parallelism across elements is improved by pipeline parallelism, while inside elements, they propose delayed writes to remove read-write memory dependency; unbalanced pipeline stages are tackled by offloading the slow path to another element.

The authors then present the evaluation of ClickNP and implement various components from the Click library (for e.g. parser, checksum, AES). The NFs produced by ClickNP have high peak throughput and delay below 1ms. ClickNP also simplifies the development of NFs (each NF has 10s of lines of code and developed in 1 week). The authors present two case studies: developing a IPSec Gateway and L4 Load balancer using ClickNP and compare the performance with software NFs, where the ClickNP NF can provide 40 Gbps throughput (compared to 628 Mbps with the software NF), and also latency is low and stable (order of microseconds). They present the resource utilization of ClickNP NFs versus NFs handwritten with Verilog and find less than 2 times overhead in resource utilization (which is not much of a concern as FPGA resources are bound to increase in the future).  In summary, ClickNP provides a practical platform for implementing NFs using FPGA which provides various advantages.

Q: There has been a lot of exciting work with P4. Can you provide a sense of how P4 fits with ClickNP?
A: Network Functions are generally complicated like encryption/decryption and stateful load balancers which are difficult to implement using P4. Integration with P4 is a future line of work.

Q: Why does ClickNP elements use channels instead of a shared memory model?
A: This is primarily to avoid the bottleneck of shared memory, though using channels forces programmers to write programs without a shared memory model.