Name lookup is an important problem, both in the context of networking and in domains outside of networking. This paper focuses on a particular use case of name lookup: routing in a content-centric networking. Content-centric networking uses hierarchical names (e.g., /com/google/maps) of arbitrary length rather than fixed-length addresses to route packets. In order to be useful, content-centric networking needs a lookup mechanism that supports longest prefix matching, has high throughput (saturate a 100 Gbps Ethernet link), and low latency (100 microseconds per lookup).
Satisfying these constraints requires a carefully designed data structure that can work efficiently with GPU hardware. Some straightforward approaches won't work: character tries and state transition tables require too much memory. Aligned transition arrays (ATAs) can greatly compact the information of a transition table, but they don't support incremental update and are still inefficient (each character must be looked up separately). To address both of these concerns, the authors implement multi-ATAs, which can use multiple characters in each transition and support incremental update.
There are some challenges to implementing multi-ATAs on a hybrid CPU-GPU platform: the PCI-e bus and GPU processor are limiting factors in achieving both high-throughput and low latency. Two techniques are used to improve performance. First, pipelining lookup and data transfer improves PCI-e bus and GPU processor utilization. Second, interweaving the memory layout of the input names in GPU memory reduces the memory accesses of threads, improving performance.
The implementation of the multi-ATA data structure performs well on the CPU-GPU platform. Using the multi-ATA requires two orders of magnitude less space than a baseline state transition table. The techniques to efficiently use the GPU allow up to ~70 million searches per second, an order of magnitude more than a state transition table. Furthermore, these techniques can saturate a 100 Gbps link with latencies of less than 100 microseconds per search.
Question and answer
Dong Zhou, Carnegie Mellon University
DZ: In your evaluation, you worked only on a local machine, but you didn't actually transfer data onto a NIC. Is this a fair comparison with other systems?
BL: Yes, we only work within the context of a single machine. We're looking at using the NIC in future work.
DZ: Using the NIC will take additional CPU cycles, will the competing CPU used affect your results?
BL: This is something we need to address, but we think that other hardware acceleration could help.
Srimat Chakradhar, NEC Labs
SC: Right now you are keeping the entire state table on the GPU, will this be possible as state tables get larger?
BL: Currently, we have a 10 million entry table on the GPU, which uses about 1/3 of the GPU’s external memory for a single GPU processor chip, actually we have two processor chips on the GTX590 board. We estimate that we could keep a 60 million entry table on this kind of GPU. Newer GPUs have more space, and we think
that we could keep a 120 million entry table on the new hardware.
Michael Freedman, Princeton University
MF: In the multi-ATA table it looks like collisions map into a second table. This seems to imply that lookups will require multiple accesses, is that the case?
BL: No, we have a proof in the paper.
MF: Did you consider comparing with other hashing algorithms (e.g., cuckoo hashing)?
BL: No, this is a much more simple lookup algorithm
Gun Sirer, Cornell Univerity
GS: It seems like I should be mining BitCoins with my GPUs. Right now, this CCN routing lacks security (e.g., self-verifying names). Lack of security as a first-class primitive plagues DNS today. How will your system address security challenges?
BL: We haven't considered that specifically, but it's definitely the future work.