Tuesday, August 22, 2017

Session 1, Paper 1 - dRMT: Disaggregated Programmable Switching (Sharad Chole, Andy Fingerhut, Sha Ma, Anirudh Sivaraman, Shay Vargaftik, Alon Berger, Gal Mendelson, Mohammad Alizadeh, Shang-Tse Chuang, Isaac Keslassy, Ariel Orda, Tom Edsall)

Anirudh  proposes a new architecture for programmable hardware, dRMT (disaggregated Reconfigurable  Match-Action Table).  dRMT design aims to overcome the two restrictions and its drawbacks of RMT:
  1. Each match-action pipeline stage in RMT can only access memory local to it, implying that the memory not used by a stage cannot be allocated to other match-action stage, therefore poor memory utilization.
  2. RMT execute operation only in a fixed order: a match followed by an action in each stage. The authors argue, such fixed functionality lead to under utilisation of hardware resources for programs where matches and actions are imbalanced. 

The dRMT's key idea is to disaggregate hardware resources i.e., memory and compute of a programmable switch.
  1. dRMT separates memory for tables from the processing stages and allows to access them using cross bar. This cross bar carries information such as search keys and search results between match/action stages and memory. 
  2. Pipeline stages in RMT are replaced with a set of match-action processors. Similar to RMT, these processors has match-action units. But, unlike RMT,  packets do not move between pipeline stages. Instead, a packet is sent to dRMT match-action processor in round robin fashion, and entire p4 program for that packet is executed to its completion.

Then, Anirudh address the following 3 questions:

  1. How to schedule entire system (memory and processors) at compile time that guarantees a deterministic throughput and latency?
  2. How does dRMT compare with RMT on real programs?
  3. Is dRMT feasible?
For more details, please check out the paper.

dRMT is evaluated using 3 switch.p4 programs and one p4 program from a large switch asic manufacturer. The results shows significant  improvements over RMT --- dRMT chip requires 4.5%, 16%, 41%, and 50% fewer processors without compromise on line rate. Even for 100 randomly generated p4 program that has similar switch.p4 program characteristics,  dRMT chip require 10% lesser number of processors. With less number of processors, the throughput degradation of dRMT is graceful, where as RMT's performance falls off a cliff if a program need more pipeline stages.

Though the work has no working chip, it presents a design for dRMT chip and analysed its feasibility and chip area cost. In specific, dRMT require some additional chip area to implement a crossbar and match-action processors. For example, a 32 processor dRMT chip need an additional 5 mm^2, a modest increase when compared to 200 mm^2 typical switching chip.

The bottom line,  Anirudh says:
- Disaggregation of hardware resources improves utilization and throughput.
- dRMT require same chip area supporting even more complex operations.
- More details at http://drmt.technion.ac.il/


Q1: While scaling wiring may not be a problem. It was observed that the problem is with latency introduced by the cross bar. Is there any study on latency?
Anirudh: Yes, latency does goes up with cross bar -- by a few clock cycles (4) in dRMT.

Q2: Whether RMT was just a reference model, which could be implemented in hardware using memory disaggregation,  just like dRMT does?
Anirudh: High performance programmable switching chips that the authors know of do not disaggregate memory and have memory that is local to each stage.

Q3: Whether RMT could benefit from having heterogeneous stages, where each stage had a different ratio of match:action:memory capacity?
Anirudh: There is significant hardware design effort involved in designing a single RMT stage. So you want to amortise that design cost by replicating the same design for all stages and making the stages homogeneous.