Wednesday, August 24, 2016

CODA: Toward Automatically Identifying and Scheduling COflows in the DArk
Presenter: Hong Zhang
Co-authors: Li Chen, Bairen Yi, Kai Chen, Mosharaf Chowdhury, Yanhui Geng

The key problem that the authors are trying to solve in the paper is how to automatically identify and schedule coflows without any application modifications. The key motivation behind this work is the fact that for coflows to work, all the distributed data-parallel applications in a shared cluster have to be modified to use the same coflow API, which authors point out is infeasible to enforce in many cases.

CODA is the mechanism designed by the authors, which tires to identify coflows without modifying applications and at the same time ensures that the coflow scheduler is robust enough that even if coflow identification is not perfect (which invariably be the case), the performance of the system is still significantly better than the per-flow mechanisms.

CODA uses DBSCAN as its base algorithm to identify coflows. For this, it uses both explicit attributes like arrival times and packet headers and implicit attributes like communication patterns. Further to tolerate coflow identification errors, CODA delays the assignment of flows to particular coflows until they must be scheduled to reduce the number of stagglers, and couples it with intra-coflow prioritization (prioritize small flow inside a coflow) to further reduce the number of stagglers.

For evaluation, authors used both large scale trace-driven simulations and a small scale testbed with 40 servers. The key results include over 95% flow identification accuracy and improvement in completion times by 2x (5x) on average (95th percentile) on traces obtained from Facebook, compared to per-flow mechanisms.

Future work includes to apply CODA to more varied set of applications, extend coda to take into account coflow dependencies, and how to perform error tolerant scheduling in general.

Q: You use ML to classify applications to do scheduling, but it comes with a danger that changing few lines of code might make the algorithm miss-classify, so the programmers now will have to take this into account now while they are making code changes.
A: As long as programmers do not change low level apis it should not affect too much. We will further explore this, but we expect it not to be a very big issue.

Q: Is there a more principled approach to solve this problem rather than using a ML algorithm?
A: We use very simple learning classification algorithm and try to understand which attributes are more effective by adding humans in the loop. Applying this to different frameworks still needs work as attributes might change.