Wednesday, August 19, 2015

Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can

Session 7.2: Scheduling and resource management: 2

Title: Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can

Authors: Virajith Jalaparti (University of Illinois, Urbana-Champaign), Peter Bodik (Microsoft Research), Ishai Menache (Microsoft Research), Sriram Rao (Microsoft), Konstantin Makarychev (Microsoft Research), Matthew Caesar (University of Illinois, Urbana-Champaign)

Paper: http://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p407.pdf
Public review: http://conferences.sigcomm.org/sigcomm/2015/pdf/reviews/241pr.pdf
Large compute clusters have become quite common. The jobs on this network are often constrained by the network. Network scheduling is important for data-parallel jobs, because they include Network intensive stages (e.g., shuffle, join).
Several techniques proposed. Consider a simple map-reduce job, where you have a bunch of maps and reduces. You have inputs that are sent to the maps. The focus is on placing tasks, e.g., Tetris and Quincy. There is a focus on scheduling network flows, e.g., Baraat and Varys. The limitation of existing techniques is that the map input data is spread randomly. The maps produce data which is sent to the reducers. When the reducers start running the reduce input is spread randomly.
The authors propose placing data input in a few racks. Map input data is placed in one rack. This then means the reduce input is in the same rack. Furthermore, all the transfers stay within the rack leading to rack locality. The benefit of this proposal include high bandwidth between tasks and reduced contention across jobs. This placement of data is feasible, and it is useful in cases where there are recurring jobs known ahead of time and separate storage and compute clusters.
The challenge of this approach is how many racks to assign a job. The presenter showed a figure that showed that one rack may be sufficient for 75—95% of jobs. However there is a long tail, and we may need more than one rack for 5—25% of jobs. Offline planning using job properties can be used to address this problem.
The authors propose a system called Corral. Corral consists of an offline planner and a cluster scheduler. The planner takes future job estimates and determines which racks are to be used. Corral essentially solves an offline planning problem. For data placement: One data replica is constrained to Racks(i), while other replicas are placed randomly for redundancy.
The planning problem is formulated as follows: Given a set of jobs and their arrival time, find a schedule which meets their goals. There are two scenarios: Batch or online. With the former you minimize the makespan, while the latter minimizes the average job time. This presentation focusses on the batch approach.
Corral was implemented on Yarn with a modified HDFS and Resource manager. The cluster had 210 machines in 7 racks. Three different MapReduce workloads were tested. Corral reduces makespan by 10—33%. Corral improved the network locality. There is reduced contention across the cluster.
Q: Do you consider that different jobs may share data?
A: No we don’t take it into account. That is deferred to future work.