Thursday, August 20, 2015

Low Latency Geo-distributed Data Analytics

Session: Scheduling and resource managment (2)
Authors: Qifan Pu (UC Berkeley/MSR), Ganesh Ananthanarayanan (MSR), Peter Bodik (MSR), Srikanth Kandula (MSR), Aditya Akella (UW Madison), Victor Bahl (MSR), Ion Stoica (UC Berkeley)
Public review:

Organizations are distributed data centers across the globe. You have performance counters and user activities. Right now if an app wants to get access to these counters, all the data is moved to a single data center and then analyzed. This is centralized data analysis. This paper discusses a method to have a single logical analytics cluster across all sites. This prevents the wasteful shipment of data across the globe.
Consider an example analytics job. There will be some map and reduce tasks. These tasks will cause a lot of traffic across the WAN. As a result there might be bottlenecks on some of the links. Typically one splits the tasks evenly. We are then able to calculate how long it might take to transfer data. What if we build a system that is aware of the network? Then we can put more reduce tasks.
Remember from the previous presentation that queries do not arrive at the same time as the generation of data. We present Iridium, which can jointly optimize data and task placement. For a single dataset, Iridium uses iterative heuristics for joint task-data placement as follows:
  1. Identify bottlenecks by solving task placement.
  2. Place the tasks to reduce network bottlenecks during query execution.
Iridium is evaluated with Spark 1.1.0 and HDFS 2.4.1.
Figure 6 from the paper shows that Iridium can get 4—19× speedup for the centralized baseline and 3—4x speed-up for the in-place baseline.

TO sum up. We proposed Iridium which allows analysis of logs spread across data centers. Iridium allows for a single logical analytics cluster across all sites. It incorporates WAN bandwidths and reduces response time over baselines by 3×—19×.

Q: In your Intro, you say the normal behavior is to copy the data to a centralized location. Your results show that copying the data to a central location is worse than leaving the data in place. Are we stupid for doing that?
Comment from audience: People have generally thought that it was stupid to move data.

Q: How is the completion time related? Some reduce tasks cannot start before map jobs are finished.
A: In our environment reducers always run after the mappers finish.