Layer 9: Session 3, Paper 2: Characterizing Load Imbalance in Real-World Networked Caches

Tuesday, October 28, 2014

Session 3, Paper 2: Characterizing Load Imbalance in Real-World Networked Caches

Authors: Qi Huang (Cornell University), Helga Gudmundsdottir, Ymir Vigfusson (Reykjavik University), Daniel Freedman (Technion), Ken Birman, Robbert van Renesse (Cornell University)

Link to paper Characterizing Load Imbalance in Real-World Networked Caches

Modern web services nowadays heavily rely on the in-memory caches to reduce the request latency for users and reduce the load on the back-end storage servers. To avoid overloading a single cache server, data is partitioned into multiple segments (“shard”) and stored across different cache servers. Related objects tend to be mapped into the same shard; and ideally, each cache server should see similar request rates. However, in the real-world workload, we often observe the load imbalance among the cache servers. Potential causes of load imbalance include: hashing schemes, skewed access popularity, etc. Using real workload of Facebook’s TAO cache system, this paper tries to answer the following questions:

What is the state of load imbalance?
What contributes to load imbalance?
How effective are current techniques?
How to improve?

The authors find that significant load imbalance observed in TAO. Possible causes:

Content popularity (ranking top 100 shards, and top 20 objects per server)
Hot objects: (Interestingly) very hot hot object alone are not a major cause of load imbalance of cache server.
Hot shards: Popular shards have much more traffics than the hot object

Next, the presenter evaluates the current techniques used to mitigate the load imbalance (consistent-hashing, hot-content replication) and find room for improvement. 3 approaches for consistent hashing: consistent hashing for in-memory networked caches implemented in libketama, TAO’s hashing, and baseline “perfect hashing”. 3 approaches for replication: None (no replication), TAO’s replication, optimal replication. The combination of these (hashing,replication) are evaluated using Facebook’s TAO workload. Metric used: max/avg, where max,avg are volume of requests of most loaded/average loaded servers). Some key results:

standard hashing (implemented in libketama): 53% higher load on the most loaded server compared to the average. Only achieve 41% higher load when implemented with optimal replication.
TAO’s hashing in general performs better than standard hashing, but still worse than when optimal hashing or replication is implemented in TAO
The streaming algorithm outperforms other replication approaches, but still has room for further improvement.

Future work:

Characterize how hashing and replication affect load imbalance?
Can streaming algorithm replicate content before its popularity surges?
Can we predict the popularity spikes and prevent hotspots.

1 comment:

Angel17May 13, 2023 at 7:34 AM
Thank you for sharing this post. Keep posting.
cary nc gutter installation
ReplyDelete
Replies

Add comment