Layer 9: CoNEXT 2013 Student Workshop: Datacenter

Dissecting bufferbloat: measurement and per-application breakdown of queueing delay
Authors: Andrea Araldo (Telecom ParisTech, Paris, France), Dario Rossi (Telecom ParisTech, Paris, France)
Presenter: Andrea Araldo

Passive methodology to infer queuing delay in the Internet, and an implementation that can be downloaded and used
Validation of the tool: Results from a real ISP network to show per application view and QoE, and the causes of queuing delay.
Evaluate impact of queuing delay on the user experience

Bufferbloat is long queuing delays inside network buffers. It is due to two factors:
- Tcp congestion control, which is loss based - only reacts to congestion after a buffer is completely full.
- Memory is cheap -> manufactures make buffers large
Can see bufferbloat up to 4 seconds in a common router

There is much previous work:
Active: gives maximum queuing delay rather than the typical
Passive: measures queuing delay across all applications. This says nothing about user experience; high delay can be intolerable depending on the application

Contribution: first to give a per-application view of queuing delay.

Methodology was to place tstat into a real tier-3 ISP network.
Use DPI of tstat to do per-application breakdown.
8 classes of trace:
- Delay tolerant: OTH, Mail, and p2p;
- Middle sensitivity to delay: web, media chat;
- Highly sensitive to delay: Ssh, VoIP

Root cause analysis: queuing delay experience by an application caused by concurrent applications running on host at same time,
Thus need to look at correlation between applications running on the same host
Thus extract most frequent application flow combinations
e.g. (chat, p2p) and (chat, http)

Proposes a methodology to infer queuing delay, and provides an open-source tool.

Insights:
How applications suffer queuing delay
What the causes are of queuing delay

Future work:
Deployment of an operation tool for online traffic analysis, using their modified version of tstat.

Q for bufferbloat: can we find where the problem is
Can’t know exactly where it is, large ISP
Queuing happens when transiting from high rate link to low rate link, can use this to infer

Q: (statement) could be useful to run against collected traces and be able to use tool to analyze

------

Real-Time Diagnosis of TCP Performance in Clouds
Authors: Mojgan Ghasemi (Princeton University), Theophilus Benson (Duke University), Jennifer Rexford (Princeton University)
Presenter: Mojgan Ghasemi

Cloud providers need a tool to:
-Detect performance problems
Find origin of problem: Sender, receiver, network
And then drive corrective actions

2 previous approaches
- Gather tcp endpoint stats on end host
Needs to modify guest vm, changes trust model, uses tenants resources
- Collect offline packet traces on network
No app end host visibility
High measurement overhead
Offline - not effective for real time diagnosis

Solution presented is real time diagnosis
Use hypervisor
Advantages:
- Thus don’t need to modify tenant VMs, and avoid network overhead

Challenges:
No visibility into guest VMs
Need to scale to large no of connections (efficient memory)
Low delay and high throughput - needs to be fast while accurate

Every packet sent is captured in the hypervisor
Between the network and the VM
5-tuple key is used. If doesn’t exist then assign to a flow
Update the tcp state into a tcp state machine model, and update the flow stats in a flow table: constants (mss) counter *byte count) sample stats (eg rtt) calculated stats (e.g. CWND).

Also keep linked list of the samples gathered
Not the whole packets, just the key components
When ack of packet received then remove the sample - otherwise needs large memory requirements

Also want to be measuring ongoing connections: i.e. if miss the handshake
Need to be able to monitor connections midstream
Important as DC conns long-lived - not just new connections
Allows on demand monitoring
Reduces overhead by selective connection diagnoses

Two more challenges:
Don’t know the constants e.g. MSS and TCP options as missed handshake
Don’t know the TCP state and CWND

Solutions:
1. Constants and options: use moving min/max/averages based on the observed packets i.e. infer
2. Tcp state and cwnd - use heuristics to narrow down the state we don’t know: rate, spacing and loss.
Rate e.g. is exponential growth then in slow start, but if linear may not be able to deduce from rate
Packet spacing: observe amount of packets sent before the ack
Loss: e.g. if observe 3 duplicate asks, or a timeout - can use to work out point in the state machine

Active approach if can't work out state:
Fake a loss:
3-dupe acks - force application to divide its window in half
Or a timeout - forces slow start. Don't do - Too much impact on performance (nuclear option)
These are options if the operator determines it is worth the cost

Questions:

Q:
Can tell which tcp variant observing packets is using
A: nmap can do this
For now assuming Reno
Q: how does nmap do this?
A: unsure

Q:
Does it deal with secure connections, such as ssl?
A: yes it does

Q: for active measurements, can divert/delay the packet rather than causing loss?
A: could do? (Not sure of answer)

Q: how can make sure don’t effect when measuring
A: if passive measurement then not affecting traffic

-------
Diagnosing Slow Web Page Access at the Client Side
Authors: Tobias Flach (University of Southern California), Ethan Katz-Bassett (University of Southern California), Ramesh Govindan (University of Southern California)
Presenter: Tobias Flach

Explain why web access slow sometimes
Infer solutions

Challenges
Web pages become increasingly complex
- Hard to establish which resources are responsible for bad performance

Some performance issues are short living
- Hard to reproduce the problem

Related work
Client side:
-Inject measurement scripts into the page (Fathom), (Netalyzr)
-Persistent network performance analysis
Server side:
- Collect data for requested resources

Solution proposed:
Tool, which passively monitors browser behavior as well as network traffic
And actively probe the network when detect an anomaly
Classifies the anomaly and determines root caused based on collected data and features

Architecture
Browser, data collection, data analysis
Analysis has anomaly rule set, and anomaly detection and classification

Data collection:
- One page may request multiple servers
Passive: tcp packets/connections, browser data (e.g. dom)
Active: pings and trace routes: only done if indicators suggest anomaly present

Data analysis:
1. Trigger active measurements from performance anomalies. E.g. a timeout, or the user clicks a button to indicate poor performance
2. Annotate recoded packets and connections
3. Cluster traces with common properties e.g. traces on common sub path
4. Map traces onto anomaly types

Conclusion
Tool to detect transient performance anomalies
Working on implementation fro chrome browser
Supplements existing frameworks that focus on detecting persistent issues
Rather than transient
Supplement not replaces existing tools

Questions:

Q: what to do when discover a problem?
As expertise often on client side, use their knowledge to approach the right people. Have more information than just "page doesn’t work" --- especially important for transient performance problems

Q: how can make sure don’t effect when measuring
A: definitely want to reduce active overhead, thus only when detect anomaly. For passive add additional constraints to minimize use, e.g. disable tcpdump if see bit torrent packets.
User tradeoff if want to enable to diagnosis overhead

MARS: Measurement-based Allocation of VM Resources for Cloud Data Centers
Authors: Chiwook Jeong (Gwangju Institute of Science and Technology (GIST)), Taejin Ha (Gwangju Institute of Science and Technology (GIST)), Jaeseon Hwang (Gwangju Institute of Science and Technology (GIST)), Hyuk Lim (Gwangju Institute of Science and Technology (GIST)), JongWon Kim (Gwangju Institute of Science and Technology (GIST))
Presenter: Chiwook Jeong

Cloud data centers
Conventional resources allocation
Equal resource allocation -> can be imbalanced if resource demands -> performance degradation

Equal utilization allocation
- Service performance not equal to utilization -> can degrade user experience

Utilization is not equal to user experience
High usage rate of VM resources doesn’t always mean low service performance
Propose approach that directly measures the service performance rather than the usage rate of VM resources

Experimental environment for cloud computing
kvm, openstack, open vswitch
Top, virt-top, weighttp
To measure cpu/memory network

Mars 1:
Find worst performing vm
Vm with longest response time = worst performance = needs more resources

Mars 2: identify over utilized resource
Vm with longest over utilized resource = over utilized = needs more resources

Mars 3: re allocate the resource
The under utilize resource of vms is reallocated to the vm with the worst performance

Experimental results
Found improvement of 21.5% in average response time

Measurement based resource allocation strategy proposed for more efficient resource allocation
Future work:
Consider storage resource as well as CPU, memory and network bandwidth
Extend mars to vm consolidation problem

Tuesday, December 10, 2013

CoNEXT 2013 Student Workshop: Datacenter — Measurement Session

No comments:

Post a Comment