Thursday, December 15, 2016

Conext 2016: Session 9: GRETEL: Lightweight Fault Localization for OpenStack

GRETEL: Lightweight Fault Localization for OpenStack
Ayush Goel, Sukrit Kalra and Mohan Dhawan, IBM Research, New Delhi, India
Link to paper: 
Blog notes by Ayon Chakraborty

Cloud Management Systems (CMS, e.g., Apache CouldStack, VMWare vSphere, OpenStack) are inherently complex distributed systems with intelligent orchestration of tasks like instance creation, deletion or migration. Typically they communicate via REST/RPC based mechanisms. Performance problems are very common in such large scale systems and such problems/errors need to be detected and fixed by the admin as early as possible. However, there exists a disconnect between the system fault that actually occured and the description of the fault that is propagaed to the user via some user interface/dashboard.

The author's propose a system, GRETEL, to localize such faults in a CMS system like OpenStack. There are several challenges:
  1.  OpenStack is a closed system. Operations continue to completion unless acted upon by external factors. Faults are a manifestation of perturbations of system state on the physical node on which OpenStack component executes. 
  2.  OpenStack exports a finite set of REST/RPC APIs that are a finite set of high level administrative tasks possible.
The following is the approach taken by GRETEL. It handles 3 key issues.
  1. Correct fingerprinting of OpenStack operations. .. Somewhat like a training phase where the system sees a sequence (packet stream) of REST and RPC calls on a packet by packet basis in order to fingerprint the operation being executed. GRETEL analyzes such packet stream and identifes a particular packet to be faulty. GRETEL tries to locate the source and destination of the localized faulty packet. 
  2. Lightweight fault detection in real time. Operational errors manifest in API return values. Use regular expressions to detect such values. Standard/OpenStack specified error codes. Also, performance error include inordinate API latencies. For this the REST/RPC timestamps can be used. Also we can take the help of online anomaly detection algorithms.
  3. Precise operation identification upon fault detection.
Implementation of GRETEL for OpenStack Liberty:
A) Network Monitoring via Bro/Brocolli:
Intercept OpenStack REST/RPC messages.
RabbitMQ Analyzer for Bro (~60+ LOC C++)
B) System State Monitoring
collectd snapshots
C) Performance Anomalies
R tsoutliers (LS mode)

The experimental setup used for evaluation used a physical setup of three tiered datacenter topology with 14 switches, 7 servers and 3 compute nodes. Performance metrics includes GRETEL's accuracy and precision.

Overall, GRETEL is a fast, fault detection and diagnosis system for OpenStack. It leverages non-intrusive system monitoring. It uses fingerprints to quickly identify faulty operations at runtime. It systematically combines different system states to identify the root cause of operational and performance faults in OpenStack. GRETEL provides good performance under stress conditions.

1. How is GRETEL sensitive to OpenStack model? Sensitiveity for plugins? Changing the standard REST APIs
-- You might need to re-train it again for the new flows. Same holds for plugins. If new operations are introduced as plugins also you need to re-train it.

2. How does it compare to the intrusive approach?
Need source code modifications. GRETEL wants to do away with that.