Vista: Machine Learning based Database Performance Troubleshooting Framework in Amazon RDS

Singh, Vikramank; Song, Zhao; Narayanaswamy, Balakrishnan (Murali); Vaidya, Kapil Eknath; Kraska, Tim

dc.contributor.author	Singh, Vikramank
dc.contributor.author	Song, Zhao
dc.contributor.author	Narayanaswamy, Balakrishnan (Murali)
dc.contributor.author	Vaidya, Kapil Eknath
dc.contributor.author	Kraska, Tim
dc.date.accessioned	2024-12-19T17:16:42Z
dc.date.available	2024-12-19T17:16:42Z
dc.date.issued	2024-11-20
dc.identifier.isbn	979-8-4007-1286-9
dc.identifier.uri	https://hdl.handle.net/1721.1/157897
dc.description	SoCC ’24, November 20–22, 2024, Redmond, WA, USA	en_US
dc.description.abstract	Database performance troubleshooting is a complex multi-step process that broadly involves three key stages- (a) Detection: determining what's wrong and when; (b) Root Cause Analysis (RCA): reasoning about why is the performance poor; (c) Resolution: identifying a fix. A plethora of techniques exist to address each of these problems, but they hardly work in real-world at scale. First, real-world customer workloads are noisy, non-stationary and quasi-periodic in nature rendering traditional detectors ineffective. Second, real-world production databases execute a highly diverse set of queries that skew the database statistics into long-tail distributions causing traditional RCA methods to fail. Third, these databases typically execute millions of such diverse queries every minute rendering traditional methods inefficient when deployed at scale. In this paper we describe Vista, a machine learning based performance troubleshooting framework for databases, and dive-deep into how it addresses the 3 real-world problems outlined above. Vista deploys a deep auto-regressive model trained on a large and diverse Amazon Relational Database Service (RDS) fleet with custom skip connections and periodicity alignment features to model long range and varying periodicity in customer workloads, and detects performance bottlenecks in the form of outliers. Furthermore, it efficiently filters only a top few dominating SQL queries from millions in a problematic workload, and uses a robust causal inference framework to identify the culprit queries and their statistics leading to a low false-positive and false-negative rate. Currently, Vista runs on hundreds of thousands of RDS databases, analyzes millions of workloads every day bringing down the troubleshooting time for RDS customers from hours to seconds. At the end, we also describe several challenges and learnings from implementing and deploying Vista at Amazon scale.	en_US
dc.publisher	ACM\|ACM Symposium on Cloud Computing	en_US
dc.relation.isversionof	https://doi.org/10.1145/3698038.3698519	en_US
dc.rights	Creative Commons Attribution	en_US
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/	en_US
dc.source	Association for Computing Machinery	en_US
dc.title	Vista: Machine Learning based Database Performance Troubleshooting Framework in Amazon RDS	en_US
dc.type	Article	en_US
dc.identifier.citation	Singh, Vikramank, Song, Zhao, Narayanaswamy, Balakrishnan (Murali), Vaidya, Kapil Eknath and Kraska, Tim. 2024. "Vista: Machine Learning based Database Performance Troubleshooting Framework in Amazon RDS."
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science	en_US
dc.identifier.mitlicense	PUBLISHER_CC
dc.eprint.version	Final published version	en_US
dc.type.uri	http://purl.org/eprint/type/ConferencePaper	en_US
eprint.status	http://purl.org/eprint/status/NonPeerReviewed	en_US
dc.date.updated	2024-12-01T08:54:23Z
dc.language.rfc3066	en
dc.rights.holder	The author(s)
dspace.date.submission	2024-12-01T08:54:24Z
mit.license	PUBLISHER_CC
mit.metadata.status	Authority Work and Publication Information Needed	en_US

Files in this item

Name:: license_rdf
Size:: 40bytes
Format:: application/rdf+xml

View/Open

Name:: 3698038.3698519.pdf
Size:: 14.48Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record