Show simple item record

dc.contributor.authorSingh, Vikramank
dc.contributor.authorSong, Zhao
dc.contributor.authorNarayanaswamy, Balakrishnan (Murali)
dc.contributor.authorVaidya, Kapil Eknath
dc.contributor.authorKraska, Tim
dc.date.accessioned2024-12-19T17:16:42Z
dc.date.available2024-12-19T17:16:42Z
dc.date.issued2024-11-20
dc.identifier.isbn979-8-4007-1286-9
dc.identifier.urihttps://hdl.handle.net/1721.1/157897
dc.descriptionSoCC ’24, November 20–22, 2024, Redmond, WA, USAen_US
dc.description.abstractDatabase performance troubleshooting is a complex multi-step process that broadly involves three key stages- (a) Detection: determining what's wrong and when; (b) Root Cause Analysis (RCA): reasoning about why is the performance poor; (c) Resolution: identifying a fix. A plethora of techniques exist to address each of these problems, but they hardly work in real-world at scale. First, real-world customer workloads are noisy, non-stationary and quasi-periodic in nature rendering traditional detectors ineffective. Second, real-world production databases execute a highly diverse set of queries that skew the database statistics into long-tail distributions causing traditional RCA methods to fail. Third, these databases typically execute millions of such diverse queries every minute rendering traditional methods inefficient when deployed at scale. In this paper we describe Vista, a machine learning based performance troubleshooting framework for databases, and dive-deep into how it addresses the 3 real-world problems outlined above. Vista deploys a deep auto-regressive model trained on a large and diverse Amazon Relational Database Service (RDS) fleet with custom skip connections and periodicity alignment features to model long range and varying periodicity in customer workloads, and detects performance bottlenecks in the form of outliers. Furthermore, it efficiently filters only a top few dominating SQL queries from millions in a problematic workload, and uses a robust causal inference framework to identify the culprit queries and their statistics leading to a low false-positive and false-negative rate. Currently, Vista runs on hundreds of thousands of RDS databases, analyzes millions of workloads every day bringing down the troubleshooting time for RDS customers from hours to seconds. At the end, we also describe several challenges and learnings from implementing and deploying Vista at Amazon scale.en_US
dc.publisherACM|ACM Symposium on Cloud Computingen_US
dc.relation.isversionofhttps://doi.org/10.1145/3698038.3698519en_US
dc.rightsCreative Commons Attributionen_US
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/en_US
dc.sourceAssociation for Computing Machineryen_US
dc.titleVista: Machine Learning based Database Performance Troubleshooting Framework in Amazon RDSen_US
dc.typeArticleen_US
dc.identifier.citationSingh, Vikramank, Song, Zhao, Narayanaswamy, Balakrishnan (Murali), Vaidya, Kapil Eknath and Kraska, Tim. 2024. "Vista: Machine Learning based Database Performance Troubleshooting Framework in Amazon RDS."
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Scienceen_US
dc.identifier.mitlicensePUBLISHER_CC
dc.eprint.versionFinal published versionen_US
dc.type.urihttp://purl.org/eprint/type/ConferencePaperen_US
eprint.statushttp://purl.org/eprint/status/NonPeerRevieweden_US
dc.date.updated2024-12-01T08:54:23Z
dc.language.rfc3066en
dc.rights.holderThe author(s)
dspace.date.submission2024-12-01T08:54:24Z
mit.licensePUBLISHER_CC
mit.metadata.statusAuthority Work and Publication Information Neededen_US


Files in this item

Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record