Statistical Approach to Monitoring Session ID: 1246
Prepared by: Harris Baskaran Database Engineer , Google
1
Oracle at Google! ● 99.5% uptime systems for internal systems. ● Lots of Oracle databases, few MS-SQL and mySQL ● Some large complex databases and many small databases.
2
About me ●
Engineer for 10 years, DBA for last 9 years
●
http://dontbouncethatdb.blogspot.com/
●
https://www.linkedin.com/in/harrisbaskaran
●
[email protected] ●
I love engineering!
●
Been with Google , New York for the last 2 years
3
98% of all statistics are made up. The other 6% are a result of math errors
… this 100% true 4
This talk ● How do you set monitoring thresholds? ● Thinking of database as a snowflake ● Problem and how can statistics help with the solution ● Case study in action ● Learnings ● The future! 5
How do you set monitoring thresholds?
6
Thresholds
Copyright 2016 Google Inc. All Rights Reserved.
Time 7
Thresholds
Copyright 2016 Google Inc. All Rights Reserved.
Time 8
Is a hard threshold really the answer?
9
Adaptive thresholds
10
How big is your problem? • You need to set the thresholds for your databases – Thousands of databases; that makes it thousands of
metrics to modify thresholds for – Upper bound of each system is different – Each database behaves differently
11
Database as a snowflake
12
Database as a snowflake Each database is unique (except physical standbys; if they are not then something is wrong) Some have repeating cycles(month end systems) Some are always chatty Some rarely used
13
Which performance metric? • Average Active Sessions – Number of sessions in the database working or waiting. – DBTime/Elapsed time – Metric can only convey there is a problem
only the Wait model analysis can tell where the problem is at.
14
Scenario I - The Quiet Database Utilization is stable and within system limits. As a DBA; I want to know everytime it goes any higher than usual.
Copyright 2016 Google Inc. All Rights Reserved.
Time 15
Scenario II - The Juggernaut database Noisy databases.
Copyright 2016 Google Inc. All Rights Reserved.
Time 16
Scenario III - The Blue Moon database Quiet most of the time, until it is used
Copyright 2016 Google Inc. All Rights Reserved.
Time 17
Scenario IV - Deja vu databases Repeating pattern of performance profile
Copyright 2016 Google Inc. All Rights Reserved.
Time 18
Statistics and how it can help with this problem
19
Problem definition • Try to get figure out a model where DBAs can get notified when the performance of the database is out of the “ordinary”. • “Ordinary” varies from database to database.
20
Distribution • What is the distribution of the data? – It is basically the frequency of a value occurring in time series • Distribution fitting • Next three slides of math (deep breath….)
21
Example of distribution Time
Metric value Relative freq
Z score
2016-01-01 6:00 AM
26 (26-38.8) / 11.4
-1.12
2016-01-01 8:00 AM
33 (33-38.8) / 11.4
-0.51
2016-01-01 9:00 AM
65 (65-38.8) / 11.4
2.3
2016-01-01 10:00 AM
28 (28-38.8) / 11.5
2016-01-01 11:00 AM
34 (34-38.8) / 11.5
2016-01-01 12:00 PM
55
2016-01-01 1:00 PM
25
2016-01-01 2:00 PM
44
Mean
38.8
std-dev
11.4
2016-01-01 11:00 PM
45
2016-01-02 12:00 AM
32
2016-01-02 1:00 AM
28
2016-01-02 2:00 AM
34
22
Normal distribution
z-Score: -1.2 Value : 26
z-Score: 2.3 Value : 65
23
Not always the case!
user:Tayste / Wikimedia Commons / Public Domain 24
Chebyshev to the rescue!
OUTLIER!
25
Detecting anomalies! Leaving the crazy math out Given a metric(like AAS) we plot the frequency of values. Now that we know the mean and variance and using Chebyshev’s Inequality ● mean + 3.2 * stand deviation would give you 90% of the values and ● mean + 4.2 * stand deviation would give you 95% of the values. 26
Case study in action
27
How to set it up? Approach I : Inside the database create table monitoring_thresholds( inst_id number, mean number, stddev number, value number, begin_time date); insert into monitoring_thresholds select inst_id, avg(value), stddev(value), max(value) keep (dense_rank first order by begin_time desc) as val, max(begin_time) keep (dense_rank first order by begin_time desc) from GV$sysmetric_history where metric_name='Average Active Sessions' and begin_time > sysdate-1 group by inst_id
28
Metric extensions
29
When to alert Setup the Metric extension with the following SQL Select max(a.value) - b.mean +(3.2*b.stddev) from GV$sysmetric_history a, monitoring_thresholds b where metric_name='Average Active Sessions' Where a.begin_time > b.begin_time B.begin_time = (select max(c.begin_time) from Monitoring_thresholds c)
Alert when value is greater than 0. 30
Quiet database
Copyright 2016 Google Inc. All Rights Reserved.
Time 31
Threshold auto adjusts
Copyright 2016 Google Inc. All Rights Reserved.
Time 32
Jaggernaut
Copyright 2016 Google Inc. All Rights Reserved.
Time 33
Blue moon database
Copyright 2016 Google Inc. All Rights Reserved.
Time 34
Deja vu
Copyright 2016 Google Inc. All Rights Reserved.
Time 35
Learnings - what we learnt after an year
36
Learnings 1. We have to add a lower bound threshold for our alert notification 2. We detect and alert both on the 90% breach and 95% breach. 90% helps us be proactive. 3. Tuning the duration for capture controls the sensitivity of the thresholds. Most databases 3 days works well. 4. Don’t alert for every small spike, set a time to alert clause
37
Learnings (contd) 5. AAS is too broad. Does not characteristically model a database too well. - Model the by main wait class - Use AAS as a secondary catch all
38
Copyright 2016 Google Inc. All Rights Reserved.
39
Failings
Copyright 2016 Google Inc. All Rights Reserved.
Time
1. Flaky load are difficult to catch! Unless the load stays high! 2. Also the time to alert matters. 3. Duration of mean and variance controls sensitivity. 4. Deja vu databases 40
Where to from here?
41
Modelling the Database signature
Copyright 2016 Google Inc. All Rights Reserved. 42
Summary ● This is one approach, not the ultimate solution. ● Adaptive threshold works better than static thresholds ● With so many metrics, AAS is a good start ● Monitoring thresholds are VERY important. ○ When to wake up DBAs ○ Are things out of the ordinary
43
Where else statistics used by DBAs - Regression based forecasts of growth estimates for databases - Other system performance analysis
44
Thank you.
Please take a minute to provide session feedback.
45