Statistical Approach to Monitoring 1246

Report 2 Downloads 75 Views
Statistical Approach to Monitoring Session ID: 1246

Prepared by: Harris Baskaran Database Engineer , Google

1

Oracle at Google! ● 99.5% uptime systems for internal systems. ● Lots of Oracle databases, few MS-SQL and mySQL ● Some large complex databases and many small databases.

2

About me ●

Engineer for 10 years, DBA for last 9 years



http://dontbouncethatdb.blogspot.com/



https://www.linkedin.com/in/harrisbaskaran



[email protected]



I love engineering!



Been with Google , New York for the last 2 years

3

98% of all statistics are made up. The other 6% are a result of math errors

… this 100% true 4

This talk ● How do you set monitoring thresholds? ● Thinking of database as a snowflake ● Problem and how can statistics help with the solution ● Case study in action ● Learnings ● The future! 5

How do you set monitoring thresholds?

6

Thresholds

Copyright 2016 Google Inc. All Rights Reserved.

Time 7

Thresholds

Copyright 2016 Google Inc. All Rights Reserved.

Time 8

Is a hard threshold really the answer?

9

Adaptive thresholds

10

How big is your problem? • You need to set the thresholds for your databases – Thousands of databases; that makes it thousands of

metrics to modify thresholds for – Upper bound of each system is different – Each database behaves differently

11

Database as a snowflake

12

Database as a snowflake Each database is unique (except physical standbys; if they are not then something is wrong) Some have repeating cycles(month end systems) Some are always chatty Some rarely used

13

Which performance metric? • Average Active Sessions – Number of sessions in the database working or waiting. – DBTime/Elapsed time – Metric can only convey there is a problem

only the Wait model analysis can tell where the problem is at.

14

Scenario I - The Quiet Database Utilization is stable and within system limits. As a DBA; I want to know everytime it goes any higher than usual.

Copyright 2016 Google Inc. All Rights Reserved.

Time 15

Scenario II - The Juggernaut database Noisy databases.

Copyright 2016 Google Inc. All Rights Reserved.

Time 16

Scenario III - The Blue Moon database Quiet most of the time, until it is used

Copyright 2016 Google Inc. All Rights Reserved.

Time 17

Scenario IV - Deja vu databases Repeating pattern of performance profile

Copyright 2016 Google Inc. All Rights Reserved.

Time 18

Statistics and how it can help with this problem

19

Problem definition • Try to get figure out a model where DBAs can get notified when the performance of the database is out of the “ordinary”. • “Ordinary” varies from database to database.

20

Distribution • What is the distribution of the data? – It is basically the frequency of a value occurring in time series • Distribution fitting • Next three slides of math (deep breath….)

21

Example of distribution Time

Metric value Relative freq

Z score

2016-01-01 6:00 AM

26 (26-38.8) / 11.4

-1.12

2016-01-01 8:00 AM

33 (33-38.8) / 11.4

-0.51

2016-01-01 9:00 AM

65 (65-38.8) / 11.4

2.3

2016-01-01 10:00 AM

28 (28-38.8) / 11.5

2016-01-01 11:00 AM

34 (34-38.8) / 11.5

2016-01-01 12:00 PM

55

2016-01-01 1:00 PM

25

2016-01-01 2:00 PM

44

Mean

38.8

std-dev

11.4

2016-01-01 11:00 PM

45

2016-01-02 12:00 AM

32

2016-01-02 1:00 AM

28

2016-01-02 2:00 AM

34

22

Normal distribution

z-Score: -1.2 Value : 26

z-Score: 2.3 Value : 65

23

Not always the case!

user:Tayste / Wikimedia Commons / Public Domain 24

Chebyshev to the rescue!

OUTLIER!

25

Detecting anomalies! Leaving the crazy math out Given a metric(like AAS) we plot the frequency of values. Now that we know the mean and variance and using Chebyshev’s Inequality ● mean + 3.2 * stand deviation would give you 90% of the values and ● mean + 4.2 * stand deviation would give you 95% of the values. 26

Case study in action

27

How to set it up? Approach I : Inside the database create table monitoring_thresholds( inst_id number, mean number, stddev number, value number, begin_time date); insert into monitoring_thresholds select inst_id, avg(value), stddev(value), max(value) keep (dense_rank first order by begin_time desc) as val, max(begin_time) keep (dense_rank first order by begin_time desc) from GV$sysmetric_history where metric_name='Average Active Sessions' and begin_time > sysdate-1 group by inst_id

28

Metric extensions

29

When to alert Setup the Metric extension with the following SQL Select max(a.value) - b.mean +(3.2*b.stddev) from GV$sysmetric_history a, monitoring_thresholds b where metric_name='Average Active Sessions' Where a.begin_time > b.begin_time B.begin_time = (select max(c.begin_time) from Monitoring_thresholds c)

Alert when value is greater than 0. 30

Quiet database

Copyright 2016 Google Inc. All Rights Reserved.

Time 31

Threshold auto adjusts

Copyright 2016 Google Inc. All Rights Reserved.

Time 32

Jaggernaut

Copyright 2016 Google Inc. All Rights Reserved.

Time 33

Blue moon database

Copyright 2016 Google Inc. All Rights Reserved.

Time 34

Deja vu

Copyright 2016 Google Inc. All Rights Reserved.

Time 35

Learnings - what we learnt after an year

36

Learnings 1. We have to add a lower bound threshold for our alert notification 2. We detect and alert both on the 90% breach and 95% breach. 90% helps us be proactive. 3. Tuning the duration for capture controls the sensitivity of the thresholds. Most databases 3 days works well. 4. Don’t alert for every small spike, set a time to alert clause

37

Learnings (contd) 5. AAS is too broad. Does not characteristically model a database too well. - Model the by main wait class - Use AAS as a secondary catch all

38

Copyright 2016 Google Inc. All Rights Reserved.

39

Failings

Copyright 2016 Google Inc. All Rights Reserved.

Time

1. Flaky load are difficult to catch! Unless the load stays high! 2. Also the time to alert matters. 3. Duration of mean and variance controls sensitivity. 4. Deja vu databases 40

Where to from here?

41

Modelling the Database signature

Copyright 2016 Google Inc. All Rights Reserved. 42

Summary ● This is one approach, not the ultimate solution. ● Adaptive threshold works better than static thresholds ● With so many metrics, AAS is a good start ● Monitoring thresholds are VERY important. ○ When to wake up DBAs ○ Are things out of the ordinary

43

Where else statistics used by DBAs - Regression based forecasts of growth estimates for databases - Other system performance analysis

44

Thank you.

Please take a minute to provide session feedback.

45