Week 3 - Numerical Descriptive Measures (Key question: what does a set of numerical data mean?) Central tendency is the extent to which the data values group together around a typical or central value. Measures of central tendency n
a.
The mean, or average: X = i.
∑ Xi
i=1
n
. =AVERAGE(x1,x2,...,xn)
Σ means ‘sum the following expression, from i=1, i=2, … up to i=n.’
b. The median, which is the middle-position value (or average of middle-position values) in an ordered array. The median is not sensitive to extreme outliers. =MEDIAN(x1,x2,...,xn) c.
The mode, which is the most frequently observed value. The mode is usually reported for discrete or categorical data only. (Shown far R)
d. The geometric mean, which is often used to measure the rate of change of a variable over time: XG = (X 1 × X 2 × ... × X n )1/n . =GEOMEAN(x1,x2,...,xn) i.
We use the geometric mean rate of return to measure the average rate of return of an investment over time: RG = [(1 + R1 ) × (1 + R2 ) × ... × (1 + Rn )]1/n − 1
http://www.financeformulas.net/Geometric_Mean_Return.html ii.
Example:
Variation is the amount of dispersion of values around the central value. Measures of variation a.
The range, which is the distance between largest value =MAX(array) and smallest value =MIN(array). n
b. The sample variance, which is how far variables are spread out in a sample: S
2
=
∑ (X i −X)2
i=1
n−1
.
=VAR.S(x1,x2,...,xn) i.
By dividing by n-1 and not n, we recognise that we’re taking a sample. We acknowledge that this isn’t the whole population.
c.
The sample standard deviation, which shows variation about the mean: S =
√
n
∑ (X i −X)2
i=1
n−1
.
=STDEV.S(x1,x2,...,xn) i.
The more the data are spread out, the greater the range, S2 and S.
d. The coefficient of variation, which represents the ratio of the standard deviation to the mean, i.e. the amount of variation relative to the mean: S /X × 100% e.
The z-score for each data point, which is the number of standard deviations a data value is from the mean: z =
X−X S
Shape is the pattern in the distribution of values. Measures of shape a.
Skewness measures the amount of asymmetry in a distribution.
b. Kurtosis measures the concentration of values in the centre of a distribution, as compared with the tails. Data analysis in Excel yields numerical values that describe central tendency, variation and shape. 1.
Select the Data tab.
2. Select Data Analysis. 3.
Select Descriptive Statistics and click OK.
4.
Enter the cell range.
5. Check the Summary Statistics box and Click OK. Quartiles, interquartile range and boxplots Quartiles split ordered data into 4 segments, with an equal number of values per segment. If quartiles come between values, an average is taken. =QUARTILE(array, quartile no.)
Q1 = first quartile Q2 = second quartile / median Q3 = third quartile
The interquartile range = Q3 - Q1. It is a measure of variability and is resistant to outliers. Five number summary = (min, Q1, median, Q3, max).
A boxplot graphically displays data with respect to the five number summary. The length of the boxplot and size of the box will vary with central tendency, variation and shape.
Numerical descriptive measures for a population Q: How is this different from a sample? In a population, data varies in a bell-shaped distribution. According to the empirical rule: ● ~68% of data in a bell-shaped distribution are within 1 std.dev. of the mean i.e. μ =± 1σ ● ~95% of data in a bell-shaped distribution are within 2 std.dev. of the mean i.e. μ =± 2σ ● ~99.7% of data in a bell-shaped distribution are within 3 std.dev. of the mean i.e. μ =± 3σ According to Chebyshev’s Rule, regardless of how the data are distributed, at least (1 − k12 ) · 100% of the values will fall within k std.dev. of the mean (k>1). Finally, numerical descriptive measures should: ●
Document both good and bad results
●
Be presented in a fair, objective and neutral manner
●
Not use inappropriate summary measures to distort facts.