Basic Statistics
Basic Statistics
[1] Sampling, Standard Errors, Distribution, Mean, Median, Standard Deviation, Correlation, Mean Test, Median Test [2] Griffin, John M., Patrick J. Kelly and Federico Nardari, 2010, Do Market Efficiency Measures Yield Correct Inferences? A Comparison of Yield Correct Inferences? A Comparison of Developed and Emerging Markets, Review of Financial Studies 23(8), 3225‐3277.
• Lies, Damned Lies and Statistics! • Econometrics is the application of statistical techniques and analyses to the study of problems and issues in economics problems and issues in economics • Sampling: Population vs sample • Description • Inference
Assoc. Prof. Nuttawat Visaltanachoti, CFA
[email protected] Special Project II: MSF Chulalongkorn University
1
2
The Nature of Econometrics and Economic Data
The Nature of Econometrics and Economic Data • Econometrics vs Statistics : Lack of controlled experimentation • Observational or retrospective data • In contrast to science researcher, social‐ i h i l science researcher is a passive collector of the data
“Econometric theory is like an exquisitely balanced French recipe, spelling out precisely with how many turns to mix the sauce, how many carats of spice to add, and for how many milliseconds to bake the mixture at exactly 474 degrees of temperature. But when the statistical cook turns to raw materials, he finds that the hearts of cactus for vermicelli he uses shredded wheat; and he substitutes green garment die for curry, ping‐pong balls for turtle’s eggs, and, for Chalifougnac vintage 1883, a can of turpentine.” (Valavanis, 1959, p.83)
3
Empirical Economic Analysis
Empirical Economic Analysis
1) Careful formulation of the question of interest based on economic model –
4
2) Econometric model
Crime Model: Nobel Prize winner Gary Becker
3) Hypotheses Slope coefficient(s) = 0 Intercept coefficient = 0 –
Job Training and Worker Productivity 5
6
1
Financial Data
Cross‐Sectional Data
• Time series
• Random sampling from the underlying population • Sample selection problem
– Annual, quarterly, monthly, weekly, or daily – Tick data – Ordering of the data matters
– Study factors that influence the accumulation of family wealth: Wealthier families are less likely to family wealth: Wealthier families are less likely to disclose their wealth
• Cross‐section Cross section – Ordering of the data does not matter
• Population is not large enough to reasonably assume the observations are independent draws
• Panel – Have both a time series and a cross sectional component
– Study factors (i.e. Wage rates, energy prices, tax rates, workforce quality) that explain new business activity across states. However business activity is unlikely to be independent
• Qualitative and Quantitative – Binary/Category (Yes/No; Good/Bad) 7
8
Different Time Periods in CS
9
10
11
12
Time Series Data • Chronological ordering of observations in a time series convey important information • More difficult to analyze than cross‐sectional data because time series are related to their histories • Spurious regression • Data frequency: daily, weekly, monthly, quarterly, annually • Seasonality
2
Pooled Cross Sections • Have both cross‐sectional and time series features • Increase sample size • I.e. two cross‐sectional surveys in different years – same survey questions same survey questions – both cross‐section contains a random sample
• It is analyzed much like a standard cross‐section • Often need to account for differences in the variables across the time • Allow to see how key relation has changed over time 13
14
Panel or Longitudinal Data • Time‐series for each cross‐sectional member • The same cross‐sectional units are followed over time • Having multiple observations on the same units allows for the control of certain unobserved allows for the control of certain unobserved characteristic of individuals or firms – Stack all time‐series of each cross‐sectional unit to perform data transformations for each cross‐sectional unit across different years.
• Allow us to study the importance of lags in behavior or decision making 15
16
Causality and Ceteris Paribus
Examples
• Association is suggestive but causality is compelling • Ceteris paribus means ‘other relevant factors being equal’ plays an important role in causal analysis • Hold other factors fixed to find a link between a factor of interest • Have enough other factors been held fixed to make a case of causality? • Random, independence and correlation
• Effects of fertilizer on crop yield • Measuring returns to education – Non‐experimental (unethical to do so) – People choose their own level of education so it can not be p y determined independently of all other factors – Omitted and unobserved factors • Innate ability
• Effect of law enforcement on city crime levels – Inferring causality
• Expectations Hypothesis – Prediction from economic model
17
18
3
Use cont. return when …
Asset Return • • • • •
Simple return: Vt = V0 (1+r)T Consider a single period t=1 Vt = Pt+1; V0 = Pt Pt+1=Pt(1+r) r = (Pt+1/Pt)‐1 = (P r = (P )‐1 = (Pt+1‐Pt)/Pt
• • • •
Continuously compounded return: Vt = V0 erT Consider a single period T = 1 er = Vt /V0 = Pt+1/Pt r = ln(Pt+1/Pt)
Pt = 100 Pt+1 = 200 Pt+2 = 100 Return in the first period (from t to t+1): – rS = (200‐100)/100 = 100% – rC = ln (200/100) l ( / ) = 69.3% • Return in the second period (from t+1 to t+2): – rS = (100‐200)/200 = ‐50% – rC = ln (100/200) = ‐69.3% • Sum of return over two periods (from t to t+2): – rS = 100%+(‐50%) = 50% – rC = 69.3%+(‐69.3%) = 0%
• • • •
19
Use simple return when …
20
Simple return vs Continuous return
From $40, invest $10 in A and $30 in B. wa=10/(10+30)=25% wb=30/(10+30)=75% Pa, t‐1=10;Pa, t = 20; Pb, t‐1=30; Pb, t=30. Si l Simple return: t – ra=(20‐10)/10=100%; rb=(30‐30)/30=0% – Total simple return=(50‐40)/40=25% – Or r=wara+ wbrb=25%(100%)+75%(0%)=25% • Log return: – ra=ln(20/10)=69.3%; rb=ln(30/30)=0% – Total log return=ln(50/40)=22.3% – Or r=wara+ wbrb=25%(69.3%)+75%(0)=17.3%
• • • • • •
• The simple return does not have the additive property over time, but the log return does. • The portfolio return calculated by the weight average of stock return using simple return is g g p exact, but if use log return is just an approximation. • Continuously compounded returns can be less than ‐100%
21
Data Transformation
22
Box‐Cox Transformation (MLE)
• Transforms are usually applied so that the data appear to more closely meet the assumptions of a statistical inference procedure that is to be applied, or to improve the interpretability or appearance of graphs. • Standardized = [ x‐mean(x) ]/ std(x) • Log‐transformation = ln(x) • Logistic = 1/(1+exp(‐x)) • Reciprocal = 1/x 23
• Family of power transformation • Ensure that the usual assumption in linear model holds:
24
4
Alternatives to Box‐Cox Transformation
Alternatives to Box‐Cox Transformation
• Manly (1971)
• John and Draper (1980): Modulus transformation
• Allow for negative y • Commonly used to transform unimodal skewed distribution to normal distribution • Not useful for bimodal or U‐shaped distribution
• Allow for negative y • Work best for symmetric distribution • Power transformation may introduce some degree of skewness 25
Univariate analysis
26
Numerical Summary
• Histrogram • Although mean and standard deviation are powerful descriptive statistics, they are inappropriate in some cases. pp p • Skewed distributions (with long left‐ or right hand tails), truncated distributions, the presence of outliers, and other situations arise in which mean and sd alone can be quite misleading.
• Measures of central tendency – Mean, Median, Mode
• Measures of dispersion – Standard deviation, Variance, Mean absolute deviation, range, inter‐quartile range
• Tchebychev’s rule: for any population, at least 100(1‐ 1/m2) lie within m standard deviations around the mean for m>1 • m 1.5 2.0 2.5 3.0 • %pop 55.6% 75% 84% 88.9%
27
28
Equality Hypothesis (independent)
Equality Hypothesis (independent)
• one sample mean test
• Two sample mean test – Unequal sample size, unequal variances
• Two sample mean test – Equal sample size, equal variances
– Unequal sample size, equal variances
29
30
5
Mann‐Whitney Wilcoxon Rank Sum Test
Equality Hypothesis (dependent) • When there is only one sample that has been tested twice (repeated measures) or when there are two samples that have been matched or "paired" matched or paired . • Paired Difference Test
• A non‐parametric test for assessing whether two independent samples of observations come from the same distribution. • Though it is commonly stated that the MWW g y test tests for differences in medians, this is not strictly true. • Rather this test is for chances of obtaining greater observations in one population versus the other.
31
32
U test
MWW Calculation
• The smaller value of U1 and U2 is the one used when consulting significance tables. • Arrange all the observations into a single ranked series. That is, rank all the observations without regard to which sample they are in. • Add up the ranks for the observations which came from sample 1 The sum of ranks in sample 2 follows from sample 1. The sum of ranks in sample 2 follows by calculation, since the sum of all the ranks equals N(N+1)⁄ where N is the total number of observations. 2 • The sum of the two values is given by Knowing that R1 + R2 = N(N + 1)/2 and N = n1 + n2 , and doing some algebra, we find that the sum is U1 + U2 = n1n2 • The maximum value of U is the product of the sample sizes for the two samples. In such a case, the "other" U would be 0.
• Large sample
33
34
Wilcoxon Signed Rank Test (dependent)
Significance of Wilcoxon’s statistic
• A non‐parametric statistical hypothesis test for the case of two related samples or repeated measurements on a single sample. • It can be used as an alternative to the paired Student's t‐test when the population cannot be assumed to be normally distributed. • Wilcoxon signed rank statistic W+ is computed by ordering the absolute values |Z1|, ..., |Zn|, the rank of each ordered |Zi| is given a rank of Ri. • φi = I(Zi > 0)
• To use this table: compare your obtained value of Wilcoxon's test statistic to the critical value in the table (taking into account N, the number of subjects). • Your obtained value is statistically significiant if it is equal to or SMALLER than the value in the table equal to or SMALLER than the value in the table. N 0.05 0.01 10 8 3 15 25 16 20 52 38 25 89 68
35
36
6
Example: W+ = 9. Sign of Xi – Yi
Xi – Yi
Absolute Xi – Yi
Correlation Rank of Absolute
Signed Rank
Subject (i)
Xi
Yi
1
125
110
+
15
15
7
7
2
115
122
‐
‐7
7
3
‐3
3
130
125
+
5
5
1.5
1.5
4
140
120
‐
20
20
9
9
5
140
140
‐‐‐
0
0
‐‐‐
‐‐‐
6
115
124
‐
‐9
9
4
‐4
7
140
123
+
17
17
8
8
8
125
137
‐
‐12
12
6
‐6
9
140
135
+
5
5
1.5
1.5
10
135
145
‐
‐10
10
5
‐5
• “The invalid assumption that correlation implies cause is probably among the two or three most serious and common errors of human reasoning.” Stephen Jay Gould • A standard method for describing the relationship between two variables is to present a bivariate scatter diagram accompanied by summary statistics: the standard deviation (SD) and average of the two variables, the correlation coefficient, and the number of observations.
37
Correlation
38
Correlation
• Pearson’s correlation: cov(x,y)/[sd(x) sd(y)]
• The absolute value of the correlation coefficient tells us how tightly the data are clustered around the line. • The sign of the correlation coefficient reveals The sign of the correlation coefficient reveals whether there is a positive or negative linear association between the two variables. • The correlation coefficient always has a value between –1 and 1.
• Spearman’s Rank correlation: d = ranks difference – No tied
39
Correlation
40
Limitations of correlation coefficients
• Significance test: H0: r=0
• Twice the r does not mean twice as much clustering. • r = 0.60 does not mean that 60 percent of the points are tightly clustered points are tightly clustered. • r says absolutely nothing about the value of the slope of the relationship; r is a measure of the tightness of clustering around a line that may have any slope.
• Student t distribution with n‐2 degree of freedom • Fisher transformation: Spearman rank correlation
41
42
7
Same Correlation
Association is not Causation • A high correlation coefficient means that two variables are highly associated, but association is not the same as causation. • The fundamental reason that association is mistaken for causation lies in the notion of confounding. • A prototypical example of confounding involves three variables: x, y, and z. • Typically, z is a third confounding variable that is causing or driving both x and y, and thus a simple plot or correlation of x and y will make it seem like x causes y when, in fact, x and y are caused by the underlying missing z. • The problem with observational studies is that the confounding z variable is often deeply hidden in such subtle ways that the investigator is unaware confounding is present.
• Homoskedastic vs Heteroskedastic • The spread of y stay constant as x changes (LHS) • The spread of y varies as x changes (RHS) h d f i h ( S)
43
44
Confounding factor
Ecological Correlation
• A quick look at the data shows that students from private schools do better on SATs and in college than those from public schools – there is a correlation between type of school (the x is a correlation between type of school (the x variable) and educational success (the y variable). • Confounding factors: parental support, family income, nutrition, and student motivation
• Using a correlation coefficient based on grouped or aggregated data. • The difficulty with ecological correlation is that researchers often are interested in correlation at the individual level but must instead content themselves with exploring correlation at the group level. • Ecological correlations are typically, though not always, bigger in absolute value than individual‐level correlations.
45
46
Simpson’s Paradox: Samuels, M. L. (1993). “Simpson’s Paradox and Related Phenomena,” Journal of the American Statistical Association 88 (421): 81–88.
Ecological Correlation • Researchers can thus be fooled into thinking they have found something important when the correlation they are really interested in is much smaller. • Sometimes the problem is even worse: The ecological correlation is of opposite sign to the individual‐level correlation. • The loss of dispersion within the group tends to generate a stronger correlation at the group level compared with the individual level. 47
• A positive trend appears for two separate groups (blue and red), a negative trend (black, dashed) appears when the data are combined.
48
8
Hand‐on Example
Common Market Efficiency Measures
• Market Efficiency Measures • In a fully efficient and frictionless market, actual changes in stock prices are unforecastable
• Autocorrelations – Assume that informationally efficient prices follow a random walk
• Variance ratios Variance ratios – |VR‐1| measures relative efficiency
• Griffin, John M., Patrick J. Kelly and Federico Nardari, 2010, Do Market Efficiency Measures Yield Correct Inferences? A Comparison of Developed and Emerging Markets, Review of Financial Studies 23(8), 3225‐3277.
• Delay • Trading costs
49
50
VR
VR
• Lo and Mackindlay (1988 RFS) • Variance ratio is computed by dividing the variance of returns estimated from longer intervals by the variance of returns estimated y from shorter intervals, (for the same measurement period), and • Then normalizing this value to one by dividing it by the ratio of the longer interval to the shorter interval.
• Under the null hypothesis of a random walk with uncorrelated increments, variance ratios (VRs) should equal one at all lags. • A variance ratio that is greater than one suggests that the returns series is positively serially correlated or that the shorter interval returns trend within the duration of the longer interval duration of the longer interval. • A variance ratio that is less than one suggests that the return series is negatively serially correlated or that the shorter interval returns tend toward mean reversion within the duration of the longer interval. • This approach is advantageous in that if a market consists of stocks with both over and under reaction to past returns, then both would be captured.
51
VR
52
Delay
• To control for autocorrelations induced by microstructure effects, we – a) focus on results at the weekly frequency; – b) use screens where stocks are required to trade b) use screens where stocks are required to trade frequently; – c) skip a trading day in some results, and in some cases also require that this skipped day contain trading activity, following Mech (1993).
53
• An R2‐based measure of the sensitivity of current returns to past market‐wide information. • Delay is calculated as the difference in R Delay is calculated as the difference in R2 between an unrestricted market model with four weekly lags and a restricted model with no lags
54
9
Trading Costs • LOT, Hasbrouck’s Gibbs, Roll’s Covariance • For US stock market – Goyenko, Ruslan, Craig Holden, and Charles Trzcinka. (2009). Do liquidity measures measure liquidity? Journal of Financial Economics, 92, 153‐181.
• For international stock market – Fong, Kingsley., Holden, Craig., and Charles Trzcinka. (2010). Can global stock market liquidity be measured? SSRN Working Paper: http://ssrn.com/abstract=1558447
• For commodity market – Marshall, Ben, Nguyen Nhut, and Nuttawat Visaltanachoti. (2010). Commodity liquidity measurement. SSRN Working Paper: http://ssrn.com/abstract=1603683 55
10