ECO220 Notes Lecture 1: Sampling Errors & Non Sampling Errors Goal → to make inferences about population parameter from sample statistics Probability: foundation for statistics Statistics: descriptive and inferential o Descriptive → describes what happened (ex. Class avg) o Inferential → conclusion about data not 100% sure o Describes sample (data) using statistics o Make inferences about population and its parameters using observed data (sample) Population = set of all items of interest (ex. All students @ uoft for evaluations) Parameter = descriptive measure of a population (something describes population ex. What %/fraction of population) Sample = subset of the population (ex. Small group) Statistics = descriptive measure of a sample (ex. Of 200 in sample response __) Sampling Error, ‘white noise’, ‘sample noise’, ‘sampling variability’ = the purely random differences between a sample and the population that arises b/c the sample is a random subset of the population As sample size gets larger the sampling error tends to get smaller Ex. Pick 200 out of 60,000; could result in extremely different (due to change) o Not wrong b/c random sample nothing wrong with survey itself Size of sample determines the size of sampling error o Larger samples = less sampling error Example: bag of m&m, choose using a spoon, n= # of m&m, y = % of yellow in n n y 4 0/4 = 0% 13 2/13 = 15.4% Population: whole bag of m&m (all) Parameter: what %/portion are yellow Sample completed 2 times Samples statistics → %yellow Law of large Numbers Larger samples = smaller sampling errors o Sampling error decrease as n (sample size) increase o There is no law of small numbers (‘law’ = small samples represent population) Example: movie on demand, should rural company offer new channel?, randomly select 100ppl, ask 2 different questions Population: customer base in rural for company Sample: 100 customers Sample statistics: mean = 2.3, proportion = 0.45 Population parameters not known b/c sampling errors The Types of Information Variables = characteristic recorded about each individual or case (types of info) Quantitative = numerical measurements of a quality or amount o Ex. a 10% decrease in prices will lead to a 20% increase in QD Qualitative = some assessment of quality or kind o Ex. A increase in price tends to lead to a decrease in QD Identifier variable = unique code for each product/customer Page | 1
Data
Rows of data table correspond to individual cases People answer survey = respondents, people experimented on = subjects/participants/experimental units # of observations = sample size ‘these data are flawed’ Data = multiple Datum = 1
3 Types of Data Interval = numerical measurements, real numbers that are quantitative/numerical (ex. How many marriages?) Ordinal = ranking of categories (ex. How would you rank marital status?) Nominal = un-ranked categories that are qualitative/categorical (only use names) Hierarchy of Data 1. Interval - real number -> all calculations are valid 2. Ordinal - must represent the ranked order -> calculations based on ordering process valid 3. Nominal - arbitrary numbers that represents categories -> only calculations based on frequency 3 Types of Data Sets Cross-sectional = a snapshot of different units taken in the same time period o Ex. Annual GDP for 2010 for 20 countries (20 observations) Time Series = track something over time o Stationary time series = without a strong trend or change in variability (then use histogram with time series) o Ex. Annual Canadian GDP from 200 until 2010 ( 10 observations) Panel (Longitudinal) = a cross-section of units where each is followed over time o Ex. Annual GDP of 20 countries from 200 unit 2010 (200 observations) Sampling Stratified Sampling = a sampling design in which the population is divided into several homogenous subpopulations, or strata, and random samples are then drawn from each stratum o Strata = subset of a population that are internally homogenous but may differ from one another Systematic Sampling = a sample drawn by selecting individual systematically from a sample frame Convenience Sampling = a sampling technique that selects individuals who are conveniently available o May not represent population Cluster sampling = a sample design in which groups, or clusters, representative of the population are chosen at random and a census is then take of each Multistage sampling = sampling schemes that combine several sampling methods Page | 2
Sample size determines what can be concluded from the data regardless of the size of the population Voluntary Response Sample o Hard to define sample frame, doesn’t correspond to population o Bias toward those with strong opinions (especially negative opinions) Simple random sample (SRS) = a sample in which each set of n individuals in the population has an equal chance of selection Sample Frame = list of individuals from which the sample is drawn Sampling Vs. Non-sampling Errors Sampling Error o Pure chance (random) difference between sample & population (aka ‘white noise’) o Random: no one can guess the outcome, has some underlying set of outcomes will be equally likely o It is impossible to match sample to population b/c too many characteristics to think of and match o Undercoverage = not all portions of population sampled Non-sampling Errors o Systematic (not random) difference between sample & population o Biased estimate = statistic is systematically higher or lower than the parameter o Systematic errors in data collection: Systematic lying (ex. Ppl over estimate income) Poor survey instrument design (ex. Unclear) o Non-response bias: Low response rate and non-responders are non-random (ex. Selection) o Sampling frame differs from target population o Sampling variability = the sample-to-sample differences
Sample
used to calculate
Statistic
used to estimate
Parameter
Tells us about
Population
Population Parameter = a numerically valued attribute of a model for a population (hope to estimate from sample data) Biased = any systematic failure of sampling method to represent its population Measurement error = intentional or unintentional inaccurate response to a survey question Valid Survey Know what you want to know Use the right sampling frame Ask Specific rather than general question Watch for biases o Nonresponsive bias = bias introduced to a sample when a large fraction of those sampled fails to respond o Voluntary Response Bias o Response Bias = tendency of respondents to tailor their responses to please interviewer and consequence of slanted question wording Be careful with question phrasing Page | 3
Be careful with answer phrasing o Measurement errors o Pilot Test = a small trial run of a study to check that the method of the study are sound Be sure you really want a representative sample Lecture 2: Tabulations, Bar/pie Charts, Histograms & Centre Describing 1 variable (with few unique values) Bar charts = displays the distributions of a categorical variable, showing the counts for each category next to each other for easy comparison Pie Charts = show the whole group of cases as a circle (great for ½, ¼, 1/8 comparisons) Segmented/Stacked Bar Charts = a bar chart that treats each bar as the “whole” and divides it proportionally into segments corresponding to the % in each group Stem & Leaf Display = like histogram but also give individual values (but require quantitative data control) Tabulation = list all unique values in data & relative frequency (aka, frequency table, relative frequency table) o Basis of bar/pie chart o Interval, ordinal or nominal data One variable with interval data o Histograms = a graph that shows how the data are distributed Frequency Histogram = Bar height measures number of observations in bin Relative Frequency Histogram = Bar height measures fraction of observation in bin relative to total number Density Histogram = Bar area measures the fraction of observation in bin relative to total number o Classes (bins) = non-overlapping and equal sized intervals that cover range Number of bins selected changes the appearance of the histogram Sturges’ formula: # of bins = 1 + 3.3*log(n) o Shape of Things **Review Lecture, slide 18-21** Histograms give overview of a variable with a picture (can make informal inferences about the shape of population) Symmetric = split equally to left and right Positively Skewed = long tail to the right (skewed to right) Negatively Skewed = long tail to the left (skewed to left) Modality = # of major peaks Bell/Normal/Gaussian (means unimodal, symmetrical) Describing 2 variable (with few unique pairs) Cross tabulation = measures frequency that two variables take each possible pair of values (any kind of data) (aka, contingency table or two variables) o Basis of pie/bar chart o Shows relationships between two variables o Interval, ordinal or nominal data o Creates Contingency tables Marginal distribution = frequency distribution of either one of the variables Conditional distributions = the distribution of a variable restricting the WHO to consider only a smaller group of individuals Sample vs. Population Sample contains only a subset of observations in a population (sample errors too) Page | 4
Sampling noise = difference between population and sample simply due to random chance o Driven by size of the sample (and not the size of the sample relative to the size of the population, which usually assumed infinite) Sampling error always present o Statistic is study of how to make inferences in light of sampling error o Never see in perfect forms o Consider sample size (n) when making informal inferences (larger = more accurate) Measures of Central Tendency Sample statistics often called summary statistics b/c they are meant to give a concise idea of what data “looks like” Three sample statistics that provide numeric measures of central tendency
-No Sample Error -Population Parameter -Real Life
-Have Sample Error -Population Statistic
Median = middle observation after sorting o If n is an even #, calculate median by averaging the two middle observations o Better choice for skewed data than mean Mode = the value that occurs with the greatest frequency o With interval data often use modal class o Modal Class = class with most observations Sensitivity to Outliers Outliers = extremely large or small values different from the bulk of the data Robust = not sensitive to outliers o Mean not robust b/c sensitive can rise/lower (balance) o Median is robust b/c more subject to sample error, only looks at last and first **REVIEW LECTURE DIAGRAMS***** Graph Problems (violate) Area principle = a principle that helps to interpret statistical information by insisting that in a statistical display each data value be represented by the same amount of area (ex. 3D pie chart) Keep it honest (all % add up to 100%) Look at data separately too when more than 1 variable or created contingency table Use large enough sample size (especially for pie chart) Don’t overstate case Simpson’s Paradox = a phenomenon that arise when averages, or percentages, are taken across different groups, and these group averages appear to contradict the overall averages Lecture 3: Describing Interval data (beyond a histogram & mean/medium/mode) Page | 5
4 Measures of Variability (spread) Summarize data variability with statistics: o Range = the difference between the largest and smallest observations Measure of variability as difference become bigger, data more variable Sample range subject to sample error Use 2 observations (biggest & smallest) Very sensitive to outliers o Variance = sum of the squared deviations from the mean divided by the degrees of freedom Always Numerator: total sum of squares (TSS) Observation far from mean increases TSS a lot Denominator: degree of freedom (df, v, ‘nu’) Only n-1 free observation left after calculate mean
o
o
Standard Deviation = the square root of the variance Measured in same units as original variable Variance measured in units squared Standard deviation all same for different shaped graphs b/c use all data like mean (all data in these graphs are centered around the middle) S.d. depends on shape of graph and units of measure Possible for: range = 0 & S.D. = 0 **Review Lecture, slide 6** Empirical Rule (Normal/Bell) ~ if sample from normal population About 68.3% of all obs. Within 1s.d. of mean About 95.4% of all obs. Within 2s.d. of mean About 99.7% of all obs. Within 3s.d. of mean Chebysheff’s Theorem (always true, for any shape) At least 100*(1-1/K2)% of observations lie within K s.d.’s of the mean for k>1 o At least 75% of the obs. lit die within 2s.d. of mean (1-1/22 = ¾) o At least 89% of the obs. lie within 3s.d. of mean (1-1/32 = 8/9) o Can be applied to all samples no matter how population is distributed o Notes: k does not have to be integer (ex. 1.5), difference between this and Empirical is that K >1 and K ≠ 0 for Chebysheff’s Theorem Coefficient of Variance = measures how much variability there is compared to the mean
Page | 6
o Measures of Relative Standing Percentile = Pth percentile is value at cut-off between bottom P% & top (1-P)% of obs. o Ex. 90th percentile of checkout time is 20min (means more than 10% are waiting more than 20min) o Ex. 75th percentile of an income distribution is $64,541 (means that 75% of people earn less than $69,541, 25% earn more) o **Review Lecture, slide 12 & 14** Measure of Variability Interquartile Range (IQR) = 75th percentile minus 25th (IQR = Q3 –Q1) o Measures the spread of middle o Range = measure of variability (looks at center part of data) o Robust to outliers o **Review Lecture, slide 17-21** Standardization (Z scores) To standardize a variable (x) means to transform by subtracting its mean and dividing by its standard deviation Z tells how many standard deviations a value of X is from the mean: above the mean if positive and below the mean if negative Standardizing data only re-centers the scale but won’t change the shape Linear Transformations Can be written as Y=a + bX where a and b are constants Linear transformations change scale of a variable not shape of the distribution Change unit of measurements Grouped Data Generally with a tabulation of interval data you can find mean, s.d., & other statistics Ex. Consider a random sample of 100 people asking how many copies of the bible they own (interval data, build tabulation through few answer options) o If 30% said 0, 50% said 1, and 20% said 2, one is able to find the mean and s.d.
Lecture 4: Describing Association & Association vs. Causality (used when few variables) Qualitative Assessment Use scatter diagrams to qualitatively assess relationship between 2 variables: o Positive linear relationship o Negative linear relationship o Non-linear relationship Page | 7
o No relationship (horizontal or vertical, 1 variable changes but other doesn’t response) o Strong relationship o Weak relationships o **Review Lecture, slide 5-7** Measures of linear Relationship Statistic quantitatively assess strength Two things affect of the strength: o Extent data are scattered around the line o Steepness of the slope 3 statistics for linear relationship: o Covariance o Coefficient of Correlation and Determination o Linear Regression (Least Squares Method) Covariance (only for linear relationship) Covariance = measure how two variables vary with respect to each other Zero Covariance = no linear relationship Positive Covariance = when X is big Y tends to be big & when X is small Y tends to be small o Note: Big & Small compared to mean Negative Covariance = when X is big Y tends to be small & vice versa Not affected by units of measurement (scale of axis) Units: ex. Dollar X hours **Review Lecture, slide 12-16** Coefficient of Correlation = for two variables measure strength of linear relationship measure of strength of linear relationship, shows if the relationship is positive or negative Units: cancel out Not affected by units of measurement (scale of axis) Affected by scatter & slope Correlation does not measure strength of non-linear relationship Interpretation of Correlation o Value near -1 → strong neg. Linear relationship o Value near 1 → strong pos. Linear relationship o Value near 0 → no linear relationship A strong correlation does NOT imply that an increase in X causes an increase or decrease in Y o Only implies: X above average Y (move together) **Review Lecture, slide 21-25** Neither association or correlation imply causation “A correlation between variables implies an association between variables” Correlation Properties Sign of a correlation coefficient gives direction of association Correlation between -1 & 1 Page | 8
Correlation treats x and y symmetrically Correlation has no units Correlation is not affected by changes in the centre or scale or either variable Correlation measures the strength of the linear association between the 2 variables Correlation is sensitive to unusual observations Standardizing Use formula Basically get rid of unit of measurement then the covariance = correlation coefficient **Review Lecture, slide 27** Research Question = inquiries about the casual relationship amen variables Types of Data Experimental data = data collected in a controlled experiment where some variables are set by the researches (X variable is randomly set) o Exogenous Variables = variables not affected by choices or behaviours of agents or other unobserved variables (not affected by other variables) o Researches randomly set values for explanatory variables and sees reactions Observational data = all variables may be affected by choice or behaviours of agents and other unobserved variables and are not randomly set by a researcher o Endogenous variables: variables affected by choice or behaviours of agents o Issues with Observed Data Unobserved Variables (lurking variables) = other variables that are not in your data that affect both your dependent and your independent variables o Endogeneity Bias = bias when mistakenly attribute certain effects to observed variables when they are caused by associated unobserved variables **Review Lecture, slide 40** Natural Experiment Reporting the Shape, Centre, and Spread Shape → skewed; mean, s.d., explain why mean & median differ (=skewed) Shape → Unimodal, symmetric, mean, s.d., median, IQR (if IQR is not within 1 or 2 s.d. check if skewed or outliers) Shape → Multiple modes, understand reason, possible split data Shape → unusual, mean, s.d. Note: must show median w/ IQR, mean w/ s.d. 5-Number Summary & Boxplots 5-number summary = a distribution report its median, quartiles, extremes From this can create boxplot o How to make it (p.98,99) Scatter plots Explanatory or Predictor variable (x-variable) = variable that accounts for, explains, predicts, or is otherwise responsible for the y-variavle Response variable = the variable that the scatter plot is meant to explain or predict (yvariable) Correlation Conditions Quantitative Variables Conditions → correlation applies only to quantitative variables Linearity Conditions → correlation coefficient only measures the strength of the linear association Outlier Conditions → unusual observation can distort the correlation Page | 9
Don’t’ confuse correlation with causation Association = vague, correlation = precise
Lecture 5: Linear Regressions
Experiment Data
Observational Data
-endogenous price -exogenous drug dosage -can infer causation from correlation -demand shifters: taste, income, population, etc
Least Squares Method or ordinary least Squares (OLS) Quantitative method of fitting a straight line through the scatter diagram o Formula for a line: y=a+b*x (a intercept, b slope) o Always passes through the mean OLS = method for estimating a and b o Returns estimates of a and b such that the predicted values of y are as close as possible to actual values of y o Predict value of y is ŷ: ŷ=a+bx o Residual is e1 = yi – ŷi Solution to Minimization Interpretation Slope (b) measures marginal change in the dependent variable (Y) associated with a change in the independent variable (X) **Review Lecture, slide 21-2** Language “will have” implies casualty **Review Lecture, slide 25-6** Residual Errors Constant term mean/avg of residuals is 0 (otherwise not minimizing the SSE) Homoscedasticity = that it makes sense to talk about eh s.d. of the residual, variance of the residual is constant **Review Lecture, slide 29-31** Analysis of variance How well the model fits/explains the data Analysis of Variance (ANOVA) = a technique for resolving (decomposing) the total variability of a set of data into systematic and random components Total Sum of Squares (TTS) = total variation in y How much of total variable in y is explained by linear regression model variable: SSR How much of the total variability in y is explained by unobserved error: SSE Page | 10
If the least squares line fits the data perfectly then the SSE equals 0, where as if the regression line does not fit the data at all the SSE equals the SST SSE is not a good measure of fit but depend on observation and measured in y2, hard to understand Coefficient of Determination = measures fit of the linear regression model Fraction of total variation in dependent variable (y) explained by regression model Comes are
Measure Y Y2
Fit SE SST SSR SSE R2
Good Measure of Fit No No No No Yes
Lecture 6: Probability Theory & Expectations Probability Basis of Inference Statistical inference about parameters uses statistics affected by sampling noise o Influences involves probabilities not certainties Random Experiment = process that leads to one of several possible outcomes Sample Spaces (S) = S = {O1, O2,..,Ok} Exhaustive = list all possible outcomes Mutually Exclusive = outcomes can be only one Together imply probability of each outcome must be between 1 and 1 and must sum to 1 Event = the outcome of an experiment Joint Probability Joint probability = P(two events both occurring); P(AᴖB) o Event A, draw a spade, Event B draw a king Marginal Probability = P(single event) o If outcomes are exhaustive and mutually exclusive, add all joint probabilities containing the single event Conditional Probabilities = probabilities of one event given another event has occurred o P(B│A) = probability of B given A o P(A│B) = probability of A given B Rescale to Find Conditional Independence Two events are independent if and only P(A│B) = P(A) [implies P(B│A) = P(B) Union = union of events A and B that either A or B or both occur Probability Rules Complement of Event A = event that occurs when A does not occur, AC (or A’) Page | 11
Complement Rule = P(AC) = 1 P(AᴗB)– P(A) Multiplication Rule = joint probability of any two event A and B is o P(AᴗB) = P(A│B)* P(B) Multiplication Rule Independent Events = if A and b are independent joint prob. is = o P(AᴗB) = P(A)* P(B) Multiplication Rule N independent Events = If A1..,AN are independent join prob. is = o P(A1..A2...AN) = P(A1)*P(A2)..*P(AN) Addition Rule Mutually Exclusive Events = both cannot occur o P(A or B) = P(A) + P(B) Addition Rule = probability that event A or event B or both occur o P(A or B) = P(A) + P(B) – P(AᴗB) Random Variables = assigns a number to each outcome of an experiment Types of Random Variables: Discrete Random Variables = takes countable (finite) number of values o Ex. Number of cats owned by a randomly selected household Continuous Random Variable = takes an accountable (infinite) number of values o Ex. Average number of cats owned for a random sample of household Probability Distribution = gives probability of every value of range of values for a random variable Can tell us how much sampling noise Discrete distribution gives probability Continuous Distribution gives density; area under density function f(x) is probability Expected Value is Population Mean Risk Aversion = prefer expected value of a lottery over playing it
Distributions Univariate Probability Distribution Bivariate/Joint Probability Distribution = probability of each unique pair of values for two random variables o Joint Probability
Page | 12
Lecture 7: Discrete Probability Distributions Bernoulli Trial Two possible outcomes: success or failure Probability of success: P Probability of failure: P-1 Bernoulli random variable = records the outcome of a single Bernoulli trial Discrete random variable, mutually exclusive E[X] = μ = p E[X] = δ2 =p*(1-p) Bernoulli Experiment Fixed number of trials: n Only outcomes: success or failure Probability of success: P Probability of failure: P-1 Trials are independent Binomial Random Variable Binomial Random Variable = counts number of successes in a Binomial experiment Binomial distribution gives probability of each possible value of random variable X What if n is Big P(a sequence w/ x successes) = Px(1-p)n-x o If a specific sequence gets to one tree tip o Use multiplication rule for independent events Count # of sequences with x successes
Page | 13
Binomial Probability Distribution Probability of x successes in a Binomial experiment with n trials and probability of success equal to p is = (NOTE: 2 parameters n and p)
-**Review Lecture, slide 17-9** Bernoulli Builds Binomial
Cumulative Probability = probability that a random variable is less than or equal to a particular value P(X≤x) Lecture 8: Continuous Distributions Probability in Continuous Cases Discrete random variable can find P(X = X1) Continuous random variable probability that X is specific # is 0 Discrete vs. Continuous o Ex. Cars that pass a toll booth in a year continuous o Ex. # of snow closure @ UofT Discrete Probability Density Function: f(x) Area under a density function is the probability of a range of values F(x) gives height of the density function 1. F(x) is not a probability Requirement for a probability density function f(x): 1. F(x) ≥ 0 for all possible values of x (can’t be negative) 2. The total area under the curve is 1 Uniform Distribution
Page | 14
Page | 15