Transforming variables

Report 3 Downloads 290 Views
Powers and Roots Quite often when we’re dealing with quantitative data, it turns out that for the purposes of analysis, it is useful to carry out a transformation of one of the variables of interest. This could occur for any one of a number of reasons: C

C

C C

A distribution may be skewed. This can be problematic for any one of a number of reasons. 1) Skewed distributions are difficult to examine because many observations may be piled up in one place. 2) Unusual observations in the direction opposite from that of the skew may be hidden. 3) Summary measures like the mean and median don’t necessarily make much sense for a skewed distribution. The relationship between two variables many be nonlinear. Again, this can be a problem for any one of a number of reasons. 1) Linear relationships are easier to interpret than nonlinear ones. With a linear relationship, after all, a one unit change in the independent variable leads to some unit change in the dependent variable. 2) The statistical theory for linear models is simple and well-developed. This is what we will be learning for the rest of the quarter. 3) Nonparametric regression is not feasible when there is more than independent variable. Spread may be nonconstant. There are cases where the spread of a variable will depend on its level. In that situation, comparing two distributions of the same variable from populations are at different levels may be difficult. One or both of the variables may be a proportion.

Most of the transformations we will deal with will be in the families of powers and roots: X -> (Xp -1)/-1. In the case that p is zero, we will substitute the log transformation, X -> log x. The relationships of these transformed variables to the original are summarized in the figure below:

33.241

-18.2115 3.24725

.052052 x

Note that you will not always be using the specific formula listed above. You may need to fiddle a bit so that all of the ‘independent’ variables are positive. Also, as a general rule, carrying out a transformation only makes sense when the ratio of the largest to the smallest value is quite large.

Creating transformed variables in STATA with the generate command. An important command you will need to use in STATA to create transformed variables is the generate command. The format is generate new variable = expression involving existing variables for example, the curves in the previous transparency were generated from an original variable x by the following commands: generate generate generate generate generate

y0 = ln(x) y1 = x - 1 y2 = x^2 - 1 y3 = x^3 - 1 ym1 = (x^-1 -1)/-1

The original variable x had been created to consist of 1000 observations ranging from 0 to 4. Note that ln(x) is the natural logarithm of x, log(x) is the log base 10 of x, and x^3 is x cubed. The range of functions you can include in your expression is quite wide. help functions will summarize them for you. For example, normd(x) is the height of the normal distribution at x, normprob(x) is the value of the cumulative normal at x, and so on. .398939

.999968

.000134

.000032 -4

4

-4

x

4 x

y = normd(x)

y = normprob(x) .999998

1

-.999998

-1 -.996997

.996997

-4

y = sin(1/x)

4 x

x

y = sin 2x

Dealing with skewness If you have a variable that is skewed, you may use a power/root or log transformation to try to make it look more normal, or at least more symmetric. In general, if a distribution is skewed to the right, you may need to experiment with a log or a root transformation, while if it is skewed to the left, you may need to square or cube it. Here are some examples: First, a situation where a variable is skewed to the left. On the left is a smoothed univariate distribution (using STATA’s kdensity) histogram of countries according to the percentage of males with a primary school education in 1993 (prim93m). Note that it is skewed to the left, indicating that while most countries have a fairly high proportion of males with some primary school education, some have very little. On the right is a smoothed distribution of a variable prim93m2 generated by squaring prim93m. Note that it is much more symmetric, and even sort of normal. .047096

Density

Density

.057645

.00151

.001038 21.4175

-356.723

139.582

19041.7 prim93m2

Primary 93 M (World Bank)

Kernel Density Estimate

Kernel Density Estimate

Here’s a situation where transformation of a left-skewed distribution improves symmetry, but not normality. On the left is a distribution of life expectancies in 1995, on the right is a distribution of life expectancy to the 4th power: .035311

Density

Density

.049571

.001342

.00072 34.7524

83.2476 e0 in 1995 (World Bank)

Kernel Density Estimate

-1.2e+06

4.4e+07 e0_954

Kernel Density Estimate

The smoothed distribution is the univariate equivalent of the bivariate smoothing we saw earlier.

Just as we can raise to a power to adjust for a left skew, we can log or take a root to correct for a right skew. On the left is a smoothed distribution of per capita GDP, on the right is the equivalent for logged per capita GDP. .124285

Density

Density

.034016

.000266 44000

-1548.01

.000701

United Nations per capita GDP

11.1791

3.05971 lPcGDP95

Kernel Density Estimate

Kernel Density Estimate

Below is an extreme example. On the left is a distribution of surface areas of countries. It is right skewed, indicating that while most countries have a small surface area, some have very large ones. On the right is a distribution of the tenth-root of surface areas, the result of some experimentation.

Density

.043515

Density

.412285

0

.000581

-101.99

17177.4 Surface Area in sq. km. World B

Kernel Density Estimate

.548704

2.77749 AreaRt2

Kernel Density Estimate

Transforming to deal with nonlinearity Often we will want to transform a variable to linearize a relationship. An important rule of thumb is that if we plot the ‘dependent’ variable on the vertical versus the ‘independent’ variable on the horizontal, if the plot exhibits a bulge up and to the left, we will probably need to either transform the independent variable ‘downward’ or the dependent variable ‘upward.’ If the bulge is down and to the left, we may need to shift transform the independent variable ‘upward’ or the dependent variable ‘downward.’ Here’s an example where on the left, male life expectancy is plotted against per capita GDP, and on the right, it is plotted against logged per capita GDP. 77.4

Male e0 1995-2000 from U.N.

Male e0 1995-2000 from U.N.

77.4

36

36 36

3.58352

42416

10.6553 lPcGDP95

United Nations per capita GDP

Here’s an example where the dependent variable is TFR. 7.6

UN_TFR

UN_TFR

7.6

1.19

1.19 36

42416 United Nations per capita GDP

3.58352

10.6553 lPcGDP95

Nonconstant spread Sometimes the spread of a variable depends on the level of another variable. For example, according to the figure on the left below, how spread out the distribution of the infant mortality rate is varies according to the level of fertility, even though the relationship between the two is linear. That can sometimes be a problem if you want in some way to compare the distribution of the variable at different levels. Outliers may be harder to locate at one of the levels, for example. When spread increases with level, working down the ladder of powers can stabilize it. The right on the right below plots the cube root of the infant mortality rate as a function of the TFR.

5.43504

IMR

IMR3

169

4

1.58008 1.19

8 UN_TFR

1.19

8 UN_TFR

Proportions An important transformation when one of the variables represents a proportion is the logit transformation, ln(P/(1-P)). This is the log of the odds of P. Proportions are unusual beasts because they are bounded, by 0 and 1, and often they behave differently when you look at distributions of them. Often you will see examples of nonconstant spread when you are looking at proportions as a dependent variable, and the logit transformation is another way of dealing with this. Here is the IMR/TFR relationship from the previous page, followed by one where the IMR is transformed into the logit: -1.59273

IMR

logitIMR

169

-5.51745

4

1.19

1.19

8 UN_TFR

8 UN_TFR

Here’s an example where we’re looking at a distribution of countries by the percentage of dwellings with running water. It’s right skewed, but squaring the percentage doesn’t do much that’s useful, indeed it produces a bimodal distribution that still looks a bit skewed. .163043

Fraction

Fraction

.217391

0

0 100

13

mwater95

Here’s what happens when we look at a logistic transformation:

Fraction

.209877

0 -1.90096

10000

169

Water (World Bank)

4.59512 lwater95

Notice the distribution is much better-behaved.