|
Wednesday, 22 October 2008 23:39 |
|
Here's the R code for today: ### First we fix the seed so that we get the same random numbers every time set.seed(1) N = 5 ## Generate a fairly tidy regression xreal = (runif(N)-0.5)*10 yreal = 2.2 * xreal + rnorm(N, 0, 1)
## Nice and linear plot(xreal,yreal)
## Add a high-leverage point that doesn't follow the orginal regression x = c(xreal, 10) y = c(yreal, 3.2*10)
## Note that -- when we 'zoom out' like this -- it still looks pretty linear plot(x,y)
## Here's our "real" linear fit m1 = lm(yreal~xreal) summary(m1)
## And here's our fit with the outliers; notice that it's still ## highly significant. m2 = lm(y~x) summary(m2)
## If we just look at the residuals, we don't see anything wrong. The high ## leverage of the outlier has forced the fit to conform to that outlying ## point, masking it's effect plot(x,m2$res)
## Make our predictor variable matrix... xmat = matrix(c(rep(1, N+1), x), N+1, 2) ## Transform it to a hat matrix... hat = xmat%*%solve(t(xmat)%*%xmat)%*%t(xmat) ## And use the diag command to extract just the diagonal elements; notice the ## high value for the last point diag(hat)
## Or save time and use the influence function... infs = influence(m2) infs$hat
## The diagonal matrix sums to the number of parameters in the model (2: a mean ## term and an intercept term sum(diag(hat))
## We have 12 data points, so our cutoff is... 2*2/(N+1) ## about 1/3 is our cutoff for high leverage. Our outlier qualifies.
## Let's look at standardized residuals: plot(stdres(m2)) ## This still doesn't look too outrageous
## If we plot the jackknife residuals: plot(studres(m2)) ## Now our bad point is obviously out of place.
## If you've forgotten the incredibly useful 'names' function, now's a good time ## to remind yourself of it... m2summary = summary(m2) names(m2summary) m2summary$sig
## Calculate standardized residuals by hand, just for practice r = m2$resid/(m2summary$sigma * sqrt(1 - infs$hat))
cbind(r, stdres(m2))
## Cook's distance: cd = cooks.distance(m2) plot(x,cd)
################################# ################################# ################################# ## Co-linearity N=20 x1 = (runif(N)-0.5)*10 x2 = x1 + 3+(runif(N)-0.5)*2
y = 3*x1 + x2 + rnorm(N)
## Checking the correlations we see that they're actually pretty bad... cor(cbind(y,x1,x2))
## But fit the mode anyway m3 = lm(y~x1+x2) summary(m3) # Note the large variance on each term
xmat = matrix( c(rep(1, N), x1, x2), N, 3) eigen((t(xmat)%*%xmat)) ## We have eigenvalues that differ by an order of magnitude or so
## The cheap fix: drop one or the other co-linear terms. This generally isn't the # ideal approach, but it can sometimes help. Ridge regression is better. m3a = lm(y~x1) summary(m3a)
m3b = lm(y~x2) summary(m3b)
|