Oct 22 section
Wednesday, 22 October 2008 23:39

Here's the R code for today:

### First we fix the seed so that we get the same random numbers every time
set.seed(1)
N = 5
## Generate a fairly tidy regression
xreal = (runif(N)-0.5)*10
yreal = 2.2 * xreal + rnorm(N, 0, 1)

## Nice and linear
plot(xreal,yreal)

## Add a high-leverage point that doesn't follow the orginal regression
x = c(xreal, 10)
y = c(yreal, 3.2*10)


## Note that -- when we 'zoom out' like this -- it still looks pretty linear
plot(x,y)


## Here's our "real" linear fit
m1 = lm(yreal~xreal)
summary(m1)

## And here's our fit with the outliers; notice that it's still
##   highly significant.
m2 = lm(y~x)
summary(m2)


## If we just look at the residuals, we don't see anything wrong.  The high
##  leverage of the outlier has forced the fit to conform to that outlying
##  point, masking it's effect
plot(x,m2$res)

## Make our predictor variable matrix...
xmat = matrix(c(rep(1, N+1), x), N+1, 2)
## Transform it to a hat matrix...
hat = xmat%*%solve(t(xmat)%*%xmat)%*%t(xmat)
## And use the diag command to extract just the diagonal elements; notice the
##   high value for the last point
diag(hat)

## Or save time and use the influence function...
infs = influence(m2)
infs$hat

## The diagonal matrix sums to the number of parameters in the model (2: a mean
##   term and an intercept term
sum(diag(hat))

## We have 12 data points, so our cutoff is...
2*2/(N+1)
## about 1/3 is our cutoff for high leverage.  Our outlier qualifies.

## Let's look at standardized residuals:
plot(stdres(m2))
## This still doesn't look too outrageous

## If we plot the jackknife residuals:
plot(studres(m2))
## Now our bad point is obviously out of place.

## If you've forgotten the incredibly useful 'names' function, now's a good time
##   to remind yourself of it...
m2summary = summary(m2)
names(m2summary)
m2summary$sig

## Calculate standardized residuals by hand, just for practice
r = m2$resid/(m2summary$sigma * sqrt(1 - infs$hat))

cbind(r, stdres(m2))


## Cook's distance:
cd = cooks.distance(m2)
plot(x,cd)



#################################
#################################
#################################
## Co-linearity
N=20
x1 = (runif(N)-0.5)*10
x2 = x1 + 3+(runif(N)-0.5)*2

y = 3*x1 + x2 + rnorm(N)

## Checking the correlations we see that they're actually pretty bad...
cor(cbind(y,x1,x2))

## But fit the mode anyway
m3 = lm(y~x1+x2)
summary(m3)
# Note the large variance on each term

xmat = matrix( c(rep(1, N), x1, x2), N, 3)
eigen((t(xmat)%*%xmat))
## We have eigenvalues that differ by an order of magnitude or so


## The cheap fix: drop one or the other co-linear terms.  This generally isn't the
# ideal approach, but it can sometimes help.  Ridge regression is better.
m3a = lm(y~x1)
summary(m3a)

m3b = lm(y~x2)
summary(m3b)