220A - Fall 2008
11-26 section R code
Thursday, 27 November 2008 07:59
attach(stackloss)
library(MASS)

## We won't cover doing least squares since everyone should be pretty
## comfortable with that at this point.

######################################
##  Huber method
######################################

## The MASS library contains the rlm function, ("robust linear models")
## which will let us do the Huber estimator without a lot of grief.

m_huber <- rlm(stack.loss~Air.Flow+Water.Temp+Acid.Conc., data=stackloss)
summary(m_huber)

## If we want to play with the constant c, then we need to invoke rlm
## a bit differently

# First we inspect psi.huber to see the defaults
psi.huber

# define a new weight function with a different constant, in this case 1.1
psi.huber2<-function(u)psi.huber(u, k=1.1)

# Observe the difference
psi.huber(c(1,1.1,1.11, 1.345, 1.346))
psi.huber2(c(1,1.1,1.11, 1.345, 1.346))

# refit
m_huber_2 <- rlm(stack.loss~., data=stackloss, psi=psi.huber2)

######################################
##  LAD
######################################

## We could just recycle the rlm function, since it does basic m-estimation,
## by passing it a different m function.  Recall that we can get a weight
## function of 1/u for LAD...

psi.LAD <- function(u)1/u

m_LAD <- rlm(stack.loss~., data=stackloss, psi=psi.LAD)

## But we run into a missing weights problem (division by zero).  There's two
## ways to fix this.  The 'cheap trick' way is to simply assign the weight
## function the biggest weight possible if u=0...

psi.LAD2 <- function(u, deriv=0) return(abs(pmin(.Machine$double.xmax, 1/u)))
m_LAD_hack <- rlm(stack.loss~., data=stackloss, psi=psi.LAD2)
summary(m_LAD_hack)

## A better way is probably to get the quantreg package and use the rq
## function in that.

library(quantreg)
m_LAD <- rq(stack.loss~., data=stackloss)
summary(m_LAD)

## It turns out our nasty hack wasn't actually all that bad an approximation.

###################
## Least trimmed squares
###################

## Faraway recommends the 'lqs' package for this, which has since been rolled
## into MASS, so no need to install yet another package.  You can use the
## ltsreg as in faraway, or you can use the lqs function (which does exactly
## the same thing; ltsreg is just a 'wrapper' function for lqs at this time)

## There is a catch, though.  
m_lts1 = ltsreg(stack.loss ~., data=stackloss)
m_lts2 = ltsreg(stack.loss ~., data=stackloss)
m_lts1$coeff
m_lts2$coeff

## if you have more than one predictor, it's a complex problem to get the
## minimization exact.  R does a random search to try and minimize it, with
## the net effect that you might see a small difference in results each time
## you run it.  For SMALL data sets you can add 'nsamp=exact' to the command:

m_lts_exact = ltsreg(stack.loss ~., data=stackloss, nsamp='exact')

## But the computational time goes up something link k^n, where n is the size
## of the data set, so it doesn't take much to make it totally unreasonable.

############################################################
###  Bootstrapping
############################################################
## I'm only doing these two for the linear model.  You should also think about what might
## be appropriate to do for the other models/estimation techniques.
###################
#####  Case resampling
###################

repl = 1000
coeff=matrix(0,1000,4)

for(i in 1:repl){
    temp_model <- lm(stack.loss ~ ., data = stackloss[sample(21, rep=T),])
    coeff[i,] = temp_model$coeff
}

par(mfrow=c(2,2))
plot(density(coeff[,1]), main="Intercept")
plot(density(coeff[,2]), main="Air.Flow")
plot(density(coeff[,3]), main="Water.temp")
plot(density(coeff[,4]), main="Acid.Conc.")


###################
#####  Error resampling
###################

##  We'll just use the raw residuals.  Is this the best choice?

lm_resids = lm(stack.loss ~., data=stackloss)$resi
lm_coeffs = lm(stack.loss ~., data=stackloss)$coeff


for(i in 1:repl){
    newYi = as.matrix(cbind(rep(1,21), Air.Flow, Water.Temp, Acid.Conc.)) %*% (as.matrix(lm_coeffs)) + lm_resids[sample(21, rep=T)]
    temp_model <- lm(newYi ~ Air.Flow + Water.Temp + Acid.Conc.)
    coeff[i,] = temp_model$coeff
}
par(mfrow=c(2,2))
plot(density(coeff[,1]), main="Intercept")
plot(density(coeff[,2]), main="Air.Flow")
plot(density(coeff[,3]), main="Water.temp")
plot(density(coeff[,4]), main="Acid.Conc.")

 
R code for 11/19 section
Thursday, 20 November 2008 18:39

Here's the code:

## Local regression
#
#    When we can't fit a linear model, often it makes sense to fit a piecewise
# model, where we estimate the function at a particular point by using
# information from nearby points. 
#
#    Two examples of this are spline fitting and LOESS (or LOWESS) regression.
# We already applied spline fitting to the mcycle data in MASS (and Dr. Wang teaches
# an excellent class on smoothing splines), so let's look at LOESS.
#
#########
# LOESS regression
#########
# We can approximate any function in a neighborhood about a single data point
# by it's taylor series expansion.  Denote the expansion about x at the point xi
#
#    f(xi|x) ~~ a0 + a1(xi-x) + 1/2 a2(xi - x)^2 + 1/6 a3(xi - x)^3 + ...
#
#    We will truncate this at the second term (although we can use more if we
# want) and try to find the coefficients a0, a1, a2.
#
#    We want to consider points "in the neighborhood" of x, and apply more
# weight to those points that are closer to x.  There are many weight functions
# that can be used, but we're going to use the most common one, the "tri-cube"
# function:
#
#    w(x) = (1 - abs(x)^3)^3 for |x| < 1
#         0            for |x| >= 1
#
#    Let's also introduce a "bandwidth" parameter -- h -- which governs "how
# far" in each direction we're willing to look (the size of the "neighborhood"
# around x). 
#
#    For each point xi around x, we'll assign it weight w((xi-x)/h); note
# that for |xi-x|>h, this will make the weight 0 (since w(x) = 0 for |x| > 1).

library(MASS)
data(mcycle)
attach(mcycle)
plot(mcycle)

# Let's find the local fit around 14 msec; roughly when the first 'elbow' in
# in the data is, and use a bandwidth of 1.5.

BW = 5
ctr = 15

# Here's our weight function
w <- function(x) max((1 - abs(x)^3)^3,0)

# And here's a plot of it
tx = as.matrix(seq(-1,1, len=100))
plot(apply(tx,1,w))

# generate the scaled delta-times
dt = as.matrix((times - ctr)/BW)

# and the list of weights we need for WLS
wlist = apply(dt, 1, w)

# Now make our predictors: (xi-x), (xi-x)^2
X = cbind( times-ctr, (times-ctr)^2)

# And use the 'weights' option in lm to do WLS regression
M_local = lm(accel~X, weights = wlist)
summary(M_local)
B = M_local$coeff

local_x = as.matrix(seq(ctr-BW/5, ctr+BW/5, len=5*BW))
local_fit<-function(x) B[1] + B[2]*(x-ctr) + (B[3]/2)*(x-ctr)^2

pred = apply(local_x, 1, local_fit)
plot(local_x,pred, xlim=c(min(times), max(times)), ylim=c(min(accel), max(accel)))
points(mcycle)

## Obviously, if we shift the center a bit we should expect a different fit,
# but it should be close.

ctr = 13
BW = 5

B2 = lm(accel~cbind( times-ctr, (times-ctr)^2), weights = apply(as.matrix((times - ctr)/BW), 1, w))$coeff
local_fit2<-function(x) B2[1] + B2[2]*(x-ctr) + (B2[3]/2)*(x-ctr)^2
local_x2 = as.matrix(seq(ctr-BW/5, ctr+BW/5, len=5*BW))
pred2 = apply(local_x2, 1, local_fit2)

plot(c(local_x, local_x2), c(pred, pred2), xlim=c(min(times), max(times)), ylim=c(min(accel), max(accel)))
points(mcycle, pty=2)

ctr = 16
BW = 5

B3 = lm(accel~cbind( times-ctr, (times-ctr)^2), weights = apply(as.matrix((times - ctr)/BW), 1, w))$coeff
local_fit3<-function(x) B3[1] + B3[2]*(x-ctr) + (B3[3]/2)*(x-ctr)^2
local_x3 = as.matrix(seq(ctr-BW/5, ctr+BW/5, len=5*BW))
pred3 = apply(local_x3, 1, local_fit3)

plot(c(local_x, local_x2, local_x3), c(pred, pred2,pred3), xlim=c(min(times), max(times)), ylim=c(min(accel), max(accel)))
points(mcycle, pch=46)

local_fit3(16)

## Of course, we can also just use the built-in R package.  "span" is the smoothing parameter, and in this case
# we are using 25% of the data per point, for a bandwidth of about 14
mloess = loess(accel~times, degree=2, span=0.25)
x = seq(min(times), max(times), len=200)
plot(times, mloess$fit, type='l',xlim=c(min(times), max(times)), ylim=c(min(accel), max(accel)))
points(mcycle)

summary(mloess)

#######################
###
#######################

library(assist)
ssr(accel~times, kernel = cubic(times), scale = 1)

times2 = (times-min(times))/diff(range(times))

spline = ssr(accel~times2, rk = cubic(times2), scale = 1)

plot(spline, ask=1)
2
0
points(mcycle$accel~times2)
 

 
Nov. 5 R code
Thursday, 06 November 2008 05:13

Here's the R code for today's section:

library(MASS)
attach(mcycle)
plot(accel~times)

#    This is obviously not linear, fitting a
#    linear model is inappropriate

m1 = lm(accel~times)
summary(m1)

par(mfrow=c(2,2))
plot(m1)

#    Terrible: trends in the residuals, clear
#    nonconstant variance, poor normality
#
#
#    Let's try fitting a polynomial model:
#    y ~ 1 + x + x^2 + x^3
#
#    We'll use the I() function in the model; this
#    insures that the actual timepoint squared gets fit
#    rather than the model term being squared

# lm(y ~ x*y)  <- fit y = b1 + b2*x + b3*y +b4*x*y
# lm(y ~ I(x*y))   <- fit y= b1 + b2*x*y

m2 = lm(accel ~ times + I(times^2) + I(times^3))

summary(m2)
plot(m2)

par(mfrow=c(1,1))
plot(m2$fit~times, type='l')
points(accel~times)


#    We still have something that looks significant
#    but the diagnostics don't look that good; the
#    pattern is still evident in the residuals v
#    fitted plot. 
#
#    Fit a higher polynomial?

m3 = lm(accel ~ times + I(times^2) + I(times^3) + I(times^4) + I(times^5))
par(mfrow=c(2,2))
summary(m3)
plot(m3)

par(mfrow=c(1,1))
plot(m3$fit~times, type='l')
points(accel~times)



m4 = lm(accel ~ times + I(times^2) + I(times^3) + I(times^4) + I(times^5) + I(times^6) + I(times^7) + I(times^8) + I(times^9))
par(mfrow=c(2,2))
summary(m4)
plot(m4)

par(mfrow=c(1,1))
plot(m4$fit~times, type='l')
points(accel~times)


X = as.matrix(cbind(rep(1,133), times, times^2, times^3, times^4))
eigen(solve((t(X)%*%X)))

#    Note the huge difference in eigenvalues

X2 = as.matrix(cbind(rep(1,133), times, times^2, times^3, times^4, times^5, times^6, times^7))
eigen(solve((t(X2)%*%X2)))

#     And the difference gets even worse; near singlular
#    matrix --> bad fits and high variance

#    Another problem -- we've got (obviously) highly
#    colinear predictors; note the near-zero eigenvalues
#
#    So what can we do?
#    This is a case where linear models can't help. This
#    would be more appropriately handled via time series
#    analysis or nonparametric analysis (which we won't
#    get into here, but here's a spline model just for
#    fun).

library(assist)
times2 = times/max(times)
splinemodel = ssr(accel~times2, rk=cubic(times2))
plot(splinemodel, ask=TRUE)
2
0
points(accel~times2)

#################################################################
#################################################################
#################################################################

attach(trees)

LVolume = log(Volume)
m1 = lm(LVolume ~ Girth*I(Girth^2))
m2 = lm(LVolume ~ Girth+I(Girth^2)+I(Girth^3))

summary(m1)
summary(m2)

#    Note that these work out the same -- including
#    interaction terms between two polynomial terms is
#    slightly pointless.

m3 = lm(LVolume ~ (Girth+I(Girth^2))*(Height + I(Height^2)))
summary(m3)
par(mfrow=c(2,2))
plot(m3)

drop1(m3, test='Chisq')
drop1(m3, test='F')
drop1(m3)

drop1(m3, k=log(31))

step(m3, direction="both")

 

 
10-29 R code
Thursday, 30 October 2008 01:03

Here's the R code from class today, covering two different kinds of nonparametric randomization tests.

 

A = c(13.2, 8.2, 10.9, 14.3, 10.7, 6.6, 9.5, 10.8, 8.8, 13.3)
B = c(14, 8.8, 11.2, 14.2, 11.8, 6.4, 9.8, 11.3, 9.3, 13.6)

Diff = A - B
Dlist = NULL

for (lc in 1:10000)
{
    flips = 2*rbinom(10, 1, 0.5) - 1
    Dlist = c(Dlist, mean(Diff*flips))
}


md = mean(Diff)

sum(Dlist<md)/10000
sum(Dlist>-md)/10000

#########

library(boot)

head(claridge)
hist(claridge$hand)

RH = mean(claridge[claridge[,2]==1, 1])
LH = mean(claridge[claridge$hand!=1, 1])

HDlist = NULL

for(i in 1:50000)
{
    hand_r = sample(claridge$hand)
    HDlist = c(HDlist, mean(claridge[hand_r == 1,1]) - mean(claridge[hand_r!=1,1]))
}

sum( (RH-LH) > HDlist)/10000
sum( -(RH-LH) < HDlist)/10000

sum( (RH-LH) > HDlist)/50000 + sum( -(RH-LH) < HDlist)/50000

 

 
Oct 22 section
Wednesday, 22 October 2008 23:39

Here's the R code for today:

### First we fix the seed so that we get the same random numbers every time
set.seed(1)
N = 5
## Generate a fairly tidy regression
xreal = (runif(N)-0.5)*10
yreal = 2.2 * xreal + rnorm(N, 0, 1)

## Nice and linear
plot(xreal,yreal)

## Add a high-leverage point that doesn't follow the orginal regression
x = c(xreal, 10)
y = c(yreal, 3.2*10)


## Note that -- when we 'zoom out' like this -- it still looks pretty linear
plot(x,y)


## Here's our "real" linear fit
m1 = lm(yreal~xreal)
summary(m1)

## And here's our fit with the outliers; notice that it's still
##   highly significant.
m2 = lm(y~x)
summary(m2)


## If we just look at the residuals, we don't see anything wrong.  The high
##  leverage of the outlier has forced the fit to conform to that outlying
##  point, masking it's effect
plot(x,m2$res)

## Make our predictor variable matrix...
xmat = matrix(c(rep(1, N+1), x), N+1, 2)
## Transform it to a hat matrix...
hat = xmat%*%solve(t(xmat)%*%xmat)%*%t(xmat)
## And use the diag command to extract just the diagonal elements; notice the
##   high value for the last point
diag(hat)

## Or save time and use the influence function...
infs = influence(m2)
infs$hat

## The diagonal matrix sums to the number of parameters in the model (2: a mean
##   term and an intercept term
sum(diag(hat))

## We have 12 data points, so our cutoff is...
2*2/(N+1)
## about 1/3 is our cutoff for high leverage.  Our outlier qualifies.

## Let's look at standardized residuals:
plot(stdres(m2))
## This still doesn't look too outrageous

## If we plot the jackknife residuals:
plot(studres(m2))
## Now our bad point is obviously out of place.

## If you've forgotten the incredibly useful 'names' function, now's a good time
##   to remind yourself of it...
m2summary = summary(m2)
names(m2summary)
m2summary$sig

## Calculate standardized residuals by hand, just for practice
r = m2$resid/(m2summary$sigma * sqrt(1 - infs$hat))

cbind(r, stdres(m2))


## Cook's distance:
cd = cooks.distance(m2)
plot(x,cd)



#################################
#################################
#################################
## Co-linearity
N=20
x1 = (runif(N)-0.5)*10
x2 = x1 + 3+(runif(N)-0.5)*2

y = 3*x1 + x2 + rnorm(N)

## Checking the correlations we see that they're actually pretty bad...
cor(cbind(y,x1,x2))

## But fit the mode anyway
m3 = lm(y~x1+x2)
summary(m3)
# Note the large variance on each term

xmat = matrix( c(rep(1, N), x1, x2), N, 3)
eigen((t(xmat)%*%xmat))
## We have eigenvalues that differ by an order of magnitude or so


## The cheap fix: drop one or the other co-linear terms.  This generally isn't the
# ideal approach, but it can sometimes help.  Ridge regression is better.
m3a = lm(y~x1)
summary(m3a)

m3b = lm(y~x2)
summary(m3b)

 

 
R code from section
Thursday, 16 October 2008 18:25

Here's the R code I used in section.  You can just copy and paste this to a script file:

library(MASS)
data(mammals)
attach(mammals)

plot(body,brain)

plot(density(body))
plot(density(brain))
##These look exponential-ish, so let's try a log transform
## and look at the densities again
plot(density(log(body)))
plot(density(log(brain)))

plot(log(body), log(brain))

lbrain=log(brain)
lbody=log(body)

linmodel <- lm(brain~body)
logmodel <- lm(lbrain~lbody)

summary(linmodel)
summary(logmodel)

qqnorm(body)
qqnorm(lbody)

qqnorm(brain)
qqnorm(lbrain)

##Look at all the additional data stored about the linear model
names(linmodel)

## We can access them with the '$' operator

linmodel$coefficients
linmodel$fitted.values
linmodel$residuals

## Some commands that can be used with lm objects

## Diagnostic plots
par(mfrow=c(2,2))
plot(linmodel)
plot(logmodel)

## ANOVA tests
anova(linmodel)
anova(logmodel)

## Confidence intervals for parameters
confint(linmodel)
confint(logmodel)

## Extract residuals through a function
resid(linmodel)
resid(logmodel)

## Standardized residuals
plot(stdres(linmodel))
plot(stdres(logmodel))

## Studentized residuals
plot(studres(linmodel))
plot(studres(logmodel))

## cook's distance of the data points
plot(cooks.distance(linmodel))
plot(cooks.distance(logmodel))

## Find prediction and confidence intervals for new data

newlindata = data.frame(body=c(1, 10, 100))
predict(linmodel, newlindata,interval="prediction")


newlogdata = data.frame(lbody=c(0.1, 0.2, 0.3, 0.4))
predict(logmodel, newlogdata,interval="prediction")

predict(logmodel, interval="confidence", level=0.99999)


#####################
#####################
#####################
rm(list=ls(all=TRUE))

data(cement)

attach(cement)

x0=rep(1,13)

x=as.matrix(cbind(x0,x1,x2,x3,x4))

(x)%*%solve(t(x)%*%x)%*%t(x)

### remember that our hat matrix x[(x'x)^-1]x' is what we use to calculate our
### regression coefficients.  This looks ok.  But...

solve(t(x)%*%x)

### There's a large difference here; nearly singular
eigen(t(x)%*%x)

summary(lm(y~x))

### Really sensitive to small changes
x2=x
x2[12, 4]=11
x2[12, 2]=9


summary(lm(y~x2))

par(mfrow=c(2,2))
plot(lm(y~x))
plot(lm(y~x2))
 
220A section
Friday, 26 September 2008 20:12

Hi, and welcome to 220A.  My office hours will be from 1:30 to 2:30 pm on Wednesdays.  I'll be using this website mostly to post R code and examples.

If you don't already have R, it's available on computers in the Rachev room, on most computers in the HSSB computer lab, and also freely available here. This will be the main environment used in the 220ABC series, so you should become familiar with it as quickly as possible.  A useful 'cheat sheet' is available here, as well as some examples here (note that some of the later examples use GTK+, which may not be installed on your computer; it's pretty large, so I recommend against installing it unless you know what it is and know you'll use it later).

If you have any questions, please email me and I'll respond as quickly as possible.