Statistical Models with R

Regression Analysis

Regression is the study of the change of the distribution of one variable, $y$, according to the value of another variable, $x$. Both continuous and discrete variables can be included in regression problems.

Formally, a relationship between $x$ and $y$ usually means that the expected value of $y$ is different for different values of $x$.

Regression model can be expressed in matrix form as:

$\vec{\mu} = E(\vec{y}) = \matrix{X} \vec{\beta} + \vec{\varepsilon}$

where:

$\matrix{X}$ is a $n \text{ x } (p+1)$ matrix, with all ones in the first columns,
$\vec{\beta}$ is the vector of $\beta_0, \beta_1, \dots, \beta_p$,
$\vec{\varepsilon}$ is a random or noise component.

Models where response variable is a linear function of parameters are linear models.

Parameters values are usually estimated via maximum likelihood.

The main assumption of linear model is that the model is linear in the parameters. Other assumptions concern the error terms $\vec{\varepsilon}$ are:

$E(\varepsilon_{i}) = 0$ ($i = 1, \dots, n$)
$\varepsilon_i \sim\; N(0,\,\sigma^2_\varepsilon)$ ($i = 1, \dots, n$)
$Corr(\varepsilon_{i},\,\varepsilon_{i-j}) = 0$ ($i = 1, \dots, n; 0 < j < i$)

The lm and aov functions are used in R to fit respectively linear regression and analysis of variance models.

The common interface to fit a statistical model in R is made of a call to the corresponding function with arguments formula and data.

Model formula general expression is $y \sim x_1 \pm x_2 \pm, \dots, \pm , x_k$.

Extensions to the above formula can be summarized as:

After fitting any statistical model, common R functions for inference are:

Drug Dosage and Reaction Time

In an experiment to investigate the effect of a depressant drug, the reaction times of ten males rats to a certain stimulus were measured after a specified dose of the drug had been administer to each rat. The results were as follows:

load("drug.Rda")
str(drug)

1 2	load("drug.Rda") str(drug)

## 'data.frame':    10 obs. of  3 variables:
##  $ rat : int  1 2 3 4 5 6 7 8 9 10
##  $ dose: num  0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
##  $ time: num  0.32 0.24 0.4 0.52 0.44 0.56 0.64 0.52 0.6 0.8

## 'data.frame': 10 obs. of 3 variables:

## $ rat : int 1 2 3 4 5 6 7 8 9 10

## $ dose: num 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

## $ time: num 0.32 0.24 0.4 0.52 0.44 0.56 0.64 0.52 0.6 0.8

Basic graphical data exploration may be achieved with:

plot(time ~ dose, data = drug,
     xlab = "Dose (mg)",
     ylab =  "Reaction Time (secs)")

plot(time ~ dose, data = drug,

xlab = "Dose (mg)",

ylab = "Reaction Time (secs)")

plot of chunk drugplot

A simple model for these data might be a straight line: $E(y_i|x_i) = \beta_0 + \beta_1 x_i \text{ for } i = 1, \dots, n$

The equivalent R command is given by:

fm = lm(time ~ dose, data = drug)

1	fm = lm(time ~ dose, data = drug)

The resulting regression line can be easy superposed to the data scatterplot by:

plot(time ~ dose, data = drug,
     xlab = "Dose (mg)",
     ylab =  "Reaction Time (secs)")

abline(reg = fm)

plot(time ~ dose, data = drug,

xlab = "Dose (mg)",

ylab = "Reaction Time (secs)")

abline(reg = fm)

plot of chunk reg

Regression results can be investigated by generic function summary().

summary(fm)

1	summary(fm)

## 
## Call:
## lm(formula = time ~ dose, data = drug)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.104 -0.064  0.024  0.056  0.088 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.2880     0.0452    6.37  0.00022 ***
## dose          0.4800     0.0847    5.67  0.00047 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0769 on 8 degrees of freedom
## Multiple R-squared:  0.801,  Adjusted R-squared:  0.776 
## F-statistic: 32.1 on 1 and 8 DF,  p-value: 0.000472

## Call:

## lm(formula = time ~ dose, data = drug)

## Residuals:

## Min 1Q Median 3Q Max

## -0.104 -0.064 0.024 0.056 0.088

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) 0.2880 0.0452 6.37 0.00022 ***

## dose 0.4800 0.0847 5.67 0.00047 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Residual standard error: 0.0769 on 8 degrees of freedom

## Multiple R-squared: 0.801, Adjusted R-squared: 0.776

## F-statistic: 32.1 on 1 and 8 DF, p-value: 0.000472

Small p-values for t-test on coefficients lead to consider both coefficients as significant. Adjust R-squared equal to 0.78 shows an acceptable portion of total variation explained by the regression model.

Small p-value for F-statistic confirm fitted model significance when compared with null model: $E(y_i|x_i) = \beta_0$.

Same result could be obtained by:

anova(fm, test = F)

1	anova(fm, test = F)

## Analysis of Variance Table
## 
## Response: time
##           Df Sum Sq Mean Sq F value  Pr(>F)    
## dose       1 0.1901  0.1901    32.1 0.00047 ***
## Residuals  8 0.0474  0.0059                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Analysis of Variance Table

## Response: time

## Df Sum Sq Mean Sq F value Pr(>F)

## dose 1 0.1901 0.1901 32.1 0.00047 ***

## Residuals 8 0.0474 0.0059

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Prediction for response variable at specific values of the explanatories variable can be gained by:

newdata = data.frame(dose = c(0.2, 0.4, 0.6 ))
predict (fm , newdata = newdata)

1 2	newdata = data.frame(dose = c(0.2, 0.4, 0.6 )) predict (fm , newdata = newdata)

##     1     2     3 
## 0.384 0.480 0.576

1 2	## 1 2 3 ## 0.384 0.480 0.576

Note how the newdata argument within predict() requires an object of class data.frame containing the same variables as in the explanatory part of the model.

Predict can be used for defining prediction interval for the fitted model. Prediction interval for any new observation possibly within the regression support is given by:

predict (fm , newdata = newdata, interval = "prediction")

1	predict (fm , newdata = newdata, interval = "prediction")

##     fit    lwr    upr
## 1 0.384 0.1916 0.5764
## 2 0.480 0.2937 0.6663
## 3 0.576 0.3876 0.7644

## fit lwr upr

## 1 0.384 0.1916 0.5764

## 2 0.480 0.2937 0.6663

## 3 0.576 0.3876 0.7644

Prediction interval can be plotted by:

newdata = data.frame(dose = seq(0, 0.9 , len = 100))
newdata = cbind(newdata, predict(fm, newdata = newdata,
  interval = "prediction"))
ylim = c(min(newdata$lwr), max(newdata$upr)) * c(0.99, 1.01)

plot(time ~ dose, data = drug, type = "n", ylim = ylim,
  xlab = "Dose (mg)", ylab = "Reaction Time (secs)")

with(newdata, {
  polygon(x = c(dose, rev(dose)), y = c(lwr , rev(upr)), col = "gray")
  lines(dose, fit, col = "red")
})

with(drug, points(dose, time, pch = 16, col = "darkblue"))

newdata = data.frame(dose = seq(0, 0.9 , len = 100))

newdata = cbind(newdata, predict(fm, newdata = newdata,

interval = "prediction"))

ylim = c(min(newdata$lwr), max(newdata$upr)) * c(0.99, 1.01)

plot(time ~ dose, data = drug, type = "n", ylim = ylim,

xlab = "Dose (mg)", ylab = "Reaction Time (secs)")

with(newdata, {

polygon(x = c(dose, rev(dose)), y = c(lwr , rev(upr)), col = "gray")

lines(dose, fit, col = "red")

})

with(drug, points(dose, time, pch = 16, col = "darkblue"))

plot of chunk predictionplot

Finally, the initial assumptions concerning the structure of the error term:

$E(\varepsilon_{i}) = 0$ ($i = 1, \dots, n$)
$\varepsilon_i \sim\; N(0,\,\sigma^2_\varepsilon)$ ($i = 1, \dots, n$)
$Corr(\varepsilon_{i},\,\varepsilon_{i-j}) = 0$ ($i = 1, \dots, n; 0 < j < i$)

can be verified by the visual inspection of the suite of residuals plots offered by default by R:

par(mfrow = c(2,2))
plot(fm)

1 2	par(mfrow = c(2,2)) plot(fm)

plot of chunk lmplot

The “Residuals vs Fitted” plot does not show, in this example, any particular pattern. The presence of patterns may suggest that the model is inadequate. The “Normal Q-Q” plot shows points close to the straight line. If the normal assumption of residuals is not satisfied, points are far from the straight line. The “Normal Q-Q plot” is less reliable on the distribution tails, i.e. points ought be very far from the straight line to suggest that residuals follow a non-normal distribution. The “Scale location” is similar to the residuals versus fitted values plot, but it uses the square root of the standardized residuals in order to diminish skewness. Like the first plot, there should be no discernable pattern to the plot. The “Residuals vs Leverage” shows leverage points. Leverage points are those observations, if any, made at extreme or outlying values of the independent variables such that the lack of neighboring observations means that the fitted regression model will pass close to that particular observation. Leverage points fall out the dotted lines.

Car Seat

Three quality inspectors are studying the reproducibility of a measurement method aiming to test resistance of a specific material used to cover car seats. As a result, an experiments involving 75 samples of material from the same batch is set up. Three operators: Kevin, Michelle and Rob are assigned to test 25 samples each.

Comparison of operators average measurements and within operators variations are the key points of the analysis.

load("carseat.Rda")
str(carseat)

1 2	load("carseat.Rda") str(carseat)

## 'data.frame':    75 obs. of  2 variables:
##  $ Operator: Factor w/ 3 levels "Kevin","Michelle",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Strength: num  11.3 10.6 10.4 10.2 10.4 ...

## 'data.frame': 75 obs. of 2 variables:

## $ Operator: Factor w/ 3 levels "Kevin","Michelle",..: 2 2 2 2 2 2 2 2 2 2 ...

## $ Strength: num 11.3 10.6 10.4 10.2 10.4 ...

Boxplot of Strength by Operator along with within operators averages and between averages connecting line can be display by:

boxplot(Strength~Operator, data = carseat, xlab = "Operator",
  ylab = "Resistance", col = "lightgray")
averages =  tapply(carseat$Strength, carseat$Operator, mean)
points(averages, col = "red", pch = 22, lwd = 2, type = "b")

boxplot(Strength~Operator, data = carseat, xlab = "Operator",

ylab = "Resistance", col = "lightgray")

averages = tapply(carseat$Strength, carseat$Operator, mean)

points(averages, col = "red", pch = 22, lwd = 2, type = "b")

plot of chunk carseatplot

At first glance, difference between operators is hard to detect as within variation seems to be quite large. An outstanding value is shown at Michelle.

A simple one way analysis of variance model is computed with:

fm = aov(Strength~Operator, data = carseat)

1	fm = aov(Strength~Operator, data = carseat)

Results are shown with the usual summary() method.

summary(fm)

1	summary(fm)

##             Df Sum Sq Mean Sq F value Pr(>F)  
## Operator     2    6.6    3.31    3.85  0.026 *
## Residuals   72   61.9    0.86                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Df Sum Sq Mean Sq F value Pr(>F)

## Operator 2 6.6 3.31 3.85 0.026 *

## Residuals 72 61.9 0.86

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Operators appears to be borderline significant as its F test p-value is 0.0258. That is, not a real difference exists between operators measurements methods.

The overall influence of Michelle outstanding single result can be tested by repeating the analysis after the single observation is excluded from the data.

out = max(carseat$Strength[carseat$Operator == "Michelle"])
fm1 = aov(Strength~Operator, data = carseat[-which(carseat$Strength == out),])
summary(fm1)

out = max(carseat$Strength[carseat$Operator == "Michelle"])

fm1 = aov(Strength~Operator, data = carseat[-which(carseat$Strength == out),])

summary(fm1)

##             Df Sum Sq Mean Sq F value Pr(>F)  
## Operator     2    6.3   3.147       4  0.023 *
## Residuals   71   55.8   0.787                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Df Sum Sq Mean Sq F value Pr(>F)

## Operator 2 6.3 3.147 4 0.023 *

## Residuals 71 55.8 0.787

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Operators significance is cleary increased as its p-value decreased to 0.0226. Nevertheless, practical results remain unchanged.

In mathematical notation the model can be written as: $E(y_i|x_i) = \beta_0+\beta_j$ with $j = 1,2,3$

Single coefficients can be visualized by:

summary.lm(fm)

1	summary.lm(fm)

## 
## Call:
## aov(formula = Strength ~ Operator, data = carseat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9064 -0.5482 -0.0388  0.4656  2.4100 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         9.729      0.185   52.46   <2e-16 ***
## OperatorMichelle    0.501      0.262    1.91   0.0600 .  
## OperatorRob         0.708      0.262    2.70   0.0087 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.927 on 72 degrees of freedom
## Multiple R-squared:  0.0966, Adjusted R-squared:  0.0715 
## F-statistic: 3.85 on 2 and 72 DF,  p-value: 0.0258

## Call:

## aov(formula = Strength ~ Operator, data = carseat)

## Residuals:

## Min 1Q Median 3Q Max

## -1.9064 -0.5482 -0.0388 0.4656 2.4100

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) 9.729 0.185 52.46 <2e-16 ***

## OperatorMichelle 0.501 0.262 1.91 0.0600 .

## OperatorRob 0.708 0.262 2.70 0.0087 **

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Residual standard error: 0.927 on 72 degrees of freedom

## Multiple R-squared: 0.0966, Adjusted R-squared: 0.0715

## F-statistic: 3.85 on 2 and 72 DF, p-value: 0.0258

Note that the intercept assumes the average strength value for Kevin while the other coefficients represent the difference in Strength between the corresponding operators and Kevin.

Finally, normality of residuals can be checked by:

par(mfrow = c(1,2))
plot(fm, which = 1:2)

1 2	par(mfrow = c(1,2)) plot(fm, which = 1:2)

plot of chunk residualscarseat

Despite a small departure from normality on the rigth side of the residuals distribution, model assumptions seem to confirm.

Boiling Time

In an attempt to resolve a domestic dispute about which of two pans was the quicker pan for cooking, the following data were obtained:

load("boiling.Rda")
str(boiling)

1 2	load("boiling.Rda") str(boiling)

## 'data.frame':    6 obs. of  3 variables:
##  $ time  : num  4.65 9.75 15.05 4.75 10.15 ...
##  $ pan   : Factor w/ 2 levels "A","B": 1 1 1 2 2 2
##  $ volume: num  1 3 5 1 3 5

## 'data.frame': 6 obs. of 3 variables:

## $ time : num 4.65 9.75 15.05 4.75 10.15 ...

## $ pan : Factor w/ 2 levels "A","B": 1 1 1 2 2 2

## $ volume: num 1 3 5 1 3 5

Various measured volumes (pints) of cold water were put into each pan and heated using the same setting of the cooker. The response variable was the time in minutes until the water boiled.

From a first visual investigation, it’s easy to understand that differences in boling time means difference either in intercept or slope between the two lines:

library(lattice)
graph = xyplot(time ~ volume, data = boiling, groups = pan,
  type = "b", auto.key = list(space = "right", lines = F, points = T))
print(graph)

library(lattice)

graph = xyplot(time ~ volume, data = boiling, groups = pan,

type = "b", auto.key = list(space = "right", lines = F, points = T))

print(graph)

plot of chunk boilingxyplot

A simple linear model may provide the answer to the above questions:

fm1 = lm(time ~ volume + pan + volume:pan, data = boiling)
summary(fm1)$coeff

1 2	fm1 = lm(time ~ volume + pan + volume:pan, data = boiling) summary(fm1)$coeff

##             Estimate Std. Error  t value  Pr(>|t|)
## (Intercept)  2.01667    0.07187  28.0609 1.268e-03
## volume       2.60000    0.02104 123.5704 6.548e-05
## panB         0.05417    0.10164   0.5329 6.474e-01
## volume:panB  0.08750    0.02976   2.9406 9.880e-02

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) 2.01667 0.07187 28.0609 1.268e-03

## volume 2.60000 0.02104 123.5704 6.548e-05

## panB 0.05417 0.10164 0.5329 6.474e-01

## volume:panB 0.08750 0.02976 2.9406 9.880e-02

Given $\alpha=0.05$ as significance level, only some coefficients result as significant. Note that PanB express the difference between the intercept, aliased with PanA, and PanB while volume:panB represent the difference in slope between PanA (volume) and PanB.

PanA has a shorter boiling time than PanB but both the difference between the intercepts and the difference between slopes are not statistically significant.

A different parameterisation of the model, possibly more intuitive but less attinent with the original problem could be given by:

fm11 = lm(time ~ pan + volume:pan - 1, data = boiling)
summary(fm11)$coeff

1 2	fm11 = lm(time ~ pan + volume:pan - 1, data = boiling) summary(fm11)$coeff

##             Estimate Std. Error t value  Pr(>|t|)
## panA           2.017    0.07187   28.06 1.268e-03
## panB           2.071    0.07187   28.81 1.202e-03
## panA:volume    2.600    0.02104  123.57 6.548e-05
## panB:volume    2.688    0.02104  127.73 6.129e-05

## Estimate Std. Error t value Pr(>|t|)

## panA 2.017 0.07187 28.06 1.268e-03

## panB 2.071 0.07187 28.81 1.202e-03

## panA:volume 2.600 0.02104 123.57 6.548e-05

## panB:volume 2.688 0.02104 127.73 6.129e-05

or alternatively by:

fm12 = lm(time ~ pan/volume - 1, data = boiling)
summary(fm12)$coeff

1 2	fm12 = lm(time ~ pan/volume - 1, data = boiling) summary(fm12)$coeff

##             Estimate Std. Error t value  Pr(>|t|)
## panA           2.017    0.07187   28.06 1.268e-03
## panB           2.071    0.07187   28.81 1.202e-03
## panA:volume    2.600    0.02104  123.57 6.548e-05
## panB:volume    2.688    0.02104  127.73 6.129e-05

## Estimate Std. Error t value Pr(>|t|)

## panA 2.017 0.07187 28.06 1.268e-03

## panB 2.071 0.07187 28.81 1.202e-03

## panA:volume 2.600 0.02104 123.57 6.548e-05

## panB:volume 2.688 0.02104 127.73 6.129e-05

Finally, the same results could be obtained by using a generalized linear model with Gaussian family and its natural identity link function:

fm2 = glm(time ~ pan * volume, data = boiling,
  family = gaussian(link = "identity"))
summary(fm12)$coeff

fm2 = glm(time ~ pan * volume, data = boiling,

family = gaussian(link = "identity"))

summary(fm12)$coeff

##             Estimate Std. Error t value  Pr(>|t|)
## panA           2.017    0.07187   28.06 1.268e-03
## panB           2.071    0.07187   28.81 1.202e-03
## panA:volume    2.600    0.02104  123.57 6.548e-05
## panB:volume    2.688    0.02104  127.73 6.129e-05

## Estimate Std. Error t value Pr(>|t|)

## panA 2.017 0.07187 28.06 1.268e-03

## panB 2.071 0.07187 28.81 1.202e-03

## panA:volume 2.600 0.02104 123.57 6.548e-05

## panB:volume 2.688 0.02104 127.73 6.129e-05

However, this model could be unsatisfactory in two respects:

boiling time cannot be negative,
the variance of boiling time might be expected to increase with its expectation.

The natural candidate for this problem is the gamma distibution as it respect the increasing relationship between mean and variance.

As a result, a generalized linear model with family gamma and link identity can be used in place of a standard liear model.

fm3 = glm(time ~ pan*volume, data = boiling,
  family = Gamma(link = "identity"))
summary(fm3)$coeff

fm3 = glm(time ~ pan*volume, data = boiling,

family = Gamma(link = "identity"))

summary(fm3)$coeff

##              Estimate Std. Error   t value  Pr(>|t|)
## (Intercept) 2.0592378    0.03944  52.21096 3.666e-04
## panB        0.0008908    0.05651   0.01576 9.889e-01
## volume      2.5839284    0.01947 132.72627 5.676e-05
## panB:volume 0.1076174    0.02799   3.84542 6.146e-02

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) 2.0592378 0.03944 52.21096 3.666e-04

## panB 0.0008908 0.05651 0.01576 9.889e-01

## volume 2.5839284 0.01947 132.72627 5.676e-05

## panB:volume 0.1076174 0.02799 3.84542 6.146e-02

Models comparison can be given by the AIC statistics.

fm2$aic

fm2$aic

## [1] -13.42

1	## [1] -13.42

fm3$aic

fm3$aic

## [1] -15.09

1	## [1] -15.09

Gamma model appears to be slightly better.

Summary

In this chapter, we introduced statistical modelling with R. Through some examples on real data, we saw the lm() function to build a linear regression model, the glm() function to build a generalised linear model and anova() and aov() functions to build an ANOVA model. The summary() function is useful to investigate regression and ANOVA results. We used the plot() function both to draw the regression line in the scatter plot and to perform a residual analysis. P value were used to choose between two nested models, while AIC statistics were introduced to compare two not-nested models.

Chapter 8

Statistical Models with R

Regression Analysis

Drug Dosage and Reaction Time

Car Seat

Boiling Time

Summary

Join us!

Courses calendar

We are part of

Categories

Archives

Chapter 8

Statistical Models with R

Regression Analysis

Drug Dosage and Reaction Time

Car Seat

Boiling Time

Summary

Join us!

Courses calendar

We are part of

Categories

Archives

Tags