Statistical Models with R
Regression Analysis
Regression is the study of the change of the distribution of one variable, \(y\), according to the value of another variable, \(x\). Both continuous and discrete variables can be included in regression problems.
Formally, a relationship between \(x\) and \(y\) usually means that the expected value of \(y\) is different for different values of \(x\).
Regression model can be expressed in matrix form as:
\(\vec{\mu} = E(\vec{y}) = \matrix{X} \vec{\beta} + \vec{\varepsilon}\)
where:
- \(\matrix{X}\) is a \(n \text{ x } (p+1)\) matrix, with all ones in the first columns,
- \(\vec{\beta}\) is the vector of \(\beta_0, \beta_1, \dots, \beta_p\),
- \(\vec{\varepsilon}\) is a random or noise component.
Models where response variable is a linear function of parameters are linear models.
Parameters values are usually estimated via maximum likelihood.
The main assumption of linear model is that the model is linear in the parameters. Other assumptions concern the error terms \(\vec{\varepsilon}\) are:
- \(E(\varepsilon_{i}) = 0\) (\(i = 1, \dots, n\))
- \(\varepsilon_i \sim\; N(0,\,\sigma^2_\varepsilon)\) (\(i = 1, \dots, n\))
- \(Corr(\varepsilon_{i},\,\varepsilon_{i-j}) = 0\) (\(i = 1, \dots, n; 0 < j < i\))
The lm
and aov
functions are used in R to fit respectively linear regression and analysis of variance models.
The common interface to fit a statistical model in R is made of a call to the corresponding function with arguments formula
and data
.
Model formula general expression is \(y \sim x_1 \pm x_2 \pm, \dots, \pm , x_k\).
Extensions to the above formula can be summarized as:
After fitting any statistical model, common R functions for inference are:
Drug Dosage and Reaction Time
In an experiment to investigate the effect of a depressant drug, the reaction times of ten males rats to a certain stimulus were measured after a specified dose of the drug had been administer to each rat. The results were as follows:
1 2 |
load("drug.Rda") str(drug) |
1 2 3 4 |
## 'data.frame': 10 obs. of 3 variables: ## $ rat : int 1 2 3 4 5 6 7 8 9 10 ## $ dose: num 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ## $ time: num 0.32 0.24 0.4 0.52 0.44 0.56 0.64 0.52 0.6 0.8 |
Basic graphical data exploration may be achieved with:
1 2 3 |
plot(time ~ dose, data = drug, xlab = "Dose (mg)", ylab = "Reaction Time (secs)") |
A simple model for these data might be a straight line: \(E(y_i|x_i) = \beta_0 + \beta_1 x_i \text{ for } i = 1, \dots, n\)
The equivalent R command is given by:
1 |
fm = lm(time ~ dose, data = drug) |
The resulting regression line can be easy superposed to the data scatterplot by:
1 2 3 4 5 |
plot(time ~ dose, data = drug, xlab = "Dose (mg)", ylab = "Reaction Time (secs)") abline(reg = fm) |
Regression results can be investigated by generic function summary()
.
1 |
summary(fm) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
## ## Call: ## lm(formula = time ~ dose, data = drug) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.104 -0.064 0.024 0.056 0.088 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.2880 0.0452 6.37 0.00022 *** ## dose 0.4800 0.0847 5.67 0.00047 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.0769 on 8 degrees of freedom ## Multiple R-squared: 0.801, Adjusted R-squared: 0.776 ## F-statistic: 32.1 on 1 and 8 DF, p-value: 0.000472 |
Small p-values for t-test on coefficients lead to consider both coefficients as significant. Adjust R-squared equal to 0.78 shows an acceptable portion of total variation explained by the regression model.
Small p-value for F-statistic confirm fitted model significance when compared with null model: \(E(y_i|x_i) = \beta_0\).
Same result could be obtained by:
1 |
anova(fm, test = F) |
1 2 3 4 5 6 7 8 |
## Analysis of Variance Table ## ## Response: time ## Df Sum Sq Mean Sq F value Pr(>F) ## dose 1 0.1901 0.1901 32.1 0.00047 *** ## Residuals 8 0.0474 0.0059 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 |
Prediction for response variable at specific values of the explanatories variable can be gained by:
1 2 |
newdata = data.frame(dose = c(0.2, 0.4, 0.6 )) predict (fm , newdata = newdata) |
1 2 |
## 1 2 3 ## 0.384 0.480 0.576 |
Note how the newdata
argument within predict()
requires an object of class data.frame
containing the same variables as in the explanatory part of the model.
Predict can be used for defining prediction interval for the fitted model. Prediction interval for any new observation possibly within the regression support is given by:
1 |
predict (fm , newdata = newdata, interval = "prediction") |
1 2 3 4 |
## fit lwr upr ## 1 0.384 0.1916 0.5764 ## 2 0.480 0.2937 0.6663 ## 3 0.576 0.3876 0.7644 |
Prediction interval can be plotted by:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
newdata = data.frame(dose = seq(0, 0.9 , len = 100)) newdata = cbind(newdata, predict(fm, newdata = newdata, interval = "prediction")) ylim = c(min(newdata$lwr), max(newdata$upr)) * c(0.99, 1.01) plot(time ~ dose, data = drug, type = "n", ylim = ylim, xlab = "Dose (mg)", ylab = "Reaction Time (secs)") with(newdata, { polygon(x = c(dose, rev(dose)), y = c(lwr , rev(upr)), col = "gray") lines(dose, fit, col = "red") }) with(drug, points(dose, time, pch = 16, col = "darkblue")) |
Finally, the initial assumptions concerning the structure of the error term:
- \(E(\varepsilon_{i}) = 0\) (\(i = 1, \dots, n\))
- \(\varepsilon_i \sim\; N(0,\,\sigma^2_\varepsilon)\) (\(i = 1, \dots, n\))
- \(Corr(\varepsilon_{i},\,\varepsilon_{i-j}) = 0\) (\(i = 1, \dots, n; 0 < j < i\))
can be verified by the visual inspection of the suite of residuals plots offered by default by R:
1 2 |
par(mfrow = c(2,2)) plot(fm) |
The “Residuals vs Fitted” plot does not show, in this example, any particular pattern. The presence of patterns may suggest that the model is inadequate. The “Normal Q-Q” plot shows points close to the straight line. If the normal assumption of residuals is not satisfied, points are far from the straight line. The “Normal Q-Q plot” is less reliable on the distribution tails, i.e. points ought be very far from the straight line to suggest that residuals follow a non-normal distribution. The “Scale location” is similar to the residuals versus fitted values plot, but it uses the square root of the standardized residuals in order to diminish skewness. Like the first plot, there should be no discernable pattern to the plot. The “Residuals vs Leverage” shows leverage points. Leverage points are those observations, if any, made at extreme or outlying values of the independent variables such that the lack of neighboring observations means that the fitted regression model will pass close to that particular observation. Leverage points fall out the dotted lines.
Car Seat
Three quality inspectors are studying the reproducibility of a measurement method aiming to test resistance of a specific material used to cover car seats. As a result, an experiments involving 75 samples of material from the same batch is set up. Three operators: Kevin, Michelle and Rob are assigned to test 25 samples each.
Comparison of operators average measurements and within operators variations are the key points of the analysis.
1 2 |
load("carseat.Rda") str(carseat) |
1 2 3 |
## 'data.frame': 75 obs. of 2 variables: ## $ Operator: Factor w/ 3 levels "Kevin","Michelle",..: 2 2 2 2 2 2 2 2 2 2 ... ## $ Strength: num 11.3 10.6 10.4 10.2 10.4 ... |
Boxplot of Strength by Operator along with within operators averages and between averages connecting line can be display by:
1 2 3 4 |
boxplot(Strength~Operator, data = carseat, xlab = "Operator", ylab = "Resistance", col = "lightgray") averages = tapply(carseat$Strength, carseat$Operator, mean) points(averages, col = "red", pch = 22, lwd = 2, type = "b") |
At first glance, difference between operators is hard to detect as within variation seems to be quite large. An outstanding value is shown at Michelle.
A simple one way analysis of variance model is computed with:
1 |
fm = aov(Strength~Operator, data = carseat) |
Results are shown with the usual summary()
method.
1 |
summary(fm) |
1 2 3 4 5 |
## Df Sum Sq Mean Sq F value Pr(>F) ## Operator 2 6.6 3.31 3.85 0.026 * ## Residuals 72 61.9 0.86 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 |
Operators appears to be borderline significant as its F test p-value is 0.0258. That is, not a real difference exists between operators measurements methods.
The overall influence of Michelle outstanding single result can be tested by repeating the analysis after the single observation is excluded from the data.
1 2 3 |
out = max(carseat$Strength[carseat$Operator == "Michelle"]) fm1 = aov(Strength~Operator, data = carseat[-which(carseat$Strength == out),]) summary(fm1) |
1 2 3 4 5 |
## Df Sum Sq Mean Sq F value Pr(>F) ## Operator 2 6.3 3.147 4 0.023 * ## Residuals 71 55.8 0.787 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 |
Operators significance is cleary increased as its p-value decreased to 0.0226. Nevertheless, practical results remain unchanged.
In mathematical notation the model can be written as: \(E(y_i|x_i) = \beta_0+\beta_j\) with \(j = 1,2,3\)
Single coefficients can be visualized by:
1 |
summary.lm(fm) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
## ## Call: ## aov(formula = Strength ~ Operator, data = carseat) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.9064 -0.5482 -0.0388 0.4656 2.4100 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 9.729 0.185 52.46 <2e-16 *** ## OperatorMichelle 0.501 0.262 1.91 0.0600 . ## OperatorRob 0.708 0.262 2.70 0.0087 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.927 on 72 degrees of freedom ## Multiple R-squared: 0.0966, Adjusted R-squared: 0.0715 ## F-statistic: 3.85 on 2 and 72 DF, p-value: 0.0258 |
Note that the intercept assumes the average strength value for Kevin while the other coefficients represent the difference in Strength between the corresponding operators and Kevin.
Finally, normality of residuals can be checked by:
1 2 |
par(mfrow = c(1,2)) plot(fm, which = 1:2) |
Despite a small departure from normality on the rigth side of the residuals distribution, model assumptions seem to confirm.
Boiling Time
In an attempt to resolve a domestic dispute about which of two pans was the quicker pan for cooking, the following data were obtained:
1 2 |
load("boiling.Rda") str(boiling) |
1 2 3 4 |
## 'data.frame': 6 obs. of 3 variables: ## $ time : num 4.65 9.75 15.05 4.75 10.15 ... ## $ pan : Factor w/ 2 levels "A","B": 1 1 1 2 2 2 ## $ volume: num 1 3 5 1 3 5 |
Various measured volumes (pints) of cold water were put into each pan and heated using the same setting of the cooker. The response variable was the time in minutes until the water boiled.
From a first visual investigation, it’s easy to understand that differences in boling time means difference either in intercept or slope between the two lines:
1 2 3 4 |
library(lattice) graph = xyplot(time ~ volume, data = boiling, groups = pan, type = "b", auto.key = list(space = "right", lines = F, points = T)) print(graph) |
A simple linear model may provide the answer to the above questions:
1 2 |
fm1 = lm(time ~ volume + pan + volume:pan, data = boiling) summary(fm1)$coeff |
1 2 3 4 5 |
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.01667 0.07187 28.0609 1.268e-03 ## volume 2.60000 0.02104 123.5704 6.548e-05 ## panB 0.05417 0.10164 0.5329 6.474e-01 ## volume:panB 0.08750 0.02976 2.9406 9.880e-02 |
Given \(\alpha=0.05\) as significance level, only some coefficients result as significant. Note that PanB
express the difference between the intercept, aliased with PanA, and PanB while volume:panB
represent the difference in slope between PanA (volume) and PanB.
PanA has a shorter boiling time than PanB but both the difference between the intercepts and the difference between slopes are not statistically significant.
A different parameterisation of the model, possibly more intuitive but less attinent with the original problem could be given by:
1 2 |
fm11 = lm(time ~ pan + volume:pan - 1, data = boiling) summary(fm11)$coeff |
1 2 3 4 5 |
## Estimate Std. Error t value Pr(>|t|) ## panA 2.017 0.07187 28.06 1.268e-03 ## panB 2.071 0.07187 28.81 1.202e-03 ## panA:volume 2.600 0.02104 123.57 6.548e-05 ## panB:volume 2.688 0.02104 127.73 6.129e-05 |
or alternatively by:
1 2 |
fm12 = lm(time ~ pan/volume - 1, data = boiling) summary(fm12)$coeff |
1 2 3 4 5 |
## Estimate Std. Error t value Pr(>|t|) ## panA 2.017 0.07187 28.06 1.268e-03 ## panB 2.071 0.07187 28.81 1.202e-03 ## panA:volume 2.600 0.02104 123.57 6.548e-05 ## panB:volume 2.688 0.02104 127.73 6.129e-05 |
Finally, the same results could be obtained by using a generalized linear model with Gaussian family and its natural identity link function:
1 2 3 |
fm2 = glm(time ~ pan * volume, data = boiling, family = gaussian(link = "identity")) summary(fm12)$coeff |
1 2 3 4 5 |
## Estimate Std. Error t value Pr(>|t|) ## panA 2.017 0.07187 28.06 1.268e-03 ## panB 2.071 0.07187 28.81 1.202e-03 ## panA:volume 2.600 0.02104 123.57 6.548e-05 ## panB:volume 2.688 0.02104 127.73 6.129e-05 |
However, this model could be unsatisfactory in two respects:
- boiling time cannot be negative,
- the variance of boiling time might be expected to increase with its expectation.
The natural candidate for this problem is the gamma distibution as it respect the increasing relationship between mean and variance.
As a result, a generalized linear model with family gamma and link identity can be used in place of a standard liear model.
1 2 3 |
fm3 = glm(time ~ pan*volume, data = boiling, family = Gamma(link = "identity")) summary(fm3)$coeff |
1 2 3 4 5 |
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.0592378 0.03944 52.21096 3.666e-04 ## panB 0.0008908 0.05651 0.01576 9.889e-01 ## volume 2.5839284 0.01947 132.72627 5.676e-05 ## panB:volume 0.1076174 0.02799 3.84542 6.146e-02 |
Models comparison can be given by the AIC statistics.
1 |
fm2$aic |
1 |
## [1] -13.42 |
1 |
fm3$aic |
1 |
## [1] -15.09 |
Gamma model appears to be slightly better.
Summary
In this chapter, we introduced statistical modelling with R. Through some examples on real data, we saw the
lm()
function to build a linear regression model, theglm()
function to build a generalised linear model andanova()
andaov()
functions to build an ANOVA model. Thesummary()
function is useful to investigate regression and ANOVA results. We used theplot()
function both to draw the regression line in the scatter plot and to perform a residual analysis. P value were used to choose between two nested models, while AIC statistics were introduced to compare two not-nested models.