lm
function in base R)The models for this example come from Hoffman (2015) chapter 2. We will be examining the extent to which cognition (as measured by an information test outcome) can be predicted from age (centered at 85 years) grip strength (centered at 9 pounds), sex (with men as the reference group) and dementia status (none = 1, future = 2, and current = 3) in a sample of 550 older adults.
if (!require(psych)){
install.packages("psych")
}
library(psych)
# Import data into R from CSV file.
ImportedData = read.csv(file = "mv18epsy905_lecture02.csv", stringsAsFactors = FALSE)
# Copy initial data to a new data frame for additional formatting:
Data01 = ImportedData
# Center age at 85 years old
Data01$age85 = Data01$age - 85
# Center grip strength at 9
Data01$grip9 = Data01$grip - 9
# Create dummy coded variables for dementia group (3 categories needs 2 variables):
# 1. Start by creating variables full of zeroes:
Data01$demNF = 0 # for no dementia in the future
Data01$demNC = 0 # for no dementia currently
# 2. Change zeroes to ones for people in each respective group
# Here, the term in the bracket provides a logical variable where if true, a 1 gets put into the demNF variable
Data01$demNF[Data01$demgroup == 2] = 1
Data01$demNC[Data01$demgroup == 3] = 1
# Selecting only three quantitative variables (the c() in the brackets consists of the variables to use)
describe(Data01[c("age", "grip", "cognition")])
vars n mean sd median trimmed mad min max range skew kurtosis se
age 1 550 84.93 3.43 84.33 84.49 2.88 80.02 96.97 16.95 1.10 0.72 0.15
grip 2 550 9.11 2.98 9.00 9.11 2.97 0.00 19.00 19.00 -0.01 -0.17 0.13
cognition 3 550 24.82 10.99 25.00 25.18 11.86 0.00 44.00 44.00 -0.26 -0.62 0.47
# Creating a frequency table for categorical variables:
table(Data01[c("sexMW")])
0 1
227 323
table(Data01[c("demgroup")])
1 2 3
399 109 42
Model: \(\text{Cognition}_i = \beta_0 + e_i\),
where \(e_i \sim N(0, \sigma^2_e)\)
# Here the formula has the name of the dependent variable in the Data01 data frame
# ~ is the formula term for =
# 1 indicates we only want the intercept estimated
Model01 = lm(formula = cognition ~ 1, data = Data01)
summary(Model01)
Call:
lm(formula = cognition ~ 1, data = Data01)
Residuals:
Min 1Q Median 3Q Max
-24.8218 -7.5718 0.1782 8.1782 19.1782
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.8218 0.4686 52.97 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 10.99 on 549 degrees of freedom
In the output:
Call:
provides the arugments given to the lm()
functionResiduals:
is often disregarded but provides quantiles of the distribution of \(e_i\) (residuals)Coefficients:
provides the estimates of the lm()
coefficients, their standard errors, a Wald-ish test and a (possibly conditional) p-valueResidual standard error:
gives an estimate of \(\sigma_e = \sqrt{\sigma^2_e}\), the residual standard error/deviationAdditionally, we can display an ANOVA table using the summary()
function:
anova(Model01)
Analysis of Variance Table
Response: cognition
Df Sum Sq Mean Sq F value Pr(>F)
Residuals 549 66297 120.76
Here the ANOVA table only has residuals because the only parameter estimated is the intercept. Also, note that the Mean Square Residual of 120.76 provides an estimate of \(\sigma_{e}^2\) (and \(\sqrt{120.76} = 10.99 = \sigma_{e}\) from above).
Model: \(\text{Cognition}_i = \beta_0 + \beta_1\left(\text{Age}_i - 85\right) + \beta_2 \left(\text{Grip}_i - 9 \right) + \beta_3\left(\text{SexMW}_i \right) e_i\),
where \(e_i \sim N(0, \sigma^2_e)\)
Model02 = lm(formula = cognition ~ age85 + grip9 + sexMW, data = Data01)
summary(Model02)
Call:
lm(formula = cognition ~ age85 + grip9 + sexMW, data = Data01)
Residuals:
Min 1Q Median 3Q Max
-27.758 -7.093 0.102 8.219 22.742
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.9594 0.7389 36.487 < 2e-16 ***
age85 -0.4338 0.1325 -3.275 0.00112 **
grip9 0.5460 0.1663 3.284 0.00109 **
sexMW -3.7988 0.9904 -3.836 0.00014 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 10.46 on 546 degrees of freedom
Multiple R-squared: 0.09917, Adjusted R-squared: 0.09422
F-statistic: 20.04 on 3 and 546 DF, p-value: 2.476e-12
We can use the anova()
function to compare Model02 to Model01. Later in the class, this won’t be an option, but it works for now.
anova(Model01, Model02)
Analysis of Variance Table
Model 1: cognition ~ 1
Model 2: cognition ~ age85 + grip9 + sexMW
Res.Df RSS Df Sum of Sq F Pr(>F)
1 549 66297
2 546 59722 3 6574.7 20.036 2.476e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Model: \(\text{Cognition}_i = \beta_0 + \beta_1\left(\text{Age}_i - 85\right) + \beta_2 \left(\text{Grip}_i - 9 \right) + \beta_3\left(\text{SexMW}_i \right) + \beta_4 \text{DemNF}_i + \beta_5 \text{DemNC}_i + e_i\),
where \(e_i \sim N(0, \sigma^2_e)\)
Model03 = lm(formula = cognition ~ age85 + grip9 + sexMW + demNF + demNC, data = Data01)
summary(Model03)
Call:
lm(formula = cognition ~ age85 + grip9 + sexMW + demNF + demNC,
data = Data01)
Residuals:
Min 1Q Median 3Q Max
-28.4870 -5.9566 0.1129 6.8300 20.5467
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.2643 0.6985 41.895 < 2e-16 ***
age85 -0.4057 0.1189 -3.412 0.000692 ***
grip9 0.6042 0.1498 4.034 6.26e-05 ***
sexMW -3.6574 0.8914 -4.103 4.71e-05 ***
demNF -5.7220 1.0191 -5.615 3.14e-08 ***
demNC -16.4798 1.5228 -10.822 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.385 on 544 degrees of freedom
Multiple R-squared: 0.2773, Adjusted R-squared: 0.2707
F-statistic: 41.75 on 5 and 544 DF, p-value: < 2.2e-16
We can use the model equation to calculate the dementia group means for predicted cognition. What model parameters would give the mean for the following groups?
We can determine the differences between the dementia group means as follows:
Here, we can use the parameters to get estimated differences, but we need more information for forming the standard errors. We will learn how to do this next week.
factor()
We can use the factor()
function to tell R we have a grouping variable, which will automatically code it for us:
Model04 = lm(formula = cognition ~ age85 + grip9 + sexMW + factor(demgroup), data = Data01)
summary(Model04)
Call:
lm(formula = cognition ~ age85 + grip9 + sexMW + factor(demgroup),
data = Data01)
Residuals:
Min 1Q Median 3Q Max
-28.4870 -5.9566 0.1129 6.8300 20.5467
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.2643 0.6985 41.895 < 2e-16 ***
age85 -0.4057 0.1189 -3.412 0.000692 ***
grip9 0.6042 0.1498 4.034 6.26e-05 ***
sexMW -3.6574 0.8914 -4.103 4.71e-05 ***
factor(demgroup)2 -5.7220 1.0191 -5.615 3.14e-08 ***
factor(demgroup)3 -16.4798 1.5228 -10.822 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.385 on 544 degrees of freedom
Multiple R-squared: 0.2773, Adjusted R-squared: 0.2707
F-statistic: 41.75 on 5 and 544 DF, p-value: < 2.2e-16
Here, the parameters are identical to what we saw in Model03. But compare the ANOVA tables:
anova(Model03)
Analysis of Variance Table
Response: cognition
Df Sum Sq Mean Sq F value Pr(>F)
age85 1 1926 1926.2 21.871 3.683e-06 ***
grip9 1 3039 3039.2 34.508 7.398e-09 ***
sexMW 1 1609 1609.3 18.273 2.260e-05 ***
demNF 1 1496 1496.1 16.988 4.350e-05 ***
demNC 1 10315 10315.2 117.124 < 2.2e-16 ***
Residuals 544 47911 88.1
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
anova(Model04)
Analysis of Variance Table
Response: cognition
Df Sum Sq Mean Sq F value Pr(>F)
age85 1 1926 1926.2 21.871 3.683e-06 ***
grip9 1 3039 3039.2 34.508 7.398e-09 ***
sexMW 1 1609 1609.3 18.273 2.260e-05 ***
factor(demgroup) 2 11811 5905.7 67.056 < 2.2e-16 ***
Residuals 544 47911 88.1
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
What is the difference?