note: multiple linear regression simulation
Simulating Data and Fitting a Multiple Linear Regression Model
In this example, we simulate data and fit a multiple linear regression model to it. We then manually calculate the adjusted R-squared and F-statistic to compare with the values provided by the model summary.
Simulating Data
First, we set the seed for reproducibility and simulate the data
# Set seed for reproducibility
set.seed(123)
# Simulate data
n <- 30 # Number of observations
x1 <- rnorm(n, mean = 10, sd = 2) # Continuous predictor 1
x2 <- rnorm(n, mean = 5, sd = 1) # Continuous predictor 2
x3 <- factor(sample(c("A", "B", "C"), n, replace = TRUE)) # Categorical predictor with 3 levels
x4 <- rnorm(n, mean = 0, sd = 1) # Continuous predictor 3
# Simulate response variable (y) based on the predictors
y <- 2 + 1.5 * x1 + 0.8 * x2 + ifelse(x3 == "B", 3, ifelse(x3 == "C", 5, 0)) + 1.2 * x4 + rnorm(n, mean = 0, sd = 1)
# Combine data into a data frame
data <- data.frame(y, x1, x2, x3, x4)
# Preview
head(data)
## y x1 x2 x3 x4
## 1 18.49021 8.879049 5.426464 A -0.78273028
## 2 25.53604 9.539645 4.704929 C -0.77899724
## 3 32.70612 13.117417 5.895126 C -0.37480009
## 4 27.01636 10.141017 5.878133 C -0.31939381
## 5 24.88084 10.258575 5.821581 B 0.08454377
## 6 28.92555 13.430130 5.688640 B -0.76847360
Fitting the Multiple Linear Regression Model
Next, we fit the multiple linear regression model and compare summary with anova output
# Fit the multiple linear regression model
mlr_model <- lm(y ~ x1 + x2 + x3 + x4, data = data)
# Display the summary of the model
summary(mlr_model)
##
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6234 -0.4573 -0.1174 0.6899 1.2926
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.71889 1.36357 -0.527 0.602887
## x1 1.65819 0.07951 20.854 < 2e-16 ***
## x2 0.84485 0.18675 4.524 0.000139 ***
## x3B 4.43388 0.42597 10.409 2.23e-10 ***
## x3C 6.44933 0.46137 13.979 4.99e-13 ***
## x4 1.25028 0.23694 5.277 2.06e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8099 on 24 degrees of freedom
## Multiple R-squared: 0.9708, Adjusted R-squared: 0.9648
## F-statistic: 159.8 on 5 and 24 DF, p-value: < 2.2e-16
anova(mlr_model)
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x1 1 335.74 335.74 511.8012 < 2.2e-16 ***
## x2 1 2.97 2.97 4.5347 0.04367 *
## x3 2 167.31 83.66 127.5251 1.638e-13 ***
## x4 1 18.27 18.27 27.8451 2.065e-05 ***
## Residuals 24 15.74 0.66
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Checking Adjusted R-squared Manually
We can manually calculate the adjusted R-squared and compare it with the value from the model summary
# Extract necessary values
SSE <- sum(residuals(mlr_model)^2) # Sum of squared errors
SST <- sum((y - mean(y))^2) # Total sum of squares
ptrue <- length(coef(mlr_model)) # Number of parameters (including intercept)
# Number of continuous predictors
num_continuous <- sum(sapply(data, is.numeric)) - 1 # Subtract 1 to exclude the response variable (y)
# Number of levels in the categorical predictor
num_levels <- length(levels(data$x3))
# Total number of parameters (p)
p <- 1 + num_continuous + (num_levels - 1)
cat("Total number of parameters (p):", p, "like the number of mlr coefficients ", ptrue,"\n")
## Total number of parameters (p): 6 like the number of mlr coefficients 6
# Compute Adjusted R-squared manually
adjusted_r_squared <- 1 - (SSE / SST) * ((n - 1) / (n - p))
# Compare with the Adjusted R-squared from the model summary
cat("Adjusted R-squared (manual):", adjusted_r_squared, "\n")
## Adjusted R-squared (manual): 0.964773
cat("Adjusted R-squared (from model):", summary(mlr_model)$adj.r.squared, "\n")
## Adjusted R-squared (from model): 0.964773
Observed, Mean Observed, and Fitted Values for Sum of Squares
In regression analysis, let \( y_i \) (observed, blue) represent the actual data points, \( \hat{y}_i \) (fitted, green) denote the predicted values from the model, and \( \bar{y} \) (mean of observed values, purple) be the average of the observed data. The reference line \( y = x \) (red) serves as a 1:1 comparison line. The total sum of squares (SST), \[ \text{SST} = \sum_{i = 1}^{n} \left( y_{i} - \bar{y} \right)^{2}, \] quantifies the total variation in \( y_i \). This can be decomposed into the sum of squared errors (SSE), \[ \text{SSE} = \sum_{i = 1}^{n} \left( y_{i} - \hat{y}_{i} \right)^{2}, \] capturing the deviation of \( y_i \) from \( \hat{y}_i \), and the sum of squares due to regression (SSR), \[ \text{SSR} = \sum_{i = 1}^{n} \left( \hat{y}_{i} - \bar{y} \right)^{2}, \] representing the variation explained by the model, such that \[ \text{SST} = \text{SSE} + \text{SSR}, \] or equivalently, \[ \sum_{i = 1}^{n} \left( y_{i} - \bar{y} \right)^{2} = \sum_{i = 1}^{n} \left( y_{i} - \hat{y}_{i} \right)^{2} + \sum_{i = 1}^{n} \left( \hat{y}_{i} - \bar{y} \right)^{2}. \]
library(ggplot2)
# Add fitted values to the data frame
data$Fitted <- fitted(mlr_model)
# Calculate the mean of the observed values
mean_y <- mean(data$y)
# Create the plot
ggplot(data, aes(x = Fitted, y = y)) +
geom_point(aes(color = "Observed"), alpha = 0.7) + # Observed values
geom_point(aes(y = Fitted, color = "Fitted"), alpha = 0.7) + # Fitted values
geom_abline(aes(slope = 1, intercept = 0, color = "Reference Line (y = x)"), linetype = "dashed") + # Reference line (y = x)
geom_hline(aes(yintercept = mean_y, color = "Mean of Observed Values"), linetype = "dotted") + # Mean of observed values
labs(
title = "Observed vs. Fitted Values",
x = "Fitted Values (Predicted)",
y = "Observed Values (Actual)",
color = "Legend"
) +
theme_minimal() +
theme(
legend.position = "bottom",
plot.title = element_text(hjust = 0.5, face = "bold")
) +
scale_color_manual(values = c(
"Observed" = "blue",
"Fitted" = "green",
"Reference Line (y = x)" = "red",
"Mean of Observed Values" = "purple"
))
Checking F-statistic Manually
Finally, we manually calculate the F-statistic and compare it with the value from the model summary:
coef(mlr_model)
## (Intercept) x1 x2 x3B x3C x4
## -0.7188876 1.6581923 0.8448530 4.4338831 6.4493264 1.2502829
# Extract necessary values
SSR <- sum((fitted(mlr_model) - mean(y))^2) # Regression sum of squares
SSE <- sum(residuals(mlr_model)^2) # Residual sum of squares
n <- nrow(data) # Number of observations
p <- length(coef(mlr_model)) - 1 # Number of predictors (excluding intercept)
# Degrees of freedom
df_regression <- p
df_residual <- n - p - 1
# Calculate MSR and MSE
MSR <- SSR / df_regression
MSE <- SSE / df_residual
# Calculate F-statistic
F_statistic <- MSR / MSE
# Compare with the F-statistic from the model summary
cat("F-statistic (manual):", F_statistic, "\n")
## F-statistic (manual): 159.8462
cat("F-statistic (from model):", summary(mlr_model)$fstatistic[1], "\n")
## F-statistic (from model): 159.8462