Skip to main content
izak

note: multiple linear regression simulation

Simulating Data and Fitting a Multiple Linear Regression Model

In this example, we simulate data and fit a multiple linear regression model to it. We then manually calculate the adjusted R-squared and F-statistic to compare with the values provided by the model summary.

Simulating Data

First, we set the seed for reproducibility and simulate the data

# Set seed for reproducibility
set.seed(123)

# Simulate data
n <- 30  # Number of observations
x1 <- rnorm(n, mean = 10, sd = 2)  # Continuous predictor 1
x2 <- rnorm(n, mean = 5, sd = 1)   # Continuous predictor 2
x3 <- factor(sample(c("A", "B", "C"), n, replace = TRUE))  # Categorical predictor with 3 levels
x4 <- rnorm(n, mean = 0, sd = 1)   # Continuous predictor 3

# Simulate response variable (y) based on the predictors
y <- 2 + 1.5 * x1 + 0.8 * x2 + ifelse(x3 == "B", 3, ifelse(x3 == "C", 5, 0)) + 1.2 * x4 + rnorm(n, mean = 0, sd = 1)

# Combine data into a data frame
data <- data.frame(y, x1, x2, x3, x4)

# Preview
head(data)
##          y        x1       x2 x3          x4
## 1 18.49021  8.879049 5.426464  A -0.78273028
## 2 25.53604  9.539645 4.704929  C -0.77899724
## 3 32.70612 13.117417 5.895126  C -0.37480009
## 4 27.01636 10.141017 5.878133  C -0.31939381
## 5 24.88084 10.258575 5.821581  B  0.08454377
## 6 28.92555 13.430130 5.688640  B -0.76847360

Fitting the Multiple Linear Regression Model

Next, we fit the multiple linear regression model and compare summary with anova output

# Fit the multiple linear regression model
mlr_model <- lm(y ~ x1 + x2 + x3 + x4, data = data)

# Display the summary of the model
summary(mlr_model)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6234 -0.4573 -0.1174  0.6899  1.2926 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.71889    1.36357  -0.527 0.602887    
## x1           1.65819    0.07951  20.854  < 2e-16 ***
## x2           0.84485    0.18675   4.524 0.000139 ***
## x3B          4.43388    0.42597  10.409 2.23e-10 ***
## x3C          6.44933    0.46137  13.979 4.99e-13 ***
## x4           1.25028    0.23694   5.277 2.06e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8099 on 24 degrees of freedom
## Multiple R-squared:  0.9708, Adjusted R-squared:  0.9648 
## F-statistic: 159.8 on 5 and 24 DF,  p-value: < 2.2e-16
anova(mlr_model)
## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq  F value    Pr(>F)    
## x1         1 335.74  335.74 511.8012 < 2.2e-16 ***
## x2         1   2.97    2.97   4.5347   0.04367 *  
## x3         2 167.31   83.66 127.5251 1.638e-13 ***
## x4         1  18.27   18.27  27.8451 2.065e-05 ***
## Residuals 24  15.74    0.66                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Checking Adjusted R-squared Manually

We can manually calculate the adjusted R-squared and compare it with the value from the model summary

# Extract necessary values
SSE <- sum(residuals(mlr_model)^2)  # Sum of squared errors
SST <- sum((y - mean(y))^2)         # Total sum of squares
ptrue <- length(coef(mlr_model))        # Number of parameters (including intercept)

# Number of continuous predictors
num_continuous <- sum(sapply(data, is.numeric)) - 1  # Subtract 1 to exclude the response variable (y)

# Number of levels in the categorical predictor
num_levels <- length(levels(data$x3))

# Total number of parameters (p)
p <- 1 + num_continuous + (num_levels - 1)

cat("Total number of parameters (p):", p, "like the number of mlr coefficients ", ptrue,"\n")
## Total number of parameters (p): 6 like the number of mlr coefficients  6
# Compute Adjusted R-squared manually
adjusted_r_squared <- 1 - (SSE / SST) * ((n - 1) / (n - p))

# Compare with the Adjusted R-squared from the model summary
cat("Adjusted R-squared (manual):", adjusted_r_squared, "\n")
## Adjusted R-squared (manual): 0.964773
cat("Adjusted R-squared (from model):", summary(mlr_model)$adj.r.squared, "\n")
## Adjusted R-squared (from model): 0.964773

Observed, Mean Observed, and Fitted Values for Sum of Squares

In regression analysis, let \( y_i \) (observed, blue) represent the actual data points, \( \hat{y}_i \) (fitted, green) denote the predicted values from the model, and \( \bar{y} \) (mean of observed values, purple) be the average of the observed data. The reference line \( y = x \) (red) serves as a 1:1 comparison line. The total sum of squares (SST), \[ \text{SST} = \sum_{i = 1}^{n} \left( y_{i} - \bar{y} \right)^{2}, \] quantifies the total variation in \( y_i \). This can be decomposed into the sum of squared errors (SSE), \[ \text{SSE} = \sum_{i = 1}^{n} \left( y_{i} - \hat{y}_{i} \right)^{2}, \] capturing the deviation of \( y_i \) from \( \hat{y}_i \), and the sum of squares due to regression (SSR), \[ \text{SSR} = \sum_{i = 1}^{n} \left( \hat{y}_{i} - \bar{y} \right)^{2}, \] representing the variation explained by the model, such that \[ \text{SST} = \text{SSE} + \text{SSR}, \] or equivalently, \[ \sum_{i = 1}^{n} \left( y_{i} - \bar{y} \right)^{2} = \sum_{i = 1}^{n} \left( y_{i} - \hat{y}_{i} \right)^{2} + \sum_{i = 1}^{n} \left( \hat{y}_{i} - \bar{y} \right)^{2}. \]

library(ggplot2)
# Add fitted values to the data frame
data$Fitted <- fitted(mlr_model)

# Calculate the mean of the observed values
mean_y <- mean(data$y)

# Create the plot
ggplot(data, aes(x = Fitted, y = y)) +
  geom_point(aes(color = "Observed"), alpha = 0.7) +  # Observed values
  geom_point(aes(y = Fitted, color = "Fitted"), alpha = 0.7) +  # Fitted values
  geom_abline(aes(slope = 1, intercept = 0, color = "Reference Line (y = x)"), linetype = "dashed") +  # Reference line (y = x)
  geom_hline(aes(yintercept = mean_y, color = "Mean of Observed Values"), linetype = "dotted") +  # Mean of observed values
  labs(
    title = "Observed vs. Fitted Values",
    x = "Fitted Values (Predicted)",
    y = "Observed Values (Actual)",
    color = "Legend"
  ) +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    plot.title = element_text(hjust = 0.5, face = "bold")
  ) +
  scale_color_manual(values = c(
    "Observed" = "blue",
    "Fitted" = "green",
    "Reference Line (y = x)" = "red",
    "Mean of Observed Values" = "purple"
  ))

Checking F-statistic Manually

Finally, we manually calculate the F-statistic and compare it with the value from the model summary:

coef(mlr_model)
## (Intercept)          x1          x2         x3B         x3C          x4 
##  -0.7188876   1.6581923   0.8448530   4.4338831   6.4493264   1.2502829
# Extract necessary values
SSR <- sum((fitted(mlr_model) - mean(y))^2)  # Regression sum of squares
SSE <- sum(residuals(mlr_model)^2)           # Residual sum of squares
n <- nrow(data)                              # Number of observations
p <- length(coef(mlr_model)) - 1             # Number of predictors (excluding intercept)

# Degrees of freedom
df_regression <- p
df_residual <- n - p - 1

# Calculate MSR and MSE
MSR <- SSR / df_regression
MSE <- SSE / df_residual

# Calculate F-statistic
F_statistic <- MSR / MSE

# Compare with the F-statistic from the model summary
cat("F-statistic (manual):", F_statistic, "\n")
## F-statistic (manual): 159.8462
cat("F-statistic (from model):", summary(mlr_model)$fstatistic[1], "\n")
## F-statistic (from model): 159.8462