Checking Assumptions and Diagnosing Problems
2026-01-01
By the end of this lecture, you should be able to:
What it means:
Why it matters:
Good news:
Residual = Observed - Predicted \[e_{ij} = y_{ij} - \hat{y}_{ij}\]
where \(\hat{y}_{ij}\) is the treatment mean for group \(i\).
What it means:
Why it matters:
What it means:
Common violations in agriculture:
How to prevent:
Quantile-Quantile (Q-Q) Plot:
Interpretation guide:
Formal hypothesis test:
# Reload data from Lecture 3
set.seed(2026)
crd_data <- data.frame(
treatment = rep(c("N0", "N50", "N100", "N150"), each = 6),
yield = c(
rnorm(6, 2400, 150), rnorm(6, 2650, 150),
rnorm(6, 2900, 150), rnorm(6, 2950, 150)
)
)
# Fit model
model_crd <- lm(yield ~ treatment, data = crd_data)
# Extract residuals
residuals <- resid(model_crd)
# Shapiro-Wilk test
shapiro.test(residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: residuals
#> W = 0.94396, p-value = 0.1998
Interpretation:
library(ggplot2)
# Create diagnostic data
diag_data <- data.frame(
fitted = fitted(model_crd),
residuals = resid(model_crd),
treatment = crd_data$treatment
)
ggplot(diag_data, aes(x = fitted, y = residuals)) +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
geom_point(aes(color = treatment), size = 3, alpha = 0.7) +
scale_color_manual(values = c("N0" = "#9D2235",
"N50" = "#FF8C42",
"N100" = "#2E8B57",
"N150" = "#4A90E2")) +
labs(
title = "Residuals vs Fitted Values",
subtitle = "Random scatter = good; funnel shape = heteroscedasticity",
x = "Fitted Values",
y = "Residuals"
) +
theme_minimal(base_size = 14)Figure 4: Residual vs fitted plot to check homoscedasticity
What to look for:
diag_data <- diag_data |>
dplyr::mutate(sqrt_abs_resid = sqrt(abs(residuals)))
ggplot(diag_data, aes(x = fitted, y = sqrt_abs_resid)) +
geom_point(aes(color = treatment), size = 3, alpha = 0.7) +
geom_smooth(method = "loess", se = FALSE, color = "#FF8C42", linewidth = 1) +
scale_color_manual(values = c("N0" = "#9D2235",
"N50" = "#FF8C42",
"N100" = "#2E8B57",
"N150" = "#4A90E2")) +
labs(
title = "Scale-Location Plot",
subtitle = "Flat smooth line = constant variance",
x = "Fitted Values",
y = "√|Standardized Residuals|"
) +
theme_minimal(base_size = 14)Figure 5: Scale-location plot: Checking for heteroscedasticity
Interpretation:
Formal hypothesis test:
#> Levene's Test for Homogeneity of Variance (center = median)
#> Df F value Pr(>F)
#> group 3 0.9206 0.4489
#> 20
Interpretation:
Caution
Warning: Like Shapiro-Wilk, Levene’s test is sensitive to sample size. Use plots + test together.
Transformation Table:
| Problem | Transformation |
|---|---|
| Right skew |
log(y) or sqrt(y)
|
| Count data | sqrt(y + 0.5) |
| Proportion data | arcsin(sqrt(y)) |
| Heteroscedasticity |
log(y) often helps |
| Don’t know | Box-Cox finds optimal |
When to transform:
(?) across groupsRemember: Transform → Analyze → Interpret on original scale (back-transform means if needed)
Finds optimal power transformation: \[y^{(\lambda)} = \begin{cases} \frac{y^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0 \\ \log(y) & \text{if } \lambda = 0 \end{cases}\]
# Simulate right-skewed yield data (common in real trials)
set.seed(2026)
skewed_data <- data.frame(
treatment = rep(c("A", "B", "C"), each = 10),
yield = c(
exp(rnorm(10, 5, 0.5)), # Treatment A
exp(rnorm(10, 7, 0.5)), # Treatment B
exp(rnorm(10, 9, 0.5)) # Treatment C
)
)
# Fit model
model_skewed <- lm(yield ~ treatment, data = skewed_data)
# Check normality
shapiro.test(resid(model_skewed))#>
#> Shapiro-Wilk normality test
#>
#> data: resid(model_skewed)
#> W = 0.83621, p-value = 0.0003231
Result: p = 0.0.0003231 < 0.05 → Significant non-normality
#>
#> Shapiro-Wilk normality test
#>
#> data: resid(model_log)
#> W = 0.94608, p-value = 0.1326
Result: p = 0.1326 > 0.05 → Normality assumption now satisfied!
Re-run ANOVA:
#> Anova Table (Type II tests)
#>
#> Response: log_yield
#> Sum Sq Df F value Pr(>F)
#> treatment 104.720 2 202.49 < 2.2e-16 ***
#> Residuals 6.982 27
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Transformation fixed the violation. Now interpret results on log scale or back-transform means for reporting.
Scenario:
A food scientist tests 4 pasteurization temperatures on bacterial count (CFU/mL) in milk samples. CRD with 6 samples per temperature.
Data provided:
#> temperature cfu
#> 1 60C 113
#> 2 60C 133
#> 3 60C 101
#> 4 60C 121
#> 5 60C 138
#> 6 60C 125
#> 7 65C 68
#> 8 65C 64
Questions (3 minutes):
Key principle: Never trust ANOVA results without checking assumptions first!
Textbook:
R Packages:
car: leveneTest(), qqPlot(), Anova()
MASS: boxcox() for transformation selectionggplot2: Custom diagnostic plotsOnline Resources:
AGST 50104: Experimental Design | Spring 2026