CRD: Assumptions & Diagnostics

Checking Assumptions and Diagnosing Problems

Dr. Samuel B Fernandes

2026-01-01

Learning Objectives

By the end of this lecture, you should be able to:

  • Create and interpret diagnostic plots: Q-Q plot, residual vs fitted, scale-location
  • Conduct formal tests for assumptions: Shapiro-Wilk test, Levene’s test

ANOVA Assumptions Explained

Assumption 1: Normality

What it means:

  • Errors (residuals) follow a normal distribution
  • Not the raw data, but the residuals from the model

Why it matters:

  • F-test relies on normal distribution theory
  • Violated → p-values unreliable

Good news:

  • ANOVA is robust to mild non-normality with balanced designs
  • Central Limit Theorem helps with larger sample sizes (n > 30 per group)

Residual = Observed - Predicted \[e_{ij} = y_{ij} - \hat{y}_{ij}\]

where \(\hat{y}_{ij}\) is the treatment mean for group \(i\).

Assumption 2: Homogeneity of Variance

What it means:

  • All treatment groups have the same error variance \(\sigma^2\)
  • Also called homoscedasticity

Why it matters:

  • F-test assumes equal variance
  • Violated → Some groups have noisier data, biasing comparisons

Homoscedasticity vs Heteroscedasticity

Assumption 3: Independence

What it means:

  • Each observation is independent—no correlation between errors
  • Randomization ensures independence

Common violations in agriculture:

  • Spatial correlation: Adjacent plots influence each other (water runoff, pest movement)
  • Temporal correlation: Repeated measures on same unit over time
  • Cluster effects: Animals in same pen more similar than random

How to prevent:

  • Proper randomization
  • Physical separation of units
  • Blocking to account for spatial patterns
  • Use mixed models for repeated measures (later in course)

Checking Normality

The Q-Q Plot

Quantile-Quantile (Q-Q) Plot:

  • Compares sample quantiles to theoretical normal quantiles
  • Good: Points fall on diagonal line
  • Bad: Points deviate systematically

Interpretation guide:

  • Light tails: Flatter than line at ends
  • Heavy tails: Steeper than line at ends
  • Skewness: Curve away from line
Figure 1: Good Q-Q plot: Points follow the line closely

Q-Q Plot Examples: Good vs Bad

Figure 2: Bad: Right-skewed residuals (transformation needed)
Figure 3: Bad: Heavy-tailed distribution (outliers present)

Shapiro-Wilk Test for Normality

Formal hypothesis test:

  • \(H_0\): Residuals are normally distributed
  • \(H_a\): Residuals are not normally distributed
Code
# Reload data from Lecture 3
set.seed(2026)
crd_data <- data.frame(
  treatment = rep(c("N0", "N50", "N100", "N150"), each = 6),
  yield = c(
    rnorm(6, 2400, 150), rnorm(6, 2650, 150),
    rnorm(6, 2900, 150), rnorm(6, 2950, 150)
  )
)

# Fit model
model_crd <- lm(yield ~ treatment, data = crd_data)

# Extract residuals
residuals <- resid(model_crd)

# Shapiro-Wilk test
shapiro.test(residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  residuals
#> W = 0.94396, p-value = 0.1998

Interpretation:

  • p = 0.1998 > 0.05 → Fail to reject \(H_0\)
  • No evidence of non-normality
  • Safe to proceed with ANOVA

Checking Homogeneity of Variance

Residual vs Fitted Plot

Code
library(ggplot2)

# Create diagnostic data
diag_data <- data.frame(
  fitted = fitted(model_crd),
  residuals = resid(model_crd),
  treatment = crd_data$treatment
)

ggplot(diag_data, aes(x = fitted, y = residuals)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  geom_point(aes(color = treatment), size = 3, alpha = 0.7) +
  scale_color_manual(values = c("N0" = "#9D2235", 
                                 "N50" = "#FF8C42",
                                 "N100" = "#2E8B57", 
                                 "N150" = "#4A90E2")) +
  labs(
    title = "Residuals vs Fitted Values",
    subtitle = "Random scatter = good; funnel shape = heteroscedasticity",
    x = "Fitted Values",
    y = "Residuals"
  ) +
  theme_minimal(base_size = 14)

Figure 4: Residual vs fitted plot to check homoscedasticity

What to look for:

  • Good: Random scatter around zero, no pattern
  • Bad: Funnel shape (variance increases), systematic trends

Scale-Location Plot

Code
diag_data <- diag_data |>
  dplyr::mutate(sqrt_abs_resid = sqrt(abs(residuals)))

ggplot(diag_data, aes(x = fitted, y = sqrt_abs_resid)) +
  geom_point(aes(color = treatment), size = 3, alpha = 0.7) +
  geom_smooth(method = "loess", se = FALSE, color = "#FF8C42", linewidth = 1) +
  scale_color_manual(values = c("N0" = "#9D2235", 
                                 "N50" = "#FF8C42",
                                 "N100" = "#2E8B57", 
                                 "N150" = "#4A90E2")) +
  labs(
    title = "Scale-Location Plot",
    subtitle = "Flat smooth line = constant variance",
    x = "Fitted Values",
    y = "√|Standardized Residuals|"
  ) +
  theme_minimal(base_size = 14)

Figure 5: Scale-location plot: Checking for heteroscedasticity

Interpretation:

  • Smooth line should be roughly horizontal
  • Upward or downward trend → heteroscedasticity

Levene’s Test for Equal Variance

Formal hypothesis test:

  • \(H_0: \sigma_1^2 = \sigma_2^2 = \cdots = \sigma_t^2\) (Equal variances)
  • \(H_a\): At least one variance differs
Code
library(car)

# Levene's test
leveneTest(yield ~ treatment, data = crd_data)
#> Levene's Test for Homogeneity of Variance (center = median)
#>       Df F value Pr(>F)
#> group  3  0.9206 0.4489
#>       20

Interpretation:

  • p = 0.4489 > 0.05 → Fail to reject \(H_0\)
  • No evidence of unequal variances
  • Homoscedasticity assumption satisfied

Caution

Warning: Like Shapiro-Wilk, Levene’s test is sensitive to sample size. Use plots + test together.

When Assumptions Fail: Transformations

Common Data Transformations

Transformation Table:

Problem Transformation
Right skew log(y) or sqrt(y)
Count data sqrt(y + 0.5)
Proportion data arcsin(sqrt(y))
Heteroscedasticity log(y) often helps
Don’t know Box-Cox finds optimal

When to transform:

  • Q-Q plot shows clear skewness
  • Residual plot shows funnel shape
  • Variance ratio > 3:1 (?) across groups
  • Scientific precedent (e.g., log for enzyme activity)

Remember: Transform → Analyze → Interpret on original scale (back-transform means if needed)

Box-Cox Transformation

Finds optimal power transformation: \[y^{(\lambda)} = \begin{cases} \frac{y^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0 \\ \log(y) & \text{if } \lambda = 0 \end{cases}\]

Code
library(MASS)

# Box-Cox on CRD model
bc <- boxcox(model_crd, lambda = seq(-2, 2, 0.1))
Figure 6: Box-Cox transformation plot: Optimal λ near 1 (no transformation needed)
Code
# Extract optimal lambda
lambda_optimal <- bc$x[which.max(bc$y)]

cat("Optimal λ =", round(lambda_optimal, 2))
#> Optimal λ = 1.11

Interpretation:

  • λ = 1 → No transformation
  • λ = 0.5 → Square root
  • λ = 0 → Log
  • λ = -1 → Reciprocal

Transformation Example: Right-Skewed Data

Code
# Simulate right-skewed yield data (common in real trials)
set.seed(2026)
skewed_data <- data.frame(
  treatment = rep(c("A", "B", "C"), each = 10),
  yield = c(
    exp(rnorm(10, 5, 0.5)),    # Treatment A
    exp(rnorm(10, 7, 0.5)),  # Treatment B
    exp(rnorm(10, 9, 0.5))   # Treatment C
  )
)

# Fit model
model_skewed <- lm(yield ~ treatment, data = skewed_data)

# Check normality
shapiro.test(resid(model_skewed))
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  resid(model_skewed)
#> W = 0.83621, p-value = 0.0003231

Result: p = 0.0.0003231 < 0.05 → Significant non-normality

Applying Log Transformation

Code
# Log transformation
skewed_data$log_yield <- log(skewed_data$yield)

# Fit model on transformed data
model_log <- lm(log_yield ~ treatment, data = skewed_data)

# Check normality again
shapiro.test(resid(model_log))
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  resid(model_log)
#> W = 0.94608, p-value = 0.1326

Result: p = 0.1326 > 0.05 → Normality assumption now satisfied!

Re-run ANOVA:

Code
library(car)
#another option for ANOVA table
Anova(model_log)
#> Anova Table (Type II tests)
#> 
#> Response: log_yield
#>            Sum Sq Df F value    Pr(>F)    
#> treatment 104.720  2  202.49 < 2.2e-16 ***
#> Residuals   6.982 27                      
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Transformation fixed the violation. Now interpret results on log scale or back-transform means for reporting.

Before vs After Transformation

Figure 7: Q-Q plot BEFORE log transformation
Figure 8: Q-Q plot AFTER log transformation

Your Turn Activity

Diagnose This Dataset

Scenario:
A food scientist tests 4 pasteurization temperatures on bacterial count (CFU/mL) in milk samples. CRD with 6 samples per temperature.

Data provided:

#>   temperature cfu
#> 1         60C 113
#> 2         60C 133
#> 3         60C 101
#> 4         60C 121
#> 5         60C 138
#> 6         60C 125
#> 7         65C  68
#> 8         65C  64

Questions (3 minutes):

  1. Fit ANOVA model
  2. Check Q-Q plot and residual plot
  3. Is transformation needed? If so, which one?
  4. Share your diagnosis with a neighbor

Summary & Key Takeaways

  • Transformations fix many violations: log, sqrt, Box-Cox

What We Learned Today

  • Three ANOVA assumptions: Normality, homogeneity of variance, independence
  • Diagnostic tools: Q-Q plots, residual plots, Shapiro-Wilk, Levene’s test
  • Visual diagnostics are primary—formal tests are secondary
  • Normality violations are less serious with balanced designs and large sample sizes
  • Heteroscedasticity (unequal variances) is more serious—always address it

Key principle: Never trust ANOVA results without checking assumptions first!

Resources

Textbook:

  • Oehlert (2010), Chapter 3 and 6 (diagnostics): Model diagnostics and transformations

R Packages:

  • car: leveneTest(), qqPlot(), Anova()
  • MASS: boxcox() for transformation selection
  • ggplot2: Custom diagnostic plots

Online Resources:

  • Q-Q plot interpretation: link