CRD: Assumptions & Diagnostics

Checking Assumptions and Diagnosing Problems

Dr. Samuel B Fernandes

2026-01-01

Learning Objectives

By the end of this lecture, you should be able to:

Create and interpret diagnostic plots: Q-Q plot, residual vs fitted, scale-location
Conduct formal tests for assumptions: Shapiro-Wilk test, Levene’s test

ANOVA Assumptions Explained

Assumption 1: Normality

What it means:

Errors (residuals) follow a normal distribution
Not the raw data, but the residuals from the model

Why it matters:

F-test relies on normal distribution theory
Violated → p-values unreliable

Good news:

ANOVA is robust to mild non-normality with balanced designs
Central Limit Theorem helps with larger sample sizes (n > 30 per group)

Residual = Observed - Predicted \[e_{ij} = y_{ij} - \hat{y}_{ij}\]

where \(\hat{y}_{ij}\) is the treatment mean for group \(i\).

Assumption 2: Homogeneity of Variance

What it means:

All treatment groups have the same error variance \(\sigma^2\)
Also called homoscedasticity

Why it matters:

F-test assumes equal variance
Violated → Some groups have noisier data, biasing comparisons

Assumption 3: Independence

What it means:

Each observation is independent—no correlation between errors
Randomization ensures independence

Common violations in agriculture:

Spatial correlation: Adjacent plots influence each other (water runoff, pest movement)
Temporal correlation: Repeated measures on same unit over time
Cluster effects: Animals in same pen more similar than random

How to prevent:

Proper randomization
Physical separation of units
Blocking to account for spatial patterns
Use mixed models for repeated measures (later in course)

Checking Normality

The Q-Q Plot

Quantile-Quantile (Q-Q) Plot:

Compares sample quantiles to theoretical normal quantiles
Good: Points fall on diagonal line
Bad: Points deviate systematically

Interpretation guide:

Light tails: Flatter than line at ends
Heavy tails: Steeper than line at ends
Skewness: Curve away from line

Figure 1: Good Q-Q plot: Points follow the line closely

Q-Q Plot Examples: Good vs Bad

Figure 2: Bad: Right-skewed residuals (transformation needed)

Figure 3: Bad: Heavy-tailed distribution (outliers present)

Shapiro-Wilk Test for Normality

Formal hypothesis test:

\(H_0\): Residuals are normally distributed
\(H_a\): Residuals are not normally distributed

Code

# Reload data from Lecture 3
set.seed(2026)
crd_data <- data.frame(
  treatment = rep(c("N0", "N50", "N100", "N150"), each = 6),
  yield = c(
    rnorm(6, 2400, 150), rnorm(6, 2650, 150),
    rnorm(6, 2900, 150), rnorm(6, 2950, 150)
  )
)

# Fit model
model_crd <- lm(yield ~ treatment, data = crd_data)

# Extract residuals
residuals <- resid(model_crd)

# Shapiro-Wilk test
shapiro.test(residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  residuals
#> W = 0.94396, p-value = 0.1998

Interpretation:

p = 0.1998 > 0.05 → Fail to reject \(H_0\)
No evidence of non-normality
Safe to proceed with ANOVA

Checking Homogeneity of Variance

Residual vs Fitted Plot

Code

library(ggplot2)

# Create diagnostic data
diag_data <- data.frame(
  fitted = fitted(model_crd),
  residuals = resid(model_crd),
  treatment = crd_data$treatment
)

ggplot(diag_data, aes(x = fitted, y = residuals)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  geom_point(aes(color = treatment), size = 3, alpha = 0.7) +
  scale_color_manual(values = c("N0" = "#9D2235", 
                                 "N50" = "#FF8C42",
                                 "N100" = "#2E8B57", 
                                 "N150" = "#4A90E2")) +
  labs(
    title = "Residuals vs Fitted Values",
    subtitle = "Random scatter = good; funnel shape = heteroscedasticity",
    x = "Fitted Values",
    y = "Residuals"
  ) +
  theme_minimal(base_size = 14)

Figure 4: Residual vs fitted plot to check homoscedasticity

What to look for:

Good: Random scatter around zero, no pattern
Bad: Funnel shape (variance increases), systematic trends

Scale-Location Plot

Code

diag_data <- diag_data |>
  dplyr::mutate(sqrt_abs_resid = sqrt(abs(residuals)))

ggplot(diag_data, aes(x = fitted, y = sqrt_abs_resid)) +
  geom_point(aes(color = treatment), size = 3, alpha = 0.7) +
  geom_smooth(method = "loess", se = FALSE, color = "#FF8C42", linewidth = 1) +
  scale_color_manual(values = c("N0" = "#9D2235", 
                                 "N50" = "#FF8C42",
                                 "N100" = "#2E8B57", 
                                 "N150" = "#4A90E2")) +
  labs(
    title = "Scale-Location Plot",
    subtitle = "Flat smooth line = constant variance",
    x = "Fitted Values",
    y = "√|Standardized Residuals|"
  ) +
  theme_minimal(base_size = 14)

Figure 5: Scale-location plot: Checking for heteroscedasticity

Interpretation:

Smooth line should be roughly horizontal
Upward or downward trend → heteroscedasticity

Levene’s Test for Equal Variance

Formal hypothesis test:

\(H_0: \sigma_1^2 = \sigma_2^2 = \cdots = \sigma_t^2\) (Equal variances)
\(H_a\): At least one variance differs

Code

library(car)

# Levene's test
leveneTest(yield ~ treatment, data = crd_data)

#> Levene's Test for Homogeneity of Variance (center = median)
#>       Df F value Pr(>F)
#> group  3  0.9206 0.4489
#>       20

Interpretation:

p = 0.4489 > 0.05 → Fail to reject \(H_0\)
No evidence of unequal variances
Homoscedasticity assumption satisfied

Caution

Warning: Like Shapiro-Wilk, Levene’s test is sensitive to sample size. Use plots + test together.

When Assumptions Fail: Transformations

Common Data Transformations

Transformation Table:

Problem	Transformation
Right skew	`log(y)` or `sqrt(y)`
Count data	`sqrt(y + 0.5)`
Proportion data	`arcsin(sqrt(y))`
Heteroscedasticity	`log(y)` often helps
Don’t know	Box-Cox finds optimal

When to transform:

Q-Q plot shows clear skewness
Residual plot shows funnel shape
Variance ratio > 3:1 (?) across groups
Scientific precedent (e.g., log for enzyme activity)

Remember: Transform → Analyze → Interpret on original scale (back-transform means if needed)

Box-Cox Transformation

Finds optimal power transformation: \[y^{(\lambda)} = \begin{cases} \frac{y^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0 \\ \log(y) & \text{if } \lambda = 0 \end{cases}\]

Code

library(MASS)

# Box-Cox on CRD model
bc <- boxcox(model_crd, lambda = seq(-2, 2, 0.1))

Figure 6: Box-Cox transformation plot: Optimal λ near 1 (no transformation needed)

Code

# Extract optimal lambda
lambda_optimal <- bc$x[which.max(bc$y)]

cat("Optimal λ =", round(lambda_optimal, 2))

#> Optimal λ = 1.11

Interpretation:

λ = 1 → No transformation
λ = 0.5 → Square root
λ = 0 → Log
λ = -1 → Reciprocal

Transformation Example: Right-Skewed Data

Code

# Simulate right-skewed yield data (common in real trials)
set.seed(2026)
skewed_data <- data.frame(
  treatment = rep(c("A", "B", "C"), each = 10),
  yield = c(
    exp(rnorm(10, 5, 0.5)),    # Treatment A
    exp(rnorm(10, 7, 0.5)),  # Treatment B
    exp(rnorm(10, 9, 0.5))   # Treatment C
  )
)

# Fit model
model_skewed <- lm(yield ~ treatment, data = skewed_data)

# Check normality
shapiro.test(resid(model_skewed))

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  resid(model_skewed)
#> W = 0.83621, p-value = 0.0003231

Result: p = 0.0.0003231 < 0.05 → Significant non-normality

Applying Log Transformation

Code

# Log transformation
skewed_data$log_yield <- log(skewed_data$yield)

# Fit model on transformed data
model_log <- lm(log_yield ~ treatment, data = skewed_data)

# Check normality again
shapiro.test(resid(model_log))

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  resid(model_log)
#> W = 0.94608, p-value = 0.1326

Result: p = 0.1326 > 0.05 → Normality assumption now satisfied!

Re-run ANOVA:

Code

library(car)
#another option for ANOVA table
Anova(model_log)

#> Anova Table (Type II tests)
#> 
#> Response: log_yield
#>            Sum Sq Df F value    Pr(>F)    
#> treatment 104.720  2  202.49 < 2.2e-16 ***
#> Residuals   6.982 27                      
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Transformation fixed the violation. Now interpret results on log scale or back-transform means for reporting.

Before vs After Transformation

Figure 7: Q-Q plot BEFORE log transformation

Figure 8: Q-Q plot AFTER log transformation

Your Turn Activity

Diagnose This Dataset

Scenario:
A food scientist tests 4 pasteurization temperatures on bacterial count (CFU/mL) in milk samples. CRD with 6 samples per temperature.

Data provided:

#>   temperature cfu
#> 1         60C 113
#> 2         60C 133
#> 3         60C 101
#> 4         60C 121
#> 5         60C 138
#> 6         60C 125
#> 7         65C  68
#> 8         65C  64

Questions (3 minutes):

Fit ANOVA model
Check Q-Q plot and residual plot
Is transformation needed? If so, which one?
Share your diagnosis with a neighbor

Summary & Key Takeaways

Transformations fix many violations: log, sqrt, Box-Cox

What We Learned Today

Three ANOVA assumptions: Normality, homogeneity of variance, independence
Diagnostic tools: Q-Q plots, residual plots, Shapiro-Wilk, Levene’s test
Visual diagnostics are primary—formal tests are secondary
Normality violations are less serious with balanced designs and large sample sizes
Heteroscedasticity (unequal variances) is more serious—always address it

Key principle: Never trust ANOVA results without checking assumptions first!

Resources

Textbook:

Oehlert (2010), Chapter 3 and 6 (diagnostics): Model diagnostics and transformations

R Packages:

car: leveneTest(), qqPlot(), Anova()
MASS: boxcox() for transformation selection
ggplot2: Custom diagnostic plots

Online Resources:

Q-Q plot interpretation: link