Chapter 1: Mathematical Comparison: t-test vs TOST

1. t-test: Testing for Difference

In a standard two-sample t-test, we assess:

  • Null Hypothesis (H₀): μ₁ = μ₂ (no difference)
  • Alternative (H₁): μ₁ ≠ μ₂ (there is a difference)

The test statistic:

\[ t = \frac{\bar{X}_1 - \bar{X}_2}{s_p \cdot \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}} \]

with pooled standard deviation \(s_p\), follows a t-distribution. We reject H₀ if \(|t| > t_{α/2}\).

Key limitation: If p > 0.05, it only means no evidence of difference, not evidence of similarity.


2. TOST: Testing for Equivalence

In TOST, we define equivalence bounds: \([-Δ, +Δ]\), where Δ is the maximum acceptable difference.

TOST is performed:

  1. \(H₀: μ₁ - μ₂ ≤ -Δ\) vs \(H₁: μ₁ - μ₂ > -Δ\)
  2. \(H₀: μ₁ - μ₂ ≥ +Δ\) vs \(H₁: μ₁ - μ₂ < +Δ\)

We reject both H₀’s if the (1–2α) confidence interval for \(μ₁ - μ₂\) is entirely within \([-Δ, +Δ]\).

Visualization: Equivalence is demonstrated if the CI ⊂ [-Δ, +Δ]


Summary of Logic

Feature t-test TOST
Goal Detect a difference Confirm similarity
H₀ μ₁ = μ₂ |μ₁ – μ₂| > Δ
CI threshold CI excludes 0 CI fully within [-Δ, +Δ]
Conclusion Difference exists or not Equivalence confirmed or not

3. Simulation: t-test vs TOST

We conducted 10,000 simulations across a range of sample size (n = 2 to 100 ) to evaluate the following:

  • Random sampling from two groups with different means
  • Comparison of those sample means to determine whether a difference is detected
  • The correct conclusion in all cases: tho two group means are different.

Estimation of the probability of a correct conclusion, defined as correctly identifying that the two group means are different, using two statistical approaches: t-test and TOST

Key Results

Simulation results
Simulation results
  • t-test: Designed to detected differences, but in simulations, it often fails to detect small true difference when sample size increases, leading to a false sense of equivalence. This effect worsens as the sample size increases due to narrowing confidence intervals centered near zero, especially when the true difference is small.

  • TOST: Designed to test for equivalence. As sample size increases, TOST becomes more likely to correctly reject equivalence when the means differ, reaching near-perfect accuracy beyond n ≈ 60.

Insight: A non-significant t-test does not imply equivalence. In fact, t-test becomes less reliable at detecting small true differences as sample size increases, while TOST correctly rejects equivalence with increasing sample size. Equivalence requires a dedicated test, and TOST fulfills that purpose.


4. Code used for the Simulation

library(parallel)

# Parallelized t.test simulation
sim_ttest <- function(target_n, g1, g2, reps = 10000) {
  cl <- makeCluster(detectCores() - 1)
  clusterExport(cl, varlist = c("g1", "g2", "reps"), envir = environment())
  n_vec <- 2:target_n
  results <- parSapply(cl, n_vec, function(i) {
    mean(replicate(reps, {
      g1.s <- sample(g1, i)
      g2.s <- sample(g2, i)
      t.test(g1.s, g2.s)$p.value > 0.05
    }))
  })
  stopCluster(cl)
  data.frame(n = n_vec, probability = results * 100)
}

# Parallelized TOST-like CI check
sim_tost <- function(target_n, g1, g2, margin, reps = 10000) {
  cl <- makeCluster(detectCores() - 1)
  clusterExport(cl, varlist = c("g1", "g2", "margin", "reps"), envir = environment())
  n_vec <- 2:target_n
  results <- parSapply(cl, n_vec, function(i) {
    mean(replicate(reps, {
      g1.s <- sample(g1, i)
      g2.s <- sample(g2, i)
      diff <- mean(g2.s) - mean(g1.s)
      sd1 <- sd(g1.s)
      sd2 <- sd(g2.s)
      se <- sqrt(sd1^2 / i + sd2^2 / i)
      df <- ((sd1^2 / i + sd2^2 / i)^2) /
            ((sd1^4) / ((i - 1) * i^2) + (sd2^4) / ((i - 1) * i^2))
      t_crit <- qt(0.95, df)
      lci <- diff - t_crit * se
      uci <- diff + t_crit * se
      (lci > -margin) & (uci < margin)
    }))
  })
  stopCluster(cl)
  data.frame(n = n_vec, probability = results * 100)
}


# Set seed and simulate
set.seed(0815)
g1 <- rnorm(1000, 0.8987491, 0.3197588)
g2 <- rnorm(1000, 1.098018, 0.4273691)
margin <- 1.5 * sd(g1)

result_ttest <- sim_ttest(100, g1, g2)
result_tost <- sim_tost(100, g1, g2, margin)

# Combine and plot
library(ggplot2)
result_ttest$method <- "t.test"
result_tost$method <- "equivalence.test"
df_plot <- rbind(result_ttest, result_tost)

ggplot(df_plot, aes(n, probability, color = method)) +
  geom_point(alpha = 0.5) + geom_line() +
  theme_bw() + xlab("Sample size (n)") +
  ylab("Correct conclusion probability (%)") +
  ggtitle("Comparison of t-test vs Equivalence Test (TOST logic)")