note: binomial power

23 October 2025
stats,
r,
note

Rejection Trade-off

When comparing statistical tests, we often face a trade-off. A test that’s better at detecting a false null hypothesis (high power) can sometimes be too strong, also rejecting a true null hypothesis more often (high Type I error). We see this classic trade-off by in the humble one-sample t-test versus the non-parametric sign test.

We’ll use simulated data from a heavy-tailed distribution (a t-distribution with 3 degrees of freedom) to see how these tests perform.

Both tests evaluate the same null hypothesis. The population median is zero. But they use different information. The sign test only counts how many data points are above and below zero. It’s simple and robust. The t-test uses the sample mean and standard deviation. It’s more powerful if its distributional assumptions are met, but can be misled by outliers or heavy tails.

The t-test, with its larger tails in its sampling distribution, is more sensitive. This sensitivity has costs and benefits.

A Single Sample Example

First see the tests in action on a single sample. The true population mean/median is zero, so the null hypothesis is true.

set.seed(123)
n <- 15
true_null_sample <- rt(n, df = 3) # Sample from a t-distribution (mean=0)

# The Sign Test
count_below_zero <- sum(true_null_sample < 0)
binom.test(count_below_zero, n, p = 0.5)

## 
##  Exact binomial test
## 
## data:  count_below_zero and n
## number of successes = 8, number of trials = 15, p-value = 1
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.2658613 0.7873333
## sample estimates:
## probability of success 
##              0.5333333

# The t-test
t.test(true_null_sample)

## 
##  One Sample t-test
## 
## data:  true_null_sample
## t = -0.69198, df = 14, p-value = 0.5003
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -1.3668742  0.7000223
## sample estimates:
## mean of x 
## -0.333426

Running this, you’ll likely see a non-significant p-value for both tests. But to better understand their behavior, we need to simulate this process thousands of times.

When the Null Hypothesis is TRUE

We’ll repeatedly sample from a population where the null is true (median = 0) and check how often each test incorrectly rejects it (Type I error rate). We expect this to be near 5% for a well-calibrated test.

set.seed(123)
N_sim <- 10000
n <- 15

pvals_t <- numeric(N_sim)
pvals_sign <- numeric(N_sim)

for (i in 1:N_sim) {
  sample_data <- rt(n, df = 3) # True mean/median is 0
  
  # Conduct t-test and store p-value
  pvals_t[i] <- t.test(sample_data)$p.value
  
  # Conduct sign test and store p-value
  count_below_zero <- sum(sample_data < 0)
  pvals_sign[i] <- binom.test(count_below_zero, n, p = 0.5)$p.value
}

# Calculate Type I Error Rate (should be ~0.05)
type1_error_t <- mean(pvals_t < 0.05)
type1_error_sign <- mean(pvals_sign < 0.05)

print(paste("t-test Type I Error:", round(type1_error_t, 3)))

## [1] "t-test Type I Error: 0.042"

print(paste("Sign Test Type I Error:", round(type1_error_sign, 3)))

## [1] "Sign Test Type I Error: 0.034"

The t-test has an inflated Type I error rate. It incorrectly rejects the true null hypothesis about 4.2% of the time, much higher than the desired 5%. But the sign test is well-calibrated at 5%. The t-test’s sensitivity to the heavy tails of our distribution makes it overzealous.

When the Null Hypothesis is FALSE

Now shift the population up by 1. The true median is now 1, so the null hypothesis (median = 0) is false. A good test should reject this frequently (high power).

set.seed(123)
N_sim <- 10000
n <- 15

pvals_t <- numeric(N_sim)
pvals_sign <- numeric(N_sim)

for (i in 1:N_sim) {
  sample_data <- rt(n, df = 3) + 1 # True mean/median is 1 (H0 is false)
  
  # Conduct t-test and store p-value
  pvals_t[i] <- t.test(sample_data)$p.value
  
  # Conduct sign test and store p-value
  count_below_zero <- sum(sample_data < 0)
  pvals_sign[i] <- binom.test(count_below_zero, n, p = 0.5)$p.value
}

# Calculate Power (1 - Type II Error)
power_t <- mean(pvals_t < 0.05)
power_sign <- mean(pvals_sign < 0.05)

print(paste("t-test Power:", round(power_t, 3)))

## [1] "t-test Power: 0.684"

print(paste("Sign Test Power:", round(power_sign, 3)))

## [1] "Sign Test Power: 0.67"

The t-test has higher power. It correctly identifies the false null hypothesis 68.4% of the time, compared to just 67% for the sign test. The same sensitivity that caused false alarms now helps it detect the real effect.

The t-test is more powerful but can have an inflated Type I error rate when its assumptions are stretched (e.g., by heavy-tailed data). The sign test is less powerful but robust, protecting your Type I error rate at the cost of missing some real effects.

The better test depends on your priorities. If controlling false positives matters, a robust test like the sign test is safer. If detecting a true effect is the goal and you can tolerate a slightly higher false positive rate, the t-test might be better.

← Previous
note: manova