Y1
?Statistical inference (e.g., hypothesis tests and confidence intervals) requires inference — reasoning about the unobserved.
p-values require probability distributions.
Randomization (or Design) + a Hypothesis + a Test Statistic Function → probability distributions representing the hypothesis (reference distributions)
Observed Values of Test Statistics + Reference Distribution → p-value.
The fundamental problem of causal inference says that we can see only one potential outcome for any given unit.
So, if a counterfactual causal effect of the treatment, T, for Jake occurs when yJake,T=1≠yJake,T=0, then how can we learn about the causal effect?
One solution is the estimation of averages of causal effects (the ATE, ITT, LATE).
This is what we call Neyman’s approach.
Another solution is to make claims or guesses about the causal effects.
We could say, “I think that the effect on Jake is 5.” or “This experiment had no effect on anyone.” And then we ask “How much evidence does this experiment have about that claim?”
This evidence is summarized in a p-value.
We call this Fisher’s approach.
The hypothesis testing approach to causal inference doesn’t provide a best guess but instead tells you how much evidence or information the research design provides about a causal claim.
The estimation approach provides a best guess but doesn’t tell you how much you know about that guess.
We nearly always report both since both help us make decisions: “Our best guess of the treatment effect was 5, and we could reject the idea that the effect was 0 (p=.01).”
A hypothesis is a statement about a relationship among potential outcomes.
A test statistic summarizes the relationship between treatment and observed outcomes.
The design allows us to link the hypothesis and the test statistic: calculate a test statistic that describes a relationship between potential outcomes.
The design also tells us how to generate a distribution of possible test statistics implied by the hypothesis.
A p-value describes the relationship between our observed test statistic and the distribution of possible hypothesized test statistics.
Outcome | Treatment | yi,0 | ITE | yi,1 | Y>0 |
---|---|---|---|---|---|
0 | 0 | 0 | 10 | 10 | 0 |
30 | 1 | 0 | 30 | 30 | 0 |
0 | 0 | 0 | 200 | 200 | 0 |
1 | 0 | 1 | 90 | 91 | 0 |
11 | 1 | 1 | 10 | 11 | 0 |
23 | 1 | 3 | 20 | 23 | 0 |
34 | 1 | 4 | 30 | 34 | 0 |
45 | 1 | 5 | 40 | 45 | 0 |
190 | 0 | 190 | 90 | 280 | 1 |
200 | 0 | 200 | 20 | 220 | 1 |
For example, the sharp, or strong, null hypothesis of no effects is H0:yi,1=yi,0
## The mean difference test statistic
meanTT <- function(ys, z) {
mean(ys[z == 1]) - mean(ys[z == 0])
}
## The difference of mean ranks test statistic
meanrankTT <- function(ys, z) {
ranky <- rank(ys)
mean(ranky[z == 1]) - mean(ranky[z == 0])
}
observedMeanTT <- meanTT(ys = Y, z = T)
observedMeanRankTT <- meanrankTT(ys = Y, z = T)
observedMeanTT
[1] -49.6
[1] 1
What we observe for each person i (Yi) is either what we would have observed in treatment (yi,1) or what we would have observed in control (yi,0).
Yi=Tiyi,1+(1−Ti)∗yi,0
So, if yi,1=yi,0 then Yi=yi,0.
What we actually observe is what we would have observed in the control condition.
We need to know how to repeat our experiment:
Then we repeat it, calculating the implied test statistic by the hypothesis and design each time:
How should we interpret these p-values? (Notice that they are one-tailed.)
pMeanTT <- mean(possibleMeanDiffsH0 >= observedMeanTT)
pMeanRankTT <- mean(possibleMeanRankDiffsH0 >= observedMeanRankTT)
pMeanTT
[1] 0.7785
[1] 0.3198
## using the coin package
library(coin)
set.seed(12345)
pMean2 <- coin::pvalue(oneway_test(Y ~ factor(T),
data = dat,
distribution = approximate(nresample = 1000), alternative = "less"
))
dat$rankY <- rank(dat$Y)
pMeanRank2 <- coin::pvalue(oneway_test(rankY ~ factor(T),
data = dat,
distribution = approximate(nresample = 1000), alternative = "less"
))
pMean2
[1] 0.783
99 percent confidence interval:
0.7476 0.8157
[1] 0.323
99 percent confidence interval:
0.2853 0.3624
First install a development version of the RItools package
# dev_mode() ## dont install the package globally
renv::install("markmfredrickson/RItools@randomization-distribution",
force = TRUE
)
# dev_mode()
Then use the RItest
function.
# dev_mode()
library(RItools)
thedesignA <- simpleRandomSampler(total = N, z = dat$T, b = rep(1, N))
pMean4 <- RItest(
y = dat$Y, z = dat$T, samples = 1000, test.stat = meanTT,
sampler = thedesignA
)
pMeanRank4 <- RItest(
y = dat$Y, z = dat$T, samples = 1000, test.stat = meanrankTT,
sampler = thedesignA
)
pMean4
pMeanRank4
# dev_mode() ## and turn off dev_mode
How should we interpret the two-tailed p-value here?
## using the ri2 package
library(ri2)
thedesign <- declare_ra(N = N)
dat$Z <- dat$T
pMean4 <- conduct_ri(Y ~ Z,
declaration = thedesign,
sharp_hypothesis = 0, data = dat, sims = 1000
)
summary(pMean4)
term estimate two_tailed_p_value
1 Z -49.6 0.4444
pMeanRank4 <- conduct_ri(rankY ~ Z,
declaration = thedesign,
sharp_hypothesis = 0, data = dat, sims = 1000
)
summary(pMeanRank4)
term estimate two_tailed_p_value
1 Z 1 0.6349
Testing weak null hypotheses, H0:ˉy1=ˉy0.
Rejecting null hypotheses (and making false positive and/or false negative errors).
Maintaining correct false positive error rates when testing more than one hypothesis.
Power of hypothesis tests (Module on Statistical Power and Design Diagnosands).
The weak null hypothesis is a claim about aggregates, and it is nearly always stated in terms of averages: H0:ˉy1=ˉy0
The test statistic for this hypothesis is nearly always the
simple difference of means (i.e., meanTT()
above).
lm1 <- lm(Y ~ T, data = dat)
lm1P <- summary(lm1)$coef["T", "Pr(>|t|)"]
ttestP1 <- t.test(Y ~ T, data = dat)$p.value
library(estimatr)
ttestP2 <- difference_in_means(Y ~ T, data = dat)
c(lm1P = lm1P, ttestP1 = ttestP1, tttestP2 = ttestP2$p.value)
lm1P ttestP1 tttestP2.T
0.3321 0.3587 0.3587
Both variation and location of Y changes with treatment in this simulation.
## By hand:
varEstATE <- function(Y, T) {
var(Y[T == 1]) / sum(T) + var(Y[T == 0]) / sum(1 - T)
}
seEstATE <- sqrt(varEstATE(dat$Y, dat$T))
obsTStat <- observedMeanTT / seEstATE
c(
observedTestStat = observedMeanTT,
stderror = seEstATE,
tstat = obsTStat,
pval = 2 * min(
pt(obsTStat, df = 8, lower.tail = TRUE),
pt(obsTStat, df = 8, lower.tail = FALSE)
)
)
observedTestStat stderror tstat pval
-49.6000 48.0448 -1.0324 0.3321
“In typical use, the level of the test [α] is a promise about the test’s performance and the size is a fact about its performance…” (Rosenbaum 2010, Glossary)
α is the probability of rejecting the null hypothesis when the null hypothesis is true.
How should we interpret p=0.78? What about p=0.32 (our tests of the sharp null)?
What does it mean to “reject” H0:yi,1=yi,2 at α=.05?
Notice:
The curve is centered at the hypothesized value.
The curve represents the world of the hypothesis.
The p-value is how rare it would be to see the observed test statistic (or a value farther away from the hypothesized value) in the world of the null.
In the picture, the observed value of the test statistic is consistent with the hypothesized distribution, but just not super consistent.
Even if p<.05 (or p<.001) the observed test statistic must reflect some value on the hypothesized distribution. This means that you can always make an error when you reject a null hypothesis.
If we say, “The experimental result is significantly different from the hypothesized value of zero (p=.001)! We reject that hypothesis!” when the truth is zero we are making a false positive error (claiming to detect something positively when there is no signal, only noise).
If we say, “We cannot distinguish this result from zero (p=.3). We cannot reject the hypothesis of zero.” when the truth is not zero we are making a false negative error (claiming inability to detect something when there is a signal, but it is overwhelmed by noise.)
A single test of a single hypothesis should encourage false positive errors rarely (for example, if we set α=.05) then we are saying that we are comfortable with our testing procedure making false positive errors in no more than 5% of tests of a given treatment assignment in a given experiment.
Also, a single test of a single hypothesis should detect signal when it exists — it should be have high statistical power. In other words, it should not fail to detect a signal when it exists (i.e. should have low false negative error rates).
If errors are necessary, how can we diagnose them? How do we learn whether our hypothesis-testing procedure might generate too many false positive errors?
Diagnose by simulation!
Across repetitions of the design:
The proportion of small p-values should be no larger than α if the test is operating correctly.
Example with a binary outcome. Does the test work as it should? What do the p-values look like when there is no effect?
collectPValues <- function(y, trt, thedistribution = exact()) {
## Make Y and T have no relationship by re-randomizing T
new_trt <- repeatExperiment(length(y))
thedata <- data.frame(new_trt = new_trt, y = y)
thedata$ranky <- rank(y)
thedata$new_trtF <- factor(thedata$new_trt)
## The four tests
thelm <- lm(y ~ new_trt, data = thedata)
t_test_CLT <- difference_in_means(y ~ new_trt, data = thedata)
t_test_exact <- oneway_test(y ~ new_trtF,
data = thedata,
distribution = thedistribution
)
t_test_rank_exact <- oneway_test(ranky ~ new_trtF,
data = thedata,
distribution = thedistribution
)
owP <- coin::pvalue(t_test_exact)[[1]]
owRankP <- coin::pvalue(t_test_rank_exact)[[1]]
## Return the p-values
return(c(
lmp = summary(thelm)$coef["new_trt", "Pr(>|t|)"],
neyp = t_test_CLT$p.value[[1]],
rtp = owP,
rtpRank = owRankP
))
}
When there is no effect, a test of the null hypothesis of no effects should produce a big p-value.
If the test is working well, we should see mostly big p-values and very few small p-values.
A few of the p-values for the four different tests (we did 5000 simulations, just showing 5)
[,1] [,2] [,3] [,4] [,5]
lmp 1 1 1 1 0.1411
neyp 1 1 1 1 0.1778
rtp 1 1 1 1 0.4444
rtpRank 1 1 1 1 0.4444
In fact, if there is no effect, and if we decided to reject the null hypothesis of no effects with α=.25, we would want no more than 25% of our p-values in this simulation to be less than p=.25. What do we see here? Which tests appear to have false positive rates that are too high?
## Calculate the proportion of p-values less than .25 for each row of pDist
apply(pDist, 1, function(x) {
mean(x < .25)
})
lmp neyp rtp rtpRank
0.4536 0.4536 0.0000 0.0000
Compare tests by plotting the proportion of p-values less than any given number. The “randomization inference” tests control the false positive rate (these are the tests of using direct permutation, repeating the experiment).
In this design only the direct randomization inference-based tests control the false positive rate.
Here, all of the tests do a good job of controlling the false positive rate.
A good test:
casts doubt on the truth rarely, and
easily distinguishes signal from noise (casts doubt on falsehoods often).
We can learn whether our testing procedure controls false positive rates given our design.
When false positive rates are not controlled, what might be going wrong? (Often has to do with asymptotics.)
Even if a given testing procedure controls the false positive rate for a single test, it may not control the rate for a group of multiple tests. See 10 Things you need to know about multiple comparisons for a guide to the approaches to controlling such rejection-rates in multiple tests.
A 100(1−α)% confidence interval can be defined as the range of hypotheses where all of the p-values are greater than or equal to α. This is called inverting the hypothesis test (Rosenbaum (2010)). That is, a confidence interval is a collection of hypothesis tests.
A point estimate based on hypothesis testing is called a Hodges-Lehmann point estimate (Rosenbaum (1993),Hodges and Lehmann (1963)).
A set of hypothesis tests can be combined into one single hypothesis test (Hansen and Bowers (2008),Caughey, Dafoe, and Seawright (2017)).
In equivalence testing, one can hypothesize that two test-statistics are equivalent (i.e., the treatment group is the same as the control group) rather than only about one test-statistic (the difference between the two groups is zero) (Hartman and Hidalgo (2018)).
Since a hypothesis test is a model of potential outcomes, one can use hypothesis testing to learn about complex models, such as models of spillover and propagation of treatment effects across networks (Bowers, Fredrickson, and Panagopoulos (2013), Bowers, Fredrickson, and Aronow (2016), Bowers et al. (2018))
If an intervention was very effective at increasing the
variability of an outcome but did not change the mean, would the p-value reported by R or Stata if we used
lm_robust()
or difference_of_means()
or
reg
or t.test
be large or small?
If an intervention caused the mean in the control group to be
moderately reduced but increased a few outcomes a lot (like a 10 times
effect), would the p-value from R
lm_robust()
or difference_of_means()
be large
or small?
Does the effect of an experimental treatment differ between different groups? Could differences in treatment effect arise because of some background characteristics of experimental subjects?
Which, among several, strategies for communication were most effective on a single outcome?
Which, among several outcomes, were influenced by a single experimental intervention?
Say our probability of making a false positive error is .05 in a single test. What happens if we ask: (1) which of these 10 outcomes has a statistically significant relationship with the two arms of treatment? or (2) which of these 10 treatment arms had a statistically significant relationship with the single outcome?
Number of errors committed when testing m null hypotheses (Benjamini and Hochberg 1995 ’s Table 1). Cells are numbers of tests. R is # of “discoveries” and V is # of false discoveries, U is # of correct non-rejections, and S is # of correct rejections.
Declared Non-Significant | Declared Significant | Total | |
---|---|---|---|
True null hypotheses (Htrue=0) |
|
|
m0 |
Not true null hyps (Htrue≠0) |
|
|
m - m0 |
Total |
|
|
m |
Family wise error rate (FWER) is P(V>0) (Probability of any false positive error).
We’d like to control this if we plan to make a decision about the results of our multiple tests. The research project is mostly confirmatory.
See, for example, the projects of the OES https://oes.gsa.gov: federal agencies will make decisions about programs depending on whether they detect results or not.
False Discovery Rate (FDR) is E(V/R|R>0) (Average proportion of false positive errors given some rejections).
We’d like to control this if we are using this experiment to plan the next experiment. We are willing to accept a higher probability of error in the interests of giving us more possibilities for discovery.
For example, one could imagine an organization, a government, an NGO, could decide to conduct a series of experiments as a part of a learning agenda: no single experiment determines decision making, more room for exploration.
We will focus on FWER but recommend thinking about FDR and learning agendas as a very useful way to go.
What is the effect of one treatment on multiple outcomes?
On which outcomes (out of many) did the treatment have an effect?
The second question, in particular, can lead to the kind of uncontrolled family wise error rate problems that we referred to above.
Imagine we had five outcomes and one treatment (showing potential and observed outcomes here):
ID T Y1_T_0 Y1_T_1 Y2_T_0 Y2_T_1 Y3_T_0 Y3_T_1 Y4_T_0 Y4_T_1 Y5_T_0 Y5_T_1
1 001 0 0.19 0.19 0.366 0.366 0.546 0.546 -0.626 -0.626 -0.125 -0.125
2 002 0 -0.43 -0.43 0.931 0.931 -2.233 -2.233 1.309 1.309 1.078 1.078
3 003 0 0.91 0.91 -1.907 -1.907 0.288 0.288 -0.133 -0.133 -1.261 -1.261
4 004 0 1.79 1.79 0.052 0.052 0.544 0.544 -1.608 -1.608 -0.452 -0.452
5 005 1 1.00 1.00 -0.848 -0.848 -1.192 -1.192 -1.308 -1.308 -1.027 -1.027
6 006 0 1.11 1.11 -0.368 -0.368 -0.018 -0.018 -0.045 -0.045 0.068 0.068
ID T Y1 Y2 Y3 Y4 Y5
1 001 0 0.19 0.366 0.546 -0.626 -0.125
2 002 0 -0.43 0.931 -2.233 1.309 1.078
3 003 0 0.91 -1.907 0.288 -0.133 -1.261
4 004 0 1.79 0.052 0.544 -1.608 -0.452
5 005 1 1.00 -0.848 -1.192 -1.308 -1.027
6 006 0 1.11 -0.368 -0.018 -0.045 0.068
Y1
?Can we detect an effect on outcome Y1
? (i.e., does the
hypothesis test produce a small enough p-value?)
[1] 0.88
## Notice that the t-test p-value is also a chi-squared test
## p-value.
coin::pvalue(independence_test(Y1 ~ factor(T),
data = dat1,
teststat = "quadratic"
))
[1] 0.88
On which of the five outcomes can we detect an effect? (i.e., does any of the five hypothesis tests produce a small enough p-value?)
p1 <- coin::pvalue(oneway_test(Y1 ~ factor(T), data = dat1))
p2 <- coin::pvalue(oneway_test(Y2 ~ factor(T), data = dat1))
p3 <- coin::pvalue(oneway_test(Y3 ~ factor(T), data = dat1))
p4 <- coin::pvalue(oneway_test(Y4 ~ factor(T), data = dat1))
p5 <- coin::pvalue(oneway_test(Y5 ~ factor(T), data = dat1))
theps <- c(p1 = p1, p2 = p2, p3 = p3, p4 = p4, p5 = p5)
sort(theps)
p5 p4 p3 p2 p1
0.27 0.30 0.43 0.59 0.88
Can we detect an effect for any of the five outcomes? (i.e., does the hypothesis test for all five outcomes at once produce a small enough p-value?)
coin::pvalue(independence_test(Y1 + Y2 + Y3 + Y4 + Y5 ~ factor(T),
data = dat1, teststat = "quadratic"
))
[1] 0.67
Which approach is likely to mislead us with too many “statistically significant” results (5 tests or 1 omnibus test)?
Let’s do a simulation to learn about these testing approaches.
We will (1) set the true causal effects to be 0, (2) repeatedly re-assign treatment, and (3) each time, do each of those three tests.
Since the true effect is 0, we expect most of the p-values to be large. (In fact, we’d like no more than 5% of the p-values to be greater than p=.05 if we are using the α=.05 accept-reject criterion).
estimator | fwer |
---|---|
t-test all | 0.22 |
t-test all holm adj | 0.04 |
t-test omnibus | 0.04 |
t-test Y1 | 0.05 |
The approach using 5 tests produces a p<.05 much too often — recall that there are no causal effects at all for any of these outcomes.
A test of a single outcome (here Y1
) has p<.05 no more than 5% of the
simulations.
The omnibus test also shows a well-controlled error rate.
Using a multiple testing correction (here we use the “Holm” correction) also correctly controls the false positive rate.
FYI, here is how to use the Holm correction (Notice what happens to the p-values):
p1 p2 p3 p4 p5
0.88 0.59 0.43 0.30 0.27
p1 p2 p3 p4 p5
1 1 1 1 1
## To show what happens with "significant" p-values
theps_new <- sort(c(theps, newlowp = .01))
p.adjust(theps_new, method = "holm")
newlowp p5 p4 p3 p2 p1
0.06 1.00 1.00 1.00 1.00 1.00
The same kind of problem can happen when the question is about the differential effects of a multi-armed treatment.
With 5 arms, “the effect of arm 1” could mean many different things: “Is the average potential outcome under arm 1 bigger than arm 2?”, “Are the potential outcomes of arm 1 bigger than the average potential outcomes of all of the other arms?”
If we just focus on pairwise comparisons across arms, we could have ((5×5)−5)/2=10 unique tests!
Here are some potential and observed outcomes and T
with
multiple values.
ID T Y_T_2 Y_T_3 Y_T_4 Y_T_5 Y
1 001 3 0.366 0.546 -0.626 -0.125 0.546
2 002 3 0.931 -2.233 1.309 1.078 -2.233
3 003 4 -1.907 0.288 -0.133 -1.261 -0.133
4 004 5 0.052 0.544 -1.608 -0.452 -0.452
5 005 2 -0.848 -1.192 -1.308 -1.027 -0.848
6 006 3 -0.368 -0.018 -0.045 0.068 -0.018
Here are the 10 pairwise tests with and without adjustment for multiple testing. Notice how one “significant” result (p=.01) changes with adjustment.
Comparison Stat p.value p.adjust
1 1 - 2 = 0 1.435 0.231 1.0000
2 1 - 3 = 0 0.8931 0.3447 1.0000
3 1 - 4 = 0 6.404 0.01139 0.1139
4 1 - 5 = 0 0.8216 0.3647 1.0000
5 2 - 3 = 0 0.05882 0.8084 1.0000
6 2 - 4 = 0 2.641 0.1041 0.7287
7 2 - 5 = 0 0.0437 0.8344 1.0000
8 3 - 4 = 0 3.232 0.07222 0.6500
9 3 - 5 = 0 0.0003464 0.9852 1.0000
10 4 - 5 = 0 2.899 0.08861 0.7089
We illustrate four different approaches:
estimator | fwer |
---|---|
Choose best pairwise test | 0.238 |
Choose best pairwise test after adjustment | 0.028 |
Overall test | 0.034 |
t-test T1 vs all | 0.018 |
Multiple testing problems can arise from multiple outcomes or multiple treatments (or multiple moderators/interaction terms).
Procedures for making hypothesis tests and confidence intervals can involve error. Ordinary practice controls the error rates in a single test (or single confidence interval). But multiple tests require extra work to ensure that error rates are controlled.
The loss of power arising from adjustment approaches encourages us to consider what questions we want to ask of the data. For example, if we want to know if the treatment had any effect, then a joint test or omnibus test of multiple outcomes will increase our statistical power without requiring adjustment.