Stepdown p-values for Multiple Hypothesis Testing in Julia

By: Bradley Setzler

Re-posted from: https://juliaeconomics.com/2014/06/17/stepdown-p-values-for-multiple-hypothesis-testing-in-julia/

* The script to reproduce the results of this tutorial in Julia is located here.

To finish yesterday’s tutorial on hypothesis testing with non-parametric p-values in Julia, I show how to perform the bootstrap stepdown correction to p-values for multiple testing, which many economists find intimidating (including me) but is actually not so difficult.

First, I will show the simple case of Holm’s (1979) stepdown procedure. Then, I will build on the bootstrap example to apply one of the most exciting advances in econometric theory from the past decade, the bootstrap step-down procedure, developed by Romano and Wolf (2004) and extended by Chicago professor Azeem Shaikh. These methods allow one to bound the family-wise error rate, which is the probability of making at least one false discovery (i.e., rejecting a null hypothesis when the null hypothesis was true).


Motivation

The standard example to motivate this discussion is to suppose you compute p-values for, say, 10 parameters with independent estimators. If all 10 have p-value equal to 0.05 corresponding to the null hypothesis of each, then we reject the null hypothesis for any one of them with 95% confidence. However, the probability that all of them reject the null hypothesis is 1-\left(1-0.05\right)^{10} \approx 0.4 (by independence, the joint probability is the product of marginal probabilities), so we only have 60% confidence in rejecting all of them jointly. If we wish to test for all of the hypotheses jointly, we need some way to adjust the p-values to give us greater confidence in jointly rejecting them. Holm’s solution to this problem is to iteratively inflate the individual p-values, so that the smallest p-value is inflated the most, and the largest p-value is not inflated at all (unless one of the smaller p-values is inflated above the largest p-value, then the largest p-value will be inflated to preserve the order of p-values). The specifics of the procedure are below.

The Romano-Wolf-Shaikh approach is to account for the dependence among the estimators, and penalize estimators more as they become more independent. Here’s how I motivate their approach: First, consider the extreme case of two estimators that are identical. Then, if we reject the null hypothesis for the first, there is no reason to penalize the second; it would be very strange to reject the null hypothesis for the first but not the second when they are identical. Second, suppose they are almost perfectly dependent. Then, rejecting the null for one strongly suggests that we should also reject the null for the second, so we do not want to penalize one very strongly for the rejection of the other. But as they become increasingly independent, the rejection of one provides less and less affirmative information about the other, and we approach the case of perfect independence shown above, which receives the greatest penalty. The specifics of the procedure are below.


Holm’s Correction to the p-values

Suppose you have a vector of p-values, p, of length K, and a desired maximum family-wise error rate, \alpha (the most common choice is \alpha=0.05. These are all of the needed ingredients. Holm’s stepdown procedure is as follows:

  1. If the smallest element of p is greater than \frac{\alpha}{(K+1)}, reject no null hypotheses and exit (do not continue to step 2). Else, reject the null hypothesis corresponding to the smallest element of p (continue to step 2).
  2. If the second-smallest element of p is greater than \frac{\alpha}{(K+1)-1}, reject no remaining null hypotheses and exit (do not continue to step 3). Else, reject the null hypothesis corresponding to the second-smallest element of p (continue to step 3).
  3. If the third-smallest element of p is greater than \frac{\alpha}{(K+1)-2}, reject no remaining null hypotheses and exit (do not continue to step 4). Else, reject the null hypothesis corresponding to the third-smallest element of p (continue to step 4).
  4. And so on.

We could program this directly as a loop-based function that takes p,\alpha as parameters, but a simpler approach is to compute the p-values equivalent to this procedure for any \alpha, which are,

\tilde{p}_{(k)} =\min\left\{\max_{j:p_{(j)}\leq p_{(k)}} \left\{ (K+1-j)p_{(j)}\right\},1\right\},

where (k) means the k^{\mathit{th}} smallest element. This expression is equivalent to the algorithm above in the sense that, for any family-wise error rate \alpha\in (0,1), \tilde{p}_{(k)}\leq\alpha if and only if the corresponding hypothesis is rejected by the algorithm above.

The following code computes these p-values. Julia (apparently) lacks a command that would tell me the index of the rank of the p-values, so my loop below does this, including the handling of ties (when some of the p-values are the same):

function Holm(p)
    K = length(p)
    sort_index = -ones(K)
    sorted_p = sort(p)
    sorted_p_adj = sorted_p.*[K+1-i for i=1:K]
    for j=1:K
        num_ties = length(sort_index[(p.==sorted_p[j]) & (sort_index.<0)])
        sort_index[(p.==sorted_p[j]) & (sort_index.<0)] = j:(j-1+num_ties)
    end
    sorted_holm_p = [minimum([maximum(sorted_p_adj[1:k]),1]) for k=1:K]
    holm_p = [sorted_holm_p[sort_index[k]] for k=1:K]
    return holm_p
end

This is straight-forward except for sort_index, which I constructed such that, e.g., if the first element of sort_index is 3, then the first element of pvalues is the third smallest. Unfortunately, it arbitrarily breaks ties in favor of the parameter that appears first in the MLE array, so the second entry of sort_index is 1 and the third entry is 2, even though the two corresponding p-values are equal.

Please email me or leave a comment if you know of a better way to handle ties.


Bootstrap Stepdown p-values

Given our work yesterday, it is relatively easy to replace bootstrap marginal p-values with bootstrap stepdown p-values. We begin with samples, as created yesterday. The following code creates tNullDistribution, which is the same as nullDistribution from yesterday except as t-statistics (i.e., divided by standard error).

function stepdown(MLE,bootSamples)
    K = length(MLE)
    tMLE = MLE
    bootstrapSE = std(bootSamples,1)
    tNullDistribution = bootSamples
    p = -ones(K)
    for i=1:K
        tMLE[i] = abs(tMLE[i]/bootstrapSE[i])
        tNullDistribution[:,i] = abs((tNullDistribution[:,i]-mean(tNullDistribution[:,i]))/bootstrapSE[i])
        p[i] = mean(tMLE[i].<tNullDistribution[:,i])
    end
    sort_index = -ones(K)
    stepdown_p = -ones(K)
    sorted_p = sort(p)
    for j=1:K
        num_ties = length(sort_index[(p.==sorted_p[j]) & (sort_index.<0)])
        sort_index[(p.==sorted_p[j]) & (sort_index.<0)] = j:(j-1+num_ties)
    end
    for k=1:K
    	current_index = [sort_index.>=k]
        stepdown_p[sort_index[k]] = mean(maximum(tNullDistribution[:,sort_index.>=k],2).>tMLE[sort_index[k]])
    end
    return ["single_pvalues"=>p,"stepdown_pvalues"=>stepdown_p,"Holm_pvalues"=>Holm(p)]
end

The only difference between the single p-values and the stepdown p-values is the use of the maximum t-statistic in the comparison to the null distribution, and the maximum is taken over only the parameter estimates whose p-values have not yet been computed. Notice that I used a dictionary in the return so that I could output single, stepdown, and Holm p-values from the stepdown function.


Results

To test out the above corrected p-values, we return to MLE and samples from yesterday. Suppose we want to perform the two-sided tests simultaneously for the null hypotheses \beta_0=0,\beta_1=0,\beta_2=0,\beta_3=0. The results from the above code are,

julia> stepdown(MLE[1:4],samples[:,1:4])
Dict{ASCIIString,Any} with 3 entries:
  "Holm_pvalues"     => {0.766,0.766,0.18,0.036}
  "stepdown_pvalues" => [0.486,0.649,0.169,0.034]
  "single_pvalues"   => [0.486,0.383,0.06,0.009]

We see that the p-value corresponding to the null hypothesis \beta_2=0 is inflated by both the bootstrap stepdown and Holm procedures, and is no longer significant at the 0.10 level.


Bradley J. Setzler