Investigating p-values in hypothesis tests

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/01/20/pvalue.html

Introduction

Today I thought of writing a post covering some topic from statistics.
I decided to discuss some properties of p-value of hypothesis tests.

The post is written under Julia 1.8.5, Distributions.jl 0.25.80,
HypothesisTests.jl 0.10.11, UnicodePlots.jl 3.3.4.

Some theory

Let us assume that we have some hypothesis, call it H0,
that we assume to be true. Now consider that we have some statistic T,
whose distribution we can compute assuming that H0 is true.
For a given sample of data we computed the value of the statistic and denote
this value as t.

We are ready to define p-value, which is the probability of obtaining the
value of the T statistic at least as extreme as the result t actually
observed, under the assumption that H0 is correct.

Let me give a simple example. Assume that we have some data, for which we
assume that it comes from normal distribution with an unknown mean and known
variance equal to v=1. We want to test H0 hypothesis that the mean of the
distribution is equal to 0, against assumption that the mean is not equal
to 0. We check it by taking a n=9 element sample of the data and computing
its mean. Assume that this mean is m=1. In this case the value of the
statistic t is computed using the formula t=m/sqrt(v/n)=1/sqrt(1/9)=3. From
mathematical statistics we know that the distribution of the statistic T in
this case is normal with mean 0 and variance 1. Therefore we can compute that
the probability of seeing the value at least as extreme (in absolute value) as
the observed is 2*(1-F(|t|)) = 2*(1-F(3)) ≈ 0.0027. This number is our
p-value. In this case, since this probability is low we would most likely
conclude that the data does not support our hypothesis that the mean is equal
to 0.

Note that in the formula F is cumulative distribution function of standard
normal distribution, and we take absolute value of t and multiply the weight
of the tail of the distribution by two, as we assume that we equally treat
positive and negative deviations from 0 as extreme (this is called a two-sided
p-value).

We could have obtained this result conveniently in Julia using the
OneSampleZTest function from the HypothesisTests.jl package as follows:

julia> using HypothesisTests

julia> m = 1.0
1.0

julia> v = 1.0
1.0

julia> n = 9
9

julia> OneSampleZTest(m, sqrt(v), n)
One sample z-test
-----------------
Population details:
    parameter of interest:   Mean
    value under h_0:         0
    point estimate:          1.0
    95% confidence interval: (0.3467, 1.653)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           0.0027

Details:
    number of observations:   9
    z-statistic:              3.0
    population standard error: 0.3333333333333333

Now, a crucial observation about p-value, given the definition we have
presented, is that assuming H0 is true the distribution of p-value should
be uniform on [0, 1] interval.

For example, in the case of our normally distributed variable we claim that
PV = 2*(1-F(|T|)) has uniform distribution on [0, 1] interval.

Let us make a quick computation to verify it is indeed the case. Assume
x is from [0, 1] interval:

Pr(PV ≤ x) = Pr(2*(1-F(|T|)) ≤ x) = Pr(F⁻¹(1-x/2) ≤ |T|)

Now we note that this can be rewritten as:

Pr(F⁻¹(1-x/2) ≤ T) + Pr(T ≤ -F⁻¹(1-x/2))

Note that these two probabilities are equal as normal distribution is symmetric
around 0 so we can simplify it to:

2 * (1 - Pr(T ≤ F⁻¹(1-x/2))) = 2 * (1 - F(F⁻¹(1-x/2))) = 2 * x/2 = x

So in summary we get Pr(PV ≤ x) = x for x from [0, 1] interval, so
in this case indeed p-value has uniform distribution.

Checking the theory using simulation

Above we have claimed that the distribution of p-value for OneSampleZTest
is uniform on [0, 1] interval. Let us check it using simulation. For this
we will generate 9-element sample from standard normal distribution multiple
times, and check if the distribution of p-value.

julia> using Random

julia> using Statistics

julia> using UnicodePlots

julia> pvalues = [pvalue(OneSampleZTest(mean(randn(9)), 1.0, 9))
                  for _ in 1:10^6]
1000000-element Vector{Float64}:
 0.1327790631059239
 0.37168172133050215
 0.9877013474897027
 0.24068935009108738
 0.24685639345198623
 0.19113260388855702
 0.1806265110576506
 0.3482022041517382
 ⋮
 0.33776890104210877
 0.141776876179674
 0.38354782746729116
 0.9866784609861702
 0.9219832724792354
 0.9768708734987599
 0.44377261216567804

julia> histogram(pvalues)
                ┌                                        ┐
   [0.0 , 0.05) ┤███████████████████████████████▋ 49 921
   [0.05, 0.1 ) ┤███████████████████████████████▊ 50 228
   [0.1 , 0.15) ┤███████████████████████████████▋ 49 908
   [0.15, 0.2 ) ┤████████████████████████████████  50 451
   [0.2 , 0.25) ┤███████████████████████████████▌ 49 655
   [0.25, 0.3 ) ┤███████████████████████████████▋ 49 907
   [0.3 , 0.35) ┤███████████████████████████████▋ 49 867
   [0.35, 0.4 ) ┤███████████████████████████████▊ 50 303
   [0.4 , 0.45) ┤███████████████████████████████▋ 49 827
   [0.45, 0.5 ) ┤███████████████████████████████▋ 49 980
   [0.5 , 0.55) ┤███████████████████████████████▋ 49 855
   [0.55, 0.6 ) ┤███████████████████████████████▌ 49 569
   [0.6 , 0.65) ┤███████████████████████████████▋ 49 798
   [0.65, 0.7 ) ┤███████████████████████████████▊ 50 143
   [0.7 , 0.75) ┤███████████████████████████████▉ 50 406
   [0.75, 0.8 ) ┤███████████████████████████████▋ 49 923
   [0.8 , 0.85) ┤███████████████████████████████▋ 50 048
   [0.85, 0.9 ) ┤███████████████████████████████▋ 49 835
   [0.9 , 0.95) ┤███████████████████████████████▊ 50 249
   [0.95, 1.0 ) ┤███████████████████████████████▊ 50 127
                └                                        ┘
                                 Frequency

Let us additionally check what happens if our data does not meet the assumptions
of our H0:

julia> pvalues = [pvalue(OneSampleZTest(mean(randn(9)) .+ 1, 1.0, 9))
                  for _ in 1:10^6]
1000000-element Vector{Float64}:
 0.0002078548349486422
 7.323288709478623e-5
 0.4909823219717196
 0.12532336793818022
 0.05461673383130956
 0.001132766345973069
 0.02138767010993988
 0.4385291137658384
 ⋮
 0.06136424417006158
 0.003270513361517393
 0.0007748069139344607
 0.06833756938048707
 8.815311836419556e-5
 4.057178851779267e-5
 0.000801919865655273

julia> histogram(pvalues)
                ┌                                        ┐
   [0.0 , 0.05) ┤███████████████████████████████  850 737
   [0.05, 0.1 ) ┤██▎ 61 462
   [0.1 , 0.15) ┤█▏ 28 540
   [0.15, 0.2 ) ┤▋ 16 534
   [0.2 , 0.25) ┤▍ 10 542
   [0.25, 0.3 ) ┤▍ 7 367
   [0.3 , 0.35) ┤▎ 5 366
   [0.35, 0.4 ) ┤▎ 4 022
   [0.4 , 0.45) ┤▎ 3 058
   [0.45, 0.5 ) ┤▎ 2 329
   [0.5 , 0.55) ┤▏ 1 908
   [0.55, 0.6 ) ┤▏ 1 551
   [0.6 , 0.65) ┤▏ 1 319
   [0.65, 0.7 ) ┤▏ 1 091
   [0.7 , 0.75) ┤▏ 940
   [0.75, 0.8 ) ┤▏ 753
   [0.8 , 0.85) ┤▏ 715
   [0.85, 0.9 ) ┤▏ 619
   [0.9 , 0.95) ┤▏ 601
   [0.95, 1.0 ) ┤▏ 546
                └                                        ┘
                                 Frequency

Indeed, as expected, we see that in this case we are getting low p-value most
of the time.

The fact that p-value has uniform distribution under H0 is a crucial
property. Unfortunately, not all hypothesis tests can ensure this.

Distribution p-value for discrete tests

A typical issue with p-value distribution is encountered when we work with
discrete distributions. Let me give you an example.

Now assume we sample data from a binomial distribution with sample size
n=16 and success probability p=0.5. Let us repeat the simulation that we
performed above using this distribution:

julia> using Distributions

julia> pvalues = [pvalue(BinomialTest(rand(Binomial(16, 0.5)), 16, 0.5))
                  for _ in 1:10^6]
1000000-element Vector{Float64}:
 0.8036193847656249
 0.8036193847656249
 0.8036193847656249
 0.8036193847656249
 0.8036193847656249
 0.45449829101562506
 0.8036193847656249
 0.8036193847656249
 ⋮
 0.8036193847656249
 0.21011352539062514
 0.8036193847656249
 0.21011352539062514
 0.21011352539062514
 0.45449829101562506
 0.8036193847656249

julia> histogram(pvalues)
                ┌                                        ┐
   [0.0 , 0.05) ┤█▊ 21 491
   [0.05, 0.1 ) ┤████▉ 55 699
   [0.1 , 0.15) ┤  0
   [0.15, 0.2 ) ┤  0
   [0.2 , 0.25) ┤███████████▊ 133 700
   [0.25, 0.3 ) ┤  0
   [0.3 , 0.35) ┤  0
   [0.35, 0.4 ) ┤  0
   [0.4 , 0.45) ┤  0
   [0.45, 0.5 ) ┤█████████████████████▋ 244 309
   [0.5 , 0.55) ┤  0
   [0.55, 0.6 ) ┤  0
   [0.6 , 0.65) ┤  0
   [0.65, 0.7 ) ┤  0
   [0.7 , 0.75) ┤  0
   [0.75, 0.8 ) ┤  0
   [0.8 , 0.85) ┤███████████████████████████████  348 485
   [0.85, 0.9 ) ┤  0
   [0.9 , 0.95) ┤  0
   [0.95, 1.0 ) ┤  0
   [1.0 , 1.05) ┤█████████████████▌ 196 316
                └                                        ┘
                                 Frequency

Now we can see that clearly the distribution is not uniform. For example, under
the standard 0.05 cut-off threshold for p-value we would expect to have
50,000 observations in [0.0, 0.05) bin, but we got only 21,491. Is this a
problem? Well, it is. Assume that we generate the data with p=0.55, but assume
the H0 probability is 0.5. We get:

julia> pvalues = [pvalue(BinomialTest(rand(Binomial(16, 0.55)), 16, 0.5))
                  for _ in 1:10^6]
1000000-element Vector{Float64}:
 0.21011352539062514
 0.07681274414062504
 0.8036193847656249
 0.8036193847656249
 1.0
 0.8036193847656249
 0.8036193847656249
 0.45449829101562506
 ⋮
 0.8036193847656249
 0.45449829101562506
 0.8036193847656249
 0.8036193847656249
 0.21011352539062514
 1.0
 1.0

julia> histogram(pvalues)
                ┌                                        ┐
   [0.0 , 0.05) ┤██▉ 31 600
   [0.05, 0.1 ) ┤██████▌ 68 901
   [0.1 , 0.15) ┤  0
   [0.15, 0.2 ) ┤  0
   [0.2 , 0.25) ┤█████████████▊ 145 964
   [0.25, 0.3 ) ┤  0
   [0.3 , 0.35) ┤  0
   [0.35, 0.4 ) ┤  0
   [0.4 , 0.45) ┤  0
   [0.45, 0.5 ) ┤███████████████████████▏ 243 914
   [0.5 , 0.55) ┤  0
   [0.55, 0.6 ) ┤  0
   [0.6 , 0.65) ┤  0
   [0.65, 0.7 ) ┤  0
   [0.7 , 0.75) ┤  0
   [0.75, 0.8 ) ┤  0
   [0.8 , 0.85) ┤███████████████████████████████  328 306
   [0.85, 0.9 ) ┤  0
   [0.9 , 0.95) ┤  0
   [0.95, 1.0 ) ┤  0
   [1.0 , 1.05) ┤█████████████████▎ 181 315
                └                                        ┘
                                 Frequency

So we can see that even if I generate the data from a distribution different
than assumed under H0 you would reject this hypothesis with 0.05 threshold
less frequently than even what you would expect under H0.

Conclusions

The bias in distribution of p-value can have significant consequences when
making statistical inference in practice. For example, in a paper I co-authored
some time ago, we show the property of p-value for various statistical tests
used in VaR backtesting (a very common test in financial applications). If you
would like to learn the details you can find it here.