Some lesser known properties of the coefficient of determination

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/01/27/r2.html

Introduction

Last week I have posted about p-value in hypothesis testing.
I decided to continue discussion of basic statistics today.

My focus will be on computation of Pearson correlation coefficient
and the coefficient of determination. I want to concentrate on the fact standard
estimators of these quantities are biased.

The post is written under Julia 1.8.5, Distributions 0.25.80,
HypergeometricFunctions 0.3.11, HypothesisTests.jl 0.10.11, and Plots.jl 1.38.2.

Testing for bias

In the post we will assume that we have an n-element sample x and y of
two random variables that are normally distributed.

The standard formula for Pearson correlation coefficient between x and y
is (using Julia as pseudocode):

sum(((xi, yi),) -> (xi-mean(x))*(yi-mean(y)), zip(x, y)) /
sqrt(sum(xi -> (xi-mean(x))^2, x) * sum(yi -> (yi-mean(y))^2, y))

Now assume that we want to build a simple linear regression model
where we explain y by x. In this case the coefficient of determination of
this regression is known to be a square of Pearson correlation coefficient
between y and x.

An interesting feature is that both Pearson correlation coefficient and the
coefficient of determination defined above are biased estimators. Let us check
this using a simple experiment. We will generate data for n=10 and true
Pearson correlation between x and y set to 0.5 (note that then the
true coefficient of determination is equal to 0.25).

julia> using Distributions

julia> using Statistics

julia> using Random

julia> function sim(n, ρ)
           dist = MvNormal([1.0 ρ; ρ 1.0])
           x, y = eachrow(rand(dist, n))
           return cor(x, y)
       end
sim (generic function with 1 method)

julia> Random.seed!(1);

julia> cor_sim = [sim(10, 0.5) for _ in 1:10^6];

julia> r2_sim = cor_sim .^ 2;

Now check that the obtained estimators are biased indeed:

julia> using HypothesisTests

julia> OneSampleTTest(cor_sim, 0.5)
One sample t-test
-----------------
Population details:
    parameter of interest:   Mean
    value under h_0:         0.5
    point estimate:          0.478612
    95% confidence interval: (0.4781, 0.4791)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           <1e-99

Details:
    number of observations:   1000000
    t-statistic:              -80.08122936481526
    degrees of freedom:       999999
    empirical standard error: 0.0002670739739448121


julia> OneSampleTTest(r2_sim, 0.25)
One sample t-test
-----------------
Population details:
    parameter of interest:   Mean
    value under h_0:         0.25
    point estimate:          0.300398
    95% confidence interval: (0.3, 0.3008)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           <1e-99

Details:
    number of observations:   1000000
    t-statistic:              231.0430912829669
    degrees of freedom:       999999
    empirical standard error: 0.00021813356867576183

We see a noticeable bias for both coefficients. An interesting feature
you might notice is that Pearson correlation coefficient is biased down, while
the coefficient of determination is biased up.

Debiasing estimators

Unbiased estimators for our case have been derived over 60 years ago by
Olkin and Pratt. If we denote by r the computed Pearson coefficient
of determination then the unbiased estimates are given using the following
functions:

julia> using HypergeometricFunctions

julia> cor_unbiased(r) = r * _₂F₁(0.5, 0.5, (n-2)/2, 1-r^2)
cor_unbiased (generic function with 1 method)

julia> r2_unbiased(r) = 1 - (1 - r^2) * _₂F₁(1, 1, n/2, 1-r^2) * (n-3)/(n-2)
r2_unbiased (generic function with 1 method)

Let us check them on our data:

julia> OneSampleTTest(cor_unbiased.(cor_sim), 0.5)
One sample t-test
-----------------
Population details:
    parameter of interest:   Mean
    value under h_0:         0.5
    point estimate:          0.499949
    95% confidence interval: (0.4994, 0.5005)

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.8529

Details:
    number of observations:   1000000
    t-statistic:              -0.18540252488698106
    degrees of freedom:       999999
    empirical standard error: 0.0002740923081266803


julia> OneSampleTTest(r2_unbiased.(cor_sim), 0.25)
One sample t-test
-----------------
Population details:
    parameter of interest:   Mean
    value under h_0:         0.25
    point estimate:          0.249932
    95% confidence interval: (0.2494, 0.2505)

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.8054

Details:
    number of observations:   1000000
    t-statistic:              -0.24635861593453004
    degrees of freedom:       999999
    empirical standard error: 0.0002750802967082336

Indeed, debiasing worked.

Difference between estimators

Now check the direction of the difference between estimators as a function of
the observed Pearson correlation coefficient (keeping n=10 fixed as above):

julia> using Plots

julia> plot(r -> r - cor_unbiased(r), xlim=[-1, 1], label="r difference",
            xlab="observed r")

julia> plot!(r -> r^2 - r2_unbiased(r), label="r² difference")

You should get the following plot:

Biases

As you can see, Pearson correlation coefficient is always bumped up for
positive correlations and down for negative correlations in our case.

Interestingly the coefficient of determination is bumped down if the
absolute value of true correlation is less than approximately 0.6565, and if
this absolute value is larger then it will be changed up.

This means that we can expect that for large true Pearson coefficient of
correlation the standard formula for computing the coefficient of determination
will lead to underestimation of the true value on the average.

To check this let us run our simulation again with true r=0.9 and still
keeping n=10:

julia> r2_sim_2 = [sim(10, 0.9)^2 for _ in 1:10^6];

julia> OneSampleTTest(r2_sim_2, 0.81)
One sample t-test
-----------------
Population details:
    parameter of interest:   Mean
    value under h_0:         0.81
    point estimate:          0.796659
    95% confidence interval: (0.7964, 0.7969)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           <1e-99

Details:
    number of observations:   1000000
    t-statistic:              -100.77325436105275
    degrees of freedom:       999999
    empirical standard error: 0.00013238294244740913

julia> OneSampleTTest(r2_unbiased.(sqrt.(r2_sim_2)), 0.81)
One sample t-test
-----------------
Population details:
    parameter of interest:   Mean
    value under h_0:         0.81
    point estimate:          0.810047
    95% confidence interval: (0.8098, 0.8103)

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.7251

Details:
    number of observations:   1000000
    t-statistic:              0.351680959027248
    degrees of freedom:       999999
    empirical standard error: 0.0001342944339289557

Indeed the coefficient of determination is negatively biased in this case
and using Olkin and Pratt method worked again.

So what are the downsides of Olkin and Pratt estimator of the coefficient of
determination? The problem is that it gives negative values for the coefficient
for observed r close to 0. Why is this unavoidable? To see this assume for
a moment that true r=0. In random a sample we will see some positive observed
r^2, just by pure chance. Therefore, to keep the estimator unbiased (note that
its expected value should be 0) we have to allow for its negative values.

Conclusions

I hope you found the presented properties of estimators useful. I think it is
quite interesting that even basic statistical methods have quite complex
properties, that are not obvious initially.