Tag Archives: Julia

Adjusting to Julia: Piecewise regression

By: Jan Vanhove

Re-posted from: https://janhove.github.io/posts/2023-03-07-julia-breakpoint-regression/index.html

In this fourth installment of Adjusting to Julia, I will at long last analyse some actual data. One of the first posts on this blog was Calibrating p-values in ‘flexible’ piecewise regression models. In that post, I fitted a piecewise regression to a dataset comprising the ages at which a number of language learners started learning a second language (age of acquisition, AOA) and their scores on a grammaticality judgement task (GJT) in that second language. A piecewise regression is a regression model in which the slope of the function relating the predictor (here: AOA) to the outcome (here: GJT) changes at some value of the predictor, the so-called breakpoint. The problem, however, was that I didn’t specify the breakpoint beforehand but pick the breakpoint that minimised the model’s deviance. This increased the probability that I would find that the slope before and after the breakpoint differed, even if they in fact were the same. In the blog post I wrote almost nine years ago, I sought to recalibrate the p-value for the change in slope by running a bunch of simulations in R. In this blog post, I’ll do the same, but in Julia.

The data set

We’ll work with the data from the North America study conducted by DeKeyser et al. (2010). If you want to follow along, you can download this dataset here and save it to a subdirectory called data in your working directory.

We need the DataFrames, CSV and StatsPlots packages in order to read in the CSV with the dataset as a data frame and draw some basic graphs.

using DataFrames, CSV, StatsPlots

d = CSV.read("data/dekeyser2010.csv", DataFrame);

@df d plot(:AOA, :GJT
           , seriestype = :scatter
           , legend = :none
           , xlabel = "AOA"
           , ylabel = "GJT")

The StatsPlots package uses the @df macro to specify that the variables in the plot() function can be found in the data frame provided just after it (i.e., d).

Two regression models

Let’s fit two regression models to this data set using the GLM package. The first model, lm1, is a simple regression model with AOA as the predictor and GJT as the outcome. The syntax should be self-explanatory:

using GLM 

lm1 = lm(@formula(GJT ~ AOA), d);
coeftable(lm1)
Coef. Std. Error t Pr(> t )
(Intercept) 190.409 3.90403 48.77 <1e-57 182.63 198.188
AOA -1.21798 0.105139 -11.58 <1e-17 -1.42747 -1.00848

We can visualise this model by plotting the data in a scatterplot and adding the model predictions to it like so. I use begin and end to force Julia to only produce a single plot.

d[!, "prediction"] = predict(lm1);

begin
@df d plot(:AOA, :GJT
           , seriestype = :scatter
           , legend = :none
           , xlabel = "AOA"
           , ylabel = "GJT");
@df d plot!(:AOA, :prediction
            , seriestype = :line)
end

Our second model will incorporate an ‘elbow’ in the regression line at a given breakpoint – a piecewise regression model. For a breakpoint bp, we need to create a variable since_bp that encodes how many years beyond this breakpoint the participants’ AOA values are. If an AOA value is lower than the breakpoint, the corresponding since_bp value is just 0. The add_breakpoint() value takes a dataset containing an AOA variable and adds a variable called since_bp to it.

function add_breakpoint(data, bp)
    data[!, "since_bp"] = max.(0, data[!, "AOA"] .- bp);
end;

To add the since_bp variable for a breakpoint at age 12 to our dataset d, we just run this function. Note that in Julia, arguments are not copied when they are passed to a function. That is, the add_breakpoint() function changes the dataset; it does not create a changed copy of the dataset like R would:

# changes d!
add_breakpoint(d, 12);
print(d);
76×4 DataFrame
 Row │ AOA    GJT    prediction  since_bp 
     │ Int64  Int64  Float64     Int64    
─────┼────────────────────────────────────
   1 │    59    151     118.548        47
   2 │     9    182     179.447         0
   3 │    51    127     128.292        39
   4 │    58    113     119.766        46
   5 │    27    157     157.523        15
   6 │    11    188     177.011         0
   7 │    17    125     169.703         5
   8 │    57    138     120.984        45
   9 │    10    171     178.229         0
  10 │    14    168     173.357         2
  11 │    20    174     166.049         8
  12 │    34    149     148.997        22
  13 │    19    155     167.267         7
  14 │    54    149     124.638        42
  15 │    63    107     113.676        51
  16 │    71    104     103.932        59
  17 │    24    176     161.177        12
  18 │    16    143     170.921         4
  19 │    22    133     163.613        10
  20 │    48    113     131.946        36
  21 │    17    171     169.703         5
  22 │    20    144     166.049         8
  23 │    44    151     136.818        32
  24 │    24    182     161.177        12
  25 │    56    113     122.202        44
  26 │     5    197     184.319         0
  27 │    71    114     103.932        59
  28 │    36    170     146.561        24
  29 │    57    115     120.984        45
  30 │    45    115     135.6          33
  31 │    56    118     122.202        44
  32 │    44    118     136.818        32
  33 │    23    155     162.395        11
  34 │    18    186     168.485         6
  35 │    42    132     139.254        30
  36 │    54    116     124.638        42
  37 │    14    169     173.357         2
  38 │    47    131     133.164        35
  39 │     8    196     180.665         0
  40 │    24    122     161.177        12
  41 │    52    148     127.074        40
  42 │    27    188     157.523        15
  43 │    11    198     177.011         0
  44 │    18    174     168.485         6
  45 │    48    150     131.946        36
  46 │    31    158     152.651        19
  47 │    49    131     130.728        37
  48 │    48    131     131.946        36
  49 │    15    180     172.139         3
  50 │    49    113     130.728        37
  51 │    23    167     162.395        11
  52 │    10    193     178.229         0
  53 │    20    164     166.049         8
  54 │    24    183     161.177        12
  55 │    35    118     147.779        23
  56 │    36    136     146.561        24
  57 │    44    115     136.818        32
  58 │    49    141     130.728        37
  59 │    15    181     172.139         3
  60 │    12    193     175.793         0
  61 │    53    140     125.856        41
  62 │    16    153     170.921         4
  63 │    54    110     124.638        42
  64 │     9    163     179.447         0
  65 │    25    174     159.959        13
  66 │    27    169     157.523        15
  67 │    18    179     168.485         6
  68 │    26    143     158.741        14
  69 │    22    162     163.613        10
  70 │    50    128     129.51         38
  71 │    42    119     139.254        30
  72 │     5    197     184.319         0
  73 │    14    168     173.357         2
  74 │    39    132     142.908        27
  75 │    56    140     122.202        44
  76 │    12    182     175.793         0

Since we don’t know what the best breakpoint is, we’re going to estimate it from the data. For each integer in a given range (minbp through maxbp), we’re going to fit a piecewise regression model with that integer as the breakpoint. We’ll then pick the breakpoint that minimises the deviance of the fit (i.e., the sum of squared differences between the model fit and the actual outcome). The fit_piecewise() function takes care of this. It outputs both the best fitting piecewise regression model and the breakpoint used for this model.

function fit_piecewise(data, minbp, maxbp)
  min_deviance = Inf
  best_model = nothing
  best_bp = 0
  current_model = nothing
  
  for bp in minbp:maxbp
    add_breakpoint(data, bp)
    current_model = lm(@formula(GJT ~ AOA + since_bp), data)
    if deviance(current_model) < min_deviance
      min_deviance = deviance(current_model)
      best_model = current_model
      best_bp = bp
    end
  end
  
  return best_model, best_bp
end;

Let’s now apply this function to our dataset. The estimated breakpoint is at age 16, and the model coefficients are shown below:

lm2 = fit_piecewise(d, 6, 20);
# the first output is the model itself, the second the breakpoint used
coeftable(lm2[1])
lm2[2]
16

Let’s visualise this model by drawing a scatterplot and adding the regression fit to it. While we’re at it, we might as well add a 95% confidence band around the regression fit.

add_breakpoint(d, 16);
predictions = predict(lm2[1], d;
                      interval = :confidence,
                      level = 0.95);
d[!, "prediction"] = predictions[!, "prediction"];
d[!, "lower"] = predictions[!, "lower"];
d[!, "upper"] = predictions[!, "upper"];

begin
@df d plot(:AOA, :GJT
           , seriestype = :scatter
           , legend = :none
           , xlabel = "AOA"
           , ylabel = "GJT"
      );
@df d plot!(:AOA, :prediction
            , seriestype = :line
            , ribbon = (:prediction .- :lower, 
                        :upper .- :prediction)
      )
end

We could run an -test for the model comparison like below, but the -value corresponds to the -value for the since_bp value, anyway:

ftest(lm1.model, lm2[1].model);

But there’s a problem: This -value can’t be taken at face value. By looping through different possible breakpoint and then picking the one that worked best for our dataset, we’ve increased our chances of finding some pattern in the data even if nothing is going on. So we need to recalibrate the -value we’ve obtained.

Recalibrating the p-value

Our strategy is as follows. We will generate a fairly large number of datasets similar to d but of which we know that there isn’t any breakpoint in the GJT/AOA relationship. We will do this by simulating new GJT values from the simple regression model fitted above (lm1). We will then apply the fit_piecewise() function to each of these datasets, using the same minbp and maxbp values as before and obtain the -value associated with each model. We will then compute the proportion of the -value so obtained that is lower than the -value from our original model, i.e., 0.0472.

I wasn’t able to find a Julia function similar to R’s simulate() that simulates a new outcome variable based on a linear regression model. But such a function is easy enough to put together:

using Distributions

function simulate_outcome(null_model)
  resid_distr = Normal(0, dispersion(null_model.model))
  prediction = fitted(null_model)
  new_outcome = prediction + rand(resid_distr, length(prediction))
  return new_outcome
end;

The one_run() function generates a single new outcome vector, overwrites the GJT variable in our dataset with it, and then applies the fit_piecewise() function to the dataset, returning the -value of the best-fitting piecewise regression model.

function one_run(data, null_model, min_bp, max_bp)
  new_outcome = simulate_outcome(null_model)
  data[!, "GJT"] = new_outcome
  best_model = fit_piecewise(data, min_bp, max_bp)
  pval = coeftable(best_model[1]).cols[4][3]
  return pval
end;

Finally, the generate_p_distr() function runs the one_run() function a large number of times and output the -values generated.

function generate_p_distr(data, null_model, min_bp, max_bp, n_runs)
  pvals = [one_run(data, null_model, min_bp, max_bp) for _ in 1:n_runs]
  return pvals
end;

Our simulation will consist of 25,000 runs, and in each run, 16 regression models will be fitted, for a total of 400,000 models. On my machine, this takes less than 20 seconds (i.e., less than 50 microseconds per model).

n_runs = 25_000;
pvals = generate_p_distr(d, lm1, 6, 20, n_runs);

For about 11–12% of the datasets in which no breakpoint governed the data, the fit_piecewise() procedure returned a -value of 0.0472 or lower. So our original -value of 0.0472 ought to be recalibrated to about 0.12.

sum(pvals .<= 0.0472) / n_runs
0.11864

Adjusting to Julia: Tea tasting

By: Jan Vanhove

Re-posted from: https://janhove.github.io/posts/2023-02-23-julia-tea-tasting/index.html

In this third blog post in which I try my hand at the Julia language, I’ll tackle a slight variation of an old problem – Fisher’s tea tasting lady – both analytically and using a brute-force simulation.

The lady tasting tea

In The Design of Experiments, Ronald A. Fisher explained the Fisher exact test using the following example. Imagine that a lady claims she can taste the difference between cups of tea in which the tea was poured into the cup first and then milk was added, and cups of tea in which the milk was poured first and then the tea was added. A sceptic might put the lady to the test and prepare eight cups of tea – four with tea to which milk was added, and four with milk to which tea was added. (Yuck to both, by the way.) The lady is presented with these in a random order and is asked to identify those four cups with tea to which milk was added. Now, if the lady has no discriminatory ability whatever, there is only a 1-in-70 chance she identifies all four cups correctly. This is because there are 70 ways of picking four cups out of eight, and only one of these ways is correct. In Julia:

binomial(8, 4)
70

We can now imagine a slight variation on this experiment. If the lady identifies all four cups correctly, we choose to believe she has the purported discriminatory ability. If she identifies two or fewer cups correctly, we remain sceptical. But she identifies three out of four cups correctly, we prepare another eight cups of tea and give her another chance under the same conditions.

We can ask two questions about this new procedure:

  1. With which probability will we believe the lady if she, in fact, does not have any discriminatory ability?
  2. How many rounds of tea tasting will we need on average before the experiment terminates?

In the following, I’ll share both analytical and a simulation-based answers to these questions.

Analytical solution

Under the null hypothesis of no discriminatory ability, the number of correctly identified cups in any one draw () follows a hypergeometric distribution with parameters (total), (successes) and (draws), i.e., [ X (8, 4, 4). ] In any given round, the subject fails the test if she only identifies 0, 1 or 2 cups correctly. Under the null hypothesis, the probability of this happening is given by , the value of which we can determine using the cumulative mass function of the Hypergeometric(8, 4, 4) distribution. We suspend judgement on the subject’s discriminatory abilities if she identifies exactly three cups correctly, in which case she has another go. Under the null hypothesis, the probability of this happening in any given round is given by , the value of which can be determined using the probability mass function of the Hypergeometric(8, 4, 4) distribution.

With those probabilities in hand, we can now compute the probability that the subject fails the experiment in a specific round, assuming the null hypothesis is correct. In the first round, she will fail the experiment with probability . In order to fail in the second round, she needed to have advanced to the second round, which happens with probability , and then fail in that round, which happens with probability . That is, she will fail in the second round with probability . To fail in the third round, she needed to advance to the third round, which happens with probability and then fail in the third round, which happens with probability . That is, she will fail in the third round with probability . Etc. etc. The probability that she will fail somewhere in the experiment if the null hypothesis is true, then, is given by where the first equality is just a matter of shifting the index and the second equality holds because the expression is a geometric series.

Let’s compute the final results using Julia. The following loads the DataFrames and Distributions packages and then defines d as the Hypergeometric(8, 4, 4) distribution. Note that in Julia, the parameters for the Hypergeometric distribution aren’t (total), (successes) and (draws), but rather (successes), (failures) and (draws); see the documentation. We then read off the values for and from the cumulative mass function and probability mass function, respectively:

using Distributions
d = Hypergeometric(4, 4, 4);
p = cdf(d, 2);
q = pdf(d, 3);

The probability that the subject will fail the experiment if she does indeed not have the purported discriminatory ability is now easily computed:

p / (1 - q)
0.9814814814814815

The next question is how many rounds we expect the experiment to carry on for if the null hypothesis is true. At each round, the probability that the experiment terminates in that round is given by . From the geometric distribution, we know that we then on average need attempts before the first terminating event occurs:

1 / (1 - q)
1.2962962962962965

In sum, if the subject lacks any discriminatory ability, there is only a 1.85% chance that she will pass the test, and on average, the experiment will run for 1.3 rounds.

Brute-force solution

First, we define a function experiment() that runs the experiment once. In essence, we have an urn that contains four correct identifications (true) and four incorrect identifications (false). From this urn, we sample() four identifications without replacement.

Note, incidentally, that Julia functions can take both positional arguments and keyword arguments. In the sample() command below, both urn and 4 are passed as positional arguments, and you’d have to read the documentation to figure out which argument specifies what. The keyword arguments are separated from the positional arguments by a semi-colon and are identified with the parameter’s name.

Next, we count the number of trues in our pick using sum(). Depending on how many trues there are in pick, we terminate the experiment, returning false if we remain sceptic and true if we choose to believe the lady, or we run the experiment for one more round. The number of attempts are tallied and output as well.

function experiment(attempt = 1)
    urn = [false, false, false, false, true, true, true, true]
    pick = sample(urn, 4; replace = false)
    number = sum(pick)
    if number <= 2
        return false, attempt
    elseif number >= 4
        return true, attempt
    else
      return experiment(attempt + 1)
    end
end;

A single run of experiment() could produce the following output:

experiment()
(false, 1)

Next, we write a function simulate() that runs the experiment() a large number of times, and outputs both whether each experiment() led to us believe the lady or remaining sceptical, and how many round each experiment() took. These results are tabulated in a DataFrame – just because. Of note, Julia supports list comprehension that Python users will be familiar with. I use this feature here both the run the experiment a large number of times as well as to parse the output.

using DataFrames

function simulate(runs = 10000)
    results = [experiment() for _ in 1:runs]
    success = [results[i][1] for i in 1:runs]
    attempts = [results[i][2] for i in 1:runs]
    d = DataFrame(Successful = success, Attempts = attempts)
    return d
end;

Let’s swing for the fences and run this experiment a million times. Like in Python, we can make large numbers easier to parse by inserting underscores in them:

runs = 1_000_000;

Using the @time macro, we can check how long it take for this simulation to finish.

@time d = simulate(runs)
  0.359740 seconds (4.07 M allocations: 361.334 MiB, 14.82% gc time, 35.07% compilation time)
1000000×2 DataFrame
999975 rows omitted
Row Successful Attempts
Bool Int64
1 false 1
2 false 1
3 false 1
4 false 1
5 false 1
6 false 2
7 false 3
8 false 2
9 false 1
10 false 1
11 false 1
12 false 1
13 false 2
999989 false 1
999990 false 1
999991 false 1
999992 false 4
999993 false 1
999994 false 1
999995 false 1
999996 false 1
999997 false 1
999998 false 1
999999 false 1
1000000 false 1

On my machine then, running this simulation takes less than a second. Note that 60% of this time is compilation time. (Update: When migrating my blog to Quarto, I reran this code using a new Julia version (1.9.1). Now the code runs faster.) Indeed, if we run the function another time, i.e., after it’s been compiled, the run time drops to about 0.3 seconds (Update: 0.2 seconds now.):

@time d2 = simulate(runs)
  0.209087 seconds (3.89 M allocations: 348.982 MiB, 16.29% gc time)
1000000×2 DataFrame
999975 rows omitted
Row Successful Attempts
Bool Int64
1 false 1
2 false 1
3 false 1
4 false 1
5 false 1
6 false 2
7 false 2
8 false 1
9 false 1
10 false 1
11 false 2
12 false 1
13 false 1
999989 false 3
999990 false 3
999991 false 1
999992 false 1
999993 false 1
999994 false 1
999995 false 1
999996 false 1
999997 false 1
999998 false 2
999999 false 1
1000000 false 1

Using describe(), we see that this simulation – which doesn’t ‘know’ anything about hypergeometric and geometric distributions, produces the same answers that we arrived at by analytical means: There’s a 1.86% chance that we end up believing the lady even if she has no discriminatory ability whatsoever. And if she doesn’t have any discriminatory ability, we’ll need 1.3 rounds on average before terminating the experiment:

describe(d)
2×7 DataFrame
Row variable mean min median max nmissing eltype
Symbol Float64 Integer Float64 Integer Int64 DataType
1 Successful 0.018533 false 0.0 true 0 Bool
2 Attempts 1.29611 1 1.0 10 0 Int64

The slight discrepancy between the simulation-based results and the analytical ones are just due to randomness. Below is a quick way for constructing 95% confidence intervals around both of our simulation-based estimates, and the analytical solutions fall within both intervals.

means = mean.(eachcol(d))
2-element Vector{Float64}:
 0.018533
 1.296105
ses = std.(eachcol(d)) / sqrt(runs)
2-element Vector{Float64}:
 0.00013486862533794173
 0.0006198767726106645
upr = means + 1.96*ses
2-element Vector{Float64}:
 0.018797342505662368
 1.297319958474317
lwr = means - 1.96*ses
2-element Vector{Float64}:
 0.018268657494337634
 1.2948900415256832