Author Archives: Dean Markwick's Blog -- Julia

Calibrating an Ornstein–Uhlenbeck Process

By: Dean Markwick's Blog -- Julia

Re-posted from: https://dm13450.github.io/2024/03/09/Calibrating-an-Ornstein-Uhlenbeck-Process.html

Read enough quant finance papers or books and you’ll come across the
Ornstein–Uhlenbeck (OU) process. This is a post that explores the OU
process, the equations, how we can simulate such a process and then estimate the parameters.


Enjoy these types of posts? Then you should sign up for my newsletter.


I’ve briefly touched on mean reversion and OU processes before in my
Stat Arb – An Easy Walkthrough
blog post where we modelled the spread between an asset and its
respective ETF. The whole concept of ‘mean reversion’ is something
that comes up frequently in finance and at different time scales. It
can be thought of as the first basic extension as Brownian motion and
instead of things moving randomly there is now a slight structure
where it be oscillating around a constant value.

The Hudson Thames group have a similar post on OU processes (Mean-Reverting Spread Modeling: Caveats in Calibrating the OU Process) and
my post should be a nice compliment with code and some extensions.

The Ornstein-Uhlenbeck Equation

As a continuous process, we write the change in \(X_t\) as an increment in time and some noise

\[\mathrm{d}X_t = \theta (\mu – x_t) \mathrm{d}t + \sigma \mathrm{d}W_t\]

The amount it changes in time depends on the previous \(X_t\) and to free parameters \(\mu\) and \(\theta\).

  • The \(\mu\) is the long-term drift of the process
  • The \(\theta\) is the mean reversion or momentum parameter depending on the sign.

If \(\theta\) is 0 we can see the equation collapses down to a simple random walk.

If we assume \(\mu = 0\), so the long-term average is 0, then a positive value of \(\theta\) means we see mean reversion. Large values of \(X\) mean the next change is likely to have a negative sign, leading to a smaller value in \(X\).

A negative value of \(\theta\) means the opposite and we end up with a large value in X generating a further large positive change and the process explodes.
E
If discretise the process we can simulate some samples with different parameters to illustrate these two modes.

\[X_{t+1} – X_t = \theta (\mu – X_t) \Delta t + \sigma \sqrt{\Delta t} W_t\]

where \(W_t \sim N(0,1)\).

which is easy to write out in Julia. We can save some time by drawing the random values first and then just summing everything together.

using Distributions, Plots

function simulate_os(theta, mu, sigma, dt, maxT, initial)
    p = Array{Float64}(undef, length(0:dt:maxT))
    p[1] = initial
    w = sigma * rand(Normal(), length(p)) * sqrt(dt)
    for i in 1:(length(p)-1)
        p[i+1] = p[i] + theta*(mu-p[i])*dt + w[i]
    end
    return p
end

We have two classes of OU processes we want to simulate, a mean
reverting \(\theta > 0\) and a momentum version (\(\theta < 0\)) and
we also want to simulate a random walk at the same time, so \(\theta =
0\). We will assume \(\mu = 0\) which keeps the pictures simple.

maxT = 5
dt = 1/(60*60)
vol = 0.005

initial = 0.00*rand(Normal())

p1 = simulate_os(-0.5, 0, vol, dt, maxT, initial)
p2 = simulate_os(0.5, 0, vol, dt, maxT, initial)
p3 = simulate_os(0, 0, vol, dt, maxT, initial)

plot(0:dt:maxT, p1, label = "Momentum")
plot!(0:dt:maxT, p2, label = "Mean Reversion")
plot!(0:dt:maxT, p3, label = "Random Walk")

Different values an OU process can look

The mean reversion (orange) hasn’t moved away from the long-term average (\(\mu=0\)) and the momentum has diverged the furthest from the starting point, which lines up with the name. The random walk, inbetween both as we would expect.

Now we have successfully simulated the process we want to try and
estimate the \(\theta\) parameter from the simulation. We have two
slightly different (but similar methods) to achieve this.

OLS Calibration of an OU Process

When we look at the generating equation we can simply rearrange it into a linear equation.

\[\Delta X = \theta \mu \Delta t – \theta \Delta t X_t + \epsilon\]

and the usual OLS equation

\[y = \alpha + \beta X + \epsilon\]

such that

\[\alpha = \theta \mu \Delta t\]

\[\beta = -\theta \Delta t\]

where \(\epsilon\) is the noise. So we just need a DataFrame with the difference between subsequent observations and relate that to the current observation. Just a diff and a shift.

using DataFrames, DataFramesMeta
momData = DataFrame(y=p1)
momData = @transform(momData, :diffY = [NaN; diff(:y)], :prevY = [NaN; :y[1:(end-1)]])

Then using the standard OLS process from the GLM package.

mdl = lm(@formula(diffY ~ prevY), momData[2:end, :])
alpha, beta = coef(mdl)

theta = -beta / dt
mu = alpha / (theta * dt)

Which gives us \(\mu = 0.0075, \theta = -0.3989\), so close to zero
for the drift and the reversion parameter has the correct sign.

Doing the same for the mean reversion data.

mdl = lm(@formula(diffY ~ prevY), revData[2:end, :])
alpha, beta = coef(mdl)

theta = -beta / dt
mu = alpha / (theta * dt)

This time \(\mu = 0.001\) and \(\theta = 1.2797\). So a little wrong
compared to the true values, but at least the correct sign.

Does Bootstrapping Help?

It could be that we need more data, so we use the bootstrap to randomly sample from the population to give us pseudo-new draws. We use the DataFrames again and pull random rows with replacement to build out the data set. We do this sampling 1000 times.

res = zeros(1000)
for i in 1:1000
    mdl = lm(@formula(diffY ~ prevY + 0), momData[sample(2:nrow(momData), nrow(momData), replace=true), :])
    res[i] = -first(coef(mdl)/dt)
end

bootMom = histogram(res, label = :none, title = "Momentum", color = "#7570b3")
bootMom = vline!(bootMom, [-0.5], label = "Truth", momentum = 2)
bootMom = vline!(bootMom, [0.0], label = :none, color = "black")

We then do the same for the reversion data.

res = zeros(1000)
for i in 1:1000
    mdl = lm(@formula(diffY ~ prevY + 0), revData[sample(2:nrow(revData), nrow(revData), replace=true), :])
    res[i] = first(-coef(mdl)/dt)
end

bootRev = histogram(res, label = :none, title = "Reversion", color = "#1b9e77")
bootRev = vline!(bootRev, [0.5], label = "Truth", lw = 2)
bootRev = vline!(bootRev, [0.0], label = :none, color = "black")

Then combining both the graphs into one plot.

plot(bootMom, bootRev, 
  layout=(2,1),dpi=900, size=(800, 300),
  background_color=:transparent, foreground_color=:black,
     link=:all)

Bootstrapping an OU process

The momentum bootstrap has worked and centred around the correct
value, but the same cannot be said for the reversion plot. However, it
has correctly guessed the sign.

AR(1) Calibration of a OU Process

If we continue assuming that \(\mu = 0\) then we can simplify the OLS
to a 1-parameter regression – OLS without an intercept. From the
generating process, we can see that this is an AR(1) process – each
observation depends on the previous observation by some amount.

\[\phi = \frac{\sum _i X_i X_{i-1}}{\sum _i X_{i-1}^2}\]

then the reversion parameter is calculated as

\[\theta = – \frac{\log \phi}{\Delta t}\]

This gives us a simple equation to calculate \(\theta\) now.

For the momentum sample:

phi = sum(p1[2:end] .* p1[1:(end-1)]) / sum(p1[1:(end-1)] .^2)
-log(phi)/dt

Givens \(\theta = -0.50184\), so very close to the true value.

For the reversion sample

phi = sum(p2[2:end] .* p2[1:(end-1)]) / sum(p2[1:(end-1)] .^2)
-log(phi)/dt

Gives \(\theta = 1.26\), so correct sign, but quite a way off.

Finally, for the random walk

phi = sum(p3[2:end] .* p3[1:(end-1)]) / sum(p3[1:(end-1)] .^2)
-log(phi)/dt

Produces \(\theta = -0.027\), so quite close to zero.

Again, values are similar to what we expect, so our estimation process
appears to be working.

Using Multiple Samples for Calibrating an OU Process

If you aren’t convinced I don’t blame you. Those point estimates above are nowhere near the actual values that simulated the data so it’s hard to believe the estimation method is working. Instead, what we need to do is repeat the process and generate many more price paths and estimate the parameters of each one.

To make things a bit more manageable code-wise though I’m going to
introduce a struct that contains the parameters and allows to
simulate and estimate in a more contained manner.

struct OUProcess
    theta
    mu 
    sigma
    dt
    maxT
    initial
end

We now write specific functions for this object and this allows us to
simplify the code slightly.

function simulate(ou::OUProcess)
    simulate_os(ou.theta, ou.mu, ou.sigma, ou.dt, ou.maxT, ou.initial)
end

function estimate(ou::OUProcess)
   p = simulate(ou)
   phi =  sum(p[2:end] .* p[1:(end-1)]) / sum(p[1:(end-1)] .^2)
   -log(phi)/ou.dt
end

function estimate(ou::OUProcess, N)
    res = zeros(N)
    for i in 1:N
        p = simulate(ou)
        res[i] = estimate(ou)
    end
    res
end

We use these new functions to draw from the process 1,000 times and
sample the parameters for each one, collecting the results as an
array.

ou = OUProcess(0.5, 0.0, vol, dt, maxT, initial)
revPlot = histogram(estimate(ou, 1000), label = :none, title = "Reversion")
vline!(revPlot, [0.5], label = :none);

And the same for the momentum OU process

ou = OUProcess(-0.5, 0.0, vol, dt, maxT, initial)
momPlot = histogram(estimate(ou, 1000), label = :none, title = "Momentum")
vline!(momPlot, [-0.5], label = :none);

Plotting the distribution of the results gives us a decent
understanding of how varied the samples can be.

plot(revPlot, momPlot, layout = (2,1), link=:all)

Multiple sample estimation of an OU process

We can see the heavy-tailed nature of the estimation process, but
thankfully the histograms are centred around the correct number. This
goes to show how difficult it is to estimate the mean reversion
parameter even in this simple setup. So for a real dataset, you need to
work out how to collect more samples or radically adjust how accurate
you think your estimate is.

Summary

We have progressed from simulating an Ornstein-Uhlenbeck process to
estimating its parameters using various methods. We attempted to
enhance the accuracy of the estimates through bootstrapping, but we
discovered that the best approach to improve the estimation is to have
multiple samples.

So if you are trying to fit this type of process on some real world
data, be it the spread between two stocks
(Statistical Arbitrage in the U.S. Equities Market),
client flow (Unwinding Stochastic Order Flow: When to Warehouse Trades) or anything
else you believe might be mean reverting, then understand how much
data you might need to accurately model the process.

Exploring Causal Regularisation

By: Dean Markwick's Blog -- Julia

Re-posted from: https://dm13450.github.io/2023/12/28/Exploring-Causal-Regularisation.html

A good prediction model isn’t necessarily a good causal model. You could be missing a key variable in your dataset that is driving the underlying behavior so you end up with a good predictive model but not the correct explanation as to why things behave that way. Taking a causal approach is a tougher problem and needs an understanding of whether we have access to the right variables or we are making the right link between variables and an outcome. Causal regularisation is a method that uses machine learning techniques (regularisation!) to try and produce models that can be interpreted causally.


Enjoy these types of posts? Then you should sign up for my newsletter.


Regularisation is normally taught as a method to reduce overfitting, you have a big model and you make it smaller by shrinking some of the factors. Work by Janzing (papers below) argues that this can help produce better causal models too and in this blog post I will work through two papers to try and understand the process better.

I’ll work off two main papers for causal regularisation:

  1. Causal Regularisation
  2. Detecting non-causal artifacts in multivariate linear regression models

In truth, I am working backward. I first encountered causal regularisation in Better AB testing via Causal Regularisation where it uses causal regularisation to produce better estimates by combining a biased and an unbiased dataset. I want to take a step back and understand casual regularisation from the original papers. Using free data from the UCI Machine Learning Repository we can attempt to replicate the methods from the papers and see how causal regularisation works to produce better causal models.

As ever, I’m in Julia (1.9), so fire up that notebook and follow along.

using CSV, DataFrames, DataFramesMeta
using Plots
using GLM, Statistics

Wine Tasting Data

The wine-quality dataset from the UCI repository provides measurements of the chemical properties of wine and a quality rating from someone drinking the wine. It’s a simple CSV file that you can download (winequality) and load with minimal data wrangling needed.

We will be working with the red wine data set as that’s what both Janzing papers use.

rawData = CSV.read("wine+quality/winequality-red.csv", DataFrame)
first(rawData)

APD! Always Plotting the Data to make sure the values are something you expect. Sometimes you need a visual confirmation that things line up with what you believe.

plot(scatter(rawData.alcohol, rawData.quality, title = "Alcohol", label = :none, color="#eac435"),
     scatter(rawData.pH, rawData.quality, title = "pH", label = :none, color="#345995"),
     scatter(rawData.sulphates, rawData.quality, title= "Sulphates", label = :none, color="#E40066"),
     scatter(rawData.density, rawData.quality, title = "Density", label = :none, color="#03CEA4"), ylabel = "Quality")

Wine quality variable relationships

By choosing four of the variables randomly we can see that some are correlated with quality and some are not.

A loose goal is to come up with a causal model that can explain the quality of the wine using the provided factors. We will change the data slightly to highlight how causal regularisation helps, but for now, let’s start with the simple OLS model.

In the paper they normalise the variables to be unit variance, so we divide by the standard deviation.
We then model the quality of the wine using all the available variables.

vars = names(rawData, Not(:quality))

cleanData = deepcopy(rawData)

for var in filter(!isequal("White"), vars)
    cleanData[!, var] = cleanData[!, var] ./ std(cleanData[!, var])
end

cleanData[!, :quality] .= Float64.(cleanData[!, :quality])

ols = lm(term(:quality) ~ sum(term.(Symbol.(vars))), cleanData)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

quality ~ 1 + fixed acidity + volatile acidity + citric acid + residual sugar + chlorides + free sulfur dioxide + total sulfur dioxide + density + pH + sulphates + alcohol

Coefficients:
────────────────────────────────────────────────────────────────────────────────────────
                           Coef.  Std. Error      t  Pr(>|t|)     Lower 95%    Upper 95%
────────────────────────────────────────────────────────────────────────────────────────
(Intercept)           21.9652     21.1946      1.04    0.3002  -19.6071      63.5375
fixed acidity          0.043511    0.0451788   0.96    0.3357   -0.0451055    0.132127
volatile acidity      -0.194027    0.0216844  -8.95    <1e-18   -0.23656     -0.151494
citric acid           -0.0355637   0.0286701  -1.24    0.2150   -0.0917989    0.0206716
residual sugar         0.0230259   0.0211519   1.09    0.2765   -0.0184626    0.0645145
chlorides             -0.088211    0.0197337  -4.47    <1e-05   -0.126918    -0.0495041
free sulfur dioxide    0.0456202   0.0227121   2.01    0.0447    0.00107145   0.090169
total sulfur dioxide  -0.107389    0.0239718  -4.48    <1e-05   -0.154409    -0.0603698
density               -0.0337477   0.0408289  -0.83    0.4086   -0.113832     0.0463365
pH                    -0.0638624   0.02958    -2.16    0.0310   -0.121883    -0.00584239
sulphates              0.155325    0.019381    8.01    <1e-14    0.11731      0.19334
alcohol                0.294335    0.0282227  10.43    <1e-23    0.238977     0.349693
────────────────────────────────────────────────────────────────────────────────────────

The dominant factor is the alcohol amount which is the strongest variable in predicting the quality, i.e. higher quality has a higher alcohol content. We also note that 5 out of the 12 variables are deemed insignificant at the 5% level. We save these parameters and then look at the regression without the alcohol variable.

olsParams = DataFrame(Dict(zip(vars, coef(ols)[2:end])))
olsParams[!, :Model] .= "OLS"
olsParams
1×12 DataFrame
Row alcohol chlorides citric acid density fixed acidity free sulfur dioxide pH residual sugar sulphates total sulfur dioxide volatile acidity Model
Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 String
1 0.294335 -0.088211 -0.0355637 -0.0337477 0.043511 0.0456202 -0.0638624 0.0230259 0.155325 -0.107389 -0.194027 OLS
cleanDataConfounded = select(cleanData, Not(:alcohol))
vars = names(cleanDataConfounded, Not(:quality))

confoundOLS = lm(term(:quality) ~ sum(term.(Symbol.(vars))), cleanDataConfounded)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

quality ~ 1 + fixed acidity + volatile acidity + citric acid + residual sugar + chlorides + free sulfur dioxide + total sulfur dioxide + density + pH + sulphates

Coefficients:
───────────────────────────────────────────────────────────────────────────────────────────
                             Coef.  Std. Error       t  Pr(>|t|)     Lower 95%    Upper 95%
───────────────────────────────────────────────────────────────────────────────────────────
(Intercept)           189.679       14.2665      13.30    <1e-37  161.696       217.662
fixed acidity           0.299551     0.0391918    7.64    <1e-13    0.222678      0.376424
volatile acidity       -0.176182     0.0223382   -7.89    <1e-14   -0.219997     -0.132366
citric acid             0.00912711   0.0292941    0.31    0.7554   -0.0483321     0.0665863
residual sugar          0.133781     0.0189031    7.08    <1e-11    0.0967031     0.170858
chlorides              -0.107215     0.0203052   -5.28    <1e-06   -0.147043     -0.0673877
free sulfur dioxide     0.0394281    0.023462     1.68    0.0931   -0.00659172    0.0854479
total sulfur dioxide   -0.128248     0.0246854   -5.20    <1e-06   -0.176668     -0.0798287
density                -0.355576     0.0276265  -12.87    <1e-35   -0.409765     -0.301388
pH                      0.0965662    0.0261087    3.70    0.0002    0.0453551     0.147777
sulphates               0.213697     0.0191745   11.14    <1e-27    0.176087      0.251307
───────────────────────────────────────────────────────────────────────────────────────────

citric acid and free sulfur dioxide are now the only insignificant variables, the rest are believed to contribute to the quality. This means we are experiencing confounding as alcohol is the better explainer but the effect of alcohol is now hiding behind these other variables.

Confounding – When a variable influences other variables and the outcome at the same time leading to an incorrect view on the correlation between the variables and outcomes.

This regression after dropping the alcohol variable is incorrect and provides the wrong causal conclusion. So can we do better and get closer to the true regression coefficients using some regularisation methods?

For now, we save these incorrect parameters and explore the causal regularisation methods.

olsParamsConf = DataFrame(Dict(zip(vars, coef(confoundOLS)[2:end])))
olsParamsConf[!, :Model] .= "OLS No Alcohol"
olsParamsConf[!, :alcohol] .= NaN

olsParamsConf
1×12 DataFrame
Row chlorides citric acid density fixed acidity free sulfur dioxide pH residual sugar sulphates total sulfur dioxide volatile acidity Model alcohol
Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 String Float64
1 -0.107215 0.00912711 -0.355576 0.299551 0.0394281 0.0965662 0.133781 0.213697 -0.128248 -0.176182 OLS No Alcohol NaN

Regularisation and Regression

Some maths. Regression is taking our variables \(X\) and finding the parameters \(a\) that get us closest to \(Y\).

\[Y = a X\]

\(X\) is a matrix, and \(a\) is a vector. When we fit this to some data, the values of \(a\) are free to converge to any value they want, so long as it gets close to the outcome variable. This means we are minimising the difference between \(Y\) and \(X\)

\[||(Y – a X)|| ^2.\]

Regularisation is the act of restricting the values \(a\) can take.

For example, we can make the sum of all the \(a\)’s equal to a constant (L_1 regularisation), or the sum of the square of the $a$ values equal a constant (L_2 regularisation).
In simpler terms, if we want to increase the coefficient of one parameter, we need to reduce the parameter of a different term. Think of there being a finite amount of mass that we can allocate to the parameters, they can’t take on whatever value they like, but instead need to regulate amongst themselves. This helps reduce overfitting as it constrains how much influence a parameter can have and the final result should converge to a model that doesn’t overfit.

In ridge regression we are minimising the \(L_2\) norm, so restricting the sum of the square of the \(a\)’s and at the same time minimising the original OLS regression.

\[||(Y – a X)|| ^2 – \lambda || a || ^2.\]

So we can see how regularisation is an additional component of OLS regression. \(\lambda\) is a hyperparameter that is just a number and controls how much restriction we place on the \(a\) values.

To do ridge regression in Julia I’ll be leaning on the MLJ.jl framework and using that to build out the learning machines.

using MLJ

@load RidgeRegressor pkg=MLJLinearModels

We will take the confounded dataset (so the data where the alcohol column is deleted), partition it into train and test sets, and get started with some regularisation.

y, X = unpack(cleanDataConfounded, ==(:quality); rng=123);

train, test = partition(eachindex(y), 0.7, shuffle=true)

mdl = MLJLinearModels.RidgeRegressor()
RidgeRegressor(
  lambda = 1.0, 
  fit_intercept = true, 
  penalize_intercept = false, 
  scale_penalty_with_samples = true, 
  solver = nothing)

Can see the hyperparameter lambda is initialised to 1.

Basic Ridge Regression

We want to know the optimal \(\lambda\) value so will use cross-validation to train the model on one set of data and verify on a hold-out set before repeating. This is all simple in MLJ.jl, we define a grid of penalisations between 0 and 1 and fit the regression using cross-validation across the different lambdas. We are optimising for the best \(R^2\) value.

lambda_range = range(mdl, :lambda, lower = 0, upper = 1)

lmTuneModel = TunedModel(model=mdl,
                          resampling = CV(nfolds=6, shuffle=true),
                          tuning = Grid(resolution=200),
                          range = [lambda_range],
                          measures=[rsq]);

lmTunedMachine = machine(lmTuneModel, X, y);

fit!(lmTunedMachine, rows=train, verbosity=0)
report(lmTunedMachine).best_model
RidgeRegressor(
  lambda = 0.020100502512562814, 
  fit_intercept = true, 
  penalize_intercept = false, 
  scale_penalty_with_samples = true, 
  solver = nothing)

The best value of \(\lambda\) is 0.0201. When we plot the \(R^2\) vs the \(\lambda\) values there isn’t that much of a change just a minor inflection around the small ones.

plot(lmTunedMachine)

R2 and lambda for basic ridge regression

Let’s save those parameters. This will be our basic ridge regression result that the other technique builds off.

res = fitted_params(lmTunedMachine).best_fitted_params.coefs

ridgeParams = DataFrame(res)
ridgeParams = hcat(ridgeParams, DataFrame(Model = "Ridge", alcohol=NaN))
ridgeParams
1×12 DataFrame
Row fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates Model alcohol
Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 String Float64
1 0.190892 -0.157286 0.0410523 0.117846 -0.142458 0.0374597 -0.153419 -0.29919 0.0375852 0.232461 Ridge NaN

Implementing Causal Regularisation

The main result from the paper is that we first need to estimate the confounding effect \(\beta\) and then choose a penalisation factor \(\lambda\) that satisfies

\[(1-\beta) || a || ^ 2\]

So the \(L_2\) norm of the ridge parameters can only be so much. In the 2nd paper, they estimate \(\beta\) to be 0.8. For us, we can use the above grid search, calculate the norm of the parameters, and find which ones satisfy those criteria.

So iterate through the above results of the grid search, and calculate the L2 norm of the parameters.

mdls = report(lmTunedMachine).history

l = zeros(length(mdls))
a = zeros(length(mdls))

for (i, mdl) in enumerate(mdls)
    l[i] = mdl.model.lambda
    a[i] = sum(map( x-> x[2], fitted_params(fit!(machine(mdl.model, X, y))).coefs) .^2)
end

Plotting the results gives us a visual idea of how the penalisation works. Larger values of \(\lambda\) mean the model parameters are more and more restricted.

inds = sortperm(l)
l = l[inds]
a = a[inds]

mdlsSorted = report(lmTunedMachine).history[inds]

scatter(l, a, label = :none)
hline!([(1-0.8) * sum(coef(confoundOLS)[2:end] .^ 2)], label = "Target Length", xlabel = "Lambda", ylabel = "a Length")

R2 and lambda for basic ridge regression with target length

We search the lengths for the one closest to the target length and save those parameters.

targetLength = (1-0.8) * sum(coef(confoundOLS)[2:end] .^ 2)
ind = findfirst(x-> x < targetLength, a)

res = fitted_params(fit!(machine(mdlsSorted[ind].model, X, y))).coefs

finalParams = DataFrame(res)
finalParams = hcat(finalParams, DataFrame(Model = "With Beta", alcohol=NaN))
finalParams
1×12 DataFrame
Row fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates Model alcohol
Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 String Float64
1 0.0521908 -0.139099 0.0598797 0.0377729 -0.0786037 0.00654776 -0.0856938 -0.124057 0.00682623 0.11735 With Beta NaN

What if we don’t want to calculate the confounding effect?

Now the code to calculate \(\beta\) isn’t the easiest or straightforward to implement (hence why I took their estimate). Instead, we could take the approach from Better AB Testing via Causal Regularisation and use the test set to optimise the penalisation parameter \(\lambda\) and then use that value when training the model on the train set.

Applying this method to the wine dataset isn’t a true replication of their paper, as their test and train data sets are instead two data sets, one with bias and one without like you might observe from an AB test. So it’s more of a demonstration of the method rather than a direct comparison to the Janzing method.

Again, MLJ makes this simple, we just fit the machine using the test rows to produce the best-fitting model.

lambda_range = range(mdl, :lambda, lower = 0, upper = 1)

lmTuneModel = TunedModel(model=mdl,
                          resampling = CV(nfolds=6, shuffle=true),
                          tuning = Grid(resolution=200),
                          range = [lambda_range],
                          measures=[rsq]);

lmTunedMachine = machine(lmTuneModel, X, y);

fit!(lmTunedMachine, rows=test, verbosity=0)
plot(lmTunedMachine)

R2 and lambda for basic ridge regression on the test set

report(lmTunedMachine).best_model
RidgeRegressor(
  lambda = 0.010050251256281407, 
  fit_intercept = true, 
  penalize_intercept = false, 
  scale_penalty_with_samples = true, 
  solver = nothing)

Our best \(\lambda\) is 0.01 so we retrain the same machine, this time using the training rows.

res2 = fit!(machine(report(lmTunedMachine).best_model, X, y), rows=train)

Again saving these parameters down leaves us with three methods and three sets of parameters.

finalParams2 = DataFrame(fitted_params(res2).coefs)
finalParams2 = hcat(finalParams2, DataFrame(Model = "No Beta", alcohol=NaN))

allParams = vcat([olsParams, olsParamsConf, ridgeParams, finalParams, finalParams2]...)
allParams
5×12 DataFrame
Row alcohol chlorides citric acid density fixed acidity free sulfur dioxide pH residual sugar sulphates total sulfur dioxide volatile acidity Model
Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 String
1 0.294335 -0.088211 -0.0355637 -0.0337477 0.043511 0.0456202 -0.0638624 0.0230259 0.155325 -0.107389 -0.194027 OLS
2 NaN -0.107215 0.00912711 -0.355576 0.299551 0.0394281 0.0965662 0.133781 0.213697 -0.128248 -0.176182 OLS No Alcohol
3 NaN -0.142458 0.0410523 -0.29919 0.190892 0.0374597 0.0375852 0.117846 0.232461 -0.153419 -0.157286 Ridge
4 NaN -0.0786037 0.0598797 -0.124057 0.0521908 0.00654776 0.00682623 0.0377729 0.11735 -0.0856938 -0.139099 With Beta
5 NaN -0.141766 0.031528 -0.323596 0.222812 0.03869 0.048907 0.127026 0.23961 -0.153488 -0.157603 No Beta

What method has done the best at uncovering the confounded relationship?

Relative Squared Error

We have our different estimates of the parameters of the model, we now want to compare these to the ‘true’ unconfounded variables and see whether we have recovered the correct variables. To do this we calculate the square difference and normalise by the overall \(L_2\) norm of the parameters.

In practice, this just means we are comparing how far the fitted parameters are away from the true (unconfounded) model parameters.

allParamsLong = stack(allParams, Not(:Model))
trueParams = select(@subset(allParamsLong, :Model .== "OLS"), Not(:Model))
rename!(trueParams, ["variable", "truth"])
allParamsLong = leftjoin(allParamsLong, trueParams, on = :variable)
errorRes = @combine(groupby(@subset(allParamsLong, :variable .!= "alcohol"), :Model), 
         :a = sum((:truth .- :value) .^2),
         :a2 = sum(:value .^ 2))
errorRes = @transform(errorRes, :e = :a ./ :a2)
sort(errorRes, :e)
5×4 DataFrame
Row Model a a2 e
String Float64 Float64 Float64
1 OLS 0.0 0.0920729 0.0
2 With Beta 0.0291038 0.0698576 0.416616
3 Ridge 0.129761 0.266952 0.486085
4 No Beta 0.157667 0.301286 0.523314
5 OLS No Alcohol 0.213692 0.349675 0.611116

Using the \(\beta\) estimation method gives the best model (smallest \(e\)), which lines up with the paper and the magnitude of error is also inline with the paper (they had 0.35 and 0.45 for Lasoo/ridge regression respectively).
The ridge regression and no beta method also improved on the naive OLS approach, so that indicates that there is some improvement from using these methods. The No Beta method is not a faithful reproduction of the Better AB testing paper because it requires the ‘test’ dataset to be an AB test scenario, which we don’t have from the above, so that might explain why the values don’t quite line up.

All methods improve on the naive ‘OLS No Alcohol’ parameters though, which shows this approach to causal regularisation can uncover better models if you have underlying confounding in your data.

Summary

We are always stuck with the data we are given and most of the time can’t collect more to try and uncover more relationships. Causal regularisation gives us a chance to use normal machine learning techniques to build better causal relationships by guiding what the regularisation parameters should be and using that to restrict the overall parameters. When we can estimate the expected confounding value \(\beta\) we get the best results, but regular ridge regression and the Webster-Westray method also provide an improvement on just doing a naive regression.
So whilst overfitting is the main driver for doing regularisation it also brings with it some causal benefits and lets you understand true relationships between variables in a truer sense.

Another Causal Post

I’ve written about causal analysis techniques before with Double Machine Learning – An Easy Introduction. This is another way of building causal models.

Free Finance Data Sets for the Quants

By: Dean Markwick's Blog -- Julia

Re-posted from: https://dm13450.github.io/2023/11/25/Free-Finance-Datasets-for-Quants.html

Now and then I am asked how to get started in quant finance and
my advice has always been to just get hold of some data and play about
with different models. The first step is to get some data and this post takes you
through several different sources and hopefully gives you the
launchpad to start poking around with financial data.


Enjoy these types of posts? Then you should sign up for my newsletter.


I’ve tried to cover different assets and frequencies to hopefully
inspire the various types of quant finance
out there.

High-Frequency FX Market Data

My day-to-day job is in FX so naturally, that’s where I think all the
best data can be found. TrueFX provides
tick-by-tick in milliseconds, so high-frequency data is available for free and across lots of different currencies.
So if you are interested in working out how to deal with large amounts
of data (1 month of EURUSD is 600MB) efficiently, this source is a
good place to start.

As a demo, I’ve downloaded the USDJPY October dataset.

using CSV, DataFrames, DataFramesMeta, Dates, Statistics
using Plots

It’s a big CSV file, so this isn’t the best way to store the data,
instead, stick it into a database like QuestDB
that are made for time series data.

usdjpy = CSV.read("USDJPY-2023-10.csv", DataFrame,
                 header = ["Ccy", "Time", "Bid", "Ask"])
usdjpy.Time = DateTime.(usdjpy.Time, dateformat"yyyymmdd HH:MM:SS.sss")
first(usdjpy, 4)
4×4 DataFrame
Row Ccy Time Bid Ask
String7 DateTime Float64 Float64
1 USD/JPY 2023-10-01T21:04:56.931 149.298 149.612
2 USD/JPY 2023-10-01T21:04:56.962 149.298 149.782
3 USD/JPY 2023-10-01T21:04:57.040 149.589 149.782
4 USD/JPY 2023-10-01T21:04:58.201 149.608 149.782

It’s simple data, just a bid and ask price with a time stamp.

usdjpy = @transform(usdjpy, :Spread = :Ask .- :Bid, 
                            :Mid = 0.5*(:Ask .+ :Bid), 
                            :Hour = round.(:Time, Minute(10)))

usdjpyHourly = @combine(groupby(usdjpy, :Hour), :open = first(:Mid), :close = last(:Mid), :avg_spread = mean(:Spread))
usdjpyHourly.Time = Time.(usdjpyHourly.Hour)

plot(usdjpyHourly.Hour, usdjpyHourly.open, lw =1, label = :none, title = "USDJPY Price Over October")

Looking at the hourly price over the month gives you flat periods
over the weekend.

USDJPY October price chart

Let’s look at the average spread (ask – bid) throughout the day.

hourlyAvgSpread = sort(@combine(groupby(usdjpyHourly, :Time), :avg_spread = mean(:avg_spread)), :Time)

plot(hourlyAvgSpread.Time, hourlyAvgSpread.avg_spread, lw =2, title = "USDJPY Intraday Spread", label = :none)

USDJPY average intraday spread

We see a big spike at 10 pm because of the day roll and the
secondary markets go offline briefly, which pollutes the data
bit. Looking at just midnight to 8 pm gives a more indicative picture.

plot(hourlyAvgSpread[hourlyAvgSpread.Time .<= Time("20:00:00"), :].Time, 
     hourlyAvgSpread[hourlyAvgSpread.Time .<= Time("20:00:00"), :].avg_spread, label = :none, lw=2,
     title = "USDJPY Intraday Spread")

USDJPY average intraday spread zoomed

In October spreads have generally been wider in the later part of the
day compared to the morning.

There is much more that can be done with this data across the
different currencies though. For example:

  1. How stable are correlations across currencies at different time frequencies?
  2. Can you replicate my microstructure noise post? How does the microstructure noise change between currencies
  3. Price updates are irregular, what are some statistical properties?

Daily Futures Market Data

Let’s zoom out a little bit now, decrease the frequency, and widen the
asset pool. Futures cover many asset classes, oil, coal, currencies,
metals, agriculture, stocks, bonds, interest rates, and probably
something else I’ve missed. This data is daily and roll adjusted, so
you have a continuous time series of an asset for many years. This means you can look at the classic momentum/mean reversion portfolio models and have a real stab at long-term trends.

The data is part of the Nasdaq data link product (formerly Quandl)
and once you sign up for an account you have access to the free
data. This futures dataset is
Wiki Continuous Futures
and after about 50 clicks and logging in, re-logging in, 2FA codes
you can view the pages.

To get the data you can go through one of the API packages in
your favourite language. In Julia, this means the QuandlAccess.jl
package which keeps things simple.

using QuandlAccess

futuresMeta = CSV.read("continuous.csv", DataFrame)
futuresCodes = futuresMeta[!, "Quandl Code"] .* "1"

quandl = Quandl("QUANDL_KEY")

function get_data(code)
    futuresData = quandl(TimeSeries(code))
    futuresData.Code .= code
    futuresData
end
futureData = get_data.(rand(futuresCodes, 4));

We have an array of all the available contracts futuresCodes and
sample 4 of them randomly to see what the data looks like.

p = []
for df in futureData
    append!(p, plot(df.Date, df.Settle, label = df.Code[1]))
end

plot(plot.(p)..., layout = 4)

Futures examples

  • ABY – WTI Brent Bullet – Spread between two oil futures on different
    exchanges.
  • TZ6 – Transco Zone 6 Non-N.Y. Natural Gas (Platts IFERC) Basis – Spread between
    two different natural gas contracts
  • PG – PG&E Citygate Natural Gas (Platts IFERC) Basis – Again, spread between
    two different natural gas contracts
  • FMJP – MSCI Japan Index – Index containing Japanese stocks

I’ve managed to randomly select 3 energy futures and one stock
index.

Project ideas with this data:

  1. Cross-asset momentum and mean reversion.
  2. Cross-asset correlations, does the price of oil drive some equity indexes?
  3. Macro regimes, can you pick out commonalities of market factors over
    the years?

Equity Order Book Data

Out there in the wild is the FI2010 dataset which is essentially a
sample of the full order book for five different
stocks on the Nordic stock exchange for 10 days. You have 10 levels of
prices and volumes and so can reconstruct the order book throughout the
day. It is the benchmark dataset for limit order book prediction and you will see it referenced
in papers that are trying to implement new prediction models. For
example Benchmark Dataset for Mid-Price Forecasting of Limit
Order Book Data with Machine Learning Methods

references some basic methods on the dataset and how they perform when
predicting the mid-price.

I found the dataset (as a Python package) here
https://github.com/simaki/fi2010 but it’s just stored as a CSV which
you can lift easily.

fi2010 = CSV.read(download("https://raw.githubusercontent.com/simaki/fi2010/main/data/data.csv"),DataFrame);

Update on 7/01/2024

Since posting this the above link has gone offline and the user has
deleted their Github account! Instead the data set can be found here:
https://etsin.fairdata.fi/dataset/73eb48d7-4dbc-4a10-a52a-da745b47a649/data
. I’ve not verified if its in the same format, so there might be some
additional work going from the raw data to how this blog post sets it
up. Thank’s to the commentators below pointing this out.

The data is wide (each column is a depth level of the price and
volume) so I turn each into a long data set and add the level, side
and variable as a new column.

fi2010Long = stack(fi2010, 4:48, [:Column1, :STOCK, :DAY])
fi2010Long = @transform(fi2010Long, :a = collect.(eachsplit.(:variable, "_")))
fi2010Long = @transform(fi2010Long, :var = first.(:a), :level = last.(:a), :side = map(x->x[2], :a))
fi2010Long = @transform(groupby(fi2010Long, [:STOCK, :DAY]), :Time = collect(1:length(:Column1)))
first(fi2010Long, 4)

The ‘book depth’ is the sum of the liquidity available at all the
levels and indicates how easy it is to trade the stock. As a
quick example, we can take the average of each stock per day and use
that as a proxy for the ease of trading these stocks.

intraDayDepth = @combine(groupby(fi2010Long, [:STOCK, :DAY, :var]), :avgDepth = mean(:value))
intraDayDepth = @subset(intraDayDepth, :var .== "VOLUME");
plot(intraDayDepth.DAY, intraDayDepth.avgDepth, group=intraDayDepth.STOCK, 
     marker = :circle, title = "Avg Daily Book Depth - FI2010")

FI2010 Intraday book depth

Stock 3 and 4 have the highest average depth, so most likely the
easier to trade, whereas Stock 1 has the thinnest depth. Stock 2 has
an interesting switch between liquid and not liquid.

So if you want to look beyond top-of-book data, this dataset provides
the extra level information needed and is closer to what a
professional shop is using. Better than trying to predict daily Yahoo
finance mid-prices with neural nets at least.

Build Your Own Crypto Datasets

If you want to take a further step back then being able to build the
tools that take in streaming data directly from the exchanges and
save that into a database is another way you can build out your
technical capabilities. This means you have full control over what you
download and save. Do you want just the top of book every update, the
full depth of the book, or just the reported trades?
I’ve written about this before, Getting Started with High Frequency Finance using Crypto Data and Julia, and learned a lot in the
process. Doing things this way means you have full control over the entire
process and can fully understand the data you are saving and any
additional quirks around the process.

Conclusion

Plenty to get stuck into and learn from. Being able to get the data
and loading it into an environment is always the first challenge and
learning how to do that with all these different types of data should
help you understand what these types of jobs entail.