Author Archives: Dean Markwick's Blog -- Julia

Exploring Causal Regularisation

By: Dean Markwick's Blog -- Julia

Re-posted from: https://dm13450.github.io/2023/12/28/Exploring-Causal-Regularisation.html

A good prediction model isn’t necessarily a good causal model. You could be missing a key variable in your dataset that is driving the underlying behavior so you end up with a good predictive model but not the correct explanation as to why things behave that way. Taking a causal approach is a tougher problem and needs an understanding of whether we have access to the right variables or we are making the right link between variables and an outcome. Causal regularisation is a method that uses machine learning techniques (regularisation!) to try and produce models that can be interpreted causally.


Enjoy these types of posts? Then you should sign up for my newsletter.


Regularisation is normally taught as a method to reduce overfitting, you have a big model and you make it smaller by shrinking some of the factors. Work by Janzing (papers below) argues that this can help produce better causal models too and in this blog post I will work through two papers to try and understand the process better.

I’ll work off two main papers for causal regularisation:

  1. Causal Regularisation
  2. Detecting non-causal artifacts in multivariate linear regression models

In truth, I am working backward. I first encountered causal regularisation in Better AB testing via Causal Regularisation where it uses causal regularisation to produce better estimates by combining a biased and an unbiased dataset. I want to take a step back and understand casual regularisation from the original papers. Using free data from the UCI Machine Learning Repository we can attempt to replicate the methods from the papers and see how causal regularisation works to produce better causal models.

As ever, I’m in Julia (1.9), so fire up that notebook and follow along.

using CSV, DataFrames, DataFramesMeta
using Plots
using GLM, Statistics

Wine Tasting Data

The wine-quality dataset from the UCI repository provides measurements of the chemical properties of wine and a quality rating from someone drinking the wine. It’s a simple CSV file that you can download (winequality) and load with minimal data wrangling needed.

We will be working with the red wine data set as that’s what both Janzing papers use.

rawData = CSV.read("wine+quality/winequality-red.csv", DataFrame)
first(rawData)

APD! Always Plotting the Data to make sure the values are something you expect. Sometimes you need a visual confirmation that things line up with what you believe.

plot(scatter(rawData.alcohol, rawData.quality, title = "Alcohol", label = :none, color="#eac435"),
     scatter(rawData.pH, rawData.quality, title = "pH", label = :none, color="#345995"),
     scatter(rawData.sulphates, rawData.quality, title= "Sulphates", label = :none, color="#E40066"),
     scatter(rawData.density, rawData.quality, title = "Density", label = :none, color="#03CEA4"), ylabel = "Quality")

Wine quality variable relationships

By choosing four of the variables randomly we can see that some are correlated with quality and some are not.

A loose goal is to come up with a causal model that can explain the quality of the wine using the provided factors. We will change the data slightly to highlight how causal regularisation helps, but for now, let’s start with the simple OLS model.

In the paper they normalise the variables to be unit variance, so we divide by the standard deviation.
We then model the quality of the wine using all the available variables.

vars = names(rawData, Not(:quality))

cleanData = deepcopy(rawData)

for var in filter(!isequal("White"), vars)
    cleanData[!, var] = cleanData[!, var] ./ std(cleanData[!, var])
end

cleanData[!, :quality] .= Float64.(cleanData[!, :quality])

ols = lm(term(:quality) ~ sum(term.(Symbol.(vars))), cleanData)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

quality ~ 1 + fixed acidity + volatile acidity + citric acid + residual sugar + chlorides + free sulfur dioxide + total sulfur dioxide + density + pH + sulphates + alcohol

Coefficients:
────────────────────────────────────────────────────────────────────────────────────────
                           Coef.  Std. Error      t  Pr(>|t|)     Lower 95%    Upper 95%
────────────────────────────────────────────────────────────────────────────────────────
(Intercept)           21.9652     21.1946      1.04    0.3002  -19.6071      63.5375
fixed acidity          0.043511    0.0451788   0.96    0.3357   -0.0451055    0.132127
volatile acidity      -0.194027    0.0216844  -8.95    <1e-18   -0.23656     -0.151494
citric acid           -0.0355637   0.0286701  -1.24    0.2150   -0.0917989    0.0206716
residual sugar         0.0230259   0.0211519   1.09    0.2765   -0.0184626    0.0645145
chlorides             -0.088211    0.0197337  -4.47    <1e-05   -0.126918    -0.0495041
free sulfur dioxide    0.0456202   0.0227121   2.01    0.0447    0.00107145   0.090169
total sulfur dioxide  -0.107389    0.0239718  -4.48    <1e-05   -0.154409    -0.0603698
density               -0.0337477   0.0408289  -0.83    0.4086   -0.113832     0.0463365
pH                    -0.0638624   0.02958    -2.16    0.0310   -0.121883    -0.00584239
sulphates              0.155325    0.019381    8.01    <1e-14    0.11731      0.19334
alcohol                0.294335    0.0282227  10.43    <1e-23    0.238977     0.349693
────────────────────────────────────────────────────────────────────────────────────────

The dominant factor is the alcohol amount which is the strongest variable in predicting the quality, i.e. higher quality has a higher alcohol content. We also note that 5 out of the 12 variables are deemed insignificant at the 5% level. We save these parameters and then look at the regression without the alcohol variable.

olsParams = DataFrame(Dict(zip(vars, coef(ols)[2:end])))
olsParams[!, :Model] .= "OLS"
olsParams
1×12 DataFrame
Row alcohol chlorides citric acid density fixed acidity free sulfur dioxide pH residual sugar sulphates total sulfur dioxide volatile acidity Model
Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 String
1 0.294335 -0.088211 -0.0355637 -0.0337477 0.043511 0.0456202 -0.0638624 0.0230259 0.155325 -0.107389 -0.194027 OLS
cleanDataConfounded = select(cleanData, Not(:alcohol))
vars = names(cleanDataConfounded, Not(:quality))

confoundOLS = lm(term(:quality) ~ sum(term.(Symbol.(vars))), cleanDataConfounded)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

quality ~ 1 + fixed acidity + volatile acidity + citric acid + residual sugar + chlorides + free sulfur dioxide + total sulfur dioxide + density + pH + sulphates

Coefficients:
───────────────────────────────────────────────────────────────────────────────────────────
                             Coef.  Std. Error       t  Pr(>|t|)     Lower 95%    Upper 95%
───────────────────────────────────────────────────────────────────────────────────────────
(Intercept)           189.679       14.2665      13.30    <1e-37  161.696       217.662
fixed acidity           0.299551     0.0391918    7.64    <1e-13    0.222678      0.376424
volatile acidity       -0.176182     0.0223382   -7.89    <1e-14   -0.219997     -0.132366
citric acid             0.00912711   0.0292941    0.31    0.7554   -0.0483321     0.0665863
residual sugar          0.133781     0.0189031    7.08    <1e-11    0.0967031     0.170858
chlorides              -0.107215     0.0203052   -5.28    <1e-06   -0.147043     -0.0673877
free sulfur dioxide     0.0394281    0.023462     1.68    0.0931   -0.00659172    0.0854479
total sulfur dioxide   -0.128248     0.0246854   -5.20    <1e-06   -0.176668     -0.0798287
density                -0.355576     0.0276265  -12.87    <1e-35   -0.409765     -0.301388
pH                      0.0965662    0.0261087    3.70    0.0002    0.0453551     0.147777
sulphates               0.213697     0.0191745   11.14    <1e-27    0.176087      0.251307
───────────────────────────────────────────────────────────────────────────────────────────

citric acid and free sulfur dioxide are now the only insignificant variables, the rest are believed to contribute to the quality. This means we are experiencing confounding as alcohol is the better explainer but the effect of alcohol is now hiding behind these other variables.

Confounding – When a variable influences other variables and the outcome at the same time leading to an incorrect view on the correlation between the variables and outcomes.

This regression after dropping the alcohol variable is incorrect and provides the wrong causal conclusion. So can we do better and get closer to the true regression coefficients using some regularisation methods?

For now, we save these incorrect parameters and explore the causal regularisation methods.

olsParamsConf = DataFrame(Dict(zip(vars, coef(confoundOLS)[2:end])))
olsParamsConf[!, :Model] .= "OLS No Alcohol"
olsParamsConf[!, :alcohol] .= NaN

olsParamsConf
1×12 DataFrame
Row chlorides citric acid density fixed acidity free sulfur dioxide pH residual sugar sulphates total sulfur dioxide volatile acidity Model alcohol
Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 String Float64
1 -0.107215 0.00912711 -0.355576 0.299551 0.0394281 0.0965662 0.133781 0.213697 -0.128248 -0.176182 OLS No Alcohol NaN

Regularisation and Regression

Some maths. Regression is taking our variables \(X\) and finding the parameters \(a\) that get us closest to \(Y\).

\[Y = a X\]

\(X\) is a matrix, and \(a\) is a vector. When we fit this to some data, the values of \(a\) are free to converge to any value they want, so long as it gets close to the outcome variable. This means we are minimising the difference between \(Y\) and \(X\)

\[||(Y – a X)|| ^2.\]

Regularisation is the act of restricting the values \(a\) can take.

For example, we can make the sum of all the \(a\)’s equal to a constant (L_1 regularisation), or the sum of the square of the $a$ values equal a constant (L_2 regularisation).
In simpler terms, if we want to increase the coefficient of one parameter, we need to reduce the parameter of a different term. Think of there being a finite amount of mass that we can allocate to the parameters, they can’t take on whatever value they like, but instead need to regulate amongst themselves. This helps reduce overfitting as it constrains how much influence a parameter can have and the final result should converge to a model that doesn’t overfit.

In ridge regression we are minimising the \(L_2\) norm, so restricting the sum of the square of the \(a\)’s and at the same time minimising the original OLS regression.

\[||(Y – a X)|| ^2 – \lambda || a || ^2.\]

So we can see how regularisation is an additional component of OLS regression. \(\lambda\) is a hyperparameter that is just a number and controls how much restriction we place on the \(a\) values.

To do ridge regression in Julia I’ll be leaning on the MLJ.jl framework and using that to build out the learning machines.

using MLJ

@load RidgeRegressor pkg=MLJLinearModels

We will take the confounded dataset (so the data where the alcohol column is deleted), partition it into train and test sets, and get started with some regularisation.

y, X = unpack(cleanDataConfounded, ==(:quality); rng=123);

train, test = partition(eachindex(y), 0.7, shuffle=true)

mdl = MLJLinearModels.RidgeRegressor()
RidgeRegressor(
  lambda = 1.0, 
  fit_intercept = true, 
  penalize_intercept = false, 
  scale_penalty_with_samples = true, 
  solver = nothing)

Can see the hyperparameter lambda is initialised to 1.

Basic Ridge Regression

We want to know the optimal \(\lambda\) value so will use cross-validation to train the model on one set of data and verify on a hold-out set before repeating. This is all simple in MLJ.jl, we define a grid of penalisations between 0 and 1 and fit the regression using cross-validation across the different lambdas. We are optimising for the best \(R^2\) value.

lambda_range = range(mdl, :lambda, lower = 0, upper = 1)

lmTuneModel = TunedModel(model=mdl,
                          resampling = CV(nfolds=6, shuffle=true),
                          tuning = Grid(resolution=200),
                          range = [lambda_range],
                          measures=[rsq]);

lmTunedMachine = machine(lmTuneModel, X, y);

fit!(lmTunedMachine, rows=train, verbosity=0)
report(lmTunedMachine).best_model
RidgeRegressor(
  lambda = 0.020100502512562814, 
  fit_intercept = true, 
  penalize_intercept = false, 
  scale_penalty_with_samples = true, 
  solver = nothing)

The best value of \(\lambda\) is 0.0201. When we plot the \(R^2\) vs the \(\lambda\) values there isn’t that much of a change just a minor inflection around the small ones.

plot(lmTunedMachine)

R2 and lambda for basic ridge regression

Let’s save those parameters. This will be our basic ridge regression result that the other technique builds off.

res = fitted_params(lmTunedMachine).best_fitted_params.coefs

ridgeParams = DataFrame(res)
ridgeParams = hcat(ridgeParams, DataFrame(Model = "Ridge", alcohol=NaN))
ridgeParams
1×12 DataFrame
Row fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates Model alcohol
Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 String Float64
1 0.190892 -0.157286 0.0410523 0.117846 -0.142458 0.0374597 -0.153419 -0.29919 0.0375852 0.232461 Ridge NaN

Implementing Causal Regularisation

The main result from the paper is that we first need to estimate the confounding effect \(\beta\) and then choose a penalisation factor \(\lambda\) that satisfies

\[(1-\beta) || a || ^ 2\]

So the \(L_2\) norm of the ridge parameters can only be so much. In the 2nd paper, they estimate \(\beta\) to be 0.8. For us, we can use the above grid search, calculate the norm of the parameters, and find which ones satisfy those criteria.

So iterate through the above results of the grid search, and calculate the L2 norm of the parameters.

mdls = report(lmTunedMachine).history

l = zeros(length(mdls))
a = zeros(length(mdls))

for (i, mdl) in enumerate(mdls)
    l[i] = mdl.model.lambda
    a[i] = sum(map( x-> x[2], fitted_params(fit!(machine(mdl.model, X, y))).coefs) .^2)
end

Plotting the results gives us a visual idea of how the penalisation works. Larger values of \(\lambda\) mean the model parameters are more and more restricted.

inds = sortperm(l)
l = l[inds]
a = a[inds]

mdlsSorted = report(lmTunedMachine).history[inds]

scatter(l, a, label = :none)
hline!([(1-0.8) * sum(coef(confoundOLS)[2:end] .^ 2)], label = "Target Length", xlabel = "Lambda", ylabel = "a Length")

R2 and lambda for basic ridge regression with target length

We search the lengths for the one closest to the target length and save those parameters.

targetLength = (1-0.8) * sum(coef(confoundOLS)[2:end] .^ 2)
ind = findfirst(x-> x < targetLength, a)

res = fitted_params(fit!(machine(mdlsSorted[ind].model, X, y))).coefs

finalParams = DataFrame(res)
finalParams = hcat(finalParams, DataFrame(Model = "With Beta", alcohol=NaN))
finalParams
1×12 DataFrame
Row fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates Model alcohol
Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 String Float64
1 0.0521908 -0.139099 0.0598797 0.0377729 -0.0786037 0.00654776 -0.0856938 -0.124057 0.00682623 0.11735 With Beta NaN

What if we don’t want to calculate the confounding effect?

Now the code to calculate \(\beta\) isn’t the easiest or straightforward to implement (hence why I took their estimate). Instead, we could take the approach from Better AB Testing via Causal Regularisation and use the test set to optimise the penalisation parameter \(\lambda\) and then use that value when training the model on the train set.

Applying this method to the wine dataset isn’t a true replication of their paper, as their test and train data sets are instead two data sets, one with bias and one without like you might observe from an AB test. So it’s more of a demonstration of the method rather than a direct comparison to the Janzing method.

Again, MLJ makes this simple, we just fit the machine using the test rows to produce the best-fitting model.

lambda_range = range(mdl, :lambda, lower = 0, upper = 1)

lmTuneModel = TunedModel(model=mdl,
                          resampling = CV(nfolds=6, shuffle=true),
                          tuning = Grid(resolution=200),
                          range = [lambda_range],
                          measures=[rsq]);

lmTunedMachine = machine(lmTuneModel, X, y);

fit!(lmTunedMachine, rows=test, verbosity=0)
plot(lmTunedMachine)

R2 and lambda for basic ridge regression on the test set

report(lmTunedMachine).best_model
RidgeRegressor(
  lambda = 0.010050251256281407, 
  fit_intercept = true, 
  penalize_intercept = false, 
  scale_penalty_with_samples = true, 
  solver = nothing)

Our best \(\lambda\) is 0.01 so we retrain the same machine, this time using the training rows.

res2 = fit!(machine(report(lmTunedMachine).best_model, X, y), rows=train)

Again saving these parameters down leaves us with three methods and three sets of parameters.

finalParams2 = DataFrame(fitted_params(res2).coefs)
finalParams2 = hcat(finalParams2, DataFrame(Model = "No Beta", alcohol=NaN))

allParams = vcat([olsParams, olsParamsConf, ridgeParams, finalParams, finalParams2]...)
allParams
5×12 DataFrame
Row alcohol chlorides citric acid density fixed acidity free sulfur dioxide pH residual sugar sulphates total sulfur dioxide volatile acidity Model
Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 String
1 0.294335 -0.088211 -0.0355637 -0.0337477 0.043511 0.0456202 -0.0638624 0.0230259 0.155325 -0.107389 -0.194027 OLS
2 NaN -0.107215 0.00912711 -0.355576 0.299551 0.0394281 0.0965662 0.133781 0.213697 -0.128248 -0.176182 OLS No Alcohol
3 NaN -0.142458 0.0410523 -0.29919 0.190892 0.0374597 0.0375852 0.117846 0.232461 -0.153419 -0.157286 Ridge
4 NaN -0.0786037 0.0598797 -0.124057 0.0521908 0.00654776 0.00682623 0.0377729 0.11735 -0.0856938 -0.139099 With Beta
5 NaN -0.141766 0.031528 -0.323596 0.222812 0.03869 0.048907 0.127026 0.23961 -0.153488 -0.157603 No Beta

What method has done the best at uncovering the confounded relationship?

Relative Squared Error

We have our different estimates of the parameters of the model, we now want to compare these to the ‘true’ unconfounded variables and see whether we have recovered the correct variables. To do this we calculate the square difference and normalise by the overall \(L_2\) norm of the parameters.

In practice, this just means we are comparing how far the fitted parameters are away from the true (unconfounded) model parameters.

allParamsLong = stack(allParams, Not(:Model))
trueParams = select(@subset(allParamsLong, :Model .== "OLS"), Not(:Model))
rename!(trueParams, ["variable", "truth"])
allParamsLong = leftjoin(allParamsLong, trueParams, on = :variable)
errorRes = @combine(groupby(@subset(allParamsLong, :variable .!= "alcohol"), :Model), 
         :a = sum((:truth .- :value) .^2),
         :a2 = sum(:value .^ 2))
errorRes = @transform(errorRes, :e = :a ./ :a2)
sort(errorRes, :e)
5×4 DataFrame
Row Model a a2 e
String Float64 Float64 Float64
1 OLS 0.0 0.0920729 0.0
2 With Beta 0.0291038 0.0698576 0.416616
3 Ridge 0.129761 0.266952 0.486085
4 No Beta 0.157667 0.301286 0.523314
5 OLS No Alcohol 0.213692 0.349675 0.611116

Using the \(\beta\) estimation method gives the best model (smallest \(e\)), which lines up with the paper and the magnitude of error is also inline with the paper (they had 0.35 and 0.45 for Lasoo/ridge regression respectively).
The ridge regression and no beta method also improved on the naive OLS approach, so that indicates that there is some improvement from using these methods. The No Beta method is not a faithful reproduction of the Better AB testing paper because it requires the ‘test’ dataset to be an AB test scenario, which we don’t have from the above, so that might explain why the values don’t quite line up.

All methods improve on the naive ‘OLS No Alcohol’ parameters though, which shows this approach to causal regularisation can uncover better models if you have underlying confounding in your data.

Summary

We are always stuck with the data we are given and most of the time can’t collect more to try and uncover more relationships. Causal regularisation gives us a chance to use normal machine learning techniques to build better causal relationships by guiding what the regularisation parameters should be and using that to restrict the overall parameters. When we can estimate the expected confounding value \(\beta\) we get the best results, but regular ridge regression and the Webster-Westray method also provide an improvement on just doing a naive regression.
So whilst overfitting is the main driver for doing regularisation it also brings with it some causal benefits and lets you understand true relationships between variables in a truer sense.

Another Causal Post

I’ve written about causal analysis techniques before with Double Machine Learning – An Easy Introduction. This is another way of building causal models.

Free Finance Data Sets for the Quants

By: Dean Markwick's Blog -- Julia

Re-posted from: https://dm13450.github.io/2023/11/25/Free-Finance-Datasets-for-Quants.html

Now and then I am asked how to get started in quant finance and
my advice has always been to just get hold of some data and play about
with different models. The first step is to get some data and this post takes you
through several different sources and hopefully gives you the
launchpad to start poking around with financial data.


Enjoy these types of posts? Then you should sign up for my newsletter.


I’ve tried to cover different assets and frequencies to hopefully
inspire the various types of quant finance
out there.

High-Frequency FX Market Data

My day-to-day job is in FX so naturally, that’s where I think all the
best data can be found. TrueFX provides
tick-by-tick in milliseconds, so high-frequency data is available for free and across lots of different currencies.
So if you are interested in working out how to deal with large amounts
of data (1 month of EURUSD is 600MB) efficiently, this source is a
good place to start.

As a demo, I’ve downloaded the USDJPY October dataset.

using CSV, DataFrames, DataFramesMeta, Dates, Statistics
using Plots

It’s a big CSV file, so this isn’t the best way to store the data,
instead, stick it into a database like QuestDB
that are made for time series data.

usdjpy = CSV.read("USDJPY-2023-10.csv", DataFrame,
                 header = ["Ccy", "Time", "Bid", "Ask"])
usdjpy.Time = DateTime.(usdjpy.Time, dateformat"yyyymmdd HH:MM:SS.sss")
first(usdjpy, 4)
4×4 DataFrame
Row Ccy Time Bid Ask
String7 DateTime Float64 Float64
1 USD/JPY 2023-10-01T21:04:56.931 149.298 149.612
2 USD/JPY 2023-10-01T21:04:56.962 149.298 149.782
3 USD/JPY 2023-10-01T21:04:57.040 149.589 149.782
4 USD/JPY 2023-10-01T21:04:58.201 149.608 149.782

It’s simple data, just a bid and ask price with a time stamp.

usdjpy = @transform(usdjpy, :Spread = :Ask .- :Bid, 
                            :Mid = 0.5*(:Ask .+ :Bid), 
                            :Hour = round.(:Time, Minute(10)))

usdjpyHourly = @combine(groupby(usdjpy, :Hour), :open = first(:Mid), :close = last(:Mid), :avg_spread = mean(:Spread))
usdjpyHourly.Time = Time.(usdjpyHourly.Hour)

plot(usdjpyHourly.Hour, usdjpyHourly.open, lw =1, label = :none, title = "USDJPY Price Over October")

Looking at the hourly price over the month gives you flat periods
over the weekend.

USDJPY October price chart

Let’s look at the average spread (ask – bid) throughout the day.

hourlyAvgSpread = sort(@combine(groupby(usdjpyHourly, :Time), :avg_spread = mean(:avg_spread)), :Time)

plot(hourlyAvgSpread.Time, hourlyAvgSpread.avg_spread, lw =2, title = "USDJPY Intraday Spread", label = :none)

USDJPY average intraday spread

We see a big spike at 10 pm because of the day roll and the
secondary markets go offline briefly, which pollutes the data
bit. Looking at just midnight to 8 pm gives a more indicative picture.

plot(hourlyAvgSpread[hourlyAvgSpread.Time .<= Time("20:00:00"), :].Time, 
     hourlyAvgSpread[hourlyAvgSpread.Time .<= Time("20:00:00"), :].avg_spread, label = :none, lw=2,
     title = "USDJPY Intraday Spread")

USDJPY average intraday spread zoomed

In October spreads have generally been wider in the later part of the
day compared to the morning.

There is much more that can be done with this data across the
different currencies though. For example:

  1. How stable are correlations across currencies at different time frequencies?
  2. Can you replicate my microstructure noise post? How does the microstructure noise change between currencies
  3. Price updates are irregular, what are some statistical properties?

Daily Futures Market Data

Let’s zoom out a little bit now, decrease the frequency, and widen the
asset pool. Futures cover many asset classes, oil, coal, currencies,
metals, agriculture, stocks, bonds, interest rates, and probably
something else I’ve missed. This data is daily and roll adjusted, so
you have a continuous time series of an asset for many years. This means you can look at the classic momentum/mean reversion portfolio models and have a real stab at long-term trends.

The data is part of the Nasdaq data link product (formerly Quandl)
and once you sign up for an account you have access to the free
data. This futures dataset is
Wiki Continuous Futures
and after about 50 clicks and logging in, re-logging in, 2FA codes
you can view the pages.

To get the data you can go through one of the API packages in
your favourite language. In Julia, this means the QuandlAccess.jl
package which keeps things simple.

using QuandlAccess

futuresMeta = CSV.read("continuous.csv", DataFrame)
futuresCodes = futuresMeta[!, "Quandl Code"] .* "1"

quandl = Quandl("QUANDL_KEY")

function get_data(code)
    futuresData = quandl(TimeSeries(code))
    futuresData.Code .= code
    futuresData
end
futureData = get_data.(rand(futuresCodes, 4));

We have an array of all the available contracts futuresCodes and
sample 4 of them randomly to see what the data looks like.

p = []
for df in futureData
    append!(p, plot(df.Date, df.Settle, label = df.Code[1]))
end

plot(plot.(p)..., layout = 4)

Futures examples

  • ABY – WTI Brent Bullet – Spread between two oil futures on different
    exchanges.
  • TZ6 – Transco Zone 6 Non-N.Y. Natural Gas (Platts IFERC) Basis – Spread between
    two different natural gas contracts
  • PG – PG&E Citygate Natural Gas (Platts IFERC) Basis – Again, spread between
    two different natural gas contracts
  • FMJP – MSCI Japan Index – Index containing Japanese stocks

I’ve managed to randomly select 3 energy futures and one stock
index.

Project ideas with this data:

  1. Cross-asset momentum and mean reversion.
  2. Cross-asset correlations, does the price of oil drive some equity indexes?
  3. Macro regimes, can you pick out commonalities of market factors over
    the years?

Equity Order Book Data

Out there in the wild is the FI2010 dataset which is essentially a
sample of the full order book for five different
stocks on the Nordic stock exchange for 10 days. You have 10 levels of
prices and volumes and so can reconstruct the order book throughout the
day. It is the benchmark dataset for limit order book prediction and you will see it referenced
in papers that are trying to implement new prediction models. For
example Benchmark Dataset for Mid-Price Forecasting of Limit
Order Book Data with Machine Learning Methods

references some basic methods on the dataset and how they perform when
predicting the mid-price.

I found the dataset (as a Python package) here
https://github.com/simaki/fi2010 but it’s just stored as a CSV which
you can lift easily.

fi2010 = CSV.read(download("https://raw.githubusercontent.com/simaki/fi2010/main/data/data.csv"),DataFrame);

Update on 7/01/2024

Since posting this the above link has gone offline and the user has
deleted their Github account! Instead the data set can be found here:
https://etsin.fairdata.fi/dataset/73eb48d7-4dbc-4a10-a52a-da745b47a649/data
. I’ve not verified if its in the same format, so there might be some
additional work going from the raw data to how this blog post sets it
up. Thank’s to the commentators below pointing this out.

The data is wide (each column is a depth level of the price and
volume) so I turn each into a long data set and add the level, side
and variable as a new column.

fi2010Long = stack(fi2010, 4:48, [:Column1, :STOCK, :DAY])
fi2010Long = @transform(fi2010Long, :a = collect.(eachsplit.(:variable, "_")))
fi2010Long = @transform(fi2010Long, :var = first.(:a), :level = last.(:a), :side = map(x->x[2], :a))
fi2010Long = @transform(groupby(fi2010Long, [:STOCK, :DAY]), :Time = collect(1:length(:Column1)))
first(fi2010Long, 4)

The ‘book depth’ is the sum of the liquidity available at all the
levels and indicates how easy it is to trade the stock. As a
quick example, we can take the average of each stock per day and use
that as a proxy for the ease of trading these stocks.

intraDayDepth = @combine(groupby(fi2010Long, [:STOCK, :DAY, :var]), :avgDepth = mean(:value))
intraDayDepth = @subset(intraDayDepth, :var .== "VOLUME");
plot(intraDayDepth.DAY, intraDayDepth.avgDepth, group=intraDayDepth.STOCK, 
     marker = :circle, title = "Avg Daily Book Depth - FI2010")

FI2010 Intraday book depth

Stock 3 and 4 have the highest average depth, so most likely the
easier to trade, whereas Stock 1 has the thinnest depth. Stock 2 has
an interesting switch between liquid and not liquid.

So if you want to look beyond top-of-book data, this dataset provides
the extra level information needed and is closer to what a
professional shop is using. Better than trying to predict daily Yahoo
finance mid-prices with neural nets at least.

Build Your Own Crypto Datasets

If you want to take a further step back then being able to build the
tools that take in streaming data directly from the exchanges and
save that into a database is another way you can build out your
technical capabilities. This means you have full control over what you
download and save. Do you want just the top of book every update, the
full depth of the book, or just the reported trades?
I’ve written about this before, Getting Started with High Frequency Finance using Crypto Data and Julia, and learned a lot in the
process. Doing things this way means you have full control over the entire
process and can fully understand the data you are saving and any
additional quirks around the process.

Conclusion

Plenty to get stuck into and learn from. Being able to get the data
and loading it into an environment is always the first challenge and
learning how to do that with all these different types of data should
help you understand what these types of jobs entail.

Stat Arb – An Easy Walkthrough

By: Dean Markwick's Blog -- Julia

Re-posted from: https://dm13450.github.io/2023/07/15/Stat-Arb-Walkthrough.html

Statistical arbitrage (stat arb) is a pillar of quantitate trading
that relies on mean reversion to predict the future returns of an
asset. Mean reversion believes that if a stock has risen higher it’s
more likely to revert in the short term which is the opposite of a
momentum strategy that believes if a stock has been rising it will
continue to rise. This blog post will walk you the ‘the’ statistical
arbitrage paper
Statistical Arbitrage in the US Equities Market
apply it to a stock/ETF pair and then look at an intraday crypto stat
arb strategy.


Enjoy these types of posts? Then you should sign up for my newsletter.


I’m using Julia 1.9 and my AlpacaMarkets.jl package gets all the data we
need.

using AlpacaMarkets
using DataFrames, DataFramesMeta
using Dates
using Plots
using RollingFunctions, Statistics
using GLM

To start with we simply want the daily prices of JPM, XLF, and SPY. JPM
is the stock we think will go through mean reversion, XLF is the
financial sector ETF and SPY is the general SPY ETF.

We this that if JPM rises higher than XLF then it will soon revert and
trade lower shortly. Likewise, if JPM falls lower than XLF
then we think it will soon trade higher. Our mean reversion is all
about JPM around XLF. We’ve chosen XLF as it represents the general
financial sector landscape, so will represent the general sector outlook
more consistently than JPM on its own.

jpm = AlpacaMarkets.stock_bars("JPM", "1Day"; startTime = Date("2017-01-01"), limit = 10000, adjustment="all")[1]
xlf = AlpacaMarkets.stock_bars("XLF", "1Day"; startTime = Date("2017-01-01"), limit = 10000, adjustment="all")[1];
spy = AlpacaMarkets.stock_bars("SPY", "1Day"; startTime = Date("2017-01-01"), limit = 10000, adjustment="all")[1];

We want to clean the data to format the date correctly and select the
close and open columns.

function parse_date(t)
   Date(string(((split(t, "T")))[1]))
end

function clean(df, x) 
    df = @transform(df, :Date = parse_date.(:t), :Ticker = x, :NextOpen = [:o[2:end]; NaN])
   @select(df, :Date, :c, :o, :Ticker, :NextOpen)
end

Now we calculate the close-to-close log returns and format the data
into a column for each asset.

jpm = clean(jpm, "JPM")
xlf = clean(xlf, "XLF")
spy = clean(spy, "SPY")
allPrices = vcat(jpm, xlf, spy)
allPrices = sort(allPrices, :Date)

allPrices = @transform(groupby(allPrices, :Ticker), 
                      :Return = [NaN; diff(log.(:c))], 
                      :ReturnO = [NaN; diff(log.(:o))],
                      :ReturnTC = [NaN; diff(log.(:NextOpen))]);

modelData = unstack(@select(allPrices, :Date, :Ticker, :Return), :Date, :Ticker, :Return)
modelData = modelData[2:end, :];

last(modelData, 4)

4 rows × 4 columns

Date JPM XLF SPY
Date Float64? Float64? Float64?
1 2023-06-30 0.0138731 0.00864001 0.0117316
2 2023-07-03 0.00799894 0.00562049 0.00114985
3 2023-07-05 -0.00661524 -0.00206703 -0.0014883
4 2023-07-06 -0.00993581 -0.00860923 -0.00786148

Looking at the actual returns we can see that all three move in sync

plot(modelData.Date, cumsum(modelData.JPM), label = "JPM")
plot!(modelData.Date, cumsum(modelData.XLF), label = "XLF")
plot!(modelData.Date, cumsum(modelData.SPY), label = "SPY", legend = :left)

Stock returns

The key point is that they are moving in sync with each other. Given XLF has JPM included in it, this is expected but it also presents the opportunity to trade around any dispersion between the ETF and the individual name.

The Stat Arb Modelling Process

  • https://math.stackexchange.com/questions/345773/how-the-ornstein-uhlenbeck-process-can-be-considered-as-the-continuous-time-anal

Let’s think simply about pairs trading. We have two securities that we want to trade if their prices change too much, so our variable of interest is

\[e = P_1 – P_2\]

and we will enter a trade if \(e\) becomes large enough in both the positive and negative directions.

To translate that into a statistical problem we have two steps.

  1. Work out the difference between the two securities
  2. Model how the difference changes over time.

Step 1 is a simple regression of the stock vs the ETF we are trading against. Step 2 needs a bit more thought, but is still only a simple regression.

The Macro Regression – Stock vs ETF

In our data, we have the daily returns of JPM, the XLF ETF, and the SPY ETF. To work out the interdependence, it’s just a case of simple linear regression.

regModel = lm(@formula(JPM ~ XLF + SPY), modelData)
JPM ~ 1 + XLF + SPY

Coefficients:
──────────────────────────────────────────────────────────────────────────────────
                    Coef.   Std. Error       t  Pr(>|t|)   Lower 95%     Upper 95%
──────────────────────────────────────────────────────────────────────────────────
(Intercept)   0.000188758  0.000162973    1.16    0.2469  -0.0001309   0.000508417
XLF           1.35986      0.0203485     66.83    <1e-99   1.31995     1.39977
SPY          -0.363187     0.0260825    -13.92    <1e-41  -0.414345   -0.312028
──────────────────────────────────────────────────────────────────────────────────

From the slope of the model, we can see that JPM = 1.36XLF – 0.36SPY,
so JPM has a \(\beta\) of 1.36 to the XLF index and a \(\beta\) of
-0.36 to the SPY ETF, or general market. So each day, we can
approximate JPMs return by multiplying the XLF returns and SPY
returns.

This is our economic factor model, which describes from a
‘big picture’ kind of way how the stock trades vs the general market (SPY)
and its sector-specific market (XLF).

What we need to do next is look at what this model doesn’t explain
and try and describe that.

The Reversion Regression

Any difference around this model can be explained by the summation of
the residuals over time. In the paper the sum of the residuals
over time is called the ‘auxiliary process’ and this is the data behind
the second regression.

plot(scatter(modelData.Date, residuals(regModel), label = "Residuals"),
       plot(modelData.Date,cumsum(residuals(regModel)),
       label = "Aux Process"),
	  layout = (2,1))

Auxiliary process

We believe the auxiliary process (cumulative sum of the residuals)
can be modeled using a
Ornstein-Uhlenbeck
(OU) process.

An OU process is a type of differential equation that displays mean
reversion behaviour. If the process falls away from its average level
then it will be forced back.

\[dX = \kappa (m – X(t))dt + \sigma \mathrm{d} W\]

\(\kappa\) represents how quickly the mean reversion occurs.

To fit this type of process we need to recognise that the above differential form of an OU process can be discretised to become a simple AR(1) model where the model parameters can be transformed to get the OU parameters.

We now fit the OU process onto the cumulative sum of the residuals from the first model. If the residuals have some sort of structure/pattern then this means our original model was missing some variable that explains the difference.

X = cumsum(residuals(regModel))
xDF = DataFrame(y=X[2:end], x = X[1:end-1])
arModel = lm(@formula(y~x), xDF)
y ~ 1 + x

Coefficients:
─────────────────────────────────────────────────────────────────────────────────
                  Coef.   Std. Error       t  Pr(>|t|)     Lower 95%    Upper 95%
─────────────────────────────────────────────────────────────────────────────────
(Intercept)  4.41618e-6  0.000162655    0.03    0.9783  -0.000314618  0.000323451
x            0.997147    0.00186733   534.00    <1e-99   0.993484     1.00081
─────────────────────────────────────────────────────────────────────────────────

We take these coefficients and transform them into the parameters from the paper.

varEta = var(residuals(arModel))
a, b = coef(arModel)
k = -log(b)*252
m = a/(1-b)
sigma = sqrt((varEta * 2 * k) / (1-b^2))
sigma_eq = sqrt(varEta / (1-b^2))
[m, sigma_eq]
2-element Vector{Float64}:
 0.0015477568390823153
 0.08709971423424319

So \(m\) gives us the average level and \(\sigma_{\text{eq}}\) the
appropriate scale.

Now to build the mean reversion signal. We still have \(X\) as our
auxiliary process which we believe is mean reverting. We now have the
estimated parameters on the scale of this mean reversion so we can
transform the auxiliary process by these parameters and use this to see when the process is higher or lower than the model suggests it should be.

modelData.Score = (X .- m)./sigma_eq;

plot(modelData.Date, modelData.Score, label = "s")
hline!([-1.25], label = "Long JPM, Short XLF", color = "red")
hline!([-0.5], label = "Close Long Position", color = "red", ls=:dash)

hline!([1.25], label = "Short JPM, Long XLF", color = "purple")
hline!([0.75], label = "Close Short Position", color = "purple", ls = :dash, legend=:topleft)

Stock signal

The red lines indicate when JPM has diverged from XLF on the negative side, i.e. we expect JPM to move higher and XLF to move lower. We enter the position if s < -1.25 (solid red line) and exit the position when s > -0.5 (dashed red line).

  • Buy to open if \(s < -s_{bo}\) (< -1.25) Buy 1 JPM, sell Beta XLF
  • Close long if \(s > -s_{c}\) (-0.5)

The purple line is the same but in the opposite direction.

  • Sell to open if \(s > s_{so}\) (>1.25) Sell 1 JPM, buy Beta XLF
  • Close short if \(s < s_{bc}\) (<0.75)

That’s the modeling part done. We model how the stock moves based on
the overall market and then any differences to this we use the OU
process to come up with the mean reversion parameters.

So, does it make money?

Backtesting the Stat Arb Strategy

To backtest this type of model we have to roll through time and
calculate both regressions to construct the signal.

A couple of new additions too

  • We shift and scale the returns when doing the macro regression.
  • The auxiliary process on the last day is always 0, which makes
    calculating the signal simple.
paramsRes = Array{DataFrame}(undef, length(90:(nrow(modelData) - 90)))

for (j, i) in enumerate(90:(nrow(modelData) - 90))
    modelDataSub = modelData[i:(i+90), :]
    modelDataSub.JPM = (modelDataSub.JPM .- mean(modelDataSub.JPM)) ./ std(modelDataSub.JPM)
    modelDataSub.XLF = (modelDataSub.XLF .- mean(modelDataSub.XLF)) ./ std(modelDataSub.XLF)
    modelDataSub.SPY = (modelDataSub.SPY .- mean(modelDataSub.SPY)) ./ std(modelDataSub.SPY)
    
    macroRegr = lm(@formula(JPM ~ XLF + SPY), modelDataSub)
    auxData = cumsum(residuals(macroRegr))
    ouRegr = lm(@formula(y~x), DataFrame(x=auxData[1:end-1], y=auxData[2:end]))
    
    varEta = var(residuals(ouRegr))
    a, b = coef(ouRegr)
    k = -log(b)*252
    m = a/(1-b)
    sigma = sqrt((varEta * 2 * k) / (1-b^2))
    sigma_eq = sqrt(varEta / (1-b^2))
    
    
    paramsRes[j] = DataFrame(Date= modelDataSub.Date[end], 
                             MacroBeta_XLF = coef(macroRegr)[2], MacroBeta_SPY = coef(macroRegr)[3], MacroAlpha = coef(macroRegr)[1],
                             VarEta = varEta, OUA = a, OUB = b, OUK = k, Sigma = sigma, SigmaEQ=sigma_eq,
                             Score = -m/sigma_eq)
    
end

paramsRes = vcat(paramsRes...)
last(paramsRes, 4)

4 rows × 11 columns (omitted printing of 4 columns)

Date MacroBeta_XLF MacroBeta_SPY MacroAlpha VarEta OUA OUB
Date Float64 Float64 Float64 Float64 Float64 Float64
1 2023-06-30 0.974615 -0.230273 1.10933e-17 0.331745 0.175358 0.830417
2 2023-07-03 0.96943 -0.228741 -5.73883e-17 0.331222 0.198176 0.826816
3 2023-07-05 0.971319 -0.230438 2.38846e-17 0.335844 0.242754 0.841018
4 2023-07-06 0.974721 -0.232765 5.09875e-17 0.331695 0.256579 0.823822

The benefit of doing it this way also means we can see how each
\(\beta\) in the macro regression evolves.

plot(paramsRes.Date, paramsRes.MacroBeta_XLF, label = "XLF Beta")
plot!(paramsRes.Date, paramsRes.MacroBeta_SPY, label = "SPY Beta")

Stock betas

Good to see they are consistent in their signs and generally don’t
vary a great deal.

In the OU process, we are also interested in the speed of the mean
reversion as we don’t want to take a position that is very slow to
revert to the mean level.

kplot = plot(paramsRes.Date, paramsRes.OUK, label = :none)
kplot = hline!([252/45], label = "K Threshold")

In the paper, they suggest making sure the reversion happens with half
of the estimation period. As we are using 90 days, that means the
horizontal line shows when \(k\) is above this value.

svg

Plotting the score function also shows how the model wants to go
long/short the different components over time.

splot = plot(paramsRes.Date, paramsRes.Score, label = "Score")
hline!([-1.25], label = "Long JPM, Short XLF", color = "red")
hline!([-0.5], label = "Close Long Position", color = "red", ls=:dash)

hline!([1.25], label = "Short JPM, Long XLF", color = "purple")
hline!([0.75], label = "Close Short Position", color = "purple", ls = :dash)

Stock position

We run through the allocation procedure and label whether we are long
(+1) or short (-\(\beta\)) an amount of either the stock or ETFs.

paramsRes.JPM_Pos .= 0.0
paramsRes.XLF_Pos .= 0.0
paramsRes.SPY_Pos .= 0.0

for i in 2:nrow(paramsRes)
    
    if paramsRes.OUK[i] > 252/45
    
        if paramsRes.Score[i] >= 1.25
            paramsRes.JPM_Pos[i] = -1
            paramsRes.XLF_Pos[i] = paramsRes.MacroBeta_XLF[i]
            paramsRes.SPY_Pos[i] = paramsRes.MacroBeta_SPY[i]
        elseif paramsRes.Score[i] >= 0.75 && paramsRes.JPM_Pos[i-1] == -1
            paramsRes.JPM_Pos[i] = -1
            paramsRes.XLF_Pos[i] = paramsRes.MacroBeta_XLF[i]    
            paramsRes.SPY_Pos[i] = paramsRes.MacroBeta_SPY[i]
        end

        if paramsRes.Score[i] <= -1.25
            paramsRes.JPM_Pos[i] = 1
            paramsRes.XLF_Pos[i] = -paramsRes.MacroBeta_XLF[i]   
            paramsRes.SPY_Pos[i] = -paramsRes.MacroBeta_SPY[i]
        elseif paramsRes.Score[i] <= -0.5 && paramsRes.JPM_Pos[i-1] == 1
            paramsRes.JPM_Pos[i] = 1
            paramsRes.XLF_Pos[i] = -paramsRes.MacroBeta_XLF[i] 
            paramsRes.SPY_Pos[i] = -paramsRes.MacroBeta_SPY[i]
        end
    end
        
end

To make sure we use the right price return we lead the return columns
by one so that we enter the position and get the next return.

modelData = @transform(modelData, :NextJPM= lead(:JPM, 1), 
                                   :NextXLF = lead(:XLF, 1),
                                   :NextSPY = lead(:SPY, 1))

paramsRes = leftjoin(paramsRes, modelData[:, [:Date, :NextJPM, :NextXLF, :NextSPY]], on=:Date)

portRes = @combine(groupby(paramsRes, :Date), :Return = :NextJPM .* :JPM_Pos .+ :NextXLF .* :XLF_Pos .+ :NextSPY .* :SPY_Pos);

plot(portRes.Date, cumsum(portRes.Return), label = "Stat Arb Return")

Stock portfolio return

Sad trombone noise. This is not a great result as we’ve ended up
negative over the period. However, given the paper is 15 years old it
would be very rare to still be able to make money this way
after everyone knows how to do it. Plus, I’ve only used one stock vs
the ETF portfolio, you typically want to diversify out and use all the
stocks in the ETF to be long and short multiple single names and use
the ETF as a minimal hedge,

The good thing about it being a negative result means that we don’t have
to start considering transaction costs or other annoying things like
that.

When we break out the components of the strategy we can see that
it appears to pick out the right times to short/long JPM and
SPY, its the hedging with the XLF ETF that is bringing the portfolio
down.

plot(paramsRes.Date, cumsum(paramsRes.NextJPM .* paramsRes.JPM_Pos), label = "JPM Component")
plot!(paramsRes.Date, cumsum(paramsRes.NextXLF .* paramsRes.XLF_Pos), label = "XLF Component")
plot!(paramsRes.Date, cumsum(paramsRes.NextSPY .* paramsRes.SPY_Pos), label = "SPY Component")
plot!(portRes.Date, cumsum(portRes.Return), label = "Stat Arb Portfolio")

Stock portfolio components

So whilst naively trying to trade the stat arb portfolio is probably
a loss maker, there might be some value in using the model as a signal
input or overlay to another strategy.

What about if we up the frequency and look at intraday stat arb?

Intraday Stat Arb in Crypto – ETH and BTC

Crypto markets are open 24 hours a day 7 days a week and so gives that
much more opportunity to build out a continuous trading model. We look
back since the last year and repeat the backtesting process to see if
this bares any fruit.

Once again AlpacaMarkets gives us an easy way to pull the hourly bar
data for both ETH and BTC.

btcRaw = AlpacaMarkets.crypto_bars("BTC/USD", "1Hour"; startTime = now() - Year(1), limit = 10000)[1]
ethRaw = AlpacaMarkets.crypto_bars("ETH/USD", "1Hour"; startTime = now() - Year(1), limit = 10000)[1];

btc = @transform(btcRaw, :ts = DateTime.(chop.(:t)), :Ticker = "BTC")
eth = @transform(ethRaw, :ts = DateTime.(chop.(:t)), :Ticker = "ETH")

btc = btc[:, [:ts, :Ticker, :c]]
eth = eth[:, [:ts, :Ticker, :c]]

allPrices = vcat(btc, eth)
allPrices = sort(allPrices, :ts)

allPrices = @transform(groupby(allPrices, :Ticker), 
                      :Return = [NaN; diff(log.(:c))]);

modelData = unstack(@select(allPrices, :ts, :Ticker, :Return), :ts, :Ticker, :Return);
modelData = @subset(modelData, .! isnan.(:ETH .+ :BTC))

Plotting out the returns we can see they are loosely related just like
the stock example.

plot(modelData.ts, cumsum(modelData.BTC), label = "BTC")
plot!(modelData.ts, cumsum(modelData.ETH), label = "ETH")

Crypto returns

We will be using BTC as the ‘index’ and see how ETH is related.

regModel = lm(@formula(ETH ~ BTC), modelData)
ETH ~ 1 + BTC

Coefficients:
─────────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error       t  Pr(>|t|)    Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────────────────
(Intercept)  7.72396e-6  3.64797e-5    0.21    0.8323  -6.37847e-5  7.92327e-5
BTC          1.115       0.00673766  165.49    <1e-99   1.10179     1.12821
─────────────────────────────────────────────────────────────────────────────

Fairly high beta for ETH and against BTC. We use a 90-hour rolling window now instead of a 90 day.

window = 90

paramsRes = Array{DataFrame}(undef, length(window:(nrow(modelData) - window)))

for (j, i) in enumerate(window:(nrow(modelData) - window))
    modelDataSub = modelData[i:(i+window), :]
    modelDataSub.ETH = (modelDataSub.ETH .- mean(modelDataSub.ETH)) ./ std(modelDataSub.ETH)
    modelDataSub.BTC = (modelDataSub.BTC .- mean(modelDataSub.BTC)) ./ std(modelDataSub.BTC)
    
    macroRegr = lm(@formula(ETH ~ BTC), modelDataSub)
    auxData = cumsum(residuals(macroRegr))
    ouRegr = lm(@formula(y~x), DataFrame(x=auxData[1:end-1], y=auxData[2:end]))
    varEta = var(residuals(ouRegr))
    a, b = coef(ouRegr)
    k = -log(b)/((1/24)/252)
    m = a/(1-b)
    sigma = sqrt((varEta * 2 * k) / (1-b^2))
    sigma_eq = sqrt(varEta / (1-b^2))
    
    
    paramsRes[j] = DataFrame(ts= modelDataSub.ts[end], MacroBeta = coef(macroRegr)[2], MacroAlpha = coef(macroRegr)[1],
                             VarEta = varEta, OUA = a, OUB = b, OUK = k, Sigma = sigma, SigmaEQ=sigma_eq,
                             Score = -m/sigma_eq)
    
end

paramsRes = vcat(paramsRes...)

Again, looking at \(\beta\) overtime we see there has been a sudden
shift

plot(plot(paramsRes.ts, paramsRes.MacroBeta, label = "Macro Beta", legend = :left), 
     plot(paramsRes.ts, paramsRes.OUK, label = "K"), layout = (2,1))

Crypto params

Interesting that there has been a big change in \(\beta\) between ETH and BTC
recently that has suddenly reverted. Ok, onto the backtesting again.

paramsRes.ETH_Pos .= 0.0
paramsRes.BTC_Pos .= 0.0

for i in 2:nrow(paramsRes)
    
    if paramsRes.OUK[i] > (252/(1/24)/45)
    
        if paramsRes.Score[i] >= 1.25
            paramsRes.ETH_Pos[i] = -1
            paramsRes.BTC_Pos[i] = paramsRes.MacroBeta[i]   
        elseif paramsRes.Score[i] >= 0.75 && paramsRes.ETH_Pos[i-1] == -1
            paramsRes.ETH_Pos[i] = -1
            paramsRes.BTC_Pos[i] = paramsRes.MacroBeta[i]     
        end

        if paramsRes.Score[i] <= -1.25
            paramsRes.ETH_Pos[i] = 1
            paramsRes.BTC_Pos[i] = -paramsRes.MacroBeta[i]   
        elseif paramsRes.Score[i] <= -0.5 && paramsRes.ETH_Pos[i-1] == 1
            paramsRes.ETH_Pos[i] = 1
            paramsRes.BTC_Pos[i] = -paramsRes.MacroBeta[i]     
        end
    end
        
end


modelData = @transform(modelData, :NextETH= lead(:ETH, 1), :NextBTC = lead(:BTC, 1))

paramsRes = leftjoin(paramsRes, modelData[:, [:ts, :NextETH, :NextBTC]], on=:ts)

portRes = @combine(groupby(paramsRes, :ts), :Return = :NextETH .* :ETH_Pos .+ :NextBTC .* :BTC_Pos);

plot(portRes.ts, cumsum(portRes.Return))

Crypto stat arb returns

This looks slightly better. At least it is positive at the end of the
testing period.

plot(paramsRes.ts, cumsum(paramsRes.NextETH .* paramsRes.ETH_Pos), label = "ETH Component")
plot!(paramsRes.ts, cumsum(paramsRes.NextBTC .* paramsRes.BTC_Pos), label = "BTC Component")
plot!(portRes.ts, cumsum(portRes.Return), label = "Stat Arb Portfolio", legend=:topleft)

Crypto components

Again, the components of the portfolio seem to be ok in the ETH case
but generally, this is from the overall long bias. Unlike the JPM/XLF
example, there isn’t much more diversification we can add anything that might
help. We could add in more crypto assets, or an equity/gold angle, but
it becomes more of an asset class arb than something truly
statistical.

Conclusion

The original paper is one of those that all quants get recommended to
read and statistical arbitrage is a concept that you probably
understand in theory but practically doing is another
question. Hopefully, this blog post gets you up to speed with the
basic concepts and how to implement them.
It can be boiled down to two steps.

  1. Model as much as you can with a simple regression
  2. Model what’s left over as an OU process.

It can work with both high-frequency and low-frequency data, so have a
look at different combinations or assets and see if you have more luck
then I did backtesting.

If you do end up seeing something positive, make sure you are
backtesting properly!