Author Archives: Dean Markwick's Blog -- Julia

Stat Arb – An Easy Walkthrough

By: Dean Markwick's Blog -- Julia

Re-posted from: https://dm13450.github.io/2023/07/15/Stat-Arb-Walkthrough.html

Statistical arbitrage (stat arb) is a pillar of quantitate trading
that relies on mean reversion to predict the future returns of an
asset. Mean reversion believes that if a stock has risen higher it’s
more likely to revert in the short term which is the opposite of a
momentum strategy that believes if a stock has been rising it will
continue to rise. This blog post will walk you the ‘the’ statistical
arbitrage paper
Statistical Arbitrage in the US Equities Market
apply it to a stock/ETF pair and then look at an intraday crypto stat
arb strategy.


Enjoy these types of posts? Then you should sign up for my newsletter.


I’m using Julia 1.9 and my AlpacaMarkets.jl package gets all the data we
need.

using AlpacaMarkets
using DataFrames, DataFramesMeta
using Dates
using Plots
using RollingFunctions, Statistics
using GLM

To start with we simply want the daily prices of JPM, XLF, and SPY. JPM
is the stock we think will go through mean reversion, XLF is the
financial sector ETF and SPY is the general SPY ETF.

We this that if JPM rises higher than XLF then it will soon revert and
trade lower shortly. Likewise, if JPM falls lower than XLF
then we think it will soon trade higher. Our mean reversion is all
about JPM around XLF. We’ve chosen XLF as it represents the general
financial sector landscape, so will represent the general sector outlook
more consistently than JPM on its own.

jpm = AlpacaMarkets.stock_bars("JPM", "1Day"; startTime = Date("2017-01-01"), limit = 10000, adjustment="all")[1]
xlf = AlpacaMarkets.stock_bars("XLF", "1Day"; startTime = Date("2017-01-01"), limit = 10000, adjustment="all")[1];
spy = AlpacaMarkets.stock_bars("SPY", "1Day"; startTime = Date("2017-01-01"), limit = 10000, adjustment="all")[1];

We want to clean the data to format the date correctly and select the
close and open columns.

function parse_date(t)
   Date(string(((split(t, "T")))[1]))
end

function clean(df, x) 
    df = @transform(df, :Date = parse_date.(:t), :Ticker = x, :NextOpen = [:o[2:end]; NaN])
   @select(df, :Date, :c, :o, :Ticker, :NextOpen)
end

Now we calculate the close-to-close log returns and format the data
into a column for each asset.

jpm = clean(jpm, "JPM")
xlf = clean(xlf, "XLF")
spy = clean(spy, "SPY")
allPrices = vcat(jpm, xlf, spy)
allPrices = sort(allPrices, :Date)

allPrices = @transform(groupby(allPrices, :Ticker), 
                      :Return = [NaN; diff(log.(:c))], 
                      :ReturnO = [NaN; diff(log.(:o))],
                      :ReturnTC = [NaN; diff(log.(:NextOpen))]);

modelData = unstack(@select(allPrices, :Date, :Ticker, :Return), :Date, :Ticker, :Return)
modelData = modelData[2:end, :];

last(modelData, 4)

4 rows × 4 columns

Date JPM XLF SPY
Date Float64? Float64? Float64?
1 2023-06-30 0.0138731 0.00864001 0.0117316
2 2023-07-03 0.00799894 0.00562049 0.00114985
3 2023-07-05 -0.00661524 -0.00206703 -0.0014883
4 2023-07-06 -0.00993581 -0.00860923 -0.00786148

Looking at the actual returns we can see that all three move in sync

plot(modelData.Date, cumsum(modelData.JPM), label = "JPM")
plot!(modelData.Date, cumsum(modelData.XLF), label = "XLF")
plot!(modelData.Date, cumsum(modelData.SPY), label = "SPY", legend = :left)

Stock returns

The key point is that they are moving in sync with each other. Given XLF has JPM included in it, this is expected but it also presents the opportunity to trade around any dispersion between the ETF and the individual name.

The Stat Arb Modelling Process

  • https://math.stackexchange.com/questions/345773/how-the-ornstein-uhlenbeck-process-can-be-considered-as-the-continuous-time-anal

Let’s think simply about pairs trading. We have two securities that we want to trade if their prices change too much, so our variable of interest is

\[e = P_1 – P_2\]

and we will enter a trade if \(e\) becomes large enough in both the positive and negative directions.

To translate that into a statistical problem we have two steps.

  1. Work out the difference between the two securities
  2. Model how the difference changes over time.

Step 1 is a simple regression of the stock vs the ETF we are trading against. Step 2 needs a bit more thought, but is still only a simple regression.

The Macro Regression – Stock vs ETF

In our data, we have the daily returns of JPM, the XLF ETF, and the SPY ETF. To work out the interdependence, it’s just a case of simple linear regression.

regModel = lm(@formula(JPM ~ XLF + SPY), modelData)
JPM ~ 1 + XLF + SPY

Coefficients:
──────────────────────────────────────────────────────────────────────────────────
                    Coef.   Std. Error       t  Pr(>|t|)   Lower 95%     Upper 95%
──────────────────────────────────────────────────────────────────────────────────
(Intercept)   0.000188758  0.000162973    1.16    0.2469  -0.0001309   0.000508417
XLF           1.35986      0.0203485     66.83    <1e-99   1.31995     1.39977
SPY          -0.363187     0.0260825    -13.92    <1e-41  -0.414345   -0.312028
──────────────────────────────────────────────────────────────────────────────────

From the slope of the model, we can see that JPM = 1.36XLF – 0.36SPY,
so JPM has a \(\beta\) of 1.36 to the XLF index and a \(\beta\) of
-0.36 to the SPY ETF, or general market. So each day, we can
approximate JPMs return by multiplying the XLF returns and SPY
returns.

This is our economic factor model, which describes from a
‘big picture’ kind of way how the stock trades vs the general market (SPY)
and its sector-specific market (XLF).

What we need to do next is look at what this model doesn’t explain
and try and describe that.

The Reversion Regression

Any difference around this model can be explained by the summation of
the residuals over time. In the paper the sum of the residuals
over time is called the ‘auxiliary process’ and this is the data behind
the second regression.

plot(scatter(modelData.Date, residuals(regModel), label = "Residuals"),
       plot(modelData.Date,cumsum(residuals(regModel)),
       label = "Aux Process"),
	  layout = (2,1))

Auxiliary process

We believe the auxiliary process (cumulative sum of the residuals)
can be modeled using a
Ornstein-Uhlenbeck
(OU) process.

An OU process is a type of differential equation that displays mean
reversion behaviour. If the process falls away from its average level
then it will be forced back.

\[dX = \kappa (m – X(t))dt + \sigma \mathrm{d} W\]

\(\kappa\) represents how quickly the mean reversion occurs.

To fit this type of process we need to recognise that the above differential form of an OU process can be discretised to become a simple AR(1) model where the model parameters can be transformed to get the OU parameters.

We now fit the OU process onto the cumulative sum of the residuals from the first model. If the residuals have some sort of structure/pattern then this means our original model was missing some variable that explains the difference.

X = cumsum(residuals(regModel))
xDF = DataFrame(y=X[2:end], x = X[1:end-1])
arModel = lm(@formula(y~x), xDF)
y ~ 1 + x

Coefficients:
─────────────────────────────────────────────────────────────────────────────────
                  Coef.   Std. Error       t  Pr(>|t|)     Lower 95%    Upper 95%
─────────────────────────────────────────────────────────────────────────────────
(Intercept)  4.41618e-6  0.000162655    0.03    0.9783  -0.000314618  0.000323451
x            0.997147    0.00186733   534.00    <1e-99   0.993484     1.00081
─────────────────────────────────────────────────────────────────────────────────

We take these coefficients and transform them into the parameters from the paper.

varEta = var(residuals(arModel))
a, b = coef(arModel)
k = -log(b)*252
m = a/(1-b)
sigma = sqrt((varEta * 2 * k) / (1-b^2))
sigma_eq = sqrt(varEta / (1-b^2))
[m, sigma_eq]
2-element Vector{Float64}:
 0.0015477568390823153
 0.08709971423424319

So \(m\) gives us the average level and \(\sigma_{\text{eq}}\) the
appropriate scale.

Now to build the mean reversion signal. We still have \(X\) as our
auxiliary process which we believe is mean reverting. We now have the
estimated parameters on the scale of this mean reversion so we can
transform the auxiliary process by these parameters and use this to see when the process is higher or lower than the model suggests it should be.

modelData.Score = (X .- m)./sigma_eq;

plot(modelData.Date, modelData.Score, label = "s")
hline!([-1.25], label = "Long JPM, Short XLF", color = "red")
hline!([-0.5], label = "Close Long Position", color = "red", ls=:dash)

hline!([1.25], label = "Short JPM, Long XLF", color = "purple")
hline!([0.75], label = "Close Short Position", color = "purple", ls = :dash, legend=:topleft)

Stock signal

The red lines indicate when JPM has diverged from XLF on the negative side, i.e. we expect JPM to move higher and XLF to move lower. We enter the position if s < -1.25 (solid red line) and exit the position when s > -0.5 (dashed red line).

  • Buy to open if \(s < -s_{bo}\) (< -1.25) Buy 1 JPM, sell Beta XLF
  • Close long if \(s > -s_{c}\) (-0.5)

The purple line is the same but in the opposite direction.

  • Sell to open if \(s > s_{so}\) (>1.25) Sell 1 JPM, buy Beta XLF
  • Close short if \(s < s_{bc}\) (<0.75)

That’s the modeling part done. We model how the stock moves based on
the overall market and then any differences to this we use the OU
process to come up with the mean reversion parameters.

So, does it make money?

Backtesting the Stat Arb Strategy

To backtest this type of model we have to roll through time and
calculate both regressions to construct the signal.

A couple of new additions too

  • We shift and scale the returns when doing the macro regression.
  • The auxiliary process on the last day is always 0, which makes
    calculating the signal simple.
paramsRes = Array{DataFrame}(undef, length(90:(nrow(modelData) - 90)))

for (j, i) in enumerate(90:(nrow(modelData) - 90))
    modelDataSub = modelData[i:(i+90), :]
    modelDataSub.JPM = (modelDataSub.JPM .- mean(modelDataSub.JPM)) ./ std(modelDataSub.JPM)
    modelDataSub.XLF = (modelDataSub.XLF .- mean(modelDataSub.XLF)) ./ std(modelDataSub.XLF)
    modelDataSub.SPY = (modelDataSub.SPY .- mean(modelDataSub.SPY)) ./ std(modelDataSub.SPY)
    
    macroRegr = lm(@formula(JPM ~ XLF + SPY), modelDataSub)
    auxData = cumsum(residuals(macroRegr))
    ouRegr = lm(@formula(y~x), DataFrame(x=auxData[1:end-1], y=auxData[2:end]))
    
    varEta = var(residuals(ouRegr))
    a, b = coef(ouRegr)
    k = -log(b)*252
    m = a/(1-b)
    sigma = sqrt((varEta * 2 * k) / (1-b^2))
    sigma_eq = sqrt(varEta / (1-b^2))
    
    
    paramsRes[j] = DataFrame(Date= modelDataSub.Date[end], 
                             MacroBeta_XLF = coef(macroRegr)[2], MacroBeta_SPY = coef(macroRegr)[3], MacroAlpha = coef(macroRegr)[1],
                             VarEta = varEta, OUA = a, OUB = b, OUK = k, Sigma = sigma, SigmaEQ=sigma_eq,
                             Score = -m/sigma_eq)
    
end

paramsRes = vcat(paramsRes...)
last(paramsRes, 4)

4 rows × 11 columns (omitted printing of 4 columns)

Date MacroBeta_XLF MacroBeta_SPY MacroAlpha VarEta OUA OUB
Date Float64 Float64 Float64 Float64 Float64 Float64
1 2023-06-30 0.974615 -0.230273 1.10933e-17 0.331745 0.175358 0.830417
2 2023-07-03 0.96943 -0.228741 -5.73883e-17 0.331222 0.198176 0.826816
3 2023-07-05 0.971319 -0.230438 2.38846e-17 0.335844 0.242754 0.841018
4 2023-07-06 0.974721 -0.232765 5.09875e-17 0.331695 0.256579 0.823822

The benefit of doing it this way also means we can see how each
\(\beta\) in the macro regression evolves.

plot(paramsRes.Date, paramsRes.MacroBeta_XLF, label = "XLF Beta")
plot!(paramsRes.Date, paramsRes.MacroBeta_SPY, label = "SPY Beta")

Stock betas

Good to see they are consistent in their signs and generally don’t
vary a great deal.

In the OU process, we are also interested in the speed of the mean
reversion as we don’t want to take a position that is very slow to
revert to the mean level.

kplot = plot(paramsRes.Date, paramsRes.OUK, label = :none)
kplot = hline!([252/45], label = "K Threshold")

In the paper, they suggest making sure the reversion happens with half
of the estimation period. As we are using 90 days, that means the
horizontal line shows when \(k\) is above this value.

svg

Plotting the score function also shows how the model wants to go
long/short the different components over time.

splot = plot(paramsRes.Date, paramsRes.Score, label = "Score")
hline!([-1.25], label = "Long JPM, Short XLF", color = "red")
hline!([-0.5], label = "Close Long Position", color = "red", ls=:dash)

hline!([1.25], label = "Short JPM, Long XLF", color = "purple")
hline!([0.75], label = "Close Short Position", color = "purple", ls = :dash)

Stock position

We run through the allocation procedure and label whether we are long
(+1) or short (-\(\beta\)) an amount of either the stock or ETFs.

paramsRes.JPM_Pos .= 0.0
paramsRes.XLF_Pos .= 0.0
paramsRes.SPY_Pos .= 0.0

for i in 2:nrow(paramsRes)
    
    if paramsRes.OUK[i] > 252/45
    
        if paramsRes.Score[i] >= 1.25
            paramsRes.JPM_Pos[i] = -1
            paramsRes.XLF_Pos[i] = paramsRes.MacroBeta_XLF[i]
            paramsRes.SPY_Pos[i] = paramsRes.MacroBeta_SPY[i]
        elseif paramsRes.Score[i] >= 0.75 && paramsRes.JPM_Pos[i-1] == -1
            paramsRes.JPM_Pos[i] = -1
            paramsRes.XLF_Pos[i] = paramsRes.MacroBeta_XLF[i]    
            paramsRes.SPY_Pos[i] = paramsRes.MacroBeta_SPY[i]
        end

        if paramsRes.Score[i] <= -1.25
            paramsRes.JPM_Pos[i] = 1
            paramsRes.XLF_Pos[i] = -paramsRes.MacroBeta_XLF[i]   
            paramsRes.SPY_Pos[i] = -paramsRes.MacroBeta_SPY[i]
        elseif paramsRes.Score[i] <= -0.5 && paramsRes.JPM_Pos[i-1] == 1
            paramsRes.JPM_Pos[i] = 1
            paramsRes.XLF_Pos[i] = -paramsRes.MacroBeta_XLF[i] 
            paramsRes.SPY_Pos[i] = -paramsRes.MacroBeta_SPY[i]
        end
    end
        
end

To make sure we use the right price return we lead the return columns
by one so that we enter the position and get the next return.

modelData = @transform(modelData, :NextJPM= lead(:JPM, 1), 
                                   :NextXLF = lead(:XLF, 1),
                                   :NextSPY = lead(:SPY, 1))

paramsRes = leftjoin(paramsRes, modelData[:, [:Date, :NextJPM, :NextXLF, :NextSPY]], on=:Date)

portRes = @combine(groupby(paramsRes, :Date), :Return = :NextJPM .* :JPM_Pos .+ :NextXLF .* :XLF_Pos .+ :NextSPY .* :SPY_Pos);

plot(portRes.Date, cumsum(portRes.Return), label = "Stat Arb Return")

Stock portfolio return

Sad trombone noise. This is not a great result as we’ve ended up
negative over the period. However, given the paper is 15 years old it
would be very rare to still be able to make money this way
after everyone knows how to do it. Plus, I’ve only used one stock vs
the ETF portfolio, you typically want to diversify out and use all the
stocks in the ETF to be long and short multiple single names and use
the ETF as a minimal hedge,

The good thing about it being a negative result means that we don’t have
to start considering transaction costs or other annoying things like
that.

When we break out the components of the strategy we can see that
it appears to pick out the right times to short/long JPM and
SPY, its the hedging with the XLF ETF that is bringing the portfolio
down.

plot(paramsRes.Date, cumsum(paramsRes.NextJPM .* paramsRes.JPM_Pos), label = "JPM Component")
plot!(paramsRes.Date, cumsum(paramsRes.NextXLF .* paramsRes.XLF_Pos), label = "XLF Component")
plot!(paramsRes.Date, cumsum(paramsRes.NextSPY .* paramsRes.SPY_Pos), label = "SPY Component")
plot!(portRes.Date, cumsum(portRes.Return), label = "Stat Arb Portfolio")

Stock portfolio components

So whilst naively trying to trade the stat arb portfolio is probably
a loss maker, there might be some value in using the model as a signal
input or overlay to another strategy.

What about if we up the frequency and look at intraday stat arb?

Intraday Stat Arb in Crypto – ETH and BTC

Crypto markets are open 24 hours a day 7 days a week and so gives that
much more opportunity to build out a continuous trading model. We look
back since the last year and repeat the backtesting process to see if
this bares any fruit.

Once again AlpacaMarkets gives us an easy way to pull the hourly bar
data for both ETH and BTC.

btcRaw = AlpacaMarkets.crypto_bars("BTC/USD", "1Hour"; startTime = now() - Year(1), limit = 10000)[1]
ethRaw = AlpacaMarkets.crypto_bars("ETH/USD", "1Hour"; startTime = now() - Year(1), limit = 10000)[1];

btc = @transform(btcRaw, :ts = DateTime.(chop.(:t)), :Ticker = "BTC")
eth = @transform(ethRaw, :ts = DateTime.(chop.(:t)), :Ticker = "ETH")

btc = btc[:, [:ts, :Ticker, :c]]
eth = eth[:, [:ts, :Ticker, :c]]

allPrices = vcat(btc, eth)
allPrices = sort(allPrices, :ts)

allPrices = @transform(groupby(allPrices, :Ticker), 
                      :Return = [NaN; diff(log.(:c))]);

modelData = unstack(@select(allPrices, :ts, :Ticker, :Return), :ts, :Ticker, :Return);
modelData = @subset(modelData, .! isnan.(:ETH .+ :BTC))

Plotting out the returns we can see they are loosely related just like
the stock example.

plot(modelData.ts, cumsum(modelData.BTC), label = "BTC")
plot!(modelData.ts, cumsum(modelData.ETH), label = "ETH")

Crypto returns

We will be using BTC as the ‘index’ and see how ETH is related.

regModel = lm(@formula(ETH ~ BTC), modelData)
ETH ~ 1 + BTC

Coefficients:
─────────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error       t  Pr(>|t|)    Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────────────────
(Intercept)  7.72396e-6  3.64797e-5    0.21    0.8323  -6.37847e-5  7.92327e-5
BTC          1.115       0.00673766  165.49    <1e-99   1.10179     1.12821
─────────────────────────────────────────────────────────────────────────────

Fairly high beta for ETH and against BTC. We use a 90-hour rolling window now instead of a 90 day.

window = 90

paramsRes = Array{DataFrame}(undef, length(window:(nrow(modelData) - window)))

for (j, i) in enumerate(window:(nrow(modelData) - window))
    modelDataSub = modelData[i:(i+window), :]
    modelDataSub.ETH = (modelDataSub.ETH .- mean(modelDataSub.ETH)) ./ std(modelDataSub.ETH)
    modelDataSub.BTC = (modelDataSub.BTC .- mean(modelDataSub.BTC)) ./ std(modelDataSub.BTC)
    
    macroRegr = lm(@formula(ETH ~ BTC), modelDataSub)
    auxData = cumsum(residuals(macroRegr))
    ouRegr = lm(@formula(y~x), DataFrame(x=auxData[1:end-1], y=auxData[2:end]))
    varEta = var(residuals(ouRegr))
    a, b = coef(ouRegr)
    k = -log(b)/((1/24)/252)
    m = a/(1-b)
    sigma = sqrt((varEta * 2 * k) / (1-b^2))
    sigma_eq = sqrt(varEta / (1-b^2))
    
    
    paramsRes[j] = DataFrame(ts= modelDataSub.ts[end], MacroBeta = coef(macroRegr)[2], MacroAlpha = coef(macroRegr)[1],
                             VarEta = varEta, OUA = a, OUB = b, OUK = k, Sigma = sigma, SigmaEQ=sigma_eq,
                             Score = -m/sigma_eq)
    
end

paramsRes = vcat(paramsRes...)

Again, looking at \(\beta\) overtime we see there has been a sudden
shift

plot(plot(paramsRes.ts, paramsRes.MacroBeta, label = "Macro Beta", legend = :left), 
     plot(paramsRes.ts, paramsRes.OUK, label = "K"), layout = (2,1))

Crypto params

Interesting that there has been a big change in \(\beta\) between ETH and BTC
recently that has suddenly reverted. Ok, onto the backtesting again.

paramsRes.ETH_Pos .= 0.0
paramsRes.BTC_Pos .= 0.0

for i in 2:nrow(paramsRes)
    
    if paramsRes.OUK[i] > (252/(1/24)/45)
    
        if paramsRes.Score[i] >= 1.25
            paramsRes.ETH_Pos[i] = -1
            paramsRes.BTC_Pos[i] = paramsRes.MacroBeta[i]   
        elseif paramsRes.Score[i] >= 0.75 && paramsRes.ETH_Pos[i-1] == -1
            paramsRes.ETH_Pos[i] = -1
            paramsRes.BTC_Pos[i] = paramsRes.MacroBeta[i]     
        end

        if paramsRes.Score[i] <= -1.25
            paramsRes.ETH_Pos[i] = 1
            paramsRes.BTC_Pos[i] = -paramsRes.MacroBeta[i]   
        elseif paramsRes.Score[i] <= -0.5 && paramsRes.ETH_Pos[i-1] == 1
            paramsRes.ETH_Pos[i] = 1
            paramsRes.BTC_Pos[i] = -paramsRes.MacroBeta[i]     
        end
    end
        
end


modelData = @transform(modelData, :NextETH= lead(:ETH, 1), :NextBTC = lead(:BTC, 1))

paramsRes = leftjoin(paramsRes, modelData[:, [:ts, :NextETH, :NextBTC]], on=:ts)

portRes = @combine(groupby(paramsRes, :ts), :Return = :NextETH .* :ETH_Pos .+ :NextBTC .* :BTC_Pos);

plot(portRes.ts, cumsum(portRes.Return))

Crypto stat arb returns

This looks slightly better. At least it is positive at the end of the
testing period.

plot(paramsRes.ts, cumsum(paramsRes.NextETH .* paramsRes.ETH_Pos), label = "ETH Component")
plot!(paramsRes.ts, cumsum(paramsRes.NextBTC .* paramsRes.BTC_Pos), label = "BTC Component")
plot!(portRes.ts, cumsum(portRes.Return), label = "Stat Arb Portfolio", legend=:topleft)

Crypto components

Again, the components of the portfolio seem to be ok in the ETH case
but generally, this is from the overall long bias. Unlike the JPM/XLF
example, there isn’t much more diversification we can add anything that might
help. We could add in more crypto assets, or an equity/gold angle, but
it becomes more of an asset class arb than something truly
statistical.

Conclusion

The original paper is one of those that all quants get recommended to
read and statistical arbitrage is a concept that you probably
understand in theory but practically doing is another
question. Hopefully, this blog post gets you up to speed with the
basic concepts and how to implement them.
It can be boiled down to two steps.

  1. Model as much as you can with a simple regression
  2. Model what’s left over as an OU process.

It can work with both high-frequency and low-frequency data, so have a
look at different combinations or assets and see if you have more luck
then I did backtesting.

If you do end up seeing something positive, make sure you are
backtesting properly!

Predicting a Successful Mt Everest Climb

By: Dean Markwick's Blog -- Julia

Re-posted from: https://dm13450.github.io/2023/04/27/Predicting-a-Successful-Mt-Everest-Climb.html

Climbing Mount Everest is a true test of human endurance with a real risk of death. The Himalayan Database is a data repository, available for free, that records various details about the peaks, people, and expeditions to climb the different Nepalese Himalayan mountains and provides the data for this analysis. In this blog post, I’ll show you how to load the database and explore some of the features before building a model that tries to predict how you can successfully climb Mount Everest.


Enjoy these types of posts? Then you should sign up for my newsletter.


Over the past few months, I’ve been training for a marathon and have been trying to understand the best way to train and maximise my performance. This means extensive research and reading to get an idea of what the science says. Endure by Alex Hutchinson is a book I recommend and it takes a look at the way the human body functions over long distances/extreme tasks – such as climbing Mount Everest with no oxygen or ultra, ultra marathoners with an overarching reference to the Breaking2 project by Nike.

In one section the book references something called the Himalayan Database which is a database of expeditions to Mount Everest and other mountains in the Himalayas. As a data lover, this piqued my interest as an interesting data source and something a bit different from my usual data explorations around finance/sports. So I downloaded the database, worked out how to load it, and had a poke around the data.

If you go to the website, himalayandatabase, you can download the data yourself and follow along.

The database is distributed in the DBF format and the website itself is a bit of a blast from the past. It expects you to download a custom data viewer program to look at the data, but thankfully there are people in the R world that demonstrated how to load the raw DBF files. I’ve taken inspiration from this, downloaded the DBF files, loaded up DBFTables.jl and loaded the data into Julia.

using DataFrames, DataFramesMeta
using Plots, StatsPlots
using Statistics
using Dates

I hit a roadblock straight away and had to patch DBFTables.jl with a new datatype that the Himalayan database uses that isn’t in the original spec. Pull request here if you are interested: DBFTables.jl – Add M datatype. Another feather to my open-source contributions hat!

using DBFTables

There are 6 tables in the database but we are only interested in 3 of them:

  • exped details the expeditions. So each trip up a mountain by one or more people.
  • peaks has the details on the mountains in the mountains in the Himalayas.
  • members which has information on each person that has attempted to climb one of the mountains.
function load_dbf(fn)
    dbf = DBFTables.Table(fn)
    DataFrame(dbf)
end

exped = load_dbf("exped.DBF")
peaks = load_dbf("peaks.DBF")
members = load_dbf("members.DBF");

Taking a look at the mountains with the most entries.

first(sort(@combine(groupby(exped, :PEAKID), :N = length(:YEAR)),
       :N, rev=true), 3)
Row PEAKID N
String Int64
1 EVER 2191
2 AMAD 1456
3 CHOY 1325

Unsurprisingly Mount Everest is the most attempted mountain with Ama Dablam in second and Cho Oyu in third place.

Exploring the Himalayas Data

We start with some basic groupings to look at how the data is distributed per year.

expSummary = @combine(groupby(@subset(members, :CALCAGE .> 0), :EXPID),  
         :N = length(:CALCAGE),
         :YoungestAge=minimum(:CALCAGE), 
         :AvgAge = mean(:CALCAGE),
         :NFemale = sum(:SEX .== "F"))

expSummary = leftjoin(expSummary, 
              @select(exped, :EXPID, :PEAKID, :BCDATE, :SMTDATE, :MDEATHS, :HDEATHS, :SUCCESS1), on = :EXPID)
expSummary = leftjoin(expSummary, @select(peaks, :PEAKID, :PKNAME), on = :PEAKID)

everest = dropmissing(@subset(expSummary, :PKNAME .== "Everest"))
everest = @transform(everest, :DeathRate = (:MDEATHS .+ :HDEATHS) ./ :N, :Year = floor.(:BCDATE, Dates.Year))
everestYearly = @combine(groupby(everest, :Year), :N = sum(:N), 
                          :Deaths = sum(:MDEATHS + :HDEATHS),
                        :Success = sum(:SUCCESS1))
everestYearly = @transform(everestYearly, :DeathRate = :Deaths ./ :N, :SuccessRate = :Success ./ :N)
everestYearly = @transform(everestYearly, 
                            :DeathRateErr = sqrt.(:DeathRate .* (1 .- :DeathRate)./:N),
                            :SuccessRateErr = sqrt.(:SuccessRate .* (1 .- :SuccessRate)./:N));

What is the average age of those who climb Mount Everest?

scatter(everest.SMTDATE, everest.AvgAge, label = "Average Age of Attempting Everest")

Average age of climbing Mount Everest

By eye, it looks like the average age has been steadily increasing. Generally, your expedition’s average age needs to be at least 30. Given it costs a small fortune to climb Everest this is probably more of a ‘need money’ rather than a look at the overall fitness of a 30-year-old.

When we look at the number of attempts yearly and the annual death rate:

plot(bar(everestYearly.Year, everestYearly.N, label = "Number of Attempts in a Year"),
    scatter(everestYearly.Year, everestYearly.DeathRate, yerr=everestYearly.DeathRateErr, 
            label = "Yearly Death Rate"),
     layout = (2,1))

Yearly death rate on Mount Everest

scatter(everestYearly[everestYearly.Year .> Date("2000-01-01"), :].Year, 
        everestYearly[everestYearly.Year .> Date("2000-01-01"), :].DeathRate, 
        yerr=everestYearly[everestYearly.Year .> Date("2000-01-01"), :].DeathRateErr, 
        label = "Yearly Death Rate")

20th century death rate

But how ‘easy’ has it been to conquer Everest over the years? Looking at the success rate at best 10% of attempted expeditions are completed, which highlights how tough it is. Given some of the photos of people queueing to reach the summit, you’d think it would be much easier, but out of the 400 expeditions, less than 100 will make it.

scatter(everestYearly.Year, everestYearly.SuccessRate, yerr=everestYearly.SuccessRateErr, 
        label = "Mt. Everest Success Rate")

Mount Everest success rate

A couple of interesting points from this graph:

  • 2014 was an outlier due to an avalanche that lead to Mount Everest being closed from April until the rest of the year.
  • No one successfully climbed Mt Everest in 2015 because of the earthquake.
  • Only 1 success in 2020 before the pandemic closed everything.

So a decent amount of variation in what can happen in a given year on Mt Everest.

Predicting Success

The data has some interesting quirks and we now turn to our next step, trying to build a model. Endurance was about what it takes to complete impressive human feats. So let’s do that here, can we use the database to predict and explain what leads to success?

We will be using the MLJ.jl package again to fit some machine learning models easily.

using MLJ, LossFunctions

To start with we are going to pull out the relevant factors that we think will help climb a mountain. Not specifically Everest, but any of the Himalayan peaks from the database.

modelData = members[:, ["MSUCCESS", "PEAKID","MYEAR", 
                        "MSEASON", "SEX", "CALCAGE", "CITIZEN", "STATUS", 
                         "MROUTE1", "MO2USED"]]
modelData = @subset(modelData, :PEAKID .== "EVER")
modelData.MROUTE1 = modelData.PEAKID .* "_" .* string.(modelData.MROUTE1)
modelData = dropmissing(modelData)
modelData.MYEAR = parse.(Int, modelData.MYEAR)
modelData = @subset(modelData, :CALCAGE .> 0)

print(size(modelData))
(22583, 10)
first(modelData, 4)
4×10 DataFrame
Row MSUCCESS PEAKID MYEAR MSEASON SEX CALCAGE CITIZEN STATUS MROUTE1 MO2USED
Bool String Int64 Int64 String Int64 String String String Bool
1 false EVER 1963 1 M 36 USA Climber EVER_2 true
2 true EVER 1963 1 M 31 USA Climber EVER_1 true
3 false EVER 1963 1 M 27 USA Climber EVER_1 false
4 false EVER 1963 1 M 26 USA Climber EVER_2 true

Just over 22k rows and 10 columns, so plenty of data to sink our teeth into. MLJ needs us to define the Multiclass type of the factor variables and we also want to split out the predictor and predictors then split out into the test/train sets.

modelData2 = coerce(modelData,
                    :MSUCCESS => OrderedFactor,
                    :MSEASON => Multiclass,
                    :SEX => Multiclass,
                    :CITIZEN => Multiclass,
                    :STATUS => Multiclass, 
                    :MROUTE1 => Multiclass,
                    :MO2USED => OrderedFactor);
y, X = unpack(modelData2, ==(:MSUCCESS), colname -> true; rng=123);

train, test = partition(eachindex(y), 0.7, shuffle=true);

All these multi-class features need to be one-hot encoded, so we use the continuous encoder. The workflow is:

  • Create the encoder/standardizer.
  • Train on the data
  • Transform the data

This gives confidence that you aren’t leaking the training data into the test data.

encoder = ContinuousEncoder()
encMach = machine(encoder, X) |> fit!
X_encoded = MLJ.transform(encMach, X);

X_encoded.MO2USED = X_encoded.MO2USED .- 1;
standardizer = @load Standardizer pkg=MLJModels
stanMach = fit!(machine(
                 standardizer(features = [:CALCAGE]),X_encoded); 
                rows=train)
X_trans = MLJ.transform(stanMach, X_encoded);
X_trans.MYEAR = X_trans.MYEAR .- minimum(X_trans.MYEAR);
plot(
    histogram(X_trans.CALCAGE, label = "Age"),
    histogram(X_trans.MYEAR, label = "Year"),
    histogram(X_trans.MO2USED, label = "02 Used")
    )

Variable distribution

Looking at the distribution of the transformed data gives a good indication of how varied these variables change post-transformation.

Model Fitting using MLJ.jl

I’ll now explore some different models using the MLJ.jl workflow similar to my previous post on Machine Learning Property Loans for Fun and Profit. MLJ.jl gives you a common interface to fit a variety of different models and evaluate their performance all from one package, so handy here when we want to look at a simple linear model and also an XGBoost model.

Let’s start with our null model to get the baseline.

constantModel = @load ConstantClassifier pkg=MLJModels

constMachine = machine(constantModel(), X_trans, y)

evaluate!(constMachine,
        rows=train,
         resampling=CV(shuffle=true),
         operation = predict_mode,
         measures=[accuracy, balanced_accuracy, kappa],
         verbosity=0)
Model Accuracy Kappa
Null 0.512 0.0

For classification tasks, the null model is essentially tossing a coin, so the accuracy will be around 50% and the \(\kappa\) is zero.

Next we move on to the simple linear model using all the features.

logisticClassifier = @load LogisticClassifier pkg=MLJLinearModels verbosity=0

lmMachine = machine(logisticClassifier(lambda=0), X_trans, y)

fit!(lmMachine, rows=train, verbosity=0)

evaluate!(lmMachine, 
          rows=train, 
          resampling=CV(shuffle=true), 
          operation = predict_mode,
          measures=[accuracy, balanced_accuracy, kappa], verbosity = 0)
Model Accuracy Kappa
Null 0.512 0.0
Linear Regression 0.884 0.769

This gives a good improvement over the null model, so indicates our included features have some sort of information useful in predicting success.

Inspecting the parameters indicates how strong each variable is. Route 0 leads to a large reduction in the probability of success whereas using oxygen increases the probability of success. Climbing in the Autumn or Winter also looks like it reduces your chance of success.

params = mapreduce(x-> DataFrame(Param=collect(x)[1], Value = collect(x)[2]), 
                   vcat, fitted_params(lmMachine).coefs)
params = sort(params, :Value)

vcat(first(params, 5), last(params, 5))
10×2 DataFrame
Row Param Value
Symbol Float64
1 MROUTE1__EVER_0 -4.87433
2 SEX__F -1.97957
3 SEX__M -1.94353
4 MSEASON__3 -1.39251
5 MSEASON__4 -1.1516
6 MROUTE1__EVER_2 0.334305
7 CITIZEN__USSR 0.43336
8 CITIZEN__Russia 0.518197
9 MROUTE1__EVER_1 0.697601
10 MO2USED 3.85578

XGBoost Time

What’s a model if we’ve not tried xgboost to squeeze the most performance out of all the data? Easy to fit using MLJ and without having to do any special lifting.

xgboostModel = @load XGBoostClassifier pkg=XGBoost verbosity = 0

xgboostmodel = xgboostModel()

xgbMachine = machine(xgboostmodel, X_trans, y)

evaluate!(xgbMachine,
        rows=train,
         resampling=CV(nfolds = 6, shuffle=true),
         measures=[accuracy,balanced_accuracy, kappa],
         verbosity=0)
Model Accuracy Kappa
Null 0.512 0.0
Linear Regression 0.884 0.769
XGBoost 0.889 0.778

We get 88.9% accuracy compared to the linear regression 88.4% and a \(\kappa\) increase too, so looking like a good model.

How Do I Succeed in the Climbing Mount Everest?

The whole point of these models is to try and work out what combination of these parameters gets us the highest probability of success on a mountain. We want some idea of feature importance that can direct us to the optimal approach to a mountain. Should I be an Austrian Doctor or is there an easier route that should be taken?

With xgboost we can use the feature_importances function to do exactly what it says on the tin and look at what features are most important in the model.

fi = feature_importances(xgbMachine)
fi = mapreduce(x-> DataFrame(Param=collect(x)[1], Value = collect(x)[2]), 
                   vcat, fi)
first(fi, 5)
5×2 DataFrame
Row Param Value
Symbol Float32
1 MO2USED 388.585
2 MROUTE1__EVER_0 46.3129
3 STATUS__H-A Worker 15.0079
4 CITIZEN__Nepal 11.6299
5 CITIZEN__UK 4.25651

So using oxygen, taking the 0th route up, being an H-A Worker, and either being from Nepal or a UK citizen appears to have the greatest impact on being successful. Using oxygen is an obvious benefit/cannot be avoided and I don’t think anyone believes that their chance of success would be higher without oxygen. Being Nepalese is the one I would struggle with.

How does the model perform on the hold-out set? We’ve got 30% of the data that hasn’t been used in the fitting that can also validate how well the model performs.

modelNames = ["Null", "LM", "XGBoost"]
modelMachines = [constMachine, 
               lmMachine, 
               xgbMachine]


aucRes = DataFrame(Model = modelNames,
    AUC = map(x->auc(MLJ.predict(x,rows=test), y[test]), 
                modelMachines))
kappaRes = DataFrame(Kappa = map(x->kappa(MLJ.predict_mode(x,rows=test), y[test]), modelMachines),
                     Accuracy = map(x->accuracy(MLJ.predict_mode(x,rows=test), y[test]), modelMachines),
          Model = modelNames)
evalRes = leftjoin(aucRes, kappaRes, on =:Model)
3×4 DataFrame
Row Model AUC Kappa Accuracy
String Float64 Float64? Float64?
1 Null 0.5 0.0 0.512768
2 LM 0.937092 0.768528 0.883838
3 XGBoost 0.939845 0.775408 0.88738

On the test set, the XGBoost model is only slightly better than the linear model in terms of \(\kappa\) and accuracy. It’s worse when measuring the AUC, so this is setting alarm bells ringing that the model isn’t quite there yet.

How Does Oxygen Change the Probability of Success?

X_trans2 = copy(X_trans[1:2, :])
X_trans2.MO2USED  = 1 .- X_trans2.MO2USED


predict(xgbMachine, vcat(X_trans[1:2, :], X_trans2))
4-element CategoricalDistributions.UnivariateFiniteVector{OrderedFactor{2}, Bool, UInt32, Float32}:
 UnivariateFinite{OrderedFactor{2}}(false=>0.245, true=>0.755)
 UnivariateFinite{OrderedFactor{2}}(false=>1.0, true=>0.000227)
 UnivariateFinite{OrderedFactor{2}}(false=>0.901, true=>0.0989)
 UnivariateFinite{OrderedFactor{2}}(false=>1.0, true=>0.000401)

By taking the first two entries and switching whether they used oxygen or not we can see how the outputted probability of success changes. In each case, it provides a dramatic shift in the probabilities. Again, from the feature importance output, we know this is the most important variable but it does seem to be a bit dominating in terms of what happens with and without oxygen.

Probability Calibration

Finally, let’s look at the calibration of the models.

using CategoricalArrays

modelData.Prediction = pdf.(predict(xgbMachine, X_trans), 1)

lData = @transform(modelData, :prob = cut(:Prediction, (0:0.1:1.1)))
gData = groupby(lData, :prob)
calibData = @combine(gData, :N = length(:MSUCCESS), 
                            :SuccessRate = mean(:MSUCCESS), 
                            :PredictedProb = mean(:Prediction))

calibData = @transform(calibData, :Err = 1.96 .* sqrt.((:PredictedProb .* (1 .- :PredictedProb)) ./ :N))




p = plot(calibData[:, :PredictedProb], 
         calibData[:, :SuccessRate], 
         yerr = calibData[:, :Err],
    seriestype=:scatter, label = "XGBoost Calibration")

p = plot!(p, 0:0.1:1, 0:0.1:1, label = :none)
h = histogram(modelData.Prediction, normalize=:pdf, label = "Prediction Distribution")

plot(p, h, layout = (2,1))

Calibration plot

To say the model is poorly calibrated is an understatement. There is no association of an increased success rate with the increase in model probability and from the distribution of predictions we can see it’s quite binary, there isn’t an even distribution to the output.
So whilst the evaluation metrics look better than a null model, the reality is that the model isn’t doing anything. With all the different factors in the model matrix, there is likely some degeneracy in the data, such that a single occurrence of a variable ends up predicting success or not. There is potentially an issue with using the member’s table instead of the expedition table, as whether the expedition was successful or not will lead to multiple members being successful.

Conclusion

Overall it’s an interesting data set even if it does take a little work to get it loaded into Julia. There is a wealth of different features in the data that lead to some nice graphs, but using these features to predict whether you will be successful or not in climbing Mount Everest doesn’t lead to a useful model.

Similar Posts

More Extreme Moves in a Downtrending Market?

By: Dean Markwick's Blog -- Julia

Re-posted from: https://dm13450.github.io/2023/03/01/Downtrend-Extremes.html

I’m exploring whether there are more extreme moves in the equity
market when we are in a downtrend. I think anecdotally we
all notice these big red numbers when the market has been grinding
lower over a period as it is the classic loss aversion of human
psychology. This is loosely related to my Ph.D. in point processes and
also a blog post from last year when I investigated trend following –
Trend Following with ETFs. I’m
going to take two approaches, a simple binomial model and a Hawkes
process. For the data, we will be pulling the daily data from Alpaca Markets using my AlpacaMarkets.jl package.


Enjoy these types of posts? Then you should sign up for my newsletter. It’s a short monthly recap of anything and everything I’ve found interesting recently plus
any posts I’ve written. So sign up and stay informed!






A few packages to get started and I’m running Julia 1.8 for this
project.

using AlpacaMarkets
using DataFrames, DataFramesMeta
using Dates
using Plots, PlotThemes, StatsPlots
using RollingFunctions, Statistics, StatsBase
using GLM

All good data analysis starts with the data. I’m downloading the
daily statistics of SPY the S&P 500 stock index ETF which will represent
the overall stock market.

function parse_date(t)
   Date(string(split(t, "T")[1]))
end

function clean(df, x) 
    df = @transform(df, :Date = parse_date.(:t), :Ticker = x, :NextOpen = [:o[2:end]; NaN])
   @select(df, :Date, :Ticker, :c, :o, :NextOpen)
end

spyPrices = stock_bars("SPY", "1Day"; startTime = now() - Year(10), limit = 10000, adjustment = "all")[1]
spyPrices = clean(spyPrices, "SPY")
last(spyPrices, 3)
3×5 DataFrame
Row Date Ticker c o NextOpen
Date String Float64 Float64 Float64
1 2023-02-22 SPY 398.54 399.52 401.56
2 2023-02-23 SPY 400.66 401.56 395.42
3 2023-02-24 SPY 396.38 395.42 NaN

I’m doing the usual close-to-close returns and then taking the 100-day moving average as my trend signal.

spyPrices = @transform(spyPrices, :Return = [missing; diff(log.(:c))])
spyPrices = @transform(spyPrices, :Avg = lag(runmean(:Return, 100), 1))
spyPrices = @transform(spyPrices, :BigMove = abs.(:Return) .>= 0.025)
dropmissing!(spyPrices);
sp = scatter(spyPrices[spyPrices.BigMove, :].Date, spyPrices[spyPrices.BigMove, :].Return, legend = :none)
sp = scatter!(sp, spyPrices[.!spyPrices.BigMove, :].Date, spyPrices[.!spyPrices.BigMove, :].Return)

plot(sp, plot(spyPrices.Date, spyPrices.Avg), layout = (2,1), legend=:none)

Daily big moves and trend signal

By calling a ‘big move’ anything greater than \(\pm\) 0.025 (in log terms)
we can see that they, the blue dots, are slightly clustered around
common periods. In the plot below, the 100-day rolling average of
the returns, our trend signal, also appears to be slightly correlated with these big returns.

scatter(spyPrices.Avg, abs.(spyPrices.Return), label = :none,
            xlabel = "Trend Signal", ylabel = "Daily Return")

Trend signal vs daily return

Here we have the 100-day rolling average on the x-axis and the
absolute return on that day on the y-axis. If we squint a little we
can imagine there is a slight quadratic pattern, or at the least, these
down trends appear to correspond with the more extreme day moves. We
want to try and understand if this is a significant effect.

A Binomial Model

We will start by looking at the probability that each day might
have a ‘large move’. We first split into a train/test split of 70/30.

trainData = spyPrices[1:Int(floor(nrow(spyPrices)*0.7)), :]
testData = spyPrices[Int(ceil(nrow(spyPrices)*0.7)):end, :];

The GLM.jl package lets you write out the formula and fit a wide
variety of linear models. We have two models, the proper one that uses
the Avg column (our trend signal) as our features and a null model
that just fits an intercept.

binomialModel = glm(@formula(BigMove ~ Avg + Avg^2), trainData, Binomial())
nullModel = glm(@formula(BigMove ~ 1), trainData, Binomial())

spyPrices[!, :Binomial] = predict(binomialModel, spyPrices);

To look at the model we can plot the output of the model relative to
the signal at the time.

plot(scatter(spyPrices.Avg, spyPrices[!, :Binomial], label ="Response Function"), 
    plot(spyPrices.Date, spyPrices[!, :Binomial], label = "Probability of a Large Move"), layout = (2,1))

From the top graph, we see the higher probability of an extreme move comes from when the moving average is a large negative number. The probability then flatlines beyond zero, which suggests there isn’t that much of an effect for large moves when the momentum in the market is positive.

We also plot the daily probability of a large move and see that it
has been pretty bad in the few months lots of big moves!

Binomial Intensity

We need to check if the model is any good though. We will just check
the basic accuracy.

using Metrics
binary_accuracy(predict(binomialModel, testData), testData.BigMove)
binary_accuracy(predict(nullModel)[1] .* ones(nrow(testData)), testData.BigMove)
0.93

0.95

So the null model has an accuracy of 95% on the test set, but the
fitted model has an accuracy of 93%. Not good, looks like the trend
signal isn’t adding anything. We might be able to salvage the model
with a robust windowed fit and test procedure or look at a
single stock name but overall, I think it’s more of a testament to how
hard it is to model this data rather than anything too specific.

We could also consider the self-exciting nature of these large moves. If one happens, is there a higher probability of another happening? Given my Ph.D. was in Hawkes processes, I have done lots of writing around them before and this is just another example of how they can be applied.

Hawkes Processes

Hawkes processes! The bane of my life for four years. Still, I am
forever linked with them now so might as well put that Ph.D. to use. If
you haven’t come across Hawkes processes before it is a self-exciting
point process where the occurrence of one event can lead to further
events. In our case, this means one extreme event can cause further
extreme events, something we are trying to use the downtrend to
predict. With the Hawkes process, we are checking whether the events
are just self-correlated.

I’ve built the HawkesProcesses.jl package to make it easy to work
with Hawkes processes.

using HawkesProcesses, Distributions

Firstly, we get the data in the right shape by pulling the number of
days since the start of the data of each big event.

startDate = minimum(spyPrices.Date)
allEvents = getfield.(spyPrices[spyPrices.BigMove, :Date] .- startDate, :value);
allDatesNorm = getfield.(spyPrices.Date .- startDate, :value);
maxT = getfield.(maximum(spyPrices[spyPrices.BigMove, :Date]) .- startDate, :value)

We then fit the Hawkes process using the standard Bayesian method for
5,000 iterations.

bgSamps1, kappaSamps1, kernSamps1 = HawkesProcesses.fit(allEvents .+ rand(length(allEvents)), maxT, 5000)
bgSamps2, kappaSamps2, kernSamps2 = HawkesProcesses.fit(allEvents .+ rand(length(allEvents)), maxT, 5000)

bgEst = mean(bgSamps1[2500:end])
kappaEst = mean(kappaSamps1[2500:end])
kernEst = mean(kernSamps1[2500:end])

intens = HawkesProcesses.intensity(allDatesNorm, allEvents, bgEst, kappaEst, Exponential(1/kernEst));
spyPrices[!, :Intensity] = intens;

We get three parameters out of the Hawkes process. The background rate
\(\mu\), the self-exciting parameter \(\kappa\) and an exponential
parameter that describes how long each event has an impact on the
probability of another event, \(\beta\).

(bgEst, kappaEst, kernEst)
(0.005, 0.84, 0.067)

We get \(\kappa = 0.84\) and \(\beta = 0.07\) which we can interpret
as a high probability that another large move follows and that takes
around 14 days (business days) to decay. So with each large move,
expect another large move within 3 weeks.

When we compare the Hawkes intensity to the previous binomial
intensity we get a similar shape between both models.

plot(spyPrices.Date, spyPrices.Binomial, label = "Binomial")
plot!(spyPrices.Date, intens, label = "Hawkes")

Binomial and Hawkes Intensity

They line up quite well, which is encouraging and shows they are on a similar path. If we zoom in specifically to 2022.

plot(spyPrices[spyPrices.Date .>= Date("2022-01-01"), :].Date, 
     spyPrices[spyPrices.Date .>= Date("2022-01-01"), :].Binomial, label = "Binomial")
plot!(spyPrices[spyPrices.Date .>= Date("2022-01-01"), :].Date, 
      spyPrices[spyPrices.Date .>= Date("2022-01-01"), :].Intensity, label = "Hawkes")

2022 Intensity

Here we can see the binomial intensity stays higher for longer whereas
the Hawkes process goes through quicker bursts of intensity. This is
intuitive as the binomial model is using a 100-day moving average
under the hood, whereas the Hawkes process is much more reactive to
the underlying events.

To check whether the Hawkes process is any good we compare its
likelihood to a null likelihood of a constant Poisson process.

We first fit the null point process model by optimising the
null_likelihood across the events.

using Optim
null_likelihood(events, lambda, maxT) = length(events)*log(lambda) - lambda*maxT

opt = optimize(x-> -1*null_likelihood(allEvents, x[1], maxT),  0, 10)
Optim.minimizer(opt)
0.031146179404103084

Which gives a likelihood of:

null_likelihood(allEvents, Optim.minimizer(opt), maxT)
-335.1797769669301

Whereas the Hawkes process has a likelihood of:

likelihood(allEvents, bgEst, kappaEst, Exponential(1/kernEst), maxT)
-266.63091365640366

A substantial improvement, so all in the Hawkes process looks pretty
good.

Overall, the Hawkes model subdues quite quickly, but the binomial model can remain elevated. They are covering two different behaviours. The Hawkes model can describe what happens after one of these large moves happens. The binomial model is mapping the momentum onto a probability of a large event.

How do we combine both the binomial and the Hawkes process model?

Point Process Model

To start with, we need to consider a point process with variable intensity. This is known as an inhomogeneous point process. In our case, these events depend on the value of the trend signal.

\[\lambda (t) \propto \hat{r} (t)\]

\[\lambda (t) = \beta _0 + \beta _1 \hat{r} (t) + \beta_2 \hat{r} ^2 (t)\]

Like the binomial model, we will use a quadratic combination of the values. Then, given we know how to write the likelihood for a point process, we can do some maximum likelihood estimation to find the appropriate parameters.

Our rhat function need to return the signal at a given time.

function rhat(t, spy)
    dt = minimum(spy.Date) + Day(Int(floor(t)))
    spy[spy.Date .<= dt, :Avg][end]
end

And our likelihood which uses the rhat function, plus making it compatible with arrays.

function lambda(t, params, spy)
   exp(params[1] + params[1] * rhat(t, spy) + params[2] * rhat(t, spy) * rhat(t, spy))
end

lambda(t::Array{<:Number}, params::Array{<:Number}, spy::DataFrame) = map(x-> lambda(x, params, spy), t)

The likelihood of a point process is

\[\mathcal{L} = \sum _{t_i} log(\lambda (t_i)) – \int _0 ^T \lambda (t) \mathrm{d}\]

We have to use numerical integration to do the second half of the equation which is where the QuadGK.jl package comes in. We pass it a function and it will do the integration for us. Job done!

function likelihood(params, rate, events, maxT, spy)
    sum(log.(rate(events, params, spy))) - quadgk(t-> rate(t, params, spy), 0, maxT)[1]
end

With all the functions ready we can optimise and find the correct parameters.

using Optim, QuadGK
opt = optimize(x-> -1*likelihood(x, lambda, allEvents, maxT, spyPrices), rand(3))
Optim.minimizer(opt)
3-element Vector{Float64}:
 -3.4684622926014783
  1.6204408269570916
  2.902098418452392

This also has a maximum likelihood of -334. Which if you scroll up
isn’t much better compared to the null model. So warning bells should
be ringing that this isn’t a good model.

plot(minimum(spyPrices.Date) + Day.(Int.(collect(0:maxT))), 
     lambda(collect(0:maxT), Optim.minimizer(opt), spyPrices), label = :none,
     title = "Poisson Intensity")

Poisson Intensity

The intensity isn’t showing too much structure over time.

To check the fit of this model we simulate some events with the same
intensity pattern.

lambdaMax = maximum(lambda(collect(0:0.1:maxT), Optim.minimizer(opt), spyPrices)) * 1.1
rawEvents = rand(Poisson(lambdaMax * maxT), 1)[1]
unthinnedEvents = sort(rand(Uniform(0, maxT), rawEvents))
acceptProb = lambda(unthinnedEvents, Optim.minimizer(opt), spyPrices) / lambdaMax
events = unthinnedEvents[rand(length(unthinnedEvents)) .< acceptProb];
histogram(events,label= "Simulated", bins = 100)
histogram!(allEvents, label = "True", bins = 100)

Simulated Events

It’s not a great model as the simulated events don’t line up with the
true events. Looking back at the intensity function we can see it
doesn’t vary much around 0.03, so whilst the intensity function looks
varied, zooming out shows it is quite flat.

Next Steps

I wanted to integrate the variable background into the Hawkes process
so we could combine both models. As my Hawkes sampling is Bayesian I
have an old blog post to turn the above from an MLE to full Bayesian
estimation, but that code doesn’t work anymore. You need to use the
LogDensityProblems.jl package to get it working, so I’m going to
have to invest some time in learning that. I’ll be honest, I’m not
sure how bothered I can be, I’ve got a long list of other things I
want to explore and learning some abstract interface doesn’t feel like
it’s a good use of my time. Frustrating because the whole point of
Julia is composability, I could write a pure Julia function and
use HCMC on it, but now I’ve got to get another package involved. I’m sure
there is good reason and the LogDensityProblems package solves some issues but it feels a bit like the Javascript ecosystem where everything changes and the way to do something is outdated the minute it is pushed to main.

Conclusion

So overall we’ve shown that the large moves don’t happen more often in
down-trending markets, at least in the broad S&P500 view of the market.
Both a binomial and point process model showed no improvement on a
null model for predicting these extreme days whereas the Hawkes model shows that they are potentially self-exciting.