Category Archives: Julia

My first Twitch live streaming session

Re-posted from: https://bkamins.github.io/julialang/2022/05/12/twitch.html

Introduction

In 24 hours I will have my first Twitch live streaming session. It will begin on
Friday, May 13, 7 PM EDT on ManningPublications channel.

In this post I want to share the source material I am going to present so that
everyone interested can easily follow it.

The codes are a shortened version of contents of chapters 8 and 9 of my upcoming
Julia for Data Analysis book.

Environment setup

I will run the codes under Julia 1.7.2. You will need to install the following
packages (I show you the versions of the packages I use):

CSV.jl 0.10.4
CodecBzip2.jl 0.7.2
DataFrames.jl 1.3.4
Loess.jl 0.5.4
Plots.jl 1.28.1

The problem

In the session I will analyze Lichess puzzles database. It contains
information about over 2,000,000 puzzles, covering such data as number of times
a given puzzle was played, how hard the puzzle is, how much Lichess users like
the puzzle, or what chess themes the puzzle features. My goal is to check the
relationship between the puzzle hardness and how much users like it.

Source codes

Here are the source codes that I am going to present and explain during the
session.

I will start with fetching the data from the internet, unpacking it, and reading
it into a data frame:

import Downloads
Downloads.download("https://database.lichess.org/lichess_db_puzzle.csv.bz2",
                   "puzzles.csv.bz2")

using CodecBzip2
compressed = read("puzzles.csv.bz2")
plain = transcode(Bzip2Decompressor, compressed)

using CSV
using DataFrames
puzzles = CSV.read(plain, DataFrame;
                   header=["PuzzleId", "FEN", "Moves", "Rating","RatingDeviation",
                           "Popularity", "NbPlays", "Themes","GameUrl"])

describe(puzzles)

Next, I will perform exploratory data analysis of the data base and subset it
to only keep the puzzles that I will later want to analyze:

using Plots
plot([histogram(puzzles[!, col]; label=col) for
      col in ["Rating", "RatingDeviation", "Popularity", "NbPlays"]]...)

using Statistics
plays_lo = median(puzzles.NbPlays)
rating_lo = 1500
rating_hi = quantile(puzzles.Rating, 0.99)
row_selector = (puzzles.NbPlays .> plays_lo) .&&
               (rating_lo .< puzzles.Rating .< rating_hi)

sum(row_selector)
count(row_selector)

good = puzzles[row_selector, ["Rating", "Popularity"]]

plot(histogram(good.Rating; label="Rating"),
     histogram(good.Popularity; label="Popularity"))

describe(good)

Finally I will perform some aggregation data of the data stored in the Lichess
database and analyze the relationship between puzzle difficulty and popularity:

grouped_good = groupby(good, :Rating, sort=true)
agg_good = combine(grouped_good, :Popularity => mean)
scatter(agg_good.Rating, agg_good.Popularity_mean;
        xlabel="rating", ylabel="mean popularity", legend=false)

using Loess
model = loess(agg_good.Rating, agg_good.Popularity_mean)
agg_good.pred = predict(model, float.(agg_good.Rating))
plot!(agg_good.Rating, agg_good.pred; width=5)

Conclusions

I invite everyone to join me during the Twitch live streaming session.
If you would have any questions please do not hesitate to ask them in chat and I
will try to answer them live. I hope you will enjoy it!

Lissajous Curve Matrix in Julia

By: Jonathan Carroll

Re-posted from: https://jcarroll.com.au/2022/05/12/lissajous-curve-matrix-in-julia/

Another ‘small learning project’ for me as I continue to learn Julia. I’ve said many
times that small projects with a defined goal are one of the best ways to learn, at
least for me. This one was inspired by yet another Reddit post

<br />

These are at least reminiscent of Lissajous curves but they
primarily just looked pretty cool – that animation is very nicely put together.

That graphic was made using Typescript which
is itself neat to begin with, but it looked like something that Julia might be well-suited to,
at least the parts I’ve learned so far. It seems to involve interpolating between points and animation,
both of which I recently covered on my mini blog.

Better yet, it appeared that matrix operations might be a useful component, for which Julia seems particularly well-suited.

The first thing I needed to do was to get a polygon plotted in Julia. This already challenged my existing
knowledge, but that’s where the learning happens. I dabbled with the Shape class and didn’t get very far. I found
some other implementations that plotted shapes, but none (at least that I understood) that produced a set of
points I could interpolate between.

I ended up defining my own function that calculates the vertices of an n-sided polygon with a bit of math. There’s
very likely already something that does it, but it failed the discoverability aspect. The function I came up with is

"""
    vertices(center, R, n[, closed])

# Arguments
- `center::Point`: center of polygon
- `R::Real`: circumradius
- `n::Int`: number of sides
- `closed::Bool`: should the first point be repeated?

Polygon has a flat bottom and points progress counterclockwise 
starting at the right end of the base

The final point is the starting point when closed = true
"""
function vertices(center::Point, R::Real, n::Int, closed::Bool=true) 
    X = center[1] .+ R * cos.(π/n .* (1 .+ 2 .* (0:n-1)) .- π/2)
    Y = center[2] .+ R * sin.(π/n .* (1 .+ 2 .* (0:n-1)) .- π/2)

    res = permutedims([X Y])
    ## append the start point if closed
    if closed
        res = hcat(res, res[:,1])
    end
    return res
end

This is where playing around with code and data where you know what you want but
not how to produce it is the most useful. Coming from R, I was at risk of trying to
create a data.frame of x and y points, but Arrays make more sense here. Getting
the points in the right structure was the biggest learning experience for me – combining
Arrays of Points doesn’t quite work in the way I expect coming from R, but I think this works.

I did play with the idea of making my own struct for this group of Points, but even though
(I think) it inherited from AbstractArray, none of the Array methods seemed to work for it – more
to learn for next time!

I wanted to make sure that the points I generated here seem to make sense, so I can plot them. Getting
the plots to work requires using Plots, and Point comes from GeometryBasics, so

using Plots
import GeometryBasics: Point

then plotting the vertices of a polygon is as easy as

a = vertices(Point(0,0), 1, 5, true);
plot(a[1,:], a[2,:], xlim = (-1.2, 1.2), ylim = (-1.2, 1.2), ratio = 1)
scatter!(a[1,:], a[2,:])

And just by changing the number of vertices

a = vertices(Point(0,0), 1, 6, true);
plot(a[1,:], a[2,:], xlim = (-1.2, 1.2), ylim = (-1.2, 1.2), ratio = 1)
scatter!(a[1,:], a[2,:])

I find it somewhat odd that plot doesn’t have an Array method and I need to explicitly slice out the
x and y arguments, but perhaps I’m “holding it wrong”?

Next I wanted to interpolate points between these vertices. I played with interpolation
in Julia in my last mini blog post so
I knew that function was

interpolate(a, b) = t -> ((1.0 - t) * a + t * b)

Interpolating between vertices meant interpolating between any two vertices, then repeating that over pairs. Taking
the case of two vertices first

"""
    _interPoints(pts, steps, slice)

# Arguments
- `pts::Array`: Array of `Point`s representing a polygon
- `steps::Int`: number of points to interpolate
- `slice::Int`: which polygon vertex to begin with; points will  be interpolated to the next vertex

This is an internal function to interpolate points between 
    two vertices of a polygon. It is intended to be used 
    in a `map` across slices of a polygon.
"""
function _interPoints(pts::Array, steps::Int, slice::Int) 
    int = interpolate(pts[:,slice], pts[:,slice+1])
    explode = [int(t) for t in range(0,1,length=steps)]
    return hcat(explode...)
end

which I can test with

a = vertices(Point(0,0), 1, 5, true);
b = _interPoints(a, 10, 1);
plot(a[1,:], a[2,:], ratio = 1)
scatter!(b[1,:], b[2,:])

Then, mapping across pairs of points is just

"""
    interPoints(pts, steps)

# Arguments
- `pts::Array`: Array of `Point`s representing a polygon
- `steps::Int`: number of points to interpolate between each pair of vertices

This takes an `Array` of `Point`s representing polygon vertices and interpolates between the vertices
"""
function interPoints(pts::Array, steps::Int) 
    res = map(s -> _interPoints(pts, steps, s), 1:(size(pts,2)-1))
    return hcat(res...)
end

Plotting all these points

a = vertices(Point(0,0), 1, 5, true);
b = interPoints(a, 10);
plot(b[1,:], b[2,:], xlim = (-1.2, 1.2), ylim = (-1.2, 1.2), ratio = 1)
scatter!(b[1,:], b[2,:])

Animating these points is as simple as

anim = @animate for t in 1:size(b,2)
    plot(b[1,:], b[2,:], xlim = (-1.2, 1.2), ylim = (-1.2, 1.2), ratio=1)
    scatter!([b[1,t]], [b[2,t]], markersize=8)
end

gif(anim, fps = 12)

and I think that’s pretty great progress towards what I want to make. Now I just need to run more of these
at different speeds, and find the intersections of them.

Taking the intersection problem first, I just want to create two polygons and extract the x values from one and
the y values from the other. Simple enough

"""
Find the intersection of two Arrays (representing polygons)

# Arguments
- `a::Array`: first polygon (for x values)
- `b::Array`: second polygon (for y values)

Take the x values from a and the y values from b
"""
function intersection(a::Array, b::Array) 
    permutedims(hcat([(a[1, :])...], [(b[2, :])...]))
end

permutedims was the big win for me here – I naively expected to be able to transpose
an Array but that ends up with some LinearAlgebra.Adjoint mess and I got confused

[1 2; 3 4]

## 2×2 Array{Int64,2}:
##  1  2
##  3  4


[1 2; 3 4]'

## 2×2 Adjoint{Int64,Array{Int64,2}}:
##  1  3
##  2  4

Anyway, this appears to be able to take the intersection of two Arrays. Let’s plot it!

t1 = interPoints(vertices(Point(2,8), 0.5, 5), 10);
t2 = interPoints(vertices(Point(1,7), 0.5, 5), 10);
tx = intersection(t1, t2);

plot(t1[1,:], t1[2,:], xlim = (0,3.5), ylim = (6,9), ratio = 1)
plot!(t2[1,:], t2[2,:])
plot!(tx[1,:], tx[2,:])

Perfect! Now I just need to do it a bunch more times (at different ‘speeds’) and animate it.

I originally worked out the array math by hand and found a suitable number of points to plot
for any given polygon and which multiplicative factors I could use, then I worked backwards to
formalise it into a function

"""

    speed_factor(poly, speed)

# Arguments
- `poly::Array`: Array of `Point`s representing a polygon
- `speed::Real`: mulitiplicative factor representing how the number of times a polygon should be traversed
"""
function speed_factor(poly::Array, speed::Real)
    if (speed % 1 == 0)
        res = repeat(poly, outer = (1,Int(speed)))
    else 
        n = Int(floor(speed / 1))
        res = repeat(poly, outer=(1,n))
        n_rem = Int(speed*size(poly,2)-size(res,2))
        res = hcat(res, poly[:,1:n_rem])
    end
    res
end

If I create a polygon of 72 interpolated points, I can create another with the same number of
points but with larger gaps between them. This means the ‘faster’ polygon will loop around some n>1 number
of times.

r = 0.4; # circumradius for the polygon
d = 3;   # number of vertices

# Both produce a 2x72 Array{Float64,2}
tx1 = interPoints(vertices(Point(2,6), r, d), 24)
tx2 = speed_factor(interPoints(vertices(Point(3,6), r, d), 16) , 1.5)

I can create a series of these, say, at speeds of 1, 1.5, 2, 2.4, and 3. These are just nice
numbers which are all integer divisors of the largest number of points (72)

## n = 3
r = 0.4;
d = 3;

tx1 = interPoints(vertices(Point(2,6), r, d), 24)
tx2 = speed_factor(interPoints(vertices(Point(3,6), r, d), 16), 1.5) 
tx3 = speed_factor(interPoints(vertices(Point(4,6), r, d), 12), 2)
tx4 = speed_factor(interPoints(vertices(Point(5,6), r, d), 10), 2.4)
tx5 = speed_factor(interPoints(vertices(Point(6,6), r, d), 8), 3)

ty1 = interPoints(vertices(Point(1,5), r, d), 24)
ty2 = speed_factor(interPoints(vertices(Point(1,4), r, d), 16), 1.5)
ty3 = speed_factor(interPoints(vertices(Point(1,3), r, d), 12), 2)
ty4 = speed_factor(interPoints(vertices(Point(1,2), r, d), 10), 2.4)
ty5 = speed_factor(interPoints(vertices(Point(1,1), r, d), 8), 3)

The variable name is arbitrary, but these are a sequence of polygons along the x and y axes of
some plot area.

One thing that I really like about Julia is that anything can be in an Array (similar to lists in R) so
I can combine these groups of points into an Array of Arrays

allx = [tx1, tx2, tx3, tx4, tx5]
ally = [ty1, ty2, ty3, ty4, ty5]

Now, how to calculate all the intersections? Julia of course does “broadcasting” where we can take some
operation and (in R parlance) “vectorize it”. That initially led me to

intersection.(allx, ally)

which does indeed do that – it produces a 5-element Array{Array{Float64,2},1} but that’s not what I wanted…
this only calculates the ‘diagonal’ of intersection(tx1, ty1), intersection(tx2, ty2), …

Thankfully, Julia also has list comprehensions, so the full ‘matrix’ of intersections is actually

allint = [intersection(x, y) for x in allx, y in ally]

which produces a 5×5 Array{Array{Float64,2},2} – the full matrix! With that in place, we now have all the
pieces we need, so we just need to plot them.

The following sets up a plot on every ‘timestep’ (one per point in the interpolation) where it redraws the canvas,
with the progressive drawing of each polygon and the intersections, plus some tracking lines along the x and y
extractions. One of the very neat things I entirely failed to appreciate earlier was the concept of
enumerated objects – Julia knows that if I ask for x in obj I want to iterate over all the elements

bbox = Point(6.5,6.5);

anim3 = @animate for t in 1:size(tx1,2)
    plot(xlim=(0,bbox[2]), ylim=(0,bbox[2]), 
        legend=false, ratio=1, axis=nothing, border=:none, 
        background_color="black", size=(1200,1200))
    for p in 1:size(allx,1)
        plot!(allx[p][1,1:t], allx[p][2,1:t], color=p, linewidth=6)
        plot!(ally[p][1,1:t], ally[p][2,1:t], color=p, linewidth=6)
        
        plot!([allx[p][1,t], allx[p][1,t]], [0.5, allx[p][2,t]], color="grey", alpha=0.5, linewidth=5)
        plot!([ally[p][1,t], bbox[2]], [ally[p][2,t], ally[p][2,t]], color="grey", alpha=0.5, linewidth=5)
    end
    for p in allint
        plot!(p[1,1:t], p[2,1:t], color="blue", linewidth=5)
    end
end

gif(anim3, "n3.gif", fps=12)

And, finally, the result

With all that in place, it’s reasonably straightforward to adapt this to other polygons. For n=4

r = 0.4;
d = 4;

tx1 = interPoints(vertices(Point(2,6), r, d), 24)
tx2 = speed_factor(interPoints(vertices(Point(3,6), r, d), 16), 1.5)
tx3 = speed_factor(interPoints(vertices(Point(4,6), r, d), 12), 2)
tx4 = speed_factor(interPoints(vertices(Point(5,6), r, d), 10), 2.4)
tx5 = speed_factor(interPoints(vertices(Point(6,6), r, d), 8), 3)

ty1 = interPoints(vertices(Point(1,5), r, d), 24)
ty2 = speed_factor(interPoints(vertices(Point(1,4), r, d), 16), 1.5)
ty3 = speed_factor(interPoints(vertices(Point(1,3), r, d), 12), 2)
ty4 = speed_factor(interPoints(vertices(Point(1,2), r, d), 10), 2.4)
ty5 = speed_factor(interPoints(vertices(Point(1,1), r, d), 8), 3)

allx = [tx1, tx2, tx3, tx4, tx5]
ally = [ty1, ty2, ty3, ty4, ty5]
allint = [intersection(x, y) for x in allx, y in ally]

bbox = Point(6.5,6.5);

anim4 = @animate for t in 1:size(tx1,2)
    plot(xlim=(0,bbox[2]), ylim=(0,bbox[2]), 
        legend=false, ratio=1, axis=nothing, border=:none, 
        background_color="black", size=(1200,1200))
    for p in 1:size(allx,1)
        plot!(allx[p][1,1:t], allx[p][2,1:t], color=p, linewidth=6)
        plot!(ally[p][1,1:t], ally[p][2,1:t], color=p, linewidth=6)
        
        plot!([allx[p][1,t], allx[p][1,t]], [0.5, allx[p][2,t]], color="grey", alpha=0.5, linewidth=5)
        plot!([ally[p][1,t], bbox[2]], [ally[p][2,t], ally[p][2,t]], color="grey", alpha=0.5, linewidth=5)
    end
    for p in allint
        plot!(p[1,1:t], p[2,1:t], color="blue", linewidth=5)
    end
end

gif(anim4, "n4.gif", fps=12)

n=5 (very similar code)

and n=6

I was extremely happy to see these come together, and I’m genuinely surprised by how little code it took. I could
certainly imagine trying to do the same in R, but I have doubts that it would come together quite so cleanly.

This is definitely still part of my journey towards learning Julia, so if there’s something in here you can spot
that I could have done better, I do encourage you to let me know! Either here in the comments or on Twitter.

The code for generating all of this can be found here.

devtools::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.1.2 (2021-11-01)
##  os       Pop!_OS 21.04               
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_AU:en                    
##  collate  en_AU.UTF-8                 
##  ctype    en_AU.UTF-8                 
##  tz       Australia/Adelaide          
##  date     2022-05-12                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date       lib source        
##  blogdown      1.8     2022-02-16 [1] CRAN (R 4.1.2)
##  bookdown      0.24    2021-09-02 [1] CRAN (R 4.1.2)
##  brio          1.1.1   2021-01-20 [3] CRAN (R 4.0.3)
##  bslib         0.3.1   2021-10-06 [1] CRAN (R 4.1.2)
##  cachem        1.0.3   2021-02-04 [3] CRAN (R 4.0.3)
##  callr         3.7.0   2021-04-20 [1] CRAN (R 4.1.2)
##  cli           3.2.0   2022-02-14 [1] CRAN (R 4.1.2)
##  crayon        1.5.0   2022-02-14 [1] CRAN (R 4.1.2)
##  desc          1.4.1   2022-03-06 [1] CRAN (R 4.1.2)
##  devtools      2.4.3   2021-11-30 [1] CRAN (R 4.1.2)
##  digest        0.6.27  2020-10-24 [3] CRAN (R 4.0.3)
##  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.1.2)
##  evaluate      0.14    2019-05-28 [3] CRAN (R 4.0.1)
##  fastmap       1.1.0   2021-01-25 [3] CRAN (R 4.0.3)
##  fs            1.5.0   2020-07-31 [3] CRAN (R 4.0.2)
##  glue          1.6.1   2022-01-22 [1] CRAN (R 4.1.2)
##  htmltools     0.5.2   2021-08-25 [1] CRAN (R 4.1.2)
##  jquerylib     0.1.4   2021-04-26 [1] CRAN (R 4.1.2)
##  jsonlite      1.7.2   2020-12-09 [3] CRAN (R 4.0.3)
##  JuliaCall     0.17.4  2021-05-16 [1] CRAN (R 4.1.2)
##  knitr         1.37    2021-12-16 [1] CRAN (R 4.1.2)
##  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.1.2)
##  magrittr      2.0.1   2020-11-17 [3] CRAN (R 4.0.3)
##  memoise       2.0.0   2021-01-26 [3] CRAN (R 4.0.3)
##  pkgbuild      1.2.0   2020-12-15 [3] CRAN (R 4.0.3)
##  pkgload       1.2.4   2021-11-30 [1] CRAN (R 4.1.2)
##  prettyunits   1.1.1   2020-01-24 [3] CRAN (R 4.0.1)
##  processx      3.5.2   2021-04-30 [1] CRAN (R 4.1.2)
##  ps            1.5.0   2020-12-05 [3] CRAN (R 4.0.3)
##  purrr         0.3.4   2020-04-17 [3] CRAN (R 4.0.1)
##  R6            2.5.0   2020-10-28 [3] CRAN (R 4.0.2)
##  Rcpp          1.0.6   2021-01-15 [3] CRAN (R 4.0.3)
##  remotes       2.4.2   2021-11-30 [1] CRAN (R 4.1.2)
##  rlang         1.0.1   2022-02-03 [1] CRAN (R 4.1.2)
##  rmarkdown     2.13    2022-03-10 [1] CRAN (R 4.1.2)
##  rprojroot     2.0.2   2020-11-15 [3] CRAN (R 4.0.3)
##  rstudioapi    0.13    2020-11-12 [3] CRAN (R 4.0.3)
##  sass          0.4.0   2021-05-12 [1] CRAN (R 4.1.2)
##  sessioninfo   1.1.1   2018-11-05 [3] CRAN (R 4.0.1)
##  stringi       1.5.3   2020-09-09 [3] CRAN (R 4.0.2)
##  stringr       1.4.0   2019-02-10 [3] CRAN (R 4.0.1)
##  testthat      3.1.2   2022-01-20 [1] CRAN (R 4.1.2)
##  usethis       2.1.5   2021-12-09 [1] CRAN (R 4.1.2)
##  withr         2.5.0   2022-03-03 [1] CRAN (R 4.1.2)
##  xfun          0.30    2022-03-02 [1] CRAN (R 4.1.2)
##  yaml          2.2.1   2020-02-01 [3] CRAN (R 4.0.1)
## 
## [1] /home/jono/R/x86_64-pc-linux-gnu-library/4.1
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library

Modelling Microstructure Noise Using Hawkes Processes

By: Dean Markwick's Blog -- Julia

Re-posted from: https://dm13450.github.io/2022/05/11/modelling-microstructure-noise-using-hawkes-processes.html

Microstructure noise is where the
price we observe isn’t the ‘true’ price of the underlying asset. The
observed price doesn’t diffuse as we assume in your typical
derivative pricing models, but instead, we see
some quirks in the underlying data. For example, there is an explosion of
realised variance as we use finer and finer time subsampling periods.

Last month I wrote about
calculating realised volatility
and now I’ll be taking it a step further. I’ll show you how this
microstructure noise manifests itself in futures trade data and how I
use a Hawkes process to come up with a price formation model that fits
the actual data.

The original work was all done in
Modeling microstructure noise with mutually exciting point processes. I’ll
be explaining the maths behind it, showing you how to fit the models
in Julia and hopefully educating you on this high-frequency finance
topic.

Enjoy these types of posts? Then you should sign up for my newsletter. It’s a short monthly recap of anything and everything I’ve found interesting recently plus
any posts I’ve written. So sign up and stay informed!

A Bit of Background

My Ph.D. was all about using Hawkes processes to model when things
happen and how these things are self-exciting. You may have read my
work on Hawkes process before, either through my Julia package
HawkesProcess.jl,
or examples of me
using Hawkes processes to model terror attack data or just
how to calculate the deviance information criteria of a Hawkes process. The
real hardcore might have read my Ph.D. thesis.

But how can we use a Hawkes process to model the price of some
financial asset? There is a vast amount of research and work about how Hawkes
processes can be used in price formation processes for financial
instruments and high-frequency types of problems and this post will
act as a primer to anyone interested in using Hawkes processes for
such problems.

The High-Frequency Problem

At short timescales, a price moves randomly rather than with any
trend. Amazon stock might trade thousands of times in a minute but
that’s supply and demand changing rather than people thinking
Amazon’s future is going to change from one minute to the next. So we need
a different way of thinking about how prices move at short timescales
compared to longer timescales.

We can build nice mathematical models guessing how a price might move;
it might move like a random walk or maybe a random walk with jumps in
the price now and then. But, no matter what model we use, it must
match up with what is observed in the real world. One of
these observations is a phenomenon called ‘microstructure noise’ and we
only see it in high-frequency data.

Microstructure noise is a catch-all term for different things happening
in the market. This includes, bid-ask bounce, people buy and sell at
two different prices, so looks like the price is moving, but in
reality, just oscillating around the mid-price. The discreteness of
prices at these time scales also plays a part. There is a minimum
increment level that prices change buy and this can have real effects
on how prices move. Exchanges need to pay attention to their tick
sizes, as it can help or hinder liquidity if they are set
incorrectly. This then has a real effect that we can observe when we
calculate a realised volatility.

Realised volatility is measuring how much the price moved in a
period. These high-frequency effects are going to give the impression
of more movement than what the ‘true’ volatility is. So we end up
seeing our measurement of volatility explode as the time scale we use
gets smaller and smaller. Calculating the volatility using 1-second
intervals gives a larger value than if we used 1 minute
intervals. This means that our volatility estimation depends on the
time scale used, so what is the ‘real’ volatility?

We aren’t interested in working out what the real volatility
is. Instead, we want to build a model for a price that displays this
volatility scaling effect.

The Hawkes Process Model for a Price

How do we use Hawkes processes to build a model that will have this microstructure noise?

Lets call the price at time $t$, $X(t)$ and guess that it moves by
summing up the positive jumps $N_1(t)$ and negative jumps $N_2(t)$
that also happen at time $t$.

\[X(t) = N_1(t) – N_2(t)\]

When do these jumps occur? How are these jumps distributed? This is
where we use the Hawkes process.

A Hawkes process is a self-exciting point process. When something
happens it increases the probability of another event happening. This
is the self-exciting behaviour we want. Every time there is a positive
jump, there is an increase in the probability of a negative jump
happening and, likewise, when there is a negative jump there is a greater
probability of a positive jump.

Hawkes demonstration
Each jump causes the probability of the other jump happening like in
this picture

When someone buys and pushes the price higher by removing that
liquidity, it’s more likely that someone will now sell at the new
higher price introducing some downward pressure. This is mean
reversion where prices move higher and then outside forces
push it lower and vice versa.

With Hawkes processes there are three parameters:

The background rate or when the jumps randomly occur. This is common to both positive and negative jumps.
$\kappa$ – the ‘force’ that pushes and increases the probability of the other jump happening.
The kernel $g(t)$ dictates how long the force lasts. This is an exponential decay with parameter $\beta$.

Hawkes parameters demonstration

We will fit a Hawkes process to a price series
to infer the 3 parameters that describe how the jumps behave. This
model will hopefully replicate the ‘microstructure noise’ effects we
see in practise.

Futures Trade Data

In the early days of my Ph.D., I answered an email that was advertising
for early grad students to do some prop trading. As part of the
interview, they gave me some data and asked me to write a simple
moving cross-over strategy. I failed miserably as I never
heard back from them. But I did get some nice data, which now that I’m
older and wiser, recognise as reported trades from a futures
exchange. This is the data we will be using today to calculate the
mode and luckily it’s similar to the data they use in the original
paper.

using CSV
using DataFrames, DataFramesMeta
using Plots
using Dates
using Statistics

All the usual packages when working with data in Julia.

rawData = CSV.read("fgbl__BNH14_clean.csv", DataFrame, header=false)
rename!(rawData, [:UnixTime, :Price, :Volume, :DateTime]);
first(rawData, 5)

5 rows × 4 columns

	UnixTime	Price	Volume	DateTime
	Int64	Float64	Int64	String
1	1378908794086	136.9	1	09/11/201315:13:14.086
2	1378974046854	137.25	5	09/12/201309:20:46.854
3	1378990110771	137.55	1	09/12/201313:48:30.771
4	1378998136894	137.7	1	09/12/201316:02:16.894
5	1378999992561	137.55	1	09/12/201316:33:12.561

To clean the data we convert the Unix timestamp to an actual DateTime object and pull out the hour and the date of the trade.

cleanData = @transform(rawData, DateTimeClean = DateTime.(:DateTime, dateformat"mm/dd/yyyyHH:MM:SS.sss"), 
                                DateTimeUnix = unix2datetime.(:UnixTime ./ 1000) )
cleanData = @transform(cleanData, Date = Date.(:DateTimeUnix),
                                  Hour = hour.(:DateTimeUnix));
first(cleanData[:,[:UnixTime, :DateTimeClean, :DateTimeUnix]], 5)

5 rows × 3 columns

	UnixTime	DateTimeClean	DateTimeUnix
	Int64	DateTime	DateTime
1	1378908794086	2013-09-11T15:13:14.086	2013-09-11T14:13:14.086
2	1378974046854	2013-09-12T09:20:46.854	2013-09-12T08:20:46.854
3	1378990110771	2013-09-12T13:48:30.771	2013-09-12T12:48:30.771
4	1378998136894	2013-09-12T16:02:16.894	2013-09-12T15:02:16.894
5	1378999992561	2013-09-12T16:33:12.561	2013-09-12T15:33:12.561

To get an idea of the data we are looking at I aggregate the total
number of trades and total volume of the trades over each day and plot
that as a time series.

dayData = groupby(cleanData, :Date)
dailyVolumes = @combine(dayData, TotalVolume = sum(:Volume),
                                  TotalTrades = length(:Volume),
                                  FirstTradeTime = minimum(:DateTimeUnix))

xticks = minimum(dailyVolumes.Date):Month(2):maximum(dailyVolumes.Date)
xticks_labels = Dates.format.(xticks, "yyyy-mm")

p1 = plot(dailyVolumes.Date, dailyVolumes.TotalVolume, seriestype=:scatter, label="Daily Volume", legend = :topleft, xticks = (xticks, xticks_labels))
p2 = plot(dailyVolumes.Date, dailyVolumes.TotalTrades, seriestype=:scatter, label= "Daily Number of Trades", legend = :topleft, xticks = (xticks, xticks_labels))
plot(p1, p2, fmt=:png)

Daily futures volume

It takes a while for the trading to take off in this future
contract. This is where it slowly becomes the front-month contract and
then is the most active.

Also, because trading doesn’t cross over the daylight saving dates, we don’t have to worry about timezones. Always a bonus!

What about if we look at what hour is the most active?

hourDataG = groupby(cleanData, [:Date, :Hour])
hourDataS = @combine(hourDataG, TotalHourVolume = sum(:Volume),
                                  TotalHourTrades = length(:Volume))
hourDataS = leftjoin(hourDataS, dailyVolumes, on=:Date)

hourDataS = @transform(hourDataS, FracVolume = :TotalHourVolume ./ :TotalVolume)

hourDataG = groupby(hourDataS, :Hour)
hourDataS = @combine(hourDataG, MeanFracVolume = mean(:FracVolume),
                                 MedianFracVolume = median(:FracVolume))
sort!(hourDataS, :Hour)
bar(hourDataS.Hour, hourDataS.MedianFracVolume * 100, title = "Fraction of Daily Volume", label=:none, fmt=:png)

$Hourly volume fraction of a futures contract.$

Early in the morning (just after the exchange opens) and late afternoon (when the Americans start trading) is where there is the most activity.

For our analysis, we are going to be focused on the hours 14, 15, 16
to make sure that we have the most active period and this is the same
as what the original paper did, took a subset of the day.

How to Calculate the Volatility Signature

Let $X(t)$ be the price of the future at time $t$. The signature is the quadratic variation over a window of $[0, T]$, which is more commonly known as the realised volatility:

\[C(\tau) = \frac{1}{T} \Sigma _{n=0} ^{T/\tau} \mid X((n+1) \tau) – X(n \tau) \mid ^2 .\]

$\tau$ is our sampling frequency, say every minute, etc.

To calculate the volatility across the trades we have to pay particular attention to the fact that these trades are irregularly spaced, so we need to fill forward the price for every $t$ value.

function get_price(t::Number, prices, times)
    ind = min(searchsortedfirst(times, t), length(times))
    sp = ind == 0 ? 0 : prices[ind]
end

function get_price(t::Array{<:Number}, prices, times)
    res = Array{Float64}(undef, length(t))
    for i in eachindex(t)
        res[i] = get_price(t[i], prices, times)
    end
    res
end

Our get_price function here will return the last price before time $t$.

To calculate the signature value we chose a $\tau$ value, generate
the indexes between 0 and $T$ using a $\tau$ step size. Pull the
price at those times and calculate the quadratic variation. Again, we
add a method to calculate the signature for different $\tau$’s.

function signature(tau::Number, x, t, maxT)
    inds = collect(0:tau:maxT)
    prices = get_price(inds, x, t)
    
    rets = prices[2:end] .- prices[1:(end-1)]
    (1/maxT) * sum(abs.(rets) .^ 2)
end

function signature(tau::Array{<:Number}, x, t, maxT)
   res = Array{Float64}(undef, length(tau))
    for i in eachindex(res)
        res[i] = signature(tau[i], x, t, maxT)
    end
    res
end

Now let’s apply this to the data. We are only interested when the
future was actively trading and in the hours between 14:00 and 16:00.
We convert the times into seconds since 15:59:59 and calculate the
signature for all the dates, before taking the final average.

We are taking the log price of the last trade to represent our actual
$X(t)$. We just look at 2014 dates too as that is when the future is
active.

cleanData2014 = @subset(cleanData, :Date .>= Date("2014-01-01"))

uniqueDates = unique(cleanData2014.Date)

eventList = Array{Array{Float64}}(undef, length(uniqueDates))
priceList = Array{Array{Float64}}(undef, length(uniqueDates))

signatureList = Array{Array{Float64}}(undef, length(uniqueDates))
avgSignature = zeros(length(1:1:200))

for (i, dt) in enumerate(uniqueDates)
   
    subData = @subset(cleanData2014, :Date .== dt, :Hour .<= 16, :Hour .>= 14)
    eventList[i] = getfield.(subData.DateTimeClean .- (DateTime(dt) + Hour(14) - Second(1)), :value) ./ 1e3
    priceList[i] = subData.Price
    
    signatureList[i] = signature(collect(1:1:200), log.(priceList[i]), eventList[i] .+ rand(length(eventList[i])), 3*60*60 + 1)
    avgSignature .+= signatureList[i]
end

avgSignature = avgSignature ./ length(eventList);

To plot the signature we take the average across all the dates and
then normalised by the $\tau = 60$ value.

plot(avgSignature / avgSignature[60], seriestype=:scatter, 
    label = "Average Signature", xlabel = "Tau", ylabel = "Realised Volatility (normalised)", fmt=:png)

Realised Volatility Signature

This is an interesting plot with big consequences in high-frequency finance.

This explosion in realised volatility at small timescales ($\tau
\rightarrow 0$) comes from microstructure noise. If prices evolved
as Brownian motion, the above plot would be flat for all timescales so
the above result contradicts lots of classical finance assumptions.

Practically, this is a pain if we are trying to measure the currently
volatility, it depends on the timescale we are looking at, there isn’t
one true volatility using the normal methods. Instead, we need to be
aware of these microstructure effects as we use a finer $\tau$ over
which to calculate the volatility.

This is where the Hawkes model comes in. If we assume the price,
$X(t)$ moves as stated in Equation (), can we produce a similar signature plot?

The Theoretical Signature Under a Hawkes Process

After doing some maths (you can read the paper for the full details), we arrive at the following equation for the theoretical signature.

If both $N_1$ and $N_2$ are realisations of Hawkes processes with
parameters $\mu, \kappa$ and $g(t) = \beta \exp (-\beta t)$ then their intensity can be written as

\[C(\tau) = \Lambda \left( k ^2 + (1 – k ^2) \frac{1 – e ^{-\gamma \tau}}{ \gamma \tau} \right),\]

where

$\Lambda = \frac{2 \mu}{1 – \kappa}, k = \frac{1}{1 + \kappa}, \gamma = \beta (\kappa + 1)$.

These are from the paper and adjusted based on my parameterisation of the Hawkes process. This gives us our theo_signature function.

function theo_signature(tau, bg, kappa, kern)
    Lambda = 2*bg/(1-kappa)
    k = 1/(1 + kappa)
    gamma = kern*(kappa + 1)
    @. Lambda * (k^2 + (1-k^2) * (1 - exp(-gamma * tau)) / (gamma*tau))
end

Calibrating the Hawkes Process Model

We now move on to fitting the Hawkes process to the data. I’ll be using
a new method that takes a different approach to my package HawkesProcesses.jl.

We have a theoretical volatility signature from a Hawkes process
(theo_signature) and the above plot of what the actual signature
looks like it. Therefore, it is just a case of optimising over a loss
function to find the best fitting parameters. I’ll use root mean
square error as my loss function and simply use the Optim.jl package
to perform the minimisation.

signatureRMSE(x, sig) = sqrt(
    mean(
        (sig .- theo_signature(1:200, x[1], x[2], x[3])).^2
        )
)

using Optim
optRes = optimize(x->signatureRMSE(x, avgSignature/avgSignature[10]), rand(3))

paramEst = Optim.minimizer(optRes)

3-element Vector{Float64}:
24402236592012655
7417867072115396
19569169443936185

These are the three parameters of the Hawkes process which appear
sensible.

plot(avgSignature, label="Observed", seriestype=:scatter)
plot!(avgSignature[10]*theo_signature(1:200, paramEst[1], paramEst[2], paramEst[3]), label="Theoretical", lw=3, xlabel = "Tau", ylabel = "Realised Variance", fmt=:png)

Theoretical vs Observed Signature

So a nice match-up between the theoretical signature and what we
observed. This gives some weight to the Hawkes model as a
representation of the price process.

Interpreting the Hawkes Parameters

Our $\kappa$ value of 0.75 shows there is a large amount of
excitement with each price jump. a $\beta$ value of 0.2 shows that this
mean reversion lasts for about 5 seconds. So if we see an uptick in
the price, we expect a downtick with a rough half-life of five
seconds. The opposite is also true, a downtick likely leads to an
uptick 5 seconds later.

Conclusion

Microstructure noise shows up when we start calculating volatility on
a high-frequency timescale. We have shown that it is a real effect
using some futures data and then built a Hawkes model to try and
reproduce this effect. We managed to get the right shape of the
volatility signature and found that was quite a bit of mean reversion
between the up and downticks that lasted around 5 seconds.

What’s next? In a future post, I will introduce another dimension and
show and there can also be correlation across assets under a similar
method to reproduce another high-frequency phenomenon. This will be
based on the same paper and show you how we can start looking at the
correlation between two assets and how that changes at high-frequency
time scales.

juliabloggers.com

A Julia Language Blog Aggregator