Author Archives: Joshua Miller

Quick Baseball Series Simulations With Julia

Listening to game three of the National League Divisional Series
between the Washington Nationals and Los Angeles Dodgers today on
the way to the airport, I heard a startingly weighty statistic: The
team that wins game three of a five-game series wins 77% of series!
That's one pivotal game!

After mulling over that for a second, I thought to myself, "Hmm, I
wonder if that's significantly different from what you'd expect even
with each game being a coin flip. After all, it's not like the winner
of any game in a series is only 50% to win the whole best-of-five."
And with Gogo internet on the flight in typical fine form, I had
plenty of extra time to jump into a quick Julia REPL and find out.

We can simulate a 5-game series between teams 1 and 0 like:

julia> (round(Int, rand(1,5)))
1x5 Array{Int64,2}:
 1  1  1  1  0

In real life they don't play games 4 and 5, but there's less effort
making the machine calculate those dead rubber games than real
major league playoff games.

Here's a function to see whether the winner of game three won
the series:

julia> function gameThreeWonSeries(series)
            winner = sum(series) > 2 ? 1 : 0
            series[3] == winner
       end

So now let's just run that simulation 100,000 times and see what happens.

julia> wonGameThree = Array{Int}(10000)
100000-element Array{Int64,1}:
...

julia> for i = 1:100000
            wonGameThree[i] = gameThreeWonSeries(round(Int, rand(1,5))) ? 1 : 0
       end

julia> mean(wonGameThree)
0.68857

So even if every game were a simple coin flip, the team that won
game three would win about 69% of series. And of course, the games
aren't exactly coin flips — one of the teams is probably a
little bit better than the other, skewing that number even higher.

A nice thought experiment realized while flying without internet.
What I failed to figure out is how my laptop could chat with a
ground-based customer service rep to troubleshoot the broken
airborne Wi-Fi.

Multivariate Gaussian Distributions in Julia

I recently worked my way through the latest incarnation of Andrew
Ng's Machine Learning Course on
Coursera
.
One particular method stood out to me as having some potential
real-life applications for me: anomaly detection with multivariate
Gaussian
distributions
.

The class is conducted in Octave, which I find a little annoying to
deal with as a language, so to play around with some data, I wanted to
convert the procedure to Julia. Not having
found anything prebuilt from quick Googling, I translated it myself,
and leave it here for anyone who'd like to refer to it later. The
mathematical formulas I'm working from are in the above-referenced
Wikipedia article, at "Non-degenerate
case"
.
The set of data I'll represent as X.

First we calculate our μ, just the mean of each column in X:

mu = mean(X, 1)

Having found our mean, we can now calculate the covariance matrix:

function covariance(x, mu)
  n = length(x[:,1])
  (x .- mu)' * (x .- mu) ./ n
end

sigma = covariance(X, mu)

And having both μ and Σ, we can calculate the probability for any
vector x against our distribution:

function probability(mu, sigma, x)
  n = length(mu)
  exp(-0.5 * sum((x .- mu)' .* inv(sigma) .* (x .- mu))) *
  (2 * pi) ^ (-0.5 * n) *
  det(sigma) ^ -0.5
end

A Tiny Julia Webservice

One benefit of using Julia for playing with
statistics is that it's a modern, usable language for general purposes
as well. Earlier I built a k-Nearest Neighbor classifier for
predicting crime from the data on
hbg-crime.org, so I thought it would be fun to
build a web service in Julia that could return those results.

using Morsel
using Datetime
import JSON
require("knn_classification")

app = Morsel.app()

Morsel is a small
Sinatra-like web framework for Julia that we're going to be using.

function jsonResponse(res, data)
    res.headers["Content-Type"] = "application/json"
    JSON.json(data)
end

Here's a quick function to handle building a JSON response out of a
hash; we just set the Content-Type header on the response in place,
and then write out our response data as JSON.

function parseNearestParams(params)
    lat = parsefloat(Float64, get(params, "lat", "0"))
    lon = parsefloat(Float64, get(params, "lon", "0"))
    k = parseint(get(params, "k", "7"))
    timestring = get(params, "time", nothing)
    time = isdefined(:timestring) ? Datetime.datetime("yyyy-MM-ddTHH:mm:ss", timestring) : Datetime.now()
    lat, lon, k, time
end

This is how we're going to parse the params for a request, which will
look like this: /nearest?lat=40.27258&lon=-76.87438 For lat and
lon we just call parsefloat on the string we receive; for k (how
many nearest neighbors to use) we use parseint, and for time we
either parse a string we receive or use the current time.

get(app, "/nearest") do req, res
    lat, lon, k, time = parseNearestParams(req.state[:url_params])
    if (lat == 0 || lon == 0)
        jsonResponse(res, {:error => "You must supply lat and lon parameters."})
    else
        result = nearest(lat, lon, time, k)
        jsonResponse(res, {"beware-of" => result})
    end
end

Here's our only route. We are pulling the params out of the
req.state hash, and either rendering an error if lat and lon
aren't specified, or returning the result of our prediction.

start(app, 8000)

And then we start the app; once the kNN engine parses the reports data
and is ready to go, you'll see a ready message from Morsel and you can
start making requests to http://localhost:8000/nearest.