k-Nearest Neighbor Classification for Harrisburg Crimes

Update:

The general form of this code remains correct, but several
things have changed with Julia 0.3+ and DataFrames 0.5+;
most notably that DataFrames are now indexed by symbols instead
of strings and that the fancy DataFrames metaprogramming functions
have been pulled out into their own very-alpha
package
.
Updated code is available on
this experiment's github repo.

One thing I've wanted to be able to do with the
hbg-crime.org data set is to start emulating
those IBM "Smarter Planet" commercials where someone with an English
accent tells you they can help police departments predict where crime
happens. Two months' worth of data (give or take) is probably not
going to give us the confidence we would want to dispatch squad cars,
but it's a fun experiment.

I'm going to use the k-Nearest Neighbors
algorithm

to take a given location, time of day, and day of the week, pick which
crimes most closely match those criteria, and pick a winner from those
crimes. The 'k' is the number of neighbors we're going to pick from
the sorted list. If you want to go over how it works, I found that the
Data Mining Map had the clearest explanation of k-Nearest
Neighbor
for me at
least.

Let's start by loading up the libraries we need and importing the
data:

using DataFrames
using Datetime

reports = readtable("reports.csv")

We don't need the start time or neighborhood, and let's only work with
data where we've successfully geocoded the report:

delete!(reports, ["Start", "Neighborhood"])
complete_cases!(reports)

Since we are going to be comparing distances between columns with
different units (degrees of latitude/longitude; day of the week;
time), we need to standardize each column on a 0.0-1.0 scale.

function standardize(x::Float64, max, min)
    (x - min) / (max - min)
end

function standardize(m::DataArray{Float64,1}, max, min)
    (m .- min) ./ (max - min)
end

Note that we have two versions of the standardize function, one that
applies to just a scalar value, and one that works on a DataArray,
the underlying structure in a DataFrame. When we call standardize
on a vector, it'll use Julia's vectorized arithmetic and apply to the
whole vector. Neat.

So we need the max and min latitudes and longitudes for the
standardize function:

max_lat = maximum(reports["Lat"])
min_lat = minimum(reports["Lat"])
max_lon = maximum(reports["Lon"])
min_lon = minimum(reports["Lon"])

If we hadn't done complete_cases! earlier, we would need to do
maximum(removeNA(reports["Lat"])) or we'd have just gotten NA (the
null result) back.

Now we'll create standard latitude and longitude columns:

@transform(reports, StandardLat => standardize(Lat, max_lat, min_lat))
@transform(reports, StandardLon => standardize(Lon, max_lon, min_lon))

Check the results:

julia> reports["StandardLat"]
1130-element DataArray{Float64,1}:
 0.194104
 0.170785
 0.279784
 0.103708
 0.213908
 0.188516
 0.163927
 â‹®       
 0.153612
 0.165426
 0.24092 
 0.159025

That's what we wanted.

So next we'll convert the column of strings representing report dates
and times into a column of DateTimes:

function parsedate(string::String)
    Datetime.datetime("yyyy-MM-ddTHH:mm:ss", string)
end
@vectorize_1arg String parsedate
@transform(reports, End => parsedate(End))

The @vectorize_1arg macro is pretty cool — it takes any function of
one argument and a type to operate on and creates a new function that
takes a vector of that type and applies it to each element in the
vector.

Next we need to do the same standardization process. Rather than go
0.0 for midnight to 1.0 for the next midnight, making 11:59 pm and
12:01 am as far apart from each other as possible, what I did is
standardize them on "distance from noon," so that noon is 0.0, and
11:59 pm and 12:01 am are both about 1.0.

function standardize_minutes(d::DateTime)
    total_min = (Datetime.hour(d) * 60) + Datetime.minutes(d)
    dist_to_noon = abs(720 - total_min)
    dist_to_noon / 720
end
@vectorize_1arg Any standardize_minutes
@transform(reports, StandardMin => standardize_minutes(End))

Only one more column to standardize — day of the week:

function standardize_day(d::DateTime)
    (Datetime.dayofweek(d) - 1) / 6
end
@vectorize_1arg Any standardize_day
@transform(reports, StandardDay => standardize_day(End))

One thing to notice in this and the previous function is that I had to
use Any as the type for @vectorize_1arg. I believe this is because
the column in the DataFrame is untyped, despite having nothing but
DateTimes in it. I think I should be able to set a type on a column,
but no luck so far.

[UPDATE: This will be fixed in a future version of Julia, see this gist]

So the function for computing the Euclidean distance between one point
and another is simple enough:

function eucl_dist(lat1, lon1, min1, day1, lat2, lon2, min2, day2)
    (lat2 - lat1)^2 +
    (lon2 - lon2)^2 +
    (min2 - min1)^2 +
    (day2 - day1)^2
end

I'd like to express that generally in terms of two vectors in such a
way that it will work in the following function but couldn't quite
wrap my head around it. Speaking of which, the following function is
the last funtion — take a current latitude, longitude, time, and k
(i.e., how many neighbors to compare), and return an estimate for the
likeliest crime to occur:

function nearest(lat, lon, time, k)
    std_lat = standardize(lat, max_lat, min_lat)
    std_lon = standardize(lon, max_lon, min_lon)
    std_min = standardize_minutes(time)
    std_day = standardize_day(time)
    reports["Distances"] = [eucl_dist(std_lat, std_lon, std_min, std_day, lat, lon, min, day) for (lat,lon,min,day)=zip(reports["StandardLat"], reports["StandardLon"], reports["StandardMin"], reports["StandardDay"])]
    sorted = sortby(reports, "Distances")
    last(sortby(by(sorted[1:k, ["Description"]], "Description", nrow), "x1")[:1])
end

First we standardize the location and time arguments on the same scale
as the data. Then we create a column "Distances" in the DataFrame
where each element is the distance between our arguments and that row.
To do that, I'm using a list comprehension where the arguments are a
zipped up 4-by-N matrix of the standardized columns. I'd like to get
the eucl_dist function to take vectors in such a way that I could
@transform the DataFrame like the standardization functions above
but haven't figured that out yet.

Last, we sort the DataFrame by distances, take the closest k rows,
group by "Description" and count, then take the winner, all in one
run-on line of code.

The results:

julia> println(nearest(40.2821445, -76.8804254, Datetime.now(), 7))
BURGLARY-ENTERS STRUCTURE W/NO PERSON PRESENT

Lock your doors!