By: Joshua Miller

Re-posted from: http://increasinglyfunctional.com/2014/01/08/knn-classification-for-harrisburg-crimes/

## Update:

The general form of this code remains correct, but several

things have changed with Julia 0.3+ and DataFrames 0.5+;

most notably that DataFrames are now indexed by symbols instead

of strings and that the fancy DataFrames metaprogramming functions

have been pulled out into their own very-alpha

package.

Updated code is available on

this experiment's github repo.

One thing I've wanted to be able to do with the

hbg-crime.org data set is to start emulating

those IBM "Smarter Planet" commercials where someone with an English

accent tells you they can help police departments predict where crime

happens. Two months' worth of data (give or take) is probably not

going to give us the confidence we would want to dispatch squad cars,

but it's a fun experiment.

I'm going to use the k-Nearest Neighbors

algorithm

to take a given location, time of day, and day of the week, pick which

crimes most closely match those criteria, and pick a winner from those

crimes. The 'k' is the number of neighbors we're going to pick from

the sorted list. If you want to go over how it works, I found that the

Data Mining Map had the clearest explanation of k-Nearest

Neighbor for me at

least.

Let's start by loading up the libraries we need and importing the

data:

```
using DataFrames
using Datetime
reports = readtable("reports.csv")
```

We don't need the start time or neighborhood, and let's only work with

data where we've successfully geocoded the report:

```
delete!(reports, ["Start", "Neighborhood"])
complete_cases!(reports)
```

Since we are going to be comparing distances between columns with

different units (degrees of latitude/longitude; day of the week;

time), we need to standardize each column on a 0.0-1.0 scale.

```
function standardize(x::Float64, max, min)
(x - min) / (max - min)
end
function standardize(m::DataArray{Float64,1}, max, min)
(m .- min) ./ (max - min)
end
```

Note that we have two versions of the `standardize`

function, one that

applies to just a scalar value, and one that works on a `DataArray`

,

the underlying structure in a `DataFrame`

. When we call `standardize`

on a vector, it'll use Julia's vectorized arithmetic and apply to the

whole vector. Neat.

So we need the max and min latitudes and longitudes for the

standardize function:

```
max_lat = maximum(reports["Lat"])
min_lat = minimum(reports["Lat"])
max_lon = maximum(reports["Lon"])
min_lon = minimum(reports["Lon"])
```

If we hadn't done `complete_cases!`

earlier, we would need to do

`maximum(removeNA(reports["Lat"]))`

or we'd have just gotten `NA`

(the

null result) back.

Now we'll create standard latitude and longitude columns:

```
@transform(reports, StandardLat => standardize(Lat, max_lat, min_lat))
@transform(reports, StandardLon => standardize(Lon, max_lon, min_lon))
```

Check the results:

```
julia> reports["StandardLat"]
1130-element DataArray{Float64,1}:
0.194104
0.170785
0.279784
0.103708
0.213908
0.188516
0.163927
⋮
0.153612
0.165426
0.24092
0.159025
```

That's what we wanted.

So next we'll convert the column of strings representing report dates

and times into a column of `DateTime`

s:

```
function parsedate(string::String)
Datetime.datetime("yyyy-MM-ddTHH:mm:ss", string)
end
@vectorize_1arg String parsedate
@transform(reports, End => parsedate(End))
```

The `@vectorize_1arg`

macro is pretty cool — it takes any function of

one argument and a type to operate on and creates a new function that

takes a vector of that type and applies it to each element in the

vector.

Next we need to do the same standardization process. Rather than go

0.0 for midnight to 1.0 for the next midnight, making 11:59 pm and

12:01 am as far apart from each other as possible, what I did is

standardize them on "distance from noon," so that noon is 0.0, and

11:59 pm and 12:01 am are both about 1.0.

```
function standardize_minutes(d::DateTime)
total_min = (Datetime.hour(d) * 60) + Datetime.minutes(d)
dist_to_noon = abs(720 - total_min)
dist_to_noon / 720
end
@vectorize_1arg Any standardize_minutes
@transform(reports, StandardMin => standardize_minutes(End))
```

Only one more column to standardize — day of the week:

```
function standardize_day(d::DateTime)
(Datetime.dayofweek(d) - 1) / 6
end
@vectorize_1arg Any standardize_day
@transform(reports, StandardDay => standardize_day(End))
```

One thing to notice in this and the previous function is that I had to

use `Any`

as the type for `@vectorize_1arg`

. I believe this is because

the column in the DataFrame is untyped, despite having nothing but

`DateTime`

s in it. I think I should be able to set a type on a column,

but no luck so far.

[UPDATE: This will be fixed in a future version of Julia, see this gist]

So the function for computing the Euclidean distance between one point

and another is simple enough:

```
function eucl_dist(lat1, lon1, min1, day1, lat2, lon2, min2, day2)
(lat2 - lat1)^2 +
(lon2 - lon2)^2 +
(min2 - min1)^2 +
(day2 - day1)^2
end
```

I'd like to express that generally in terms of two vectors in such a

way that it will work in the following function but couldn't quite

wrap my head around it. Speaking of which, the following function is

the last funtion — take a current latitude, longitude, time, and k

(i.e., how many neighbors to compare), and return an estimate for the

likeliest crime to occur:

```
function nearest(lat, lon, time, k)
std_lat = standardize(lat, max_lat, min_lat)
std_lon = standardize(lon, max_lon, min_lon)
std_min = standardize_minutes(time)
std_day = standardize_day(time)
reports["Distances"] = [eucl_dist(std_lat, std_lon, std_min, std_day, lat, lon, min, day) for (lat,lon,min,day)=zip(reports["StandardLat"], reports["StandardLon"], reports["StandardMin"], reports["StandardDay"])]
sorted = sortby(reports, "Distances")
last(sortby(by(sorted[1:k, ["Description"]], "Description", nrow), "x1")[:1])
end
```

First we standardize the location and time arguments on the same scale

as the data. Then we create a column "Distances" in the DataFrame

where each element is the distance between our arguments and that row.

To do that, I'm using a list comprehension where the arguments are a

zipped up 4-by-N matrix of the standardized columns. I'd like to get

the `eucl_dist`

function to take vectors in such a way that I could

`@transform`

the DataFrame like the standardization functions above

but haven't figured that out yet.

Last, we sort the DataFrame by distances, take the closest k rows,

group by "Description" and count, then take the winner, all in one

run-on line of code.

The results:

```
julia> println(nearest(40.2821445, -76.8804254, Datetime.now(), 7))
BURGLARY-ENTERS STRUCTURE W/NO PERSON PRESENT
```

Lock your doors!