Tag Archives: #MonthOfJulia

#MonthOfJulia Day 32: Classification

Julia-Logo-Classification

Yesterday we had a look at Julia’s regression model capabilities. A natural counterpart to these are models which perform classification. We’ll be looking at the GLM and DecisionTree packages. But, before I move on to that, I should mention the MLBase package which provides a load of functionality for data preprocessing, performance evaluation, cross-validation and model tuning.

Logistic Regression

Logistic regression lies on the border between the regression techniques we considered yesterday and the classification techniques we’re looking at today. In effect though it’s really a classification technique. We’ll use some data generated in yesterday’s post to illustrate. Specifically we’ll look at the relationship between the Boolean field valid and the three numeric fields.

julia> head(points)
6x4 DataFrame
| Row | x        | y       | z       | valid |
|-----|----------|---------|---------|-------|
| 1   | 0.867859 | 3.08688 | 6.03142 | false |
| 2   | 9.92178  | 33.4759 | 2.14742 | true  |
| 3   | 8.54372  | 32.2662 | 8.86289 | true  |
| 4   | 9.69646  | 35.5689 | 8.83644 | true  |
| 5   | 4.02686  | 12.4154 | 2.75854 | false |
| 6   | 6.89605  | 27.1884 | 6.10983 | true  |

To further refresh your memory, the plot below shows the relationship between valid and the variables x and y. We’re going to attempt to capture this relationship in our model.

regression-synthetic-data

Logistic regression is also applied with the glm() function from the GLM package. The call looks very similar to the one used for linear regression except that the error family is now Binomial() and we’re using a logit link function.

julia> model = glm(valid ~ x + y + z, points, Binomial(), LogitLink())
DataFrameRegressionModel{GeneralizedLinearModel{GlmResp{Array{Float64,1},Binomial,LogitLink},
                         DensePredChol{Float64,Cholesky{Float64}}},Float64}:

Coefficients:
              Estimate Std.Error   z value Pr(>|z|)
(Intercept)   -23.1457   3.74348  -6.18295    <1e-9
x            -0.260122  0.269059 -0.966786   0.3337
y              1.36143  0.244123    5.5768    <1e-7
z             0.723107   0.14739   4.90606    <1e-6

According to the model there is a significant relationship between valid and both y and z but not x. Looking at the plot above we can see that x does have an influence on valid (there is a gradual transition from false to true with increasing x), but that this effect is rather “fuzzy”, hence the large p-value. By comparison there is a very clear and abrupt change in valid at y values of around 15. The effect of y is also about twice as strong as that of z. All of this makes sense in light of the way that the data were constructed.

Decision Trees

Now we’ll look at another classification technique: decision trees. First load the required packages and then grab the iris data.

julia> using MLBase, DecisionTree
julia> using RDatasets, Distributions
julia> iris = dataset("datasets", "iris");
julia> iris[1:5,:]
5x5 DataFrame
| Row | SepalLength | SepalWidth | PetalLength | PetalWidth | Species  |
|-----|-------------|------------|-------------|------------|----------|
| 1   | 5.1         | 3.5        | 1.4         | 0.2        | "setosa" |
| 2   | 4.9         | 3.0        | 1.4         | 0.2        | "setosa" |
| 3   | 4.7         | 3.2        | 1.3         | 0.2        | "setosa" |
| 4   | 4.6         | 3.1        | 1.5         | 0.2        | "setosa" |
| 5   | 5.0         | 3.6        | 1.4         | 0.2        | "setosa" |

We’ll also define a Boolean variable to split the data into training and testing sets.

julia> train = rand(Bernoulli(0.75), nrow(iris)) .== 1;

We split the data into features and labels and then feed those into build_tree(). In this case we are building a classifier to identify whether or not a particular iris is of the versicolor variety.

julia> features = array(iris[:,1:4]);
julia> labels = [n == "versicolor" ? 1 : 0 for n in iris[:Species]];
julia> model = build_tree(labels[train], features[train,:]);

Let’s have a look at the product of a labours.

julia> print_tree(model)
Feature 3, Threshold 3.0
L-> 0 : 36/36
R-> Feature 3, Threshold 4.8
    L-> Feature 4, Threshold 1.7
        L-> 1 : 38/38
        R-> 0 : 1/1
    R-> Feature 3, Threshold 5.1
        L-> Feature 1, Threshold 6.7
            L-> Feature 2, Threshold 3.2
                L-> Feature 4, Threshold 1.8
                    L-> Feature 1, Threshold 6.3
                        L-> 0 : 1/1
                        R-> 1 : 1/1
                    R-> 0 : 5/5
                R-> 1 : 1/1
            R-> 1 : 2/2
        R-> 0 : 29/29

The textual representation of the tree above breaks the decision process down into a number of branches where the model decides whether to go to the left (L) or right (R) branch according to whether or not the value of a given feature is above or below a threshold value. So, for example, on the third line of the output we must decide whether to move to the left or right depending on whether feature 3 (PetalLength) is less or greater than 4.8.

We can then apply the decision tree model to the testing data and see how well it performs using standard metrics.

julia> predictions = apply_tree(model, features[!train,:]);
julia> ROC = roc(labels[!train], convert(Array{Int32,1}, predictions))
ROCNums{Int64}
  p = 8
  n = 28
  tp = 7
  tn = 28
  fp = 0
  fn = 1
julia> precision(ROC)
1.0
julia> recall(ROC)
0.875

A true positive rate of 87.5% and true negative rate of 100% is not too bad at all!

You can find a more extensive introduction to using decision trees with Julia here. The DecisionTree package also implements random forest and boosting models. Other related packages are:

Definitely worth checking out if you have the time. My time is up though. Come back soon to hear about what Julia provides for evolutionary programming.

The post #MonthOfJulia Day 32: Classification appeared first on Exegetic Analytics.

#MonthOfJulia Day 31: Regression

Julia-Logo-Regression

Today we’ll be looking at two packages for regression analyses in Julia: GLM and GLMNet. Let’s get both of those loaded so that we can begin.

julia> using GLM, GLMNet

Next we’ll create a synthetic data set which we’ll use for illustration purposes.

julia> using Distributions, DataFrames
julia> points = DataFrame();
julia> points[:x] = rand(Uniform(0.0, 10.0), 500);
julia> points[:y] = 2 + 3 * points[:x] + rand(Normal(1.0, 3.0) , 500);
julia> points[:z] = rand(Uniform(0.0, 10.0), 500);
julia> points[:valid] = 2 * points[:y] + points[:z] + rand(Normal(0.0, 3.0), 500) .> 35;
julia> head(points)
6x4 DataFrame
| Row | x        | y       | z       | valid |
|-----|----------|---------|---------|-------|
| 1   | 0.867859 | 3.08688 | 6.03142 | false |
| 2   | 9.92178  | 33.4759 | 2.14742 | true  |
| 3   | 8.54372  | 32.2662 | 8.86289 | true  |
| 4   | 9.69646  | 35.5689 | 8.83644 | true  |
| 5   | 4.02686  | 12.4154 | 2.75854 | false |
| 6   | 6.89605  | 27.1884 | 6.10983 | true  |

By design there is a linear relationship between the x and y fields. We can extract that relationship from the data using glm().

julia> model = glm(y ~ x, points, Normal(), IdentityLink())
DataFrameRegressionModel{GeneralizedLinearModel{GlmResp{Array{Float64,1},Normal,IdentityLink},
                         DensePredChol{Float64,Cholesky{Float64}}},Float64}:

Coefficients:
             Estimate Std.Error z value Pr(>|z|)
(Intercept)   2.69863  0.265687 10.1572   <1e-23
x             2.99845 0.0474285 63.2204   <1e-99

The third and forth arguments to glm() stipulate that we are applying simple linear regression where we expect the residuals to have a Normal distribution. The parameter estimates are close to what was expected, taking into account the additive noise introduced into the data. The call to glm() seems rather verbose for something as simple as linear regression and, consequently, there is a shortcut lm() which gets the same result with less fuss.

Using the result of glm() we can directly access the estimated coefficients along with their standard errors and the associated covariance matrix.

julia> coef(model)
2-element Array{Float64,1}:
 2.69863
 2.99845
julia> stderr(model)
2-element Array{Float64,1}:
 0.265687 
 0.0474285
julia> vcov(model)
2x2 Array{Float64,2}:
  0.0705897  -0.0107768 
 -0.0107768   0.00224947

The data along with the linear regression fit are shown below.
regression-synthetic-data

Moving on to the GLMNet package, which implements linear models with penalised maximum likelihood estimators. We’ll use the Boston housing data from R’s MASS package for illustration.

julia> using RDatasets
julia> boston = dataset("MASS", "Boston");
julia> X = array(boston[:,1:13]);
julia> y = array(boston[:,14]);			# Median value of houses in units of $1000

Running glmnet() which will fit models for various values of the regularisation parameter, λ.

julia> path = glmnet(X, y);

The result is a set of 76 different models. We’ll have a look at the intercepts and coefficients for the first ten models (which correspond to the largest values of λ). The coefficients are held in the betas field, which is an array with a column for each model and a row for each coefficient. Since the first few models are strongly penalised, each has only a few non-zero coefficients.

julia> path.a0[1:10]
10-element Array{Float64,1}:
 22.5328
 23.6007
 23.6726
 21.4465
 19.4206
 17.5746
 15.8927
 14.3602
 12.9638
 12.5562
julia> path.betas[:,1:10]
13x10 Array{Float64,2}:
 0.0  0.0        0.0       0.0       0.0       0.0       0.0       0.0       0.0      0.0      
 0.0  0.0        0.0       0.0       0.0       0.0       0.0       0.0       0.0      0.0      
 0.0  0.0        0.0       0.0       0.0       0.0       0.0       0.0       0.0      0.0      
 0.0  0.0        0.0       0.0       0.0       0.0       0.0       0.0       0.0      0.0      
 0.0  0.0        0.0       0.0       0.0       0.0       0.0       0.0       0.0      0.0      
 0.0  0.0        0.127841  0.569442  0.971462  1.33777   1.67153   1.97564   2.25274  2.47954  
 0.0  0.0        0.0       0.0       0.0       0.0       0.0       0.0       0.0      0.0      
 0.0  0.0        0.0       0.0       0.0       0.0       0.0       0.0       0.0      0.0      
 0.0  0.0        0.0       0.0       0.0       0.0       0.0       0.0       0.0      0.0      
 0.0  0.0        0.0       0.0       0.0       0.0       0.0       0.0       0.0      0.0      
 0.0  0.0        0.0       0.0       0.0       0.0       0.0       0.0       0.0     -0.040168
 0.0  0.0        0.0       0.0       0.0       0.0       0.0       0.0       0.0      0.0      
 0.0 -0.0843998 -0.153581 -0.196981 -0.236547 -0.272599 -0.305447 -0.335377 -0.36264 -0.384493 

Now that we’ve got a bundle of models, how do we choose among them? Cross-validation, of course!

julia> path = glmnetcv(X, y)
Least Squares GLMNet Cross Validation
76 models for 13 predictors in 10 folds
Best λ 0.028 (mean loss 24.161, std 3.019)

We find that the best results (on the basis of loss) were achieved when λ had a value of 0.028, which is relatively weak regularisation. We’ll put the parameters of the corresponding model neatly in a data frame.

julia> DataFrame(variable = names(boston)[1:13],
                 beta = path.path.betas[:,indmin(path.meanloss)])
13x2 DataFrame
| Row | variable | beta       |
|-----|----------|------------|
| 1   | Crim     | -0.0983463 |
| 2   | Zn       | 0.0414416  |
| 3   | Indus    | 0.0        |
| 4   | Chas     | 2.68519    |
| 5   | NOx      | -16.3066   |
| 6   | Rm       | 3.86694    |
| 7   | Age      | 0.0        |
| 8   | Dis      | -1.39602   |
| 9   | Rad      | 0.252687   |
| 10  | Tax      | -0.0098268 |
| 11  | PTRatio  | -0.929989  |
| 12  | Black    | 0.00902588 |
| 13  | LStat    | -0.5225    |

From the fit coefficients we can conclude, for example, that average house value increases with the number of rooms in the house (Rm) but decreases with nitrogen oxides concentration (NOx), which is a proxy for traffic intensity.

Whew! That was exhilarating but exhausting. As a footnote, please have a look at the thesis “RegTools: A Julia Package for Assisting Regression Analysis” by Muzhou Liang. The RegTools package is available here. As always, the full code for today is available on github. Next time we’ll look at classification models. Below are a couple of pertinent videos to keep you busy in the meantime.


The post #MonthOfJulia Day 31: Regression appeared first on Exegetic Analytics.

#MonthOfJulia Day 30: Clustering

Julia-Logo-Clustering

Today we’re going to look at the Clustering package, the documentation for which can be found here. As usual, the first step is loading the package.

julia> using Clustering

We’ll use the RDatasets package to select the xclara data and rename the columns in the resulting data frame.

julia> using RDatasets
julia> xclara = dataset("cluster", "xclara");
julia> names!(xclara, [symbol(i) for i in ["x", "y"]]);

Using Gadfly to generate a plot we can clearly see that there are three well defined clusters in the data.
xclara-clusters

Next we need to transform the data into an Array and then transpose it so that each point lies in a separate column (remember that this is key to calculating distances!).

julia> xclara = convert(Array, xclara);
julia> xclara = xclara';

Before we can run the clustering algorithm we need to identify seed points which act as the starting locations for clusters. There are a number of options for doing this. We’re simply going to choose three points in the data at random. How did we arrive at three starting points (as opposed to, say, six)? Well, in this case it was simply visual inspection: there appear to be three clear clusters in the data. When the data are more complicated (or have higher dimensionality) then choosing the number of clusters becomes a little more tricky.

julia> initseeds(:rand, xclara, 3)
3-element Array{Int64,1}:
 2858
  980
 2800

Now we’re ready to run the clustering algorithm. We’ll start with k-means clustering.

julia> xclara_kmeans = kmeans(xclara, 3);

A quick plot will confirm that it has recognised the three clusters that we intuitively identified in the data.
xclara-clusters-colour
We can have a look at the cluster centers, the number of points assigned to each cluster and (a subset of) the cluster assignments.

julia> xclara_kmeans.centers
2x3 Array{Float64,2}:
  9.47805   69.9242  40.6836
 10.6861   -10.1196  59.7159
julia> xclara_kmeans.counts
3-element Array{Int64,1}:
  899
  952
 1149
julia> xclara_kmeans.assignments[1:10]
10-element Array{Int64,1}:
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2

The k-means algorithm is limited to using the Euclidean metric to calculate the distance between points. An alternative, k-medoids clustering, is also supported in the Clustering package. The kmedoids() function accepts a distance matrix (from an arbitrary metric) as it’s first argument, allowing for a far greater degree of flexibility.

The final algorithm implemented by Clustering is DBSCAN, which is a density based clustering algorithm. In addition to a distance matrix, dbscan() also requires neighbourhood radius and the minimum number of points per cluster.

julia> using Distances
julia> dclara = pairwise(SqEuclidean(), xclara);
julia> xclara_dbscan = dbscan(dclara, 10, 40);

As is apparent from the plot below, DBSCAN results in a dramatically different set of clusters. The “loosely” packed points on the periphery of each of the three clusters we identified before have now been lumped together into a single cluster. Only the high density cores of these clusters are now separately identified.
xclara-clusters-dbscan

That’s it for the moment about clusters. The full code for today can be found on github. Tomorrow we’ll take a look at regression. In the meantime, take a few minutes to watch the video below about using Julia’s clustering capabilities for climate classification.

The post #MonthOfJulia Day 30: Clustering appeared first on Exegetic Analytics.