Tag Archives: #MonthOfJulia

#MonthOfJulia Day 29: Distances

Julia-Logo-Distances

Today we’ll be looking at the Distances package, which implements a range of distance metrics. This might seem a rather obscure topic, but distance calculation is at the core of all clustering techniques (which are next on the agenda), so it’s prudent to know a little about how they work.

Note that there is a Distance package as well (singular!), which was deprecated in favour of the Distances package. So please install and load the latter.

julia> using Distances

We’ll start by finding the distance between a pair of vectors.

julia> x = [1., 2., 3.];
julia> y = [-1., 3., 5.];

A simple application of Pythagora’s Theorem will tell you that the Euclidean distance between the tips of those vectors is 3. We can confirm our maths with Julia though. The general form of a distance calculation uses evaluate(), where the first argument is a distance type. Common distance metrics (like Euclidean distance) also come with convenience functions.

julia> evaluate(Euclidean(), x, y)
3.0
julia> euclidean(x, y)
3.0

We can just as easily calculate other metrics like the city block (or Manhattan), cosine or Chebyshev distances.

julia> evaluate(Cityblock(), x, y)
5.0
julia> cityblock(x, y)
5.0
julia> evaluate(CosineDist(), x, y)
0.09649209709474871
julia> evaluate(Chebyshev(), x, y)
2.0

Moving on to distances between the columns of matrices. Again we’ll define a pair of matrices for illustration.

julia> X = [0 1; 0 2; 0 3];
julia> Y = [1 -1; 1 3; 1 5];

With colwise() distances are calculated between corresponding columns in the two matrices. If one of the matrices has only a single column (see the example with Chebyshev() below) then the distance is calculated between that column and all columns in the other matrix.

julia> colwise(Euclidean(), X, Y)
2-element Array{Float64,1}:
 1.73205
 3.0    
julia> colwise(Hamming(), X, Y)
2-element Array{Int64,1}:
 3
 3
julia> colwise(Chebyshev(), X[:,1], Y)
2-element Array{Float64,1}:
 1.0
 5.0

We also have the option of using pairwise() which gives the distances between all pairs of columns from the two matrices. This is precisely the distance matrix that we would use for a cluster analysis.

julia> pairwise(Euclidean(), X, Y)
2x2 Array{Float64,2}:
 1.73205  5.91608
 2.23607  3.0    
julia> pairwise(Euclidean(), X)
2x2 Array{Float64,2}:
 0.0      3.74166
 3.74166  0.0    
julia> pairwise(Mahalanobis(eye(3)), X, Y)     # Effectively just the Euclidean metric
2x2 Array{Float64,2}:
 1.73205  5.91608
 2.23607  3.0    
julia> pairwise(WeightedEuclidean([1.0, 2.0, 3.0]), X, Y)
2x2 Array{Float64,2}:
 2.44949  9.69536
 3.74166  4.24264

As you might have observed from the last example above, it’s also possible to calculate weighted versions of some of the metrics.

Finally a less contrived example. We’ll look at the distances between observations in the iris data set. We first need to extract only the numeric component of each record and then transpose the resulting matrix so that observations become columns (rather than rows).

julia> using RDatasets
julia> iris = dataset("datasets", "iris");
julia> iris = convert(Array, iris[:,1:4]);
julia> iris = transpose(iris);
julia> dist_iris = pairwise(Euclidean(), iris);
julia> dist_iris[1:5,1:5]
5x5 Array{Float64,2}:
 0.0       0.538516  0.509902  0.648074  0.141421
 0.538516  0.0       0.3       0.331662  0.608276
 0.509902  0.3       0.0       0.244949  0.509902
 0.648074  0.331662  0.244949  0.0       0.648074
 0.141421  0.608276  0.509902  0.648074  0.0   

The full distance matrix is illustrated below as a heatmap using Plotly. Note how the clearly define blocks for each of the iris species setosa, versicolor, and virginica.

Distance Matrix for Iris Data

Tomorrow we’ll be back to look at clustering in Julia.

The post #MonthOfJulia Day 29: Distances appeared first on Exegetic Analytics.

#MonthOfJulia Day 28: Hypothesis Tests

Julia-Logo-HypothesisTests

It’s all very well generating myriad statistics characterising your data. How do you know whether or not those statistics are telling you something interesting? Hypothesis Tests. To that end, we’ll be looking at the HypothesisTests package today.

The first (small) hurdle is loading the package.

ulia> using HypothesisTests

That wasn’t too bad. Next we’ll assemble some synthetic data.

julia> using Distributions
julia> srand(357)
julia> x1 = rand(Normal(), 1000);
julia> x2 = rand(Normal(0.5, 1), 1000);
julia> x3 = rand(Binomial(100, 0.25), 1000);   # 25% success rate on samples of size 100
julia> x4 = rand(Binomial(50, 0.50), 1000);    # 50% success rate on samples of size 50
julia> x5 = rand(Bernoulli(0.25), 100) .== 1;

We’ll apply a one sample t-test to x1 and x2. The output below indicates that x2 has a mean which differs significantly from zero while x1 does not. This is consistent with our expectations based on the way that these data were generated. I’m impressed by the level of detail in the output from OneSampleTTest(): different aspects of the test are neatly broken down into sections (population, test summary and details) and there is automated high level interpretation of the test results.

julia> t1 = OneSampleTTest(x1)
One sample t-test
-----------------
Population details:
    parameter of interest:   Mean
    value under h_0:         0
    point estimate:          -0.013027816861268473
    95% confidence interval: (-0.07587776077157478,0.04982212704903784)

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.6842692696393744 (not signficant)

Details:
    number of observations:   1000
    t-statistic:              -0.40676289562651996
    degrees of freedom:       999
    empirical standard error: 0.03202803648352013
julia> t2 = OneSampleTTest(x2)
One sample t-test
-----------------
Population details:
    parameter of interest:   Mean
    value under h_0:         0
    point estimate:          0.5078522467069418
    95% confidence interval: (0.44682036100064954,0.5688841324132342)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           2.6256160116367554e-53 (extremely significant)

Details:
    number of observations:   1000
    t-statistic:              16.328833826939398
    degrees of freedom:       999
    empirical standard error: 0.031101562554276502

Using pvalue() we can further interrogate the p-values generated by these tests. The values reported in the output above are for the two-sided test, but we can look specifically at values associated with either the left- or right tails of the distribution. This makes the outcome of the test a lot more specific.

julia> pvalue(t1)
0.6842692696393744
julia> pvalue(t2)
2.6256160116367554e-53
julia> pvalue(t2, tail = :left)            # Not significant.
1.0
julia> pvalue(t2, tail = :right)           # Very significant indeed!
1.3128080058183777e-53

The associated confidence intervals are also readily accessible. We can choose between two-sided or left/right one-sided intervals as well as change the significance level.

julia> ci(t2, tail = :both)                # Two-sided 95% confidence interval by default
(0.44682036100064954,0.5688841324132342)
julia> ci(t2, tail = :left)                # One-sided 95% confidence interval (left)
(-Inf,0.5590572480083876)
julia> ci(t2, 0.01, tail = :right)         # One-sided 99% confidence interval (right)
(0.43538291818831604,Inf)

As a second (and final) example we’ll look at BinomialTest(). There are various ways to call this function. First, without looking at any particular data, we’ll check whether 25 successes from 100 samples is inconsistent with a 25% success rate (obviously not and, as a result, we fail to reject this hypothesis).

julia> BinomialTest(25, 100, 0.25)
Binomial test
-------------
Population details:
    parameter of interest:   Probability of success
    value under h_0:         0.25
    point estimate:          0.25
    95% confidence interval: (0.16877973809934183,0.3465524957588082)

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           1.0 (not signficant)

Details:
    number of observations: 100
    number of successes:    25

Next we’ll see whether the Bernoulli samples in x5 provide contradictory evidence to an assumed 50% success rate (based on the way that x5 was generated we are not surprised to find an infinitesimal p-value and the hypothesis is soundly rejected).

julia> BinomialTest(x5, 0.5)
Binomial test
-------------
Population details:
    parameter of interest:   Probability of success
    value under h_0:         0.5
    point estimate:          0.18
    95% confidence interval: (0.11031122915326055,0.26947708596681197)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           6.147806615048005e-11 (extremely significant)

Details:
    number of observations: 100
    number of successes:    18

There are a number of other tests available in this package, including a range of non-parametric tests which I have not even mentioned above. Certainly HypothesisTests should cover most of the bases for statistical inference. For more information, read the extensive documentation. Check out the sample code on github for further examples.

significant

The post #MonthOfJulia Day 28: Hypothesis Tests appeared first on Exegetic Analytics.

#MonthOfJulia Day 27: Distributions

Julia-Logo-Distributions

Today I’m looking at the Distributions package. Let’s get things rolling by loading it up.

julia> using Distributions

There’s some overlap between the functionality in Distributions and what we saw yesterday in the StatsFuns package. So, instead of looking at functions to evaluate various aspects of PDFs and CDFs, we’ll focus on sampling from distributions and calculating summary statistics.

Julia has native support for sampling from a uniform distribution. We’ve seen this before, but here’s a reminder.

julia> srand(359)                          # Set random number seed.
julia> rand()                              # Random number on [0, 1)
0.4770241944535658

What if you need to generate samples from a more exotic distribution? The Normal distribution, although not particularly exotic, seems like a natural place to start. The Distributions package exposes a type for each supported distribution. For the Normal distribution the type is appropriately named Normal. It’s derived from Distribution with characteristics Univariate and Continuous.

julia> super(Normal)
Distribution{Univariate,Continuous}
julia> names(Normal)
2-element Array{Symbol,1}:
 :μ
 :σ

The constructor accepts two parameters: mean (μ) and standard deviation (σ). We’ll instantiate a Normal object with mean 1 and standard deviation 3.

julia> d1 = Normal(1.0, 3.0)
Normal(μ=1.0, σ=3.0)
julia> params(d1)
(1.0,3.0)
julia> d1.μ
1.0
julia> d1.σ
3.0

Thanks to the wonders of multiple dispatch we are then able to generate samples from this object with the rand() method.

julia> x = rand(d1, 1000);

We’ll use Gadfly to generate a histogram to validate that the samples are reasonable. They look pretty good.
normal-histogram

There are functions like pdf(), cdf(), logpdf() and logcdf() which allow the density function of our distribution object to be evaluated at particular points. Check those out. We’re moving on to truncating a portion of the distribution, leaving a Truncated distribution object.

julia> d2 = Truncated(d1, -4.0, 6.0);

Again we can use Gadfly to get an idea of what this looks like. This time we’ll plot the actual PDF rather than a histogram of samples.
truncated-normal-pdf

The Distributions package implements an extensive selection of other continuous distributions, like Exponential, Poisson, Gamma and Weibull. The basic interface for each of these is consistent with what we’ve seen for Normal above, although there are some methods which are specific to some distributions.

Let’s look at a discrete distribution, using a Bernoulli distribution with success rate of 25% as an example.

julia> d4 = Bernoulli(0.25)
Bernoulli(p=0.25)
julia> rand(d4, 10)
10-element Array{Int64,1}:
 0
 1
 0
 1
 1
 0
 0
 0
 1
 0

What about a Binomial distribution? Suppose that we have a success rate of 25% per trial and want to sample the number of successes in a batch of 100 trials.

julia> d5 = Binomial(100, 0.25)
Binomial(n=100, p=0.25)
julia> rand(d5, 10)
10-element Array{Int64,1}:
 22
 21
 30
 23
 28
 25
 26
 26
 28
 21

Finally let’s look at an example of fitting a distribution to a collection of samples using Maximum Likelihood.

julia> x = rand(d1, 10000);
julia> fit(Normal, x)
Normal(μ=1.0015796782177036, σ=3.033914550184868)

Yup, those values are in pretty good agreement with the mean and standard deviation we specified for our Normal object originally.

That’s it for today. There’s more to the Distributions package though. Check out my github repository to see other examples which didn’t make it into the today’s post.

t_distribution

The post #MonthOfJulia Day 27: Distributions appeared first on Exegetic Analytics.