#MonthOfJulia Day 26: Statistics

By: Andrew Collier

Re-posted from: http://www.exegetic.biz/blog/2015/10/monthofjulia-day-26-statistics/

Julia-Logo-Statistics

JuliaStats is a meta-project which consolidates various packages related to statistics and machine learning in Julia. Well worth taking a look if you plan on working in this domain.

julia> x = rand(10);
julia> mean(x)
0.5287191472784906
julia> std(x)
0.2885446536178459

Julia already has some builtin support for statistical operations, so additional packages are not strictly necessary. However they do increase the scope and ease of possible operations (as we’ll see below).Julia already has some builtin support for statistical operations. Let’s kick off by loading all the packages that we’ll be looking at today.

julia> using StatsBase, StatsFuns, StreamStats

StatsBase

The documentation for StatsBase can be found here. As the package name implies, it provides support for basic statistical operations in Julia.

High level summary statistics are generated by summarystats().

julia> summarystats(x)
Summary Stats:
Mean:         0.528719
Minimum:      0.064803
1st Quartile: 0.317819
Median:       0.529662
3rd Quartile: 0.649787
Maximum:      0.974760

Weighted versions of the mean, variance and standard deviation are implemented. There’re also geometric and harmonic means.

julia> w = WeightVec(rand(1:10, 10));      # A weight vector.
julia> mean(x, w)                          # Weighted mean.
0.48819933297961043
julia> var(x, w)                           # Weighted variance.
0.08303843715334995
julia> std(x, w)                           # Weighted standard deviation.
0.2881639067498738
julia> skewness(x, w)
0.11688162715805048
julia> kurtosis(x, w)
-0.9210456851144664
julia> mean_and_std(x, w)
(0.48819933297961043,0.2881639067498738)

There’s a weighted median as well as functions for calculating quantiles.

julia> median(x)                           # Median.
0.5296622773635412
julia> median(x, w)                        # Weighted median.
0.5729104703595038
julia> quantile(x)
5-element Array{Float64,1}:
 0.0648032
 0.317819 
 0.529662 
 0.649787 
 0.97476  
julia> nquantile(x, 8)
9-element Array{Float64,1}:
 0.0648032
 0.256172 
 0.317819 
 0.465001 
 0.529662 
 0.60472  
 0.649787 
 0.893513 
 0.97476  
julia> iqr(x)                              # Inter-quartile range.
0.3319677541313941

Sampling from a population is also catered for, with a range of algorithms which can be applied to the sampling procedure.

julia> sample(['a':'z'], 5)                # Sampling (with replacement).
5-element Array{Char,1}:
 'w'
 'x'
 'e'
 'e'
 'o'
julia> wsample(['T', 'F'], [5, 1], 10)     # Weighted sampling (with replacement).
10-element Array{Char,1}:
 'F'
 'T'
 'T'
 'T'
 'F'
 'T'
 'T'
 'T'
 'T'
 'T'

There’s also functionality for empirical estimation of distributions from histograms and a range of other interesting and useful goodies.

StatsFuns

The StatsFuns package provides constants and functions for statistical computing. The constants are by no means essential but certainly very handy. Take, for example, twoπ and sqrt2.

There are some mildly exotic mathematical functions available like logistic, logit and softmax.

julia> logistic(-5)
0.0066928509242848554
julia> logistic(5)
0.9933071490757153
julia> logit(0.25)
-1.0986122886681098
julia> logit(0.75)
1.0986122886681096
julia> softmax([1, 3, 2, 5, 3])
5-element Array{Float64,1}:
 0.0136809
 0.101089 
 0.0371886
 0.746952 
 0.101089 

Finally there is a suite of functions relating to various statistical distributions. The functions for the Normal distribution are illustrated below, but there’re functions for Beta and Binomial distribution, the Gamma and Hypergeometric distribution and many others. The function naming convention is consistent across all distributions.

julia> normpdf(0);                         # PDF
julia> normlogpdf(0);                      # log PDF
julia> normcdf(0);                         # CDF
julia> normccdf(0);                        # Complementary CDF
julia> normlogcdf(0);                      # log CDF
julia> normlogccdf(0);                     # log Complementary CDF
julia> norminvcdf(0.5);                    # inverse-CDF
julia> norminvccdf(0.99);                  # inverse-Complementary CDF
julia> norminvlogcdf(-0.693147180559945);  # inverse-log CDF
julia> norminvlogccdf(-0.693147180559945); # inverse-log Complementary CDF

StreamStats

Finally, the StreamStats package supports calculating online statistics for a stream of data which is being continuously updated.

julia> average = StreamStats.Mean()
Online Mean
 * Mean: 0.000000
 * N:    0
julia> variance = StreamStats.Var()
Online Variance
 * Variance: NaN
 * N:        0
julia> for x in rand(10)
       	update!(average, x)
       	update!(variance, x)
       	@printf("x = %3.f: mean = %.3f | variance = %.3fn", x, state(average),
                                                                state(variance))
       end
x = 0.928564: mean = 0.929 | variance = NaN
x = 0.087779: mean = 0.508 | variance = 0.353
x = 0.253300: mean = 0.423 | variance = 0.198
x = 0.778306: mean = 0.512 | variance = 0.164
x = 0.566764: mean = 0.523 | variance = 0.123
x = 0.812629: mean = 0.571 | variance = 0.113
x = 0.760074: mean = 0.598 | variance = 0.099
x = 0.328495: mean = 0.564 | variance = 0.094
x = 0.303542: mean = 0.535 | variance = 0.090
x = 0.492716: mean = 0.531 | variance = 0.080

In addition to the mean and variance illustrated above, the package also supports online versions of min() and max(), and can be used to generate incremental confidence intervals for Bernoulli and Poisson processes.

That’s it for today. Check out the full code on github and watch the video below.

The post #MonthOfJulia Day 26: Statistics appeared first on Exegetic Analytics.