By: Andrew Collier

Re-posted from: http://www.exegetic.biz/blog/2015/10/monthofjulia-day-26-statistics/

JuliaStats is a meta-project which consolidates various packages related to statistics and machine learning in Julia. Well worth taking a look if you plan on working in this domain.

julia> x = rand(10); julia> mean(x) 0.5287191472784906 julia> std(x) 0.2885446536178459

Julia already has some builtin support for statistical operations, so additional packages are not strictly necessary. However they do increase the scope and ease of possible operations (as we’ll see below).Julia already has some builtin support for statistical operations. Let’s kick off by loading all the packages that we’ll be looking at today.

julia> using StatsBase, StatsFuns, StreamStats

## StatsBase

The documentation for StatsBase can be found here. As the package name implies, it provides support for basic statistical operations in Julia.

High level summary statistics are generated by `summarystats()`

.

julia> summarystats(x) Summary Stats: Mean: 0.528719 Minimum: 0.064803 1st Quartile: 0.317819 Median: 0.529662 3rd Quartile: 0.649787 Maximum: 0.974760

Weighted versions of the mean, variance and standard deviation are implemented. There’re also geometric and harmonic means.

julia> w = WeightVec(rand(1:10, 10)); # A weight vector. julia> mean(x, w) # Weighted mean. 0.48819933297961043 julia> var(x, w) # Weighted variance. 0.08303843715334995 julia> std(x, w) # Weighted standard deviation. 0.2881639067498738 julia> skewness(x, w) 0.11688162715805048 julia> kurtosis(x, w) -0.9210456851144664 julia> mean_and_std(x, w) (0.48819933297961043,0.2881639067498738)

There’s a weighted median as well as functions for calculating quantiles.

julia> median(x) # Median. 0.5296622773635412 julia> median(x, w) # Weighted median. 0.5729104703595038 julia> quantile(x) 5-element Array{Float64,1}: 0.0648032 0.317819 0.529662 0.649787 0.97476 julia> nquantile(x, 8) 9-element Array{Float64,1}: 0.0648032 0.256172 0.317819 0.465001 0.529662 0.60472 0.649787 0.893513 0.97476 julia> iqr(x) # Inter-quartile range. 0.3319677541313941

Sampling from a population is also catered for, with a range of algorithms which can be applied to the sampling procedure.

julia> sample(['a':'z'], 5) # Sampling (with replacement). 5-element Array{Char,1}: 'w' 'x' 'e' 'e' 'o' julia> wsample(['T', 'F'], [5, 1], 10) # Weighted sampling (with replacement). 10-element Array{Char,1}: 'F' 'T' 'T' 'T' 'F' 'T' 'T' 'T' 'T' 'T'

There’s also functionality for empirical estimation of distributions from histograms and a range of other interesting and useful goodies.

## StatsFuns

The StatsFuns package provides constants and functions for statistical computing. The constants are by no means essential but certainly very handy. Take, for example, `twoπ`

and `sqrt2`

.

There are some mildly exotic mathematical functions available like logistic, logit and softmax.

julia> logistic(-5) 0.0066928509242848554 julia> logistic(5) 0.9933071490757153 julia> logit(0.25) -1.0986122886681098 julia> logit(0.75) 1.0986122886681096 julia> softmax([1, 3, 2, 5, 3]) 5-element Array{Float64,1}: 0.0136809 0.101089 0.0371886 0.746952 0.101089

Finally there is a suite of functions relating to various statistical distributions. The functions for the Normal distribution are illustrated below, but there’re functions for Beta and Binomial distribution, the Gamma and Hypergeometric distribution and many others. The function naming convention is consistent across all distributions.

julia> normpdf(0); # PDF julia> normlogpdf(0); # log PDF julia> normcdf(0); # CDF julia> normccdf(0); # Complementary CDF julia> normlogcdf(0); # log CDF julia> normlogccdf(0); # log Complementary CDF julia> norminvcdf(0.5); # inverse-CDF julia> norminvccdf(0.99); # inverse-Complementary CDF julia> norminvlogcdf(-0.693147180559945); # inverse-log CDF julia> norminvlogccdf(-0.693147180559945); # inverse-log Complementary CDF

## StreamStats

Finally, the StreamStats package supports calculating online statistics for a stream of data which is being continuously updated.

julia> average = StreamStats.Mean() Online Mean * Mean: 0.000000 * N: 0 julia> variance = StreamStats.Var() Online Variance * Variance: NaN * N: 0 julia> for x in rand(10) update!(average, x) update!(variance, x) @printf("x = %3.f: mean = %.3f | variance = %.3fn", x, state(average), state(variance)) end x = 0.928564: mean = 0.929 | variance = NaN x = 0.087779: mean = 0.508 | variance = 0.353 x = 0.253300: mean = 0.423 | variance = 0.198 x = 0.778306: mean = 0.512 | variance = 0.164 x = 0.566764: mean = 0.523 | variance = 0.123 x = 0.812629: mean = 0.571 | variance = 0.113 x = 0.760074: mean = 0.598 | variance = 0.099 x = 0.328495: mean = 0.564 | variance = 0.094 x = 0.303542: mean = 0.535 | variance = 0.090 x = 0.492716: mean = 0.531 | variance = 0.080

In addition to the mean and variance illustrated above, the package also supports online versions of min() and max(), and can be used to generate incremental confidence intervals for Bernoulli and Poisson processes.

That’s it for today. Check out the full code on github and watch the video below.

The post #MonthOfJulia Day 26: Statistics appeared first on Exegetic Analytics.