Tag Archives: Programming

Slowly moving to the Julia language from R

Julia is a new computing language that’s gotten a lot of attention lately (e.g., this Wired piece) and that I’ve ignored until recently. But I checked it out a few days ago and, holy crap, it’s a nice language. I’m rewriting the code in my book to use Julia instead of R and I’m almost certainly going to use it instead of R in my PhD class next fall.

So, why Julia and why not R? (And, I suppose why not Python/Matlab/other languages?)

  • Multiple dispatch. So you can define a function
    function TrickyAlgorithm(aThing, bThing)

    differently to depend on whether aThing is a matrix of real numbers and bThing is is a vector, or aThing is a vector and bThing is a matrix, or any other combinations of data types. And you can do this without lots of tedious, potentially slow, and confusing (to people reading and maintaining the code) argument checks and conversion within the function.

    Note that this is kind of similar to Object Oriented Programming, but in OOP TrickyAlgorithm would need to be a method of aThing or bThing. Also note that this is present in R as well.

  • Homoiconicity — the code can be operated on by other parts of the code. Again, R kind of has this too! Kind of, because I’m unaware of a good explanation for how to use it productively, and R’s syntax and scoping rules make it tricky to pull off. But I’m still excited to see it in Julia, because I’ve heard good things about macros and I’d like to appreciate them (I’ve started playing around with Clojure and like it a lot too…). And because stuff like this is amazing:
    @devec r = exp(-abs(x-y))

    which devectorizes x and y (both vectors) and evaluates as

    for i = 1:length(x)
        r[i] = exp(-abs(x[i]-y[i]))
    end
    

    (this example and code is from Dahua Lin’s blog post, Fast Numeric Computation in Julia). Note that “evaluates as” does not mean “gives the same answer as,” it means that the code r = exp(-abs(x-y)) is replaced with the loop by @devec and then the loop is what’s run.

  • Decent speed. Definitely faster than well written R; I don’t have a great feel for how well it compares to highly optimized R (using inline C++, for example), but I write one-off simulation programs and don’t write highly optimized R.And the language encourages loops, which is a relief. R discourages loops and encourages “vectorized” operations that operate on entire objects at once (which are then converted to fast loops in C…). But I use loops all the time anyway, because avoiding loops in time series applications is impossible. R’s poor support for recursion doesn’t help either.And, more to the point, I teach econometrics to graduate students. Many of them haven’t programmed before. Most of them are not going to write parts of their analysis in C++.
  • The syntax is fine and unthreatening, which will help for teaching. It basically looks like Matlab done right. Matlab’s not a bad language because its programs look like they’re built out of Legos, it’s a bad language because of its horrendous implementation of functions, anonymous functions, objects, etc. Compared to R, Matlab and Julia look downright friendly. Compared to Clojure… I can’t even imagine asking first year PhD students (some with no programming experience at all) to work with a Lisp.
  • The last point that’s always mentioned in these language comparisons. What about all of the R packages? There are thousands and thousands of statistical packages coded up for R, and you’re giving that up by moving to a different language.This is apparently a big concern for a lot of people, but… have you looked at the source code for these packages? Most of them are terrible! But some are good, and it might take some time to port them to Julia. Not that much time, I think, because most high-performance popular R packages are a thin layer of interoperability over a fast implementation in C or C++, so the port is just a matter of wrapping it up for Julia. And most of the well designed packages are tools for other package developers.That’s not quite true of R’s statistical graphics, though. They’re really great and could be hard to port. And that’s more or less the only thing that I’m sure that I’ll miss in Julia. (But hopefully not for too long.)
  • Lastly, and this is important: the same massive quantity of packages for R is a big constraint on its future development. Breaking backwards compatibility is a big deal but avoiding it too much imposes costs.

Anyway, since I converted some R code to Julia I thought it would be fun to compare speeds. The first example is used to show the sampling distribution of an average of uniform(0,1) random variables. In R, we have

rstats <- function(rerror, nobs, nsims = 500) {
  replicate(nsims, mean(rerror(nobs)))}

which is (I think) pretty idiomatic R (and is vectorized, so it’s supposed to be fast). Calling it gives

R> system.time(rstats(runif, 500))

[out]:  user  system elapsed 
       0.341   0.002   0.377

For comparison to the Julia results, we’re going to care about the “elapsed” result of 0.377 seconds; the “system” column isn’t relevant here.  Calling it for more observations and more simulations (50,000 of each) gives

R> system.time(rstats(runif, 50000, 50000))

[out]:   user  system elapsed 
      204.184   0.217 215.526

so 216 seconds overall. And, just to preempt criticism, I ran these simulations a few times each and these results are representative; and I ran a byte-compiled version that got (unexpectedly) slightly worse performance.

Equivalent Julia code is

function rmeans(dist, nobs; nsims = 500)
    means = Array(Float64,nsims)
    for i in 1:nsims
        means[i] = mean(rand(dist, nobs))
    end
    return means
end

which is pretty easy to read, but I have no idea if it’s idiomatic. This is my first code in Julia. If you like to minimize lines of code and preallocation of arrays, Julia has list comprehensions and you can write the stylish one line definition (that gave similar times)

rmeans_pretty(dist, nobs; nsims = 500) =
    [ mean(rand(dist, nobs)) for i = 1:nsims ]

We can time  (after loading the Distributions packages):

julia> @elapsed rmeans(Uniform(), 500)

[out]: 0.093662961

so 0.09 seconds, or about a quarter the time as R. But (I forgot to mention earlier), Julia uses a Just In Time compiler, so the 0.09 seconds includes compilation and execution. Running it a second time gives

julia>  @elapsed rmeans(Uniform(), 500)

[out]: 0.004334132

which is half the time again. (Update on 3/17: as Jules pointed out in the comments, 0.004 is 1/20th of 0.09, so this is substantially faster than I’d initially thought. So we are getting into the ~100 times faster range. That’s actually a pretty exciting speed increase, but I’ll need to look into it some more. Well, that was embarrassing.)

Running the larger simulation, we have

julia> @elapsed rmeans_loop(Uniform(), 50000, nsims = 50000)

[out]: 77.318591953

so the R code is a little less than three times slower here. (The compilation step doesn’t make a meaningful difference.) So, Julia isn’t hundreds of times faster, but it is noticeably faster than R, which is nice.

But speed in this sort of test isn’t the main factor. I’m really excited about multiple dispatch — it’s one of the few things in R that I really, really liked from a language standpoint. I really like what I’ve read about Julia’s support for parallelism (but need to learn more). And I like metaprogramming, even if I can’t do it myself yet. So Julia’s trying to be a fast, easy to learn, and elegantly designed language. That’s awesome. I want it to work.

ps: and it’s open source! Can’t forget that.

Filed under: Blog Tagged: julia, programming, R

The Relationship between Vectorized and Devectorized Code

By: John Myles White

Re-posted from: http://www.johnmyleswhite.com/notebook/2013/12/22/the-relationship-between-vectorized-and-devectorized-code/

Introduction

Some people have come to believe that Julia’s vectorized code is unusably slow. To correct this misconception, I outline a naive benchmark below that suggests that Julia’s vectorized code is, in fact, noticeably faster than R’s vectorized code. When experienced Julia programmers suggest that newcomers should consider devectorizing code, we’re not trying to beat R’s speed — our vectorized code does that already. Instead, we’re trying to match C’s speed.

As the examples below indicate, a little bit of devectorization goes a long way towards this loftier goal. In the specific examples I show, I find that:

  • Julia’s vectorized code is 2x faster than R’s vectorized code
  • Julia’s devectorized code is 140x faster than R’s vectorized code
  • Julia’s devectorized code is 1350x faster than R’s devectorized code

Examples of Vectorized and Devectorized Code in R

Let’s start by contrasting two pieces of R code: a vectorized and a devectorized implementation of a trivial snippet of code that does repeated vector addition.

First, we consider an example of idiomatic, vectorized R code:

vectorized <- function()
{
    a <- c(1, 1)
    b <- c(2, 2)
    x <- c(NaN, NaN)

    for (i in 1:1000000)
    {
        x <- a + b
    }

    return()
}

time <- function (N)
{
    timings <- rep(NA, N)

    for (itr in 1:N)
    {
        start <- Sys.time()
        vectorized()
        end <- Sys.time()
        timings[itr] <- end - start
    }

    return(timings)
}

mean(time(10))

This code takes, on average, 0.49 seconds per iteration to compute 1,000,000 vector additions.

Having considered the vectorized implementation, we can then consider an unidiomatic devectorized implementation of the same operation in R:

devectorized <- function()
{
    a <- c(1, 1)
    b <- c(2, 2)
    x <- c(NaN, NaN)

    for (i in 1:1000000)
    {
        for (index in 1:2)
        {
            x[index] <- a[index] + b[index]
        }
    }

    return()
}

time <- function (N)
{
    timings <- rep(NA, N)

    for (itr in 1:N)
    {
        start <- Sys.time()
        devectorized()
        end <- Sys.time()
        timings[itr] <- end - start
    }

    return(timings)
}

mean(time(10))

This takes, on average, 4.72 seconds per iteration to compute 1,000,000 vector additions.

Examples of Vectorized and Devectorized Code in Julia

Let’s now consider two Julia implementations of this same snippet of code. We’ll start with a vectorized implementation:

function vectorized()
    a = [1.0, 1.0]
    b = [2.0, 2.0]
    x = [NaN, NaN]

    for i in 1:1000000
        x = a + b
    end

    return
end

function time(N)
    timings = Array(Float64, N)

    # Force compilation
    vectorized()

    for itr in 1:N
        timings[itr] = @elapsed vectorized()
    end

    return timings
end

mean(time(10))

This takes, on average, 0.236 seconds per iteration to compute 1,000,000 vector additions.

Next, let’s consider a devectorized implementation of this same snippet:

function devectorized()
    a = [1.0, 1.0]
    b = [2.0, 2.0]
    x = [NaN, NaN]

    for i in 1:1000000
        for index in 1:2
            x[index] = a[index] + b[index]
        end
    end

    return
end

function time(N)
    timings = Array(Float64, N)

    # Force compilation
    devectorized()

    for itr in 1:N
        timings[itr] = @elapsed devectorized()
    end

    return timings
end

mean(time(10))

This takes, on average, 0.0035 seconds per iteration to compute 1,000,000 vector additions.

Comparing Performance in R and Julia

We can summarize the results of the four examples above in a single table:

Approach Language Average Time
Vectorized R 0.49
Devectorized R 4.72
Vectorized Julia 0.24
Devectorized Julia 0.0035

All of these examples were timed on my 2.9 GHz Intel Core i7 MacBook Pro. The results are quite striking: Julia is uniformly faster than R. And a very small bit of devectorization produces huge performance improvements. Of course, it would be nice if Julia’s compiler could optimize vectorized code as well as it optimizes devectorized code. But doing so requires a substantial amount of work.

Why is Optimizing Vectorized Code Hard?

What makes automatic devectorization tricky to get right is that even minor variants of the snippet shown above have profoundly different optimization strategies. Consider, for example, the following two snippets of code:

function vectorized2()
    a = [1.0, 1.0]
    b = [2.0, 2.0]

    res = {}

    for i in 1:1000000
        x = [rand(), rand()]
        x += a + b
        push!(res, x)
    end

    return res
end

function time(N)
    timings = Array(Float64, N)

    # Force compilation
    vectorized2()

    for itr in 1:N
        timings[itr] = @elapsed vectorized2()
    end

    return timings
end

mean(time(10))

This first snippet takes 1.29 seconds on average.

function devectorized2()
    a = [1.0, 1.0]
    b = [2.0, 2.0]

    res = {}

    for i in 1:1000000
        x = [rand(), rand()]
        for dim in 1:2
            x[dim] += a[dim] + b[dim]
        end
        push!(res, x)
    end

    return res
end

function time(N)
    timings = Array(Float64, N)

    # Force compilation
    devectorized2()

    for itr in 1:N
        timings[itr] = @elapsed devectorized2()
    end

    return timings
end

mean(time(10))

This second snippet takes, on average, 0.27 seconds.

The gap between vectorized and devectorized code is much smaller here because this second set of code snippets uses memory in a very different way than our original snippets did. In the first set of snippets, it was possible to entirely avoid allocating any memory for storing changes to x. The devectorized code for the first set of snippets explicitly made clear to the compiler that no memory needed to be allocated. The vectorized code did not make this clear. Making it clear that no memory needed to be allocated led to a 75x speedup. Explicitly telling the compiler what it can avoid spending time on goes a long way.

In contrast, in the second set of snippets, a new chunk of memory has to be allocated for every x vector that gets created. And the result is that even the devectorized variant of our second snippet cannot offer much of a performance boost over its vectorized analogue. The devectorized variant is slightly faster because it avoids allocating any memory during the steps in which x has a and b added to it, but this makes less of a difference when there is still a lot of other work being done that cannot be avoided by devectorizing operations.

This reflects a more general statement: the vectorization/devectorization contrast is only correlated, not causally related, with the actual performance characteristics of code. What matters for computations that take place on modern computers is the efficient utilization of processor cycles and memory. In many real examples of vectorized code, it is memory management, rather than vectorization per se, that is the core causal factor responsible for performance.

The Reversed Role of Vectorization in R and Julia

Part of what makes it difficult to have a straightforward discussion about vectorization is that vectorization in R conflates issues that are logically unrelated. In R, vectorization is often done for both (a) readability and (b) performance. In Julia, vectorization is only used for readability; it is devectorization that offers superior performance.

This confuses some people who are not familiar with the internals of R. It is therefore worth noting how one improves the speed of R code. The process of performance improvement is quite simple: one starts with devectorized R code, then replaces it with vectorized R code and then finally implements this vectorized R code in devectorized C code. This last step is unfortunately invisible to many R users, who therefore think of vectorization per se as a mechanism for increasing performance. Vectorization per se does not help make code faster. What makes vectorization in R effective is that it provides a mechanism for moving computations into C, where a hidden layer of devectorization can do its mgic.

In other words, R is doing exactly what Julia is doing to get better performance. R’s vectorized code is simply a thin wrapper around completely devectorized C code. If you don’t believe me, go read the C code for something like R’s distance function, which involves calls to functions like the following:

static double R_euclidean(double *x, int nr, int nc, int i1, int i2)
{
    double dev, dist;
    int count, j;

    count= 0;
    dist = 0;
    for(j = 0 ; j < nc ; j++) {
    if(both_non_NA(x[i1], x[i2])) {
        dev = (x[i1] - x[i2]);
        if(!ISNAN(dev)) {
        dist += dev * dev;
        count++;
        }
    }
    i1 += nr;
    i2 += nr;
    }
    if(count == 0) return NA_REAL;
    if(count != nc) dist /= ((double)count/nc);
    return sqrt(dist);
}

It is important to keep this sort of thing in mind: the term vectorization in R actually refers to a step in which you write devectorized code in C. Vectorization, per se, is a red herring when reasoning about performance.

To finish this last point, let’s summarize the performance hierarchy for R and Julia code in a simple table:

Worst Case Typical Case Best Case
Julia Vectorized Code Julia Devectorized Code
R Devectorized Code R Vectorized Code C Devectorized Code

It is the complete absence of one column for Julia that makes it difficult to compare vectorization across the two languages. Nothing in Julia is as bad as R’s devectorized code. On the other end of the spectrum, the performance of Julia’s devectorized code simply has no point of comparison in pure R: it is more similar to the C code used to power R behind the scenes.

Conclusion

Julia aims to (and typically does) provide vectorized code that is efficient as the vectorized code available in other high-level languages. What sets Julia apart is the possibility of writing, in pure Julia, high performance code that uses CPU and memory resources as effectively as can be done in C.

In particular, vectorization and devectorization stand in the opposite relationship to one another in Julia as they do in R. In R, devectorization makes code unusably slow: R code must be vectorized to perform at an acceptable level. In contrast, Julia programmers view vectorized code as a convenient prototype that can be modified with some clever devectorization to produce production-performance code. Of course, we would like prototype code to perform better. But no popular language offers that kind of functionality. What Julia offers isn’t the requirement for devectorization, but the possibility of doing it in Julia itself, rather than in C.

The State of Statistics in Julia

By: John Myles White

Re-posted from: http://www.johnmyleswhite.com/notebook/2012/12/02/the-state-of-statistics-in-julia/

Updated 12.2.2012: Added sample output based on a suggestion from Stefan Karpinski.

Introduction

Over the last few weeks, the Julia core team has rolled out a demo version of Julia’s package management system. While the Julia package system is still very much in beta, it nevertheless provides the first plausible way for non-expert users to see where Julia’s growing community of developers is heading.

To celebrate some of the amazing work that’s already been done to make Julia usable for day-to-day data analysis, I’d like to give a brief overview of the state of statistical programming in Julia. There are now several packages that, taken as a whole, suggest that Julia may really live up to its potential and become the next generation language for data analysis.

Getting Julia Installed

If you’d like to try out Julia for yourself, you’ll first need to clone the current Julia repo from GitHub and then build Julia from source as described in the Julia README. Compiling Julia for the first time can take up to two hours, but updating Julia afterwards will be quite fast once you’ve gotten a working copy of the language and its dependencies installed on your system. After you have Julia built, you should add its main directory to your path and then open up the Julia REPL by typing julia at the command line.

Installing Packages

Once Julia’s REPL is running, you can use the following commands to start installing packages:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
julia> require("pkg")
 
julia> Pkg.init()
Initialized empty Git repository in /Users/johnmyleswhite/.julia/.git/
Cloning into 'METADATA'...
remote: Counting objects: 443, done.
remote: Compressing objects: 100% (208/208), done.
remote: Total 443 (delta 53), reused 423 (delta 33)
Receiving objects: 100% (443/443), 38.98 KiB, done.
Resolving deltas: 100% (53/53), done.
[master (root-commit) dbd486e] empty package repo
 2 files changed, 4 insertions(+)
 create mode 100644 .gitmodules
 create mode 160000 METADATA
 create mode 100644 REQUIRE
 
julia> Pkg.add("DataFrames", "Distributions", "MCMC", "Optim", "NHST", "Clustering")
Installing DataFrames: v0.0.0
Cloning into 'DataFrames'...
remote: Counting objects: 1340, done.
remote: Compressing objects: 100% (562/562), done.
remote: Total 1340 (delta 760), reused 1229 (delta 655)
Receiving objects: 100% (1340/1340), 494.79 KiB, done.
Resolving deltas: 100% (760/760), done.
Installing Distributions: v0.0.0
Cloning into 'Distributions'...
remote: Counting objects: 49, done.
remote: Compressing objects: 100% (30/30), done.
remote: Total 49 (delta 8), reused 49 (delta 8)
Receiving objects: 100% (49/49), 17.29 KiB, done.
Resolving deltas: 100% (8/8), done.
Installing MCMC: v0.0.0
Cloning into 'MCMC'...
warning: no common commits
remote: Counting objects: 155, done.
remote: Compressing objects: 100% (97/97), done.
remote: Total 155 (delta 66), reused 140 (delta 51)
Receiving objects: 100% (155/155), 256.68 KiB, done.
Resolving deltas: 100% (66/66), done.
Installing NHST: v0.0.0
Cloning into 'NHST'...
remote: Counting objects: 20, done.
remote: Compressing objects: 100% (18/18), done.
remote: Total 20 (delta 2), reused 19 (delta 1)
Receiving objects: 100% (20/20), 4.31 KiB, done.
Resolving deltas: 100% (2/2), done.
Installing Optim: v0.0.0
Cloning into 'Optim'...
remote: Counting objects: 497, done.
remote: Compressing objects: 100% (191/191), done.
remote: Total 497 (delta 318), reused 476 (delta 297)
Receiving objects: 100% (497/497), 79.68 KiB, done.
Resolving deltas: 100% (318/318), done.
Installing Options: v0.0.0
Cloning into 'Options'...
remote: Counting objects: 10, done.
remote: Compressing objects: 100% (8/8), done.
remote: Total 10 (delta 1), reused 6 (delta 0)
Receiving objects: 100% (10/10), done.
Resolving deltas: 100% (1/1), done.
Installing Clustering: v0.0.0
Cloning into 'Clustering'...
remote: Counting objects: 38, done.
remote: Compressing objects: 100% (28/28), done.
remote: Total 38 (delta 7), reused 38 (delta 7)
Receiving objects: 100% (38/38), 7.77 KiB, done.
Resolving deltas: 100% (7/7), done.

That will get you started with some of the core tools for doing statistical programming in Julia. You’ll probably also want to install another package called “RDatasets”, which provides access to 570 of the classic data sets available in R. This package has a much larger file size than the others, which is why I recommend installing it after you’ve first installed the other packages:

1
2
3
4
5
6
7
8
9
10
require("pkg")
 
julia> Pkg.add("RDatasets")
Installing RDatasets: v0.0.0
Cloning into 'RDatasets'...
remote: Counting objects: 609, done.
remote: Compressing objects: 100% (588/588), done.
remote: Total 609 (delta 21), reused 605 (delta 17)
Receiving objects: 100% (609/609), 10.56 MiB | 1.15 MiB/s, done.
Resolving deltas: 100% (21/21), done.

Assuming that you’ve gotten everything working, you can then type the following to load Fisher’s classic Iris data set:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
julia> load("RDatasets")
Warning: redefinition of constant NARule ignored.
Warning: New definition ==(NAtype,Any) is ambiguous with ==(Any,AbstractArray{T,N}).
         Make sure ==(NAtype,AbstractArray{T,N}) is defined first.
Warning: New definition ==(Any,NAtype) is ambiguous with ==(AbstractArray{T,N},Any).
         Make sure ==(AbstractArray{T,N},NAtype) is defined first.
Warning: New definition replace!(PooledDataVec{S},NAtype,T) is ambiguous with replace!(PooledDataVec{S},T,NAtype).
         Make sure replace!(PooledDataVec{S},NAtype,NAtype) is defined first.
Warning: New definition promote_rule(Type{AbstractDataVec{T}},Type{T}) is ambiguous with promote_rule(Type{AbstractDataVec{S}},Type{T}).
         Make sure promote_rule(Type{AbstractDataVec{T}},Type{T}) is defined first.
Warning: New definition ^(NAtype,T<:Union(String,Number)) is ambiguous with ^(Any,Integer).
         Make sure ^(NAtype,_<:Integer) is defined first.
Warning: New definition ^(DataVec{T},Number) is ambiguous with ^(Any,Integer).
         Make sure ^(DataVec{T},Integer) is defined first.
Warning: New definition ^(DataFrame,Union(NAtype,Number)) is ambiguous with ^(Any,Integer).
         Make sure ^(DataFrame,Integer) is defined first.
 
julia> using DataFrames
 
julia> using RDatasets
 
julia> iris = data("datasets", "iris")
DataFrame  (150,6)
              Sepal.Length Sepal.Width Petal.Length Petal.Width     Species
[1,]        1          5.1         3.5          1.4         0.2    "setosa"
[2,]        2          4.9         3.0          1.4         0.2    "setosa"
[3,]        3          4.7         3.2          1.3         0.2    "setosa"
[4,]        4          4.6         3.1          1.5         0.2    "setosa"
[5,]        5          5.0         3.6          1.4         0.2    "setosa"
[6,]        6          5.4         3.9          1.7         0.4    "setosa"
[7,]        7          4.6         3.4          1.4         0.3    "setosa"
[8,]        8          5.0         3.4          1.5         0.2    "setosa"
[9,]        9          4.4         2.9          1.4         0.2    "setosa"
[10,]      10          4.9         3.1          1.5         0.1    "setosa"
[11,]      11          5.4         3.7          1.5         0.2    "setosa"
[12,]      12          4.8         3.4          1.6         0.2    "setosa"
[13,]      13          4.8         3.0          1.4         0.1    "setosa"
[14,]      14          4.3         3.0          1.1         0.1    "setosa"
[15,]      15          5.8         4.0          1.2         0.2    "setosa"
[16,]      16          5.7         4.4          1.5         0.4    "setosa"
[17,]      17          5.4         3.9          1.3         0.4    "setosa"
[18,]      18          5.1         3.5          1.4         0.3    "setosa"
[19,]      19          5.7         3.8          1.7         0.3    "setosa"
[20,]      20          5.1         3.8          1.5         0.3    "setosa"
  :
[131,]    131          7.4         2.8          6.1         1.9 "virginica"
[132,]    132          7.9         3.8          6.4         2.0 "virginica"
[133,]    133          6.4         2.8          5.6         2.2 "virginica"
[134,]    134          6.3         2.8          5.1         1.5 "virginica"
[135,]    135          6.1         2.6          5.6         1.4 "virginica"
[136,]    136          7.7         3.0          6.1         2.3 "virginica"
[137,]    137          6.3         3.4          5.6         2.4 "virginica"
[138,]    138          6.4         3.1          5.5         1.8 "virginica"
[139,]    139          6.0         3.0          4.8         1.8 "virginica"
[140,]    140          6.9         3.1          5.4         2.1 "virginica"
[141,]    141          6.7         3.1          5.6         2.4 "virginica"
[142,]    142          6.9         3.1          5.1         2.3 "virginica"
[143,]    143          5.8         2.7          5.1         1.9 "virginica"
[144,]    144          6.8         3.2          5.9         2.3 "virginica"
[145,]    145          6.7         3.3          5.7         2.5 "virginica"
[146,]    146          6.7         3.0          5.2         2.3 "virginica"
[147,]    147          6.3         2.5          5.0         1.9 "virginica"
[148,]    148          6.5         3.0          5.2         2.0 "virginica"
[149,]    149          6.2         3.4          5.4         2.3 "virginica"
[150,]    150          5.9         3.0          5.1         1.8 "virginica"
 
julia> head(iris)
DataFrame  (6,6)
          Sepal.Length Sepal.Width Petal.Length Petal.Width  Species
[1,]    1          5.1         3.5          1.4         0.2 "setosa"
[2,]    2          4.9         3.0          1.4         0.2 "setosa"
[3,]    3          4.7         3.2          1.3         0.2 "setosa"
[4,]    4          4.6         3.1          1.5         0.2 "setosa"
[5,]    5          5.0         3.6          1.4         0.2 "setosa"
[6,]    6          5.4         3.9          1.7         0.4 "setosa"
 
julia> tail(iris)
DataFrame  (6,6)
            Sepal.Length Sepal.Width Petal.Length Petal.Width     Species
[1,]    145          6.7         3.3          5.7         2.5 "virginica"
[2,]    146          6.7         3.0          5.2         2.3 "virginica"
[3,]    147          6.3         2.5          5.0         1.9 "virginica"
[4,]    148          6.5         3.0          5.2         2.0 "virginica"
[5,]    149          6.2         3.4          5.4         2.3 "virginica"
[6,]    150          5.9         3.0          5.1         1.8 "virginica"

Now that you can see that Julia can handle complex data sets, let’s talk a little bit about the packages that make statistical analysis in Julia possible.

The DataFrames Package

The DataFrames package provides data structures for working with tabular data in Julia. At a minimum, this means that DataFrames provides tools for dealing with individual columns of missing data, which are called DataVec‘s. A collection of DataVec‘s allows one to build up a DataFrame, which provides a tabular data structure like that used by R’s data.frame type.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
julia> load("DataFrames")
 
julia> using DataFrames
 
julia> data = {"Value" => [1, 2, 3], "Label" => ["A", "B", "C"]}
Warning: imported binding for data overwritten in module Main
{"Label"=>["A", "B", "C"],"Value"=>[1, 2, 3]}
 
julia> df = DataFrame(data)
DataFrame  (3,2)
        Label Value
[1,]      "A"     1
[2,]      "B"     2
[3,]      "C"     3
 
julia> df["Value"]
3-element DataVec{Int64}
 
[1,2,3]
 
julia> df[1, "Value"] = NA
NA
 
 
julia> head(df)
DataFrame  (3,2)
        Label Value
[1,]      "A"    NA
[2,]      "B"     2
[3,]      "C"     3

Distributions

The Distributions package provides tools for working with probability distributions in Julia. It reifies distributions as types in Julia’s large type hierarchy, which means that quite generic names like rand can be used to sample from complex distributions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
julia> load("Distributions")
julia> using Distributions
 
julia> x = rand(Normal(11.0, 3.0), 10_000)
10000-element Float64 Array:
  6.87693
 13.3676 
  7.25008
  8.82833
 10.6911 
  7.1004 
 13.7449 
  5.96412
  8.57957
 15.27374.89007
 15.1509 
  6.32376
  7.83847
 14.4476 
 14.2974 
  9.74783
  9.67398
 14.4992 
 
julia> mean(x)
11.00366217730023
 
julia> var(x)
Warning: Possible conflict in library symbol ddot_
9.288938550823996

Optim

The Optim package provides tools for numerical optimization of arbitrary functions in Julia. It provides a function, optimize, which works a bit like R’s optim function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
julia> load("Optim")
julia> using Optim
 
julia> f = v -> (10.9 - v[1])^2 + (7.3 - v[2])^2
#<function>
 
julia> initial_guess = [0.0, 0.0]
2-element Float64 Array:
 0.0
 0.0
 
julia> results = optimize(f, initial_guess)
Warning: Possible conflict in library symbol dcopy_
OptimizationResults("Nelder-Mead",[0.333333, 0.333333],[10.9, 7.29994],3.2848148720460163e-9,38,true)
 
julia> results.minimum
2-element Float64 Array:
 10.9    
  7.29994

MCMC

The MCMC package provides tools for sampling from arbitrary probability distributions using Markov Chain Monte Carlo. It provides functions like slice_sampler, which allows one to sample from a (potentially unnormalized) density function using Radford Neal’s slice sampling algorithm.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
julia> load("MCMC")
 
julia> using MCMC
 
julia> d = Normal(17.29, 1.0)
Normal(17.29,1.0)
 
julia> f = x -> logpdf(d, x)
#<function>
 
julia> [slice_sampler(0.0, f) for i in 1:100]
100-element (Float64,Float64) Array:
 (2.7589100475626323,-106.49522613611775) 
 (22.840595204318323,-16.323492094305458) 
 (0.11800384424353683,-148.35766451986206)
 (25.507580447082677,-34.68325273534245)  
 (25.794565860846134,-37.08275877393945)  
 (25.898128716394307,-37.96887853221083)  
 (9.309878825853284,-32.76010551023705)   
 (30.824102772255355,-92.50490745818972)  
 (9.108789186504177,-34.38504372063516)   
 (25.547686903330494,-35.01363502992266)(5.795001414731885,-66.98643477086263)   
 (15.50115292212293,-2.518925467219337)   
 (12.046429369881345,-14.666455009726143) 
 (17.25455052645699,-0.919566865791911)   
 (25.494698549206657,-34.57747767488159)  
 (1.8340810959111111,-120.36165311809079) 
 (2.7112428736526177,-107.18901820771696) 
 (9.21203292192012,-33.54571459047587)    
 (19.12274407701784,-2.5984139591266584)

NHST

The NHST package provides tools for testing standard statistical hypotheses using null hypothesis significance testing tools like the t-test and the chi-squared test.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
julia> load("Distributions")
 
julia> using Distributions
 
julia> load("NHST")
 
julia> using NHST
 
julia> d1 = Normal(17.29, 1.0)
Normal(17.29,1.0)
 
julia> d2 = Normal(0.0, 1.0)
Normal(0.0,1.0)
 
julia> x = rand(d1, 1_000)
1000-element Float64 Array:
 15.7085
 18.585 
 16.6036
 18.962 
 17.8715
 16.6814
 17.9676
 16.8924
 16.6022
 17.981317.1339
 17.3964
 18.6184
 16.7238
 18.5003
 16.1618
 17.9198
 17.4928
 18.715 
 
julia> y = rand(d2, 1_000)
1000-element Float64 Array:
  0.664885 
  0.147182 
  0.96265  
  0.24282  
  1.881    
 -0.632478 
  0.539297 
  0.996562 
 -0.483302 
  0.5146292.06249  
 -0.549444 
  0.857575 
 -1.47464  
 -2.33243  
  0.510751 
 -0.381069 
 -1.49165  
  0.0521203
 
julia> t_test(x, y)
HypothesisTest("t-Test",{"t"=>392.2838409538002},{"df"=>1989.732411290855},0.0,[17.1535, 17.3293],{"mean of x"=>17.24357323225425,"mean of y"=>0.0021786523177457794},0.0,"two-sided","Welch Two Sample t-test","x and y",1989.732411290855)

Clustering

The Clustering package provides tools for doing simple k-means style clustering.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
julia> load("Clustering")
 
julia> using Clustering
 
julia> srand(1)
 
julia> n = 100
100
 
julia> x = vcat(randn(n, 2), randn(n, 2) .+ 10)
200x2 Float64 Array:
  0.0575636  -0.112322 
 -1.8329     -0.101326 
  0.370699   -0.956183 
  1.31816    -1.44351  
  0.787598    0.148386 
  0.712214   -1.293    
 -1.8578     -1.06208  
 -0.746303   -0.0439182
  1.12082    -2.00616  
  0.364646   -1.0933110.1974     10.5583   
 11.0832      8.92082  
 11.5414     11.6022   
  9.0453     11.5093   
  8.86714    10.4233   
 10.7336     10.7201   
  8.60415     9.13942  
  8.62482     8.51701  
 10.5044     10.3841   
 
julia> true_assignments = vcat(zeros(n), ones(n))
200-element Float64 Array:
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.01.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 
julia> results = k_means(x, 2)
Warning: Possible conflict in library symbol dgesdd_
Warning: Possible conflict in library symbol dsyrk_
Warning: Possible conflict in library symbol dgemm_
KMeansOutput([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1  ...  2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],2x2 Float64 Array:
 -0.0166203  -0.248904
 10.0418     10.0074  ,3,422.9820560670007,true)
 
julia> results.assignments
200-element Int64 Array:
 1
 1
 1
 1
 1
 1
 1
 1
 1
 12
 2
 2
 2
 2
 2
 2
 2
 2

While all of this software is still quite new and often still buggy, being able to work with these tools through a simple package systems had made me more excited than ever before about the future of Julia as a language for data analysis. There is, of course, one thing conspicuously lacking right now: a really powerful visualization toolkit for interactive graphics like that provided by R’s ggplot2 package. Hopefully something will come into being within the next few months.