Author Archives: John Myles White

Why Julia’s DataFrames are Still Slow

By: John Myles White

Re-posted from: http://www.johnmyleswhite.com/notebook/2015/11/28/why-julias-dataframes-are-still-slow/

Introduction

Although I’ve recently decided to take a break from working on OSS for a little while, I’m still as excited as ever about Julia as a language.

That said, I’m still unhappy with the performance of Julia’s core data analysis infrastructure. The performance of code that deals with missing values has been substantially improved thanks to the beta release of the NullableArrays package, which David Gold developed during this past Julia Summer of Code. But the DataFrames package is still a source of performance problems.

The goal of this post is to explain why Julia’s DataFrames are still unacceptably slow in many important use cases — and will remain slow even after the current dependency on the DataArrays package is replaced with a dependency on NullableArrays.

Problematic Interactions with Julia’s Compiler

The core problem with the DataFrames library is that a DataFrame is, at its core, a black-box container that could, in theory, contain objects of arbitrary types. In practice, a DataFrame contains highly constrained objects, but those constraints are (a) hard to express to the compiler and (b) still too weak to allow the compiler to produce the most efficient machine code.

The use of any black-box container creates the potential for performance problems in Julia because of the way that Julia’s compiler works. In particular, Julia’s compiler is able to execute code quickly because it can generate custom machine code for every function call — and this custom machine code is specialized for the specific run-time types of the function’s arguments.

This run-time generation of custom machine code is called specialization. When working with black-box containers, Julia’s approach to specialization is not used to full effect because machine code specialization based on run-time types only occurs at function call sites. If you access objects from a black-box container and then perform extended computations on the results, those computations will not be fully specialized because there is no function call between (a) the moment at which type uncertainty about the contents of the black-box container is removed and (b) the moment at which code that could benefit from type information is executed.

A Minimal Example

To see this concern in practice, consider the following minimal example of a hot loop being executed on values that are extracted from a black-box container:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
function g1(black_box_container)
    x, y = black_box_container[1], black_box_container[2]
    n = length(x)
    s = 0.0
    for i in 1:n
        s += x[i] * y[i]
    end
    s
end
 
function hot_loop(x, y)
    n = length(x)
    s = 0.0
    for i in 1:n
        s += x[i] * y[i]
    end
    s
end
 
function g2(black_box_container)
    x, y = black_box_container[1], black_box_container[2]
    hot_loop(x, y)
end
 
container = Any[randn(10_000_000), randn(10_000_000)];
 
@time g1(container)
# 2.258571 seconds (70.00 M allocations: 1.192 GB, 5.03% gc time)
 
@time g2(container)
# 0.015286 seconds (5 allocations: 176 bytes)

g1 is approximately 150x slower than g2 on my machine. But g2 is, at a certain level of abstraction, exactly equivalent to g1 — the only difference is that the hot loop in g1 has been put inside of a function call. To convince yourself that the function call boundary is the only important difference between these two functions, consider the following variation of g2 and hot_loop:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
@inline function hot_loop_alternative(x, y)
    n = length(x)
    s = 0.0
    for i in 1:n
        s += x[i] * y[i]
    end
    s
end
 
function g3(black_box_container)
    x, y = black_box_container[1], black_box_container[2]
    hot_loop_alternative(x, y)
end
 
@time g1(container)
# 2.290116 seconds (70.00 M allocations: 1.192 GB, 4.90% gc time)
 
@time g2(container)
# 0.017835 seconds (5 allocations: 176 bytes)
 
@time g3(container)
# 2.250301 seconds (70.00 M allocations: 1.192 GB, 5.08% gc time)

On my system, forcing the hot loop code to be inlined removes all of the performance difference between g1 and g2. Somewhat ironically, by inlining the hot loop, we’ve prevented the compiler from generating machine code that’s specialized on the types of the x and y values we pull out of our black_box_container. Inlining removes a function call site — and function call sites are the only times when machine code can be fully specialized based on run-time type information.

This problem is the core issue that needs to be resolved to make Julia’s DataFrames as efficient as they should be. Below I outline three potential solutions to this problem. I do not claim that these three are the only solutions; I offer them only to illustrate important issues that need to be addressed.

Potential Solutions to the Under-Specialization Problem

One possible solution to the problem of under-specialization is to change Julia’s compiler. I think that work on that front could be very effective, but the introduction of specialization strategies beyond Julia’s current "specialize at function call sites" would make Julia’s compiler much more complex — and could, in theory, make some code slower if the compiler were to spend more time performing compilation and less time performing the actual computations that a user wants to perform.

A second possible solution is to generate custom DataFrame types for every distinct DataFrame object. This could convert DataFrames from black-box containers that contain objects of arbitrary type into fully typed containers that can only contain objects of types that are fully known to the compiler.

The danger with this strategy is that you could generate an excessively large number of different specializations — which would again run the risk of spending more time inside the compiler than inside of the code you actually want to execute. It could also create excessive memory pressure as an increasing number of specialized code paths are stored in memory. Despite these concerns, a more aggressively typed DataFrame might be a powerful tool for doing data analysis.

The last possible solution I know of is the introduction of a high-level API that ensures that operations on DataFrames always reduce down to operations on objects whose types are known when hot loops execute. This is essentially the computational model used in traditional databases: take in a SQL specification of a computation, make use of knowledge about the data actually stored in existing tables to formulate an optimized plan for performing that computation, and then perform that optimized computation.

I think this third option is the best because it will also solve another problem Julia’s data infrastructure will hit eventually: the creation of code that is insufficiently generic and not portable to other backends. If people learn to write code that only works efficiently for a specific implementation of DataFrames, then their code will likely not work when they try to apply it to data stored in alternative backends (e.g. traditional databases). This would trap users into data structures that may not suit their needs. The introduction of a layer of appropriate abstractions (as in dplyr) would resolve both issues at once.

Take-Aways

  • Making Julia’s DataFrames better is still a work-in-progress.
  • The core issue is still the usage of data structures that are not amenable to Julia’s type inference machinery. One of the two main issues is now resolved; another must be addressed before things function smoothly.
  • Several solutions to this remaining are possible; we will probably see one or more of these solutions gain traction in the near-term future.

What’s Wrong with Statistics in Julia?

By: John Myles White

Re-posted from: http://www.johnmyleswhite.com/notebook/2014/11/29/whats-wrong-with-statistics-in-julia/

Introduction

Several months ago, I promised to write an updated version of my old post, “The State of Statistics in Julia”, that would describe how Julia’s support for statistical computing has evolved since December 2012.

I’ve kept putting off writing that post for several reasons, but the most important reason is that all of my attention for the last few months has been focused on what’s wrong with how Julia handles statistical computing. As such, the post I’ve decided to write isn’t a review of what’s already been done in Julia, but a summary of what’s being done right now to improve Julia’s support for statistical computing.

In particular, this post focuses on several big changes to the core data structures that are used in Julia to represent statistical data. These changes should all ship when Julia 0.4 is released.

What’s Wrong with Statistics in Julia Today?

The primary problem with statistical computing in Julia is that the current tools were all designed to emulate R. Unfortunately, R’s approach to statistical computing isn’t amenable to the kinds of static analysis techniques that Julia uses to produce efficient machine code.

In particular, the following differences between R and Julia have repeatedly created problems for developers:

  • In Julia, computations involving scalars are at least as important as computations involving vectors. In particular, iterative computations are first-class citizens in Julia. This implies that statistical libraries must allow developers to write efficient code that iterates over the elements of a vector in pure Julia. Because Julia’s compiler can only produce efficient machine code for computations that are type-stable, the representations of missing values, categorical values and ordinal values in Julia programs must all be type-stable. Whether a value is missing or not, its type must remain the same.
  • In Julia, almost all end-users will end up creating their own types. As such, any tools for statistical computing must be generic enough that they can be extended to arbitrary types with little to no effort. In contrast to R, which can heavily optimize its algorithms for a very small number of primitive types, Julia developers must ensure that their libraries are both highly performant and highly abstract.
  • Julia, like most mainstream languages, eagerly evaluates the arguments passed to functions. This implies that idioms from R which depend upon non-standard evaluation are not appropriate for Julia, although it is possible to emulate some forms of non-standard evaluation using macros. In addition, Julia doesn’t allow programmers to reify scope. This implies that idioms from R that require access to the caller’s scope are not appropriate for Julia.

The most important way in which these issues came up in the first generation of statistical libraries was in the representation of a single scalar missing value. In Julia 0.3, this concept is represented by the value NA, but that representation will be replaced when 0.4 is released. Most of this post will focus on the problems created by NA.

In addition to problems involving NA, there were also problems with how expressions were being passed to some functions. These problems have been resolved by removing the function signatures for statistical functions that involved passing expressions as arguments to those functions. A prototype package called DataFramesMeta, which uses macros to emulate some kinds of non-standard evaluation, is being developed by Tom Short.

Representing Missing Values

In Julia 0.3, missing values are represented by a singleton object, NA, of type NAtype. Thus, a variable x, which might be either a Float64 value or a missing value encoded as NA, will end up with type Union(Float64, NAtype). This Union type is a source of performance problems because it defeats Julia’s compiler’s attempts to assign a unique concrete type to every variable.

We could remove this type-instability by ensuring that every type has a specific value, such as NaN, that signals missingness. This is the approach that both R and pandas take. It offers acceptable performance, but does so at the expense of generic handling of non-primitive types. Given Julia’s rampant usage of custom types, the sentinel values approach is not viable.

As such, we’re going to represent missing values in Julia 0.4 by borrowing some ideas from functional languages. In particular, we’ll be replacing the singleton object NA with a new parametric type Nullable{T}. Unlike NA, a Nullable object isn’t a direct scalar value. Rather, a Nullable object is a specialized container type that either contains one value or zero values. An empty Nullable container is taken to represent a missing value.

The Nullable approach to representing a missing scalar value offers two distinct improvements:

  • Nullable{T} provides radically better performance than Union(T, NA). In some benchmarks, I find that iterative constructs can be as much as 100x faster when using Nullable{Float64} instead of Union(Float64, NA). Alternatively, I’ve found that Nullable{Float64} is about 60% slower than using NaN to represent missing values, but involves a generic approach that trivially extends to arbitrary new types, including integers, dates, complex numbers, quaternions, etc…
  • Nullable{T} provides more type safety by requiring that all attempts to interact with potentially missing values explicitly indicate how missing values should be treated.

In a future blog post, I’ll describe how Nullable works in greater detail.

Categorical Values

In addition to revising the representation of missing values, I’ve also been working on revising our representation of categorical values. Working with categorical data in Julia has always been a little strange, because the main tool for representing categorical data, the PooledDataArray, has always occupied an awkward intermediate position between two incompatible objectives:

  • A container that keeps track of the unique values present in the container and uses this information to efficiently represent values as pointers to a pool of unique values.
  • A container that contains values of a categorical variable drawn from a well-defined universe of possible values. The universe can include values that are not currently present in the container.

These two goals come into severe tension when considering subsets of a PooledDataArray. The uniqueness constraint suggests that the pool should shrink, whereas the categorical variable definition suggests that the pool should be maintained without change. In Julia 0.4, we’re going to commit completely to the latter behavior and leave the problem of efficiently representing highly compressible data for another data structure.

We’ll also begin representing scalar values of categorical variables using custom types. The new CategoricalVariable and OrdinalVariable types that will ship with Julia 0.4 will further the efforts to put scalar computations on an equal footing with vector computations. This will be particularly notable for dealing with ordinal variables, which are not supported at all in Julia 0.3.

Metaprogramming

Many R functions employ non-standard evaluation as a mechanism for augmenting the current scope with the column names of a data.frame. In Julia, it’s often possible to emulate this behavior using macros. The in-progress DataFramesMeta package explores this alternative to non-standard evaluation. We will also be exploring other alternatives to non-standard evaluation in the future.

What’s Next

In the long-term future, I’m hoping to improve several other parts of Julia’s core statistical infrastructure. In particular, I’d like to replace DataFrames with a new type that no longer occupies a strange intermediate position between matrices and relational tables. I’ll write another post about those issues later.

Values vs. Bindings: The Map is Not the Territory

By: John Myles White

Re-posted from: http://www.johnmyleswhite.com/notebook/2014/09/06/values-vs-bindings-the-map-is-not-the-territory/

Many newcomers to Julia are confused by the seemingly dissimilar behaviors of the following two functions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
julia> a = [1, 2, 3]
3-element Array{Int64,1}:
 1
 2
 3
 
julia> function foo!(a)
           a[1] = 10
           return
       end
foo! (generic function with 1 method)
 
julia> foo!(a)
 
julia> a
3-element Array{Int64,1}:
 10
  2
  3
 
julia> function bar!(a)
           a = [1, 2]
           return
       end
bar! (generic function with 1 method)
 
julia> bar!(a)
 
julia> a
3-element Array{Int64,1}:
 10
  2
  3

Why does the first function successfuly alter the global variable a, but the second function does not?

To answer that question, we need to explain the distinction between values and bindings. We’ll start with a particularly simple example of a value and a binding.

In Julia, the number 1 is a value:

1
2
julia> 1
1

In contrast to operating on a value, the Julia assignment operation shown below creates a binding:

1
2
julia> a = 1
1

This newly created binding is an association between the symbolic name a and the value 1. In general, a binding operation always associates a specific value with a specific name. In Julia, the valid names that can be used to create bindings are symbols, because it is important that the names be parseable without ambiguity. For example, the string "a = 1" is not an acceptable name for a binding, because it would be ambiguous with the code that binds the value 1 to the name a.

This first example of values vs. bindings might lead one to believe that values and bindings are very easy to both recognize and distinguish. Unfortunately, the values of many common objects are not obvious to many newcomers.

What, for example, is the value of the following array?

1
2
3
4
5
julia> [1, 2, 3]
3-element Array{Int64,1}:
 1
 2
 3

To answer this question, note that the value of this array is not defined by the contents of the array. You can confirm this by checking whether Julia considers two objects to be exactly identical using the === operator:

1
2
3
4
5
julia> 1 === 1
true
 
julia> [1, 2, 3] === [1, 2, 3]
false

The general rule is simple, but potentially non-intuitive: two arrays with identical contents are not the same array. To motivate this, think of arrays as if they were cardboard boxes. If I have two cardboard boxes, each of which contains a single ream of paper, I would not claim that the two boxes are the exact same box just because they have the same contents. Our intuitive notion of object identity is rich enough to distinguish between two containers with the same contents, but it takes some time for newcomers to programming languages to extend this notion to their understanding of arrays.

Because every container is distinct regardless of what it contains, every array is distinct because every array is its own independent container. An array’s identity is not defined by what it contains. As such, its value is not equivalent to its contents. Instead, an array’s value is a unique identifier that allows one to reliably distinguish each array from every other array. Think of arrays like numbered cardboard boxes. The value of an array is its identifier: thus the value of [1, 2, 3] is something like the identifier “Box 1″. Right now, “Box 1″ happens to contain the values 1, 2 and 3, but it will continue to be “Box 1″ even after its contents have changed.

Hopefully that clarifies what the value of an array is. Starting from that understanding, we need to re-examine bindings because bindings themselves behave like containers.

A binding can be thought of as a named box that can contain either 0 or 1 values. Thus, when a new Julia session is launched, the name a has no value associated with it: it is an empty container. But after executing the line, a = 1, the name has a value: the container now has one element in it. Being a container, the name is distinct from its contents. As such, the name can be rebound by a later operation: the line a = 2 will change the contents of the box called a to refer to the value 2.

The fact that bindings behave like containers becomes a source of confusion when the value of a binding is itself a container:

1
a = [1, 2, 3]

In this case, the value associated with the name a is the identifier of an array that happens to have the values 1, 2, and 3 in it. But if the contents of that array are changed, the name a will still refer to the same array — because the value associated with a is not the contents of the array, but the identifier of the array.

As such, there is a very large difference between the following two operations:

1
2
a[1] = 10
a = [1, 2]
  • In the first case, we are changing the contents of the array that a refers to.
  • In the second case, we are changing which array a refers to.

In this second case, we are actually creating a brand new container as an intermediate step to changing the binding of a. This new container has, as its initial contents, the values 1 and 2. After creating this new container, the name a is changed to refer to the value that is the identifier of this new container.

This is why the two functions at the start of this post behave so differently: one mutates the contents of an array, while the other mutates which array a name refers to. Because variable names in functions are local, changing bindings inside of a function does not change the bindings outside of that function. Thus, the function bar! does not behave as some would hope. To change the contents of an array wholesale, you must not change bindings: you must change the contents of the array. To do that, bar! should be written as:

1
2
3
4
function bar!(a)
    a[:] = [1, 2]
    return
end

The notation a[:] allows one to talk about the contents of an array, rather than its identifier. In general, you should not expect that you can change the contents of any container without employing some indexing syntax that allows you to talk about the contents of the container, rather than the container itself.