Author Archives: Julia Developers

Julia 0.5 Highlights

By: Julia Developers

Re-posted from: http://feedproxy.google.com/~r/JuliaLang/~3/aGSpKgdJc2Y/julia-0.5-highlights

To follow along with the examples in this blog post and run them live, you can go to JuliaBox, create a free login, and open the “Julia 0.5 Highlights” notebook under “What’s New in 0.5”. The notebook can also be downloaded from here.

Julia 0.5 is a pivotal release.
It introduces more transformative features than any release since the first official version.
Moreover, several of these features set the stage for even more to come in the lead up to Julia 1.0.
In this post, we’ll go through some of the major changes in 0.5, including improvements to functional programming, comprehensions, generators, arrays, strings, and more.

Functions

Julia has always supported functional programming features:

Before this release, however, these features all came with a significant performance cost.
In a language that targets high-performance technical computing, that’s a serious limitation.
So the Julia standard library and ecosystem have been rife with work-arounds to get the expressiveness of functional programming without the performance problems.
But the right solution, of course, is to make functional programming fast – ideally just as fast as the optimal hand-written version of your code would be.
In Julia 0.5, it is.
And that changes everything.

This change is so important that there will be a separate blog post about it in the coming weeks, explaining how higher-order functions, closures and lambdas have been made so efficient, as well as detailing the kinds of zero-cost abstractions these changes enable.
But for now, I’ll just tease with a little timing comparison.
First, some definitions – they’re the same in both 0.4 and 0.5:

v = rand(10^7);                   # 10 million random numbers
double_it_vec(v) = 2v             # vectorized doubling of input
double_it_map(v) = map(x->2x, v)  # map a lambda over input

First, a timing comparison in Julia 0.4:

julia> VERSION
v"0.4.7"

julia> mean([@elapsed(double_it_vec(v)) for _=1:100])
0.024444888209999998

julia> mean([@elapsed(double_it_map(v)) for _=1:100])
0.5515606454499999

On 0.4, the functional version using map is 22 times slower than the vectorized version, which uses specialized generated code for maximal speed.
Now, the same comparison in Julia 0.5:

julia> VERSION
v"0.5.0"

julia> mean([@elapsed(double_it_vec(v)) for _=1:100])
0.024549842180000003

julia> mean([@elapsed(double_it_map(v)) for _=1:100])
0.023871925960000002

The version using map is as fast as the vectorized one in 0.5.
In this case, writing 2v happens to be more convenient than writing map(x->2x, v), so we may choose not to use map here, but there are many cases where functional constructs are clearer, more general, and more convenient.
Now, they are also fast.

Ambiguous methods

One design decision that any multiple dispatch language must make is how to handle dispatch ambiguities: cases where none of the methods applicable to a given set of arguments is more specific than the rest.
Suppose, for example, that a generic function, f, has the following methods:

f(a::Int, b::Real) = 1
f(a::Real, b::Int) = 2

In Julia 0.4 and earlier, the second method definition causes an ambiguity warning:

WARNING: New definition
    f(Real, Int64) at none:1
is ambiguous with:
    f(Int64, Real) at none:1.
To fix, define
    f(Int64, Int64)
before the new definition.

This warning is clear and gets right to the point: the case f(a,b) where a and b are of type Int (aka Int64 on 64-bit systems) is ambiguous.
Evaluating f(3,4) calls the first method of f – but this behavior is undefined.
Giving a warning whenever methods could be ambiguous is a fairly conservative choice: it urges people to define a method covering the ambiguous intersection before even defining the methods that overlap.
When we decided to give warnings for potentially ambiguous methods, we hoped that people would avoid ambiguities and all would be well in the world.

Warning about method ambiguities turns out to be both too strict and too lenient.
It’s far too easy for ambiguities to arise when shared generic functions serve as extension points across unrelated packages.
When many packages extend the same generic functions, it’s common for the methods added to have some ambiguous overlap.
This happens even when each package has no ambiguities on its own.
Worse still, slight changes to one package can introduce ambiguities elsewhere, resulting in the least fun game of whack-a-mole ever.
At the same time, the fact that ambiguities only cause warnings means that people learn to ignore them, which is annoying at best, and dangerous at worst: it’s far too easy for a real problem to be hidden by a barrage of insignificant ambiguity warnings.
In particular, on 0.4 and earlier if an ambiguous method is actually called, no error occurs.
Instead, one of the possible methods is called, based on the order in which methods were defined – which is essentially arbitrary when they come from different packages.
Usually the method works – it does apply, after all – but this is clearly not the right thing to do.

The solution is simple: in Julia 0.5 the existence of potential ambiguities is fine, but actually calling an ambiguous method is an immediate error.
The above method definitions for f, which previously triggered a warning, are now silent, but calling f with two Int arguments is a method dispatch error:

julia> f(3,4)
ERROR: MethodError: f(::Int64, ::Int64) is ambiguous. Candidates:
  f(a::Real, b::Int64) at REPL[2]:1
  f(a::Int64, b::Real) at REPL[1]:1
 in eval(::Module, ::Any) at ./boot.jl:231
 in macro expansion at ./REPL.jl:92 [inlined]
 in (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at ./event.jl:46

This improves the experience of using the Julia package ecosystem considerably, while also making Julia safer and more reliable.
No more torrent of insignificant ambiguity warnings.
No more playing ambiguity whack-a-mole when someone else refactors their code and accidentally introduces ambiguities in yours.
No more risk that a method call could be silently broken because of warnings that we’ve all learned to ignore.

Return type annotations

A long-requested feature has been the ability to annotate method definitions with an explicit return type.
This aids the clarity of code, serves as self-documentation, helps the compiler reason about code, and ensures that return types are what programmers intend them to be.
In 0.5, you can annotate method definitions with a return type like so:

function clip{T<:Real}(x::T, lo::Real, hi::Real)::T
    if x < lo
        return lo
    elseif x > hi
        return hi
    else
        return x
    end
end

This function is similar to the built-in clamp function, but let’s consider this definition for the sake of example.
The return annotation on clip has the effect of inserting implicit calls to x->convert(T, x) at each return point of the method.
It has no effect on any other method of clip, only the one where the annotation occurs.
In this case, the annotation ensures that this method always returns a value of the same type as x, regardless of the types of lo and hi:

julia> clip(0.5, 1, 2) # convert(T, lo)
1.0

julia> clip(1.5, 1, 2) # convert(T, x)
1.5

julia> clip(2.5, 1, 2) # convert(T, hi)
2.0

You’ll note that the annotated return type here is T, which is a type parameter of the clip method.
Not only is that allowed, but the return type can be an arbitrary expression of argument values, type parameters, and values from outer scopes.
For example, here is a variation that promotes its arguments:

function clip2(x::Real, lo::Real, hi::Real)::promote_type(typeof(x), typeof(lo), typeof(hi))
    if x < lo
        return lo
    elseif x > hi
        return hi
    else
        return x
    end
end

julia> clip2(2, 1, 3)
2

julia> clip2(2, 1, 13//5)
2//1

julia> clip2(2.5, 1, 13//5)
2.5

Return type annotations are a fairly simple syntactic transformation, but they make it easier to write methods with consistent and predictable return types.
If different branches of your code can lead to slightly different types, the fix is now as simple as putting a single type annotation on the entire method.

Vectorized function calls

Julia 0.5 introduces the syntax f.(A1, A2, ...) for vectorized function calls.
This syntax translates to broadcast(f, A1, A2, ...), where broadcast is a higher-order function (introduced in 0.2), which generically implements the kind of broadcasting behavior found in Julia’s “dotted operators” such as .+, .-, .*, and ./.
Since higher-order functions are now efficient, writing broadcast(f,v,w) and f.(v,w) are both about as fast as loops specialized for the operation f and the shapes of v and w.
This syntax lets you vectorize your scalar functions the way built-in vectorized functions like log, exp, and atan2 work.
In fact, in the future, this syntax will likely replace the pre-vectorized methods of functions like exp and log, so that users will write exp.(v) to exponentiate a vector of values.
This may seem a little bit uglier, but it’s more consistent than choosing an essentially arbitrarily set of functions to pre-vectorize, and as I’ll explain below, this approach can also have significant performance benefits.

To give a more concrete sense of what this syntax can be used for, consider the clip function defined above for real arguments.
This scalar function can be applied to vectors using vectorized call syntax without any further method definitions:

julia> v = randn(10)
10-element Array{Float64,1}:
 -0.868996
  1.79301
 -0.309632
  1.16802
 -1.57178
 -0.223385
 -0.608423
 -1.54862
 -1.33672
  0.864448

julia> clip(v, -1, 1)
ERROR: MethodError: no method matching clip(::Array{Float64,1}, ::Int64, ::Int64)
Closest candidates are:
  clip{T<:Real}(::T<:Real, ::Real, ::Real) at REPL[2]:2

julia> clip.(v, -1, 1)
10-element Array{Float64,1}:
 -0.868996
  1.0
 -0.309632
  1.0
 -1.0
 -0.223385
 -0.608423
 -1.0
 -1.0
  0.864448

The second and third arguments don’t need to be scalars – as with dotted operators, they can be vectors as well, and the clip operation will be applied to each corresponding triple of values:

julia> clip.(v, repmat([-1,0.5],5), repmat([-0.5,1],5))
10-element Array{Float64,1}:
 -0.868996
  1.0
 -0.5
  1.0
 -1.0
  0.5
 -0.608423
  0.5
 -1.0
  0.864448

From these examples, it may be unclear why this operation is called “broadcast”.
The function gets its name from the following behavior:
wherever one of its arguments has a singleton dimension (i.e. dimension of size 1), it “broadcasts” that value along the corresponding dimension of the other arguments when applying the operator.
Broadcasting allows dotted operations to easily do handy tricks like mean-centering the columns of a matrix:

julia> A = rand(3,4);

julia> B = A .- mean(A,1)
3×4 Array{Float64,2}:
  0.343976   0.427378  -0.503356  -0.00448691
 -0.210096  -0.531489   0.168928  -0.128212
 -0.13388    0.104111   0.334428   0.132699

julia> mean(B,1)
1×4 Array{Float64,2}:
 0.0  0.0  0.0  0.0

The matrix A is 3×4 and mean(A,1) is 1×4 so the .- operator broadcasts the subtraction of each mean value along the corresponding column of A, thereby mean-centering each column.
Combining this broadcasting behavior with vectorized call syntax lets us write some fairly fancy custom array operations very concisely:

julia> clip.(B, [-0.3, -0.2, -0.1], [0.4, 0.3, 0.2, 0.1]')
3×4 Array{Float64,2}:
  0.343976   0.3       -0.3       -0.00448691
 -0.2       -0.2        0.168928  -0.128212
 -0.1        0.104111   0.2        0.1

This expression clips each element of B with its own specific (hi,lo) pair from this matrix:

julia> [(lo,hi) for lo=[-0.3, -0.2, -0.1], hi=[0.4, 0.3, 0.2, 0.1]]
3×4 Array{Tuple{Float64,Float64},2}:
 (-0.3,0.4)  (-0.3,0.3)  (-0.3,0.2)  (-0.3,0.1)
 (-0.2,0.4)  (-0.2,0.3)  (-0.2,0.2)  (-0.2,0.1)
 (-0.1,0.4)  (-0.1,0.3)  (-0.1,0.2)  (-0.1,0.1)

Vectorized call syntax avoids ever materializing this array of pairs, however, and the messy code to apply clip to each element of B with the corresponding lo and hi values doesn’t have to be written.
When B is larger than a toy example, not constructing a temporary matrix of (lo,hi) pairs can be a big efficiency win.

There is a bit more to the story about vectorized call syntax.
It’s common to write expressions applying multiple vectorized functions to some arrays.
For example, one might write something like:

max(abs(X), abs(Y))

This computes the absolute values of each element of X and Y and takes the larger of the corresponding elements from abs(X) and abs(Y).
In this traditional vectorized form, the code allocates two temporary intermediate arrays – one to store each of abs(X) and abs(Y).
If we use the new vectorized function call syntax, however, these calls are syntactically fused into a single call to broadcast with an anonymous function.
In other words, we write this:

max.(abs.(X), abs.(Y))

which internally becomes this:

broadcast((x, y)->max(abs(x), abs(y)), X, Y)

This version of the computation avoids allocating any intermediate arrays and performs the entire vectorized computation all at once, directly into the result array.
We can see this difference in memory usage and speed when we benchmark these expressions:

julia> using BenchmarkTools

julia> X, Y = rand(1000,1000), rand(1000,1000);

julia> @benchmark max(abs(X), abs(Y))
BenchmarkTools.Trial:
  memory estimate:  22.89 mb
  minimum time:     13.95 ms (1.77% GC)
  median time:      14.17 ms (1.76% GC)
  mean time:        14.32 ms (1.78% GC)
  maximum time:     17.15 ms (3.47% GC)

julia> @benchmark max.(abs.(X), abs.(Y))
BenchmarkTools.Trial:
  memory estimate:  7.63 mb
  minimum time:     2.84 ms (0.00% GC)
  median time:      2.98 ms (0.00% GC)
  mean time:        3.27 ms (18.26% GC)
  maximum time:     5.96 ms (65.68% GC)

julia> 22.89/7.63, 16.63/3.84
(3.0,4.330729166666667)

I’m using the BenchmarkTools package here instead of hand-rolled timing loops. BenchmarkTools has been carefully designed to avoid many of the common pitfalls of benchmarking code and to provide sound statistical estimates of how much time and memory your code uses.
For the sake of brevity, I’m omitting some of the less relevant output from @benchmark.

As you can see, the dotted form uses 3 times less memory and is 4.3 times faster.
These improvements come from avoiding temporary allocations and performing the entire computation in a single pass over the arrays.
Even greater reduction in allocation can occur when we use the new .= operator to also do vectorized assignment:

julia> Z = zeros(X); # matrix of zeros similar to X

julia> @benchmark Z .= max.(abs.(X), abs.(Y))
BenchmarkTools.Trial:
  memory estimate:  96.00 bytes
  minimum time:     1.76 ms (0.00% GC)
  median time:      1.82 ms (0.00% GC)
  mean time:        1.89 ms (0.00% GC)
  maximum time:     4.24 ms (0.00% GC)

With in-place vectorized assignment, we can fill the pre-allocated array, Z, without doing any allocation (the 96 bytes is an artifact), and do so 7.3 times faster than the old-style vectorized computation.
This can be a big win in situations where we can reuse the same output array for multiple computations.

The last major missing piece of vectorized call syntax is yet to come – it will be implemented in the next version of Julia.
Dotted operators like .+ and .* will cease to be their own independent operators and simply become the vectorized forms of the corresponding scalar operators, + and *.
In other words, instead of .+ being a function as it is now, with its own behavior independent of +, when you write X .+ Y it will mean broadcast(+, X, Y).
Furthermore, dotted operators will participate in the same syntax-level fusion as other vectorized calls, so an expression like exp.(log.(X) .+ log.(Y)) will translate into a single call to broadcast:

broadcast((x, y)->exp(log(x) + log(y)), X, Y)

This change will complete the transition to a generalized approach to vectorized function application (including syntax-level loop fusion), which will make Julia’s story for writing allocation-free array code much stronger.

Comprehensions

Julia’s array comprehensions have always supported some advanced features such as iterating with several variables to produce multidimensional arrays.
This release rounds out the functionality of comprehensions with two additional features: nested generation with multiple for clauses, and filtering with a trailing if clause.
To demonstrate these features, consider making a dollar (100¢) using quarters (25¢), dimes (10¢), nickels (5¢) and pennies (1¢).
We can generate an array of tuples of total values in each kind of coin by using a comprehension with nested for clauses:

julia> change = [(q,d,n,p) for q=0:25:100 for d=0:10:100-q for n=0:5:100-q-d for p=100-q-d-n]
242-element Array{NTuple{4,Int64},1}:
 (0,0,0,100)
 (0,0,5,95)
 (0,0,10,90)
 (0,0,15,85)
 (0,0,20,80)
 (0,0,25,75)
 
 (75,10,5,10)
 (75,10,10,5)
 (75,10,15,0)
 (75,20,0,5)
 (75,20,5,0)
 (100,0,0,0)

There are a few notable differences from the multidimensional array syntax:

  • Each iteration is a new for clause, rather than a single compound iteration separated by commas;
  • Each successive for clause can refer to variables from the previous clauses;
  • The result is a single flat vector regardless of how many nested for clauses there are.

The tuple (q,d,n,p) in the comprehension body is a breakdown of monetary value into quarters, dimes, nickels and pennies.
Note that the iteration range for p isn’t a range at all, it’s a single value, 100-q-d-n, the unique number guaranteeing that each tuple adds up to a dollar.
(This relies on the fact that a number behaves like an immutable zero-dimensional container, holding only itself, a behavior which is sometimes convenient but which has been the subject of significant debate.
As of 0.5 it still works.)
We can verify that each tuple adds up to 100:

julia> extrema([sum(t) for t in change])
(100,100)

Since 100 is both the minimum and maximum of all the tuple sums, we know they are all exactly 100.
So, there are 242 ways to make a dollar with common coins.
But suppose we want to ensure that the value in pennies is less than the value in nickels, and so forth.
By adding a filter clause, we can do this easily too:

julia> [(q,d,n,p) for q=0:25:100 for d=0:10:100-q for n=0:5:100-q-d for p=100-q-d-n if p < n < d < q]
4-element Array{NTuple{4,Int64},1}:
 (50,30,15,5)
 (50,30,20,0)
 (50,40,10,0)
 (75,20,5,0)

The only difference here is the if p < n < d < q clause at the end of the comprehension, which has the effect that the result only contains cases where this predicate holds true.
There are exactly four ways to make a dollar with strictly increasing value from pennies to nickels to dimes to quarters.

Nested and filtered comprehensions aren’t earth-shattering features – everything you can do with them can be done in a variety of other ways – but they are expressive and convenient, found in other languages, and they allow you to try more things with your data quickly and easily, with less pointless refactoring.

Generators

In the previous section we used an array comprehension to take the sum of each tuple, save the sums as an array, and then pass that array of sums to the extrema function to find the largest and smallest sum (they’re all 100):

julia> @time extrema([sum(t) for t in change])
  0.000072 seconds (8 allocations: 2.203 KB)
(100,100)

Wrapping this in the @time macro shows that this expression allocates 2.2 KB of memory – mostly for the array of sums, which is thrown away after the computation.
But allocating an array just to find its extrema is unnecessary:
the minimum and maximum can be computed over streamed data by keeping the largest and smallest values seen so far.
In other words, this calculation could be expressed with constant memory overhead by interleaving the production of values with computation of extrema.
Previously, expressing this interleaved computation required some amount of refactoring, and many approaches were considerably less efficient.
In 0.5, if you simply omit the square brackets around an array comprehension, you get a generator expression, which instead of producing an array of values, can be iterated over, yielding one value at a time.
Since extrema works with arbitrary iterable objects – including generators – expressing an interleaved calculation using constant memory is now as simple as deleting [ and ]:

julia> @time extrema(sum(t) for t in change)
  0.000066 seconds (6 allocations: 208 bytes)
(100,100)

This avoids allocating a temporary array of sums entirely, instead computing the next tuple’s sum only when the extrema function is ready to accept a new value.
Using a generator reduces the memory overhead to 208 bytes – the size of the the return value.
More importantly, the memory usage doesn’t depend on the size of the change array anymore – it will always be just 208 bytes, even if change holds a trillion tuples.
It’s not hard to imagine situations where such a reduction in asymptotic memory usage is crucial.
The similar syntax between array comprehensions and generator expressions makes it trivial to move back and forth between the two styles of computation as needed.

Initializing collections

The new generator syntax dovetails particularly nicely with Julia’s convention for constructing collections – to make a new collection, you call the constructor with a single iterable argument, which yields the values you want in the new collection.
In its simplest form, this looks something like:

julia> IntSet([1, 4, 9, 16, 25, 36, 49, 64])
IntSet([1, 4, 9, 16, 25, 36, 49, 64])

In this expression, an array of integers is passed to the IntSet constructor to create an object representing that set, which in this case happen to be small squares.
Once constructed, the IntSet object no longer refers to the original array of integers.
Instead, it uses a bitmask to efficiently store and operate on sets.
It displays itself as you would construct it from an array, but that’s merely for convenience – there’s no actual array anymore.

Now, I’m a human (no blogbots here) and I find typing out even short sequences of perfect squares tedious and error prone – despite a math degree, I’m awful at arithmetic.
It would be much easier to generate squares with an array comprehension:

julia> IntSet([k^2 for k = 1:8])
IntSet([1, 4, 9, 16, 25, 36, 49, 64])

This comprehension produces the same array of integers that I typed manually above.
As before, creating this array object is unnecessary – it would be even better to generate the desired squares as they are inserted into the new IntSet.
Which, of course, is precisely what generator expressions allow:

julia> IntSet(k^2 for k = 1:8)
IntSet([1, 4, 9, 16, 25, 36, 49, 64])

Using a generator here is just as clear, more concise, and significantly more efficient:

julia> using BenchmarkTools

julia> @benchmark IntSet([k^2 for k = 1:8])
BenchmarkTools.Trial:
  memory estimate:  320.00 bytes
  minimum time:     163.00 ns (0.00% GC)
  median time:      199.00 ns (0.00% GC)
  mean time:        245.18 ns (12.95% GC)
  maximum time:     5.36 μs (92.47% GC)

julia> @benchmark IntSet(k^2 for k = 1:8)
BenchmarkTools.Trial:
  memory estimate:  160.00 bytes
  minimum time:     114.00 ns (0.00% GC)
  median time:      139.00 ns (0.00% GC)
  mean time:        165.74 ns (11.48% GC)
  maximum time:     4.82 μs (93.20% GC)

As you can see from this benchmark, the version with an array comprehension uses twice as much memory and is 50% slower than constructing the same IntSet using a generator expression.

Constructing dictionaries

Generators can be used to construct dictionaries too, and this use case deserves some special attention since it completes a multi-release process of putting user-defined dictionary types on an equal footing with the built-in Dict type.
In Julia 0.3, the => operator only existed as part of syntax for constructing Dict objects:
[k₁ => v₁, k₂ => v₂] and [k(i) => v(i) for i = c].
This design was based on other dynamic languages where dictionaries are among a small set of built-in types with special syntax that are deeply integrated into the language.
As Julia’s ecosystem has matured, however, it has become apparent that Julia is actually more like Java or C++ in this respect than it is like Python or Lua: the Dict type isn’t that special – it happens to be defined in the standard library, but is otherwise quite ordinary.
Many programs use other dictionary implementations: for example, the tree-based SortedDict type, which sorts values by key, or OrderedDict, which maintains keys in the order they are inserted.
Having special syntax only for Dict makes using other dictionary implementations problematic.
In 0.3, there was no good syntax for constructing values of these dictionaries – the best one could do was to invoke a constructor with an array of two-tuples:

SortedDict([(k₁, v₁), (k₂, v₂)])        # fixed-size dictionaries
SortedDict([(k(i), v(i)) for i in c])   # dictionary comprehensions

Not only are these constructions inconvenient and ugly, they’re also inefficient since they create temporary heap-allocated arrays of heap-allocated tuples of key-value pairs.
With much relief, we can now instead write:

SortedDict(k₁ => v₁, k₂ => v₂)          # fixed-size dictionaries, since 0.4
SortedDict(k(i) => v(i) for i = c)      # dictionary comprehensions, since 0.5

This last syntax combines two orthogonal features introduced in 0.4 and 0.5, respectively:

  • k => v as a standalone syntax for a Pair object, and
  • generator expressions, particularly to initialize collections.

The Dict type is now constructed in exactly the same way:

julia> Dict("foo" => 1, "bar" => 2)
Dict{String,Int64} with 2 entries:
  "bar" => 2
  "foo" => 1

julia> Dict("*"^k => k for k = 1:10)
Dict{String,Int64} with 10 entries:
  "**********" => 10
  "***"        => 3
  "*******"    => 7
  "********"   => 8
  "*"          => 1
  "**"         => 2
  "****"       => 4
  "*********"  => 9
  "*****"      => 5
  "******"     => 6

This generalization makes the syntax for constructing a Dict slightly longer, but we feel that the increased consistency, ability to change dictionary implementations with a simple search-and-replace, and putting user-defined dictionary-like types on the same level as the built-in Dict type make this change well worthwhile.

Arrays

The 0.5 release was originally intended to include a large number of disruptive array changes, collectively dubbed “Arraymageddon”.
After much discussion, experimentation and benchmarking, this set of breaking changes was significantly reduced for a variety of reasons:

  • Some changes were deemed not to be good ideas after all;
  • Others were of unclear benefit, so it was decided to reconsider them in the future once there is more information to support a decision;
  • A few didn’t get implemented due to lack of developer time, including some cases where everyone agrees there’s a problem but there is not yet any complete design for a solution.

Although not many breaking changes happened in 0.5, this was a major release for Julia’s array infrastructure.
The code to implement various complex polymorphic indexing operations for generic arrays and array-like structures was majorly refactored, and in the process it shrank by 40% while becoming more complete, more general, and faster.
You can read more about the very cool things you can now do with array-like types in an excellent pair of blog posts published here earlier in the year: Multidimensional algorithms and iteration and Generalizing AbstractArrays.
In the next two subsections, I’ll go over some of the array changes that did happen in 0.5.

Dimension sum slices

The most significant breaking change in the 0.5 cycle affects multidimensional array slicing.
To explain it we’ll need a little terminology.
A singleton dimension of a multidimensional array is a dimension whose size is 1.
For example, a 5×1 matrix has a trailing singleton dimension and may be called a “column matrix”, and a 1×5 matrix has a leading singleton dimension and may be called a “row matrix”.
A scalar slice refers to a dimension in a multidimensional slice expression where the index is a scalar integer (considered to be zero-dimensional), rather than a 1-dimensional range or vector, or some higher-dimensional collection of indices.
For example, in A[1,:] the first slice is scalar, the second is not; in A[:,2] the second slice is scalar, the first is not; in A[3,4] both slices are scalar.

All previous versions of Julia have dropped trailing scalar slices when performing multidimensional array slicing.
That is, when an array was sliced with multiple indices, the resulting array had the number of dimensions of the original array minus the number of trailing scalar slices.
So when you sliced a column out of a matrix the result was a 1-dimensional vector, but when you sliced a row the result was a 2-dimensional row matrix:

julia> VERSION
v"0.4.7"

julia> M = [i+j^2 for i=1:3, j=1:4]
3x4 Array{Int64,2}:
 2  5  10  17
 3  6  11  18
 4  7  12  19

julia> M[:,1] # vector
3-element Array{Int64,1}:
 2
 3
 4

julia> M[3,:] # row matrix
1x4 Array{Int64,2}:
 4  7  12  19

This rule is handy for linear algebra since row and column slices have distinct types and different orientations, but its complexity, asymmetry, and lack of generality make it less than ideal for arrays as general purpose containers.
With more dimensions, the asymmetry of this behavior can be seen even in a single slice operation:

julia> VERSION
v"0.4.7"

julia> T = [i+j^2+k^3 for i=1:3, j=1:4, k=1:2]
3x4x2 Array{Int64,3}:
[:, :, 1] =
 3  6  11  18
 4  7  12  19
 5  8  13  20

[:, :, 2] =
 10  13  18  25
 11  14  19  26
 12  15  20  27

julia> T[2,:,2]
1x4 Array{Int64,2}:
 11  14  19  26

The leading dimension of this slice is retained while the trailing dimension is discarded – even though both are scalar slices.
The result array is neither 3-dimensional like the original, nor 1-dimensional like the collective indexes (0 + 1 + 0); instead, it’s 2-dimensional – apropos of nothing.
Here, in another fairly similar slice, all dimensions are kept:

julia> T[:,4,:]
3x1x2 Array{Int64,3}:
[:, :, 1] =
 18
 19
 20

[:, :, 2] =
 25
 26
 27

By comparison, the new slicing behavior in 0.5 is simple, systematic, and symmetrical.
(And not original by any means – APL pioneered this array slicing scheme in the 1960s.)
In Julia 0.5, when an array is sliced, the dimension of the result is the sum of the dimensions of the slices, and the dimension sizes of the result are the concatenation of the sizes of the slices.
Thus, row slices and column slices both produce vectors:

julia> VERSION
v"0.5.0"

julia> M[:,1] # vector: 1 + 0 = 1
3-element Array{Int64,1}:
 2
 3
 4

julia> M[1,:] # vector: 0 + 1 = 1
4-element Array{Int64,1}:
  2
  5
 10
 170

Similarly, slicing a 3-dimensional array with scalars in all but one dimension also produces a vector:

julia> T[2,:,2] # vector: 0 + 1 + 0 = 1
4-element Array{Int64,1}:
 11
 14
 19
 26

The only example from above that doesn’t produce a vector is the last one:

julia> T[:,4,:] # matrix: 1 + 0 + 1 = 2
3×2 Array{Int64,2}:
 18  25
 19  26
 20  27

The result is a matrix since the leading and trailing slices are ranges, and the middle slice disappears since it is scalar, leaving a matrix.
The 0.5 slicing behavior naturally generalizes to higher dimensional slices:

julia> I = [1 2 1; 1 3 2]
2×3 Array{Int64,2}:
 1  2  1
 1  3  2

julia> J = [4 2 1 3]
1×4 Array{Int64,2}:
 4  2  1  3

julia> M[I,J]
2×3×1×4 Array{Int64,4}:
[:, :, 1, 1] =
 17  18  17
 17  19  18

[:, :, 1, 2] =
 5  6  5
 5  7  6

[:, :, 1, 3] =
 2  3  2
 2  4  3

[:, :, 1, 4] =
 10  11  10
 10  12  11

Here we have the following natural identity on dimensions:

size(M[I,J]) == (size(I,1), size(I,2), size(J,1), size(J,2))

In addition to being more systematic and symmetrical, this new behavior allows many complex indexing operations to be expressed concisely.

Although the change to multidimensional slicing behavior is a significant breaking change, it has caused surprisingly little havoc in the Julia package ecosystem.
It tends to primarily affect linear algebra code, and when code does break, it’s usually fairly clear what is broken and what needs to be done to fix it.
When updating your code, if you need to keep a dimension that is dropped under the new indexing behavior, you can write M[1:1,:]:

julia> M[1:1,:]
1×4 Array{Float64,2}:
 0.950951  0.713032  0.0835119  0.897018

Since integer range construction can be eliminated by Julia’s compiler, writing this is free but has the effect of keeping a dimension which would otherwise be dropped under the new rules.
Unfortunately, there’s no way to make this change without breaking some code – we apologize in advance for the inconvenience, and we hope you find the improvement to be worthwhile.

Array views

One of the major news items of 0.5 is a non-change:
array slices still create copies of array data.
There was a lot of discussion about changing the default behavior to creating views, but we ended up deciding against this change and keeping the old behavior.
The motivation for views by default was to improve performance drastically in a variety of slow cases, but after a lot of discussion, experiments, and benchmarks, it was decided not to make this change.
The conversation about this decision is long, so I’ll summarize the major points:

  • Slicing should either consistently produce views or copies.
    Unpredictably doing one or the other depending on types – or worse still, on runtime values – would be a disaster for writing reliable, generic code.

  • Guaranteeing view semantics for all abstract arrays – especially sparse and custom array types – is hard and can be quite slow and/or expensive in general cases.

  • Even in the case of dense arrays with cheap array views, it’s not clear that views are always a performance win.
    In some cases they definitely are, but in others the fact that a copied slice is contiguous and has optimal memory ordering for iteration overwhelms the benefit of not copying.

  • Copied slices are easier to reason about and less likely to lead to subtle bugs than views.
    Views can lead to situations where someone modifies the view, not realizing that it’s a view, thereby unintentionally modifying the original array.
    These kinds of bugs are hard to track down and even harder to notice.

  • There is no clear transition or deprecation strategy.
    Changing from copying slices to views would be a major compatibility issue.
    We generally give programmers deprecation warnings when some behavior is going to break or change in the next release.
    Sometimes we can’t do that so we just bite the bullet and break code with an error. But changing slices to views wouldn’t break code with an error, it would just silently cause code to produce different, incorrect results.
    There’s no clear way to make this transition safely.

Taken together this makes a compelling case against changing the default slicing behavior to returning views.
That said, even if they’re not the default, views are a crucial tool for performance in some situations.
Accordingly, a huge amount of work went into improving the ergonomics of views in 0.5, including:

  • Renaming the function for view construction from “sub” to “view”, which seems like a much better name.

  • Array views now support all forms of indexing supported by arrays.
    Previously, views did not support some of the more complex forms of array indexing.

  • The @view macro was introduced, allowing the use of natural slicing syntax for views.
    In other words you can now write @view A[1:end-1,2:end] instead of view(A, 1:size(A,1)-1, 2:size(A,2)).

Since views are an such important tool for both performance and for expressing complex mutating operations on arrays (especially with higher order functions), we may introduce a special syntax for view slices in the future.
In particular, the syntax A@[I...] had a fair amount of popular support.
Stay tuned!

And more…

This is far from the full extent of the improvements introduced in Julia 0.5, but this blog post is already getting quite long, so I’ll just summarize a few of the other big ticket items:

  • The set of string types and operations has been significantly simplified and streamlined.
    The ASCIIString and UTF8String types have been merged into a single String type, and the UTF16String and UTF32String and related functions have been moved into the LegacyStrings package, which keeps the same implementations as 0.4.
    In the future, better support for different string encodings will be developed under the StringEncodings package.

  • Most functionality related to prime generation, primality checking and combinatorics, has been moved into two external packages: Primes and Combinatorics.
    To use these functions, you’ll need to install these packages and do using Primes or using Combinatorics as necessary.

  • Julia’s LLVM version was upgraded from 3.3 to 3.7.1.
    This may not seem like a big deal, but the transition required herculean effort by many core Julia contributors.
    For a series of different and impossibly annoying reasons, LLVM versions 3.4, 3.5 and 3.6 were not usable for Julia, so we’re very happy to be back to using current versions of our favorite compiler framework.

  • Support for compiling and running on ARM chips is much improved since 0.4.
    Julia 0.5 also introduced initial support for Power systems, a development which has been supported and driven by IBM.
    We will be expanding and improving support for many architectures going forward.
    With support for ARM and Power, Julia is already a productive platform for technical computing from embedded systems to big iron.

  • The 0.5 release has experimental multithreading support.
    This isn’t ready for production usage, but it’s fun to play around with and you can already get impressive performance gains – scalability is a key focus.
    Julia’s threading provides true concurrent execution like C++, Go or Java:
    different threads can do work at the same time, up to the number of physical cores available.

  • Interactive debugging support has been a weak spot in the Julia ecosystem for some time, but not any more.
    On a vanilla build of Julia 0.5, you can install the Gallium package to get a full-fledged, high-performance debugger:
    set breakpoints, step through code, examine variables, and inspect stack frames.

I hope you’ve enjoyed this overview of highlights from the new release of Julia, and that you enjoy the release itself even more.
Julia 0.5 is easily the strongest release to date, but of course the next one will be even better 🙂

Happy coding!

Julia 0.5 Release Announcement

By: Julia Developers

Re-posted from: http://feedproxy.google.com/~r/JuliaLang/~3/tyDIHJxQsDY/julia-0.5-release

After over a year of development, the Julia community is proud to announce
the release of version 0.5 of the Julia language and standard library.
This release contains major language refinements and numerous standard library improvements.
A long list of changes is available in the NEWS log found in our main repository, with a summary reproduced below.
A separate blog post detailing some of the highlights of the new release has also been posted.

We’ll be releasing regular bugfix backports from the 0.5.x line, which is recommended for users requiring a stable language and API.
Major feature work is ongoing on master for 0.6-dev.

The Julia ecosystem continues to grow, and there are now over one thousand registered packages!
The third annual JuliaCon took place in Cambridge, MA in the summer of 2016, with an exciting line up of talks and keynotes.
Most of them are available to view.

Binaries are available from the main download page or visit JuliaBox to try this release from the comfort of your browser. Happy Coding!

Notable compiler and language changes:

  • The major focus of this release has been the ability to write fast functional code, removing the earlier performance penalty for anonymous functions and closures.
    This has been achieved via each function and closure now being its own type, and the captured variables of a closure are fields of its type.
    All functions, including anonymous functions, are now generic and support all features.

  • Experimental support for multi threading.

  • All dimensions indexed by scalars are now dropped, whereas previously only trailing scalar dimensions would be omitted from the result.
    This is a major breaking changes, but has been made to make the indexing rules much more consistent.

  • Generator expressions now can create iterators that are computed only on demand.

  • Experimental support for arrays whose indexing starts from values other than 1. Standard Julia arrays are still 1-based, but external packages can implement array types with indexing from arbitrary indices.

  • Major simplification of the string types, unifying ASCIIString and UTF8String as String, as well as moving types and functions related to different encodings out of the standard library.

  • Package operations now use the libgit2 library rather than shelling out to command line git. This makes these calls to package related functions much faster, and more reliable, especially on Windows.

  • And many many more changes and improvements…

Ports

Julia now runs on the ARM and Power architectures, making it possible to use it on the widest variety of hardware, from the smallest embedded machines to the largest HPC systems. Porting a language to a new architecture is never easy, so special thanks to the people who made it possible. Part of the work to create the Power port was supported by IBM, for which we are grateful.

Developing with Julia

The Julia debugger, Gallium, is now ready to use. It allows for a full, multi language debug experience, debugging Julia and C code with ease. The debugger is also integrated with Juno, the Julia IDE that is now fully featured and ready to use.

StructuredQueries.jl – A generic data manipulation framework

By: Julia Developers

Re-posted from: http://feedproxy.google.com/~r/JuliaLang/~3/vKarZhlIyk4/StructuredQueries

This post describes my work conducted this summer at the Julia Lab to develop StructuredQueries.jl, a generic data manipulation framework for Julia.

Our initial vision for this work was much inspired by Hadley Wickham’s dplyr R package, which provides data manipulation verbs that are generic over in-memory R tabular data structures and SQL databases, and DataFramesMeta (begun by Tom Short), which provides metaprogramming facilities for working with Julia DataFrames.

While a generic querying interface is a worthwhile end in itself (and has been discussed elsewhere), it may also be useful for solving problems specific to in-memory Julia tabular data structures. We will discuss how a query interface suggests solutions to two important problems facing the development of tabular data structures in Julia: the column-indexing and nullable semantics problems. So, the present post will describe both the progress of my work and also discuss a wider scope of issues concerning support for tabular data structures in Julia. I will provide some context for these issues; the reader should feel free to skip over any uninteresting details.


Recall that the primary shortcoming of DataArrays.jl is that it does not allow for type-inferable indexing. That is, the means by which missing values are represented in DataArrays – i.e. with a token NA::NAtype object – entails that the most specific return type inferable from Base.getindex(df::DataArray{T}, i) is Union{T, NAtype}. This means that until Julia’s compiler can better handle small Union types, code that naively indexes into a DataArray will perform unnecessarily poorly.

NullableArrays.jl remedied this shortcoming by representing both missing and present values of type T as objects of type Nullable{T}. However, this solution has limitations in other respects. First, use of NullableArrays does nothing to support type inference in column-indexing of DataFrames. That is, the return type of Base.getindex(df::DataFrame, field::Symbol) is not straightforwardly inferable, even if DataFrames are built over NullableArrays. Call this first problem the column-indexing problem. Second, NullableArrays introduces certain difficulties centered around the Nullable type. Call this second problem the nullable semantics problem.

The column-indexing problem is well-documented. To see the difficulty, consider the following function

function f(df::DataFrame)
    A = df[:A]
    x = zero(eltype(A))
    for i in eachindex(A)
        x += A[i]
    end
    return x
end

where df[:A] retrieves the column named :A from df. A user might reasonably expect the above to be idiomatic Julia: the work is written in a for loop that is wrapped inside a function. However, this code will not be (ahead-of-time) compiled to efficient machine instructions because the type of the object that df[:A] returns cannot be inferred during static analysis. This is because there is nothing the DataFrame type can do to communicate the eltypes of its columns to the compiler.

The nullable semantics problem is described throughout a dispersed series of GitHub issues (the interested reader can start here and here) (and at least one mailing list post). To my knowledge, a self-contained treatment has not been given (I don’t necessarily claim to be giving one now). The problem has two parts, which I’ll call the “easy question” and the “hard question”, respectively:

  1. What should the semantics of f(x::Nullable{T}) be given a definition of f(x::T)?

  2. How should we implement these semantics in a sufficiently general and user-friendly way?

In most cases, the answer to the easy question is clear: f(x::Nullable{T}) should return an empty Nullable{U} if x is null and Nullable(f(x.value)) if x is not null. There is a question of how to choose the type parameter U, but a solution involving Julia’s type inference facilities seems to be about right. (The discussion of 0.5-style comprehensions and one or two discussions about the return type of map over an empty array, were all influential on this matter.) We will refer to these semantics as the standard lifting semantics. It is worth noting that there is at least one considerable alternative to standard lifting semantics, at least in the realm of binary operators on Nullable{Bool} arguments: three-valued logic. But whether to use three-valued logic or standard lifting semantics is usually clear from the context of the program and the intention of the programmer.

On the other hand, the hard question is still unresolved. There are a number of possible solutions, and it’s difficult to know how to weigh their costs and benefits.

We’ll return to the column-indexing problem and the hard question of nullable semantics after we’ve described the present query interface. Before we dive in, I want to emphasize that this blog post is a status update, not a release notice (though StructuredQueries is registered so that you can play with it if you like). StructuredQueries (SQ) is a work in progress, and it will likely remain that way for some time. I hope to convince the reader that SQ nonetheless represents an interesting and worthwhile direction for the development of tabular data facilities in Julia.

The query framework

The StructuredQueries package provides a framework for representing the structure of a query without assuming any specific corresponding semantics. By the structure of a query, we mean the series of particular manipulation verbs invoked and the respective arguments passed to these verbs. By the semantics of a query, we mean the actual behavior of executing a query with a particular structure against a particular data source. A query semantics thus depends both on the structure of the query and on the type of the data source against which the query is executed. We will refer to the implementation of a particular query semantics as a collection machinery.

Decoupling the representation of a query’s structure from the collection machinery helps to make the present query framework

  • generic – the framework should be able to support multiple backends.
  • modular – the framework should encourage modularity of collection machinery.
  • extensible – the framework should be easily extensible to represent (relatively) arbitrary manipulations.

These desiderata are interrelated. For instance, modularity of collection machinery allows the latter to be re-used in support for different data backends, thereby supporting generality as well.

In this section we’ll describe how SQ represents query structures. In the following sections we’ll see how SQ’s query representation framework suggests solutions to the column-indexing and nullable semantics problems described above.

To express a query in SQ, one uses the @query macro:

@query qry

where qry is Julia code that follows a certain structure that we will describe below. qry is parsed according to what we’ll call a query context. By a context we mean a general semantics for Julia code that may differ from the semantics of the standard Julia environment. That is to say: though qry must be valid Julia syntax, the code is not run as it would were it executed outside of the @query macro. Rather, code such as qry that occurs inside of a query context is subject to a number of transformations before it is run. @query uses these transformations to produce a graphical representation of the structure of qry. An @query qry invocation returns a Query object, which wraps the query graph produced as a result of processing qry.

We said above that SQ represents queries in terms of their structure but does not itself guarantee any particular semantics. This allows packages to implement their own semantics for a given query structure. To demonstrate this design, I’ve put together (i) an abstract tabular data type, AbstractTable; (ii) an interface to support a collection machinery against what I call column-indexable types T <: AbstractTable; and (iii) a concrete tabular data type, Table <: AbstractTable that satisfies the column-indexable interface and therefore inherits a collection machinery to support SQ queries.

This following behavior mimics that which one would expect from querying against a DataFrame. The main reason for putting together a demonstration using Tables and not DataFrames has to do with ease of experimentation. I can more easily modify the AbstractTable/Table types and interfaces more easily than I can the DataFrame type and interface. Indeed, this project has become just as much about designing an in-memory Julia tabular data type that is most compatible with a Julia query framework as it is about designing a query framework compatible with an in-memory Julia tabular data type. Fortunately, the implementation of backend support for Tables will be straightforward to port to support for DataFrames once we decide where such support should live.

Let’s dive into the query interface by considering examples using the iris data set. (Though the package TablesDemo.jl is intended solely as a demonstration, it is registered so that readers can easily install it with Pkg.add("TablesDemo.jl") and follow along.)

julia> iris = Table(CSV.Source(joinpath(Pkg.dir("Tables"), "csv/iris.csv")))
Tables.Table
│ Row │ sepal_length │ sepal_width │ petal_length │ petal_width │ species  │
├─────┼──────────────┼─────────────┼──────────────┼─────────────┼──────────┤
│ 1   │ 5.1          │ 3.5         │ 1.4          │ 0.2         │ "setosa" │
│ 2   │ 4.9          │ 3.0         │ 1.4          │ 0.2         │ "setosa" │
│ 3   │ 4.7          │ 3.2         │ 1.3          │ 0.2         │ "setosa" │
│ 4   │ 4.6          │ 3.1         │ 1.5          │ 0.2         │ "setosa" │
│ 5   │ 5.0          │ 3.6         │ 1.4          │ 0.2         │ "setosa" │
│ 6   │ 5.4          │ 3.9         │ 1.7          │ 0.4         │ "setosa" │
│ 7   │ 4.6          │ 3.4         │ 1.4          │ 0.3         │ "setosa" │
│ 8   │ 5.0          │ 3.4         │ 1.5          │ 0.2         │ "setosa" │
│ 9   │ 4.4          │ 2.9         │ 1.4          │ 0.2         │ "setosa" │
│ 10  │ 4.9          │ 3.1         │ 1.5          │ 0.1         │ "setosa" │
⋮
with 140 more rows.

We can then use @query to express a query against this data set – say, filtering rows according to a condition on sepal_length:

julia> q = @query filter(iris, sepal_length > 5.0)
Query with Tables.Table source

This produces a Query{S} object, where S is the type of the data source

julia> typeof(q)
StructuredQueries.Query{Tables.Table}

The structure of the query passed to @query consists of a manipulation verb (e.g. filter) that in turn takes a data source (e.g. iris) for its first argument and any number of query arguments (e.g. sepal_length > 5.0) for its latter arguments. These are the three different “parts” of a query: (1) data sources (or just “sources”), (2) manipulation verbs (or just “verbs”), and (3) query arguments.

Each part of a query induces its own context in which code is evaluated. The most significant aspect of such contexts is name resolution. That is to say, names resolve differently depending on which part of a query they appear in and in what capacity they appear:

  1. In a data source specification context – e.g., as the first argument to a verb such as filter above – names are evaluated in the enclosing scope of the @query invocation. Thus, iris in the query used to define q above refers precisely to the Table object to which the name is bound in the top level of Main.

  2. Names of manipulation verbs are not resolved to objects but rather merely signal how to construct the graphical representation of the query. (Indeed, in what follows there is no such function filter that is ever invoked in the execution of a query involving a filter clause.)

  3. Names of functions called within a query argument context, such as > in sepal_length > 5.0 are evaluated in the enclosing scope of the @query invocation.

  4. Names that appear as arguments to function calls within a query argument context, such as sepal_length in sepal_length > 5.0 are not resolved to objects but are rather parsed as “attributes” of the data source (in this case, iris). When the data source is a tabular data structure, such attributes are taken to be column names, but such behavior is just a feature of a particular query semantics (see below in the section “Roadmap and open questions”.) The attributes that are passed as arguments to a given function call in a query argument are stored as data in the graphical query representation.

One can pipe arguments to verbs inside an @query context. For instance, the Query above is equivalent to that produced by

@query iris |> filter(sepal_length > 5.0)

In this case, the first argument (sepal_length > 5.0) to the verb filter is not a data source specification (iris), which is instead the first argument to |>, but is rather a query argument (sepal_length > 5.0).

Query objects represent the structure of a query composed of the three building blocks above. To see how, lets take a look at the internals of a Query:

julia> fieldnames(q)
2-element Array{Symbol,1}:
 :source
 :graph

The first field, :source, just contains the data source specified in the query – in this case, the Table object that was bound to the name iris when the query was specified. The second field, :graph contains a(n admittedly not very interesting) graphical representation of the query structure:

julia> q.graph
FilterNode
  arguments:
      1)  sepal_length > 5.0
  inputs:
      1)  DataNode
            source:  unset source

The filter verb from the original qry expression passed to @query is represented in the graph by a FilterNode object and that the data source is represented by a DataNode object. Both FilterNode and DataNode are leaf subtypes of the abstract QueryNode type. The FilterNode is connected to the DataNode via the :input field of the former. In general, these connections constitute directed acyclic graphs. We may refer to such graphs as QueryNode graphs or query graphs.

SQ currently recognizes the following verbs out of the box – that is, it properly incorporates them into a QueryNode graph:

  • select
  • filter
  • groupby
  • summarize
  • orderby
  • innerjoin (or just join)
  • leftjoin
  • outerjoin
  • crossjoin

One uses collect(q::Query) to materialize q as a concrete set results set – hence the term “collection machinery”. Note that the set of verbs that receive support from the column-indexable interface – that is, the verbs that may be collected against a column-indexable data source – currently only includes the first four: select, filter, groupby, and summarize. This is what such support currently looks like:

julia> q = @query iris |>
           filter(sepal_length > 5.0) |>
           groupby(species, log(petal_length) > .5) |>
           summarize(avg = mean(digamma(petal_width)))
Query with Tables.Table source

julia> q.graph
SummarizeNode
  arguments:
      1)  avg=mean(digamma(petal_width))
  inputs:
      1)  GroupbyNode
            arguments:
                1)  species
                2)  log(petal_length) > 0.5
            inputs:
                1)  FilterNode
                      arguments:
                          1)  sepal_length > 5.0
                      inputs:
                          1)  DataNode
                                source:  unset source


julia> collect(q)
Grouped Tables.Table
Groupings by:
    species
    log(petal_length) > 0.5 (with alias :pred_1)

Source: Tables.Table
│ Row │ species      │ pred_1 │ avg       │
├─────┼──────────────┼────────┼───────────┤
│ 1   │ "virginica"  │ true   │ 0.428644  │
│ 2   │ "setosa"     │ true   │ -3.17557  │
│ 3   │ "versicolor" │ true   │ -0.136551 │
│ 4   │ "setosa"     │ false  │ -4.7391   │

We hope to include support for the other verbs in the near future.

Again we emphasize that this collection machinery is provided by the AbstractTables package, not StructuredQueries. As we see above, the latter provides a framework for representing a query structure, whereas packages such as AbstractTables (i) decide what it means to execute a query with a particular structure against a particular backend, and (ii) provide the implementation of the behavior in (i).

We provide a convenience macro, @collect(qry), which is equivalent to collect(@query(qry)), for when one wishes to query and collect in the same command:

julia> @collect iris |>
           filter(erf(petal_length) / petal_length > log(sepal_width) / 1.5) |>
           summarize(sum = sum(ifelse(rand() > .5, sin(petal_width), 0.0)))
Tables.Table
│ Row │ sum       │
├─────┼───────────┤
│ 1   │ 0.0998334 │

Again, note the patterns of name resolution: names of functions (e.g. erf) invoked within the context of a query argument are evaluated within the enclosing scope of the @query invocation, whereas names in the arguments of such functions (e.g. petal_length) are taken to be attributes of the data source (i.e., iris).

Dummy sources

We saw above how there are three parts to a query structure: verbs, sources and query arguments. A Query object represents the verbs and query arguments together in the QueryNode graph and wraps the data source separately. This suggests that one ought to be able to generate query graphs using @query even if one does not specify a particular data source. One can do precisely this by using dummy sources, which are essentially placeholders that can be “filled in” with particular data sources later, when one calls collect. To indicate a source as a dummy source, simply prepend it with a :. For instance:

julia> q = @query select(:src, twice_sepal_length = 2 * sepal_length)
Query with dummy source src

julia> collect(q, src = iris)
Tables.Table
│ Row │ twice_sepal_length │
├─────┼────────────────────┤
│ 1   │ 10.2               │
│ 2   │ 9.8                │
│ 3   │ 9.4                │
│ 4   │ 9.2                │
│ 5   │ 10.0               │
│ 6   │ 10.8               │
│ 7   │ 9.2                │
│ 8   │ 10.0               │
│ 9   │ 8.8                │
│ 10  │ 9.8                │
⋮
with 140 more rows.

Whatever the name of the dummy source (minus the :) was in the query must be the key in the kwarg passed to collect. Otherwise, the method will fail:

julia> collect(q, tbl = iris)
ERROR: ArgumentError: Undefined source: tbl. Check spelling in query.
 in #collect#5(::Array{Any,1}, ::Function, ::StructuredQueries.Query{Symbol}) at /Users/David/.julia/v0.6/StructuredQueries/src/query/collect.jl:23
 in (::Base.#kw##collect)(::Array{Any,1}, ::Base.#collect, ::StructuredQueries.Query{Symbol}) at ./<missing>:0

The two problems

Now that we’ve seen what the SQ query framework itself consists of, we can discuss how such a framework may help to solve the column-indexing and nullable semantics problems.

Type-inferability

Recall that the column-indexing problem consists in the inability of type inference to detect the return type of

function f(df::DataFrame)
    A = df[:A]
    x = zero(eltype(A))
    for i in eachindex(A)
        x += A[i]
    end
    return x
end

What would make f above amenable to type inference is to pass A = df[:A] above to an inner function that executes the loop, for instance

f_inner(A)
    x = zero(eltype(A))
    for i in 1:length(A)
        x += A[i]
    end
    return x
end

As long as f_inner does not get inlined, type inference will run “at” the point at which the body of f calls f_inner and will have access to the eltype of df[:A], since the latter is passed as an argument to f_inner.

This strategy of introducing a function barrier also works when one requires multiple columns. For instance, suppose I wanted to generate a new column C where C[i] = g(A[i], B[i]). The following solution is type-inferable since the type parameters of the zipped iterator zip(A, B) reflects the eltypes of A and B:

function f(g, df)
    A, B = df[:A], df[:B]
    C = similar(A)
    f_inner!(C, g, zip(A, B))
    return DataFrame(C = C)
end

function f_inner!(C, g, itr) # bang because mutates C
    for (a, b) in itr
        C[i] = g(a, b)
    end
    return C
end

In other words: If one intends to iterate over the rows of some subset of columns of a DataFrame, then at some point there must be a function barrier through which is passed an argument whose signature reflects the eltypes of the relevant columns.

The manipulation described above could be expressed for a column-indexable table (e.g. a Table object) as

@query select(tbl, C = A * B)

The collection machinery that supports this query against, say, a Table source essentially follows the above pattern of f and f_inner. That is, an outer function passes a “scalar kernel” (here, row -> row[1] * row[2]) that reflects the structure of A * B and a “row iterator” (here zip(tbl[:A], tbl[:B])) to an inner function that computes the value of the scalar kernel applied to the “rows” returned by iterating over the row iterator. (Note that the argument to the scalar kernel is assumed to be a Tuple whose individual elements assume the positions of named attributes (such as A and B) in the body of the “value expression” (here A * B) from which the scalar kernel is generated).

The scalar kernel and the information about which column to extract from tbl and zip together are all stored in the QueryNode graph produced by @query. Much of the work in producing such a graph consists in extracting such information from the qry expression (here select(tbl, C = A * B)) and processing it to produce (i) a lambda that captures the form of the transformation (A * B), (ii) a Symbol that names the resultant column (C) and a Vector{Symbol} that lists the relevant argument column names ([:A, :B]) in the order they are encountered during the production of the lambda.

Note that these data (a scalar kernel and result and argument fields) are not necessary to generate SQL code from a raw query argument, say the Expr object :( C = A * B ). Thus, one might argue that it is somewhat wasteful to compute such data and store it in the QueryNode graph when one might be able to compute the data at run-time dispatch of collect on a Query{S} where S is a type that satisfies the column-indexable interface. This is a good point, but there are two considerations to account. The first is that computing the scalar kernel and extracting the result and argument fields from the query argument is probably not prohibitively expensive. The second is that generating the scalar kernel at run-time (i) involves use of eval, which is to be avoided, and (ii) may involve a lot of work to re-incorporate the module information of names appearing in expression to be eval‘d into a scalar kernel. For now, it is easiest to generate scalar kernels at macroexpand-time and let them come along for the ride in the QueryNode graph even if the latter is to be collected against a data source (e.g. a SQL connection) that doesn’t need such data.

The use of metaprogramming to circumvent type-inferability is not a new strategy. Indeed, it is the basis for the DataFramesMeta manipulation framework. The interested reader is referred here and here for more on the history and motivation for these endeavors.

The hard question of nullable semantics

Recall the hard question of nullable semantics involves implementing a given lifting semantics – that is, a given behavior for f(x::Nullable{T}) given a defined method f(x::T)– in a “general” way.

One solution – perhaps the most obvious, and which I have previously endorsed – involves defining the method f(x::Nullable{T}) as something like

function f(x::Nullable{T})
    if isnull(x)
        return Nullable{U}()
    else
        return Nullable(f(x.value))
    end
end

with natural analogues for methods with n-ary arguments. This process is a bit cumbersome, but it would not be difficult to automate with a macro with which one could annotate the original definition f(x::T). Call this approach the “method extension lifting” approach.

The method extension lifting approach is very flexible. However, it does face some difficulties. One must somehow decide which functions should be lifted in this manner, and it’s not clear how this line (between lifted and non-lifted functions) ought to be drawn. And if one cannot edit the definition of a function then a macro is of no use; one must manually introduce the lifted variant.

There is a further problem. If one wants to support lifting over arguments with “mixed” signatures – i.e. signatures in which some argument types are Nullable and some are not – then one has either to extend the promotion machinery or to define methods for mixed signatures, e.g. +{T}(x, y::Nullable{T}). That may end up being a lot of methods. Even if their definition can be automated with metaprogramming, the compilation costs associated with method proliferation may be considerable (but I haven’t tested this).

Finally, there is the problem described in NullableArrays.jl#148. I won’t repeat the entire argument here. The summary of this problem is: if one is going to rely on a minimal set of lifted operators to support generic lifting of user-defined functions, those user-defined functions essentially have to give up much of multiple dispatch.

The difficulties associated with method extension lifting are not insurmountable, but the solution – namely, keeping a repository of lifted methods – requires an undetermined amount of maintenance and coordination.

Another way to implement standard lifting semantics is by means of a higher-order function – that is, on Julia 0.5 where higher-order functions are performant. Such a function – call it lift – might look like the following:

function lift(f, x)
    if hasvalue(x)
        return Nullable(f(x))
    else
        U = Core.Inference.return_type(f, (typeof(X),))
        return Nullable{U}()
    end
end

This definition can naturally be extended to methods with more than one argument. The primary advantage of this approach over method extension lifting is its generality: one needs only to define one (two, three) higher-order lift method to support lifting of all functions of one (two, n) argument(s), as opposed to having to define a lifted version for each such function. Note that as long as hasvalue has some generic fallback method for non-Nullable arguments, such lift functions cover both standard and mixed-signature lifting. (Ideally one would ensure that the code is optimized for when types are non-Nullable; in particular, one would ensure that the dead branch is removed – cf. julia#18484.) Call this approach the “higher-order lifting” approach.

So, with the higher-order lifting approach we might better avoid method proliferation and generality worries, which is nice. However, now we require users to invoke lift everywhere. In particular, to lift f(g(x)) over a Nullable argument x, one needs to write lift(f, lift(g, x)). The least we could do in this case is provide an @lift macro that, say, traverses the AST of f(g(x)) and replaces each function call f(...) by an invocation of lift(f, ...). That might be reasonable, but it’s still an artifact of implementation details of support for missing values, and ideally it would not be exposed to users.

Recall that the present query framework extracts the “value expression” of a query argument (for instance, B * C in the query argument C = A * B) and generates a lambda that mimics the former’s structure (in this case, row -> row[1] * row[2]). A proposed modification (see AbstractTables#2) to this process is to modify the AST of the value expression (A * B) by appropriately inserting calls to lift, e.g.

row -> lift(*, row[1], row[2])

While there is a simpler way to achieve standard lifting semantics, this approach (which is currently employed by the column-indexing collection machinery) does not easily support non-standard lifting semantics such as three-valued logic.

The higher-order lifting approach is not without its own drawbacks. Most notably, non-standard lifting semantics, such as three-valued logic, are more difficult to implement and are subject to restrictions that do not apply to the method extension lifting approach. The details of this difficulty is the proper subject of another blog post. The summary of the problem is: higher-order lifting (via code transformation, such as within @query) can only give non-standard lifting semantics to methods called explicitly within the expression passed to @query. That is,

@query filter(tbl, A | B)

can be given, say, three-valued logic semantics via higher-order lifting, but

f(x, y) = x | y
@query filter(tbl, f(A, B))

cannot.

Which approach to solving the hard question of Nullable semantics is better? It really is not clear. Right now, the Julia statistics community is trying out both solutions. I am hopeful time and experimentation will yield new insights.

SQL backends

Above we have seen (i) how the implementation of a generic querying interface suggested a solution to the column-indexing and the Nullable semantics problems and (ii) how these latter solutions may be implemented in a manner generic over so-called column-indexable in-memory Julia tabular data structures. But we haven’t said anything about how the interface is generic over tables other than in-memory Julia objects. In particular, we desire that the above framework be applicable to SQL database connections as well.

Yeesian Ng, who provided invaluable feedback and ideas during the development of SQ, also began to develop such an extension in a package called SQLQuery. We are working to further integrate it with StructuredQueries in SQLQuery.jl#2, and we encourage the reader to stay tuned for updates concerning this endeavor.

Roadmap and open questions

There is a general roadmap available at structuredQueries.jl#19. I’ll briefly describe some of what I believe are the most pressing/interesting open questions.

Interpolation syntax and implementation are both significant open questions. Suppose I wish to refer to a name in the enclosing scope of an @query invocation. A straightforward syntax would be to prepend the interpolated variable with $, as in

c = .5
q = @query filter(tbl, A > $c)

How should this be implemented? For full generality, we would like to be able to “capture” c from the enclosing scope and store it q. One way to do so is to include c in the closure of a lambda () -> c that we store in q. However, there is the question of how to deal with problems of type-inferability. Solving this problem may either require or strongly suggest some sort of “parametrized queries” API by which one can designate a name inside of a query argument context a parameter that can then be bound after the @query invocation, e.g. specified as kwargs to collect or to a function like bind!(q::Query[; kwargs...]).

We are also still deciding what the general syntax within a query context should look like. A big part of this decision concerns how aliasing and related functionality ought to work. See StructuredQueries.jl#21 for more details. This issue is similar to that of interpolation syntax insofar as both involve name resolution within different query contexts (e.g. in a data source specification context vs. a query argument context).

Finally, extensibility of not only collect but also of the graph generation facilities is an important issue, of which we hope to say more in a later post.

As mentioned above, DataFramesMeta is a pioneering approach to enhancing tabular data support in Julia via metaprogramming. Another exciting (and slightly more mature) endeavor in the realm of generic data manipulation facilities support is Query.jl by David Anthoff. Query.jl and SQ are very similar in their objectives, though different in important respects. A comparison of these packages is the proper topic of a separate blog post.

Conclusion

The foregoing post has described a work in progress. Not just the StructuredQueries package, but also the Julia statistical ecosystem. Though it will likely take a while for this ecosystem to mature, the general trend I’ve observed over the past two years is encouraging. It’s also worth noting that much of what is described above would have been difficult to conceive without developments of the Julia language. In particular, performant higher-order functions and type-inferable map have both allowed us to explore solutions that were previously made difficult by the amount of metaprogramming required to ensure type-inferability. It will be interesting to see what we can come up with given the improvements to Julia in 0.6 and beyond.

I’m very grateful to John Myles White for his guidance on this project, to Yeesian Ng at MIT for his collaboration, to Viral Shah and Alan Edelman for arranging this opportunity, and to many others at Julia Central and elsewhere for their help and insight.