Multidimensional algorithms and iteration

By: Julia Developers

Re-posted from: http://feedproxy.google.com/~r/JuliaLang/~3/9gVAJhFyvuY/iteration

Starting with release 0.4, Julia makes it easy to write elegant and
efficient multidimensional algorithms. The new capabilities rest on
two foundations: a new type of iterator, called CartesianRange, and
sophisticated array indexing mechanisms. Before I explain, let me
emphasize that developing these capabilities was a collaborative
effort, with the bulk of the work done by Matt Bauman (@mbauman),
Jutho Haegeman (@Jutho), and myself (@timholy).

These new iterators are deceptively simple, so much so that I’ve never
been entirely convinced that this blog post is necessary: once you
learn a few principles, there’s almost nothing to it. However, like
many simple concepts, the implications can take a while to sink in.
There also seems to be some widespread confusion about the
relationship between these iterators and
Base.Cartesian,
which is a completely different (and much more painful) approach to
solving the same problem. There are still a few occasions where
Base.Cartesian is necessary, but for many problems these new
capabilities represent a vastly simplified approach.

Let’s introduce these iterators with an extension of an example taken
from the
manual.

eachindex, CartesianIndex, and CartesianRange

You may already know that, in julia 0.4, there are two recommended
ways to iterate over the elements in an AbstractArray: if you don’t
need an index associated with each element, then you can use

 for a in A    # A is an AbstractArray
     # Code that does something with the element a
 end

If instead you also need the index, then use

for i in eachindex(A)
    # Code that does something with i and/or A[i]
end

In some cases, the first line of this loop expands to for i =
1:length(A)
, and i is just an integer. However, in other cases,
this will expand to the equivalent of

 for i in CartesianRange(size(A))
     # i is now a CartesianIndex
     # Code that does something with i and/or A[i]
 end

Let’s see what these objects are:

  A = rand(3,2)

  julia> for i in CartesianRange(size(A))
            @show i
         end
  i = CartesianIndex{2}((1,1))
  i = CartesianIndex{2}((2,1))
  i = CartesianIndex{2}((3,1))
  i = CartesianIndex{2}((1,2))
  i = CartesianIndex{2}((2,2))
  i = CartesianIndex{2}((3,2))

A CartesianIndex{N} represents an N-dimensional index.
CartesianIndexes are based on tuples, and indeed you can access the
underlying tuple with i.I. However, they also support certain
arithmetic operations, treating their contents like a fixed-size
Vector{Int}. Since the length is fixed, julia/LLVM can generate
very efficient code (without introducing loops) for operations with
N-dimensional CartesianIndexes.

A CartesianRange is just a pair of CartesianIndexes, encoding the
start and stop values along each dimension, respectively:

  julia> CartesianRange(size(A))
  CartesianRange{CartesianIndex{2}}(CartesianIndex{2}((1,1)),CartesianIndex{2}((3,2)))

You can construct these manually: for example,

julia> CartesianRange(CartesianIndex((-7,0)), CartesianIndex((7,15)))
CartesianRange{CartesianIndex{2}}(CartesianIndex{2}((-7,0)),CartesianIndex{2}((7,15)))

constructs a range that will loop over -7:7 along the first
dimension and 0:15 along the second.

One reason that eachindex is recommended over for i = 1:length(A)
is that some AbstractArrays cannot be indexed efficiently with a
linear index; in contrast, a much wider class of objects can be
efficiently indexed with a multidimensional iterator. (SubArrays are,
generally speaking, a prime
example
.)
eachindex is designed to pick the most efficient iterator for the
given array type. You can even use

  for i in eachindex(A, B)
      ...

to increase the likelihood that i will be efficient for accessing
both A and B.

As we’ll see below, these iterators have another purpose: independent
of whether the underlying arrays have efficient linear indexing,
multidimensional iteration can be a powerful ally when writing
algorithms. The rest of this blog post will focus on this
latter application.

Writing multidimensional algorithms with CartesianIndex iterators

A multidimensional boxcar filter

Let’s suppose we have a multidimensional array A, and we want to
compute the “moving
average”
over a
3-by-3-by-… block around each element. From any given index position,
we’ll want to sum over a region offset by -1:1 along each dimension.
Edge positions have to be treated specially, of course, to avoid going
beyond the bounds of the array.

In many languages, writing a general (N-dimensional) implementation of
this conceptually-simple algorithm is somewhat painful, but in Julia
it’s a piece of cake:

 function boxcar3(A::AbstractArray)
     out = similar(A)
     R = CartesianRange(size(A))
     I1, Iend = first(R), last(R)
     for I in R
         n, s = 0, zero(eltype(out))
         for J in CartesianRange(max(I1, I-I1), min(Iend, I+I1))
             s += A[J]
             n += 1
         end
         out[I] = s/n
     end
     out
 end

Let’s walk through this line by line:

  • out = similar(A) allocates the output. In a “real” implementation,
    you’d want to be a little more careful about the element type of the
    output (what if the input array element type is Int?), but
    we’re cutting a few corners here for simplicity.

  • R = CartesianRange(size(A)) creates the iterator for the array,
    ranging from CartesianIndex((1, 1, 1, ...)) to
    CartesianIndex((size(A,1), size(A,2), size(A,3), ...)). We don’t
    use eachindex, because we can’t be sure whether that will return a
    CartesianRange iterator, and here we explicitly need one.

  • I1 = first(R) and Iend = last(R) return the lower
    (CartesianIndex((1, 1, 1, ...))) and upper
    (CartesianIndex((size(A,1), size(A,2), size(A,3), ...))) bounds
    of the iteration range, respectively. We’ll use these to ensure
    that we never access out-of-bounds elements of A.

    Conveniently, I1 can also be used to compute the offset range.

  • for I in R: here we loop over each entry of A.

  • n = 0 and s = zero(eltype(out)) initialize the accumulators. s
    will hold the sum of neighboring values. n will hold the number of
    neighbors used; in most cases, after the loop we’ll have n == 3^N,
    but for edge points the number of valid neighbors will be smaller.

  • for J in CartesianRange(max(I1, I-I1), min(Iend, I+I1)) is
    probably the most “clever” line in the algorithm. I-I1 is a
    CartesianIndex that is lower by 1 along each dimension, and I+I1
    is higher by 1. Therefore, this constructs a range that, for
    interior points, extends along each coordinate by an offset of 1 in
    either direction along each dimension.

    However, when I represents an edge point, either I-I1 or I+I1
    (or both) might be out-of-bounds. max(I-I1, I1) ensures that each
    coordinate of J is 1 or larger, while min(I+I1, Iend) ensures
    that J[d] <= size(A,d).

  • The inner loop accumulates the sum in s and the number of visited
    neighbors in n.

  • Finally, we store the average value in out[I].

Not only is this implementation simple, but it is surprisingly robust:
for edge points it computes the average of whatever nearest-neighbors
it has available. It even works if size(A, d) < 3 for some
dimension d; we don’t need any error checking on the size of A.

Computing a reduction

For a second example, consider the implementation of multidimensional
reductions. A reduction takes an input array, and returns an array
(or scalar) of smaller size. A classic example would be summing along
particular dimensions of an array: given a three-dimensional array,
you might want to compute the sum along dimension 2, leaving
dimensions 1 and 3 intact.

The core algorithm

An efficient way to write this algorithm requires that the output
array, B, is pre-allocated by the caller (later we’ll see how one
might go about allocating B programmatically). For example, if the
input A is of size (l,m,n), then when summing along just dimension
2 the output B would have size (l,1,n).

Given this setup, the implementation is shockingly simple:

function sumalongdims!(B, A)
    # It's assumed that B has size 1 along any dimension that we're summing
    fill!(B, 0)
    Bmax = CartesianIndex(size(B))
    for I in CartesianRange(size(A))
        B[min(Bmax,I)] += A[I]
    end
    B
end

The key idea behind this algorithm is encapsulated in the single
statement B[min(Bmax,I)]. For our three-dimensional example where
A is of size (l,m,n) and B is of size (l,1,n), the inner loop
is essentially equivalent to

B[i,1,k] += A[i,j,k]

because min(1,j) = 1.

The wrapper, and handling type-instability using function barriers

As a user, you might prefer an interface more like sumalongdims(A,
dims)
where dims specifies the dimensions you want to sum along.
dims might be a single integer, like 2 in our example above, or
(should you want to sum along multiple dimensions at once) a tuple or
Vector{Int}. This is indeed the interface used in sum(A, dims);
here we want to write our own (somewhat simpler) implementation.

A bare-bones implementation of the wrapper is straightforward:

function sumalongdims(A, dims)
    sz = [size(A)...]
    sz[[dims...]] = 1
    B = Array(eltype(A), sz...)
    sumalongdims!(B, A)
end

Obviously, this simple implementation skips all relevant error
checking. However, here the main point I wish to explore is that the
allocation of B turns out to be
type-unstable:
sz is a Vector{Int}, the length (number of elements) of a specific
Vector{Int} is not encoded by the type itself, and therefore the
dimensionality of B cannot be inferred.

Now, we could fix that in several ways, for example by annotating the
result:

B = Array(eltype(A), sz...)::typeof(A)

However, this isn’t really necessary: in the remainder of this
function, B is not used for any performance-critical operations.
B simply gets passed to sumalongdims!, and it’s the job of the
compiler to ensure that, given the type of B, an efficient version
of sumalongdims! gets generated. In other words, the type
instability of B’s allocation is prevented from “spreading” by the
fact that B is henceforth used only as an argument in a function
call. This trick, using a function-call to separate a
performance-critical step from a potentially type-unstable
precursor
,
is sometimes referred to as introducing a function barrier.

As a general rule, when writing multidimensional code you should
ensure that the main iteration is in a separate function from
type-unstable precursors. Even when you take appropriate precautions,
there’s a potential “gotcha”: if your inner loop is small, julia’s
ability to inline code might eliminate the intended function barrier,
and you get dreadful performance. For this reason, it’s recommended
that you annotate function-barrier callees with @noinline:

@noinline function sumalongdims!(B, A)
    ...
end

Of course, in this example there’s a second motivation for making this
a standalone function: if this calculation is one you’re going to
repeat many times, re-using the same output array can reduce the
amount of memory allocation in your code.

Filtering along a specified dimension (exploiting multiple indexes)

One final example illustrates an important new point: when you index
an array, you can freely mix CartesianIndexes and
integers. To illustrate this, we’ll write an exponential
smoothing
filter
. An
efficient way to implement such filters is to have the smoothed output
value s[i] depend on a combination of the current input x[i] and
the previous filtered value s[i-1]; in one dimension, you can write
this as

function expfilt1!(s, x, α)
    0 < α <= 1 || error("α must be between 0 and 1")
    s[1] = x[1]
    for i = 2:length(a)
        s[i] = α*x[i] + (1-α)*s[i-1]
    end
    s
end

This would result in an approximately-exponential decay with timescale 1/α.

Here, we want to implement this algorithm so that it can be used to
exponentially filter an array along any chosen dimension. Once again,
the implementation is surprisingly simple:

function expfiltdim(x, dim::Integer, α)
    s = similar(x)
    Rpre = CartesianRange(size(x)[1:dim-1])
    Rpost = CartesianRange(size(x)[dim+1:end])
    _expfilt!(s, x, α, Rpre, size(x, dim), Rpost)
end

@noinline function _expfilt!(s, x, α, Rpre, n, Rpost)
    for Ipost in Rpost
        # Initialize the first value along the filtered dimension
        for Ipre in Rpre
            s[Ipre, 1, Ipost] = x[Ipre, 1, Ipost]
        end
        # Handle all other entries
        for i = 2:n
            for Ipre in Rpre
                s[Ipre, i, Ipost] = α*x[Ipre, i, Ipost] + (1-α)*s[Ipre, i-1, Ipost]
            end
        end
    end
    s
end

Note once again the use of the function barrier technique. In the
core algorithm (_expfilt!), our strategy is to use two
CartesianIndex iterators, Ipre and Ipost, where the first covers
dimensions 1:dim-1 and the second dim+1:ndims(x); the filtering
dimension dim is handled separately by an integer-index i.
Because the filtering dimension is specified by an integer input,
there is no way to infer how many entries will be within each
index-tuple Ipre and Ipost. Hence, we compute the CartesianRanges in
the type-unstable portion of the algorithm, and then pass them as
arguments to the core routine _expfilt!.

What makes this implementation possible is the fact that we can index
x as x[Ipre, i, Ipost]. Note that the total number of indexes
supplied is (dim-1) + 1 + (ndims(x)-dim), which is just ndims(x).
In general, you can supply any combination of integer and
CartesianIndex indexes when indexing an AbstractArray in Julia.

The AxisAlgorithms
package makes heavy use of tricks such as these, and in turn provides
core support for high-performance packages like
Interpolations that
require multidimensional computation.

Additional issues

It’s worth noting one point that has thus far remained unstated: all
of the examples here are relatively cache efficient. This is a key
property to observe when writing efficient
code
. In
particular, julia arrays are stored in first-to-last dimension order
(for matrices, “column-major” order), and hence you should nest
iterations from last-to-first dimensions. For example, in the
filtering example above we were careful to iterate in the order

for Ipost ...
    for i ...
        for Ipre ...
            x[Ipre, i, Ipost] ...

so that x would be traversed in memory-order.

Summary

As is hopefully clear by now, much of the pain of writing generic
multidimensional algorithms is eliminated by Julia’s elegant
iterators. The examples here just scratch the surface, but the
underlying principles are very simple; it is hoped that these
examples will make it easier to write your own algorithms.

A Million Text Files And A Single Laptop

By: Randy Zwitch

Re-posted from: http://randyzwitch.com/gnu-parallel-medium-data/

GNU Parallel Cat Unix

Wait…What? Why?

More often that I would like, I receive datasets where the data has only been partially cleaned, such as the picture on the right: hundreds, thousands…even millions of tiny files. Usually when this happens, the data all have the same format (such as having being generated by sensors or other memory-constrained devices).

The problem with data like this is that 1) it’s inconvenient to think about a dataset as a million individual pieces 2) the data in aggregate are too large to hold in RAM but 3) the data are small enough where using Hadoop or even a relational database seems like overkill.

Surprisingly, with judicious use of GNU Parallel, stream processing and a relatively modern computer, you can efficiently process annoying, “medium-sized” data as described above.

Data Generation

For this blog post, I used a combination of R and Python to generate the data: the “Groceries” dataset from the arules package for sampls ing transactions (with replacement), and the Python Faker (fake-factory) package to generate fake customer profiles and for creating the 1MM+ text files.

The contents of the data itself isn’t important for this blog post, but the data generation code is posted as a GitHub gist should you want to run these commands yourself.

Problem 1: Concatenating (cat * >> out.txt ?!)

The cat utility in Unix-y systems is familiar to most anyone who has ever opened up a Terminal window. Take some or all of the files in a folder, concatenate them together….one big file. But something funny happens once you get enough files…

$ cat * >> out.txt
-bash: /bin/cat: Argument list too long

That’s a fun thought…too many files for the computer to keep track of. As it turns out, many Unix tools will only accept about 10,000 arguments; the use of the asterisk in the `cat` command gets expanded before running, so the above statement passes 1,234,567 arguments to `cat` and you get an error message.

One (naive) solution would be to loop over every file (a completely serial operation):

for f in *; do cat "$f" >> ../transactions_cat/transactions.csv; done

Roughly 10,093 seconds later, you’ll have your concatenated file. Three hours is quite a coffee break…

Solution 1: GNU Parallel & Concatenation

Above, I mentioned that looping over each file gets you past the error condition of too many arguments, but it is a serial operation. If you look at your computer usage during that operation, you’ll likely see that only a fraction of a core of your computer’s CPU is being utilized. We can greatly improve that through the use of GNU Parallel:

ls | parallel -m -j $f "cat {} >> ../transactions_cat/transactions.csv"

The `$f` argument in the code is to highlight that you can choose the level of parallelism; however, you will not get infinitely linear scaling, as shown below (graph code, Julia):

Given that the graph represents a single run at each level of parallelism, it’s a bit difficult to say exactly where the parallelism gets maxed out, but at roughly 10 concurrent jobs, there’s no additional benefit. It’s also interesting to point out what the `-m` argument represents; by specifying `m`, you allow multiple arguments (i.e. multiple text files) to be passed as inputs into parallel. This alone leads to an 8x speedup over the naive loop solution.

Problem 2: Data > RAM

Now that we have a single file, we’ve removed the “one million files” cognitive dissonance, but now we have a second problem: at 19.93GB, the amount of data exceeds the RAM in my laptop (2014 MBP, 16GB of RAM). So in order to do analysis, either a bigger machine is needed or processing has to be done in a streaming or “chunked” manner (such as using the “chunksize” keyword in pandas).

But continuing on with our use of GNU Parallel, suppose we wanted to answer the following types of questions about our transactions data:

  1. How many unique products were sold?
  2. How many transactions were there per day?
  3. How many total items were sold per store, per month?

If it’s not clear from the list above, in all three questions there is an “embarrassingly parallel” portion of the computation. Let’s take a look at how to answer all three of these questions in a time- and RAM-efficient manner:

Q1: Unique Products

Given the format of the data file (transactions in a single column array), this question is the hardest to parallelize, but using a neat trick with the `tr` (transliterate) utility, we can map our data to one product per row as we stream over the file:

The trick here is that we swap the comma-delimited transactions with the newline character; the effect of this is taking a single transaction row and returning multiple rows, one for each product. Then we pass that down the line, eventually using `sort -u` to de-dup the list and `wc -l` to count the number of unique lines (i.e. products).

In a serial fashion, it takes quite some time to calculate the number of unique products. Incorporating GNU Parallel, just using the defaults, gives nearly a 4x speedup!

Q2. Transactions By Day

If the file format could be considered undesirable in question 1, for question 2 the format is perfect. Since each row represents a transaction, all we need to do is perform the equivalent of a SQL `Group By` on the date and sum the rows:

Using GNU Parallel starts to become complicated here, but you do get a 9x speed-up by calculating rows by date in chunks, then “reducing” again by calculating total rows by date (a trick I picked up at this blog post).

Q3. Total items Per store, Per month

For this example, it could be that my command-line fu is weak, but the serial method actually turns out to be the fastest. Of course, at a 14 minute run time, the real-time benefits to parallelization aren’t that great.

It may be possible that one of you out there knows how to do this correctly, but an interesting thing to note is that the serial version already uses 40-50% of the available CPU available. So parallelization might yield a 2x speedup, but seven minutes extra per run isn’t worth spending hours trying to the optimal settings.

But, I’ve got MULTIPLE files…

The three examples above showed that it’s possible to process datasets larger than RAM in a realistic amount of time using GNU Parallel. However, the examples also showed that working with Unix utilities can become complicated rather quickly. Shell scripts can help move beyond the “one-liner” syndrome, when the pipeline gets so long you lose track of the logic, but eventually problems are more easily solved using other tools.

The data that I generated at the beginning of this post represented two concepts: transactions and customers. Once you get to the point where you want to do joins, summarize by multiple columns, estimate models, etc., loading data into a database or an analytics environment like R or Python makes sense. But hopefully this post has shown that a laptop is capable of analyzing WAY more data than most people believe, using many tools written decades ago.

Julia IDE work in Atom

By: Julia Developers

Re-posted from: http://feedproxy.google.com/~r/JuliaLang/~3/1sR1OITGxfI/atom-work

A PL designer used to be able to design some syntax and semantics for their language, implement a compiler, and then call it a day. – Sean McDirmid

In the few years since its initial release, the Julia language has made wonderful progress. Over four hundred contributors – and counting – have donated their time developing exciting and modern language features like channels for concurrency, a native documentation system, staged functions, compiled packages, threading, and tons more. In the lead up to 1.0 we have a faster and more stable runtime, a more comprehensive standard library, and a more enthusiastic community than ever before.

However, a programming language isn’t just a compiler or spec in a vacuum. More and more, the ecosystem around a language – the packages, tooling, and community that support you – are a huge determining factor in where a language can be used, and who it can be used by. Making Julia accessible to everybody means facing these issues head-on. In particular, we’ll be putting a lot of effort into building a comprehensive IDE, Juno, which supports users with features like smart autocompletion, plotting and data handling, interactive live coding and debugging, and more.

Julia users aren’t just programmers – they’re engineers, scientists, data mungers, financiers, statisticians, researchers, and many other things, so it’s vital that our IDE is flexible and extensible enough to support all their different workflows fluidly. At the same time, we want to avoid reinventing well-oiled wheels, and don’t want to compromise on the robust and powerful core editing experience that people have come to expect. Luckily enough, we think we can have our cake and eat it too by building on top of the excellent Atom editor.

The Atom community has done an amazing job of building an editor that’s powerful and flexible without sacrificing a pleasant and intuitive experience. Web technologies not only make hacking on the editor extremely accessible for new contributors, but also make it easy for us to experiment with exciting and modern features like live coding, making it a really promising option for our work.

Our immediate priorities will be to get basic interactive usage working really well, including strong multimedia support for display and graphics. Before long we’ll have a comprehensive IDE bundle which includes Juno, Julia, and a bunch of useful packages for things like plotting – with the aim that anyone can get going productively with Julia within a few minutes. Once the basics are in place, we’ll integrate the documentation system and the up-and-coming debugger, implement performance linting, and make sure that there’s help and tutorials in place so that it’s easy for everyone to get started.

Juno is implemented as a large collection of independent modules and plugins; although this adds some development overhead, we think it’s well worthwhile to make sure that other projects can benefit from our work. For example, our collection of IDE components for Atom, Ink, is completely language-agnostic and should be reusable by other languages.

New contributions are always welcome, so if you’re interested in helping to push this exciting project forward, check out the developer install instructions and send us a PR!