Generalizing AbstractArrays: opportunities and challenges

By: Julia Developers

Re-posted from: http://feedproxy.google.com/~r/JuliaLang/~3/gsnx6zBGG04/arrays-iteration

Introduction: generic algorithms with AbstractArrays

Somewhat unusually, this blog post is future-looking: it mostly
focuses on things that don’t yet exist. Its purpose is to lay out the
background for community discussion about possible changes to the core
API for AbstractArrays, and serves as background reading and
reference material for a more focused “julep” (a julia enhancement
proposal). Here, often I’ll use the shorthand “array” to mean
AbstractArray, and use Array if I explicitly mean julia’s concrete
Array type.

As the reader is likely aware, in julia it’s possible to write
algorithms for which one or more inputs are only assumed to be
AbstractArrays. This is “generic” code, meaning it should work
(i.e., produce a correct result) on any specific concrete array type.
In an ideal world—which julia approaches rather well in many
cases—generality of code should not have a negative impact on its
performance: a generic implementation should be approximately as fast
as one restricted to specific array type(s). This implies that
generic algorithms should be written using lower-level operations that
give good performance across a wide variety of array types.

Providing efficient low-level operations is a different kind of design
challenge than one experiences with programming languages that
“vectorize” everything. When successful, it promotes much greater
reuse of code, because efficient, generic low-level parts allow you to
write a wide variety of efficient, generic higher-level functions.

Naturally, as the diversity of array types grows, the more careful we
have to be about our abstractions for these low-level operations.

Examples of arrays

In discussing general operations on arrays, it’s useful to have a
diverse collection of concrete arrays in mind.

In core julia, some types we support fairly well are:

  • Array: the prototype for all arrays

  • Ranges: a good example of what I often consider a “computed”
    array, where essentially none of the values are stored in
    memory. Since there is no storage, these are immutable containers:
    you can’t set values in individual slots.

  • BitArrays: arrays that can only store 0 or 1 (false or true),
    and for which the internal storage is packed so that each entry
    requires only one bit.

  • SubArrays: the problems this type introduced, and the resolution
    we adopted, probably serves as the best model for the
    generalizations considered here. Therefore, this case is discussed
    in greater detail below.

Another important class of array types in Base are sparse arrays:
SparseMatrixCSC and SparseVector, as well as other sparse
representations like Diagonal, Bidiagonal, and Tridiagonal.
These are good examples of array types where access patterns deserve
careful thought. Notably, despite many commonalities in “strategy”
among the 5 or so sparse parametrizations we have, implementations of
core algorithms (e.g., matrix multiplication) are specialized for each
sparse-like type—in other words, these mimic the “high level
vectorized functions” strategy common to other languages. What we lack
is a “sparse iteration API” that lets you write the main algorithms of
sparse linear algebra efficiently in a generic way. Our current model
is probably fine for SparseLike*Dense operations, but gets to be
harder to manage if you want to efficiently compute, e.g., Bidiagonal*SparseMatrixCSC: the number of possible combinations you have to
support grows rapidly with more sparse types, and thus represents a
powerful incentive for developing efficient, generic low-level
operations.

Outside of Base, there are some other mind-stretching examples of
arrays, including:

  • DataFrames: indexing arrays with symbols rather than
    integers. Other related types include NamedArrays, AxisArrays.

  • Interpolations: indexing arrays with non-integer floating-point
    numbers

  • DistributedArrays: another great example of a case in which you
    need to think through access patterns carefully

SubArrays: a case study

For arrays of fixed dimension, one can write algorithms that index
arrays as A[i,j,k,...] (good examples can be found in our linear
algebra code, where everything is a vector or matrix). For algorithms
that have to support arbitrary dimensionality, for a long time our
fallback was linear indexing, A[i] for integer i. However, in
general SubArrays cannot be efficiently accessed by a linear index
because it results in call(s) to div, and div is slow. This is a
CPU problem, not a Julia-specific problem. The slowness of div is
still true despite the recent addition of
infrastructure
to make
it much faster—now one can make it merely “really bad” rather than
“Terrible, Horrible, No Good, and Very
Bad”
.

The way we (largely) resolved this problem was to make it possible to
do cartesian indexing, A[i,j,k,...], for arrays of arbitrary
dimensionality (the CartesianIndex type). To leverage this in
practical code, we also had to extend our iterators with the for I in
eachindex(A)
construct. This allows one to select an iterator that
optimizes the efficiency of access to elements of A. In generic
algorithms, the performance gains were not small, sometimes on the
scale of ten- to fifty-fold. These types were described in a
previous blog post.

To my knowledge, this approach has given Julia one of the most
flexible yet efficient “array view” types in any programming language.
Many languages base views on array strides, meaning situations in
which the memory offset is regular along each dimension. Among other
things, this requires that the underlying array is dense. In
contrast, in Julia we can easily handle non-strided arrays (e.g.,
sampling at [1,3,17,428,...] along one dimension, or creating a view
of a SparseMatrixCSC). We can also handle arrays for which there is
no underlying storage (e.g., Ranges). Being able to do this with a
common infrastructure is part of what makes different optimized array
types useful in generic programming.

It’s also worth pointing out some problems:

  • Most importantly, it requires that one adopt a slightly different
    programming style. Despite being well into another release cycle,
    this transition is still not complete, even in Base.

  • For algorithms that involve two or more arrays, there’s a
    possibility that their “best” iterators will be of different
    types. In principle, this is a big problem. Consider matrix-vector
    multiplication, A[i,j]*v[j], where j needs to be in-sync for
    both A and v, yet you’d also like all of these accesses to be
    maximally-efficient. In practice, right now this isn’t a burning
    problem: even if our arrays don’t all have efficient linear
    indexing, to my knowledge all of our (dense) array types have
    efficient cartesian indexing. Since indexing by N integers (where
    N is equal to the dimensionality of the array) is always
    performant, this serves as a reliable default for generic code.
    (It’s worth noting that this isn’t true for sparse arrays, and the
    lack of a corresponding generic solution is probably the main reason
    we lack a generic API for writing sparse algorithms.)

Unfortunately, I suspect that if we want to add support for certain
new operations or types (specific examples below), it will force us to
set the latter problem on fire.

Challenging examples

Some possible new AbstractArray types pose novel challenges.

ReshapedArrays (#15449)

These are the front-and-center motivation for this post. These are
motivated by a desire to ensure that reshape(A, dims) always returns
a “view” of A rather than allocating a copy of A. (Much of the
urgency of this julep goes away if we decide to abandon this goal, in
which case for consistency we should always return a copy of A.)
It’s worth noting that besides an explicit reshape, we have some
mechanisms for reshaping that currently cause a copy to be created,
notably A[:] or A[:, :] applied to a 3D array.

Similar to SubArrays, the main challenge for ReshapedArrays is
getting good performance. If A is a 3D array, and you reshape it to
a 2D array B, then B[i,j] must be expanded to A[k,l,m]. The
problem is that computing the correct k,l,m might result in a call
to div. So ReshapedArrays violate a crutch of our current ecosystem,
in that indexing with N integers might not be the fastest way to
access elements of B. From a performance perspective, this problem
is substantial (see #15449, about five- to ten-fold).

In simple cases, there’s an easy way to circumvent this performance
problem: define a new iterator type that (internally) iterates over
the parent A’s indexes directly. In other words, create an iterator
so that B[I] immediately expands to A[I'], and so that the latter
has “ideal” performance.

Unfortunately, this strategy runs into a lot of trouble when you need
to keep two arrays in sync: if you want to adopt this strategy, you
simply can’t write B[i,j]*v[j] for matrix-vector multiplication
anymore. A potential way around this problem is to define a new class
of iterators that operate on specific dimensions of an array (#15459),
writing B[ii,jj]*v[j]. jj (whatever that is) and j need to be
in-sync, but they don’t necessarily need to both be integers. Using
this kind of strategy, matrix-vector multiplication

for j = 1:size(B, 2)
    vj = v[j]
    for i = 1:size(B, 1)
        dest[i] += B[i,j] * vj
    end
end

might be written in a more performant manner like this:

for (jj, vj) in zip(eachindex(B, Dimension{2}), v)
    for (i, ii) in zip(eachindex(dest), eachindex(B, (:, jj)))
        dest[i] += B[ii,jj]*vj
    end
end

It’s not too hard to figure out what eachindex(B, Dimension{2}) and
eachindex(B, (:, jj)) should do: ii, for example, could be a
CartesianInnerIndex (a type that does not yet exist) that for a
particular column of B iterates from A[3,7,4] to A[5,8,4], where
the dth index component wraps around at size(A, d). The
big performance advantage of this strategy is that you only have to
compute a div to set the bounds of the iterator on each column; the
inner loop doesn’t require a div on each element access. No doubt,
given suitable definition of jj one could be even more clever and
avoid calculating div altogether. To the author, this strategy
seems promising as a way to resolve the majority of the performance
concerns about ReshapedArrays—only if you needed “random access”
would you require slow (integer-based) operations.

However, a big problem is that compared to the “naive” implementation,
this is rather ugly.

Row-major matrices, PermutedDimensionArrays, and “taking transposes seriously”

Julia’s Array type stores its entries in column-major order, meaning
that A[i,j] and A[i+1,j] are in adjacent memory locations. For
certain applications—or for interfacing with certain external code
bases—it might be convenient to support row-major arrays, where
instead A[i,j] and A[i,j+1] are in adjacent memory locations. More
fundamentally, this is partially related to one of the most
commented-on issues in all of julia’s development history, known as
“taking transposes seriously” aka #4774. There have been at least two
attempts at implementation, #6837 and the mb/transpose branch, and
for the latter a summary of benefits and challenges was posted.

One of the biggest challenges mentioned was the huge explosion of
methods that one would need to support. Can generic code come to the
rescue here? There are two related concerns. The first is linear
indexing: oftentimes this is conflated with “storage order,” i.e.,
given two linear indexes i and j for the same array, the offset in
memory is proportional to i-j. For row-major arrays, this
notion is not viable, because otherwise a loop

function copy!(dest, src)
    for i = 1:length(src)
        dest[i] = src[i]  # trouble if `i` means "memory offset"
    end
    dest
end

would end up taking a transpose if src and dest don’t use the same
storage order. Consequently, a linear index has to be defined in
terms of the corresponding cartesian (full-dimensionality) index.
This isn’t much of a real problem, because it’s one we know how to
solve: use ind2sub (which is slow) when you have to, but for
efficiency make row major arrays belong to the category (LinearSlow)
of arrays that defaults to iteration with cartesian indexes. Doing so
will ensure that if one uses generic constructs like eachindex(src)
rather than 1:length(src), then the loop above can be fast.

The far more challenging problem concerns cache-efficiency: it’s much
slower to access elements of an array in anything other than
storage-order. Some
reasonably fast ways to write matrix-vector multiplication are

for j = 1:size(B, 2)
    vj = v[j]
    for i = 1:size(B, 1)
        dest[i] += B[i,j] * vj
    end
end

for a column-major matrix B, and

for i = 1:size(B, 1)
    for j = 1:size(B, 2)
        dest[i] += B[i,j] * v[j]
    end
end

for a row-major matrix. (One can do even better than this by using a
scalar temporary accumulator, but let’s not worry about that here.)
The key point to note is that the order of the loops has been
switched.

One could generalize this by defining a RowMajorRange iterator
that’s a lot like our CartesianRange iterator, but traverses the
array in row-major order. eachindex claims to return an “efficient
iterator,” and without a doubt the RowMajorRange is a (much) more
efficient iterator than a CartesianRange iterator for row-major
arrays. So let’s imagine that eachindex does what it says, and
returns a RowMajorRange iterator. Using this strategy, the two
algorithms above can be combined into a single generic implementation:

for I in eachindex(B)
    dest[I[1]] += B[I]*v[I[2]]
end

Yay! Score one for efficient generic implementations.

But our triumph is short-lived. Let’s return to the example of
copy! above, and realize that dest and src might be two
different array types, and therefore might be most-efficiently indexed
with different iterator types. We’re tempted to write this as

function copy!(dest, src)
    for (idest, isrc) in zip(eachindex(dest), eachindex(src))
        dest[idest] = src[isrc]
    end
    dest
end

Up until we introduced our RowMajorRange return-type for
eachindex, this implementation would have been fine. But we just
broke it, because now this will incorrectly take a transpose in
certain situations.

In other words, without careful design the goals of
“maximally-efficient iteration” and “keeping accesses in-sync” are in
conflict.

OffsetArrays and the meaning of AbstractArray

Julia’s arrays are indexed starting at 1, whereas some other languages
start numbering at 0. If you take comments on various blog posts at
face value, there are vast armies of programmers out there eagerly
poised to adopt julia, but who won’t even try it because of this
difference in indexing. Since recruiting those armies will lead to
world domination, this is clearly a problem of the utmost urgency.

More seriously, there are algorithms which simplify if you can index
outside of the range from 1:size(A,d). In my own lab’s internal
code, we’ve long been using a CenterIndexedArray type, in which such
arrays (all of which have odd sizes) are indexed over the range -n:n
and for which 0 refers to the “center” element. One package which
generalizes this notion is OffsetArrays. Unfortunately, in practice
both of these array types produce segfaults (due to built-in
assumptions about when @inbounds is appropriate) for many of julia’s
core functions; over time my lab has had to write implementations
specialized for CenterIndexedArrays for quite a few julia functions.

OffsetArrays illustrates another conceptual challenge, which can
easily be demonstrated by copy!. When dest is a 1-dimensional
OffsetArray and src is a standard Vector, what should copy!
do? In particular, where does src[1] go? Does it go in the first
element of dest, or does it get stored in dest[1] (which may not
be the first element).

Such examples force us to think a little more deeply about what an
array really is. There seem to be two potential conceptions. One is
that arrays are lists, and multidimensional arrays are
lists-of-lists-of-lists-of… In such a world view, the right thing
to do is to put src[1] into the first slot of dest, because 1 is
just a synonym for first. However, this world view doesn’t really
endow any kind of “meaning” to the index-tuple of an array, and in
that sense doesn’t even include the distinction conveyed by an
OffsetArray. In other words, in this world an OffsetArray is
simply nonsensical, and shouldn’t exist.

If instead one thinks OffsetArrays should exist, this essentially
forces one to adopt a different world view: arrays are effectively
associative containers, where each index-tuple is the “key” by which
one retrieves a value. With this mode of thinking, src[1] should be
stored in dest[1].

Formalizing AbstractArray

These examples suggest a formalization of AbstractArray:

  • AbstractArrays are specialized associative containers, in that the
    allowable “keys” may be restricted by more than just their julia
    type. Specifically, the allowable keys must be representable as a
    cartesian product of one-dimensional lists of values. The allowed
    keys may depend not just on the array type but also the specific
    array (e.g., its size). Attempted access by keys that cannot be
    converted to one of the allowed keys, for that specific array,
    result in BoundsErrors.

  • For any given array, one must be able to generate a
    finite-dimensional parametrization of the full domain of valid keys
    from the array itself. This might only require knowledge of the
    array size, or the keys might depend on some internal storage (think
    DataFrames and OffsetArrays). In some cases, just the array
    type might be sufficient (e.g., FixedSizeArrays). By this
    definition, note that a Dict{ASCII5,Int}, where ASCII5 is a type
    that means an ASCII string with 5 characters, would qualify as a
    5-dimensional (sparse) array, but that a Dict{ASCIIString,Int}
    would not (because there is no length limit to an ASCIIString, and
    hence no finite dimensionality).

  • An array may be indexed by more than one key type (i.e., keys may
    have multiple parametrizations). Different key parametrizations are
    equivalent when they refer to the same element of a given
    array. Linear indexes and cartesian indexes are simple examples of
    interconvertable representations, but specialized iterators can
    produce other key types as well.

  • Arrays may support multiple iterators that produce non-equivalent
    key sequences. In other words, a row-major matrix may support both
    CartesianRange and RowMajorRange iterators that access elements
    in different orders.

Finding a way forward

Resolving these conflicting demands is not easy. One approach might be
to decree that some of these array types simply can’t be supported
with generic code. It is possible that this is the right
strategy. Alternatively, one can attept to devise an array API that
handles all of these types (and hopefully more).

In GitHub issue
#15648, we are
discussing APIs that may resolve these challenges. Readers are
encouraged to contribute to this discussion.