Author Archives: Blog by Bogumił Kamiński

Does Julia use 1-based indexing?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/08/26/onebased.html

Introduction

Julia is a powerful language giving a lot of flexibility to developers.
This flexibility was one of the factors that convinced me to use Julia
as my go-to language in my work.

Expert advice

(source: https://www.facebook.com/nixcraft/photos/a.431194973560553/3484925554854131)

However, with great power comes great responsibility. New Julia users are
often overwhelmed by how much there is to learn: type system, multiple dispatch,
memory management, multi-threading, advanced indexing, …

This post is aimed at new users of Julia. Its goal is to present my personal
perspective how I think people should try to learn and understand Julia. In
short, I recommend you do not try to follow all advice from Julia experts on
idiomatic Julia code as it could quickly get overly complicated. Instead write
code that you can easily understand that follows few fundamental principles
(like avoiding global variables or writing type stable code). The advanced tips
related to writing fully generic code are mostly relevant if you want to start
writing packages in Julia, which is not what new users start with.

I will try to explain you this issue by discussing how I approach 1-based
indexing in Julia.

The post was written under Julia 1.8.0 and OffsetArrays.jl 1.12.7.

What is 1-based indexing?

If you read the Julia manual you learn that:

In Julia, indexing of arrays, strings, etc. is 1-based not 0-based.

This means that if you have some collection, like for example a Vector or a
String its first element has index of 1. Here is an example:

julia> v = [5, 7, 13]
3-element Vector{Int64}:
  5
  7
 13

julia> v[1]
5

julia> firstindex(v)
1

julia> s = "abc"
"abc"

julia> s[1]
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

julia> firstindex(s)
1

The firstindex function in Julia informs us what is a firs index in a
collection. In both cases it returned 1.

If you are a new Julia user then I advise you that you assume that all
standard collections in Julia use 1 as a starting index. So, for example,
the following code for calculation of the sum of some vector is in my
opinion acceptable (assuming we want to accept only non-empty collections):

julia> function mysum(v::AbstractVector)
           n = length(v)
           n == 0 && throw(ArgumentError("the collection is empty"))
           s = v[1]
           for i in 2:n
               s += v[i]
           end
           return s
       end
mysum (generic function with 1 method)

julia> mysum(v)
25

Does Julia use 1-based indexing

However, when you show such a code to an experienced Julia developer you will
very quickly get a feedback that it is incorrect as vectors in Julia do not have
to use 1-based indexing. Instead you might get a recommendation of the code that
looks, for example, like this:

julia> function mysum2(v::AbstractVector)
           n = length(v)
           n == 0 && throw(ArgumentError("the collection is empty"))
           s = v[begin]
           for i in firstindex(v)+1:lastindex(v)
               s += v[i]
           end
           return s
       end
mysum2 (generic function with 1 method)

julia> mysum2(v)
25

This code will work with any AbstractVector and correctly handle non 1-based
cases. Let us check it by creating a non 1-based vector:

julia> using OffsetArrays

julia> ov = OffsetArray([5, 7, 13], -2)
3-element OffsetArray(::Vector{Int64}, -1:1) with eltype Int64 with indices -1:1:
  5
  7
 13

julia> firstindex(ov)
-1

julia> lastindex(ov)
1

julia> ov[-1]
5

Now check both functions we have defined above:

julia> mysum(ov)
ERROR: BoundsError: attempt to access 3-element
OffsetArray(::Vector{Int64}, -1:1) with eltype Int64
with indices -1:1 at index [2]

julia> mysum2(ov)
25

Indeed, as expected mysum does not work correctly, but mysum2 is OK.

So why do I say that the mysum function is acceptable. There are the following
reasons:

  • In practice you are very unlikely to encounter OffsetArrays.jl, and if you
    will, you will know it; all standard collections in Julia are 1-based.
  • mysum function is much easier to write and reason about than mysum2. To
    write the mysum function you need to have only a minimal knowledge of Julia.
    To write mysum2 you need to know much more.
  • Even if you pass a vector that is not 1-based to mysum then in the worst
    case what you will get is an error, so in this case you will not get a wrong
    result silently. (of course in general there could be a case that you silently
    obtained an incorrect result, but this is not the case when you want to iterate
    all elements of a vector, which is the most frequent use-case).
  • Descriptions of many algorithms in books assume 1-based indexing, so it is
    easy to port them to code if we assume a fixed first index. Of course it is
    usually relatively easy to change the algorithm to support other indexing
    base, but it requires mental effort.

Now, if you want extra safety you can add a call to the
Base.require_one_based_indexing function on top of your function like this:

julia> function mysum3(v::AbstractVector)
           Base.require_one_based_indexing(v)
           n = length(v)
           n == 0 && throw(ArgumentError("the collection is empty"))
           s = v[1]
           for i in 2:n
               s += v[i]
           end
           return s
       end
mysum3 (generic function with 2 methods)

julia> mysum3(ov)
ERROR: ArgumentError: offset arrays are not supported
but got an array with index other than 1

Now you are sure that you only work with vectors supporting 1-based indexing
and using other vectors throw an informative error.

Practical examples of handling of 1-based indexing

Let me give you two practical examples what is done in some popular packages.

First let us look at the shuffle! function implementation in the Random
module:

function shuffle!(r::AbstractRNG, a::AbstractArray)
    require_one_based_indexing(a)
    n = length(a)
    n <= 1 && return a # nextpow below won't work with n == 0
    @assert n <= Int64(2)^52
    mask = nextpow(2, n) - 1
    for i = n:-1:2
        (mask >> 1) == i && (mask >>= 1)
        j = 1 + rand(r, ltm52(i, mask))
        a[i], a[j] = a[j], a[i]
    end
    return a
end

Note two things. First, although in general it would be possible to re-write
this code to accept non 1-based indices Julia developers decided not to do so (I
assume this is to make the code easier to read). However, there is more to it.
The second point is that I have never heard anyone complain that shuffle!
implementation should be changed. In general, of course it could, but the
reality is that during the last 10 years of Julia usage it seems that this
functionality gap was not hurting anyone enough to warrant fixing this design
flaw.

The second example is from DataFrames.jl. Some time ago we got a request to
correctly handle non 1-based indexed vectors as columns of a DataFrame.
We could have done one of two things:

  1. allow non 1-based vectors and adjust all code to support this change;
  2. disallow non 1-based vectors and throw an error if the user tries to put
    such a vector in a DataFrame.

Again, we could have went the more flexible path of allowing non 1-based vectors,
but I preferred not to do so. The reasons are:

  1. We would need to increase the complexity of the API of DataFrames.jl and
    additionally our users would constantly need to think and check if some
    data frame uses 1-based or some other indexing;
  2. Internal code would be more complex. Any change in the DataFrames.jl package
    would need to be designed in a way that correctly handles non 1-based
    indexing;
  3. All tests in DataFrames.jl would need to be adjusted to add coverage of non
    1-based indexed case for all functions.

I thought it was too much effort for too little benefit. It is much easier, both
for users and for developers to assume and enforce that we only accept as
columns vectors that use 1-based indexing.

Conclusions

When learning Julia I recommend you try to keep it simple. Learn the
functionality of Base Julia and how fundamental types defined in it work.

Do not try to write overly general code or try to increase the speed of code too
early (e.g. by using the @inbounds macro – usually using it does not give much
benefit and at the same time introduces the risk of bugs in your code).

What I recommend is writing simple code but making sure it is correct for the
cases you expect to encounter. In particular make sure you write tests of your
code (it is really easy in Julia given the functionality of the Test module).

After you get comfortable with using Julia you will learn that most of the
things you assume how it works can be changed. Then you will be ready to
go to the next level and write a fully generic code if you decide you need to.

In summary, assuming that Julia uses 1-based indexing is a safe enough bet that
is not likely to hurt you in practice (even if it is not 100% correct in
general). Therefore, when you learn Julia I recommend you make this assumption
and just enjoy the language. You will have time to learn advanced design
patterns later, after you get comfortable with the language.

Subtypes of concrete types in Julia

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/08/19/union.html

Introduction

Today my post is about subtypes of concrete types. It is mostly academic, but I
hope it will be useful for readers wanting to get a better understanding of
corner cases Julia’s type system.

The post was written under Julia 1.8.0 (with special thanks for all people
who contributed to this release!).

What is a concrete type in Julia?

In Julia a type is concrete it can have a direct instance, that is, some type
T is concrete if there exists at least one value v such that
typeof(v) === T.
For every type you can check whether it is concrete using the
isconcretetype function.

Today I want to discuss the following sentence from the section on
Types from the Julia Manual in relation to concrete types:

One particularly distinctive feature of Julia’s type system is that concrete
types may not subtype each other: all concrete types are final and may only
have abstract types as their supertypes.

From this sentence some readers conclude that concrete types cannot have
subtypes. However, it is not the case. Concrete types in Julia can have subtypes
as long as these subtypes are not concrete.

You might ask does this ever happen in practice? The answer is that it happens
and here are the examples when it does.

The first one is Union{} type. This type is not concrete and has no values.
However, it is a subtype of all types, including concrete types, for example:

julia> Union{} <: Int
true

julia> Union{} <: Vector{Missing}
true

The other case is Type{T} type, where T is a DataType (i.e. if T has
type DataType). All common concrete types are subtypes of DataType, e.g.
integers or vectors. So types like Int or Vector{Missing} have DataType
type:

julia> typeof(Int)
DataType

julia> typeof(Vector{Missing})
DataType

which means that DataType must be concrete, and indeed it is:

julia> isconcretetype(DataType)
true

Although DataType is concrete, it has Type{Int} and Type{Vector{Missing}}
as its subtypes (and these types must not, and are not, concrete as we
discussed above):

julia> Type{Int} <: DataType
true

julia> Type{Vector{Missing}} <: DataType
true

julia> isconcretetype(Type{Int})
false

julia> isconcretetype(Type{Vector{Int}})
false

Why these subtyping considerations matter?

The most important lesson learned here is that in your code you should
not assume that concrete type cannot have subtypes, as it can (although these
subtypes cannot be concrete themselves). This observation is mostly relevant
for package developers, who need to write generic code.

However, there are some practical situations when one can be affected by these
subtyping rules. The most common is when one is working with missing values.

Assume that I generate some random matrix containing either 1 or missing:

julia> using Random

julia> Random.seed!(1234);

julia> mat = rand([1, missing], 10, 3)
10×3 Matrix{Union{Missing, Int64}}:
 1          missing  1
  missing   missing  1
 1         1          missing
  missing  1         1
 1         1          missing
 1          missing  1
  missing   missing  1
  missing   missing   missing
 1          missing   missing
  missing   missing   missing

Now, I want to compute the sums of its rows, while skipping missing values.
Here is how you can do it:

julia> [sum(skipmissing(row)) for row in eachrow(mat)]
10-element Vector{Int64}:
 2
 1
 2
 2
 2
 2
 1
 0
 1
 0

However, a very similar codes that follow do not work:

julia> [sum(skipmissing(identity.(row))) for row in eachrow(mat)]
ERROR: ArgumentError: reducing with add_sum over an empty collection of element type Union{} is not allowed.

julia> [sum(skipmissing([x for x in row])) for row in eachrow(mat)]
ERROR: ArgumentError: reducing with add_sum over an empty collection of element type Union{} is not allowed.

What is the reason for the difference? In the skipmissing(row) case row
is a view and retains information about element type of the whole array, which
is Union{Missing, Int64}, so it is able to properly compute sum even in the
case when all values in a row are missing.

On the other hand both identity.(row) and [x for x in row] materialize the
row and perform type narrowing. This type narrowing means that in rows that
only contain missing values the information about Int64 is lost and we get
an error. Let us see it step by step:

julia> row = last(eachrow(mat))
3-element view(::Matrix{Union{Missing, Int64}}, 10, :) with eltype Union{Missing, Int64}:
 missing
 missing
 missing

julia> x = identity.(row)
3-element Vector{Missing}:
 missing
 missing
 missing

julia> eltype(skipmissing(x))
Union{}

As you can see, since skipmissing strips the Missing part from the source
vector element type, we are left with Union{}.

Unfortunately, such errors happen from time to time when one works with
data having missing values. For such cases in Julia many (but not all) common
reduction functions support the init keyword, so you can do:

julia> [sum(skipmissing(identity.(row)), init=0) for row in eachrow(mat)]
10-element Vector{Int64}:
 2
 1
 2
 2
 2
 2
 1
 0
 1
 0

and all is good even if type inference produces Union{}.

Conclusions

The post today was less practical than usual. However, I hope you will find it
useful when Julia tries to take you into a deep dark type system forest where
2+2=5, and the path leading out is only wide enough for one (Mikhail Tal).

How to safely use the vec and reshape functions in Julia?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/08/12/vec.html

Introduction

Julia users often want to squeeze-out maximum performance from their programs.
In the search for efficiency, they soon discover the vec and reshape
functions that allow for changing of the shape of the input array without
copying data. In this post I want do discuss how these functions work
and share with you the rules I use when deciding if I want to use them.

The post was written under Julia 1.7.2.

The contract

When you first learn some function you must look up its contract in its
docstring. Let us check vec and reshape (I abbreviated the docstrings to
focus on the key parts):

help?> vec
  vec(a::AbstractArray) -> AbstractVector

  Reshape the array a as a one-dimensional column vector.
  Return a if it is already an AbstractVector.
  The resulting array shares the same underlying data as a,
  so it will only be mutable if a is mutable,
  in which case modifying one will also modify the other.

help?> reshape
search: reshape promote_shape

  reshape(A, dims...) -> AbstractArray

  Return an array with the same data as A,
  but with different dimension sizes or number of dimensions.
  The two arrays share the same underlying data, so that the result is mutable
  if and only if A is mutable,
  and setting elements of one alters the values of the other.

In short, both functions allow you to change the shape of some array without
copying of the data. vec always returns a vector, while reshape is more
flexible and allows you to produce an array of any dimension.

Let me show you some use cases of these functions. First, assume I want to
produce a cartesian product of two collections:

julia> collect(Iterators.product('a':'b', 1:3))
2×3 Matrix{Tuple{Char, Int64}}:
 ('a', 1)  ('a', 2)  ('a', 3)
 ('b', 1)  ('b', 2)  ('b', 3)

By default the collect function produced me a matrix. If for some reason
I needed a vector instead I could write:

julia> vec(collect(Iterators.product('a':'b', 1:3)))
6-element Vector{Tuple{Char, Int64}}:
 ('a', 1)
 ('b', 1)
 ('a', 2)
 ('b', 2)
 ('a', 3)
 ('b', 3)

The important benefit of this operation is that vec is non-copying so adding
this step is efficient. Let me give you another example, this time using
broadcasting:

julia> string.(['a' 'b'], 1:3)
3×2 Matrix{String}:
 "a1"  "b1"
 "a2"  "b2"
 "a3"  "b3"

julia> vec(string.(['a' 'b'], 1:3))
6-element Vector{String}:
 "a1"
 "a2"
 "a3"
 "b1"
 "b2"
 "b3"

Now let us have a look at reshape:

julia> reshape(1:6, 2, 3)
2×3 reshape(::UnitRange{Int64}, 2, 3) with eltype Int64:
 1  3  5
 2  4  6

Why reshape would be useful? Consider for example a simple function changing
pairs of consecutive elements of a vector into a tuple. One of the ways
(for sure not the only way) to implement this would be:

julia> totuples(x::AbstractVector) = Tuple.(eachcol(reshape(x, 2, :)))
totuples (generic function with 1 method)

julia> totuples(1:6)
3-element Vector{Tuple{Int64, Int64}}:
 (1, 2)
 (3, 4)
 (5, 6)

The dangers

While the vec and reshape functions can be useful there are some risks
of using them. Let me discuss some common pitfalls.

The first is that when you reshape a collection you may leave a permanent mark
in the source that it was used in reshape (even though reshape has no !
as its suffix). This can lead to hard to catch bugs. Let us check the following
code:

julia> x = [1, 2, 3, 4]
4-element Vector{Int64}:
 1
 2
 3
 4

julia> totuples(x)
2-element Vector{Tuple{Int64, Int64}}:
 (1, 2)
 (3, 4)

julia> push!(x, 5)
ERROR: cannot resize array with shared data

As you can see, although the use of reshape was done in the totuples
function and the reshaped matrix we created there with reshape(x, 2, :)
is already out of scope the fact that we used reshape on x permanently
disallows its resizing.

The second risk is that vec and reshape may, or may not, create a new
object, as they might just return a source object. Let us check the following
code that extends the original totuples function to accept any
AbstractArray. In the code I write y = vec(x), but the same behavior
would be present with y = reshape(x, :).

julia> function totuples2(x::AbstractArray)
           y = vec(x)
           isodd(length(x)) && push!(y, last(y))
           return totuples(y)
       end
totuples2 (generic function with 1 method)

julia> x = [1;;]
1×1 Matrix{Int64}:
 1

julia> totuples2(x)
1-element Vector{Tuple{Int64, Int64}}:
 (1, 1)

julia> x
1×1 Matrix{Int64}:
 1

julia> x = [1]
1-element Vector{Int64}:
 1

julia> totuples2(x)
1-element Vector{Tuple{Int64, Int64}}:
 (1, 1)

julia> x
2-element Vector{Int64}:
 1
 1

As you can see our totuples2 function left [1;;] unchanged, since it is
a matrix, but [1] was updated. The reason is that vec(x) in this case just
returned its argument.

Finally, as an another application of the same rule, note that that the vec
function (and similarly reshape when reshaping to a vector), may or may not
produce a vector that can be resized:

julia> totuples2([1 2; 3 4])
2-element Vector{Tuple{Int64, Int64}}:
 (1, 3)
 (2, 4)

julia> totuples2(reshape(1:4, 2, 2))
2-element Vector{Tuple{Int64, Int64}}:
 (1, 2)
 (3, 4)

julia> totuples2([1;;])
1-element Vector{Tuple{Int64, Int64}}:
 (1, 1)

julia> totuples2(reshape(1:1, 1, 1))
ERROR: MethodError: no method matching resize!(::UnitRange{Int64}, ::Int64)

As you can see, this is tricky, as the function you use, like totuples2 in our
case, might throw an error only in some cases, but work in other cases. In the
totuples2 case the reason is that we use push! only if the length of the
collection is odd.

The point of all these examples is that code using vec or reshape can lead
to hard-to-diagnose errors. The reason is that you might notice the problems
caused by using them much later in the code than when you used them.

Conclusions

The vec and reshape functions are nice utilities and I use them quite often.
However, to safely use them I always follow the following two rules:

  • when writing a function never use reshape on an array that is an argument
    of the function; the reason is, that in some cases reshape will silently
    leave the “no resize” mark on the source vector; if I use reshape I make
    sure that its source is always some short lived object from a local scope;
  • do not resize the vector produced by vec/reshape as such resizing may,
    or may not affect the source and keeping a mental record if this is the case
    is hard (and most likely readers of such code will not be able to easily
    know it); as a softer rule I generally avoid any mutation of the output of
    vec/reshape as it is not guaranteed that it will be mutable.

In short: the best uses for vec and reshape are situations when their source
is a short lived object and you do not want to mutate their output in any way.