Author Archives: Blog by Bogumił Kamiński

Some special cases of method dispatch in Julia

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/04/15/dispatch.html

Introduction

As I have recently written in this post the
Julia for Data Analysis
book that I am preparing is already available in MEAP preview.
Manning is releasing its chapters sequentially (approximately one chapter per
several weeks). I thought that when some chapter gets released in MEAP I will
write on the blog about related things that are not included in the book.

This week chapter 3 was released. It includes discussion of multiple dispatch
in Julia. Therefore, in this post I thought that I will comment on some specific
cases of method dispatch that I found important to understand.

This post was tested under Julia 1.7.2.

Dispatch with Unions

In the Julia Manual you can read in the Methods section that:

When a function is applied to a particular tuple of arguments,
the most specific method applicable to those arguments is applied.

It is important to remember that this rule also applies to Unions. Why is this
relevant?

Consider that you want to add a method to a function that already has the
following methods defined:

fun(x, y) = "slow"
fun(x::Int, y::Number) = "faster"

Your additional method is very fast, but expects y to be Int and x to be
either Int or Float64. The first approach would be to add the following
definition:

fun(x::Union{Int, Float64}, y::Int) = "fastest"

However, it will lead to dispatch ambiguity:

julia> fun(1, 1)
ERROR: MethodError: fun(::Int64, ::Int64) is ambiguous. Candidates:
  fun(x::Int64, y::Number) in Main at REPL[2]:1
  fun(x::Union{Float64, Int64}, y::Int64) in Main at REPL[3]:1
Possible fix, define
  fun(::Int64, ::Int64)

The reason is that we have two conflicting methods that could be applied
to this set of arguments and neither of them is more specific. This is an
expected behavior from a dispatch perspective. However, it means
that you need to duplicate code and instead of
fun(x::Union{Int, Float64}, y::Int) = "fastest"
write two definitions:

fun(x::Float64, y::Int) = "fastest"
fun(x::Int, y::Int) = "fastest"

and now all will work as expected. The point is that using Union is not the
same as writing several separate definitions “unrolling” the Union from a
dispatch perspective.

However, there is one slight problem. In our case the definition of fun was
quite short, but what if it were 100 LOC? Doing copy-paste would lead to a lot
of code duplication (which is not recommended). One of the solutions in such
cases is to use the @eval macro and do the required definitions like this:

for T in (Int, Float64)
    @eval fun(x::$T, y::Int) = "fastest"
end

Dispatch with keyword arguments

In the Julia Manual you can read in the Note on Optional and keyword Arguments section that:

Keyword arguments behave quite differently from ordinary positional arguments.
In particular, they do not participate in method dispatch. Methods are
dispatched based only on positional arguments, with keyword arguments
processed after the matching method is identified.

The reality is that in some cases keyword arguments are taken into account
when deciding about dispatch. This situation happens if you have some methods
that define keyword arguments and some other methods that do not define them.

Again, let me illustrate it with an example. I define the following methods:

julia> g(x::Int) = "int"
g (generic function with 1 method)

julia> g(x; kwarg) = "any"
g (generic function with 2 methods)

julia> g(1)
"int"

julia> g(1; kwarg=nothing)
"any"

In this case the fact that one method had no keyword arguments and
the other had a keyword argument influenced dispatch.

However, consider the following scenario, which is similar:

julia> h(x::Int; kwarg2) = "int"
h (generic function with 3 methods)

julia> h(x; kwarg) = "any"
h (generic function with 4 methods)

julia> h(1; kwarg=nothing)
ERROR: UndefKeywordError: keyword argument kwarg2 not assigned

This time, since the first method defined some keyword argument dispatch
selected it, and next produced an error since the name of the keyword argument
was incorrect.

Let me give you one more example that is related to methods design with
keyword arguments. It happens in practice so it is worth remembering.

julia> f(args...) = "positional"
f (generic function with 1 method)

julia> f(; kwargs...) = "keyword"
f (generic function with 2 methods)

julia> methods(f)
# 2 methods for generic function "f":
[1] f(; kwargs...) in Main at REPL[2]:1
[2] f(args...) in Main at REPL[1]:1

As you can see these definitions indeed result in two separate methods for f.
Now, a question to you is what will happen if you write f()?

As you might have predicted the keyword argument method is invoked because
it is more specific in terms of positional arguments:

julia> f()
"keyword"

Conclusions

The examples I have chosen are not artificial. All of them were relevant
(and lead to bugs in development) when I was working on DataFrames.jl.

I hope you will find them useful. I omit them in my book as I feel
they would distract readers’ attention, but once one goes deeper into Julia
they can be encountered in practice.

Some special cases of method dispatch in Julia

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/04/15/dispatch.html

Introduction

As I have recently written in this post the
Julia for Data Analysis
book that I am preparing is already available in MEAP preview.
Manning is releasing its chapters sequentially (approximately one chapter per
several weeks). I thought that when some chapter gets released in MEAP I will
write on the blog about related things that are not included in the book.

This week chapter 3 was released. It includes discussion of multiple dispatch
in Julia. Therefore, in this post I thought that I will comment on some specific
cases of method dispatch that I found important to understand.

This post was tested under Julia 1.7.2.

Dispatch with Unions

In the Julia Manual you can read in the Methods section that:

When a function is applied to a particular tuple of arguments,
the most specific method applicable to those arguments is applied.

It is important to remember that this rule also applies to Unions. Why is this
relevant?

Consider that you want to add a method to a function that already has the
following methods defined:

fun(x, y) = "slow"
fun(x::Int, y::Number) = "faster"

Your additional method is very fast, but expects y to be Int and x to be
either Int or Float64. The first approach would be to add the following
definition:

fun(x::Union{Int, Float64}, y::Int) = "fastest"

However, it will lead to dispatch ambiguity:

julia> fun(1, 1)
ERROR: MethodError: fun(::Int64, ::Int64) is ambiguous. Candidates:
  fun(x::Int64, y::Number) in Main at REPL[2]:1
  fun(x::Union{Float64, Int64}, y::Int64) in Main at REPL[3]:1
Possible fix, define
  fun(::Int64, ::Int64)

The reason is that we have two conflicting methods that could be applied
to this set of arguments and neither of them is more specific. This is an
expected behavior from a dispatch perspective. However, it means
that you need to duplicate code and instead of
fun(x::Union{Int, Float64}, y::Int) = "fastest"
write two definitions:

fun(x::Float64, y::Int) = "fastest"
fun(x::Int, y::Int) = "fastest"

and now all will work as expected. The point is that using Union is not the
same as writing several separate definitions “unrolling” the Union from a
dispatch perspective.

However, there is one slight problem. In our case the definition of fun was
quite short, but what if it were 100 LOC? Doing copy-paste would lead to a lot
of code duplication (which is not recommended). One of the solutions in such
cases is to use the @eval macro and do the required definitions like this:

for T in (Int, Float64)
    @eval fun(x::$T, y::Int) = "fastest"
end

Dispatch with keyword arguments

In the Julia Manual you can read in the Note on Optional and keyword Arguments section that:

Keyword arguments behave quite differently from ordinary positional arguments.
In particular, they do not participate in method dispatch. Methods are
dispatched based only on positional arguments, with keyword arguments
processed after the matching method is identified.

The reality is that in some cases keyword arguments are taken into account
when deciding about dispatch. This situation happens if you have some methods
that define keyword arguments and some other methods that do not define them.

Again, let me illustrate it with an example. I define the following methods:

julia> g(x::Int) = "int"
g (generic function with 1 method)

julia> g(x; kwarg) = "any"
g (generic function with 2 methods)

julia> g(1)
"int"

julia> g(1; kwarg=nothing)
"any"

In this case the fact that one method had no keyword arguments and
the other had a keyword argument influenced dispatch.

However, consider the following scenario, which is similar:

julia> h(x::Int; kwarg2) = "int"
h (generic function with 3 methods)

julia> h(x; kwarg) = "any"
h (generic function with 4 methods)

julia> h(1; kwarg=nothing)
ERROR: UndefKeywordError: keyword argument kwarg2 not assigned

This time, since the first method defined some keyword argument dispatch
selected it, and next produced an error since the name of the keyword argument
was incorrect.

Let me give you one more example that is related to methods design with
keyword arguments. It happens in practice so it is worth remembering.

julia> f(args...) = "positional"
f (generic function with 1 method)

julia> f(; kwargs...) = "keyword"
f (generic function with 2 methods)

julia> methods(f)
# 2 methods for generic function "f":
[1] f(; kwargs...) in Main at REPL[2]:1
[2] f(args...) in Main at REPL[1]:1

As you can see these definitions indeed result in two separate methods for f.
Now, a question to you is what will happen if you write f()?

As you might have predicted the keyword argument method is invoked because
it is more specific in terms of positional arguments:

julia> f()
"keyword"

Conclusions

The examples I have chosen are not artificial. All of them were relevant
(and lead to bugs in development) when I was working on DataFrames.jl.

I hope you will find them useful. I omit them in my book as I feel
they would distract readers’ attention, but once one goes deeper into Julia
they can be encountered in practice.

Set vs vector lookup in Julia: a closer look

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/04/08/fast2.html

Introduction

It was April 1st, when I was writing my last post, but my friends, after
having read it, told me that it was not very funny. So today I get back to boring
style.

In the previous post I was comparing lookup speed in vector vs set in Julia and
considered a scenario when lookup in vector was faster. This is not something
that normally we should expect, but clearly I must have chosen the benchmark
settings in a way that lead to such a result.

Today let me go back to that example and discuss it in more detail.

The post was written under Julia 1.7.0, DataFrames.jl 1.3.2,
BenchmarkTools.jl 1.3.1, and Plots.jl 1.27.1.

The experiment

Let me remind you the experiment setting. I tested the lookup performance for
collections from size 10 to 40. I used random strings and then performed a check
if every value stored in the collection is indeed present in it.

Here is the test code I used with one twist. In my original post I have measured
total time of lookup. This time I measure time of lookup per single element
(the difference is in the plot command where now I divide everything by df.i).

using DataFrames
using BenchmarkTools
using Random
using Plots

f(x, r) = all(in(x), r) # function testing lookup speed

df = DataFrame()
Random.seed!(1234)

for i in 10:40
    v = [randstring(rand(1:1000)) for _ in 1:i] # randomly generate a Vector
    s = Set(v) # turn it to Set for comparison
    @assert length(s) == i # make sure we had no duplicates
    t1 = @belapsed f($s, $v)
    t2 = @belapsed f($v, $v)
    # display the intermetiate results so that I can monitor the progress
    @show (i, t1, t2)
    push!(df, (;i, t1, t2))
end

plot(df.i, [df.t1 df.t2] ./ df.i;
     labels=["Set" "Vector"],
     legend=:topleft,
     xlabel="collection size",
     ylabel="time per lookup in seconds")

The code produces the following plot:

Benchmarks plot 1

We can see that the lookup time of one element in Set is roughly constant, but
for Vector it grows linearly. This is expected. Amortized time of random
lookup in Set is O(1) while for Vector it is O(n), where n is
collection size.

Last week I have computed total time, which is roughly linear for Set while it
is quadratic for Vector. However, I have scaled the plot in a way that it was
not very visible that the curve for Vector is a parabola.

Now we can clearly see that we can expect that for larger collections Set will
be faster. Let us pick i = 1000 as an example (not very large number, but big
enough):

julia> i = 1000;

julia> v = [randstring(rand(1:1000)) for _ in 1:i];

julia> s = Set(v);

julia> @assert length(s) == i

julia> @belapsed f($s, $v)
8.8383e-5

julia> @belapsed f($v, $v)
0.001030348

As you can see now Vector is much slower.

But why was Vector faster than Set for the original data?

Understanding lookup process in Set and in Vector

When you search for a value in a Set Julia roughly does the following steps:

  1. compute hash of the value you look for;
  2. check if Set contains some values that have the same hash (there might be
    more than one such value due to collisions);
  3. if yes then check if any of them are equal to the value we have provided.

When you search for a value in a Vector the process is simpler: Julia checks
sequentially checks elements of a vector if they are equal to the value you
passed. So as you can see the extra cost of using Set is that you need to
perform hashing.

In my example I have generated the data using the following expression:
[randstring(rand(1:1000)) for _ in 1:i]. As you can see these are random
strings that in general can be quite long (up to 1000 characters). Hashing such
strings is expensive. The additional trick I did was to generate the strings so
that with high probability they have a different length. I did this to make
sure that string comparison is fast. As you can check in the implementation of
isequal in Julia it falls back to == in this case, which for String is:

function ==(a::String, b::String)
    pointer_from_objref(a) == pointer_from_objref(b) && return true
    al = sizeof(a)
    return al == sizeof(b) && 0 == _memcmp(a, b, al)
end

We can observe that if strings have unequal lengths they are compared very fast.

In summary, I have baked the example so that hashing is expensive but comparison
is cheap. If I used shorter strings, e.g. of length 2, then even for i = 10
on the average Set lookup would be faster than Vector lookup:

julia> i = 10;

julia> v = [randstring(2) for _ in 1:i];

julia> s = Set(v);

julia> @assert length(s) == i

julia> @belapsed f($s, $v)
1.47515625e-7

julia> @belapsed f($v, $v)
1.8996439628482973e-7

Of course for very short collections, like e.g. length 1 Set will be slower
even in this case as it still would compute hash.

Understanding volatility of Set timing

An additional issue that we have noticed last week is smoothness of Vector
plot and volatility of Set plot. What is the reason for this? There are two
factors here in play:

  1. For each i I use random strings of different lengths. This means that
    hashing cost differs per experiment. On the other hand for Vector we do
    not need to do hashing and, as I commented above, we can immediately decide
    that strings are different just by looking at their length.
  2. For different i Set has different percentage of occupied slots, which
    means that number of hash collisions that can happen differs.

Conclusions

My takeaways from today’s post are:

  • benchmarking is useful, but it is important to understand the data structures
    and algorithms that one uses;
  • when doing visualization it is really important to think what measures to use
    so that the plots are informative;
  • and finally: Set’s not dead (although sometimes indeed it is not an optimal
    data structure; in general I recommend you to have a look at the options
    available in DataStructures.jl).