Is vector pooling considered harmful?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/11/17/pooling.html

Introduction

PooledArrays.jl is a package providing a custom array type that can be used for compression of your data if it has a few unique elements.
In this post I want to explain the design of PooledArray object and discuss how it affects its performance.

The post was written under Julia 1.9.2, PooledArrays.jl 1.4.3, and DataFrames.jl 1.6.1.

How pooling works

The definition of the PooledArray type is the following:

mutable struct PooledArray{T, R<:Integer, N, RA} <: AbstractArray{T, N}
    refs::RA
    pool::Vector{T}
    invpool::Dict{T,R}
    # refcount[] is 1 if only one PooledArray holds a reference to pool and invpool
    refcount::Threads.Atomic{Int}
end

It represents a N-dimensional array with element type T.
The internal representation of data is that each unique value in a PooledArray gets an integer
representation (reference) of type R. Then for each element of the array the refs field is an array
that is AbstractArray{R, N} and keeps track of what reference value is stored for each entry of an array.

To be able to efficiently work with these reference numbers PooledArray stores two fields:

  • pool that gives information how reference numbers are mapped to actual values of type T stored in the array;
  • invpool that gives an inverse information on what is a reference number of some value of type T.

Since many operations on pooled arrays do not change pool and invpool the PooledArray has an extra optimization
that automatically ensures that two arrays that can be proven to have the same pool and invpool share them.
The refcount field is used to keep track how many such arrays exist. This is used in case when we modify
pool and invpool of some array and want to make sure that we do not modify another PooledArray by accident.

Let us check these properties in practice.

Test driving pooled arrays

Let us create a simple PooledArray first:

julia> using PooledArrays

julia> pa1 = PooledArray(["a", "b", "a", "b", "a", "b"])
6-element PooledVector{String, UInt32, Vector{UInt32}}:
 "a"
 "b"
 "a"
 "b"
 "a"
 "b"

julia> pa1.refs
6-element Vector{UInt32}:
 0x00000001
 0x00000002
 0x00000001
 0x00000002
 0x00000001
 0x00000002

julia> pa1.pool
2-element Vector{String}:
 "a"
 "b"

julia> pa1.invpool
Dict{String, UInt32} with 2 entries:
  "b" => 0x00000002
  "a" => 0x00000001

julia> pa1.refcount
Base.Threads.Atomic{Int64}(1)

We can see that, by default, each entry is recorded as 4-byte UInt32.
Additionally the pa1.refcount tells us that only this pooled array
uses the pool and refpool objects that it references to.

Let us first check what happens when we operate on this array:

julia> pa2 = pa1[[1, 2, 3]]
3-element PooledVector{String, UInt32, Vector{UInt32}}:
 "a"
 "b"
 "a"

julia> pa2.refcount
Base.Threads.Atomic{Int64}(2)

julia> pa1.refcount
Base.Threads.Atomic{Int64}(2)

julia> pa2.pool === pa1.pool
true

julia> pa2.invpool === pa1.invpool
true

As you can see, since pa2 subsets pa1 we knew that they can share their pool and invpool.
The refcount field tells us that two objects reuse them.

Let us now modify the pool of pa2:

julia> pa2[1] = "c"
"c"

julia> pa2.refcount
Base.Threads.Atomic{Int64}(1)

julia> pa1.refcount
Base.Threads.Atomic{Int64}(1)

julia> pa2.pool
3-element Vector{String}:
 "a"
 "b"
 "c"

julia> pa1.pool
2-element Vector{String}:
 "a"
 "b"

As you can see the pools got automatically decoupled and refcount is adjusted accordingly.

In summary, the benefit of pool-sharing is that we can very fast subset PooledArrays without having to re-create pool and invpool.
This makes working with PooledArray fast as long as we do not change the set of values we store in them.

The second important design aspect of PooledArray is the R type. As I have said, by default it is UInt32. However,
for small pools this is inefficient. Therefore you can write:

julia> pa3 = PooledArray(pa1, compress=true)
6-element PooledVector{String, UInt8, Vector{UInt8}}:
 "a"
 "b"
 "a"
 "b"
 "a"
 "b"

julia> Base.summarysize(pa1)
570

julia> Base.summarysize(pa3)
504

As you can see, you can use the compress=true keyword argument to automatically pick the minimal size of R type that is able to keep the pool at hand.
In our case it is UInt8, which would save a lot of memory in case of large arrays.
Why do we use UInt32 by default then?
The reason is that this type is typically efficient enough memory-wise and at the same time it ensures a pool that is large enough in most scenarios.
For example, the limitation of the UInt8 pool is that it can store up to 255 values only:

julia> pa4 = PooledArray(1:255, compress=true);

julia> pa4[1]=256
ERROR: You're using a PooledArray with ref type UInt8, which can only hold 255 values,
and you just tried to add the 256th reference.  Please change the ref type
to a larger int type, or use the default ref type (UInt32).

So you have a tradeoff here. If you are sure you will not change your pool then compress=true is a safe option.
If you know you might need to change the pool you need to pick the R type more carefully.

What are the benefits of pooled arrays?

There are two types of benefits of PooledArray. The first is memory footprint, the second is performance.
Let me explain them by example.

First create two large vectors storing strings:

julia> v1 = ["x$(isodd(i))" for i in 1:10^6];

julia> v2 = PooledArray(v1, compress=true);

julia> Base.summarysize(v1)
21500040

julia> Base.summarysize(v2)
1000507

As you can see there is a significant compression gain in our example by using a PooledArray.

The second benefit is performance, especially in combination with DataFrames.jl:

julia> using DataFrames

julia> df = DataFrame(; v1, v2)
1000000×2 DataFrame
     Row │ v1      v2
         │ String  String
─────────┼────────────────
       1 │ xtrue   xtrue
       2 │ xfalse  xfalse
       3 │ xtrue   xtrue
    ⋮    │   ⋮       ⋮
  999998 │ xfalse  xfalse
  999999 │ xtrue   xtrue
 1000000 │ xfalse  xfalse
       999994 rows omitted

Now let us perform two example operations (I am measuring the second timing to avoid counting compilation time).

The first is aggregation:

julia> combine(groupby(df, :v1), nrow);

julia> @time combine(groupby(df, :v1), nrow);
  0.025122 seconds (201 allocations: 31.271 MiB)

julia> combine(groupby(df, :v2), nrow);

julia> @time combine(groupby(df, :v2), nrow);
  0.002766 seconds (227 allocations: 7.643 MiB)

As you can see doing aggregation when grouping by PooledArray is much faster.

The second example is innerjoin:

julia> df_ref = df[1:2, :]
2×2 DataFrame
 Row │ v1      v2
     │ String  String
─────┼────────────────
   1 │ xtrue   xtrue
   2 │ xfalse  xfalse

julia> df_ref.val = 1:2
1:2

julia> innerjoin(df, df_ref, on=:v1, makeunique=true);

julia> @time innerjoin(df, df_ref, on=:v1, makeunique=true);
  0.057885 seconds (248 allocations: 36.741 MiB, 23.00% gc time)

julia> innerjoin(df, df_ref, on=:v2, makeunique=true);

julia> @time innerjoin(df, df_ref, on=:v2, makeunique=true);
  0.024692 seconds (265 allocations: 43.416 MiB)

And again we see that joining on :v2 is faster than on :v1.

When using pooled arrays might not be a good idea?

There are three cases when you might not see benefits from using PooledArray.

The first is when you have many unique values in your data. Then you have to pay the price
of storing refs, pool, and invpool objects and all of them will be large.

The second is if you have a value that you store that has a small memory footprint, e.g. Bool
and you did not use compress=true. In such a case refs will take more memory than original data would.

The third case is when you create multiple copies of a PooledArray and modify its pool. In such a case
the cost of copying of the pool and invpool fields might be non-negligible. Let me show you a practical
example of the third situation:

julia> df2 = DataFrame(id=1:10^6, v=PooledArray(repeat(["x$i" for i in 1:1000], 1000)))
1000000×2 DataFrame
     Row │ id       v
         │ Int64    String
─────────┼─────────────────
       1 │       1  x1
       2 │       2  x2
       3 │       3  x3
    ⋮    │    ⋮       ⋮
  999998 │  999998  x998
  999999 │  999999  x999
 1000000 │ 1000000  x1000
        999994 rows omitted

julia> df3 = DataFrame(id=1:10^6, v=repeat(["x$i" for i in 1:1000], 1000));

Note that df2 and df3 are identical except that in df2 the :v column is pooled and in df3 it is not.

Now let us test outerjoin on this data:

julia> outerjoin(df2, df2, on=:id, makeunique=true);

julia> @time outerjoin(df2, df2, on=:id, makeunique=true);
  0.065559 seconds (326 allocations: 30.951 MiB)

julia> outerjoin(df3, df3, on=:id, makeunique=true);

julia> @time outerjoin(df3, df3, on=:id, makeunique=true);
  0.036927 seconds (274 allocations: 38.400 MiB)

Note that working with non-pooled data is faster. If we check innerjoin this is not the case:

julia> innerjoin(df2, df2, on=:id, makeunique=true);

julia> @time innerjoin(df2, df2, on=:id, makeunique=true);
  0.018188 seconds (210 allocations: 30.528 MiB)

julia> innerjoin(df3, df3, on=:id, makeunique=true);

julia> @time innerjoin(df3, df3, on=:id, makeunique=true);
  0.029364 seconds (206 allocations: 38.157 MiB)

What is going on here? Let us look at the output:

julia> outerjoin(df2, df2, on=:id, makeunique=true)
1000000×3 DataFrame
     Row │ id       v        v_1
         │ Int64    String?  String?
─────────┼───────────────────────────
       1 │       1  x1       x1
       2 │       2  x2       x2
       3 │       3  x3       x3
    ⋮    │    ⋮        ⋮        ⋮
  999998 │  999998  x998     x998
  999999 │  999999  x999     x999
 1000000 │ 1000000  x1000    x1000
                  999994 rows omitted

julia> innerjoin(df2, df2, on=:id, makeunique=true)
1000000×3 DataFrame
     Row │ id       v       v_1
         │ Int64    String  String
─────────┼─────────────────────────
       1 │       1  x1      x1
       2 │       2  x2      x2
       3 │       3  x3      x3
    ⋮    │    ⋮       ⋮       ⋮
  999998 │  999998  x998    x998
  999999 │  999999  x999    x999
 1000000 │ 1000000  x1000   x1000
                999994 rows omitted

Note that outerjoin changes the element type of v and v_1 columns, so pool and invpool need to be re-created twice,
which takes time and memory.
In innerjoin the element type is not changed so pool and invpool in the output are reused from the source data frame.

Conclusions

In summary:

  • PooledArray is useful if you have data that has many duplicates.
  • The benefits of using PooledArray are lower memory footprint and the fact that some operations on it can be faster (e.g. groupby in DataFrames.jl).
  • The biggest benefit of PooledArray is when you do not change its pool of values. In such a case the pool and invpool objects are created only once
    and are reused in arrays derived from the source array.
  • Remember to carefully choose the type of the reference used in PooledArray. By default it is UInt32, but you can pick a smaller type to get even better compression
    at the expense of smaller number of unique values that your pooled array can store.

I hope you find the tips shared in this post useful in your data analysis processes.

Harnessing the Power of Multiple Dispatch in Julia Programming

By: Steven Whitaker

Re-posted from: https://glcs.hashnode.dev/multiple-dispatch

Julia is a relatively new,free, and open-source programming language.It has a syntaxsimilar to that of other popular programming languagessuch as MATLAB and Python,but it boasts being able to achieve C-like speeds.

One of the core design philosophies of Juliathat distinguishes itfrom a lot of other languagesis multiple dispatch.In essence,multiple dispatch allowswriting multiple implementationsof a single functionbased on different input types.But unlike single dispatchwhich relies on the static, compile-time typesof all but one of the inputs,multiple dispatch chooses what implementation to callusing the runtime types.

It makes writing generic code a cinch,allowing packages to work together seamlesslyand enabling a high degree of composability.

In this post,we will learn aboutmultiple dispatchin Julia.We will discuss how to utilize itand how it compares to single dispatch,typically usedin object-oriented languages.

This post assumes you already havea basic understandingof Julia’s type system.If you haven’t yet,check out our earlierpost to learn more.

Functions and Methods

To understand multiple dispatch,we first need to understandthe difference between functions and methods.

In Julia,a function essentially is just a name.This name is used to refer to the functionand to call the function.What a function does when calledis determined by the methodsassociated with the function.A method is an implementationof a functionfor a given set of input types.

Let’s illustrate this concept with code.

First, we will define a functionwith no methods,just to show that it can be done:

julia> function no_methods endno_methods (generic function with 0 methods)

When declaring a function, however,we typically also define an associated method:

julia> function func(x, y) # Method 1           x + y       endfunc (generic function with 1 method)

Here,we created the function funcand defined a methodthat adds the two function inputs.Call this method 1.

Now let’s add another method,method 2.To do so,we need to provide type annotationsto the input variables:

julia> function func(x::String, y::String) # Method 2           x * y       endfunc (generic function with 2 methods)

Notice that we didn’t createa new function with one method.Because we used the same function name,we added a method to the already existing function.

Let’s add one more method,method 3:

julia> function func(x::String, y::Integer) # Method 3           x^y       endfunc (generic function with 3 methods)

Methods can also have different numbers of arguments.The previous methods all had two arguments,so let’s create a method with three arguments,method 4:

julia> function func(x::String, y::Integer, z::Integer) # Method 4           func(x, y)[1:z]       endfunc (generic function with 4 methods)

Now let’s see multiple dispatch in action.

Multiple Dispatch

When func is called,Julia uses the types of the input argumentsto determine which method to dispatch to.In particular,the input types of multiple arguments(actually, all of them)are considered,hence the name multiple dispatch.If the types of the inputsmatch one of the method signatures,that method will be called.

Multiple dispatch

But what about func("hi", "there")?In this case,both inputs are Strings,so that matches method 2.But String is a subtype of Any,so the inputs also match method 1!In cases like thiswhere multiple methods match,the most specific method will be called.So, in this case,func("hi", "there") will dispatch to method 2because String is a more specific type than Any.

If no methods matchor if there is not one single most specific method,a MethodError will be thrown.

So,the methods we defined for funcwill be called in the following cases:

  • Method 1:When two inputs are givenand no more specific methods exist.Examples:

    julia> func(1, 2)3julia> func([3, 4], [5, 6])2-element Vector{Int64}:  8 10
  • Method 2:When two String inputs are given.Example:
    julia> func("hi ", "there")"hi there"
  • Method 3:When two inputs are given:first a String and then an Integer.(Remember, though, that abstract types,like Integer,cannot be instantiated,so when we say the second input is an Integer,we really mean it is of a typethat is a subtype of Integer.In other words,y isa Integer must evaluate to true.)Examples:

    julia> func("eat ", 3)"eat eat eat "julia> func("speak ", UInt32(2))"speak speak "
  • Method 4:When three inputs are given:first a String, then an Integer,and then another Integer.(Note that y and zdo not have to be of the same type,i.e., y could be an Int64,while z could be a Bool.)Examples:

    julia> func("repeat", 5, 6)"repeat"julia> func("first letter", 1, true)"f"

Now, you may be thinkingthat this just looks like method overloading.And you would be correctthat the above exampleshaven’t really demonstratedthe power of multiple dispatch.So,let’s compare multiple dispatchto single dispatch.

Multiple Dispatch vs Single Dispatch

Having many method definitionsis also possiblewith single dispatch.However,with single dispatch,the compile-time types of the inputs(except the first)are used to decide which method to call.

Single dispatch uses the runtime typeof just the first function argument,while multiple dispatch uses the runtime typesof all the function arguments.

Consider the following Java example:

abstract public class A{    public void func(A a)    {        System.out.println("Method AA");    }}public class B extends A{    public void func(A a)    {        System.out.println("Method BA");    }    public void func(B b)    {        System.out.println("Method BB");    }}public class Main{    public static void main(String[] args)    {        A ab = new B();        B b = new B();        ab.func(ab); // Method BA        b.func(b); // Method BB    }}

In this example,even though both ab and bare instantiated as B objects,b.func(ab) prints Method BA,whereas b.func(b) prints Method BB.This difference is becauseab and b have different compile-time types(also called declared or static types):ab is of type A,whereas b is of type B.

Now let’s compare this to Julia:

# From package PkgA.jlabstract type A endfunc(a1::A, a2::A) = println("Method AA")# From package PkgB.jlimport PkgA: A, funcstruct B <: A end # `B <: A` means `B` is a subtype of `A`func(b::B, a::A) = println("Method BA")func(b1::B, b2::B) = println("Method BB")# In the REPL or in a .jl fileusing PkgA, PkgBab::A = B()b::B = B()func(ab, ab) # Method BBfunc(b, b) # Method BB

In this case,both func(ab, ab) and func(b, b) print Method BB,despite annotating ab to be of type A,because both ab and bare of type B at runtime.

Again,multiple dispatch uses the runtime typesof all input argumentsto determine the method to call.

One of the main resulting benefitsis that composability is easy in Julia.

Composability

Composability is the abilityto use different pieces of code togetherto provide new functionality.

Puzzle pieces

First,we will see an instancewhere composability is difficult to achieve.

Let’s add to our Java example.We will add a class Cthat depends on Abut is independent of B:

public class C{    public static void run(A a1, A a2)    {        a1.func(a2);    }}

And we will add the following lines to main:

C.run(ab, ab); // Method BAC.run(b, b); // Method BA - `C.run` cannot call Method BB!

Notice that,even though b has a runtime type of B,C.run(b, b) sill prints out Method BA.This is because in class C,the static typeof the second input to run is A, not B.There are two ways to get run to print out Method BB:

  1. The author of class C changes their codeto add better support for using objects of type B,e.g., by adding additional methods to run.
  2. The author of class Main reimplements runto take into account the existence of class B.

In either case,code needs to be modifiedfor class B and class Cto compose well.That’s not ideal for composability.

Now let’s see how this works out in Julia.Suppose someone writes a package PkgC.jlthat contains the following:

using PkgArun(a1::A, a2::A) = func(a1, a2)

Then we can update our example:

# In the REPL or in a .jl fileusing PkgA, PkgB, PkgCab::A = B()b::B = B()PkgC.run(ab, ab) # Method BBPkgC.run(b, b) # Method BB

Notice that PkgC.run(b, b) prints Method BB,even though PkgC.jlknows nothing about PkgB.jl!No code changes were necessaryfor PkgB.jl and PkgC.jl to compose.That’s the power of multiple dispatch!

More Tips, Tricks, and General Advice

Now we will discuss various additional aspectsof multiple dispatch in Julia.

More Fine-Tuned Dispatching

Previously,we saw that the methodfunc(x::String, y::Integer, z::Integer)allowed y and z to be of different types(as long as both were Integers).

What if we want a methodwhere y and z need to beof the same type?We can do soby introducing a type parameter:

julia> function func(x::String, y::T, z::T) where T<:Integer # Method 5           println(x, ": y and z are both of type ", T)       endfunc (generic function with 5 methods)julia> func("welcome", 1, true) # Calls method 4"w"julia> func("welcome", 1, 1) # Calls method 5welcome: y and z are both of type Int64

This new method will be calledwhen y and z are both Integersof the same type.

Type parameters are also usefulfor constraining the typescontained in container types.For example,suppose we want a methodthat operates on an Arraythat contains only real numbers.The following method signatureaccomplishes this:

another_func(a::Array{T}) where T<:Real

If T isn’t used anywhere else,the following method signatureworks as well:

another_func(a::Array{<:Real})

Method Ambiguities

When adding different methods,it’s possible to introducemethod ambiguities,where, for a given set of input types,there is no single most specific method.

Recall that method 3 of funchad the following method signature:

func(x::String, y::Integer)

Adding the following methodwill introduce a method ambiguitywhen calling funcwith a String and an Int:

julia> function func(x, y::Int)           println(x, ": ", y)       endfunc (generic function with 6 methods)julia> func("hello", 2)ERROR: MethodError: func(::String, ::Int64) is ambiguous.Candidates:  func(x::String, y::Integer)    @ Main REPL[4]:1  func(x, y::Int64)    @ Main REPL[16]:1Possible fix, define  func(::String, ::Int64)

Notice the list of candidate methodsin the error message;neither is more specific than the other.

  • For the first argument,String is more specific than Any,so the first method is more specific.
  • For the second argument,Int64 is more specific than Integer,so the second method is more specific.

Method ambiguity

To overcome this issue,either remove one of the offending methodsor add the method suggestedin the error message.

Beware of Type Piracy

Another thing to be careful aboutwhen defining methodsis type piracy.Type piracy occurs whenyou add a method to a functionthat isn’t yours(i.e., that you didn’t originally define)using types that aren’t yours.

For example,defining simple_func(x::String)is fine because simple_funcis a new function,so it’s fine to use String,a type that we didn’t define,in the method signature.

Similarly,defining Base.sum(x::MyNewType)is fine because we defined(in this hypothetical scenario)MyNewType ourselves,so it’s fine to add this methodto Julia’s sum function.

But what’s wrong with adding a method likeBase.sum(x::Array)?The problem occurs when someone elsetries to use your code.If they call sum on any Arrays,your new method will be calledinstead of Julia’s built-in sum!That’s a nasty bug waiting to happen.

(If you want to live on the edge,try definingBase.:+(a::Int, b::Int) = error("pirate invasion")and then watch as Julia crashes fantastically.It turns out that being ableto correctly add two integersis necessary for Julia to work properly!)

Summary

In this post,we learned aboutmultiple dispatchin Julia.We discussed how to utilize itand how it compares to single dispatch,typically usedin object-oriented languages.

Do you have any further questionsabout multiple dispatch?Let us know in the comments below!

Want to learn more about Julia?Move on to thenext post to learn about modules and variable scope!Or,feel free to take a lookat out other Julia tutorial posts!

Additional Links

Harnessing the Power of Multiple Dispatch in Julia Programming

By: Great Lakes Consulting

Re-posted from: https://blog.glcs.io/multiple-dispatch

Julia is a relatively new,
free, and open-source programming language.
It has a syntax
similar to that of other popular programming languages
such as MATLAB and Python,
but it boasts being able to achieve C-like speeds.

One of the core design philosophies of Julia
that distinguishes it
from a lot of other languages
is multiple dispatch.
In essence,
multiple dispatch allows
writing multiple implementations
of a single function
based on different input types.
But unlike single dispatch
which relies on the static, compile-time types
of all but one of the inputs,
multiple dispatch chooses what implementation to call
using the runtime types.

It makes writing generic code a cinch,
allowing packages to work together seamlessly
and enabling a high degree of composability.

In this post,
we will learn about
multiple dispatch
in Julia.
We will discuss how to utilize it
and how it compares to single dispatch,
typically used
in object-oriented languages.

This post assumes you already have
a basic understanding
of Julia’s type system.
If you haven’t yet,
check out our earlier
post to learn more.

Functions and Methods

To understand multiple dispatch,
we first need to understand
the difference between functions and methods.

In Julia,
a function essentially is just a name.
This name is used to refer to the function
and to call the function.
What a function does when called
is determined by the methods
associated with the function.
A method is an implementation
of a function
for a given set of input types.

Let’s illustrate this concept with code.

First, we will define a function
with no methods,
just to show that it can be done:

julia> function no_methods end
no_methods (generic function with 0 methods)

When declaring a function, however,
we typically also define an associated method:

julia> function func(x, y) # Method 1
           x + y
       end
func (generic function with 1 method)

Here,
we created the function func
and defined a method
that adds the two function inputs.
Call this method 1.

Now let’s add another method,
method 2.
To do so,
we need to provide type annotations
to the input variables:

julia> function func(x::String, y::String) # Method 2
           x * y
       end
func (generic function with 2 methods)

Notice that we didn’t create
a new function with one method.
Because we used the same function name,
we added a method to the already existing function.

Let’s add one more method,
method 3:

julia> function func(x::String, y::Integer) # Method 3
           x^y
       end
func (generic function with 3 methods)

Methods can also have different numbers of arguments.
The previous methods all had two arguments,
so let’s create a method with three arguments,
method 4:

julia> function func(x::String, y::Integer, z::Integer) # Method 4
           func(x, y)[1:z]
       end
func (generic function with 4 methods)

Now let’s see multiple dispatch in action.

Multiple Dispatch

When func is called,
Julia uses the types of the input arguments
to determine which method to dispatch to.
In particular,
the input types of multiple arguments
(actually, all of them)
are considered,
hence the name multiple dispatch.
If the types of the inputs
match one of the method signatures,
that method will be called.

Multiple dispatch

But what about func("hi", "there")?
In this case,
both inputs are Strings,
so that matches method 2.
But String is a subtype of Any,
so the inputs also match method 1!
In cases like this
where multiple methods match,
the most specific method will be called.
So, in this case,
func("hi", "there") will dispatch to method 2
because String is a more specific type than Any.

If no methods match
or if there is not one single most specific method,
a MethodError will be thrown.

So,
the methods we defined for func
will be called in the following cases:

  • Method 1:
    When two inputs are given
    and no more specific methods exist.
    Examples:

    julia> func(1, 2)
    3
    
    julia> func([3, 4], [5, 6])
    2-element Vector{Int64}:
      8
     10
    
  • Method 2:
    When two String inputs are given.
    Example:

    julia> func("hi ", "there")
    "hi there"
    
  • Method 3:
    When two inputs are given:
    first a String and then an Integer.
    (Remember, though, that abstract types,
    like Integer,
    cannot be instantiated,
    so when we say the second input is an Integer,
    we really mean it is of a type
    that is a subtype of Integer.
    In other words,
    y isa Integer must evaluate to true.)
    Examples:

    julia> func("eat ", 3)
    "eat eat eat "
    
    julia> func("speak ", UInt32(2))
    "speak speak "
    
  • Method 4:
    When three inputs are given:
    first a String, then an Integer,
    and then another Integer.
    (Note that y and z
    do not have to be of the same type,
    i.e., y could be an Int64,
    while z could be a Bool.)
    Examples:

    julia> func("repeat", 5, 6)
    "repeat"
    
    julia> func("first letter", 1, true)
    "f"
    

Now, you may be thinking
that this just looks like method overloading.
And you would be correct
that the above examples
haven’t really demonstrated
the power of multiple dispatch.
So,
let’s compare multiple dispatch
to single dispatch.

Multiple Dispatch vs Single Dispatch

Having many method definitions
is also possible
with single dispatch.
However,
with single dispatch,
the compile-time types of the inputs
(except the first)
are used to decide which method to call.

Single dispatch uses the runtime type
of just the first function argument,
while multiple dispatch uses the runtime types
of all the function arguments.

Consider the following Java example:

abstract public class A
{
    public void func(A a)
    {
        System.out.println("Method AA");
    }
}

public class B extends A
{
    public void func(A a)
    {
        System.out.println("Method BA");
    }

    public void func(B b)
    {
        System.out.println("Method BB");
    }
}

public class Main
{
    public static void main(String[] args)
    {
        A ab = new B();
        B b = new B();

        ab.func(ab); // Method BA
        b.func(b); // Method BB
    }
}

In this example,
even though both ab and b
are instantiated as B objects,
b.func(ab) prints Method BA,
whereas b.func(b) prints Method BB.
This difference is because
ab and b have different compile-time types
(also called declared or static types):
ab is of type A,
whereas b is of type B.

Now let’s compare this to Julia:

# From package PkgA.jl
abstract type A end
func(a1::A, a2::A) = println("Method AA")

# From package PkgB.jl
import PkgA: A, func
struct B <: A end # `B <: A` means `B` is a subtype of `A`
func(b::B, a::A) = println("Method BA")
func(b1::B, b2::B) = println("Method BB")

# In the REPL or in a .jl file
using PkgA, PkgB

ab::A = B()
b::B = B()

func(ab, ab) # Method BB
func(b, b) # Method BB

In this case,
both func(ab, ab) and func(b, b) print Method BB,
despite annotating ab to be of type A,
because both ab and b
are of type B at runtime.

Again,
multiple dispatch uses the runtime types
of all input arguments
to determine the method to call.

One of the main resulting benefits
is that composability is easy in Julia.

Composability

Composability is the ability
to use different pieces of code together
to provide new functionality.

Puzzle pieces

First,
we will see an instance
where composability is difficult to achieve.

Let’s add to our Java example.
We will add a class C
that depends on A
but is independent of B:

public class C
{
    public static void run(A a1, A a2)
    {
        a1.func(a2);
    }
}

And we will add the following lines to main:

C.run(ab, ab); // Method BA
C.run(b, b); // Method BA - `C.run` cannot call Method BB!

Notice that,
even though b has a runtime type of B,
C.run(b, b) sill prints out Method BA.
This is because in class C,
the static type
of the second input to run is A, not B.
There are two ways to get run to print out Method BB:

  1. The author of class C changes their code
    to add better support for using objects of type B,
    e.g., by adding additional methods to run.
  2. The author of class Main reimplements run
    to take into account the existence of class B.

In either case,
code needs to be modified
for class B and class C
to compose well.
That’s not ideal for composability.

Now let’s see how this works out in Julia.
Suppose someone writes a package PkgC.jl
that contains the following:

using PkgA
run(a1::A, a2::A) = func(a1, a2)

Then we can update our example:

# In the REPL or in a .jl file
using PkgA, PkgB, PkgC

ab::A = B()
b::B = B()

PkgC.run(ab, ab) # Method BB
PkgC.run(b, b) # Method BB

Notice that PkgC.run(b, b) prints Method BB,
even though PkgC.jl
knows nothing about PkgB.jl!
No code changes were necessary
for PkgB.jl and PkgC.jl to compose.
That’s the power of multiple dispatch!

More Tips, Tricks, and General Advice

Now we will discuss various additional aspects
of multiple dispatch in Julia.

More Fine-Tuned Dispatching

Previously,
we saw that the method
func(x::String, y::Integer, z::Integer)
allowed y and z to be of different types
(as long as both were Integers).

What if we want a method
where y and z need to be
of the same type?
We can do so
by introducing a type parameter:

julia> function func(x::String, y::T, z::T) where T<:Integer # Method 5
           println(x, ": y and z are both of type ", T)
       end
func (generic function with 5 methods)

julia> func("welcome", 1, true) # Calls method 4
"w"

julia> func("welcome", 1, 1) # Calls method 5
welcome: y and z are both of type Int64

This new method will be called
when y and z are both Integers
of the same type.

Type parameters are also useful
for constraining the types
contained in container types.
For example,
suppose we want a method
that operates on an Array
that contains only real numbers.
The following method signature
accomplishes this:

another_func(a::Array{T}) where T<:Real

If T isn’t used anywhere else,
the following method signature
works as well:

another_func(a::Array{<:Real})

Method Ambiguities

When adding different methods,
it’s possible to introduce
method ambiguities,
where, for a given set of input types,
there is no single most specific method.

Recall that method 3 of func
had the following method signature:

func(x::String, y::Integer)

Adding the following method
will introduce a method ambiguity
when calling func
with a String and an Int:

julia> function func(x, y::Int)
           println(x, ": ", y)
       end
func (generic function with 6 methods)

julia> func("hello", 2)
ERROR: MethodError: func(::String, ::Int64) is ambiguous.

Candidates:
  func(x::String, y::Integer)
    @ Main REPL[4]:1
  func(x, y::Int64)
    @ Main REPL[16]:1

Possible fix, define
  func(::String, ::Int64)

Notice the list of candidate methods
in the error message;
neither is more specific than the other.

  • For the first argument,
    String is more specific than Any,
    so the first method is more specific.
  • For the second argument,
    Int64 is more specific than Integer,
    so the second method is more specific.

Method ambiguity

To overcome this issue,
either remove one of the offending methods
or add the method suggested
in the error message.

Beware of Type Piracy

Another thing to be careful about
when defining methods
is type piracy.
Type piracy occurs when
you add a method to a function
that isn’t yours
(i.e., that you didn’t originally define)
using types that aren’t yours.

For example,
defining simple_func(x::String)
is fine because simple_func
is a new function,
so it’s fine to use String,
a type that we didn’t define,
in the method signature.

Similarly,
defining Base.sum(x::MyNewType)
is fine because we defined
(in this hypothetical scenario)
MyNewType ourselves,
so it’s fine to add this method
to Julia’s sum function.

But what’s wrong with adding a method like
Base.sum(x::Array)?
The problem occurs when someone else
tries to use your code.
If they call sum on any Arrays,
your new method will be called
instead of Julia’s built-in sum!
That’s a nasty bug waiting to happen.

(If you want to live on the edge,
try defining
Base.:+(a::Int, b::Int) = error("pirate invasion")
and then watch as Julia crashes fantastically.
It turns out that being able
to correctly add two integers
is necessary for Julia to work properly!)

Summary

In this post,
we learned about
multiple dispatch
in Julia.
We discussed how to utilize it
and how it compares to single dispatch,
typically used
in object-oriented languages.

Do you have any further questions
about multiple dispatch?
Let us know in the comments below!

Additional Links