Author Archives: Blog by Bogumił Kamiński

Astonishing performance of lookup in vectors in Julia

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/04/01/fast.html

Introduction

Several years ago I switched to Julia because of its speed. However, every day I
am positively surprised how fast it is.

Today I was writing a simulation where I needed to check if some value was
contained in some collection. Traditionally programmers are recommended to use
the Set structure to perform such operation. However, since I needed to
squeeze out every last bit of performance in my code I investigated all
available options. Julia developers positively surprised me as it seems that
they managed to implement an extremely fast lookup in a standard Vector
type.

Therefore in my post today I decided to share some benchmark results showing
how fast lookup in Vector container is against Set lookup.

The post was written under Julia 1.7.0, DataFrames.jl 1.3.2,
BenchmarkTools.jl 1.3.1, and Plots.jl 1.27.1.

The experiment

In order to test scalability of lookup I decided to test the performance for
collections from size 10 to 40 to see if some trend can be observed. I wanted
to store in them random strings and then perform a check if every value stored
in the collection is indeed present in it.

To ensure fair evaluation of the results I used the @belapsed macro
from the BenchmarkTools.jl package. Here is the test code:

using DataFrames
using BenchmarkTools
using Random
using Plots

f(x, r) = all(in(x), r) # function testing lookup speed

df = DataFrame()
Random.seed!(1234)

for i in 10:40
    v = [randstring(rand(1:1000)) for _ in 1:i] # randomly generate a Vector
    s = Set(v) # turn it to Set for comparison
    @assert length(s) == i # make sure we had no duplicates
    t1 = @belapsed f($s, $v)
    t2 = @belapsed f($v, $v)
    # display the intermetiate results so that I can monitor the progress
    @show (i, t1, t2)
    push!(df, (;i, t1, t2))
end

plot(df.i, [df.t1 df.t2];
     labels=["Set" "Vector"],
     legend=:topleft,
     xlabel="collection size",
     ylabel="time in seconds")

The code produces the following plot:

Benchmarks plot 1

In the plot it seems that performance of function f doing lookup for both
collections scales approximately linearly in the tested range. The lookup in
Set is slower than the lookup in Vector. Additionally, which is positive,
lookup time in Vector has a more stable timing (Set timings are strangely
jagged). Finally, it seems that the difference in the timing increases with
collection size. Let us visualize this:

plot(df.i, df.t1-df.t2;
     labels="Set-Vector",
     legend=:topleft,
     xlabel="collection size",
     ylabel="time difference in seconds")

Benchmarks plot 2

Indeed it seems that the timing difference increased in the tested range of
collection sizes.

Conclusions

It is getting late today, and I do not want to delay my weekly posts. Therefore,
I will do some more testing in the coming week and report you the results in my
next post.

From today’s experiments we can see that in the test setting I used
Vector lookup was significantly faster than Set lookup.

As a side note, it was super easy to implement these benchmarks in Julia.

Back to the basics: array literals in Julia

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/03/25/arrays.html

Introduction

Array literals are one of the basic constructs in the Julia language
that essentially every developer learns during the first session.
However, the exact mechanism of how these literals work are complex if one
wants to understand them in full.

In this post I want to give several examples how the array literals in Julia
work and highlight upcoming changes in the Julia language that might affect
your code, if you happen to use them.

Expect that the material I present today is more advanced than usual,
but I think the topics I cover are relevant for any Julia developer writing
production code.

This post was written using Julia 1.7.0 and Julia 1.9.0-DEV.245.

The rules

In the Julia Manual section on array literals you can read that:

Arrays can also be directly constructed with square braces;
the syntax [A, B, C, ...] creates a one dimensional array (i.e., a vector)
containing the comma-separated arguments as its elements.
The element type (eltype) of the resulting array is automatically determined
by the types of the arguments inside the braces.
If all the arguments are the same type, then that is its eltype.
If they all have a common promotion type then they get converted to that type
using convert and that type is the array’s eltype.
Otherwise, a heterogeneous array that can hold anything — a Vector{Any}
is constructed; this includes the literal [] where no arguments are given.

In short this rule means that:

  • when you write [A, B, C] Julia checks types of A, B, and C;
  • if these types are equal then a Vector having this type is created;
  • if these types are not equal but can be promoted to a common type then
    this common type is used as element type of an array;
  • otherwise an Any element type is used.

Let us see these rules in action.

The first case is when all passed elements have the same element type:

julia> [1, 2, 3]
3-element Vector{Int64}:
 1
 2
 3

Here we passed three integers, so element type of created vector is Int64.

Now let us see a second rule at work. We pass some values that have a common
promotion type:

julia> [true, 2.0, 3]
3-element Vector{Float64}:
 1.0
 2.0
 3.0

We have passed a Bool, Float64, and Int64 value and they all got converted
to Float64, because it is their common promotion type.

Sometimes the result can have a type that is not even any of the types of passed
elements:

julia> [big(1), 2.0, 3.0]
3-element Vector{BigFloat}:
 1.0
 2.0
 3.0

This time we get a conversion to BigFloat as this is a common promotion type
of BigInt and Float64.

Finally sometimes a common promotion type is not concrete in which case
no conversion takes place:

julia> [[1, 2], ["a", "b"]]
2-element Vector{Vector}:
 [1, 2]
 ["a", "b"]

In this case Vector{Int64} and Vector{String} have a common promotion type
Vector which is UnionAll so no conversion of elements occurred.

The last case in our set of rules was that there is no common promotion type
for the elements of created vector. In this case the result has Any element
type:

julia> [1, "2", '3']
3-element Vector{Any}:
 1
  "2"
  '3': ASCII/Unicode U+0033 (category Nd: Number, decimal digit)

In this example Int64, String, and Char do not have a common promotion
type.

Upcoming changes in how array literals work

The codes given above work the same way under Julia 1.7 and Julia nightly.
Now we are getting to a territory when differences will be present.

Start a Julia 1.7 session in your terminal and run the following code:

julia> [1:2, [1, 2]]
2-element Vector{AbstractVector{Int64}}:
 1:2
 [1, 2]

julia> [1:2, ["a", "b"]]
2-element Vector{AbstractVector}:
 1:2
 ["a", "b"]

As you can see the resulting array has an abstract element type.
In particular no conversion of the elements of the array literal happened.
The reason of this behavior is the return value of the promote_type function:

julia> promote_type(typeof(1:2), typeof([1, 2]))
AbstractVector{Int64} (alias for AbstractArray{Int64, 1})

julia> promote_type(typeof(1:2), typeof(["a", "b"]))
AbstractVector (alias for AbstractArray{T, 1} where T)

As you can see the result of the pomote_type is used as an element type of the
created vectors.

Now switch to current Julia nightly build (I used Julia 1.9.0-DEV.245) and
run the same code:

julia> [1:2, [1, 2]]
2-element Vector{Vector{Int64}}:
 [1, 2]
 [1, 2]

julia> [1:2, ["a", "b"]]
2-element Vector{Vector{Any}}:
 [1, 2]
 ["a", "b"]

As you can see the result is different. Now elements of array literals are
converted to concrete Vector{Int64} and Vector{Any} types respectively.

Let us check the result of promote_type:

julia> promote_type(typeof(1:2), typeof([1, 2]))
Vector{Int64} (alias for Array{Int64, 1})

julia> promote_type(typeof(1:2), typeof(["a", "b"]))
Vector{Any} (alias for Array{Any, 1})

It looks that promotion rules have hanged since Julia 1.7. Indeed they were
updated in this PR. The PR introduces a rule that states that if no
promote_rule is defined for two arrays then typejoin on their eltypes is run
instead and an Array having this element type as the container is returned.
Here is the crucial line of code that was added in this PR:

promote_result(::Type{<:AbstractArray{T,n}}, ::Type{<:AbstractArray{S,n}},
               ::Type{Bottom}, ::Type{Bottom}) where {T,S,n} =
    (@inline; Array{promote_type(T,S),n})

However, let me recall you the earlier example we have given
(run on Julia nightly):

julia> [[1, 2], ["a", "b"]]
2-element Vector{Vector}:
 [1, 2]
 ["a", "b"]

Since this time a common promotion type exists, and is Vector
no conversion is done and the resulting container does not have a concrete
element type (as opposed to e.g. [1:2, ["a", "b"]] shown above).

Why is this relevant? As you can see, when you migrate from Julia 1.7 to
any Julia release of Julia that has this PR incorporated you must be aware
that your old code might stop working the same way it did if it used
array literals that contained arrays inside.

Conclusions

There is one key takeaway from my examples.

If in your current code you use array literals to store arrays inside them
please review them before upgrading to latest Julia.

In particular if you do not want an implicit conversion use an element type
prefix in front of the array literal for example like this:

julia> Any[[1, 2], ["a", "b"]]
2-element Vector{Any}:
 [1, 2]
 ["a", "b"]

julia> AbstractVector[1:2, [1, 2]]
2-element Vector{AbstractVector}:
 1:2
 [1, 2]

Passing an explicit element type prefix is a practice that I have currently
adopted in all of my production codes as this is the safest way to make sure my
programs run exactly as I want.

Julia for Data Analysis book preview

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/03/18/manning.html

Introduction

One of the projects I am currently working on is writing the
Julia for Data Analysis book. In this post I want to give you a brief
overview of the contents of the book and its current status.

What is the objective of the book?

The book is aimed at data scientists with some programming experience wanting
to learn how to do data analysis in Julia. Some previous experience with Julia
would be a plus, but it is not strictly required.

Since the book is aimed as an entry-level introduction to data analysis in Julia
I have divided it into two parts:

  • Part 1 is teaching you the essential Julia skills you need to learn to
    confidently use the language.
  • Part 2 is focused on data analysis.

I do not assume that you know the Julia language and in Part 1 explain the basic
components of the language from the very beginning. However, I assume that you
have some experience with programming (e.g. in R or Python).

Part 2 focuses on data analysis in Julia. In it I concentrate a lot on
the functionalities of the DataFrames.jl package.

It is impossible to cover every aspect of data analysis in Julia in a single
book. Therefore, my goal was to discuss all essential material that will
allow you to later confidently learn more advanced things on your own, while
being convenient that you have a firm grasp of the fundamental concepts.

What is the scope of the book?

Most of the chapters in the book are project oriented. I introduce new concepts
and functionalities of the Julia ecosystem by showing how they can be used to
solve practical problems.

Since the scope of the book is quite wide let me here list selected technical
topics that I discuss in it which, I think, are useful even for people who
already know Julia:

  • working with source data in various formats: CSV, Apache Arrow, SQLite, JSON,
    serialization, parsing data that has non-standard format;
  • getting data from external sources (like: downloading data, unpacking
    compressed data, getting data using HTTP requests, finishing with writing
    a web service on your own);
  • solving typical issues encountered when pre-processing data: efficient
    handling of strings, using categorical values,
    working with time-series, understanding missing values.
  • coverage of key functionalities of the DataFrames.jl package and
    DataFramesMeta.jl package (creating data frames, transforming them,
    split-apply-combine strategy, joining, sorting, subsetting, reshaping);
  • creating plots that visualize the results of your analysis using the Plots.jl
    package;
  • building simple predictive models (linear regression, logistic regression,
    LOESS) and assessment of the model performance;
  • integrating Julia code with R and Python.

Notably, I have not covered in this book more advanced topics on machine
learning. The reason is that there are too many options available, so
this would be a material for another book. However, as I have already mentioned,
the book is written in a way so that after reading it you should be able to
easily learn the functionalities of the concrete packages that provide such
functionalities, like e.g. MLJ.jl, yourself.

For example, after reading this book you might want to check out my
Hands-on Data Science with Julia live project that gives you an
introduction to machine learning with Julia.

Where to get the book?

The book will be published by Manning. This week its first two chapters
have been made available in MEAP (Manning Early Access Platform).

You can find a preview of the book under this link.

If you are interested in the topic I encourage you to visit the website,
as MEAP program gives the readers an opportunity to give feedback about the
material I have prepared before the final version of the book is published.

What is the status of the book?

All core chapters have been already written. Now we are going through a review
and publishing process. I will write another post when the book is finalized.

However, since all the technical material is already prepared, you can get the
source codes of all chapters from
GitHub.
Codes are shared there are under MIT license so you can freely reuse them.

Conclusions

Today, as a conclusion, let me pick one example code from the book (codes are
also available on GitHub).

The example code produces a plot that compares speed and code size of Julia,
Python, Java, and C on 10 standard problems taken from the
The Computer Language Benchmarks Game website.

The example is self-contained and was tested under Julia 1.7.0,
DataFrames.jl 1.3.2, CSV.jl 0.10.3, and Plots.jl 1.27.1.

data = """
problem,language,time,size
n-body,c,2.13,1633
mandelbrot,c,1.3,1135
spectral norm,c,0.41,1197
fannkuch-redux,c,7.58,910
fasta,c,0.78,1463
k-nucleotide,c,3.96,1506
binary-trees,c,1.58,809
reverse-complement,c,0.41,1965
pidigits,c,0.56,1090
regex-redux,c,0.8,1397
n-body,Java,6.77,1489
mandelbrot,Java,4.1,796
spectral norm,Java,1.55,756
fannkuch-redux,Java,10.48,1282
fasta,Java,1.2,2543
k-nucleotide,Java,4.83,1812
binary-trees,Java,2.51,835
reverse-complement,Java,1.57,2183
pidigits,Java,0.79,764
regex-redux,Java,5.34,929
n-body,Python,541.34,1196
mandelbrot,Python,177.35,688
spectral norm,Python,112.97,407
fannkuch-redux,Python,341.45,950
fasta,Python,36.9,1947
k-nucleotide,Python,46.31,1967
binary-trees,Python,44.7,660
reverse-complement,Python,6.62,814
pidigits,Python,1.16,567
regex-redux,Python,1.34,1403
n-body,Julia,4.21,1111
mandelbrot,Julia,1.42,619
spectral norm,Julia,1.11,429
fannkuch-redux,Julia,7.83,1067
fasta,Julia,1.13,1082
k-nucleotide,Julia,4.94,951
binary-trees,Julia,7.28,634
reverse-complement,Julia,1.44,522
pidigits,Julia,0.97,506
regex-redux,Julia,1.74,759
"""

using CSV
using DataFrames
using Plots

df = CSV.read(IOBuffer(data), DataFrame)

plot(map([:time, :size],
         ["execution time (relative to C)",
          "code size (relative to C)"]) do col, title
    df_plot = unstack(df, :problem, :language, col)
    df_plot[!, Not(:problem)] ./= df_plot.c
    select!(df_plot, Not(:c))
    scatter(df_plot.problem, Matrix(select(df_plot, Not(:problem)));
            labels=permutedims(names(df_plot, Not(:problem))),
            ylabel=title,
            yaxis = col == :time ? :log : :none,
            xrotation=20,
            markershape=[:rect :diamond :circle],
            markersize=[4 5 5],
            markercolor=[:lightgray :lightgray :gold],
            xtickfontsize=7, ytickfontsize=7,
            legendfontsize=7, ylabelfontsize=7)
            hline!([1.0]; color="orange", labels="C")
end...)

(the code is not easy, but after reading the whole book you should be able to
confidently read it and create a similar implementation yourself)

The figure produced by this code looks as follows.

Benchmarks plot

If you would like to read a complete interpretation of the plot please check
Chapter 1 in Julia for Data Analysis book. Here let me just summarize
that Julia runs the code fast (left pane) and at the same time is convenient to
use (as measured by the code size; right pane).