Astonishing performance of lookup in vectors in Julia

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/04/01/fast.html

Introduction

Several years ago I switched to Julia because of its speed. However, every day I
am positively surprised how fast it is.

Today I was writing a simulation where I needed to check if some value was
contained in some collection. Traditionally programmers are recommended to use
the Set structure to perform such operation. However, since I needed to
squeeze out every last bit of performance in my code I investigated all
available options. Julia developers positively surprised me as it seems that
they managed to implement an extremely fast lookup in a standard Vector
type.

Therefore in my post today I decided to share some benchmark results showing
how fast lookup in Vector container is against Set lookup.

The post was written under Julia 1.7.0, DataFrames.jl 1.3.2,
BenchmarkTools.jl 1.3.1, and Plots.jl 1.27.1.

The experiment

In order to test scalability of lookup I decided to test the performance for
collections from size 10 to 40 to see if some trend can be observed. I wanted
to store in them random strings and then perform a check if every value stored
in the collection is indeed present in it.

To ensure fair evaluation of the results I used the @belapsed macro
from the BenchmarkTools.jl package. Here is the test code:

using DataFrames
using BenchmarkTools
using Random
using Plots

f(x, r) = all(in(x), r) # function testing lookup speed

df = DataFrame()
Random.seed!(1234)

for i in 10:40
    v = [randstring(rand(1:1000)) for _ in 1:i] # randomly generate a Vector
    s = Set(v) # turn it to Set for comparison
    @assert length(s) == i # make sure we had no duplicates
    t1 = @belapsed f($s, $v)
    t2 = @belapsed f($v, $v)
    # display the intermetiate results so that I can monitor the progress
    @show (i, t1, t2)
    push!(df, (;i, t1, t2))
end

plot(df.i, [df.t1 df.t2];
     labels=["Set" "Vector"],
     legend=:topleft,
     xlabel="collection size",
     ylabel="time in seconds")

The code produces the following plot:

Benchmarks plot 1

In the plot it seems that performance of function f doing lookup for both
collections scales approximately linearly in the tested range. The lookup in
Set is slower than the lookup in Vector. Additionally, which is positive,
lookup time in Vector has a more stable timing (Set timings are strangely
jagged). Finally, it seems that the difference in the timing increases with
collection size. Let us visualize this:

plot(df.i, df.t1-df.t2;
     labels="Set-Vector",
     legend=:topleft,
     xlabel="collection size",
     ylabel="time difference in seconds")

Benchmarks plot 2

Indeed it seems that the timing difference increased in the tested range of
collection sizes.

Conclusions

It is getting late today, and I do not want to delay my weekly posts. Therefore,
I will do some more testing in the coming week and report you the results in my
next post.

From today’s experiments we can see that in the test setting I used
Vector lookup was significantly faster than Set lookup.

As a side note, it was super easy to implement these benchmarks in Julia.