Author Archives: Blog by Bogumił Kamiński

Best kept secret of modern CPU technology

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/09/16/benchmarking.html

Introduction

When writing complex Julia code we usually want it to run fast.
Here the BenchmarkTools.jl package is one of the useful tools that
helps developers in writing, running, and comparing benchmark results.

In this post I want to discuss how modern CPUs behave when you benchmark
your code by running the same task many times.

In the post I use Julia 1.8.1, BenchmarkTools.jl 1.3.1, and Plots.jl 1.32.1.

Modern CPUs learn the task they are asked to execute

First experiment

Assume we want to benchmark the sum function. Let us run a simple benchmark
executing it ten times on a vector having one million elements:

julia> using Plots

julia> using Random

julia> sum(rand(10^6)); # make sure all is compiled

julia> timings_sum(x) = [@elapsed(sum(x)) for i in 1:10]
timings (generic function with 1 method)

julia> plot(timings_sum(rand(10^6));
            legend=false, xlab="run", ylab="time (sec.)")

The result of this test is shown in this plot:

Single run

To our surprise the first run of the code is slowest, while consecutive runs
are increasingly faster. Things stabilize around fifth run, where code execution
time is over two times faster than on the first run.

Is this pattern consistent? Let us run the experiment ten times:

julia> plot([timings_sum(rand(10^6)) for _ in 1:10];
            legend=false, xlab="run", ylab="time (sec.)")

The result is here:

Ten runs new allocation

There is some instability, but the general trend is consistent. If we run
the sum function repeatedly on the same data it starts to run faster.
Allocating a new vector “resets” the performance.

Let us check if the effect is because of new data or new location of the data.
This time we keep the same vector, but change data stored in it:

julia> refx = rand(10^6);

julia> plot([timings_sum(rand!(refx)) for _ in 1:10];
            legend=false, xlab="run", ylab="time (sec.)")

The result is here:

Ten runs new data

Again there is some instability, but we see that in general it is data location
not values of the data that play a role here.

How does it affect the results of using BenchmarkTools.jl? Let us see:

julia> using BenchmarkTools

julia> @btime sum($refx);
  130.500 μs (0 allocations: 0 bytes)

As expected – we get the timing that reflects the performance on old data, not
on new data. This is something that we might want. For example, if in our
production code we are smart enough not to allocate a new vector every time, but
reuse the existing vectors this is a kind of timing that is interesting for us.

However, often we are more interested in timings on new data. This is the way
how you can run such benchmark:

julia> b = @benchmarkable sum(benchx) setup=(benchx=rand(10^6))
Benchmark(evals=1, seconds=5.0, samples=10000)

julia> run(b)
BenchmarkTools.Trial: 2221 samples with 1 evaluation.
 Range (min … max):  357.500 μs …   2.022 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     372.900 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   442.877 μs ± 214.816 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██▄▁▁  ▂                                                 ▂▁
  █████████▆█▇▇▆▆▅▆▃▄▅▃▃▆▃▃▄▄▁▁▁▁▁▁▃▁▁▁▁▁▁▁▅▄▁▃▃▁▄▄▁▄▃▃▅▅▅███▇▅ █
  358 μs        Histogram: log(frequency) by time       1.22 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

In this setup note two things. First we use setup to create new data for
each sample. Second we do one evaluation per sample. It is important to keep it
if we used more evaluations per sample then we would reuse the same vector
many times distorting the results:

julia> b.params.evals = 10
10

julia> run(b)
BenchmarkTools.Trial: 1232 samples with 10 evaluations.
 Range (min … max):  187.810 μs … 632.760 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     210.325 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   226.189 μs ±  55.923 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄▆██▇▆▄▄▃▃▂▁▁                                                 ▁
  █████████████▇▇▆▇▆▇██▆▄▁▁▆▄▄▅▁▁▄▁▅▄▅▁▅▅▅▆▄▆▁▁▄▄▆▄▆▅▄▁▄▁▄▁▄▁▁▅ █
  188 μs        Histogram: log(frequency) by time        532 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

One additional comment is that using 1 evaluation per sample makes sense only if
the operation you do is expensive enough to get a proper measurement of its run
time (in practice I usually try to benchmark code that is large enough and meets
this condition).

Second experiment

Now let us benchmark the findall function on a vector having one million
Bool values.

Start with 50% of true and 50% of false:

julia> using Random

julia> findall(trues(10)); # make sure all is compiled

julia> timings_findall(v) = [@elapsed(findall(v)) for i in 1:10]
timings_findall (generic function with 1 method)

julia> x = iseven.(1:1_000_000);

julia> Random.seed!(1234);

julia> y = shuffle(x);

julia> timings_findall(x)
10-element Vector{Float64}:
 0.000711
 0.0007065
 0.0009511
 0.0007604
 0.0007714
 0.0008445
 0.000692
 0.0006092
 0.0078415
 0.0006179

julia> timings_findall(y)
10-element Vector{Float64}:
 0.0007134
 0.0007109
 0.0007077
 0.0006935
 0.0006867
 0.0007165
 0.0007212
 0.0007061
 0.0007018
 0.0007109

julia> @btime findall($x);
  559.900 μs (2 allocations: 3.81 MiB)

julia> @btime findall($y);
  618.900 μs (2 allocations: 3.81 MiB)

We use two vectors. Vector x has true and false in respectively
even and odd locations. While vector y is shuffled.

Running timings_findall does not show a strong downward trend like we saw in
sum. However, there is some other strange thing happening. Benchmark for
vector x runs faster than benchmark for vector y.

Let us re-run the benchmark for the case where we have 1% of true values.
Again vector x will have a pattern, while vector y is shuffled:

julia> x = rem.(1:1_000_000, 100) .== 0;

julia> Random.seed!(1234);

julia> y = shuffle(x);

julia> @benchmark findall($x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  14.600 μs …  4.553 ms  ┊ GC (min … max):  0.00% … 99.08%
 Time  (median):     19.050 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):   30.971 μs ± 96.975 μs  ┊ GC (mean ± σ):  10.26% ±  3.41%

  ▂▆██▆▄▃          ▂    ▂▄▄▆▆▇▆▄▃▂           ▁▁               ▃
  ████████▆▅▃▁▁▃▁▅███▇▇█████████████▆▇▆▇▇▆▆▆██████▇▇█████▇▇▆▆ █
  14.6 μs      Histogram: log(frequency) by time      64.1 μs <

 Memory estimate: 78.17 KiB, allocs estimate: 2.

julia> @benchmark findall($y)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  58.800 μs …  4.784 ms  ┊ GC (min … max): 0.00% … 97.67%
 Time  (median):     66.100 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   75.761 μs ± 87.842 μs  ┊ GC (mean ± σ):  3.45% ±  3.18%

  ▆██▇█▆▅▂▁    ▂▅▆▆▇▆▆▅▄▂▁    ▁ ▁▁                            ▃
  ██████████▇██████████████████████▇██▇█▆▇▃▆▅▆▆▆▆▆▅▃▆▅▆▄▆▄▄▅▅ █
  58.8 μs      Histogram: log(frequency) by time       134 μs <

 Memory estimate: 78.17 KiB, allocs estimate: 2.

Now the performance difference is even bigger.

What would happen if we allocated a fresh x vector each time?

julia> b2 = @benchmarkable findall(benchx) setup=(benchx=rem.(1:1_000_000, 100) .== 0)
Benchmark(evals=1, seconds=5.0, samples=10000)

julia> run(b2)
BenchmarkTools.Trial: 5943 samples with 1 evaluation.
 Range (min … max):  14.400 μs …  4.913 ms  ┊ GC (min … max): 0.00% … 99.12%
 Time  (median):     17.800 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.622 μs ± 86.793 μs  ┊ GC (mean ± σ):  8.59% ±  2.86%

  ▂▅▇█▇▅▃▂              ▁▄▄▅▅▅▄▃▂▁                            ▂
  ███████████▇███▆▆▆▃▅▆▇██████████████▆▇▆▆▆▆▆▄▆▅▆▆▃▄▆▄▅▆▄▆▆▆▆ █
  14.4 μs      Histogram: log(frequency) by time        59 μs <

 Memory estimate: 78.17 KiB, allocs estimate: 2.

We can see that the performance now is comparable. So it is not the issue of new
memory location. Now the issue is in the data. My CPU seems to be able to learn
that in the vector x the true and false values form a pattern and takes
advantage of this when executing my code.

Conclusions

CPUs these days are very smart in how they execute the code. They can do things
like RAM caching and branch prediction to improve the run time of
your programs. If you would like to read more about such topics then this
and this are nice places to check out.

The conclusions from the tests we run today are:

  • if your code can be affected by RAM caching make sure to use a proper
    benchmarking approach that matches the use case in your production code
    (i.e. either reuse the same data or use setup option and @benchmarkable)
  • if your code can be affected by branch prediction make sure that the test data
    matches your production data (e.g. if in production you expect data that has
    some structure use a similar structure for testing)

To wrap up I would like to thank Jakob, Kristoffer, Valentin, and Jeffrey
for a discussion on this topic on #becnchmarking channel in Julia Slack.

Julia tutorial during Social Simulation Conference 2022

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/09/09/ssc2022.html

Introduction

During SSC2022 Conference together with Rajith Vidanaarachchi and
Przemysław Szufel I am going to give
An introduction to agent-based simulations in the Julia language
tutorial.

During the tutorial tutorial agent-based models implemented in Julia will be
discussed. Today I thought to post an example of a simple agent-based model
(not covered during SSC2022) that shows one of the challenges of designing
agent-based models.

The post was written under Julia 1.8.0, DataFrames.jl 1.3.5, and Plots.jl 1.32.0.

The problem

I want to create a simulation of genetic drift in Julia.

The idea of the model is as follows. On a rectangular grid we initially randomly
put agents of two types (I will use true and false to denote them). In one
step we pick a random agent and change its type to the type of one of its
neighbors. We will want to check, using simulation, if in the long run all
agents end up having the same color and how long it takes, on the average,
to reach this state.

A crucial design decision when modeling this problem is how we define neighbors
of an agent. Some classical definitions are Moore or von Neumann
neighborhoods on a torus. However, in the experiment today I assume
that if I have an agent in location (x,y) it has four neighbors on diagonals:
(x+1, y+1), (x-1, y+1), (x+1, y-1), (x-1, y-1) (we assume the simulation
is run on a torus, so left and right edge, and top and bottom edges of the grid
are glued together). I want to see what consequence this definition of the
neighborhood has on the results of the simulation.

An initial test

Let us start with an implementation on a grid of size 10 times 10 and plot the
dynamics of the system. Here is the code running 30,000 steps of the simulation
and want to see if we end up with all cells having the same type:

using Random
using Plots
Random.seed!(1234)
m = rand(Bool, 10, 10)
anim = @animate for tick in 1:30000
    x, y = rand(axes(m, 1)), rand(axes(m, 2))
    nx = mod1(x + rand((1, -1)), size(m, 1))
    ny = mod1(y + rand((1, -1)), size(m, 2))
    m[x, y] = m[nx, ny]
    heatmap(m, legend=false, size=(400, 400), title=tick)
end every 100
gif(anim, "anim_fps15.gif", fps=15)

The code saves the result as an animated GIF file that you can see below:

Genetic drift 10x10

Surprisingly, we reach a steady state that consists of 50% of true and 50%
of false cells. This teaches us that one should be careful when designing
ABM models, as small changes in the code can lead to significant changes it
the behavior of the model. The point is that our neighborhood consisting of
four diagonal neighbors defines two disjoint sets of cells on a square having even
size. In von Neumann neighborhood, which also consists of four negihbors
(but top-down and left-right) we would not have such a problem.

Let us investigate our flawed model for different sizes of the grid.

Checking different sizes of a grid

We want to check how often different steady states are reached for different
sizes of the grid and how long it takes to reach these steady states.

Before we can do it we first need to think of some rule that detects that
the simulation has reached a steady state and at the same time is cheap to check.

There are many ways to do it. Let me propose the following reasoning here. If we
have a square toroidal grid whose side is d we have 2d^2 pairs of nodes that
are neighbors. To see this note that we have d^2 agents and each node has 4
neighbors; in other words the grid is a 4-regular graph on d^2 nodes, so it
has 2d^2 edges.

Notice that if we in a continuous sequence visit every edge of this graph at
least once and we always see the same color on both ends of the edge we have
reached a steady state. Additionally, each edge of the graph is visited with
the same probability in each tick. Now assume that we run this process for t
ticks. The probability that we have never picked some concrete edge during this
time at is (1-1/(2d^2))^t. Since we have 2d^2 such edges, by union bound,
the probability that we have not picked at least one of the edges is less than
2d^2(1-1/(2d^2))^t. We want t to be large enough so that this probability is
small. Assume that we want this probability to be less than 1e-6. We thus
need to find a minimum t that satisfies 2d^2(1-1/(2d^2))^t<1e-6 as a
function of d. Using the approximation log(1+x) ≈ x for x close to 0 we
get that t > 2d^2*log(500_000d^2). For example for d=10 we can compute that
t must be at least 3546. In other words if for 3546 steps we do not make
any swaps of agent’s state we are very likely to have reached a steady state.

Using this derivation let us put our simulation in a function:

function runsim(d)
    m = rand(Bool, d, d)
    tick = 0
    lastchange = 0
    t = 2d^2*log(500_000d^2)
    while lastchange < t
        tick += 1
        x, y = rand(axes(m, 1)), rand(axes(m, 2))
        nx = mod1(x + rand((1, -1)), size(m, 1))
        ny = mod1(y + rand((1, -1)), size(m, 2))
        old = m[x, y]
        new = m[nx, ny]
        if old == new
            lastchange += 1
        else
            lastchange = 0
        end
        m[x, y] = new
    end
    class=sum(m)
    if iseven(d)
        @assert allequal(m[i, j] for i in 1:d, j in 1:d if iseven(i+j))
        @assert allequal(m[i, j] for i in 1:d, j in 1:d if isodd(i+j))
    else
        @assert allequal(m)
    end
    return (;d, tick, class)
end

Notice, that I have additionally added @assert checks that make sure
that indeed we have reached a steady state. However, we make this check only
once as it is expensive.

In the output the class value is number of true cells in the grid.

Let us now run this simulation for d ranging from 4 to 11 and 2048 times
for each size of the grid and keep the results in a data frame:

julia> using DataFrames

julia> using Statistics

julia> df = DataFrame([runsim(d) for d in 4:11 for _ in 1:2048])
16384×3 DataFrame
   Row │ d      tick   class
       │ Int64  Int64  Int64
───────┼─────────────────────
     1 │     4    595      8
     2 │     4    556      8
     3 │     4    570      8
   ⋮   │   ⋮      ⋮      ⋮
 16382 │    11  16112      0
 16383 │    11  17230    121
 16384 │    11  20914      0
           16378 rows omitted

First, we want to analyze if indeed for even grid size we have three different
possible outcomes and for odd sized grid we have two possible steady states:

julia> combine(groupby(df, [:d, :class], sort=true), nrow, :tick .=> [mean, std])
20×5 DataFrame
 Row │ d      class  nrow   tick_mean  tick_std
     │ Int64  Int64  Int64  Float64    Float64
─────┼────────────────────────────────────────────
   1 │     4      0    504    604.919     68.1656
   2 │     4      8   1017    605.345     65.5917
   3 │     4     16    527    607.355     69.8784
   4 │     5      0   1015   1309.11     388.342
   5 │     5     25   1033   1341.45     415.868
   6 │     6      0    528   1853.26     425.983
   7 │     6     18   1038   1844.34     409.991
   8 │     6     36    482   1879.57     458.587
   9 │     7      0   1002   4077.81    2059.13
  10 │     7     49   1046   3901.26    1777.57
  11 │     8      0    521   4739.05    1562.39
  12 │     8     32   1027   4620.13    1481.39
  13 │     8     64    500   4717.03    1626.72
  14 │     9      0   1025   9825.73    5347.35
  15 │     9     81   1023  10069.5     5665.56
  16 │    10      0    491  10331.5     4139.39
  17 │    10     50   1075  10233.4     4008.2
  18 │    10    100    482  10469.8     4337.41
  19 │    11      0   1035  21127.0    12885.3
  20 │    11    121   1013  21374.3    13331.7

Indeed this is a case. Moreover we note that:

  • for even-sized grids having a mixed steady state is as likely as having a
    homogenous steady state (this is expected);
  • increasing grid size increases the time to reach the steady state (this is
    expected);
  • time to reach each steady state, for a given d has approximately the same
    distribution (we have calculated mean and standard deviation of this time to
    confirm this; again this is expected).

Finally let us calculate the average time to reach steady state by d and plot
it:

julia> res = combine(groupby(df, :d, sort=true), :tick => mean)
8×2 DataFrame
 Row │ d      tick_mean
     │ Int64  Float64
─────┼──────────────────
   1 │     4    605.757
   2 │     5   1325.42
   3 │     6   1854.93
   4 │     7   3987.64
   5 │     8   4674.04
   6 │     9   9947.51
   7 │    10  10312.6
   8 │    11  21249.3

julia> scatter(res.d, res.tick_mean, legend=nothing, xlab="d", ylab="mean ticks")

The resulting plot is:

mean ticks as a function of d

What is interesting is that there is much bigger increase of the time when we
move from even to odd d than when moving from odd to even d. What is the
reason for this? The reason is that for odd d the agents’ graph is connected.
There is a single component having d^2 elements. While for even d we have
two connected components, each having d^2/2 size. We have just learned that it
is much faster to independently reach a steady state for component of size
d^2/2 twice than to reach a steady state of a single component of size d^2.

Conclusions

I hope you enjoyed the post both from ABM and Julia language perspectives. I
tried to avoid going into too much detail, both from mathematical and
implementation side of the model. If you have any questions please contact me
on Julia Slack or Julia Discourse.

Simple is best: do you really need your code to be fast?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/09/02/speed.html

Introduction

In my post from last week I tried to convince novice Julia users
to focus on writing simple code, and only when they learn more to start
writing fully generic code.

This week I continue my simple is best miniseries of posts. This time
I want to argue that you do not need to try writing code that
is maximally fast. Usually it is enough that your code is just fast
(fortunately in Julia, if you do not do some serious mistake your code
is not likely to be slow). Instead, I recommend that you try your code
to be simple and follow common patterns. In the long run it will be
much easier to read and update such code.

For the post I use DataFrames.jl as an example. It was tested under Julia 1.8.0,
DataFrames.jl 1.3.4, ShiftedArrays.jl 1.0.0, and BenchmarkTools.jl 1.3.1.

An example task

Recently one of the users asked the following question. Given a data frame
with column col add 5 columns to it with lags of column col ranging from
1 to 5.

Here is an example how you can do this task:

julia> using DataFrames

julia> using ShiftedArrays

julia> using BenchmarkTools

julia> df = DataFrame(a=1:10)
10×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     3
   4 │     4
   5 │     5
   6 │     6
   7 │     7
   8 │     8
   9 │     9
  10 │    10

julia> function add_lags(df)
           out_df = copy(df)
           for i in 1:5
               out_df[:, "a$i"] = lag(df.a, i)
           end
           return out_df
       end
add_lags (generic function with 1 method)

julia> add_lags(df)
10×6 DataFrame
 Row │ a      a1       a2       a3       a4       a5
     │ Int64  Int64?   Int64?   Int64?   Int64?   Int64?
─────┼────────────────────────────────────────────────────
   1 │     1  missing  missing  missing  missing  missing
   2 │     2        1  missing  missing  missing  missing
   3 │     3        2        1  missing  missing  missing
   4 │     4        3        2        1  missing  missing
   5 │     5        4        3        2        1  missing
   6 │     6        5        4        3        2        1
   7 │     7        6        5        4        3        2
   8 │     8        7        6        5        4        3
   9 │     9        8        7        6        5        4
  10 │    10        9        8        7        6        5

julia> @btime add_lags($df);
  2.900 μs (50 allocations: 3.26 KiB)

I have additionally added timing of this operation for a reference.
The add_lags function is meant to be an example of “fast code”
(it could still be made faster, or I could have chosen a more complex example
but I wanted to pick something that is easy to understand).

Adding lags using the high-level API function

The functions combine, select[!], transform[!], and subset[!] support
the operation specification syntax
and are called the high-level API in DataFrames.jl.

A natural code for adding lags using the transform function is:

julia> transform(df, ["a" => (x -> lag(x, i)) => "a$i" for i in 1:5])
10×6 DataFrame
 Row │ a      a1       a2       a3       a4       a5
     │ Int64  Int64?   Int64?   Int64?   Int64?   Int64?
─────┼────────────────────────────────────────────────────
   1 │     1  missing  missing  missing  missing  missing
   2 │     2        1  missing  missing  missing  missing
   3 │     3        2        1  missing  missing  missing
   4 │     4        3        2        1  missing  missing
   5 │     5        4        3        2        1  missing
   6 │     6        5        4        3        2        1
   7 │     7        6        5        4        3        2
   8 │     8        7        6        5        4        3
   9 │     9        8        7        6        5        4
  10 │    10        9        8        7        6        5

The problem with this solution that users often raise is that it is slower than
the previous one:

julia> @btime transform($df, ["a" => (x -> lag(x, i)) => "a$i" for i in 1:5]);
  61.600 μs (523 allocations: 27.46 KiB)

There are two questions that one naturally asks. The first is what is the reason
of this timing difference and the second is do we care.

So first let me explain the reason for the slowdown. The transform function is
flexible and allows for many different operations. Therefore it does,
apart from doing core transformations, a lot of extra pre and post processing
that is needed to provide this flexibility.

Now let us try to answer the question if we care. First check a bigger data frame:

julia> df2 = DataFrame(a=1:10^8);

julia> @btime add_lags($df2);
  1.354 s (62 allocations: 4.94 GiB)

julia> @btime transform($df2, ["a" => (x -> lag(x, i)) => "a$i" for i in 1:5]);
  1.318 s (539 allocations: 4.94 GiB)

The timings are roughly the same.
As you can see – the pre and post processing time in transform does not grow
with size of data. Therefore we get the following conclusions:

  • if you have few small data frames it does not matter what you use; things will
    be fast;
  • if you have a large data frame it does not matter what you use; things will
    have a comparable speed;
  • if you have a lot of small data frames – then you should be careful as
    pre and post processing time of transform might end up being a significant
    portion of total run time.

You might think that the code:

["a" => (x -> lag(x, i)) => "a$i" for i in 1:5]

is not so easy to read. This might be true for a newcomer, as this is something
new you need to learn. However, my experience is that after some practice with
DataFrames.jl operation specification language it becomes quite easy to write
and read. You just need to embrace the pattern:

[input column] => [transformation function] => [output column name]

and the rest is a natural consequence (one of the benefits of this pattern
is that it is clear visually as => nicely separates its parts).

What is the benefit of using the high-level API?

The benefit of the high-level API is that most likely the lagging operation we
have considered needs to be done in groups. That is, you have a grouping
variable, call it id, and you want to perform lagging per-id value
(e.g. you have different stocks and their quotations for different days
that you want to lag).

With transform this is a breeze. You just need to add groupby to your code:

julia> df3 = DataFrame(id=repeat(1:2, inner=8), a=1:16)
16×2 DataFrame
 Row │ id     a
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      2
   3 │     1      3
   4 │     1      4
   5 │     1      5
   6 │     1      6
   7 │     1      7
   8 │     1      8
   9 │     2      9
  10 │     2     10
  11 │     2     11
  12 │     2     12
  13 │     2     13
  14 │     2     14
  15 │     2     15
  16 │     2     16

julia> transform(groupby(df3, :id),
                 ["a" => (x -> lag(x, i)) => "a$i" for i in 1:5])
16×7 DataFrame
 Row │ id     a      a1       a2       a3       a4       a5
     │ Int64  Int64  Int64?   Int64?   Int64?   Int64?   Int64?
─────┼───────────────────────────────────────────────────────────
   1 │     1      1  missing  missing  missing  missing  missing
   2 │     1      2        1  missing  missing  missing  missing
   3 │     1      3        2        1  missing  missing  missing
   4 │     1      4        3        2        1  missing  missing
   5 │     1      5        4        3        2        1  missing
   6 │     1      6        5        4        3        2        1
   7 │     1      7        6        5        4        3        2
   8 │     1      8        7        6        5        4        3
   9 │     2      9  missing  missing  missing  missing  missing
  10 │     2     10        9  missing  missing  missing  missing
  11 │     2     11       10        9  missing  missing  missing
  12 │     2     12       11       10        9  missing  missing
  13 │     2     13       12       11       10        9  missing
  14 │     2     14       13       12       11       10        9
  15 │     2     15       14       13       12       11       10
  16 │     2     16       15       14       13       12       11

The add_lags function does not obviously allow you to do what you want.
After some thinking (and if you know DataFrames.jl well enough) you come to the
conclusion that the following will work:

julia> transform(groupby(df3, :id), add_lags)
16×7 DataFrame
 Row │ id     a      a1       a2       a3       a4       a5
     │ Int64  Int64  Int64?   Int64?   Int64?   Int64?   Int64?
─────┼───────────────────────────────────────────────────────────
   1 │     1      1  missing  missing  missing  missing  missing
   2 │     1      2        1  missing  missing  missing  missing
   3 │     1      3        2        1  missing  missing  missing
   4 │     1      4        3        2        1  missing  missing
   5 │     1      5        4        3        2        1  missing
   6 │     1      6        5        4        3        2        1
   7 │     1      7        6        5        4        3        2
   8 │     1      8        7        6        5        4        3
   9 │     2      9  missing  missing  missing  missing  missing
  10 │     2     10        9  missing  missing  missing  missing
  11 │     2     11       10        9  missing  missing  missing
  12 │     2     12       11       10        9  missing  missing
  13 │     2     13       12       11       10        9  missing
  14 │     2     14       13       12       11       10        9
  15 │     2     15       14       13       12       11       10
  16 │     2     16       15       14       13       12       11

so things are still easy to code (but need transform).
Let us compare the performance of both operations on a larger data frame:

julia> df4 = DataFrame(id=repeat(1:10, inner=10^7), a=1:10^8);

julia> @btime transform(groupby($df4, :id),
                        ["a" => (x -> lag(x, i)) => "a$i" for i in 1:5]);
  10.124 s (1731 allocations: 19.35 GiB)

julia> @btime transform(groupby($df4, :id), add_lags);
  11.747 s (1552 allocations: 24.59 GiB)

We see that using add_lags in this case is slightly slower. Of course we could
write a custom add_lags function that would work by groups, even without using
groupby but this would be not so easy (I recommend you try it if you are not
convinced).

Here we see the benefit of general design of transform: it handles grouped
data in the same way as it handles a data frame.

Conclusions

Let me summarize my thinking about code performance in Julia in general,
and in DataFrames.jl in particular:

  • If you have small data and small problem – it does not matter how fast your
    code is as it will be fast enough anyway, so you do not have to think about
    speed.
  • If you have large data, then make sure that the solution you use is fast in
    the core of the processing (in DataFrames.jl there are things like: grouping
    data, sorting, joining, reshaping); the pre and post processing that some
    functions (like transform in our case) do will be negligible anyway (this
    logic is something that you most likely already know as every Julia user
    learns that compilation time is negligible when your code runs for several
    hours); it is usually not worth to optimize every part of your code; optimize
    only the parts that are expensive when you scale your computations.
  • For DataFrames.jl there is, in my experience, only one case when you need to
    carefully think how to write your code and which functions to use. This case
    is when you have a lot of (like millions) of small data frames. The reason is
    that then the cost of pre and post processing in functions from the high-level
    API of DataFrames.jl indeed might be an important part of the total runtime of
    your code.