Author Archives: Blog by Bogumił Kamiński

Comparing dplyr vs DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2020/07/03/dplyr-vs-df.html

Introduction

This time the post is inspired by the proposal of Andrey Oskin
(thank you for submitting it, below I have adapted business problem description
and dplyr source codes that Andrey provided).

Andrey shared with me typical tasks that he is faced with when doing logs
analysis. To make things concrete, assume that you have a site and you collect
users’ clicks. In the output of this process you get a table with two fields:
time of click (ts column below, measured in seconds) and the user identifier
(user_id column below).

Given such data there are natural business questions, that we can ask, like:

  1. How many sessions an average user has?
  2. How many users have exactly two sessions?
  3. Find top 10 users, ordered by the descending number of sessions?
  4. What is the average time between sessions start?

Session in these questions is a more or less arbitrary thing, usually,
it has a meaning of sequence of events that come together as there is
a short time difference between consecutive events. In the examples we
assume that if a user has not clicked on our site for 900 seconds after the last
click the session is over.

What I do in this post is take a toy data set that has this structure and
dplyr codes that Andrey shared with me that answer the business questions
presented above and rewrite them to DataFrames.jl.

The objective of this post is to compare the syntaxes of dplyr and
DataFrames.jl. Therefore neither dplyr nor DataFrames.jl codes were tuned
to be optimal. Rather I have just taken what Andrey proposed in dplyr and
translated it to DataFrames.jl in a way that first came to my mind (but trying
to use piping). However, in the last part of the post I out of curiosity I
decided compare the performance of the codes.

All codes were tested under R version 4.0.2 and dplyr 1.0.0.
For Julia I used version 1.5.0-rc1.0 and packages: DataFrames.jl 0.21.4,
Pipe.jl 1.3.0, and ShiftedArrays 1.0.0. If you do not have much experience
with setting-up Julia project environments, in this post I give
a simple recipe how you can do it easily while ensuring you use exactly the same
versions of the packages as I do.

Setting up the stage

In the first step we load the required packages, create a data frame that
will be used later and sort it by the ts column.

In all examples in this post I first present R code, and then Julia code.
The expected output is shown in a comment. After each step I briefly comment
on the Julia code.

dplyr

library(dplyr)

df <- data.frame(ts = c(1,10,20,1000,1010,1200,2200,2220,30,500,1500,1600),
                 user_id = c(1,1,1,1,1,1,1,1,2,2,2,2)) %>%
    arrange(ts)
df
#      ts user_id
# 1     1       1
# 2    10       1
# 3    20       1
# 4    30       2
# 5   500       2
# 6  1000       1
# 7  1010       1
# 8  1200       1
# 9  1500       2
# 10 1600       2
# 11 2200       1
# 12 2220       1

DataFrames.jl

using DataFrames
using Pipe
using ShiftedArrays
using Statistics

df = @pipe DataFrame!(ts = [1,10,20,1000,1010,1200,2200,2220,30,500,1500,1600],
                      user_id = [1,1,1,1,1,1,1,1,2,2,2,2]) |>
    sort(_, :ts)
# 12×2 DataFrame
# │ Row │ ts    │ user_id │
# │     │ Int64 │ Int64   │
# ├─────┼───────┼─────────┤
# │ 1   │ 1     │ 1       │
# │ 2   │ 10    │ 1       │
# │ 3   │ 20    │ 1       │
# │ 4   │ 30    │ 2       │
# │ 5   │ 500   │ 2       │
# │ 6   │ 1000  │ 1       │
# │ 7   │ 1010  │ 1       │
# │ 8   │ 1200  │ 1       │
# │ 9   │ 1500  │ 2       │
# │ 10  │ 1600  │ 2       │
# │ 11  │ 2200  │ 1       │
# │ 12  │ 2220  │ 1       │

In this step I used two things that are worth learning:

  • A @pipe macro from the Pipes.jl package allows to pass result of the left
    hand side of |> to the right hand side in the position where _ is placed.
    In this case _ is a first argument to sort.
  • I used DataFrame! constructor; the ! in this case means that columns
    passed to a freshly constructed data frame are not copied (by default
    DataFrame constructor copies passed columns for safety).

Compute session identifier for each row of data

So the first task is to identify sessions in our data. For each user
a session_id column gives a number of session for this user, starting from
zero. Remember, that we assume that a fresh session starts for some user, if
two consecutive events for this user are separated by at least 900 seconds.

dplyr

session_df <- df %>%
    group_by(user_id) %>%
    mutate(prev_ts = lag(ts)) %>%
    mutate(diff_ts = ts - prev_ts) %>%
    mutate(diff_ts = ifelse(is.na(diff_ts), 0, diff_ts)) %>%
    mutate(session_start = ifelse(diff_ts >= 900, 1, 0)) %>%
    mutate(session_id = cumsum(session_start)) %>%
    select(-prev_ts, -diff_ts, -session_start)
session_df
# # A tibble: 12 x 3
# # Groups:   user_id [2]
#       ts user_id session_id
#    <dbl>   <dbl>      <dbl>
#  1     1       1          0
#  2    10       1          0
#  3    20       1          0
#  4    30       2          0
#  5   500       2          0
#  6  1000       1          1
#  7  1010       1          1
#  8  1200       1          1
#  9  1500       2          1
# 10  1600       2          1
# 11  2200       1          2
# 12  2220       1          2

DataFrames.jl

session_df = @pipe df |>
    groupby(_, :user_id) |>
    combine(:ts => ts -> begin
        prev_ts = lag(ts)
        diff_ts = ts .- prev_ts
        diff_ts = coalesce.(diff_ts, 0)
        session_start = diff_ts .> 900
        session_id = cumsum(session_start)
        return (ts=ts, session_id=session_id)
    end, _, ungroup=false)
# GroupedDataFrame with 2 groups based on key: user_id
# First Group (8 rows): user_id = 1
# │ Row │ user_id │ ts    │ session_id │
# │     │ Int64   │ Int64 │ Int64      │
# ├─────┼─────────┼───────┼────────────┤
# │ 1   │ 1       │ 1     │ 0          │
# │ 2   │ 1       │ 10    │ 0          │
# │ 3   │ 1       │ 20    │ 0          │
# │ 4   │ 1       │ 1000  │ 1          │
# │ 5   │ 1       │ 1010  │ 1          │
# │ 6   │ 1       │ 1200  │ 1          │
# │ 7   │ 1       │ 2200  │ 2          │
# │ 8   │ 1       │ 2220  │ 2          │
# ⋮
# Last Group (4 rows): user_id = 2
# │ Row │ user_id │ ts    │ session_id │
# │     │ Int64   │ Int64 │ Int64      │
# ├─────┼─────────┼───────┼────────────┤
# │ 1   │ 2       │ 30    │ 0          │
# │ 2   │ 2       │ 500   │ 0          │
# │ 3   │ 2       │ 1500  │ 1          │
# │ 4   │ 2       │ 1600  │ 1          │

Now I could have rewritten the dpyr code to DataFrames.jl in many ways, but
a most natural thing to do it was for me to use the following syntax:

combine(source_column => transformation_function, grouped_data_frame)

With this approach I can conveniently define an anonymous function
within a beginend block and return a (ts=ts, session_id=session_id) value
that is a NamedTuple and will get expanded into two columns of a data frame.

I use ungroup=false syntax to keep the result a GroupedDataFrame to match
what we get in dplyr.

Also, in the code of the function I use the lag function from ShiftedArrays.jl.

Now we have all information to answer our business questions.

How many sessions an average user has?

dplyr

session_df %>%
    summarize(session_num = max(session_id) + 1) %>%
    ungroup() %>%
    summarize(avg_sessions_per_user = sum(session_num) / nrow(.))
# # A tibble: 1 x 1
#   avg_sessions_per_user
#                   <dbl>
# 1                   2.5

DataFrames.jl

@pipe session_df |>
    combine(_, :session_id => (x -> maximum(x) + 1) => :session_num) |>
    combine(_, :session_num => mean => :avg_sessions_per_user)
# 1×1 DataFrame
# │ Row │ avg_sessions_per_user │
# │     │ Float64               │
# ├─────┼───────────────────────┤
# │ 1   │ 2.5                   │

Observe, that in the DataFrames.jl code the first combine is applied
to GroupedDataFrame while the second combine is applied to a DataFrame.

How many users have exactly two sessions?

dplyr

session_df %>%
    summarize(session_num = max(session_id) + 1) %>%
    filter(session_num == 2) %>%
    summarize(number_of_two_session_users = nrow(.))
# # A tibble: 1 x 1
#   number_of_two_session_users
                        <int>
1                           1

DataFrames.jl

@pipe session_df |>
    combine(_, :session_id => (x -> maximum(x) + 1) => :session_num) |>
    filter(:session_num => ==(2), _) |>
    DataFrame(number_of_two_session_users = nrow(_))
# 1×1 DataFrame
# │ Row │ number_of_two_session_users │
# │     │ Int64                       │
# ├─────┼─────────────────────────────┤
# │ 1   │ 1                           │

In this code observe that :session_num => ==(2) syntax means that in the
filter function we pass each element of :session_num column to ==(2)
function, which is a curried version of a standard x == 2 comparison.

Find top 10 users, ordered by the descending number of sessions?

dplyr

session_df %>%
    summarize(session_num = max(session_id) + 1) %>%
    ungroup() %>%
    arrange(desc(session_num)) %>%
    mutate(rn = row_number()) %>%
    filter(rn <= 10) %>%
    select(-rn)
# # A tibble: 2 x 2
#   user_id session_num
#     <dbl>       <dbl>
# 1       1           3
# 2       2           2

DataFrames.jl

@pipe session_df |>
    combine(_, :session_id => (x -> maximum(x) + 1) => :session_num) |>
    sort(_, :session_num, rev=true) |>
    first(_, 10)
# 2×2 DataFrame
# │ Row │ user_id │ session_num │
# │     │ Int64   │ Int64       │
# ├─────┼─────────┼─────────────┤
# │ 1   │ 1       │ 3           │
# │ 2   │ 2       │ 2           │

Here note that in the :session_id => (x -> maximum(x) + 1) => :session_num
expression we have to wrap x -> maximum(x) + 1 in parentheses to get the
correct result (if you would omit it => :session_num would be treated as a
part of an anonymous function definition).

What is the average time between sessions start?

dplyr

session_df %>%
    arrange(ts) %>%
    group_by(user_id, session_id) %>%
    mutate(rn = row_number()) %>%
    filter(rn == 1) %>%
    group_by(user_id) %>%
    mutate(prev_start = lag(ts)) %>%
    filter(!is.na(prev_start)) %>%
    mutate(sess_diff = ts - prev_start) %>%
    ungroup() %>%
    summarize(avg_session_starts = mean(sess_diff))
# # A tibble: 1 x 1
#   avg_session_starts
#                <dbl>
# 1               1223

DataFrames.jl

@pipe session_df |>
    parent |>
    sort(_, :ts) |>
    groupby(_, [:user_id, :session_id]) |>
    combine(first, _) |>
    groupby(_, :user_id) |>
    combine(_, :ts => diff => :sess_diff) |>
    combine(_, :sess_diff => mean => :avg_session_starts)
# 1×1 DataFrame
# │ Row │ avg_session_starts │
# │     │ Float64            │
# ├─────┼────────────────────┤
# │ 1   │ 1223.0             │

Here we show that you can use the parent function to get access to the data
frame of which GroupedDataFrame is a view. Also a common pattern is
combine(first, _) to extract the first row of each group in a
GroupedDataFrame.

Scaling the computations to a larger input data set

In this part I want to check the performance of dplyr and DataFrames.jl codes
that we have just discussed. For this I want to replicate df 5,000,000 times.
In order to avoid having only users number 1 and 2 (and thus to have more groups
to analyze), we will want to create new user ids in an arbitrary fashion.

We will want to have a look what is timing of the considered operations.

dplyr

time_test <- function(df) {
    n <- 5*10^6
    print(system.time(df <- df %>%
        slice(rep(row_number(), n)) %>%
        mutate(user_id = user_id + rep(1:n, each=nrow(df))) %>%
        arrange(ts)
    ))
    print(system.time(
    session_df <- df %>% group_by(user_id) %>% mutate(prev_ts = lag(ts)) %>%
        mutate(diff_ts = ts - prev_ts) %>%
        mutate(diff_ts = ifelse(is.na(diff_ts), 0, diff_ts)) %>%
        mutate(session_start = ifelse(diff_ts >= 900, 1, 0)) %>%
        mutate(session_id = cumsum(session_start)) %>%
        select(-prev_ts, -diff_ts, -session_start)
    ))
    print(system.time(
    session_df %>%
        summarize(session_num = max(session_id) + 1) %>%
        ungroup() %>%
        summarize(avg_sessions_per_user = sum(session_num)/nrow(.))
    ))
    print(system.time(
    session_df %>%
        summarize(session_num = max(session_id) + 1) %>%
        filter(session_num == 1) %>%
        summarize(number_of_one_session_users = nrow(.))
    ))
    print(system.time(
    session_df %>%
        summarize(session_num = max(session_id) + 1) %>%
        ungroup() %>%
        arrange(desc(session_num)) %>%
        mutate(rn = row_number()) %>%
        filter(rn <= 10) %>%
        select(-rn)
    ))
    print(system.time(
    session_df %>%
        arrange(ts) %>%
        group_by(user_id, session_id) %>%
        mutate(rn = row_number()) %>%
        filter(rn == 1) %>%
        group_by(user_id) %>%
        mutate(prev_start = lag(ts)) %>%
        filter(!is.na(prev_start)) %>%
        mutate(sess_diff = ts - prev_start) %>%
        ungroup() %>%
        summarize(avg_session_starts = mean(sess_diff))
    ))
}
time_test(df)
#    user  system elapsed
#   6.909   2.099   9.009
#    user  system elapsed
# 222.398   2.159 224.567
# `summarise()` ungrouping output (override with `.groups` argument)
#    user  system elapsed
#   6.115   0.000   6.116
# `summarise()` ungrouping output (override with `.groups` argument)
#    user  system elapsed
#   6.842   0.000   6.843
# `summarise()` ungrouping output (override with `.groups` argument)
#    user  system elapsed
#   7.409   0.000   7.409
#    user  system elapsed
# 273.404   2.156 275.569

DataFrames.jl

function time_test(df)
    n = 5*10^6
    @time df = @pipe repeat(df, n) |>
        setindex!(_, _.user_id + repeat(1:n, inner=nrow(df)), !, :user_id) |>
        sort(_, :ts)
    @time session_df = @pipe df |>
        groupby(_, :user_id) |>
        combine(:ts => ts -> begin
            prev_ts = lag(ts)
            diff_ts = ts .- prev_ts
            diff_ts = coalesce.(diff_ts, 0)
            session_start = diff_ts .> 900
            session_id = cumsum(session_start)
            return (ts=ts, session_id=session_id)
        end, _, ungroup=false)
    @time @pipe session_df |>
        combine(_, :session_id => (x -> maximum(x) + 1) => :session_num) |>
        combine(_, :session_num => mean => :avg_sessions_per_user)
    @time @pipe session_df |>
        combine(_, :session_id => (x -> maximum(x) + 1) => :session_num) |>
        filter(:session_num => ==(1), _) |>
        DataFrame(avg_sessions_per_user = nrow(_))
    @time @pipe session_df |>
        combine(_, :session_id => (x -> maximum(x) + 1) => :session_num) |>
        sort(_, :session_num, rev=true) |>
        first(_, 10)
    @time @pipe session_df |>
        parent |>
        sort(_, :ts) |>
        groupby(_, [:user_id, :session_id]) |>
        combine(first, _) |>
        groupby(_, :user_id) |>
        combine(_, :ts => diff => :sess_diff) |>
        combine(_, :sess_diff => mean => :avg_session_starts)
    return nothing
end

time_test(df)
#   9.301248 seconds (37.16 M allocations: 17.919 GiB, 10.01% gc time)
#  35.576141 seconds (485.70 M allocations: 24.391 GiB, 8.58% gc time)
#   2.170619 seconds (40.30 M allocations: 1.990 GiB, 21.22% gc time)
#   1.507405 seconds (40.30 M allocations: 1.580 GiB, 8.43% gc time)
#   1.852281 seconds (40.67 M allocations: 1.634 GiB, 7.21% gc time)
#  20.191882 seconds (166.55 M allocations: 22.924 GiB, 8.62% gc time)

As you can see, in the example queries we have investigated DataFrames.jl turns
out to be competitive with dplyr in terms of timing of the operations (let me
stress again that my objective here was not to write the fastest possible codes
in R and Julia that yield the desired results, therefore these values should be
treated lightly).

How to embed project environment setup in your Julia script

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2020/06/28/automatic-project-environments.html

Project dependency management in Julia

I have already written several posts about project dependency management
practices in Julia. You can find these materials here, here,
and here.

In all these posts I emphasize that one should use project-specific Project.toml
and Manifest.toml files. But the question is what should one do if one wants
to distribute some Julia code as a single .jl file without Project.toml
and Manifest.toml files attached?

Now, you might ask why someone would have a problem with adding Project.toml
and Manifest.toml to ones files when distributing them?
There are several potential reasons:

  • It is easy to forget to attach them to your source code.
  • By accident you might update your Project.toml and Manifest.toml files
    (and unless you use version control of your repository you have a problem;
    sometimes undo command in the Package Manager mode will help you, but not
    in all cases).
  • You store many projects in the same directory (so it gets hard to manage
    Project.toml and Manifest.toml files then) — this scenario happens when
    you have e.g. a collection of small scripts for data preprocessing that you
    reuse across different tasks.
  • If an inexperienced user of Julia gets your source code along with
    Project.toml and Manifest.toml, one might have a problem understanding how to
    properly activate the project environment and (especially when working on
    a fresh installation) run into problems by not running instantiate command
    in the Package Manager.

If any of these scenarios is your use case in this post I explain how to
automatically set up a proper project environment within a Julia script.

The codes were run under Julia 1.4.2.

Embedding project environment setup in your Julia script

I will use here a simple example of a script that depends on DataFrames.jl
and CSV.jl, creates a random DataFrame and writes it to a disk.
So the operation we want to do is the following:

using CSV, DataFrames

CSV.write("random_file.csv", DataFrame(rand(100, 5)))

Assume that I want to use CSV.jl version 0.6.1 and DataFrames.jl version 0.20.1
(note that I picked some old versions of the packages to make sure that indeed
we test that we get what we want).

How to make sure they are properly loaded by the script above? Just add the
following header to it:

using Pkg

cd(mktempdir()) do
    Pkg.activate(".")
    Pkg.add(PackageSpec(name="CSV", version="0.6.1"))
    Pkg.add(PackageSpec(name="DataFrames", version="0.20.1"),
            preserve=PRESERVE_DIRECT)
    Pkg.status()
end

What are important elements of this process:

  • Using mktempdir we create a temporary directory that will be deleted after
    our script terminates (so each time we run the code a new random directory
    will be created and instantiated).
  • As we add packages one-by one and want their specific versions in all but the
    first Pkg.add commands I use preserve=PRESERVE_DIRECT to ensure that the
    Package Manager does not change the version of an already added package.
  • I run Pkg.status to visually make sure that all is installed correctly.
  • Note that Julia normally uses a federated repository of packages; this means
    that when such a script is run several times Julia will just reuse the
    already downloaded packages (so it does not have to fetch packages from the
    Internet every time and the process is relatively fast)
  • Finally, in this way you ensure that it is clear within the script what are
    versions of its dependencies.

So here is a full code of our example (if you would want to copy-paste it
for testing):

using Pkg

cd(mktempdir()) do
    Pkg.activate(".")
    Pkg.add(PackageSpec(name="CSV", version="0.6.1"))
    Pkg.add(PackageSpec(name="DataFrames", version="0.20.1"),
            preserve=PRESERVE_DIRECT)
    Pkg.status()
end

using CSV, DataFrames

CSV.write("random_file.csv", DataFrame(rand(100, 5)))

Here is the output I got when running it (I show everything that is printed
as in this case it is relevant):

julia> using Pkg

julia> cd(mktempdir()) do
           Pkg.activate(".")
           Pkg.add(PackageSpec(name="CSV", version="0.6.1"))
           Pkg.add(PackageSpec(name="DataFrames", version="0.20.1"),
                   preserve=PRESERVE_DIRECT)
           Pkg.status()
       end
 Activating new environment at `/tmp/jl_Mit7P2/Project.toml`
   Updating registry at `~/.julia/registries/General`
   Updating git-repo `https://github.com/JuliaRegistries/General.git`
  Resolving package versions...
   Updating `/tmp/jl_Mit7P2/Project.toml`
  [336ed68f] + CSV v0.6.1
   Updating `/tmp/jl_Mit7P2/Manifest.toml`
  [336ed68f] + CSV v0.6.1
  [324d7699] + CategoricalArrays v0.7.7
  [34da2185] + Compat v3.12.0
  [9a962f9c] + DataAPI v1.3.0
  [a93c6f00] + DataFrames v0.20.2
  [864edb3b] + DataStructures v0.17.19
  [e2d170a0] + DataValueInterfaces v1.0.0
  [48062228] + FilePathsBase v0.7.0
  [41ab1584] + InvertedIndices v1.0.0
  [82899510] + IteratorInterfaceExtensions v1.0.0
  [682c06a0] + JSON v0.21.0
  [e1d29d7a] + Missings v0.4.3
  [bac558e1] + OrderedCollections v1.2.0
  [69de0a69] + Parsers v1.0.6
  [2dfb63ee] + PooledArrays v0.5.3
  [189a3867] + Reexport v0.2.0
  [a2af1166] + SortingAlgorithms v0.3.1
  [3783bdb8] + TableTraits v1.0.0
  [bd369af6] + Tables v1.0.4
  [ea10d353] + WeakRefStrings v0.6.2
  [2a0f44e3] + Base64
  [ade2ca70] + Dates
  [8bb1440f] + DelimitedFiles
  [8ba89e20] + Distributed
  [9fa8497b] + Future
  [b77e0a4c] + InteractiveUtils
  [76f85450] + LibGit2
  [8f399da3] + Libdl
  [37e2e46d] + LinearAlgebra
  [56ddb016] + Logging
  [d6f4376e] + Markdown
  [a63ad114] + Mmap
  [44cfe95a] + Pkg
  [de0858da] + Printf
  [3fa0cd96] + REPL
  [9a3f8284] + Random
  [ea8e919c] + SHA
  [9e88b42a] + Serialization
  [1a1011a3] + SharedArrays
  [6462fe0b] + Sockets
  [2f01184e] + SparseArrays
  [10745b16] + Statistics
  [8dfed614] + Test
  [cf7118a7] + UUIDs
  [4ec0a83e] + Unicode
  Resolving package versions...
   Updating `/tmp/jl_Mit7P2/Project.toml`
  [a93c6f00] + DataFrames v0.20.1
   Updating `/tmp/jl_Mit7P2/Manifest.toml`
  [a93c6f00] ↓ DataFrames v0.20.2 ⇒ v0.20.1
Status `/tmp/jl_Mit7P2/Project.toml`
  [336ed68f] CSV v0.6.1
  [a93c6f00] DataFrames v0.20.1

julia> using CSV, DataFrames

julia> CSV.write("random_file.csv", DataFrame(rand(100, 5)))
"random_file.csv"

What are important things to note here:

  • Before you terminate the Julia session the temporary directory will exist
    (in my case its name is /tmp/jl_Mit7P2 as you can see in the output; you can
    check yourself that it is the case on your system).
  • When you exit Julia the directory gets deleted (again — you can make sure
    that this is the case).
  • Initially DataFrames.jl was added in version v0.20.2, and later we downgraded
    its version (so as you can see sequential adding of packages influences the
    versions of packages already added, that is why preserve=PRESERVE_DIRECT
    is an important switch to make sure that the package you added are in correct
    versions).
  • The Package Manager updates registry by looking up
    https://github.com/JuliaRegistries/General.git, so it has some cost, but later
    in my case the packages themselves were not downloaded as they were already
    present in the federated package repository (so the process is quite fast,
    yet still has a non-negligible cost; the additional benefit of doing this
    is that you do not have to run instantiate as adding packages explicitly
    ensures that they get instantiated).

What are the limitations of the proposed approach

There are three details that one should keep in mind when using the method
that I have presented in this post:

  • Sharing Project.toml and Manifest.toml makes Julia recreate the project
    environment exactly. In the process I have described above we only ensure
    that the packages we install have a fixed version. Versions of their recursive
    dependencies will be dynamically resolved by the Package Manager (and thus
    might differ from versions used when the script was created; however, in
    practice this is not a problem unless some of these packages do not correctly
    follow Semantic Versioning).
  • The process that dynamically creates the project environment costs a bit —
    it is of course better to just ship Project.toml and Manifest.toml along your
    files so that you do not have to pay extra start-up time (but if your script
    does some heavy computations this is probably negligible).

Writing an AR(1) process simulator as a module in Julia

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2020/06/22/ar1.html

ABC of modules in Julia

Modules in Julia allow you to create separate variable workspaces.
You can find all the details needed to work with modules in
the Julia Manual.

In this post let me focus on one of the reasons to create modules in your
programs: you can control which names defined in the module get exposed outside
the module.

In order to show this let us create a simple AR(1) process simulator, to see
some story — not just dry definitions. For completeness we will first
introduce a bit of theory of such processes, next we will discuss how you can
define your own module, and finally we will show how we can use it.

This post was prepared under Julia 1.4.2 and is mostly targeted at people who
want to start using Julia for doing computation-heavy operations (therefore I
allow myself to drift away and comment on many things that are are not a core
of the example we give, but are a natural questions that arise in practice when
writing similar codes).

AR(1) model definition

The AR(1) model is a random process , where , and the
support of random variables is , that follows the equation:

where are independent random variables that are normally
distributed with mean and variance .

For the AR(1) process to be stationary we assume .
Also clearly we have , as standard deviation must be non-negative.

We will want to create a simulator of this process in Julia.

In order to implement it we need to derive the distribution of the unconditional
mean and variance of . It is simplest to get them assuming they exist and
are independent from (they do as we take ; I omit some
technical details here).

For expected value we get:

but as the process is stationary so after substituting
.

For variance we get:

where we have used the assumption that is independent from
.
Again, as the process is stationary so after
substituting .

The last question is what is the unconditional distribution of . We know
its mean and variance (and they are both finite), so it is easiest to unroll
the AR(1) process definition recursively times:

Since we know that has a finite variance we see that if we let
then converges in probability to a normal distribution
(here we use the fact that are normally distributed and a sum
of normally distributed variables is also normally distributed). In conclusion
we get that:

Definiing a module with a AR(1) simulator

First let me show a complete way how I would normally define such a simulator.
Below I discuss some of practices that are present in the code.

module AR1

using Random

export ar1!, ar1

const AR1_COMMON_DOC = """
elements following the AR(1)
process where `xₜ₊₁ = c + ϕxₜ + εₜ` and `εₜ ∼ N(0, σ²)`.

If `start` keyword argument is passed then this is the first element of the
returned vector. Otherwise it is initiated randomly to follow the stationary
distribution of the process.


You can pass `rng` keyword argument to specify a custom random number generator
that should be used when generating data.

It is required that |ϕ| < 1 so that the process is stationary.
"""

sample_start(c::Real, ϕ::Real, σ::Real, rng::AbstractRNG) =
    c / (1 - ϕ) + randn(rng) * σ / sqrt(1 - ϕ^2)

"""
    ar1!(v, c, ϕ, σ; rng, start)

Update vector `v` in place by filling it with
$AR1_COMMON_DOC
"""
function ar1!(v::AbstractVector{<:AbstractFloat}, c::Real, ϕ::Real, σ::Real;
              rng::AbstractRNG=Random.default_rng(),
              start::Real=sample_start(c, ϕ, σ, rng))
    σ < 0 && throw(ArgumentError("σ is not allowed to be negative"))
    abs(ϕ) < 1 || throw(ArgumentError("|ϕ| < 1 is required"))
    x = start
    n = length(v)
    if n > 0
        v[1] = x
        for i in 2:n
            x = c + x * ϕ + randn(rng) * σ
            v[i] = x
        end
    end
    return v
end

"""
    ar1(n, c, ϕ, σ; rng, start)

Return a vector containing `n`
$AR1_COMMON_DOC
"""
function ar1(n::Integer, c::Real, ϕ::Real, σ::Real;
             rng::AbstractRNG=Random.default_rng(),
             start::Real=sample_start(c, ϕ, σ, rng))
    return ar1!(Vector{Float64}(undef, n), c, ϕ, σ, rng=rng, start=start)
end

end # module

First let us start with a discussion why we want to use a module for this code
at all. The reason is that inside the module we define a constant
AR1_COMMON_DOC and three functions: sample_start, ar1! and ar1.
However, I do not want AR1_COMMON_DOC nor sample_start to be exposed to
users. They are just an internal implementation detail, therefore only
ar1! and ar1 are exported from the module.

Also the module loads a standard module Random. Though unlikely, some codes
that will want to use my AR1 module might not to have Random module loaded.
Since using Random is wrapped inside the module I can use Random inside
the AR1 module but it is not visible outside the module (unless you explicitly
import it of course).

Now let me comment on some other things that are present in this code (and maybe
you will find these practices useful):

  • I define AR1_COMMON_DOC as a constant string that gets interpolated into
    docstings of ar1! and ar1. This is the part of the documentation that is
    common for the two methods. I have found that if you need to maintain some code
    in the long term it is really useful to do this. Otherwise, especially if
    your code base is large, it is really easy to have a situation that the
    documentation gets out of sync.
  • I define the sample_start function as it is called both in ar1! and ar1
    to set the default value of start parameter. Again, having it extruded makes
    it simpler to maintain the code in the long term (you need to change the code
    only in one place).
  • In ar1! and ar1 I allow the users to pass their own PRNG, which
    is often needed when one wants to use some more advanced techniques of
    stochastic simulation (like e.g. variance reduction). However, I still
    want the user not to have to pass it by default, but use the generator that
    is provided by default by the Random module. Note that by using
    Random.default_rng() on Julia 1.4.2 I am sure that my code is thread safe.
  • I implement ar! function that works in-place and an ar1 function that
    allocates a fresh output vector on each call. This is a common pattern in
    Julia and we will investigate its performance consequences later in this post.
  • Note that in all functions I give the constraints on the allowed types and
    values of their parameters. However, in ar1 I create the initial vector
    as Vector{Float64}(undef, n) (i.e. it has Float64 elements) because
    randn in Julia produces Float64 output by default. Still in ar1!
    I allow v to be AbstractVector{<:AbstractFloat} as in some applications
    the user might want to pass a vector of other type (e.g. located in GPU memory
    and containing Float32 elements; note though that in such a scenario
    probably one could consider replacing randn(rng) calls by its more efficient
    version for performance reasons).
  • In operations in hot loops like v[i] = x in the ar1! function I often see
    people use @inbounds. I personally try to avoid this. The performance gain is
    usually not that big, and it is very common that one thinks that using
    @inbounds is safe, while in practice there is some bug in the code and it is
    not (you can always pass --check-bounds=no switch when starting your Julia
    session to disable bounds checking later).
  • I opted to make rng and start keyword arguments. The reason is that most
    likely the user will want to omit specifying them. Note that the rng is
    defined before start keyword argument, because we use rng to calculate the
    default value of start if it would not be provided by the user.
  • I always use return explicitly in functions defined with function keyword
    argument as in the long term it is more readable.
  • Finally, at least for me, it is very nice that it is really easy to use variable
    names like ϕ or σ in the code (and also type them in the editor or REPL).
    Not only it makes functions match my math formulas, but also the lines are shorter.

Uff, the list was longer than the code :).

Before we start

Before we start make sure that you have the packages that we will installed
in the right versions. We will use: StatsBase.jl v0.32.2 and PyPlot.jl v2.9.0.

First start your Julia REPL with the command, in Linux:

$ JULIA_NUM_THREADS=2 julia

and in Windows

C:\> set JULIA_NUM_THREADS=2
C:\> julia

(what we do here is informing Julia that it can use two threads for
computations — we will soon use it; I have chosen two threads as I assume most
current computers have at least two cores so that everyone should be able to
follow this setting)

You can check if you have the packages in the right versions by switching to package
manager mode in the Julia REPL (press ] to do it) and writing: status
command. If you do not have them or the versions are incorrect while still in the
package manager mode write the following commands (I assume that you do not
have Project.toml and Manifest.toml files in your current working directory):

(@v1.4) pkg> activate .
 Activating new environment at `~/Project.toml`

(bkamins) pkg> add [email protected]
...

(bkamins) pkg> add [email protected]
...

(bkamins) pkg> status
Status `~/Project.toml`
  [d330b81b] PyPlot v2.9.0
  [2913bbd2] StatsBase v0.32.2

(I have removed the output that some commands produce as it is not required
for our purposes).

Now we are all set to test our code. Exit the package manager mode by pressing
a backspace.

Using AR(1) simulator

First copy paste the code containing the AR1 module definition to your Julia REPL.

In order to use the module write:

julia> using .AR1

Note that it is crucial to write a . before the module name as the module
is defined in the current global scope of so called Main module (you should
have seen Main.AR1 printed when you pasted the definition of the module in
the Julia REPL).

You were probably using statements like using Random (like we did in the code
of the AR1 module). This way of loading modules searches for them in standard
library or installed packages. In our case we have a custom module defined only
in this session, so we need to add . to allow Julia to locate it.

After making the using statement we can use ar1 and ar1! functions. Let
us first check the help for ar1 (to make sure it correctly incorporates the
AR1_COMMON_DOC string). Press ? in the Julia REPL, write ar1 and press
enter to get:

help?> ar1
search: ar1 ar1! AR1 @macroexpand1

  ar1(n, c, ϕ, σ; rng, start)

  Return a vector containing n elements following the AR(1) process where
  xₜ₊₁ = c + ϕxₜ + εₜ and εₜ ∼ N(0, σ²).

  If start keyword argument is passed then this is the first element of the
  returned vector. Otherwise it is initiated randomly to follow the
  stationary distribution of the process.

  You can pass rng keyword argument to specify a custom random number
  generator that should be used when generating data.

  It is required that |ϕ| < 1 so that the process is stationary.

We conclude that all looks as expected.

As a fist task let us plot one sample of our simulator output:

julia> using PyPlot

julia> plot(ar1(100, 1, 0.9, 1))

You should get a plot similar to (it will not be identical, as we did not set
the seed of the random number generator; in all the results that follow
you should expect the same situation — they should be similar but probably
not identical to what I show):

AR(1) process sample

Next let us check if indeed for the process we considered. For
this let us generate a longer data set to have the results more accurate:

julia> using StatsBase

julia> autocor(ar1(10^6, 1, 0.9, 1), 1:1)
1-element Array{Float64,1}:
 0.9003360673839191

and we get a value very close to 0.9 as expected.

Now let us discuss why we have cared so much about deriving the formula
for the stationary distribution of . Note that we use this formula
to set the default value of start variable. The consequence should be that
if we generate a random AR(1) vector many times all its elements should have a
mean of around and variance of around
.

Let us check it for a default start:

using Statistics

julia> sim = [ar1(100, 1, 0.9, 1) for i in 1:10^6];

julia> extrema(mean.([getindex.(sim, i) for i in 1:100]))
(9.996782972052698, 10.003866043483425)

julia> extrema(var.([getindex.(sim, i) for i in 1:100]))
(5.2518154979875575, 5.2742608590894475)

and indeed we see that the results are correct.

If we would set start to be equal to the expected mean, but non-random,
then we will get a visible bias in the variance of the first few observations
(but the mean will be correct):

julia> sim = [ar1(100, 1, 0.9, 1, start=10) for i in 1:10^6];

julia> extrema(mean.([getindex.(sim, i) for i in 1:100]))
(9.996503785236944, 10.00565659110953)

julia> extrema(var.([getindex.(sim, i) for i in 1:100]))
(0.0, 5.2751198890482005)

julia> plot(var.([getindex.(sim, i) for i in 1:100]))

and the generated plot of variance looks like this:
AR(1) process sample

Setting start to be biased will similarly bias the first few expectations
in the generated process.

Performance considerations

As a performance test assume that we were interested in running the mean test
but not for but repetitions to get more accurate results.
The problem is that in that case we would need memory roughly of order
bytes, which is 80GiB (not something I can store in
RAM of my laptop).

A basic code that does such a check, while not having to store all the generated
data, could look like this:

function f1(n::Integer)
    s = zeros(100)
    for i in 1:n
        s .+= ar1(100, 1, 0.9, 1)
    end
    return extrema(s) ./ n
end

Let us test it (the first run is supposed to compile the function to give
more accurate results for the second run):

julia> @time f1(100)
  0.000087 seconds (101 allocations: 88.375 KiB)
(9.508060461092601, 10.437892560883078)

julia> @time f1(10^8)
107.395185 seconds (100.00 M allocations: 83.447 GiB, 10.97% gc time)
(9.99928784783692, 10.000313949053089)

So we we need around 100 seconds to finish, and indeed a bit over 80GiB of
data got allocated. Can we do better?

In the first step let us try to avoid some allocations by using ar1! instead
of ar1:

function f2(n::Integer)
    s = zeros(100)
    v = zeros(100)
    for i in 1:n
        s .+= ar1!(v, 1, 0.9, 1)
    end
    return extrema(s) ./ n
end

and test shows the following:

julia> @time f2(100)
  0.000081 seconds (2 allocations: 1.750 KiB)
(9.54394702371765, 10.480846071057933)

julia> @time f2(10^8)
 62.640395 seconds (2 allocations: 1.750 KiB)
(9.999306112039276, 10.000679734746692)

Indeed — we have almost halved the run time.

Let us finally turn to threading (the JULIA_NUM_THREADS=2 option that was
quietly waiting for being used from the beginning of our session):

function f3(n::Integer)
    s = [zeros(100) for _ in 1:Threads.nthreads()]
    v = [zeros(100) for _ in 1:Threads.nthreads()]
    Threads.@threads for i in 1:n
        tid = Threads.threadid()
        s[tid] .+= ar1!(v[tid], 1, 0.9, 1)
    end
    return extrema(sum(s)) ./ n
end

Let us check if all went as we expected:

julia> Threads.nthreads()
2

julia> @time f3(100)
  0.000137 seconds (21 allocations: 6.078 KiB)
(9.452653560199144, 10.31887937207082)

julia> @time f3(10^8)
 36.220213 seconds (21 allocations: 6.078 KiB)
(9.999318657341107, 10.000619565457022)

Using 2 threads we have almost halved the computation time.
Note that this time we have allocated separate vectors for each thread and
accessed them using the tid variable to avoid race condition issues.

Most common problem when using modules

Before we finish let us comment on a thing that people often find confusing
then working with modules. Here we will just use the modules that are present
by default in a standard Julia REPL session: Main (where all your work happens)
and Base (containing the definitions of standard functions you can normally
use without importing anything). In particular functions sin and cos are
defined in Base module. Now consider the following example:

julia> sin(1)
0.8414709848078965

julia> sin = 2
ERROR: cannot assign a value to variable Base.sin from module Main
Stacktrace:
 [1] top-level scope at REPL[5]:1

julia> cos = 2
2

julia> cos(1)
ERROR: MethodError: objects of type Int64 are not callable
Stacktrace:
 [1] top-level scope at REPL[7]:1

julia> Base.cos(1)
0.5403023058681398

In the case of sin you first call sin(1) which makes Julia associate name
sin in module Main with the function sin defined in module Base.
So later when trying to set a value of 2 to sin you get an error (name sin
is now bound to a function and you are not allowed to change its meaning).

For the other case for cos you give it a value in module Main first. From
that moment cos is bound to 2 (of course you could change this binding
later), which overshadows the function with name cos from module Base. As
you can see in such a case to call cos from Base you have to qualify its
name and write Base.cos(1).

In short — when importing names from other modules be careful about using
duplicate names (it is best just not to use the names you decided to import
from other modules).


 

If you are still with me — thank you for reading the post and I hope you
enjoyed it!