Back to the basics: don’t lag behind with your DataFrames.jl skills

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/09/23/lagging.html

Introduction

Several of my recent posts were a bit more advanced, so today I thought to write
something for beginners. A common task that one needs to do in data analysis
pipelines is lagging a variable in time. Today I will show you several patterns
how this can be done using DataFrames.jl and ShiftedArrays.jl. In particular, I
was prompted to write this post by a 2.0 release of the ShiftedArrays.jl package
and it reflects the current state of this package.

In the post I use Julia 1.8.1, DataFrames.jl 1.3.6, and ShiftedArrays.jl 2.0.

Setting up the scene

Let us load the packages that we will use in this post and create some starting
data frame with which we will work:

julia> using DataFrames

julia> using Dates

julia> using ShiftedArrays: lag

julia> df = DataFrame(day=Date.(2020, 9, [1, 2, 4, 5, 1, 2, 3, 6]),
                      id=repeat(1:2, inner=4),
                      v=1:8)
8×3 DataFrame
 Row │ day         id     v
     │ Date        Int64  Int64
─────┼──────────────────────────
   1 │ 2020-09-01      1      1
   2 │ 2020-09-02      1      2
   3 │ 2020-09-04      1      3
   4 │ 2020-09-05      1      4
   5 │ 2020-09-01      2      5
   6 │ 2020-09-02      2      6
   7 │ 2020-09-03      2      7
   8 │ 2020-09-06      2      8

Here an important pattern is using ShiftedArrays: lag. Since 2.0 release
the ShiftedArrays.jl package does not export the lag function. Therefore
to be able to use it without having to write ShiftedArrays.lag you need
to explicitly list it when loading the ShiftedArrays.jl package.

Simple lagging

Assume we want to lag variable v in the whole data frame. for this we can
use the transform function like this:

julia> transform(df, :v => lag)
8×4 DataFrame
 Row │ day         id     v      v_lag
     │ Date        Int64  Int64  Int64?
─────┼───────────────────────────────────
   1 │ 2020-09-01      1      1  missing
   2 │ 2020-09-02      1      2        1
   3 │ 2020-09-04      1      3        2
   4 │ 2020-09-05      1      4        3
   5 │ 2020-09-01      2      5        4
   6 │ 2020-09-02      2      6        5
   7 │ 2020-09-03      2      7        6
   8 │ 2020-09-06      2      8        7

If we wanted to lag it by more than one period we could write:

julia> transform(df, :v => (x -> lag(x, 2)) => :v_lag2)
8×4 DataFrame
 Row │ day         id     v      v_lag2
     │ Date        Int64  Int64  Int64?
─────┼───────────────────────────────────
   1 │ 2020-09-01      1      1  missing
   2 │ 2020-09-02      1      2  missing
   3 │ 2020-09-04      1      3        1
   4 │ 2020-09-05      1      4        2
   5 │ 2020-09-01      2      5        3
   6 │ 2020-09-02      2      6        4
   7 │ 2020-09-03      2      7        5
   8 │ 2020-09-06      2      8        6

As an exercise, here is an example how to do lagging by different periods
using a comprehension:

julia> transform(df, [:v => (x -> lag(x, i)) => "v_lag$i" for i in 1:3])
8×6 DataFrame
 Row │ day         id     v      v_lag1   v_lag2   v_lag3
     │ Date        Int64  Int64  Int64?   Int64?   Int64?
─────┼─────────────────────────────────────────────────────
   1 │ 2020-09-01      1      1  missing  missing  missing
   2 │ 2020-09-02      1      2        1  missing  missing
   3 │ 2020-09-04      1      3        2        1  missing
   4 │ 2020-09-05      1      4        3        2        1
   5 │ 2020-09-01      2      5        4        3        2
   6 │ 2020-09-02      2      6        5        4        3
   7 │ 2020-09-03      2      7        6        5        4
   8 │ 2020-09-06      2      8        7        6        5

Lagging by group

Simple lagging in our example seems not very useful. The reason is that it looks
that we should lag the :v variable in groups defined by the :id variable.
In DataFrames.jl this is simple, the only thing you need to change in the
above codes is passing groupby(df, :id) to transform:

julia> transform(groupby(df, :id), :v => lag)
8×4 DataFrame
 Row │ day         id     v      v_lag
     │ Date        Int64  Int64  Int64?
─────┼───────────────────────────────────
   1 │ 2020-09-01      1      1  missing
   2 │ 2020-09-02      1      2        1
   3 │ 2020-09-04      1      3        2
   4 │ 2020-09-05      1      4        3
   5 │ 2020-09-01      2      5  missing
   6 │ 2020-09-02      2      6        5
   7 │ 2020-09-03      2      7        6
   8 │ 2020-09-06      2      8        7

(the remaining examples can be updated using the same pattern)

Calendar lagging by group

The above example was better, but assumed that we want to lag data by
observations (so we shift a vector disregarding dates). What if we wanted to lag
the data by one calendar day. Here are two ways how you can do it (they slightly
differ in the produced result, so the choice of the approach depends on what you
would want in practice). In both approaches I assume that the passed dates are
unique by group.

In the first approach we want to keep all observations after lagging of the
:v variable and record only days that match some observations (before or after
lagging).

julia> combine(groupby(df, :id)) do sdf
           sort!(outerjoin(select(sdf, Not(:id)),
                           select(sdf,
                                  :day => ByRow(x -> x + Day(1)) => :day,
                                  :v => :v_lag),
                           on=:day), :day)
       end
12×4 DataFrame
 Row │ id     day         v        v_lag
     │ Int64  Date        Int64?   Int64?
─────┼─────────────────────────────────────
   1 │     1  2020-09-01        1  missing
   2 │     1  2020-09-02        2        1
   3 │     1  2020-09-03  missing        2
   4 │     1  2020-09-04        3  missing
   5 │     1  2020-09-05        4        3
   6 │     1  2020-09-06  missing        4
   7 │     2  2020-09-01        5  missing
   8 │     2  2020-09-02        6        5
   9 │     2  2020-09-03        7        6
  10 │     2  2020-09-04  missing        7
  11 │     2  2020-09-06        8  missing
  12 │     2  2020-09-07  missing        8

In this approach in each group defined by the :id column we create a helper
data frame with :day column incremented by one and outer join it with the
original data frame (with :id column dropped), next we sort the result on
:day.

In the second approach we keep only observations in the originally observed
period for a given group, but fill all the potential dates in the considered
period.

julia> combine(groupby(df, :id)) do sdf
           ref_df = DataFrame(day=minimum(sdf.day):Day(1):maximum(sdf.day))
           transform!(leftjoin!(ref_df, select(sdf, Not(:id)), on=:day),
                      :v => lag)
       end
11×4 DataFrame
 Row │ id     day         v        v_lag
     │ Int64  Date        Int64?   Int64?
─────┼─────────────────────────────────────
   1 │     1  2020-09-01        1  missing
   2 │     1  2020-09-02        2        1
   3 │     1  2020-09-03  missing        2
   4 │     1  2020-09-04        3  missing
   5 │     1  2020-09-05        4        3
   6 │     2  2020-09-01        5  missing
   7 │     2  2020-09-02        6        5
   8 │     2  2020-09-03        7        6
   9 │     2  2020-09-04  missing        7
  10 │     2  2020-09-05  missing  missing
  11 │     2  2020-09-06        8  missing

In this approach, again by :id group, we first create a reference data frame
with all dates in the observed period, next we leftjoin! the source data frame
into it (dropping :id as it is not needed), and finally just use the lag
function. Note that in this solution, in comparison to the previous one, rows
for dates 2020-09-06 (group 1) and 2020-09-07 (group 2) are missing, but we
have an extra row 2020-09-05 (group 2) filled with missing.

Conclusions

Lagging of variables is quite often needed in practice, especially in panel
econometrics context. I hope you will find the patterns I have presented in this
post useful in your data analysis projects. Additionally, I have chosen the
examples so that several functions from DataFrames.jl are employed in
combination.