Author Archives: Blog by Bogumił Kamiński

Is DataFrames.jl Hamiltonian?

Re-posted from: https://bkamins.github.io/julialang/2022/09/30/metadata.html

Introduction

Hamilton is an interesting general purpose micro-framework for
creating dataflows from python functions. It has been developed by Stitch Fix
to help managing complex DAGs, where the resulting data frames are wide
(1000+ columns).

While DataFrames.jl is not a framework for DAG execution it is a natural tool
for performing single steps in such processes. When I was reading the
explanation of motivation and design of Hamilton I was struck
by the fact that it shares similarities with some concepts in DataFrames.jl.

In this post I want to discuss some of these principles. Additionally, I want to
highlight how I believe that having metadata support in DataFrames.jl nicely
combines with them.

In the post I use Julia 1.8.2, and DataFrames.jl 1.4.1.

The principles

I do not list here all design choices that Hamilton makes. Instead I
want to discuss ideas that are similar between this framework and DataFrames.jl.

These concepts are simple:

if you use some column name it should have a single meaning in your pipeline;
every transformation should be a function with a name.

Such an approach helps with understanding of the code, its maintenance,
and documentation of the pipeline.

Let me discuss these two concepts one by one in the context of DataFrames.jl
design.

Single column name has a single meaning

This rule seems pretty intuitive but is often violated in practice. For example
users often transform a column (e.g. by taking log) and still keep its name.
The general recommendation is that such practices should be avoided. If you
significantly transform a column a recommended thing to do is to give it a new
name. In this way you are sure that you will not be confused later.

As an additional benefit of following this rule you can safely annotate your
columns with metadata to keep important information about their meaning. In
DataFrames.jl column-level metadata design works as follows:

you can add a key–value pair as metadata to any column of a data frame;
you can choose one of two styles of metadata propagation; it is either :note
(signaling it is safe to retain metadata under certain transformations) and :default
(signaling that metadata should be stripped after any transformation).

Let us see it in action. You can use colmetadata! to set column-level metadata
and colmetadata to retrieve it:

julia> using DataFrames

julia> df = DataFrame(sales=[1.2, 3.4, 2.1], year=[2001, 2002, 2003])
3×2 DataFrame
 Row │ sales    year
     │ Float64  Int64
─────┼────────────────
   1 │     1.2   2001
   2 │     3.4   2002
   3 │     2.1   2003

julia> colmetadata!(df, :sales, "label", "Yearly sales in USD", style=:note);

julia> colmetadata!(df, :year, "label", "Fiscal year (ending on June 30)", style=:note);

julia> colmetadata!(df, :sales, "sum", sum(df.sales), style=:default);

julia> colmetadata(df, :sales, "label")
"Yearly sales in USD"

julia> colmetadata(df, :year, "label")
"Fiscal year (ending on June 30)"

julia> colmetadata(df, :sales, "sum")
6.699999999999999

You might ask what is the use of :note and :default metadata distinction.

Most of the time metadata of :default style are things that are specific to
only a given instance of a data frame. An example of such metadata is column
creation time. In some scenarios having such information might be used, e.g.
for auditing purposes. However, such information should not be propagated under
transformations.

On the other hand :note meatadata is kept when data frame is transformed.
The rules, in short, are:

if the column is not changed under transformation (data it contains remains
unchanged) :note-style metadata is kept;
if you change column data, but decide to keep the original column name
metadata is also kept.

What is the rationale behind these rules? We could be super strict and say that
metadata is propagated only if the column data does not change and its name and
does not change. However, such a rule in practice seems to be overly strict.
When you get some source data you usually might want to e.g. rename the column
to some more descriptive name or perform data cleaning (e.g. dropping rows with
missing values). In such cases users usually feel that the contents of the colum
conceptually remains the same. However, some information about data in the
column might change, like for example its length. Therefore :note-style
metadata should not be used to store information that is strictly attached to a
concrete instance of a column (:default-style metadata should be used
instead).

Let us now look back at our example code.
A typical :note-style metadata is descriptive label of a column. We used
"label" key for it. Now we made a "sum" metadata to have style :default.
Someone might have wanted to store the sum information for easy access later.
However, such metadata should not be propagated under any transformation of
a data frame as it might potentially be invalidated, so we made its style
:default.

If you ask why such metadata might be useful have a look at this example:

julia> show(df, header=colmetadata.(Ref(df), names(df), "label"))
3×2 DataFrame
 Row │ Yearly sales in USD  Fiscal year (ending on June 30)
─────┼──────────────────────────────────────────────────────
   1 │                 1.2                             2001
   2 │                 3.4                             2002
   3 │                 2.1                             2003

Note that we have substituted column names (which might be not clear for end
users) with descriptive labels.

The metadata propagation rules work nice and will help you documenting your
data frame assuming that you follow the first principle: do not use a single
column name to store different information.

Every transformation should be a function with a name

The principle to define named functions for transformations, combined with
the principle of single name for single meaning, is a core of
operation specification syntax in DataFrames.jl.

Consider the following example code:

julia> usd2¢(x) = 100x
usd2¢ (generic function with 1 method)

julia> transform(df, :sales => usd2¢)
3×3 DataFrame
 Row │ sales    year   sales_usd2¢
     │ Float64  Int64  Float64
─────┼─────────────────────────────
   1 │     1.2   2001        120.0
   2 │     3.4   2002        340.0
   3 │     2.1   2003        210.0

Note that the auto-generated column name sales_usd2¢ gives us information
about a source column and transformation function applied to it. Of course you
could use a custom output name. However, using automatically generated column
name, assuming that you pass a names function as a transformation, has a big
benefit that it helps in automatic visual documentation of the transformation
applied to data (and looking up the code of the transformation function used).
This is a big feature if you work with hundreds or thousands of columns.

As with the previous principle this works if you follow a simple rule –
define transformations you want to apply as functions with a name.

Conclusions

In this post I have highlighted two principles that are useful when implementing
complex data transformation jobs:

if you use some column name it should have a single meaning;
every transformation should be a function with a name.

DataFrames.jl was designed to work nicely if you want to follow these rules, as
I have tried to explain in this post. However, nothing is carved in stone in
DataFrames.jl. You can write your code whatever way you want and nothing is
forced on the user. This is especially important in quick-and-dirty one time
interactive sessions, where users want flexibility.

If you would like to learn more about the design of operation specification
syntax I recommend you start with JuliaCon 2022 tutorial.

Similarly, we tried to design handling of metadata in a way that is easy to
understand and convenient to use. You can find more details about metadata
support in DataFrames.jl in Metadata section of the DataFrames.jl
documentation. I also plan to write a separate post in which I will give a more
in-depth example how metadata can be used in DataFrames.jl. Also there is a plan
to release TableMetadataTools.jl package, which will add even more
convenience to most common operations.

Knight covering puzzle

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/09/30/knights.html

Introduction

Recently David Amos started posting interesting puzzles on Twitter. One of them
can be found here and asks to find the smallest number of knights that
you need to place on a chess board so that every square is either occupied by a
knight or attacked by a knight.

It looks as an interesting combinatorial problem. However, if one is in a rush
(e.g. I am usually quite busy with fixing bugs DataFrames.jl :)),
one wants to find a solution without having to write a lot of code and
implementing complex algorithms.

We notice that the posed problem is from optimization domain, so using
JuMP.jl to solve it immediately comes to ones mind.
Let us give it a shot!

In the post I use Julia 1.8.1, Cbc.jl 1.0.1, and JuMP.jl v1.3.1.

Solution

In the solution I will use 1 to 8 coordinates for both dimensions of the board
(the original Tweet asked for standard chess coordinates, like e.g. b3 or e5,
but they will not be needed in the solution so I skip this requirement).

First let us load the packages we need:

julia> using Cbc

julia> using JuMP

Next, as required in the Tweet, we define the attacked_by function. It takes
horizontal and vertical coordinate of a square and returns a vector of cells by
which a given cell is attacked via a knight move:

julia> attacked_by(i, j) =
           [(p, q) for p in 1:8, q in 1:8
            if abs((p - i) * (q - j)) == 2]
attacked_by (generic function with 1 method)

In the definition we use the fact that the only possibility that the absolute
value of the product of two integers is equal to 2 is that one of them must have
absolute value equal to 1 and other absolute value equal to 2.

(If you think that the function does not have an optimal performance I agree
with you, but performance does not matter as the problem we have is small.)

Let us check if it works correctly:

julia> attacked_by(1, 1)
2-element Vector{Tuple{Int64, Int64}}:
 (3, 2)
 (2, 3)

julia> attacked_by(2, 2)
4-element Vector{Tuple{Int64, Int64}}:
 (4, 1)
 (4, 3)
 (1, 4)
 (3, 4)

julia> attacked_by(4, 4)
8-element Vector{Tuple{Int64, Int64}}:
 (3, 2)
 (5, 2)
 (2, 3)
 (6, 3)
 (2, 5)
 (6, 5)
 (3, 6)
 (5, 6)

It seems all is good.

Now the fun stuff begins. First define our optimization problem:

julia> model = Model(Cbc.Optimizer)
A JuMP Model
Feasibility problem with:
Variables: 0
Model mode: AUTOMATIC
CachingOptimizer state: EMPTY_OPTIMIZER
Solver name: COIN Branch-and-Cut (Cbc)

julia> @variable(model, x[1:8, 1:8], Bin)
8×8 Matrix{VariableRef}:
 x[1,1]  x[1,2]  x[1,3]  x[1,4]  x[1,5]  x[1,6]  x[1,7]  x[1,8]
 x[2,1]  x[2,2]  x[2,3]  x[2,4]  x[2,5]  x[2,6]  x[2,7]  x[2,8]
 x[3,1]  x[3,2]  x[3,3]  x[3,4]  x[3,5]  x[3,6]  x[3,7]  x[3,8]
 x[4,1]  x[4,2]  x[4,3]  x[4,4]  x[4,5]  x[4,6]  x[4,7]  x[4,8]
 x[5,1]  x[5,2]  x[5,3]  x[5,4]  x[5,5]  x[5,6]  x[5,7]  x[5,8]
 x[6,1]  x[6,2]  x[6,3]  x[6,4]  x[6,5]  x[6,6]  x[6,7]  x[6,8]
 x[7,1]  x[7,2]  x[7,3]  x[7,4]  x[7,5]  x[7,6]  x[7,7]  x[7,8]
 x[8,1]  x[8,2]  x[8,3]  x[8,4]  x[8,5]  x[8,6]  x[8,7]  x[8,8]

julia> @objective(model, Min, sum(x))
x[1,1] + x[2,1] + x[3,1] + x[4,1] + x[5,1] + x[6,1] + x[7,1] + x[8,1] +
x[1,2] + x[2,2] + x[3,2] + x[4,2] + x[5,2] + x[6,2] + x[7,2] + x[8,2] +
x[1,3] + x[2,3] + x[3,3] + x[4,3] + x[5,3] + x[6,3] + x[7,3] + x[8,3] +
x[1,4] + x[2,4] + x[3,4] + x[4,4] + x[5,4] + x[6,4] + x[7,4] + x[8,4] +
x[1,5] + x[2,5] + x[3,5] + x[4,5] + x[5,5] + x[6,5] + x[7,5] + x[8,5] +
x[1,6] + x[2,6] + x[3,6] + x[4,6] + x[5,6] + x[6,6] + x[7,6] + x[8,6] +
x[1,7] + x[2,7] + x[3,7] + x[4,7] + x[5,7] + x[6,7] + x[7,7] + x[8,7] +
x[1,8] + x[2,8] + x[3,8] + x[4,8] + x[5,8] + x[6,8] + x[7,8] + x[8,8]

julia> @constraint(model, c[i=1:8, j=1:8],
                   x[i, j] + sum(x[p...] for p in attacked_by(i, j)) >= 1)
8×8 Matrix{ConstraintRef{Model, MathOptInterface.ConstraintIndex{MathOptInterface.ScalarAffineFunction{Float64}, MathOptInterface.GreaterThan{Float64}}, ScalarShape}}:
 c[1,1] : x[1,1] + x[3,2] + x[2,3] >= 1.0                    …  c[1,8] : x[2,6] + x[3,7] + x[1,8] >= 1.0
 c[2,1] : x[2,1] + x[4,2] + x[1,3] + x[3,3] >= 1.0              c[2,8] : x[1,6] + x[3,6] + x[4,7] + x[2,8] >= 1.0
 c[3,1] : x[3,1] + x[1,2] + x[5,2] + x[2,3] + x[4,3] >= 1.0     c[3,8] : x[2,6] + x[4,6] + x[1,7] + x[5,7] + x[3,8] >= 1.0
 c[4,1] : x[4,1] + x[2,2] + x[6,2] + x[3,3] + x[5,3] >= 1.0     c[4,8] : x[3,6] + x[5,6] + x[2,7] + x[6,7] + x[4,8] >= 1.0
 c[5,1] : x[5,1] + x[3,2] + x[7,2] + x[4,3] + x[6,3] >= 1.0     c[5,8] : x[4,6] + x[6,6] + x[3,7] + x[7,7] + x[5,8] >= 1.0
 c[6,1] : x[6,1] + x[4,2] + x[8,2] + x[5,3] + x[7,3] >= 1.0  …  c[6,8] : x[5,6] + x[7,6] + x[4,7] + x[8,7] + x[6,8] >= 1.0
 c[7,1] : x[7,1] + x[5,2] + x[6,3] + x[8,3] >= 1.0              c[7,8] : x[6,6] + x[8,6] + x[5,7] + x[7,8] >= 1.0
 c[8,1] : x[8,1] + x[6,2] + x[7,3] >= 1.0                       c[8,8] : x[7,6] + x[6,7] + x[8,8] >= 1.0

julia> model
A JuMP Model
Minimization problem with:
Variables: 64
Objective function type: AffExpr
`AffExpr`-in-`MathOptInterface.GreaterThan{Float64}`: 64 constraints
`VariableRef`-in-`MathOptInterface.ZeroOne`: 64 constraints
Model mode: AUTOMATIC
CachingOptimizer state: EMPTY_OPTIMIZER
Solver name: COIN Branch-and-Cut (Cbc)
Names registered in the model: c, x

As the variable in the model we set a matrix called x, which will hold 1
in cells where a knight is placed and 0 in cells where no knight is placed.
As an objective we set that we want to minimize the number of used knights.
Finally we add 64 constraints saying that all cells must be attacked or covered
at least once.

We are ready for the final blow:

julia> optimize!(model)
Welcome to the CBC MILP Solver
Version: 2.10.5
Build Date: Jan  1 1970

command line - Cbc_C_Interface -solve -quit (default strategy 1)
Continuous objective value is 12 - 0.00 seconds
Cgl0004I processed model has 64 rows, 64 columns (64 integer (64 of which binary)) and 400 elements
Cutoff increment increased from 1e-05 to 0.9999
Cbc0038I Initial state - 0 integers unsatisfied sum - 0
Cbc0038I Solution found of 12
Cbc0038I Before mini branch and bound, 64 integers at bound fixed and 0 continuous
Cbc0038I Mini branch and bound did not improve solution (0.02 seconds)
Cbc0038I After 0.02 seconds - Feasibility pump exiting with objective of 12 - took 0.01 seconds
Cbc0012I Integer solution of 12 found by feasibility pump after 0 iterations and 0 nodes (0.03 seconds)
Cbc0001I Search completed - best objective 12, took 0 iterations and 0 nodes (0.03 seconds)
Cbc0035I Maximum depth 0, 0 variables fixed on reduced cost
Cuts at root node changed objective from 12 to 12
Probing was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Gomory was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Knapsack was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Clique was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
MixedIntegerRounding2 was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
FlowCover was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
TwoMirCuts was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
ZeroHalf was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)

Result - Optimal solution found

Objective value:                12.00000000
Enumerated nodes:               0
Total iterations:               0
Time (CPU seconds):             0.10
Time (Wallclock seconds):       0.10

Total time (CPU seconds):       0.11   (Wallclock seconds):       0.11

Bang! We are done. We get a lot of output. It looks like the solver found
the optimal solution to the problem that has value 12, so we need at least 12
knights. The report that we see indicates that the problem was relatively easy
to solve.

We know that we need 12 knights, but where to put them? Let us check:

julia> Bool.(value.(x))
8×8 BitMatrix:
0  0  0  0  0  0  0
0  0  0  0  1  0  0
1  1  0  1  1  0  0
0  1  0  0  0  0  0
0  0  0  0  1  0  0
0  1  1  0  1  1  0
0  1  0  0  0  0  0
0  0  0  0  0  0  0

The optimal solution forms a nice symmetric pattern.

Conclusions

Removing line breaks, that I added to make the code better fit the screen, our
solution has 6 lines (plus 3 lines for lading packages and printing the result).

JuMP.jl is amazingly expressive and easy to use. At the same time notice how
nicely it prints the elements of the optimization model (this is best
appreciated in the terminal in our case as the generated expressions are long).
If you never used it I recommend you check it out.

I can’t wait for next challenges from David Amos.

Back to the basics: don’t lag behind with your DataFrames.jl skills

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/09/23/lagging.html

Introduction

Several of my recent posts were a bit more advanced, so today I thought to write
something for beginners. A common task that one needs to do in data analysis
pipelines is lagging a variable in time. Today I will show you several patterns
how this can be done using DataFrames.jl and ShiftedArrays.jl. In particular, I
was prompted to write this post by a 2.0 release of the ShiftedArrays.jl package
and it reflects the current state of this package.

In the post I use Julia 1.8.1, DataFrames.jl 1.3.6, and ShiftedArrays.jl 2.0.

Setting up the scene

Let us load the packages that we will use in this post and create some starting
data frame with which we will work:

julia> using DataFrames

julia> using Dates

julia> using ShiftedArrays: lag

julia> df = DataFrame(day=Date.(2020, 9, [1, 2, 4, 5, 1, 2, 3, 6]),
                      id=repeat(1:2, inner=4),
                      v=1:8)
8×3 DataFrame
 Row │ day         id     v
     │ Date        Int64  Int64
─────┼──────────────────────────
   1 │ 2020-09-01      1      1
   2 │ 2020-09-02      1      2
   3 │ 2020-09-04      1      3
   4 │ 2020-09-05      1      4
   5 │ 2020-09-01      2      5
   6 │ 2020-09-02      2      6
   7 │ 2020-09-03      2      7
   8 │ 2020-09-06      2      8

Here an important pattern is using ShiftedArrays: lag. Since 2.0 release
the ShiftedArrays.jl package does not export the lag function. Therefore
to be able to use it without having to write ShiftedArrays.lag you need
to explicitly list it when loading the ShiftedArrays.jl package.

Simple lagging

Assume we want to lag variable v in the whole data frame. for this we can
use the transform function like this:

julia> transform(df, :v => lag)
8×4 DataFrame
 Row │ day         id     v      v_lag
     │ Date        Int64  Int64  Int64?
─────┼───────────────────────────────────
   1 │ 2020-09-01      1      1  missing
   2 │ 2020-09-02      1      2        1
   3 │ 2020-09-04      1      3        2
   4 │ 2020-09-05      1      4        3
   5 │ 2020-09-01      2      5        4
   6 │ 2020-09-02      2      6        5
   7 │ 2020-09-03      2      7        6
   8 │ 2020-09-06      2      8        7

If we wanted to lag it by more than one period we could write:

julia> transform(df, :v => (x -> lag(x, 2)) => :v_lag2)
8×4 DataFrame
 Row │ day         id     v      v_lag2
     │ Date        Int64  Int64  Int64?
─────┼───────────────────────────────────
   1 │ 2020-09-01      1      1  missing
   2 │ 2020-09-02      1      2  missing
   3 │ 2020-09-04      1      3        1
   4 │ 2020-09-05      1      4        2
   5 │ 2020-09-01      2      5        3
   6 │ 2020-09-02      2      6        4
   7 │ 2020-09-03      2      7        5
   8 │ 2020-09-06      2      8        6

As an exercise, here is an example how to do lagging by different periods
using a comprehension:

julia> transform(df, [:v => (x -> lag(x, i)) => "v_lag$i" for i in 1:3])
8×6 DataFrame
 Row │ day         id     v      v_lag1   v_lag2   v_lag3
     │ Date        Int64  Int64  Int64?   Int64?   Int64?
─────┼─────────────────────────────────────────────────────
   1 │ 2020-09-01      1      1  missing  missing  missing
   2 │ 2020-09-02      1      2        1  missing  missing
   3 │ 2020-09-04      1      3        2        1  missing
   4 │ 2020-09-05      1      4        3        2        1
   5 │ 2020-09-01      2      5        4        3        2
   6 │ 2020-09-02      2      6        5        4        3
   7 │ 2020-09-03      2      7        6        5        4
   8 │ 2020-09-06      2      8        7        6        5

Lagging by group

Simple lagging in our example seems not very useful. The reason is that it looks
that we should lag the :v variable in groups defined by the :id variable.
In DataFrames.jl this is simple, the only thing you need to change in the
above codes is passing groupby(df, :id) to transform:

julia> transform(groupby(df, :id), :v => lag)
8×4 DataFrame
 Row │ day         id     v      v_lag
     │ Date        Int64  Int64  Int64?
─────┼───────────────────────────────────
   1 │ 2020-09-01      1      1  missing
   2 │ 2020-09-02      1      2        1
   3 │ 2020-09-04      1      3        2
   4 │ 2020-09-05      1      4        3
   5 │ 2020-09-01      2      5  missing
   6 │ 2020-09-02      2      6        5
   7 │ 2020-09-03      2      7        6
   8 │ 2020-09-06      2      8        7

(the remaining examples can be updated using the same pattern)

Calendar lagging by group

The above example was better, but assumed that we want to lag data by
observations (so we shift a vector disregarding dates). What if we wanted to lag
the data by one calendar day. Here are two ways how you can do it (they slightly
differ in the produced result, so the choice of the approach depends on what you
would want in practice). In both approaches I assume that the passed dates are
unique by group.

In the first approach we want to keep all observations after lagging of the
:v variable and record only days that match some observations (before or after
lagging).

julia> combine(groupby(df, :id)) do sdf
           sort!(outerjoin(select(sdf, Not(:id)),
                           select(sdf,
                                  :day => ByRow(x -> x + Day(1)) => :day,
                                  :v => :v_lag),
                           on=:day), :day)
       end
12×4 DataFrame
 Row │ id     day         v        v_lag
     │ Int64  Date        Int64?   Int64?
─────┼─────────────────────────────────────
   1 │     1  2020-09-01        1  missing
   2 │     1  2020-09-02        2        1
   3 │     1  2020-09-03  missing        2
   4 │     1  2020-09-04        3  missing
   5 │     1  2020-09-05        4        3
   6 │     1  2020-09-06  missing        4
   7 │     2  2020-09-01        5  missing
   8 │     2  2020-09-02        6        5
   9 │     2  2020-09-03        7        6
  10 │     2  2020-09-04  missing        7
  11 │     2  2020-09-06        8  missing
  12 │     2  2020-09-07  missing        8

In this approach in each group defined by the :id column we create a helper
data frame with :day column incremented by one and outer join it with the
original data frame (with :id column dropped), next we sort the result on
:day.

In the second approach we keep only observations in the originally observed
period for a given group, but fill all the potential dates in the considered
period.

julia> combine(groupby(df, :id)) do sdf
           ref_df = DataFrame(day=minimum(sdf.day):Day(1):maximum(sdf.day))
           transform!(leftjoin!(ref_df, select(sdf, Not(:id)), on=:day),
                      :v => lag)
       end
11×4 DataFrame
 Row │ id     day         v        v_lag
     │ Int64  Date        Int64?   Int64?
─────┼─────────────────────────────────────
   1 │     1  2020-09-01        1  missing
   2 │     1  2020-09-02        2        1
   3 │     1  2020-09-03  missing        2
   4 │     1  2020-09-04        3  missing
   5 │     1  2020-09-05        4        3
   6 │     2  2020-09-01        5  missing
   7 │     2  2020-09-02        6        5
   8 │     2  2020-09-03        7        6
   9 │     2  2020-09-04  missing        7
  10 │     2  2020-09-05  missing  missing
  11 │     2  2020-09-06        8  missing

In this approach, again by :id group, we first create a reference data frame
with all dates in the observed period, next we leftjoin! the source data frame
into it (dropping :id as it is not needed), and finally just use the lag
function. Note that in this solution, in comparison to the previous one, rows
for dates 2020-09-06 (group 1) and 2020-09-07 (group 2) are missing, but we
have an extra row 2020-09-05 (group 2) filled with missing.

Conclusions

Lagging of variables is quite often needed in practice, especially in panel
econometrics context. I hope you will find the patterns I have presented in this
post useful in your data analysis projects. Additionally, I have chosen the
examples so that several functions from DataFrames.jl are employed in
combination.

juliabloggers.com

A Julia Language Blog Aggregator

Author Archives: Blog by Bogumił Kamiński

Is DataFrames.jl Hamiltonian?

Introduction

The principles

Single column name has a single meaning

Every transformation should be a function with a name

Conclusions

Knight covering puzzle

Introduction

Solution

Conclusions

Back to the basics: don’t lag behind with your DataFrames.jl skills

Introduction

Setting up the scene

Simple lagging

Lagging by group

Calendar lagging by group

Conclusions