Does DataFrames.jl copy or not copy, that is the question

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/09/15/copying.html

Introduction

Some time ago I have written a post about my thoughts on copying of data when working with it in Julia.

Today I want to focus on a related, but more narrow topic related to DataFrames.jl.
People starting to work with this package are sometimes confused when columns
get copied and when they are not copied. I want to discuss the most common cases in this post.

Spoiler! The post is a bit long. If you want a simple advice – you can skip to the section with conclusions.

The post was written using Julia 1.9.2 and DataFrames.jl 1.6.1.

Getting a column from a data frame

Let us start with a simpler case. When does copying happen if we get a column form a data frame?

First we set up some initial data:

julia> using DataFrames

julia> df = DataFrame(a=1:10^6)
1000000×1 DataFrame
     Row │ a
         │ Int64
─────────┼─────────
       1 │       1
       2 │       2
       3 │       3
       4 │       4
       5 │       5
    ⋮    │    ⋮
  999997 │  999997
  999998 │  999998
  999999 │  999999
 1000000 │ 1000000
999991 rows omitted

There are three ways to get the :a column from this data frame: df.a, df[:, :a] and df[!, :a].
Let us check them one by one. Start with df.a:

julia> df.a
1000000-element Vector{Int64}:
       1
       2
       3
       4
       5
       6
       7
       ⋮
  999995
  999996
  999997
  999998
  999999
 1000000

julia> @allocated df.a
0

df.a extracts the column without copying data. You can see it by the fact that there are no allocations performed in this operation.

Now check df[:, :a], which uses a standard row index : that is also used in arrays:

julia> df[:, :a]
1000000-element Vector{Int64}:
       1
       2
       3
       4
       5
       6
       7
       ⋮
  999995
  999996
  999997
  999998
  999999
 1000000

julia> @allocated df[:, :a]
8000048

df[:, :a] copies data, we see a lot of memory allocated this time. This is an identical behavior to how : works for arrays.

Finally check df[!, :a], which uses a non-standard ! row index:

julia> df[!, :a]
1000000-element Vector{Int64}:
       1
       2
       3
       4
       5
       6
       7
       ⋮
  999995
  999996
  999997
  999998
  999999
 1000000

julia> @allocated df[!, :a]
0

We can see that df[!, :a] does not allocate. It is equivalent to df.a, just with a bit different syntax
(the indexing syntax with ! is handy if we wanted to select multiple columns from a data frame, which is not possible with df.a syntax).

This part was relatively easy. Now let us turn to a harder case of setting a column of a data frame.

Case 1: setting a column in a data frame using assignment

First store the :a column in a temporary variable a (without copying it):

julia> a = df.a
1000000-element Vector{Int64}:
       1
       2
       3
       4
       5
       6
       7
       ⋮
  999995
  999996
  999997
  999998
  999999
 1000000

Now let us check various options of creation of a column that will store a.
Begin with creating of a new column.

julia> df.b = a
1000000-element Vector{Int64}:
       1
       2
       3
       4
       5
       6
       7
       ⋮
  999995
  999996
  999997
  999998
  999999
 1000000

julia> df.b === a
true

We can see that if we put df.b on the left hand side the operation does not copy the passed data.
You probably already can guess that the same happens with df[!, :c] on left hand side. Indeed
it is the case:

julia> df[!, :c] = a
1000000-element Vector{Int64}:
       1
       2
       3
       4
       5
       6
       7
       ⋮
  999995
  999996
  999997
  999998
  999999
 1000000

julia> df.c === a
true

What about df[:, :d]? Let us see:

julia> df[:, :d] = a
1000000-element Vector{Int64}:
       1
       2
       3
       4
       5
       6
       7
       ⋮
  999995
  999996
  999997
  999998
  999999
 1000000

julia> df.d === a
false

So we see a first difference. When creating a new column the data was copied.
But what would happen if some column already existed in a data frame?

Well for df.b and df[!, :c] syntaxes nothing would change, as they just put
a right hand side vector into a data frame without copying it.
But for df[:, :d] the situation is different. Let us check:

julia> d = df.d;

julia> df[:, :d] = a;

julia> df.d === a
false

julia> df.d === d
true

We can see that if we use the df[:, :d] syntax on left hand side the operation is in-place,
that is the vector already present in df is reused and the data is stored in a column
already present in a data frame. This means that we cannot use df[:, :d] = ... to change
element type of column :d. Let us see:

julia> df[:, :d] = a .+ 0.5;
ERROR: InexactError: Int64(1.5)

Indeed a .+ 0.5 contains floating point values, and the :d column allowed only integers.
Note that with df.b = ... or df[!, :c] = ... we would not have this issue as they
replace columns with what is passed on a right hand side:

julia> df.b = a .+ 0.5
1000000-element Vector{Float64}:
      1.5
      2.5
      3.5
      4.5
      5.5
      6.5
      7.5
      ⋮
 999995.5
 999996.5
 999997.5
 999998.5
 999999.5
      1.0000005e6

There is one more twist to this story. It is related to ranges.
The issue is that DataFrame object always materializes ranges
stored in it.
Therefore the following operation allocates data:

julia> df.b = 1:10^6
1:1000000

julia> df.b
1000000-element Vector{Int64}:
       1
       2
       3
       4
       5
       6
       7
       ⋮
  999995
  999996
  999997
  999998
  999999
 1000000

The issue is that generally df.b = ... does not allocate, but since we disallow storing
ranges as columns of a data frame (in our case the 1:10^6 range) the allocation still takes place.
You would have the same behavior with df[!, :c] = 1:10^6.

Case 2: setting a column in a data frame using broadcasted assignment

Julia is famous for its powerful broadcasting capabilities. Let us thus investigate what happens when we
replace = with .= in our experiments. We will reproduce all the examples we gave above from scratch.

Start with df.b .= a:

julia> df = DataFrame(a=1:10^6);

julia> a = df.a;

julia> df.b .= a;

julia> df.b === a
false

We now see a difference. The :b column is freshly allocated.

Let us check the two other options of creation of a new column:

julia> df[!, :c] .= a;

julia> df.c === a
false

julia> df[:, :d] .= a;

julia> df.d === a
false

They have the same effect: a new column gets allocated.

In the case of an existing column df.b .= ... and df[!, :c] .= ...
would again create a new copied column:

julia> df.b .= a .+ 0.5
1000000-element Vector{Float64}:
      1.5
      2.5
      3.5
      4.5
      5.5
      6.5
      7.5
      ⋮
 999995.5
 999996.5
 999997.5
 999998.5
 999999.5
      1.0000005e6

The difference is with df[:, :d] .= ...:

julia> d = df.d;

julia> df[:, :d] .= a;

julia> df.d === a
false

julia> df.d === d
true

julia> df[:, :d] .= a .+ 0.5
ERROR: InexactError: Int64(1.5)

So we see that we have here an in-place operation just like with df[:, :d] = ....

Conclusions

As a summary let me discuss a common anti-pattern:

df.a = df.b

Given the examples I presented we know that after this operation the :a and :b columns
of the df data frame are aliased, i.e. df.a === df.b produces true. Usually this is not
a desired situation as many operations assume that columns of a data frame do not share memory.

Fortunately, we also already learnt an easy fix to the aliasing problem. You can just write:

df.a .= df.b

To get a copy of :b stored in column :a.

I hope the examples I gave in my post today will be useful for your work with DataFrames.jl.