On the bang row selector in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/01/30/bang.html

Introduction

I recently see that DataFrames.jl use ! as a row selector for a data
frame a lot.

Over a year ago, when we have taken data frames indexing seriously, there was a
very big debate if ! should be allowed in expressions like df[!, :a] to get
an :a column without copying. The conclusion was that we need to have it, but
our intention was that it would be reserved for advanced uses only, while
in normal circumstances a user would not need to even know that it exists.

In this post let me review the use-cases of ! and comment on its alternatives.

This post was written under Julia 1.5.3 and DataFrames 0.22.4.

First we set up the environment:

julia> using DataFrames

julia> df = DataFrame(col1=1:3, col2='a':'c')
3×2 DataFrame
 Row │ col1   col2
     │ Int64  Char
─────┼─────────────
   1 │     1  a
   2 │     2  b
   3 │     3  c

Reading a single column from a data frame

If you want to get a single column :col1 from a data frame df you have the
following options:

  • df[!, :col1], df[!, "col1"], df.col1, and df."col1": get you the column
    without copying;
  • df[:, :col1] and df[:, "col1"]: gets you a copy of the column.

As you see to get a single column without copying it is usually much easier to
rwiere df.col1 than e.g. df[!, :col1] and the operation has exactly the same
result.

The only case when df[!, :col1] is more convenient is when you have a column
name stored in a variable. Then the following are equivalent:

julia> v = :col1
:col1

julia> df[!, v]
3-element Array{Int64,1}:
 1
 2
 3

julia> getproperty(df, v)
3-element Array{Int64,1}:
 1
 2
 3

and indeed using ! is a big more convenient in this case, as you cannot pass
variable v to an expression like df.col1.

Reading multiple columns from a data frame

If you want to get a two columns [:col1, :col2] from a data frame df you
have the following options (I am leaving out the sting version and other column
selectors we support for simplicity):

  • df[!, [:col1, :col2]] and select(df, [:col1, :col2], copycols=false):
    creates you a new data frame (a fresh wrapper object is allocated) but the
    columns of the new data frame are taken from df;
  • df[:, [:col1, :col2]] and select(df, [:col1, :col2]): gets you a new data
    frame with columns copied.

Note that for multiple column selection you can alternatively use the select
function. The difference between select and indexing is that select returns
a data frame even if a single column is selected, e.g. like this:

julia> select(df, 1)
3×1 DataFrame
 Row │ col1
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     3

while as we have explained above we have:

julia> df[!, 1]
3-element Array{Int64,1}:
 1
 2
 3

Note that as in the df[!, [:col1, :col2]] syntax copying of columns is not
done this operation is generally not recommended. Using such a data frame often
leads to very hard-to-find bugs as when you modify contents of the columns of
the newly created data frame also the source is mutated.

Making a view of a data frame

In this case we have:

julia> view(df, !, :col1)
3-element view(::Array{Int64,1}, :) with eltype Int64:
 1
 2
 3

julia> view(df, !, [:col1, :col2])
3×2 SubDataFrame
 Row │ col1   col2
     │ Int64  Char
─────┼─────────────
   1 │     1  a
   2 │     2  b
   3 │     3  c

and the views are exactly the same as if we used view(df, :, :col1) and
view(df, :, [:col1, :col2]) respectively.

In this case ! is supported mainly to allow an easy annotation of whole
expressions using data frame indexing with @views, e.g. imagine you have
the following code:

julia> x = [1, 2, 3, 4]
4-element Array{Int64,1}:
 1
 2
 3
 4

julia> df[!, 1] + x[1:3]
3-element Array{Int64,1}:
 2
 4
 6

and in order to avoid copying x you want to annotate the whole expression with
@views. Thanks to the fact that ! is supported with view you can just write:

julia> @views df[!, 1] + x[1:3]
3-element Array{Int64,1}:
 2
 4
 6

Assigning to a single column

The difference between df[!, :co11] = 11:13 and df[:, :col1] = 11:13 is that
using ! puts a new column passed on the right hand side to the data frame
without copying it (no matter if the column exists or not in the data frame),
while : assigns to an existing column in-place.

Therefore df[!, :co11] = 11:13 is equivalent to df.col1 = 11:13. On the other
hand df[:, :co11] = 11:13 is equivalent to df.col1[:] = 11:13, if the column
:col1 is present in the data frame.

Here is an example:

julia> df2 = copy(df)
3×2 DataFrame
 Row │ col1   col2
     │ Int64  Char
─────┼─────────────
   1 │     1  a
   2 │     2  b
   3 │     3  c

julia> col1 = df2.col1
3-element Array{Int64,1}:
 1
 2
 3

julia> df2[!, :col1] = 11:13
11:13

julia> col1
3-element Array{Int64,1}:
 1
 2
 3

vs.

julia> df2 = copy(df)
3×2 DataFrame
 Row │ col1   col2
     │ Int64  Char
─────┼─────────────
   1 │     1  a
   2 │     2  b
   3 │     3  c

julia>

julia> col1 = df2.col1
3-element Array{Int64,1}:
 1
 2
 3

julia> df2[:, :col1] = 11:13
11:13

julia> col1
3-element Array{Int64,1}:
 11
 12
 13

You might have noticed that when I described : I have added a condition that
it is equivalen to getproperty syntax only when the column is present in the
data frame. The reason is that if column is not present in a data frame
then we have:

julia> df
3×2 DataFrame
 Row │ col1   col2
     │ Int64  Char
─────┼─────────────
   1 │     1  a
   2 │     2  b
   3 │     3  c

julia> newcol = [11, 12, 13]
3-element Array{Int64,1}:
 11
 12
 13

julia> df[:, :newcol] = newcol
3-element Array{Int64,1}:
 11
 12
 13

julia> df
3×3 DataFrame
 Row │ col1   col2  newcol
     │ Int64  Char  Int64
─────┼─────────────────────
   1 │     1  a         11
   2 │     2  b         12
   3 │     3  c         13

julia> df.newcol === newcol
false

So instead of an in-place operation (which is not possible as the column is not
present in the data frame), we get a copy operation.

On the other hand:

julia> df.newcol2[:] = newcol
ERROR: ArgumentError: column name :newcol2 not found in the data frame; existing most similar names are: :newcol

just fails as there is no column to index into.

The other special case is SubDataFrame, where using ! for assignment is not
allowed, just like for getproperty syntax:

julia> dfv = view(df, :, :)
3×3 SubDataFrame
 Row │ col1   col2  newcol
     │ Int64  Char  Int64
─────┼─────────────────────
   1 │     1  a         11
   2 │     2  b         12
   3 │     3  c         13

julia> dfv[!, :col1] = 1:3
ERROR: ArgumentError: setting index of SubDataFrame using ! as row selector is not allowed

julia> dfv.col1 = 1:3
ERROR: ArgumentError: Replacing or adding of columns of a SubDataFrame is not allowed. Instead use `df[:, col_ind] = v` or `df[:, col_ind] .= v` to perform an in-place assignment.

Assigning to multiple columns

This case is a bit simpler than assigning to a single column case above. The
reason is that we do not allow to create new columns when multiple columns are
selected. Therefore the rule is: df[!, [:col1, :col2]] = new_values replaces
columns :col1 and :col2 in df, while df[:, [:col1, :col2]] = new_values
updates them in-place.

Note that new_values must be either a data frame or a matrix, and for ! the
columns in df will be always freshly allocated.

Broadcasting assignment to a single column

This is the point where a bit of complexity is introduced, as now getproperty
syntax (i.e. df.col) behaves similarly to : indexing and not to ! indexig.

The rules are the following:

  • df[!, :col] .= v allocates a new column and replaces the old one or if :col
    is not present in df allocates and adds it;
  • df[:, :col] .= v updates the column in-place or allocates or if :col
    is not present in df allocates adds it;
  • df.col .= v is only allowed if col is present in df and operates in-place.

Note that if :col is not present in df then using ! and : are equivalent.

Also note that in SubDataFrame it is not allowed to add new columns and !
syntax is not allowed.

Broadcasting assignment to multiple columns

Again this case is simpler than broadcasting assigning to a single column case above.
The reason is that we do not allow to create new columns when multiple columns are
selected. Therefore the rule is: df[!, [:col1, :col2]] .= new_values replaces
columns :col1 and :col2 in df, while df[:, [:col1, :col2]] = new_values
updates them in-place.

Summary of the cases

Wrapping up the cases we see that ! means the following:

  • in selection context: get me a column or a data frame without copying columns.
  • in views: make me a view (the same as : row selector);
  • in assignment to a single column: replace or add the column to a data frame
    without copying;
  • in assignment to a multiple columns: replace the colums in a data frame
    with copying;
  • in broadcasting assignment: allocate a new column and store it (and in the case
    of a single column selector optionally add it if it is missing);

And : means the following:

  • in selection context: get me a column or data frame with copying of columns.
  • in views: make me a view (the same as : row selector);
  • in assignment to a single column: change the column in-place or add the column
    to a data frame with copying;
  • in assignment to a multiple columns: change the colums in-place in a data frame;
  • in broadcasting assignment: perform in-place update of columns (and in the case
    of a single column selector optionally allocate and add it if it is missing);

Finally getproperty (the df.col style) means the following:

  • in selection context: get me a column without copying.
  • in assignment: replace or add the column to a data frame without copying;
  • in broadcasting assignment: update an existing column in-place.

In short (simplifying a bit):

  • ! gets you columns without copying and when setting columns it replaces them;
  • : gets you columns with copying and when setting columns it does this in-place;
  • getproperty gets you columns without copying and setting columns it replaces
    them, except for broadcasting assignment, when it updates them in-place.

From a practical perspective the major difference between in-place and replace
operations is that replacing columns is needed if new values have a different
type than the old ones.

For instance here ! works and : fails:

julia> df
3×2 DataFrame
 Row │ col1   col2
     │ Int64  Char
─────┼─────────────
   1 │     1  a
   2 │     2  b
   3 │     3  c

julia> df[:, :col1] .= "a"
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Int64

julia> df[!, :col1] .= "a"
3-element Array{String,1}:
 "a"
 "a"
 "a"

julia> df
3×2 DataFrame
 Row │ col1    col2
     │ String  Char
─────┼──────────────
   1 │ a       a
   2 │ a       b
   3 │ a       c

Another practical limitation is that broadcasting assignment like df.col .= v
is not allowed when :col is not present in a data frame (there is a chance that
in the future it will be allowed, see here).

Conclusions

As you can see there are cases when ! row selector is needed to cover all
potential use-cases. However, most common operations are done on a single
column and in this case:

  • for getting a column or assigning to a column instead of df[!, :col] and
    df[!, :col] = v it is usually better to just write df.col and
    df.col = v respectively as it is the same and simpler to type and read;
  • currently the case where ! is really needed is broacasting assignment context
    where df[!, :col] .= v is the only relatively nice way to freshly allocate
    a column with v broadcasted into it (but when I look at the codes of
    DataFrames.jl users this pattern is used much less frequently than we
    expected when we designed the ecosystem).

I hope this post was helpful. If you are interested in a definitive
specification of all the indexing rules in DataFrames.jl you can find them
here.