Re-posted from: https://bkamins.github.io/julialang/2021/01/30/bang.html
Introduction
I recently see that DataFrames.jl use ! as a row selector for a data
frame a lot.
Over a year ago, when we have taken data frames indexing seriously, there was a
very big debate if ! should be allowed in expressions like df[!, :a] to get
an :a column without copying. The conclusion was that we need to have it, but
our intention was that it would be reserved for advanced uses only, while
in normal circumstances a user would not need to even know that it exists.
In this post let me review the use-cases of ! and comment on its alternatives.
This post was written under Julia 1.5.3 and DataFrames 0.22.4.
First we set up the environment:
julia> using DataFrames
julia> df = DataFrame(col1=1:3, col2='a':'c')
3×2 DataFrame
Row │ col1 col2
│ Int64 Char
─────┼─────────────
1 │ 1 a
2 │ 2 b
3 │ 3 c
Reading a single column from a data frame
If you want to get a single column :col1 from a data frame df you have the
following options:
df[!, :col1],df[!, "col1"],df.col1, anddf."col1": get you the column
without copying;df[:, :col1]anddf[:, "col1"]: gets you a copy of the column.
As you see to get a single column without copying it is usually much easier to
rwiere df.col1 than e.g. df[!, :col1] and the operation has exactly the same
result.
The only case when df[!, :col1] is more convenient is when you have a column
name stored in a variable. Then the following are equivalent:
julia> v = :col1
:col1
julia> df[!, v]
3-element Array{Int64,1}:
1
2
3
julia> getproperty(df, v)
3-element Array{Int64,1}:
1
2
3
and indeed using ! is a big more convenient in this case, as you cannot pass
variable v to an expression like df.col1.
Reading multiple columns from a data frame
If you want to get a two columns [:col1, :col2] from a data frame df you
have the following options (I am leaving out the sting version and other column
selectors we support for simplicity):
df[!, [:col1, :col2]]andselect(df, [:col1, :col2], copycols=false):
creates you a new data frame (a fresh wrapper object is allocated) but the
columns of the new data frame are taken fromdf;df[:, [:col1, :col2]]andselect(df, [:col1, :col2]): gets you a new data
frame with columns copied.
Note that for multiple column selection you can alternatively use the select
function. The difference between select and indexing is that select returns
a data frame even if a single column is selected, e.g. like this:
julia> select(df, 1)
3×1 DataFrame
Row │ col1
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
while as we have explained above we have:
julia> df[!, 1]
3-element Array{Int64,1}:
1
2
3
Note that as in the df[!, [:col1, :col2]] syntax copying of columns is not
done this operation is generally not recommended. Using such a data frame often
leads to very hard-to-find bugs as when you modify contents of the columns of
the newly created data frame also the source is mutated.
Making a view of a data frame
In this case we have:
julia> view(df, !, :col1)
3-element view(::Array{Int64,1}, :) with eltype Int64:
1
2
3
julia> view(df, !, [:col1, :col2])
3×2 SubDataFrame
Row │ col1 col2
│ Int64 Char
─────┼─────────────
1 │ 1 a
2 │ 2 b
3 │ 3 c
and the views are exactly the same as if we used view(df, :, :col1) and
view(df, :, [:col1, :col2]) respectively.
In this case ! is supported mainly to allow an easy annotation of whole
expressions using data frame indexing with @views, e.g. imagine you have
the following code:
julia> x = [1, 2, 3, 4]
4-element Array{Int64,1}:
1
2
3
4
julia> df[!, 1] + x[1:3]
3-element Array{Int64,1}:
2
4
6
and in order to avoid copying x you want to annotate the whole expression with
@views. Thanks to the fact that ! is supported with view you can just write:
julia> @views df[!, 1] + x[1:3]
3-element Array{Int64,1}:
2
4
6
Assigning to a single column
The difference between df[!, :co11] = 11:13 and df[:, :col1] = 11:13 is that
using ! puts a new column passed on the right hand side to the data frame
without copying it (no matter if the column exists or not in the data frame),
while : assigns to an existing column in-place.
Therefore df[!, :co11] = 11:13 is equivalent to df.col1 = 11:13. On the other
hand df[:, :co11] = 11:13 is equivalent to df.col1[:] = 11:13, if the column
:col1 is present in the data frame.
Here is an example:
julia> df2 = copy(df)
3×2 DataFrame
Row │ col1 col2
│ Int64 Char
─────┼─────────────
1 │ 1 a
2 │ 2 b
3 │ 3 c
julia> col1 = df2.col1
3-element Array{Int64,1}:
1
2
3
julia> df2[!, :col1] = 11:13
11:13
julia> col1
3-element Array{Int64,1}:
1
2
3
vs.
julia> df2 = copy(df)
3×2 DataFrame
Row │ col1 col2
│ Int64 Char
─────┼─────────────
1 │ 1 a
2 │ 2 b
3 │ 3 c
julia>
julia> col1 = df2.col1
3-element Array{Int64,1}:
1
2
3
julia> df2[:, :col1] = 11:13
11:13
julia> col1
3-element Array{Int64,1}:
11
12
13
You might have noticed that when I described : I have added a condition that
it is equivalen to getproperty syntax only when the column is present in the
data frame. The reason is that if column is not present in a data frame
then we have:
julia> df
3×2 DataFrame
Row │ col1 col2
│ Int64 Char
─────┼─────────────
1 │ 1 a
2 │ 2 b
3 │ 3 c
julia> newcol = [11, 12, 13]
3-element Array{Int64,1}:
11
12
13
julia> df[:, :newcol] = newcol
3-element Array{Int64,1}:
11
12
13
julia> df
3×3 DataFrame
Row │ col1 col2 newcol
│ Int64 Char Int64
─────┼─────────────────────
1 │ 1 a 11
2 │ 2 b 12
3 │ 3 c 13
julia> df.newcol === newcol
false
So instead of an in-place operation (which is not possible as the column is not
present in the data frame), we get a copy operation.
On the other hand:
julia> df.newcol2[:] = newcol
ERROR: ArgumentError: column name :newcol2 not found in the data frame; existing most similar names are: :newcol
just fails as there is no column to index into.
The other special case is SubDataFrame, where using ! for assignment is not
allowed, just like for getproperty syntax:
julia> dfv = view(df, :, :)
3×3 SubDataFrame
Row │ col1 col2 newcol
│ Int64 Char Int64
─────┼─────────────────────
1 │ 1 a 11
2 │ 2 b 12
3 │ 3 c 13
julia> dfv[!, :col1] = 1:3
ERROR: ArgumentError: setting index of SubDataFrame using ! as row selector is not allowed
julia> dfv.col1 = 1:3
ERROR: ArgumentError: Replacing or adding of columns of a SubDataFrame is not allowed. Instead use `df[:, col_ind] = v` or `df[:, col_ind] .= v` to perform an in-place assignment.
Assigning to multiple columns
This case is a bit simpler than assigning to a single column case above. The
reason is that we do not allow to create new columns when multiple columns are
selected. Therefore the rule is: df[!, [:col1, :col2]] = new_values replaces
columns :col1 and :col2 in df, while df[:, [:col1, :col2]] = new_values
updates them in-place.
Note that new_values must be either a data frame or a matrix, and for ! the
columns in df will be always freshly allocated.
Broadcasting assignment to a single column
This is the point where a bit of complexity is introduced, as now getproperty
syntax (i.e. df.col) behaves similarly to : indexing and not to ! indexig.
The rules are the following:
df[!, :col] .= vallocates a new column and replaces the old one or if:col
is not present indfallocates and adds it;df[:, :col] .= vupdates the column in-place or allocates or if:col
is not present indfallocates adds it;df.col .= vis only allowed ifcolis present indfand operates in-place.
Note that if :col is not present in df then using ! and : are equivalent.
Also note that in SubDataFrame it is not allowed to add new columns and !
syntax is not allowed.
Broadcasting assignment to multiple columns
Again this case is simpler than broadcasting assigning to a single column case above.
The reason is that we do not allow to create new columns when multiple columns are
selected. Therefore the rule is: df[!, [:col1, :col2]] .= new_values replaces
columns :col1 and :col2 in df, while df[:, [:col1, :col2]] = new_values
updates them in-place.
Summary of the cases
Wrapping up the cases we see that ! means the following:
- in selection context: get me a column or a data frame without copying columns.
- in views: make me a view (the same as
:row selector); - in assignment to a single column: replace or add the column to a data frame
without copying; - in assignment to a multiple columns: replace the colums in a data frame
with copying; - in broadcasting assignment: allocate a new column and store it (and in the case
of a single column selector optionally add it if it is missing);
And : means the following:
- in selection context: get me a column or data frame with copying of columns.
- in views: make me a view (the same as
:row selector); - in assignment to a single column: change the column in-place or add the column
to a data frame with copying; - in assignment to a multiple columns: change the colums in-place in a data frame;
- in broadcasting assignment: perform in-place update of columns (and in the case
of a single column selector optionally allocate and add it if it is missing);
Finally getproperty (the df.col style) means the following:
- in selection context: get me a column without copying.
- in assignment: replace or add the column to a data frame without copying;
- in broadcasting assignment: update an existing column in-place.
In short (simplifying a bit):
!gets you columns without copying and when setting columns it replaces them;:gets you columns with copying and when setting columns it does this in-place;getpropertygets you columns without copying and setting columns it replaces
them, except for broadcasting assignment, when it updates them in-place.
From a practical perspective the major difference between in-place and replace
operations is that replacing columns is needed if new values have a different
type than the old ones.
For instance here ! works and : fails:
julia> df
3×2 DataFrame
Row │ col1 col2
│ Int64 Char
─────┼─────────────
1 │ 1 a
2 │ 2 b
3 │ 3 c
julia> df[:, :col1] .= "a"
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Int64
julia> df[!, :col1] .= "a"
3-element Array{String,1}:
"a"
"a"
"a"
julia> df
3×2 DataFrame
Row │ col1 col2
│ String Char
─────┼──────────────
1 │ a a
2 │ a b
3 │ a c
Another practical limitation is that broadcasting assignment like df.col .= v
is not allowed when :col is not present in a data frame (there is a chance that
in the future it will be allowed, see here).
Conclusions
As you can see there are cases when ! row selector is needed to cover all
potential use-cases. However, most common operations are done on a single
column and in this case:
- for getting a column or assigning to a column instead of
df[!, :col]and
df[!, :col] = vit is usually better to just writedf.coland
df.col = vrespectively as it is the same and simpler to type and read; - currently the case where
!is really needed is broacasting assignment context
wheredf[!, :col] .= vis the only relatively nice way to freshly allocate
a column withvbroadcasted into it (but when I look at the codes of
DataFrames.jl users this pattern is used much less frequently than we
expected when we designed the ecosystem).
I hope this post was helpful. If you are interested in a definitive
specification of all the indexing rules in DataFrames.jl you can find them
here.