Author Archives: Blog by Bogumił Kamiński

On the bang row selector in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/01/30/bang.html

Introduction

I recently see that DataFrames.jl use ! as a row selector for a data
frame a lot.

Over a year ago, when we have taken data frames indexing seriously, there was a
very big debate if ! should be allowed in expressions like df[!, :a] to get
an :a column without copying. The conclusion was that we need to have it, but
our intention was that it would be reserved for advanced uses only, while
in normal circumstances a user would not need to even know that it exists.

In this post let me review the use-cases of ! and comment on its alternatives.

This post was written under Julia 1.5.3 and DataFrames 0.22.4.

First we set up the environment:

julia> using DataFrames

julia> df = DataFrame(col1=1:3, col2='a':'c')
3×2 DataFrame
 Row │ col1   col2
     │ Int64  Char
─────┼─────────────
   1 │     1  a
   2 │     2  b
   3 │     3  c

Reading a single column from a data frame

If you want to get a single column :col1 from a data frame df you have the
following options:

  • df[!, :col1], df[!, "col1"], df.col1, and df."col1": get you the column
    without copying;
  • df[:, :col1] and df[:, "col1"]: gets you a copy of the column.

As you see to get a single column without copying it is usually much easier to
rwiere df.col1 than e.g. df[!, :col1] and the operation has exactly the same
result.

The only case when df[!, :col1] is more convenient is when you have a column
name stored in a variable. Then the following are equivalent:

julia> v = :col1
:col1

julia> df[!, v]
3-element Array{Int64,1}:
 1
 2
 3

julia> getproperty(df, v)
3-element Array{Int64,1}:
 1
 2
 3

and indeed using ! is a big more convenient in this case, as you cannot pass
variable v to an expression like df.col1.

Reading multiple columns from a data frame

If you want to get a two columns [:col1, :col2] from a data frame df you
have the following options (I am leaving out the sting version and other column
selectors we support for simplicity):

  • df[!, [:col1, :col2]] and select(df, [:col1, :col2], copycols=false):
    creates you a new data frame (a fresh wrapper object is allocated) but the
    columns of the new data frame are taken from df;
  • df[:, [:col1, :col2]] and select(df, [:col1, :col2]): gets you a new data
    frame with columns copied.

Note that for multiple column selection you can alternatively use the select
function. The difference between select and indexing is that select returns
a data frame even if a single column is selected, e.g. like this:

julia> select(df, 1)
3×1 DataFrame
 Row │ col1
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     3

while as we have explained above we have:

julia> df[!, 1]
3-element Array{Int64,1}:
 1
 2
 3

Note that as in the df[!, [:col1, :col2]] syntax copying of columns is not
done this operation is generally not recommended. Using such a data frame often
leads to very hard-to-find bugs as when you modify contents of the columns of
the newly created data frame also the source is mutated.

Making a view of a data frame

In this case we have:

julia> view(df, !, :col1)
3-element view(::Array{Int64,1}, :) with eltype Int64:
 1
 2
 3

julia> view(df, !, [:col1, :col2])
3×2 SubDataFrame
 Row │ col1   col2
     │ Int64  Char
─────┼─────────────
   1 │     1  a
   2 │     2  b
   3 │     3  c

and the views are exactly the same as if we used view(df, :, :col1) and
view(df, :, [:col1, :col2]) respectively.

In this case ! is supported mainly to allow an easy annotation of whole
expressions using data frame indexing with @views, e.g. imagine you have
the following code:

julia> x = [1, 2, 3, 4]
4-element Array{Int64,1}:
 1
 2
 3
 4

julia> df[!, 1] + x[1:3]
3-element Array{Int64,1}:
 2
 4
 6

and in order to avoid copying x you want to annotate the whole expression with
@views. Thanks to the fact that ! is supported with view you can just write:

julia> @views df[!, 1] + x[1:3]
3-element Array{Int64,1}:
 2
 4
 6

Assigning to a single column

The difference between df[!, :co11] = 11:13 and df[:, :col1] = 11:13 is that
using ! puts a new column passed on the right hand side to the data frame
without copying it (no matter if the column exists or not in the data frame),
while : assigns to an existing column in-place.

Therefore df[!, :co11] = 11:13 is equivalent to df.col1 = 11:13. On the other
hand df[:, :co11] = 11:13 is equivalent to df.col1[:] = 11:13, if the column
:col1 is present in the data frame.

Here is an example:

julia> df2 = copy(df)
3×2 DataFrame
 Row │ col1   col2
     │ Int64  Char
─────┼─────────────
   1 │     1  a
   2 │     2  b
   3 │     3  c

julia> col1 = df2.col1
3-element Array{Int64,1}:
 1
 2
 3

julia> df2[!, :col1] = 11:13
11:13

julia> col1
3-element Array{Int64,1}:
 1
 2
 3

vs.

julia> df2 = copy(df)
3×2 DataFrame
 Row │ col1   col2
     │ Int64  Char
─────┼─────────────
   1 │     1  a
   2 │     2  b
   3 │     3  c

julia>

julia> col1 = df2.col1
3-element Array{Int64,1}:
 1
 2
 3

julia> df2[:, :col1] = 11:13
11:13

julia> col1
3-element Array{Int64,1}:
 11
 12
 13

You might have noticed that when I described : I have added a condition that
it is equivalen to getproperty syntax only when the column is present in the
data frame. The reason is that if column is not present in a data frame
then we have:

julia> df
3×2 DataFrame
 Row │ col1   col2
     │ Int64  Char
─────┼─────────────
   1 │     1  a
   2 │     2  b
   3 │     3  c

julia> newcol = [11, 12, 13]
3-element Array{Int64,1}:
 11
 12
 13

julia> df[:, :newcol] = newcol
3-element Array{Int64,1}:
 11
 12
 13

julia> df
3×3 DataFrame
 Row │ col1   col2  newcol
     │ Int64  Char  Int64
─────┼─────────────────────
   1 │     1  a         11
   2 │     2  b         12
   3 │     3  c         13

julia> df.newcol === newcol
false

So instead of an in-place operation (which is not possible as the column is not
present in the data frame), we get a copy operation.

On the other hand:

julia> df.newcol2[:] = newcol
ERROR: ArgumentError: column name :newcol2 not found in the data frame; existing most similar names are: :newcol

just fails as there is no column to index into.

The other special case is SubDataFrame, where using ! for assignment is not
allowed, just like for getproperty syntax:

julia> dfv = view(df, :, :)
3×3 SubDataFrame
 Row │ col1   col2  newcol
     │ Int64  Char  Int64
─────┼─────────────────────
   1 │     1  a         11
   2 │     2  b         12
   3 │     3  c         13

julia> dfv[!, :col1] = 1:3
ERROR: ArgumentError: setting index of SubDataFrame using ! as row selector is not allowed

julia> dfv.col1 = 1:3
ERROR: ArgumentError: Replacing or adding of columns of a SubDataFrame is not allowed. Instead use `df[:, col_ind] = v` or `df[:, col_ind] .= v` to perform an in-place assignment.

Assigning to multiple columns

This case is a bit simpler than assigning to a single column case above. The
reason is that we do not allow to create new columns when multiple columns are
selected. Therefore the rule is: df[!, [:col1, :col2]] = new_values replaces
columns :col1 and :col2 in df, while df[:, [:col1, :col2]] = new_values
updates them in-place.

Note that new_values must be either a data frame or a matrix, and for ! the
columns in df will be always freshly allocated.

Broadcasting assignment to a single column

This is the point where a bit of complexity is introduced, as now getproperty
syntax (i.e. df.col) behaves similarly to : indexing and not to ! indexig.

The rules are the following:

  • df[!, :col] .= v allocates a new column and replaces the old one or if :col
    is not present in df allocates and adds it;
  • df[:, :col] .= v updates the column in-place or allocates or if :col
    is not present in df allocates adds it;
  • df.col .= v is only allowed if col is present in df and operates in-place.

Note that if :col is not present in df then using ! and : are equivalent.

Also note that in SubDataFrame it is not allowed to add new columns and !
syntax is not allowed.

Broadcasting assignment to multiple columns

Again this case is simpler than broadcasting assigning to a single column case above.
The reason is that we do not allow to create new columns when multiple columns are
selected. Therefore the rule is: df[!, [:col1, :col2]] .= new_values replaces
columns :col1 and :col2 in df, while df[:, [:col1, :col2]] = new_values
updates them in-place.

Summary of the cases

Wrapping up the cases we see that ! means the following:

  • in selection context: get me a column or a data frame without copying columns.
  • in views: make me a view (the same as : row selector);
  • in assignment to a single column: replace or add the column to a data frame
    without copying;
  • in assignment to a multiple columns: replace the colums in a data frame
    with copying;
  • in broadcasting assignment: allocate a new column and store it (and in the case
    of a single column selector optionally add it if it is missing);

And : means the following:

  • in selection context: get me a column or data frame with copying of columns.
  • in views: make me a view (the same as : row selector);
  • in assignment to a single column: change the column in-place or add the column
    to a data frame with copying;
  • in assignment to a multiple columns: change the colums in-place in a data frame;
  • in broadcasting assignment: perform in-place update of columns (and in the case
    of a single column selector optionally allocate and add it if it is missing);

Finally getproperty (the df.col style) means the following:

  • in selection context: get me a column without copying.
  • in assignment: replace or add the column to a data frame without copying;
  • in broadcasting assignment: update an existing column in-place.

In short (simplifying a bit):

  • ! gets you columns without copying and when setting columns it replaces them;
  • : gets you columns with copying and when setting columns it does this in-place;
  • getproperty gets you columns without copying and setting columns it replaces
    them, except for broadcasting assignment, when it updates them in-place.

From a practical perspective the major difference between in-place and replace
operations is that replacing columns is needed if new values have a different
type than the old ones.

For instance here ! works and : fails:

julia> df
3×2 DataFrame
 Row │ col1   col2
     │ Int64  Char
─────┼─────────────
   1 │     1  a
   2 │     2  b
   3 │     3  c

julia> df[:, :col1] .= "a"
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Int64

julia> df[!, :col1] .= "a"
3-element Array{String,1}:
 "a"
 "a"
 "a"

julia> df
3×2 DataFrame
 Row │ col1    col2
     │ String  Char
─────┼──────────────
   1 │ a       a
   2 │ a       b
   3 │ a       c

Another practical limitation is that broadcasting assignment like df.col .= v
is not allowed when :col is not present in a data frame (there is a chance that
in the future it will be allowed, see here).

Conclusions

As you can see there are cases when ! row selector is needed to cover all
potential use-cases. However, most common operations are done on a single
column and in this case:

  • for getting a column or assigning to a column instead of df[!, :col] and
    df[!, :col] = v it is usually better to just write df.col and
    df.col = v respectively as it is the same and simpler to type and read;
  • currently the case where ! is really needed is broacasting assignment context
    where df[!, :col] .= v is the only relatively nice way to freshly allocate
    a column with v broadcasted into it (but when I look at the codes of
    DataFrames.jl users this pattern is used much less frequently than we
    expected when we designed the ecosystem).

I hope this post was helpful. If you are interested in a definitive
specification of all the indexing rules in DataFrames.jl you can find them
here.

Mass transformations of data frames how-to

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/01/22/transforming.html

Introduction

A very common question related to the usage of DataFrames.jl is how
to perform mass transformations of data frames. Typically users want to apply
the same function to all columns, rows, or individual cells of a data frame.

In this post I want to summarize basic patterns allowing to perform these tasks.
I split the examples by the type of task performed and the requested type of the
output of the operation.

The code was tested under Julia 1.5.3 and DataFrames 0.22.2.

In the post we will consider the following source data frame:

julia> using DataFrames

julia> df = DataFrame(reshape(1:24, 6, 4), :auto)
6×4 DataFrame
 Row │ x1     x2     x3     x4
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1      7     13     19
   2 │     2      8     14     20
   3 │     3      9     15     21
   4 │     4     10     16     22
   5 │     5     11     17     23
   6 │     6     12     18     24

Note that it is important that all columns of the data frame have the same type
as usually when we apply mass transformations to different columns this
condition is required to be met (it is not a strict rule that this is the case,
but I have found that e.g. trying to apply a function that works on floats to
strings is one of the most common cases causing confusion of the users).

Each column to a vector

If you want to apply a transformation to each column and get a vector as a
result then use the eachcol iterator for this. Here are some options you might
find useful:

julia> sum.(eachcol(df))
4-element Array{Int64,1}:
  21
  57
  93
 129

julia> map(sum, eachcol(df))
4-element Array{Int64,1}:
  21
  57
  93
 129

julia> [sum(x) for x in eachcol(df)]
4-element Array{Int64,1}:
  21
  57
  93
 129

julia> [name => sum(x) for (name, x) in pairs(eachcol(df))]
4-element Array{Pair{Symbol,Int64},1}:
 :x1 => 21
 :x2 => 57
 :x3 => 93
 :x4 => 129

Each column to a data frame

If you want to produce a data frame as a result of applying a function to all
columns you can either use mapcols or combine:

julia> combine(df, names(df) .=> sum)
1×4 DataFrame
 Row │ x1_sum  x2_sum  x3_sum  x4_sum
     │ Int64   Int64   Int64   Int64
─────┼────────────────────────────────
   1 │     21      57      93     129

julia> combine(df, names(df) .=> sum, renamecols=false)
1×4 DataFrame
 Row │ x1     x2     x3     x4
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │    21     57     93    129

julia> combine(df, names(df) .=> sum .=> names(df), renamecols=false)
1×4 DataFrame
 Row │ x1     x2     x3     x4
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │    21     57     93    129

julia> mapcols(sum, df)
1×4 DataFrame
 Row │ x1     x2     x3     x4
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │    21     57     93    129

In general, as you can see mapcols was designed to handle this scenario, while
combine can be used when you would want to perform more different
transformations of the passed data frame (at the cost of being more verbose).

Each row to a vector

In this case you have two major. The basic one is to use eachrow:

julia> sum.(eachrow(df))
6-element Array{Int64,1}:
 40
 44
 48
 52
 56
 60

julia> map(sum, eachrow(df))
6-element Array{Int64,1}:
 40
 44
 48
 52
 56
 60

julia> [sum(x) for x in eachrow(df)]
6-element Array{Int64,1}:
 40
 44
 48
 52
 56
 60

This should be OK for most cases. The problem with this approach is that
eachrow is not type stable. So when you have very many rows or need column
type information in the values passed to the aggregation function use
Tables.namedtupleiterator:

julia> sum.(Tables.namedtupleiterator(df))
6-element Array{Int64,1}:
 40
 44
 48
 52
 56
 60

julia> map(sum, Tables.namedtupleiterator(df))
6-element Array{Int64,1}:
 40
 44
 48
 52
 56
 60

julia> [sum(x) for x in Tables.namedtupleiterator(df)]
6-element Array{Int64,1}:
 40
 44
 48
 52
 56
 60

whih will be faster and type stable (but at the cost of having to be compiled,
which can be problematic if you have a lot of columns in you data frame as I
have recently explained in this post).

You might ask when one wants type stability in the context of small tables. Here
is an example:

julia> df2 = DataFrame(x1=[1, 2, missing], x2 = [1, missing, missing])
3×2 DataFrame
 Row │ x1       x2
     │ Int64?   Int64?
─────┼──────────────────
   1 │       1        1
   2 │       2  missing
   3 │ missing  missing

julia> (sum∘skipmissing).(Tables.namedtupleiterator(df2))
3-element Array{Int64,1}:
 2
 2
 0

julia> (sum∘skipmissing).(eachrow(df2))
ERROR: ArgumentError: reducing over an empty collection is not allowed

As you can see in the last row of df2 we have only missing values. If we are
in a type stable context, sum knows that it should produce an integer 0,
while in a type unstable context we get an error as it is impossible to tell
what should be the type of 0 that should be produced.

Each row to a data frame

This case is typically handled by using the combine or the select functions
(which in the considered scenario produce the same output) along with the
ByRow wrapper. Here are two examples differing in whether we pass rows as
consecutive positional arguments or as a NamedTuple to an aggregation
function:

julia> combine(df, names(df) => ByRow(+) => :sum)
6×1 DataFrame
 Row │ sum
     │ Int64
─────┼───────
   1 │    40
   2 │    44
   3 │    48
   4 │    52
   5 │    56
   6 │    60

julia> combine(df, AsTable(names(df)) => sum => :sum)
6×1 DataFrame
 Row │ sum
     │ Int64
─────┼───────
   1 │    40
   2 │    44
   3 │    48
   4 │    52
   5 │    56
   6 │    60

Note that in the NamedTuple passing option we are type stable so the following
code works as in the example from the previous section:

julia> combine(df2, AsTable(names(df2)) => ByRow(sum∘skipmissing) => :sum)
3×1 DataFrame
 Row │ sum
     │ Int64
─────┼───────
   1 │     2
   2 │     2
   3 │     0

Each cell to a matrix

In order to transform each cell and store the result in a matrix you have the
following basic options:

julia> Matrix(df) .^ 2
6×4 Array{Int64,2}:
  1   49  169  361
  4   64  196  400
  9   81  225  441
 16  100  256  484
 25  121  289  529
 36  144  324  576

julia> Matrix(df .^ 2)
6×4 Array{Int64,2}:
  1   49  169  361
  4   64  196  400
  9   81  225  441
 16  100  256  484
 25  121  289  529
 36  144  324  576

julia> [df[i, j]^2 for i in axes(df, 1), j in axes(df, 2)]
6×4 Array{Int64,2}:
  1   49  169  361
  4   64  196  400
  9   81  225  441
 16  100  256  484
 25  121  289  529
 36  144  324  576

In general the first of them (conversion to a Matrix and then working with it)
should be fastest.

Each cell to a data frame

In this case one can use the same pattern as the second one above. Just write:

julia> df .^ 2
6×4 DataFrame
 Row │ x1     x2     x3     x4
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1     49    169    361
   2 │     4     64    196    400
   3 │     9     81    225    441
   4 │    16    100    256    484
   5 │    25    121    289    529
   6 │    36    144    324    576

Conclusions

Now you should have a basic understanding of different options how data frame
can be transformed by-row, by-column, or by-cell. I have skipped the discussion
of analogous operations for GroupedDataFrame. If you would want to perform
them per-group then using the combine or select examples given above will
just work also for GroupedDataFrame as a source of data.

Playing with Chain.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/01/15/chain.html

Introduction

DataFrames.jl was designed to support chaining of operations well.
For a long time my favorite package that helped with this was Pipe.jl.
It is very easy to understand how it works and is clear visually.

There are many alternative packages that support chaining, but they all required
much higher mental effort from the developer to master them. However, in
November 2020 Chain.jl was created and it really is as simple as
Pipe.jl, but at the same time more powerful. In this post I briefly investigate
what it has to offer.

This post was written with Julia 1.5.3, Chain.jl 0.4.2, and Combinatorics 1.0.2.

Experimenting with Chain.jl

The Chain.jl README.md does a really great job of explaining why and
how of the package so I refer you to the website to read the details. In short
it introduces:

  • a macro @chain and an extra annotation,
  • an @aside annotation that can be used inside a @chain block to produce
    side effects,
  • _ is used to signal where the value of the previous expression should be
    inserted (unless it is a first argument in which case _ can be omitted).

Let me give one exemplary usage of the @chain macro. Assume we have a 6
element set and want to get all permutations of its 4 element subsets (if you
ever try to implement a Mastermind solver you might need it).
Here is how you can generate it using Chain.jl:

julia> using Chain

julia> using Combinatorics

julia> @chain 1:6 begin
                 combinations(4)
                 @aside println("# of combinations: ", length(_))
                 collect # we could skip this step
                 @. permutations
                 mapreduce(collect, vcat, _)
             end
# of combinations: 15
360-element Array{Array{Int64,1},1}:
 [1, 2, 3, 4]
 [1, 2, 4, 3]
 [1, 3, 2, 4]
 ⋮
 [6, 4, 5, 3]
 [6, 5, 3, 4]
 [6, 5, 4, 3]

Note that in the first call combinations(4) Chain.jl has put _ implicitly
as the first argument of the combinations function, so the actual call is
combinations(1:6, 4).

In line collect result of combinations(1:6, 4) is passed as a single
argument to collect (you do not need to write parentheses). Similarly in line
@. permutations we use the same pattern but this time we broadcast the
permutations function over a collection passed from the previous step of the
chain. If we want to pass other than the first argument then _ is used as
shown in the mapreduce(collect, vcat, _) line.

In the second line @aside is executed but is ignored in the pipeline. Note
that it would be tempting to write

@aside println("# of combinations: ", length)

instead of

@aside println("# of combinations: ", length(_))

The reason is that length takes only one argument. However, in this case a
call to length is nested so you have to pass _ explicitly.

You can see that Chain.jl has two key features:

  • everything is wrapped in beginend block,
  • there is no visual separator (like standard |> in e.g. Pipe.jl)
    signaling an end of the expression.

Many people will find that it exactly fits their needs, but here is an
alternative syntax that I have found to be potentially usable with
Chain.jl:

@chain 1:6 (
    combinations(4);
    @aside println("# of combinations: ", length(_));
    collect; # we could skip it
    @. permutations;
    mapreduce(collect, vcat, _);
)

which produces the same result.

The difference here is that I replace beginend block with ( and ), so
it is a bit less typing. In this case one has to separate the expressions with
;. I also added ; at the end of the last expression, though it is not
strictly necessary, as in this way you can safely add/remove lines in @chain
without changing the remaining lines.

If having to add ; in this style is good or bad is a matter of taste. On one
hand it adds typing, but on the other hand it clearly shows the end of one
expression (which in beginend style is not explicit, sometimes it might
be confusing if someone needed to add a line break in an expression, and e.g.
indentation should be used to signal line continuation then).

Also () style has an additional benefit that in some editors it is easy to
select the code block enclosed in the parentheses if you would need to
copy-paste the contents of @chain.

Conclusions

I think Chain.jl is excellent. If you like chaining function calls in
your code I really recommend you to check it out.