Author Archives: Blog by Bogumił Kamiński

Transforming multiple columns with multiple functions in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/07/09/multicol.html

Introduction

A common question asked in relation to mutation of data frame objects in
DataFrames.jl is how to apply multiple transformation functions to
multiple columns of a data frame. In this post I want to show several possible
approaches to this task.

The codes were run under Julia 1.6.1 and DataFrames.jl 1.2.0.

The basic approach

First create a data frame we will work with:

julia> using DataFrames

julia> df = DataFrame(id=repeat(1:2, 3), c1=1:6, c2=11:16)
6×3 DataFrame
 Row │ id     c1     c2
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      1     11
   2 │     2      2     12
   3 │     1      3     13
   4 │     2      4     14
   5 │     1      5     15
   6 │     2      6     16

Now assume we want to calculate minimum, maximum and sum of columns
:c1 and :c2. You can do it like this:

julia> combine(df, [:c1, :c2] .=> [minimum maximum sum])
1×6 DataFrame
 Row │ c1_minimum  c2_minimum  c1_maximum  c2_maximum  c1_sum  c2_sum
     │ Int64       Int64       Int64       Int64       Int64   Int64
─────┼────────────────────────────────────────────────────────────────
   1 │          1          11           6          16      21      81

or like this:

julia> combine(df, [:c1 :c2] .=> [minimum, maximum, sum])
1×6 DataFrame
 Row │ c1_minimum  c1_maximum  c1_sum  c2_minimum  c2_maximum  c2_sum
     │ Int64       Int64       Int64   Int64       Int64       Int64
─────┼────────────────────────────────────────────────────────────────
   1 │          1           6      21          11          16      81

depending on what order of output columns you prefer.

Let us try to understand what is going on here. The combine function accepts
vectors or matrices of transformation specifications given as pairs using the
=> operator. If we check the internal broadcasting operation we see:

julia> [:c1 :c2] .=> [minimum, maximum, sum]
3×2 Matrix{Pair{Symbol, _A} where _A}:
 :c1=>minimum  :c2=>minimum
 :c1=>maximum  :c2=>maximum
 :c1=>sum      :c2=>sum

julia> [:c1, :c2] .=> [minimum maximum sum]
2×3 Matrix{Pair{Symbol, _A} where _A}:
 :c1=>minimum  :c1=>maximum  :c1=>sum
 :c2=>minimum  :c2=>maximum  :c2=>sum

In both cases we create a matrix of transformations. Note that the crucial
trick here is that we broadcast a vector againsst a 1-row matrix to get a final
matrix. For instance in [:c1 :c2] .=> [minimum, maximum, sum], the [:c1 :c2]
part is a 1-row matrix (notice that a space is separating its elements) and
[minimum, maximum, sum] is a vector (notice , separating its elements).

Exactly the same pattern can be applied to grouped data frames as you can see here:

julia> gdf = groupby(df, :id)
GroupedDataFrame with 2 groups based on key: id
First Group (3 rows): id = 1
 Row │ id     c1     c2
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      1     11
   2 │     1      3     13
   3 │     1      5     15
⋮
Last Group (3 rows): id = 2
 Row │ id     c1     c2
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     2      2     12
   2 │     2      4     14
   3 │     2      6     16

julia> combine(gdf, [:c1, :c2] .=> [minimum maximum sum])
2×7 DataFrame
 Row │ id     c1_minimum  c2_minimum  c1_maximum  c2_maximum  c1_sum  c2_sum
     │ Int64  Int64       Int64       Int64       Int64       Int64   Int64
─────┼───────────────────────────────────────────────────────────────────────
   1 │     1           1          11           5          15       9      39
   2 │     2           2          12           6          16      12      42

julia> combine(gdf, [:c1 :c2] .=> [minimum, maximum, sum])
2×7 DataFrame
 Row │ id     c1_minimum  c1_maximum  c1_sum  c2_minimum  c2_maximum  c2_sum
     │ Int64  Int64       Int64       Int64   Int64       Int64       Int64
─────┼───────────────────────────────────────────────────────────────────────
   1 │     1           1           5       9          11          15      39
   2 │     2           2           6      12          12          16      42

Programmatic specification of column names

Sometimes instead of writing down [:c1, :c2] manually one would want to select
the columns programmatically. Here are some approaches we currently support:

The most general one is by using the names function with an appropriate column
selector:

julia> combine(df, names(df, r"c") .=> [minimum maximum sum])
1×6 DataFrame
 Row │ c1_minimum  c2_minimum  c1_maximum  c2_maximum  c1_sum  c2_sum
     │ Int64       Int64       Int64       Int64       Int64   Int64
─────┼────────────────────────────────────────────────────────────────
   1 │          1          11           6          16      21      81

or

julia> combine(df, names(df, Not(:id)) .=> [minimum maximum sum])
1×6 DataFrame
 Row │ c1_minimum  c2_minimum  c1_maximum  c2_maximum  c1_sum  c2_sum
     │ Int64       Int64       Int64       Int64       Int64   Int64
─────┼────────────────────────────────────────────────────────────────
   1 │          1          11           6          16      21      81

With grouped data frames an option that is often useful is to use the
valuecols funcion that picks all non-grouping columns:

julia> combine(gdf, valuecols(gdf) .=> [minimum maximum sum])
2×7 DataFrame
 Row │ id     c1_minimum  c2_minimum  c1_maximum  c2_maximum  c1_sum  c2_sum
     │ Int64  Int64       Int64       Int64       Int64       Int64   Int64
─────┼───────────────────────────────────────────────────────────────────────
   1 │     1           1          11           5          15       9      39
   2 │     2           2          12           6          16      12      42

As above, in all the cases it is useful to investigate the arguments
passed to the enclosing functions:

julia> names(df, r"c")
2-element Vector{String}:
 "c1"
 "c2"

julia> names(df, Not(:id))
2-element Vector{String}:
 "c1"
 "c2"

julia> valuecols(gdf)
2-element Vector{Symbol}:
 :c1
 :c2

Nested columns approach

Another approach one could use to apply multiple functions to multiple colums
is to create nested columns like this:

julia> combine(df,
               [:c1, :c2] .=> x -> [(min=minimum(x), max=maximum(x), sum=sum(x))],
               renamecols=false)
1×2 DataFrame
 Row │ c1                            c2
     │ NamedTup…                     NamedTup…
─────┼──────────────────────────────────────────────────────────────
   1 │ (min = 1, max = 6, sum = 21)  (min = 11, max = 16, sum = 81)

or

julia> combine(df,
               [:c1, :c2] .=> x -> Ref((min=minimum(x), max=maximum(x), sum=sum(x))),
               renamecols=false)
1×2 DataFrame
 Row │ c1                            c2
     │ NamedTup…                     NamedTup…
─────┼──────────────────────────────────────────────────────────────
   1 │ (min = 1, max = 6, sum = 21)  (min = 11, max = 16, sum = 81)

Notice that the anonymous function we use produces a NamedTuple, so we need
to protect it against being expanded by wrapping it with a vector or Ref object.
This trick similar to what is done in broadcasting in Julia Base. Let me give
a simple example:

julia> [1 2] .=> (a=1, b=2)
ERROR: ArgumentError: broadcasting over dictionaries and `NamedTuple`s is reserved

julia> [1 2] .=> [(a=1, b=2)]
1×2 Matrix{Pair{Int64, NamedTuple{(:a, :b), Tuple{Int64, Int64}}}}:
 1=>(a = 1, b = 2)  2=>(a = 1, b = 2)

julia> [1 2] .=> Ref((a=1, b=2))
1×2 Matrix{Pair{Int64, NamedTuple{(:a, :b), Tuple{Int64, Int64}}}}:
 1=>(a = 1, b = 2)  2=>(a = 1, b = 2)

and as you can see wrapping a NamedTuple in a vector or Ref made broadcasting
use it “as is” (essentially we had to create a 1-element container that was
broadcasted over). As a side note: Ref is usually a technically preferred way
to make such a protection, but vector is easier to type and in most cases it
produces exactly the same result, so I often use it.

Going back to our aggregation case, the nested column is most useful with
grouped data frame:

julia> df_agg = combine(gdf,
                        [:c1, :c2] .=> x->[(min=minimum(x), max=maximum(x), sum=sum(x))],
                        renamecols=false)
2×3 DataFrame
 Row │ id     c1                            c2
     │ Int64  NamedTup…                     NamedTup…
─────┼─────────────────────────────────────────────────────────────────────
   1 │     1  (min = 1, max = 5, sum = 9)   (min = 11, max = 15, sum = 39)
   2 │     2  (min = 2, max = 6, sum = 12)  (min = 12, max = 16, sum = 42)

Finally you might ask how to un-nest the columns. If you want to do it for a
single column you can just write:

julia> select(df_agg, :id, :c1 => AsTable)
2×4 DataFrame
 Row │ id     min    max    sum
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1      1      5      9
   2 │     2      2      6     12

However, for multiple columns this will fail:

julia> select(df_agg, :id, [:c1, :c2] .=> AsTable)
ERROR: ArgumentError: Duplicate column name(s) returned: :min, :max, :sum

The problem is that, as you can see, we would try to create columns with names
:min, :max, and :sum twice, which is disallowed.

What you can do instead is to provide explicit column names:

julia> select(df_agg, :id, [:c1, :c2] .=> [c .* ["min", "max", "sum"] for c in ["c1_", "c2_"]])
2×7 DataFrame
 Row │ id     c1_min  c1_max  c1_sum  c2_min  c2_max  c2_sum
     │ Int64  Int64   Int64   Int64   Int64   Int64   Int64
─────┼───────────────────────────────────────────────────────
   1 │     1       1       5       9      11      15      39
   2 │     2       2       6      12      12      16      42

As usual, it is worth to investigate the argument passed to the combine function
to make sure we understand exactly the syntax:

julia> [:c1, :c2] .=> [c .* ["min", "max", "sum"] for c in ["c1_", "c2_"]]
2-element Vector{Pair{Symbol, Vector{String}}}:
 :c1 => ["c1_min", "c1_max", "c1_sum"]
 :c2 => ["c2_min", "c2_max", "c2_sum"]

Conclusions

In conclusion let me highlight one important feature of the syntax using the
=> operator that we have exploited in this post.

The trick is that => is a callable (technically it calls a Pair
constructor). Therefore it is very easy to use it in programmatic scenarios
when one might want to specify column names or applied functions at run time
e.g. by storing them in some variables.

What is new in DataFrames.jl 1.2.0?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/07/02/dataframes-1.2.0.html

Introduction

DataFrames.jl version 1.2.0 has just been released. In this post I want
to discuss the main new user visible features we have introduced.

The codes were run under Julia 1.6.1 and DataFrames.jl 1.2.0.

New functionalities

There are three major new functionalities introduced by the 1.2.0 release. Let
me explain them one by one.

matchmissing=:notequal keyword argument in joins

Before 1.2.0 release missing values in on-columns in joins either were
considered to be equal (when matchmissing=:equal was passed) or produced an
error (when matchmissing=:error, this is a default behavior). Now you can
also pass matchmissing=:notequal in which case missing values are considered
as not matching. Here is a simple example comparing the three options:

julia> using DataFrames

julia> df1 = DataFrame(id=[1, missing, 3], left=1:3)
3×2 DataFrame
 Row │ id       left
     │ Int64?   Int64
─────┼────────────────
   1 │       1      1
   2 │ missing      2
   3 │       3      3

julia> df2 = DataFrame(id=[1, missing, missing], right=1:3)
3×2 DataFrame
 Row │ id       right
     │ Int64?   Int64
─────┼────────────────
   1 │       1      1
   2 │ missing      2
   3 │ missing      3

julia> innerjoin(df1, df2, on=:id)
ERROR: ArgumentError: missing values in key columns are not allowed when matchmissing == :error

julia> innerjoin(df1, df2, on=:id, matchmissing=:equal)
3×3 DataFrame
 Row │ id       left   right
     │ Int64?   Int64  Int64
─────┼───────────────────────
   1 │       1      1      1
   2 │ missing      2      2
   3 │ missing      2      3

julia> innerjoin(df1, df2, on=:id, matchmissing=:notequal)
1×3 DataFrame
 Row │ id      left   right
     │ Int64?  Int64  Int64
─────┼──────────────────────
   1 │      1      1      1

A new syntax for column expansion in transformation functions

Users often store nested data structures in columns of a data frame.
In such cases, a frequent request is to unnest such a column.

Before 1.2.0 release one had to perform this operation like this:

julia> df = DataFrame(col=[Dict("a"=>1, "b"=>2), Dict("a"=>3, "b"=>4)])
2×1 DataFrame
 Row │ col
     │ Dict…
─────┼──────────────────────
   1 │ Dict("b"=>2, "a"=>1)
   2 │ Dict("b"=>4, "a"=>3)

julia> transform(df, :col => identity => AsTable)
2×3 DataFrame
 Row │ col                   b      a
     │ Dict…                 Int64  Int64
─────┼────────────────────────────────────
   1 │ Dict("b"=>2, "a"=>1)      2      1
   2 │ Dict("b"=>4, "a"=>3)      4      3

Now, a simpler syntax is allowed, that does not require the user to write
identity part of the transformation specification (just like in column
renaming syntax), so the following code works

julia> transform(df, :col => AsTable)
2×3 DataFrame
 Row │ col                   b      a
     │ Dict…                 Int64  Int64
─────┼────────────────────────────────────
   1 │ Dict("b"=>2, "a"=>1)      2      1
   2 │ Dict("b"=>4, "a"=>3)      4      3

and produces the same result.

subset! now correctly updates passed GroupedDataFrame

The subset! function was a new addition in 1.0.0 release. Therefore,
given the user feedback, we are adding some polishing touches to it.

Before 1.2.0 passing a GroupedDataFrame to subset! produced a correct
result, but could potentially corrupt the passed GroupedDataFrame (a proper
information about this was given in the documentation; such a design
was chosen to improve performance). However, such a behavior was found to be
error prone. Therefore in 1.2.0 release an efficient algorithm updating not
only the parent data frame but also GroupedDataFrame itself was implemented.

Here is an example of the current behavior:

julia> using Statistics

julia> df = DataFrame(id=repeat([1, 2], 4), x=1:8)
8×2 DataFrame
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     1      3
   4 │     2      4
   5 │     1      5
   6 │     2      6
   7 │     1      7
   8 │     2      8

julia> gd = groupby(df, :id)
GroupedDataFrame with 2 groups based on key: id
First Group (4 rows): id = 1
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      3
   3 │     1      5
   4 │     1      7
⋮
Last Group (4 rows): id = 2
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     2      2
   2 │     2      4
   3 │     2      6
   4 │     2      8

julia> subset!(gd, :x => x -> x .> mean(x)) # pick rows with :x above group mean
4×2 DataFrame
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      5
   2 │     2      6
   3 │     1      7
   4 │     2      8

julia> gd
GroupedDataFrame with 2 groups based on key: id
First Group (2 rows): id = 1
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      5
   2 │     1      7
⋮
Last Group (2 rows): id = 2
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     2      6
   2 │     2      8

In the above operation both df and gd get properly updated in-place
(previously only df was changed, but gd was left unchanged, and thus
it was corrupted).

Deprecated functionality

In the beginning of development of DataFrames.jl the design of DataFrame
was very close to a matrix. Over the years the consensus was reached that
we should rather treat it as Tables.jl table. However, the legacy thinking
was still reflected in the design of hcat function, which allowed horizontal
concatenation of a data frame with a vector, just like it is allowed for
matrices. Unfortunately this approach conflicts with the fact that currently
many vectors are supporting Tables.jl table interface and when doing hcat
users would prefer them to be treated as such.

I think the issue is easiest explained with an example. The Julia session
shown below was started with --depwarn=yes flag:

julia> using DataFrames

julia> df = DataFrame(col1='a':'c')
3×1 DataFrame
 Row │ col1
     │ Char
─────┼──────
   1 │ a
   2 │ b
   3 │ c

julia> hcat(df, [(x=i, y=10+i) for i in 1:3])
┌ Warning: horizontal concatenation of data frame with a vector is deprecated. Pass DataFrame(x1=x) instead.
│   caller = ip:0x0
└ @ Core :-1
3×2 DataFrame
 Row │ col1  x1
     │ Char  NamedTup…
─────┼───────────────────────
   1 │ a     (x = 1, y = 11)
   2 │ b     (x = 2, y = 12)
   3 │ c     (x = 3, y = 13)

As you can see a vector of NamedTuples, although it is a Tables.jl table,
is just horizontally concatenated to df as a new column with auto generated
name :x1.

However, most likely the user expected the following result (but without having
to use DataFrame constructor):

julia> hcat(df, DataFrame([(x=i, y=10+i) for i in 1:3]))
3×3 DataFrame
 Row │ col1  x      y
     │ Char  Int64  Int64
─────┼────────────────────
   1 │ a         1     11
   2 │ b         2     12
   3 │ c         3     13

In order to allow this behavior in the future, as you see above, passing a
vector to hcat when the other argument is a data frame is currently deprecated.

Conclusions

I hope you will enjoy the new features we have shipped in the 1.2.0 release of
the DataFrames.jl.

Apart from the changes discussed above several minor ones, mostly in the areas
of performance, display, and documentation have been made. You can find a more
detailed list of things changed in the 1.2.0 release notes.

Also remember that the NEWS.md file in the project repository is maintained
to give synthetic information of the most important changes introduced in
the releases of DataFrames.jl.

Benchmarking push! in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/06/25/push.html

Introduction

I really like the fact that push! is a convenient and fast way to add rows
to a DataFrame in DataFrames.jl. Recently I have noticed that users
also employ it quite often. Therefore I thought to post some benchmarks of
timing of push! operation for different types of rows pushed.

The tests were run under Julia 1.6.1, DataFrames.jl 1.1.1,
and BenchmarkTools.jl 1.0.0.

The benchmarks

In the benchmark I thought of testing one of the most common scenarios:

  • pushed columns have a homogeneous type;
  • we use the default settings of keyword arguments: cols=:setequal
    and promote=false.

In the test we check the time to push 10,000 rows to a data frame and consider
the following supported types of rows pushed:

  • DataFrameRow,
  • NamedTuple,
  • Tuple,
  • Vector,
  • Dict.

I want to see which option is the fastest and how does the push! speed depend
on the number of columns pushed.

Here is the test code:

julia> using BenchmarkTools

julia> using DataFrames

julia> function test_push(v)
           df = DataFrame(Symbol.("x", 1:length(v[1])) .=> Ref(Int[]))
           foreach(r -> push!(df, r), v)
           return df
       end
test_push (generic function with 1 method)

julia> function run_test(n)
           rows_dfr = [DataFrame(Symbol.("x", 1:n) .=> i)[1, :] for i in 1:10000];
           rows_nt = NamedTuple.(rows_dfr)
           rows_tup = Tuple.(rows_dfr)
           rows_vec = Vector.(rows_dfr)
           rows_dict = Dict.(pairs.(rows_dfr))
           return (n=n,
                   dfr=@belapsed(test_push($rows_dfr)),
                   nt=@belapsed(test_push($rows_nt)),
                   tup=@belapsed(test_push($rows_tup)),
                   vec=@belapsed(test_push($rows_vec)),
                   dict=@belapsed(test_push($rows_dict)))
       end
run_test (generic function with 1 method)

julia> @time res = run_test.(2 .^ (0:7)) |> DataFrame;
491.489886 seconds (10.05 G allocations: 318.857 GiB, 15.44% gc time, 1.90% compilation time)

As you can see the test is quite time consuming (and that is why the results
are worth posting).

Before I move forward let me comment on one line in the code, namely:

df = DataFrame(Symbol.("x", 1:length(v[1])) .=> Ref(Int[]))

Many Julia users will immediately spot that the
Symbol.("x", 1:length(v[1])) .=> Ref(Int[])
pattern will lead to Int[] vector being reused in all Pairs in the produced
vector, e.g.

julia> tmp = Symbol.("x", 1:2) .=> Ref(Int[])
2-element Vector{Pair{Symbol, Vector{Int64}}}:
 :x1 => []
 :x2 => []

julia> tmp[1].second === tmp[2].second
true

Normally creation of such collections is strongly discouraged as it can lead
to hard-to-catch bugs. Instead a typical recommendation is to use comprehension
which guarantees that Int[] vector is freshly allocated in each iteration, e.g.:

julia> tmp = [Symbol("x", i) => Int[] for i in 1:2]
2-element Vector{Pair{Symbol, Vector{Int64}}}:
 :x1 => []
 :x2 => []

julia> tmp[1].second === tmp[2].second
false

So why it was OK to use it in my code? The reason is that DataFrame constructor
copies all columns by default so aliased columns do not get stored in the
produced data frame object.

Going back to our main thread, here are the collected timings:

julia> res
8×6 DataFrame
 Row │ n      dfr         nt          tup         vec         dict
     │ Int64  Float64     Float64     Float64     Float64     Float64
─────┼───────────────────────────────────────────────────────────────────
   1 │     1  0.00716675  0.00171069  0.00121046  0.00125374  0.00170707
   2 │     2  0.0093752   0.00292054  0.00194945  0.0019941   0.00287937
   3 │     4  0.0129853   0.00545446  0.00347197  0.0036109   0.00484835
   4 │     8  0.0237338   0.0103292   0.0064206   0.0066019   0.00848298
   5 │    16  0.0372345   0.0200449   0.0120538   0.0123195   0.0150318
   6 │    32  0.0717086   0.0667088   0.0226685   0.0229196   0.0291269
   7 │    64  0.13517     0.192137    0.045016    0.0456765   0.0610755
   8 │   128  0.284317    0.727081    0.0974844   0.0962173   0.126006

To simplify assessment of the results let me plot them.
In order to improve the readability of the comparisons below I plot the time
per one column (which means that if the push! operation scales well with number
of columns the plots should be slightly decreasing). Here is the code producing the plot:

julia> using Plots

julia> plot(Matrix(res)[:, 2:end] ./ res.n,
            labels=permutedims(names(res)[2:end]),
            xticks=(axes(res, 1), res.n), legend=:top)

And the plot itself:

push! timing

Conclusions

Here are the take-aways:

  • pushing Tuple or Vector is fastest; the reason is that in this case
    we do not care about column names;
  • Dict is quite fast; the overhead is mostly due to column name lookup;
  • DataFrameRow is slow; this is not surprising, as it is type unstable;
  • NamedTuple is comparable to Dict for small number of columns but it
    does not scale well and for large number of columns it is very slow.

A simple take-away is: if you generate a lot of data (e.g. in a simulation) then
prefer to produce a Vector or a Tuple for push!ing into a DataFrame as
this is fastest.