Author Archives: Blog by Bogumił Kamiński

Nesting and unnesting columns in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/03/11/unnesting.html

Introduction

Today I want to discuss ways to nest and unnest columns of a data frame.

We say that we nest several columns, when we take them together and turn
into one column, usually containing NamedTuples.

Unnesting is a reverse process, we take a column storing e.g. NamedTuples,
and create several columns out of it.

The post was written under Julia 1.7.0, DataFrames.jl 1.3.2, and Tables.jl
1.7.0.

Column nesting

Column nesting is relatively simple in DataFrames.jl. You just need to use
ByRow(identity) transformation on AsTable source. Here is an example where
we nest all columns from a source data frame:

julia> using DataFrames

julia> df = DataFrame(a=1:3, b=4:6, c=7:9)
3×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      4      7
   2 │     2      5      8
   3 │     3      6      9

julia> transform(df, AsTable(:) => ByRow(identity) => :nested)
3×4 DataFrame
 Row │ a      b      c      nested
     │ Int64  Int64  Int64  NamedTup…
─────┼────────────────────────────────────────────
   1 │     1      4      7  (a = 1, b = 4, c = 7)
   2 │     2      5      8  (a = 2, b = 5, c = 8)
   3 │     3      6      9  (a = 3, b = 6, c = 9)

This works because AsTable passes NamedTuple objects to the function,
so we just need to apply identity row-wise to get the desired result.

Basic column unnesting

If you want to perform a reverse process things are also relatively simple, you
just pass the nested column name as source and AsTable as target column name:

julia> df2 = select(df, AsTable(:) => ByRow(identity) => :nested)
3×1 DataFrame
 Row │ nested
     │ NamedTup…
─────┼───────────────────────
   1 │ (a = 1, b = 4, c = 7)
   2 │ (a = 2, b = 5, c = 8)
   3 │ (a = 3, b = 6, c = 9)

julia> transform(df2, :nested => AsTable)
3×4 DataFrame
 Row │ nested                 a      b      c
     │ NamedTup…              Int64  Int64  Int64
─────┼────────────────────────────────────────────
   1 │ (a = 1, b = 4, c = 7)      1      4      7
   2 │ (a = 2, b = 5, c = 8)      2      5      8
   3 │ (a = 3, b = 6, c = 9)      3      6      9

Complex column unnesting

Sometimes you might have a situation where you have a nested column
that has heterogeneous contents (i.e. has different column names in different rows).
In such a scenario basic unnesting pattern does not work as it requires all rows
to have the same schema:

julia> df3 = DataFrame(nested = [(a=1, b=2), (b=3, c=4), (a=5, c=6)])
3×1 DataFrame
 Row │ nested
     │ NamedTup…
─────┼────────────────
   1 │ (a = 1, b = 2)
   2 │ (b = 3, c = 4)
   3 │ (a = 5, c = 6)

julia> transform(df3, :nested => AsTable)
ERROR: ArgumentError: keys of the returned elements must be identical

If you have such a situation you can use Tables.dictcolumntable as a
transformation function:

julia> transform(df3, :nested => Tables.dictcolumntable => AsTable)
3×4 DataFrame
 Row │ nested          a        b        c
     │ NamedTup…       Int64?   Int64?   Int64?
─────┼───────────────────────────────────────────
   1 │ (a = 1, b = 2)        1        2  missing
   2 │ (b = 3, c = 4)  missing        3        4
   3 │ (a = 5, c = 6)        5  missing        6

As you can see the Tables.dictcolumntable has “column unioning” behavior.
When some row does not have a column that is present in other rows it gets
a missing value instead.

Conclusions

Column nesting and unnesting is needed when you work with data that has
hierarchical structure. A common example of such a scenario is JSON data. I
hope you will find the patterns I have discussed in this post useful in your
work.

Confused by Julia

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/03/04/wat.html

Introduction

Recently, I was confused by how Julia parser works and complained on Julia Slack
about it (in a moment I will explain what confused me). Then I learned Miguel
Raz Guzmán Macedo has a very nice post about surprising behaviors of
Julia, so today I thought to promote Miguel’s blog :).

In my post, not to steal all the fun you will have when reading
Miguel’s blog, I will write about Julia’s behavior that surprised me
and three behaviors, related to operator precedence, that commonly lead to bugs.

The post was written under Julia 1.7.

What surprised me

The behavior of Julia that caught me off guard is:

julia> -a = 10
- (generic function with 1 method)

As you can see, by accident instead of writing -a == 10 I have written
-a = 10.
In consequence instead of doing an equality test we have defined a new function
for the - operator in module Main overshadowing the - definition from the
Base module, as you can see here:

julia> -(50)
10

julia> 1 - 2
ERROR: MethodError: no method matching -(::Int64, ::Int64)
You may have intended to import Base.:-
Closest candidates are:
  -(::Any) at REPL[1]:1

The reason of this behavior is that for Julia’s parser writing -a = 10 means
the same as writing -(a) = 10, which, can be recognized as a one-line
function definition syntax.

Why is this behavior problematic? Once you have defined a new function for -
in Main you have two options. Either restart your REPL or do - = Base.:- to
bind Base.:- with - defined in Main (I would recommend restarting
REPL instead of doing the work-around).

Operator precedence corner cases

Here are three cases of operator precedence surprises in Julia.

Scenario 1: & and |

When you write:

julia> 1 == 3 & 1 == 1
true

instead of expected false you get true. The reason is that you probably
thought that the parser will interpret your expression as:

julia> (1 == 3) & (1 == 1)
false

While Julia interprets it as:

julia> 1 == (3 & 1) == 1
true

Scenario 2: ranges

When you write:

julia> 1:2 .+ 3
1:5

you might have expected:

julia> (1:2) .+ 3
4:5

but actually this is interpreted as:

julia> 1:(2 + 3)
1:5

Scenario 3: pairs and anonymous functions

When you write:

julia> :a => x -> x => :b
:a => var"#1#2"()

you probably expect:

julia> :a => (x -> x) => :b
:a => (var"#3#4"() => :b)

but in reality you get:

julia> :a => (x -> x => :b)
:a => var"#5#6"()

The last scenario is relevant in DataFrames.jl, where we often use syntax:

source_column => (x -> some_anonymous_function_body) => target_column_name

Conclusions

As any programming language Julia has some syntax corner cases that can be
surprising. The problems with operator precedence I have listed in this post
have a simple practical solution: if you are unsure about operator precedence
be explicit and use parentheses to clearly signal how you want your expression
to be evaluated.

Any and all reduction of rows in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/02/25/anyall.html

Introduction

Recently on Stack Overflow there was an interesting question about
row-wise aggregation of boolean columns in DataFrames.jl.

I think that the solution is an interesting exercise in functionalities of
DataFrames.jl.

The post was written under Julia 1.7 and DataFrames.jl 1.3.2.

The question

Assume we have the following data frame:

julia> using DataFrames, Random

julia> Random.seed!(1234);

julia> df = DataFrame(rand(Bool, 10, 3), :auto)
10×3 DataFrame
 Row │ x1     x2     x3
     │ Bool   Bool   Bool
─────┼─────────────────────
   1 │  true  false   true
   2 │ false  false   true
   3 │ false  false  false
   4 │  true  false  false
   5 │  true   true  false
   6 │  true   true   true
   7 │  true  false   true
   8 │  true  false  false
   9 │ false  false  false
  10 │  true  false   true

We now would like to apply all and any functions to this data frame
row-wise. The basic approach to this task would be to do the following solution:

julia> transform(df, AsTable(r"x") .=> ByRow.([all, any]))
10×5 DataFrame
 Row │ x1     x2     x3     x1_x2_x3_all  x1_x2_x3_any
     │ Bool   Bool   Bool   Bool          Bool
─────┼─────────────────────────────────────────────────
   1 │  true  false   true         false          true
   2 │ false  false   true         false          true
   3 │ false  false  false         false         false
   4 │  true  false  false         false          true
   5 │  true   true  false         false          true
   6 │  true   true   true          true          true
   7 │  true  false   true         false          true
   8 │  true  false  false         false          true
   9 │ false  false  false         false         false
  10 │  true  false   true         false          true

The problem with this approach is that it does not scale well when you have
many columns in your data frame:

julia> df2 = DataFrame(rand(Bool, 10_000, 1000), :auto);

julia> @time select(df2, AsTable(r"x") .=> ByRow.([all, any]));
 12.026759 seconds (1.13 M allocations: 730.530 MiB, 0.39% gc time, 42.79% compilation time)

julia> @time select(df2, AsTable(r"x") .=> ByRow.([all, any]));
  6.746910 seconds (145.42 k allocations: 678.763 MiB, 0.29% gc time)

As you can see it has both high compilation cost and high execution time.

Speeding up the computation

The basic way to resolve this issue is to use the composition with collect, as
the DataFrames.table_transformation documentation suggests us. In this way we
avoid constructing a very wide NamedTuple, which is problematic in this
case.

julia> @time select(df2, AsTable(r"x") .=> ByRow.([all, any] .∘ collect));
  0.616105 seconds (893.18 k allocations: 69.745 MiB, 3.26% gc time, 80.07% compilation time)

julia> @time select(df2, AsTable(r"x") .=> ByRow.([all, any] .∘ collect));
  0.129053 seconds (26.39 k allocations: 21.293 MiB, 6.65% gc time)

This is much better. However, we can further notice that any function is
essentially the same as maximum, and all the same as minimum for boolean
data. In this case we do not have to use the collect composition:

julia> @time select(df2, AsTable(r"x") .=> ByRow.([minimum, maximum]));
  0.393608 seconds (783.29 k allocations: 42.571 MiB, 2.45% gc time, 96.49% compilation time)

julia> @time select(df2, AsTable(r"x") .=> ByRow.([minimum, maximum]));
  0.002593 seconds (5.37 k allocations: 503.359 KiB)

This time things are very fast, as row-wise aggregation for maximum
and minimum is optimized.

Conclusions

There are the following lessons from the examples we have given today:

  • when performing row-wise aggregation the default AsTable timings can be bad,
    because they construct an intermediate NamedTuple, which is expensive
    (of course sometimes it is useful, if you want to refer to column names in
    transformations – that is why this form is supported);
  • to avoid construction of NamedTuple use composition with collect, which
    uses a faster path that passes a vector to the aggregation function directly;
  • an even faster option is available for selected list of most common aggregation
    functions, which includes: sum, length, mean, var, std, median,
    minimum, and maximum (all potentially composed with skipmissing).