The curious case of subset condition

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/01/28/subset.html

Introduction

Recently on Julia Slack there was a question about using the subset function
to drop whole groups from GroupedDataFrame in DataFrames.jl.
I thought that indeed this case is tricky enough to be worth a post.

The examples were tested under Julia 1.7.0 and DataFrames.jl 1.3.2.

Standard use cases of the subset function

Let us start with creating some sample data:

julia> using DataFrames

julia> df = DataFrame(id=[1, 1, 1, 1, 2, 2], x=1:6)
6×2 DataFrame
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      2
   3 │     1      3
   4 │     1      4
   5 │     2      5
   6 │     2      6

julia> gdf = groupby(df, :id)
GroupedDataFrame with 2 groups based on key: id
First Group (4 rows): id = 1
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      2
   3 │     1      3
   4 │     1      4
⋮
Last Group (2 rows): id = 2
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     2      5
   2 │     2      6

Assume we want to keep rows having value of :x less than the mean of this
column from df. This can be achieved with:

julia> using Statistics

julia> subset(df, :x => x -> x .< mean(x))
3×2 DataFrame
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      2
   3 │     1      3

The same operation can be easily done groupwise. Now we keep rows that have the
value of :x less than the mean of this column per group defined by :id:

julia> subset(gdf, :x => x -> x .< mean(x))
3×2 DataFrame
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      2
   3 │     2      5

The limitation of the subset contract

The subset function requires that the return value of the passed condition
is a vector. Therefore the following operation fails:

julia> subset(df, :x => x -> true)
ERROR: ArgumentError: functions passed to `subset` must return an AbstractVector.

although we might expect that broadcasting would be applied to the result of
the function and all rows would be kept. For a reference e.g. select would
perform such broadcasting automatically:

julia> select(df, All(), :x => x -> true)
6×3 DataFrame
 Row │ id     x      x_function
     │ Int64  Int64  Bool
─────┼──────────────────────────
   1 │     1      1        true
   2 │     1      2        true
   3 │     1      3        true
   4 │     1      4        true
   5 │     2      5        true
   6 │     2      6        true

You might wonder why this restriction is made. Initially we allowed non-vector
return values, but they turned to be confusing for the users so we disallowed
them.

Let me give an example. If the user wants to keep all rows for which the :id
column is equal to 1 one should write:

julia> subset(df, :id => ByRow(==(1)))
4×2 DataFrame
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      2
   3 │     1      3
   4 │     1      4

However, it turned out that users frequently were forgetting to add ByRow
wrapper and instead used:

julia> subset(df, :id => ==(1))
ERROR: ArgumentError: functions passed to `subset` must return an AbstractVector.

Now it throws an error, but if we have not imposed the restriction that we require
a vector to be returned we would get the following result:

julia> subset(df, :id => x -> fill(x == 1, length(x)))
0×2 DataFrame

as the whole column :id would be compared to 1 and the result of this
comparison is false.

Dropping whole groups from a GroupedDataFrame

The requirement that the condition must return a vector was added for safety
reasons. However, there is one case when it is a bit problematic.

Assume we want to keep from the gdf GroupedDataFrame all groups for which
the mean of :x column is less than 3. The problem is that the following
condition fails:

julia> subset(gdf, :x => x -> mean(x) < 3)
ERROR: ArgumentError: functions passed to `subset` must return an AbstractVector.

since the comparing the mean of the :x column to 3 produces a scalar Bool
value.

The solution is to manually expand the result of the condition to match the
number of rows in the group:

julia> subset(gdf, :x => x -> fill(mean(x) < 3, length(x)))
4×2 DataFrame
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      2
   3 │     1      3
   4 │     1      4

This is unfortunately a bit inconvenient.

An alternative approach would be to use the filter function which applied
to GroupedDataFrame always works on whole groups:

julia> filter(:x => x -> mean(x) < 3, gdf) |> DataFrame
4×2 DataFrame
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      2
   3 │     1      3
   4 │     1      4

(we had to pass the result of filter to DataFrame constructor, as otherwise
we would get a filtered GroupedDataFrame)

Conclusions

The design of subset I discussed in this post shows one of the challenges we
face when defining APIs in DataFrames.jl. There often is a tension between
developer convenience and safety. In this example allowing only vectors as
results of conditions in the subset function is safer since it allows to
catch some common bugs in the users code. The cost is that in some cases
(most notably dropping whole groups from a GroupedDataFrame) it is a bit
inconvenient.