Handling summary statistics of empty collections

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/12/17/empty.html

Introduction

When designing solutions for data analysis one is often faced with a tough
choice between doing the correct thing and the convenient thing.

One particular case of such a situation is computing summary statistics of
empty collections. The reason is that often such statistics are not properly
defined for empty data (so Julia throws an error), but data scientist instead
would want to get some flag value instead. Today I want to discuss typical
cases of such situations and possible solutions.

The post was written under Julia 1.8.2 and DataFrames.jl 1.4.4.

Sum and product

In this case the situation is least problematic. Typically you will get what
you expect (i.e. respectively zero or one of the domain of values you
aggregate):

julia> sum(Int[])
0

julia> prod(Float64[])
1.0

julia> sum(skipmissing(Union{Int, Missing}[missing, missing]))
0

julia> prod(skipmissing(Union{Float64, Missing}[missing, missing]))
1.0

The only case that is problematic is when you want to work with an empty
container of a too-wide element type:

julia> sum([])
ERROR: MethodError: no method matching zero(::Type{Any})

A standard solution, since both sum and prod are reductions is to provide an
initialization value manually in this case:

julia> sum([], init=0)
0

julia> prod([], init=1.0)
1.0

Minimum and maximum

When computing minimum and maximum by default you get an error with empty
collections:

julia> minimum(Int[])
ERROR: MethodError: reducing over an empty collection is not allowed;
consider supplying `init` to the reducer

As you can see, we get and error are prompted to pass the init value for the
reduction, so the situation is less convenient.

Indeed, if such init value can be reasonably passed this is a good solution:

julia> minimum(Float64[], init=Inf)
Inf

julia> maximum(Int[], init=typemin(Int))
-9223372036854775808

or e.g. if we know we are working with values that must be in some range, we can
provide this range. A common case is with probabilities:

julia> minimum([], init=1.0)
1.0

julia> maximum([], init=0.0)
0.0

Sometimes, however, we might want to have a special signal value. In this case
you have two options. One is to check if the collection is empty, the other is
to catch exception:

julia> x = Int[]
Int64[]

julia> isempty(x) ? missing : minimum(x)
missing

julia> try
           minimum(x)
       catch e
           isa(e, MethodError) ? missing : rethrow(e)
       end
missing

You could wrap both solutions with a function for convenience, if you use them
often in your code. Their downside is that they add a bit of computational
overhead. The isempty check in some cases is not a O(1) operation. The most
common case is skipmissing. The trycatch approach introduces the cost of
handling of the exception.

Extrema

In case of the extrema function the situation is analogous to minimum and
maximum. The only difference is that you pass two values to init if you
want to use this method. Here is an example assuming we are processing data
that are probabilities:

julia> extrema(Float64[], init=(1.0, 0.0))
(1.0, 0.0)

Note that in this case minimum is greater than the maximum, so we can
immediately see that the passed collection was empty.

Mean, variance, and standard deviation

When computing mean, var, or std, we get NaN when working with an empty
collection:

julia> using Statistics

julia> mean(Int[])
NaN

julia> var(Float64[])
NaN

julia> std(Float64[])
NaN

This is expected, as we are performing division by zero in their computation.
Also, similarly to sum and prod, when the collection has a too-wide element
type we get an error:

julia> mean([])
ERROR: MethodError: no method matching zero(::Type{Any})

If we do not like this default behavior and want to handle an empty collection
in a special case checking if the container is empty is a standard solution:

julia> x = Float64[]
Float64[]

julia> isempty(x) ? missing : var(x)
missing

Median and quantile

Computing quantiles, and median in particular, is the least convenient case
as for them we always get an error and cannot use init value (as they are not
reductions):

julia> median(Int[])
ERROR: ArgumentError: median of an empty array is undefined, Int64[]

julia> quantile(Float64[], 0.1)
ERROR: ArgumentError: empty data vector

Here, currently, the only solution is to either check if the collection is empty
or catch the exception:

julia> x = Float64[]
Float64[]

julia> isempty(x) ? missing : median(x)
missing

julia> try
           quantile(x, 0.1)
       catch e
           isa(e, ArgumentError) ? missing : rethrow(e)
       end
missing

Conclusions

My post today was meant to be a quick reference for Julia users who sometimes
hit these issues when working with their data. My experience is that the most
common scenario of this kind is connected with missing data. Here is a typical
problematic case:

julia> using DataFrames

julia> using Random

julia> Random.seed!(1234);

julia> df = DataFrame(id=rand(1:10^6, 10^6),
                      value=rand([1:10; missing], 10^6))
1000000×2 DataFrame
     Row │ id      value
         │ Int64   Int64?
─────────┼─────────────────
       1 │ 325977        4
       2 │ 549052        9
       3 │ 218587        9
       4 │ 894246        8
       5 │ 353112        1
       6 │ 394256       10
       7 │ 953125  missing
       8 │ 795547        5
       9 │ 494250        1
    ⋮    │   ⋮        ⋮
  999993 │ 967428        9
  999994 │ 557085        1
  999995 │ 353965        5
  999996 │ 590548       10
  999997 │ 657727        2
  999998 │ 928733        3
  999999 │ 884126  missing
 1000000 │ 587503        2
        999983 rows omitted

I on purpose generated the data in a way that has quite a few missing values:

julia> combine(groupby(df, :value), proprow)
11×2 DataFrame
 Row │ value    proprow
     │ Int64?   Float64
─────┼───────────────────
   1 │       1  0.091109
   2 │       2  0.091387
   3 │       3  0.091394
   4 │       4  0.090954
   5 │       5  0.090504
   6 │       6  0.091412
   7 │       7  0.090809
   8 │       8  0.090844
   9 │       9  0.090254
  10 │      10  0.090795
  11 │ missing  0.090538

Now notice that some groups will only have missing values (in the output below
group with :id equal to 12 in row 7 is such a case):

julia> combine(groupby(df, :id),
               :value => (x -> mean(ismissing, x)) => :propmissing)
632166×2 DataFrame
    Row │ id       propmissing
        │ Int64    Float64
────────┼──────────────────────
      1 │       2     0.0
      2 │       3     0.25
      3 │       4     0.0
      4 │       6     0.0
      5 │       8     0.0
      6 │       9     0.333333
      7 │      12     1.0
      8 │      15     0.5
      9 │      16     0.0
   ⋮    │    ⋮          ⋮
 632159 │  999990     0.0
 632160 │  999991     0.0
 632161 │  999992     0.0
 632162 │  999993     0.0
 632163 │  999994     0.0
 632164 │  999996     0.0
 632165 │  999997     0.0
 632166 │ 1000000     0.0
            632149 rows omitted

If we now try to compute e.g. median value per group while skipping missing
we fail:

julia> combine(groupby(df, :id), :value => median∘skipmissing)
ERROR: ArgumentError: median of an empty array is undefined, Int64[]

The solution, as we discussed in this post is to handle the case of an empty
collection in a special way. I typically prefer the isempty check.

So first define a helper function:

julia> withempty(f, default) = x -> isempty(x) ? default : f(x)
withempty (generic function with 1 method)

and now we can write:

julia> combine(groupby(df, :id),
               :value => withempty(median, missing)∘skipmissing)
632166×2 DataFrame
    Row │ id       value_function_skipmissing
        │ Int64    Union{Missing, Float64}
────────┼─────────────────────────────────────
      1 │       2                         3.5
      2 │       3                         5.0
      3 │       4                         8.0
      4 │       6                         4.0
      5 │       8                         1.0
      6 │       9                         9.0
      7 │      12                   missing
      8 │      15                         3.0
      9 │      16                         7.0
   ⋮    │    ⋮                 ⋮
 632159 │  999990                         4.0
 632160 │  999991                         3.0
 632161 │  999992                         8.0
 632162 │  999993                         2.0
 632163 │  999994                         7.0
 632164 │  999996                         4.0
 632165 │  999997                         7.0
 632166 │ 1000000                         7.0
                           632149 rows omitted

As you can see we get missing in row 7 for group with :id value equal to
12 as expected.

If you have some thoughts about pros and cons of the approaches I discussed
today please check out this issue and comment there. Thank you!