Sorting data with missing values

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/04/12/sorting.html

Introduction

Sorting is one of the most common operations one wants to do with collections.
In this post I discuss how one can sort data that contain missing values.

The post was written under Julia 1.10.1 and Missings.jl 1.2.0.

General rules of comparison with missing values

By default missing is considered as greater than any other different value it is compared with:

julia> isless(Inf, missing)
true

julia> isless("abc", missing)
true

julia> isless(r"abc", missing)
true

Note, in particular, the last case. Although Regex does not support comparisons it can be compared to missing.
The reason is that isless has a general catch-all definition when one of the arguments is missing. Let us see it:

isless(::Missing, ::Missing) = false
isless(::Missing, ::Any) = false
isless(::Any, ::Missing) = true

The rule that missing is greater than all else has an important consequence when sorting.

Default sorting with missing values

Let us create a simple vector containing missing values:

julia> x = [missing, 3, 1, missing, 2, 4, missing]
7-element Vector{Union{Missing, Int64}}:
  missing
 3
 1
  missing
 2
 4
  missing

If we sort it missing values end up at the end of the produced vector
because, by default, sorting is done in ascending order:

julia> sort(x)
7-element Vector{Union{Missing, Int64}}:
 1
 2
 3
 4
  missing
  missing
  missing

If we want to get values in descending order missing values come first:

julia> sort(x, rev=true)
7-element Vector{Union{Missing, Int64}}:
  missing
  missing
  missing
 4
 3
 2
 1

But what if we wanted to have values sorted in descending order, but put missing at the end?

Supplementary sorting order

Users often wanted a functionality that would allow them to sort values, but treat missing
as the smallest. This means that if you sort your data in a descending order missing would be put at the end.
Similarly, if you want to sort your data in ascending order missing would be put at the beginning.

With Missings.jl release 1.2 this functionality is supported with the missingsmallest function:

julia> sort(x, lt=missingsmallest)
7-element Vector{Union{Missing, Int64}}:
  missing
  missing
  missing
 1
 2
 3
 4

julia> sort(x, lt=missingsmallest, rev=true)
7-element Vector{Union{Missing, Int64}}:
 4
 3
 2
 1
  missing
  missing
  missing

By default missingsmallest uses the isless comparison.

More advanced cases of treating missing as smallest

Assume that you have the following vector that you want to sort by
the length of the string:

julia> s = [missing, "abc", "x", missing, "bcde", "pq", missing]
7-element Vector{Union{Missing, String}}:
 missing
 "abc"
 "x"
 missing
 "bcde"
 "pq"
 missing

If you try a simple way to do it you get an error:

julia> sort(s, by=length)
ERROR: MethodError: no method matching length(::Missing)

We need to wrap length in passmissing to get what we want:

julia> sort(s, by=passmissing(length))
7-element Vector{Union{Missing, String}}:
 "x"
 "pq"
 "abc"
 "bcde"
 missing
 missing
 missing

julia> sort(s, by=passmissing(length), rev=true)
7-element Vector{Union{Missing, String}}:
 missing
 missing
 missing
 "bcde"
 "abc"
 "pq"
 "x"

But what if we wanted to treat missing values as smallest?

The first approach is the one we already know:

julia> sort(s, by=passmissing(length), lt=missingsmallest)
7-element Vector{Union{Missing, String}}:
 missing
 missing
 missing
 "x"
 "pq"
 "abc"
 "bcde"

julia> sort(s, by=passmissing(length), lt=missingsmallest, rev=true)
7-element Vector{Union{Missing, String}}:
 "bcde"
 "abc"
 "pq"
 "x"
 missing
 missing
 missing

However, there is an alternative. You can define a comparison function that works on strings:

julia> isshorter(s1::AbstractString, s2::AbstractString) = length(s1) < length(s2)
isshorter (generic function with 1 method)

Then you can pass the isshorter function to missingsmallest
as a single argument to generate a comparison function
that automatically treats missing values as smallest:

julia> sort(s, lt=missingsmallest(isshorter))
7-element Vector{Union{Missing, String}}:
 missing
 missing
 missing
 "x"
 "pq"
 "abc"
 "bcde"

julia> sort(s, lt=missingsmallest(isshorter), rev=true)
7-element Vector{Union{Missing, String}}:
 "bcde"
 "abc"
 "pq"
 "x"
 missing
 missing
 missing

Conclusions

The missingsmallest functionality was added in Missings.jl 1.2.
I hope you will find it useful when working with your data!