Author Archives: Blog by Bogumił Kamiński

DataFrames.jl training: implementing cross validation

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/10/22/cv2.html

Introduction

Today I decided to show how cross validation can be implemented using the
functions provided by DataFrames.jl. The objective is to discuss common
patterns used when working with data frames.

In this the post I use Julia 1.6.3, DataFrames.jl 1.2.2, and GLM.jl 1.5.1.

Preparing the data and initial analysis

Let us start by creating some synthetic data set we will work with.

julia> using DataFrames

julia> using GLM

julia> using Random

julia> using Statistics

julia> Random.seed!(1234);

julia> df = DataFrame(randn(50, 9), :auto);

julia> df.y = sum(eachcol(df)) .+ 1.0 .+ randn(50);

julia> df
50×10 DataFrame
 Row │ x1         x2          x3         x4          x5          x6          x7         x8         x9          y
     │ Float64    Float64     Float64    Float64     Float64     Float64     Float64    Float64    Float64     Float64
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │  0.867347  -1.22672     0.183976  -1.87215    -0.205782    0.295222    0.183203   0.358659  -0.833507   -2.20367
   2 │ -0.901744  -0.541716   -1.27635   -0.668331   -1.22338    -0.338215    0.975083   0.488578   0.0827196  -3.31337
  ⋮  │     ⋮          ⋮           ⋮          ⋮           ⋮           ⋮           ⋮          ⋮          ⋮           ⋮
  49 │ -1.00978   -1.66323     0.797165  -0.644069    0.127747    0.742054    0.196089  -1.05575    0.771387   -2.364
  50 │ -0.543805  -0.521229    0.103145  -1.37931     0.143105   -0.0665944  -2.34887   -1.37548    0.590872   -4.76852
                                                                                                           46 rows omitted

As you can see we have created data so that the target variable :y is a sum
of features :x1 to :x9 and an intercept equal to 1. The error term in the
model has a normal distribution with mean 0 and variance 1. Similarly all
features also have a normal distribution with mean 0 and variance 1.

Now we create a linear model on the whole data set to have a baseline:

julia> formula = Term(:y) ~ sum([Term(Symbol(:x, i)) for i in 1:9])
FormulaTerm
Response:
  y(unknown)
Predictors:
  x1(unknown)
  x2(unknown)
  x3(unknown)
  x4(unknown)
  x5(unknown)
  x6(unknown)
  x7(unknown)
  x8(unknown)
  x9(unknown)

julia> model = lm(formula, df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

y ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9

Coefficients:
────────────────────────────────────────────────────────────────────────
                Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────
(Intercept)  1.01641     0.140685   7.22    <1e-08   0.732078    1.30075
x1           0.879802    0.141336   6.22    <1e-06   0.594151    1.16545
x2           1.14157     0.160897   7.10    <1e-07   0.81639     1.46676
x3           0.820527    0.137644   5.96    <1e-06   0.542338    1.09872
x4           1.28829     0.125645  10.25    <1e-12   1.03435     1.54223
x5           1.02118     0.156195   6.54    <1e-07   0.705503    1.33687
x6           0.844008    0.146142   5.78    <1e-06   0.548643    1.13937
x7           0.70256     0.15646    4.49    <1e-04   0.386343    1.01878
x8           0.925448    0.15797    5.86    <1e-06   0.60618     1.24472
x9           0.889749    0.167878   5.30    <1e-05   0.550456    1.22904
────────────────────────────────────────────────────────────────────────

We calculate mean square error of the model on the training data set:

julia> deviance(model) / nrow(df)
0.6689313350544223

julia> mse(model, df) = sum(x->x^2, predict(model, df) - df.y) / nrow(df);

julia> mse(model, df)
0.6689313350544223

It seems (and indeed is) a bit low given we know that the standard deviation of
the error term in our data generation process was 1. We will come back to
it in a second. However, first let us perfom 10-fold cross validation.

Doing cross validation

We start with generating a column holding the fold number for each row.
In the example we assume we do a 10-fold cross validation and will number folds
by integers starting with 0 and ending with 9. We need to make sure that
folds should be assigned to rows in a random order so we need to shuffle!
the initially generated fold numbers:

julia> df.fold = shuffle!((1:nrow(df)) .% 10)
50-element Vector{Int64}:
 4
 6
 5
 7
 ⋮
 8
 6
 7

Now we can quite easily extract out training and test data sets for each
individual fold using the following function:

get_fold_data(df, fold) =
    (train=view(df, df.fold .!= fold, :),
     test=view(df, df.fold .== fold, :))

Note that in this code we use views to avoid excessive copying of data.

Now we are ready to calculate the cross validation mean squared error (in the
code we take advantage of the fact that each fold has exactly the same size
in our example):

julia> mean(0:9) do fold
           train, test = get_fold_data(df, fold)
           model_cv = lm(formula, train)
           return mse(model_cv, test)
       end
0.9500293257502437

Indeed it is higher than the MSE we have calculated above as expected.

Let us calculate the expected square error for the new observation fed to
our model (I am using here an analytic formula that is suitable for our specific
way of input data generation):

julia> mset(model) = 1 + sum(x -> (1 - x) ^ 2, coef(model));

julia> mset(model)
1.2810475573651952

We can see it is greater than 1 which manes sense as this error is a sum of
two errors: random term in our data generation process plus the estimation error
of parameters of our model.

If you are not convinced that the proper formula is used to calculate this
expectation then it is easy to check it using simulation:

julia> dfn = DataFrame(randn(10^6, 9), :auto);

julia> dfn.y = sum(eachcol(dfn)) .+ 1.0 .+ randn(10^6);

julia> mse(model, dfn)
1.2817533442813103

And we get a roughly similar result.

It is crucial to note that we can calculate mse_t here accurately because we
know the data generation process. In practice we would not know it but would
want get some measure that would as accurately as possible order the competing
models in terms of their true, unobserved mse_t.

Now you might wonder if the cross validation mean squared error should be
expected to be lower than the actual prediction expected squared error as in
our single run and if it is a better predictor of mse_t than mean squared
error calculated on a training data set. We check this in the next section.

Testing the procedure using simulation

Let us wrap our calculations in the function. We will collect three mean
squared errors:

  • mse_whole: calculated on training data set;
  • mse_cv: calculated using cross validation;
  • mse_t: expected prediction squared error.

Here is the function definition:

using DataFrames
using GLM
using Random
mse(model, df) = sum(x->x^2, predict(model, df) - df.y) / nrow(df);
mset(model) = 1 + sum(x -> (1 - x) ^ 2, coef(model));
get_fold_data(df, fold) =
    (train=view(df, df.fold .!= fold, :),
     test=view(df, df.fold .== fold, :))
function runtest(id)
    df = DataFrame(randn(50, 9), :auto)
    df.y = sum(eachcol(df)[1:5]) .+ 1.0 .+ randn(50)
    formulas = [Term(:y) ~ sum([Term(Symbol(:x, i)) for i in 1:n]) for n in 1:9]
    models = [lm(f, df) for f in formulas]
    mse_wholes = [mse(m, df) for m in models]

    mse_ts = [mset(m) for m in models]

    df.fold = shuffle!((1:nrow(df)) .% 10)

    mse_cvs = map(formulas) do f
        return mean(0:9) do fold
            train, test = get_fold_data(df, fold)
            model_cv = lm(f, train)
            return mse(model_cv, test)
        end
    end

    return DataFrame(id=id, vars=1:9, mse_whole=mse_wholes, mse_cv=mse_cvs, mse_t=mse_ts)
end

Let us now run 10,000 independent repetitions of our experiment and analyze
the results:

julia> Random.seed!(12);

julia> res = DataFrame([runtest() for _ in 1:10_000])
10000×3 DataFrame
   Row │ mse_whole  mse_cv    mse_t
       │ Float64    Float64   Float64
───────┼──────────────────────────────
     1 │  0.767821  1.20796   1.43191
     2 │  0.595431  1.11465   1.2233
     3 │  0.801025  1.41942   1.25682
   ⋮   │     ⋮         ⋮         ⋮
  9998 │  0.589998  0.930069  1.24527
  9999 │  0.444815  0.685732  1.36184
 10000 │  1.05747   1.68496   1.23118
                     9994 rows omitted

julia> describe(res, :all)
3×13 DataFrame
 Row │ variable   mean      std       min       q25       median    q75       max      nunique  nmissing  first     last     eltype
     │ Symbol     Float64   Float64   Float64   Float64   Float64   Float64   Float64  Nothing  Int64     Float64   Float64  DataType
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ mse_whole  0.799776  0.177712  0.286417  0.672772  0.785994  0.912926  1.71692                  0  0.767821  1.05747  Float64
   2 │ mse_cv     1.29295   0.305371  0.419517  1.07337   1.26767   1.48104   3.0517                   0  1.20796   1.68496  Float64
   3 │ mse_t      1.25542   0.132023  1.01737   1.16157   1.22982   1.32126   2.07322                  0  1.43191   1.23118  Float64

julia> cor(Matrix(res))
3×3 Matrix{Float64}:
  1.0          0.947       -0.00555373
  0.947        1.0         -0.00844714
 -0.00555373  -0.00844714   1.0

What have we learned from our specific experiment setup:

  • mse_whole is significantly biased downwards;
  • mse_cv is slightly biased upwards on the average when multiple experiments
    are considered, but its variance is much higher than the one of mse_t;
  • mse_cv and mse_whole are significantly correlated and they both do not
    exhibit correlation with mse_t.

Conclusions

I hope the examples I gave of using DataFrames.jl in the context of some typical
data science workflow will be useful. A next step – that I leave as an
exercise – would be to e.g. check how model selection using cross validation
works.

DataFrames.jl training: implementing cross validation

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/10/15/cv.html

Introduction

Today I decided to show how cross validation can be implemented using the
functions provided by DataFrames.jl. The objective is to discuss common
patterns used when working with data frames.

In this the post I use Julia 1.6.3, DataFrames.jl 1.2.2, and GLM.jl 1.5.1.

Preparing the data and initial analysis

Let us start by creating some synthetic data set we will work with.

julia> using DataFrames

julia> using GLM

julia> using Random

julia> using Statistics

julia> Random.seed!(1234);

julia> df = DataFrame(randn(50, 9), :auto);

julia> df.y = sum(eachcol(df)) .+ 1.0 .+ randn(50);

julia> df
50×10 DataFrame
 Row │ x1         x2          x3         x4          x5          x6          x7         x8         x9          y
     │ Float64    Float64     Float64    Float64     Float64     Float64     Float64    Float64    Float64     Float64
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │  0.867347  -1.22672     0.183976  -1.87215    -0.205782    0.295222    0.183203   0.358659  -0.833507   -2.20367
   2 │ -0.901744  -0.541716   -1.27635   -0.668331   -1.22338    -0.338215    0.975083   0.488578   0.0827196  -3.31337
  ⋮  │     ⋮          ⋮           ⋮          ⋮           ⋮           ⋮           ⋮          ⋮          ⋮           ⋮
  49 │ -1.00978   -1.66323     0.797165  -0.644069    0.127747    0.742054    0.196089  -1.05575    0.771387   -2.364
  50 │ -0.543805  -0.521229    0.103145  -1.37931     0.143105   -0.0665944  -2.34887   -1.37548    0.590872   -4.76852
                                                                                                           46 rows omitted

As you can see we have created data so that the target variable :y is a sum
of features :x1 to :x9 and an intercept equal to 1. The error term in the
model has a normal distribution with mean 0 and variance 1. Similarly all
features also have a normal distribution with mean 0 and variance 1.

Now we create a linear model on the whole data set to have a baseline:

julia> formula = Term(:y) ~ sum([Term(Symbol(:x, i)) for i in 1:9])
FormulaTerm
Response:
  y(unknown)
Predictors:
  x1(unknown)
  x2(unknown)
  x3(unknown)
  x4(unknown)
  x5(unknown)
  x6(unknown)
  x7(unknown)
  x8(unknown)
  x9(unknown)

julia> model = lm(formula, df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

y ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9

Coefficients:
────────────────────────────────────────────────────────────────────────
                Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────
(Intercept)  1.01641     0.140685   7.22    <1e-08   0.732078    1.30075
x1           0.879802    0.141336   6.22    <1e-06   0.594151    1.16545
x2           1.14157     0.160897   7.10    <1e-07   0.81639     1.46676
x3           0.820527    0.137644   5.96    <1e-06   0.542338    1.09872
x4           1.28829     0.125645  10.25    <1e-12   1.03435     1.54223
x5           1.02118     0.156195   6.54    <1e-07   0.705503    1.33687
x6           0.844008    0.146142   5.78    <1e-06   0.548643    1.13937
x7           0.70256     0.15646    4.49    <1e-04   0.386343    1.01878
x8           0.925448    0.15797    5.86    <1e-06   0.60618     1.24472
x9           0.889749    0.167878   5.30    <1e-05   0.550456    1.22904
────────────────────────────────────────────────────────────────────────

We calculate mean square error of the model on the training data set:

julia> deviance(model) / nrow(df)
0.6689313350544223

julia> mse(model, df) = sum(x->x^2, predict(model, df) - df.y) / nrow(df);

julia> mse(model, df)
0.6689313350544223

It seems (and indeed is) a bit low given we know that the standard deviation of
the error term in our data generation process was 1. We will come back to
it in a second. However, first let us perfom 10-fold cross validation.

Doing cross validation

We start with generating a column holding the fold number for each row.
In the example we assume we do a 10-fold cross validation and will number folds
by integers starting with 0 and ending with 9. We need to make sure that
folds should be assigned to rows in a random order so we need to shuffle!
the initially generated fold numbers:

julia> df.fold = shuffle!((1:nrow(df)) .% 10)
50-element Vector{Int64}:
 4
 6
 5
 7
 ⋮
 8
 6
 7

Now we can quite easily extract out training and test data sets for each
individual fold using the following function:

get_fold_data(df, fold) =
    (train=view(df, df.fold .!= fold, :),
     test=view(df, df.fold .== fold, :))

Note that in this code we use views to avoid excessive copying of data.

Now we are ready to calculate the cross validation mean squared error (in the
code we take advantage of the fact that each fold has exactly the same size
in our example):

julia> mean(0:9) do fold
           train, test = get_fold_data(df, fold)
           model_cv = lm(formula, train)
           return mse(model_cv, test)
       end
0.9500293257502437

Indeed it is higher than the MSE we have calculated above as expected.

Let us calculate the expected square error for the new observation fed to
our model (I am using here an analytic formula that is suitable for our specific
way of input data generation):

julia> mset(model) = 1 + sum(x -> (1 - x) ^ 2, coef(model));

julia> mset(model)
1.2810475573651952

We can see it is greater than 1 which manes sense as this error is a sum of
two errors: random term in our data generation process plus the estimation error
of parameters of our model.

If you are not convinced that the proper formula is used to calculate this
expectation then it is easy to check it using simulation:

julia> dfn = DataFrame(randn(10^6, 9), :auto);

julia> dfn.y = sum(eachcol(dfn)) .+ 1.0 .+ randn(10^6);

julia> mse(model, dfn)
1.2817533442813103

And we get a roughly similar result.

It is crucial to note that we can calculate mse_t here accurately because we
know the data generation process. In practice we would not know it but would
want get some measure that would as accurately as possible order the competing
models in terms of their true, unobserved mse_t.

Now you might wonder if the cross validation mean squared error should be
expected to be lower than the actual prediction expected squared error as in
our single run and if it is a better predictor of mse_t than mean squared
error calculated on a training data set. We check this in the next section.

Testing the procedure using simulation

Let us wrap our calculations in the function. We will collect three mean
squared errors:

  • mse_whole: calculated on training data set;
  • mse_cv: calculated using cross validation;
  • mse_t: expected prediction squared error.

Here is the function definition:

function runtest()
    df = DataFrame(randn(50, 9), :auto)
    df.y = sum.(eachrow(df)) .+ 1.0 .+ randn(50)
    model = lm(formula, df)
    mse_whole = mse(model, df)

    mse_t = mset(model)

    df.fold = shuffle!((1:nrow(df)) .% 10)
    mse_cv = mean(0:9) do fold
        train, test = get_fold_data(df, fold)
        model_cv = lm(formula, train)
        return mse(model_cv, test)
    end

    return (mse_whole=mse_whole, mse_cv=mse_cv, mse_t=mse_t)
end

Let us now run 10,000 independent repetitions of our experiment and analyze
the results:

julia> Random.seed!(12);

julia> res = DataFrame([runtest() for _ in 1:10_000])
10000×3 DataFrame
   Row │ mse_whole  mse_cv    mse_t
       │ Float64    Float64   Float64
───────┼──────────────────────────────
     1 │  0.767821  1.20796   1.43191
     2 │  0.595431  1.11465   1.2233
     3 │  0.801025  1.41942   1.25682
   ⋮   │     ⋮         ⋮         ⋮
  9998 │  0.589998  0.930069  1.24527
  9999 │  0.444815  0.685732  1.36184
 10000 │  1.05747   1.68496   1.23118
                     9994 rows omitted

julia> describe(res, :all)
3×13 DataFrame
 Row │ variable   mean      std       min       q25       median    q75       max      nunique  nmissing  first     last     eltype
     │ Symbol     Float64   Float64   Float64   Float64   Float64   Float64   Float64  Nothing  Int64     Float64   Float64  DataType
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ mse_whole  0.799776  0.177712  0.286417  0.672772  0.785994  0.912926  1.71692                  0  0.767821  1.05747  Float64
   2 │ mse_cv     1.29295   0.305371  0.419517  1.07337   1.26767   1.48104   3.0517                   0  1.20796   1.68496  Float64
   3 │ mse_t      1.25542   0.132023  1.01737   1.16157   1.22982   1.32126   2.07322                  0  1.43191   1.23118  Float64

julia> cor(Matrix(res))
3×3 Matrix{Float64}:
  1.0          0.947       -0.00555373
  0.947        1.0         -0.00844714
 -0.00555373  -0.00844714   1.0

What have we learned from our specific experiment setup:

  • mse_whole is significantly biased downwards;
  • mse_cv is slightly biased upwards on the average when multiple experiments
    are considered, but its variance is much higher than the one of mse_t;
  • mse_cv and mse_whole are significantly correlated and they both do not
    exhibit correlation with mse_t.

Conclusions

I hope the examples I gave of using DataFrames.jl in the context of some typical
data science workflow will be useful. A next step – that I leave as an
exercise – would be to e.g. check how model selection using cross validation
works.

Julia beginner’s corner: mastering comparison operators

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/10/08/comparisons.html

Introduction

In every programming language performing comparisons is one of the most
fundamental operations. Many people starting to learn Julia find it surprising
that it provides two sets of comparison operators. Today I want to summarize
how each of them works and discuss the practical consequences.

In this the post I use Julia 1.6.3.

The standard comparison operators

Normally one uses == and != to test for equality, and <, >, <=, and
>= to test for ordering of values.

Here we can see how it works:

julia> 1 == 2
false

julia> "ab" < "cd"
true

julia> (1, "ab") > (1, "cd")
false

julia> (1 => 2) < (3 => 4)
true

An important distinction is that == and != are always defined for values of
any type, while the ordering comparisons are defined only when the type designer
decided that such comparisons make sense.

Another general rule that is worth remembering, and it was shown in the examples
above, is that comparisons can be applied to collections (like e.g. arrays or
tuples) and they normally are implemented by recursively comparing elements
contained in the collection using the lexicographic ordering.

The standard operators are typically used in practice and are, in particular,
easy to type and read. However, they exhibit behavior that might not be
desirable in certain cases. The three most common situations are as follows.

Case 1: numeric -0.0 and 0.0 are considered equal:

julia> -0.0 == 0.0
true

julia> -0.0 < 0.0
false

Case 2: comparisons with NaN always produce false:

julia> NaN == NaN
false

julia> NaN < NaN
false

julia> NaN > NaN
false

julia> NaN == 1.0
false

julia> NaN < 1.0
false

julia> NaN > 1.0
false

Case 3: comparisons with missing always produce missing:

julia> missing == missing
missing

julia> missing < missing
missing

julia> missing > missing
missing

julia> missing == 1
missing

julia> missing < 1
missing

julia> missing > 1
missing

These properties make sense in certain situations, but when e.g. we want
to sort values or store them in a dictionary or set they are not desirable.
Therefore Julia introduces another set of comparison operators.

The special comparison operators

There are two special comparison functions: isequal and isless. The major
difference here is that the user can expect that these comparisons always return
a Bool value. Additionally isequal distinguishes 0.0 and -0.0 and
considers all NaN values as equal. Therefore:

julia> isequal(NaN, NaN)
true

julia> isless(NaN, NaN)
false

julia> isequal(-0.0, 0.0)
false

julia> isless(-0.0, 0.0)
true

julia> isequal(missing, missing)
true

julia> isless(missing, missing)
false

Here is an example of isequal at work:

julia> unique([-1.0, -0.0, 0.0, missing, NaN, NaN])
5-element Vector{Union{Missing, Float64}}:
  -1.0
  -0.0
   0.0
    missing
 NaN

The unique function uses isequal to test for equality thus it de-duplicated
NaNs, but retained both -0.0 and 0.0. Also, even though we had missing
in the vector it was not a problem and the function worked without an error.

Similarly, sort uses isless by default so the following works:

julia> sort([-1.0, -0.0, 0.0, missing, NaN, NaN])
6-element Vector{Union{Missing, Float64}}:
  -1.0
  -0.0
   0.0
 NaN
 NaN
    missing

If we switched the comparison operator to < we would get an error:

julia> sort([-1.0, -0.0, 0.0, missing, NaN, NaN], lt=<)
ERROR: TypeError: non-boolean (Missing) used in boolean context

or have an undefined result in corner cases:

julia> sort([-0.0, 0.0], lt=<)
2-element Vector{Float64}:
 -0.0
  0.0

julia> sort([0.0, -0.0], lt=<)
2-element Vector{Float64}:
  0.0
 -0.0

In practice the most common use of isequal and isless is in cases when
we want to avoid missing result from a comparison.

The egal equality

Before we move forward it is worth to know that in Julia there is yet a third
notion of equality. It is invoked using the === comparison. This comparison
always returns a Bool value and tests if the passed arguments are identical,
in the sense that no program could distinguish them.

This distinction is most relevant for mutable types:

julia> x = [1]
1-element Vector{Int64}:
 1

julia> y = [1]
1-element Vector{Int64}:
 1

julia> x === x
true

julia> x === y
false

julia> x == x
true

julia> x == y
true

As you can see above x and y vectors are considered equal by == as they
have the same contents, but are not equal by === as they have a different
location in memory.

It is important to note here that immutable types are compared with === by
their contents on bit level, so we have the following:

julia> x = (1,)
(1,)

julia> y = (1,)
(1,)

julia> x === y
true

Rules for designing custom types

In Julia it is very easy to define custom types. Therefore it is crucially
important, when doing so, to understand what are the default implementations
of the comparison operators. Here are the rules:

  • == falls back to === by default;
  • isequal falls back to == by default and additionally requires that the
    hash function is consistently defined;
  • < falls back to isless.

As you can see, maybe somewhat surprisingly, the fallback implementations work
in different ways for == and isequal vs < and isless pairs. Also,
although == is not directly linked with hash it is indirectly linked to it
because isequal falls back to it.

The simple practical rule then is the following. If you define a new type then:

  • always design ==, isequal and hash functions jointly if you implement
    them (if you do not implement any you are safe as the default fallbacks for
    === and hash are designed in a consistent way);
  • if you want your type to support ordering, always design < and isless
    jointly, and then also define the equality operators discussed in the bullet
    above.

The in case study

As a special application of the above examples let me discuss the in function
here. It is very useful for testing if some value is found in some collection.
I am mentioning it because the implementation of in is quite tricky in Julia
1.6. Normally it uses == to test for equality except for certain collections,
like Set or Dict, which use isequal instead. In consequence we have the
following:

julia> 1 in [1, missing]
true

julia> 1 in [2, missing]
missing

julia> 1 in Set([1, missing])
true

julia> 1 in Set([2, missing])
false

julia> NaN in [NaN]
false

julia> NaN in Set([NaN])
true

julia> 0.0 in [-0.0]
true

julia> 0.0 in Set([-0.0])
false

These differences can be surprising so it is important to remember them. Let me
note that this is a quite relevant practical consideration because using of a
Set wrapper is a very common pattern for improving the performance of the
in test:

julia> x = rand(Int, 10^5);

julia> y = rand(Int, 10^5);

julia> in.(x, Ref(y)); # precompile

julia> @time in.(x, Ref(y)); # this is slow
  5.528883 seconds (6 allocations: 16.672 KiB)

julia> in.(x, Ref(Set(y))); # precompile

julia> @time in.(x, Ref(Set(y))); # this is fast
  0.006156 seconds (14 allocations: 2.267 MiB)

The isapprox case study

Sometimes when comparing numeric values we are interested in checking if
they are approximately equal. The reason is that in some cases due to round-off
errors the sharp == equality is not what one might expect:

julia> 0.1 + 0.2 == 0.3
false

In such cases, when we are interested in testing of approximate equality the
isapprox function can be used:

julia> isapprox(0.1 + 0.2, 0.3)
true

You might ask how the approximate equality is defined. The rules are a bit
involved so I refer you to the documentation for the details. Here
let me just note that you can control absolute tolerance, relative tolerance
and how NaN values are handled via atol, reltol, and nans keyword
arguments respectively.

Conclusions

Designing comparison operators properly is one of the hardest tasks in every
programming language. In Julia the design covers a wide range of possible
scenarios that the user might want in practice. The cost of this flexible
design is that it takes some time to master it. I hope that after reading this
post you have enough understanding of the details to be able to confidently
work with comparison operators in Julia.