Tag Archives: julialang

How is equality checked in DataFrames.jl?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/04/14/isequal.html

Introduction

Today I want to discuss how values are tested for being equal in
functions provided by DataFrames.jl.

I already discussed the topic of equality testing in the past in
this post and this post and
explain it extensively in chapter 7 of my Julia for Data Analysis book.
However, the issue is still often raised by users, so I thought it is useful
to go back to it one more time.

The post was written under Julia 1.9.0-rc1, CategoricalArrays.jl 0.10.7 and DataFrames.jl 1.5.0.

Why equality testing is hard?

When users learn Julia they are typically taught that == is the operator
that should be used for testing for equality. Here is a basic example:

julia> 1 == 2
false

julia> 1 == 1
true

However, there are the following aspects of == that make it not intuitive
in some scenarios.

First is that == does not guarantee to return Bool value. The problem
is that if one of its arguments is missing then the result will be missing:

julia> 1 == missing
missing

julia> missing == missing
missing

Clearly this is not desirable in cases when we expect the operation to return Bool (e.g. when filtering data):

julia> x = [1, 2, missing, 4, 5]
5-element Vector{Union{Missing, Int64}}:
 1
 2
  missing
 4
 5

julia> x[x .> 2.5]
ERROR: ArgumentError: unable to check bounds for indices of type Missing

In such cases you should use the coalesce function to decide if you want to keep or drop missing values:

julia> x[coalesce.(x .> 2.5, false)]
2-element Vector{Union{Missing, Int64}}:
 4
 5

julia> x[coalesce.(x .> 2.5, true)]
3-element Vector{Union{Missing, Int64}}:
  missing
 4
 5

The second issue is that == follows IEEE semantics for floating-point numbers:

julia> NaN == NaN
false

julia> 0.0 == -0.0
true

First, you see that NaN is not considered to be equal to NaN. This can be quite surprising:

julia> x = [1, NaN, 2]
3-element Vector{Float64}:
   1.0
 NaN
   2.0

julia> x[x .== NaN]
Float64[]

instead the isnan function can be used:

julia> x[isnan.(x)]
1-element Vector{Float64}:
 NaN

The 0.0 and -0.0 case is even more tricky. These are two technically distinct values,
however, in some applications user might want them to be treated as equal, while in other
as not equal. IEEE standard determines that they are considered to be equal when compared
using ==.

In summary, the major problem with == is that it does not define a proper equivalence
relation. First, some values are not comparable (missing is returned); second
it is not reflexive (for NaN).

An alternative way to compare values

In many cases you need an equality operator that defines an equivalence relation.
In Julia this is provided by the isequal function. As you can read in its documentation:

isequal treats all floating-point NaN values as equal to each other,
treats -0.0 as unequal to 0.0, and missing as equal to missing.
Always returns a Bool value.

Let us check this:

julia> isequal(1, missing)
false

julia> isequal(missing, missing)
true

julia> isequal(NaN, NaN)
true

julia> isequal(0.0, -0.0)
false

In Julia functions that create equivalence classes over sets of some values use
isequal to test for equality. In Base Julia such are for example Dict and Set
operations or the unique function:

julia> Set([0.0, 0.0, -0.0, -0.0, NaN, NaN, missing, missing])
Set{Union{Missing, Float64}} with 4 elements:
  0.0
  NaN
  -0.0
  missing

julia> unique([0.0, 0.0, -0.0, -0.0, NaN, NaN, missing, missing])
4-element Vector{Union{Missing, Float64}}:
   0.0
  -0.0
 NaN
    missing

The same rules carry over to DataFrames.jl.

Testing for equality in DataFrames.jl

There are the following functionalities of DataFrames.jl that rely on the isequal equality test:

  • deduplication with unique and related functions;
  • grouping with groupby;
  • joins (innerjoin etc.).

Let us see them in action one by one. We start with the deduplication:

julia> using DataFrames

julia> df = DataFrame(id=1:8, x=[0.0, 0.0, -0.0, -0.0, NaN, NaN, missing, missing])
8×2 DataFrame
 Row │ id     x
     │ Int64  Float64?
─────┼──────────────────
   1 │     1        0.0
   2 │     2        0.0
   3 │     3       -0.0
   4 │     4       -0.0
   5 │     5      NaN
   6 │     6      NaN
   7 │     7  missing
   8 │     8  missing

julia> unique(df, :x)
4×2 DataFrame
 Row │ id     x
     │ Int64  Float64?
─────┼──────────────────
   1 │     1        0.0
   2 │     3       -0.0
   3 │     5      NaN
   4 │     7  missing

Indeed, we see that 0.0 and -0.0 are considered as not equal,
while NaN and missing are deduplicated.

Now let us turn to grouping:

julia> show(groupby(df, :x), allgroups=true)
GroupedDataFrame with 4 groups based on key: x
Group 1 (2 rows): x = 0.0
 Row │ id     x
     │ Int64  Float64?
─────┼─────────────────
   1 │     1       0.0
   2 │     2       0.0
Group 2 (2 rows): x = -0.0
 Row │ id     x
     │ Int64  Float64?
─────┼─────────────────
   1 │     3      -0.0
   2 │     4      -0.0
Group 3 (2 rows): x = NaN
 Row │ id     x
     │ Int64  Float64?
─────┼─────────────────
   1 │     5       NaN
   2 │     6       NaN
Group 4 (2 rows): x = missing
 Row │ id     x
     │ Int64  Float64?
─────┼─────────────────
   1 │     7   missing
   2 │     8   missing

As you can see we get the same result again. As a side note let me
comment that unique internally uses the same mechanism as groupby
to identify duplicates.

Finally let us check joins:

julia> df_ref = DataFrame(x=[0.0, missing], val=1:2)
2×2 DataFrame
 Row │ x          val
     │ Float64?   Int64
─────┼──────────────────
   1 │       0.0      1
   2 │ missing        2

julia> outerjoin(df, df_ref, on=:x)
ERROR: ArgumentError: missing values in key columns are not allowed when matchmissing == :error

We get a first problem. Joins detect that missing value is present in key column.
By default it errors in such a case. We can change it using the matchmissing keyword argument.
Let us assume that we want missing values to be treated as equal and try the following join:

julia> outerjoin(df, df_ref, on=:x, matchmissing=:equal)
ERROR: ArgumentError: currently for numeric values NaN and `-0.0` in their real or imaginary components are not allowed. Use CategoricalArrays.jl to wrap these values in a CategoricalVector to perform the requested join.

We still get an error. In joins, for safety, if -0.0 is encountered in key then an error is thrown. This can be fixed by transforming the :x column to categorical,
in which case -0.0 and 0.0 are considered to be different:

julia> using CategoricalArrays

julia> outerjoin(transform(df, :x => categorical => :x), df_ref, on=:x, matchmissing=:equal)
8×3 DataFrame
 Row │ id      x          val
     │ Int64?  Float64?   Int64?
─────┼────────────────────────────
   1 │      1        0.0        1
   2 │      2        0.0        1
   3 │      7  missing          2
   4 │      8  missing          2
   5 │      3       -0.0  missing
   6 │      4       -0.0  missing
   7 │      5      NaN    missing
   8 │      6      NaN    missing

Let us check categorical vector in more detail:

julia> levels(categorical(df.x))
3-element Vector{Float64}:
  -0.0
   0.0
 NaN

As you can see 0.0 and -0.0 are considered to be separate levels in a categorical vector.

Conclusions

I hope the examples given today were useful for understanding how == and isequal work in Julia.

As a final comment let me add that throwing an error for joins on -0.0 was a decision that was made
for safety reasons. However, if users give us a feedback that adding other options of handling -0.0
would be useful (e.g. treating them as equal or not-equal) then we could consider adding this feature
in the future releases of DataFrames.jl.

Element type surprises when processing collections in Julia

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/04/07/narrowing.html

Introduction

Today I want to write about a topic that is a quite tricky
element of design of Julia. The issue is that it is sometimes hard
to predict the element type of the output collection produced by
an operation that transforms an input collection.

The description above looks complicated, but the problem is
encountered in practice, so let me explain it by example.

The post was written under Julia 1.9.0-rc1.

A basic example

Assume you have some input collection [1, 2, 3] and you
want to compute square root of all its elements.

Let us consider three standard ways how you can do it:

julia> x = [1, 2, 3]
3-element Vector{Int64}:
 1
 2
 3

julia> sqrt.(x)
3-element Vector{Float64}:
 1.0
 1.4142135623730951
 1.7320508075688772

julia> map(sqrt, x)
3-element Vector{Float64}:
 1.0
 1.4142135623730951
 1.7320508075688772

julia> [sqrt(v) for v in x]
3-element Vector{Float64}:
 1.0
 1.4142135623730951
 1.7320508075688772

As you can see in each case a proper element type, that is
Float64, was determined for the returned collection.

This behavior is useful, as the user does not have to think
about specifying the output element type. In fact,
in combination with the transformation using the identity function,
this behavior can be used to conveniently narrow
down element type of some collection:

julia> y = Any[1, 2, 3]
3-element Vector{Any}:
 1
 2
 3

julia> identity.(y)
3-element Vector{Int64}:
 1
 2
 3

julia> map(identity, y)
3-element Vector{Int64}:
 1
 2
 3

julia> [v for v in y]
3-element Vector{Int64}:
 1
 2
 3

This pattern comes handy if we have input data that does not have a known
element type, but later we want to perform element type narrowing
when processing it (one of the major benefits of such narrowing
is that processing vectors of Any values is slow so we typically want to avoid it).

The hard case

Automatic output element type detection works nice most of the time.
Unfortunately, when we work with empty collections, it becomes hard to predict.
Here is a simple example:

julia> String.([])
Any[]

julia> map(String, [])
String[]

julia> [String(v) for v in []]
Any[]

julia> string.([])
AbstractString[]

julia> map(string, [])
Any[]

julia> [string(v) for v in []]
Any[]

As you can see from it broadcating, map, and comprehension use a different set of
rules to automatically determine the produced element type. These rules of course
exist and could be learned, but the point is that the issue is non-trivial.

The problem is that when you are writing production code
(e.g. you are developing a package) you want to be sure
what the element type of the collection you produce will be, as often
you cannot know upfront if the input collection the user is going to provide
will be empty or not.

The solution I use

In situations when it matters what the element type of the collection
produced by some transformation is going to be I use comprehensions
with output element type annotation:

julia> [string(v) for v in []]
Any[]

julia> String[string(v) for v in []]
String[]

Such annotation has an additional consequence that it is going to perform
conversion of the produced elements to the target type if needed:

julia> using Test

julia> s = ["a", GenericString("a")]
2-element Vector{AbstractString}:
 "a"
 "a"

julia> [string(v) for v in s]
2-element Vector{AbstractString}:
 "a"
 "a"

julia> typeof.([string(v) for v in s])
2-element Vector{DataType}:
 String
 GenericString

julia> String[string(v) for v in s]
2-element Vector{String}:
 "a"
 "a"

julia> typeof.(String[string(v) for v in s])
2-element Vector{DataType}:
 String
 String

Note that in the example prefixing the comprehension with
String made sure that the result of the operation has
String element type and all produced values have this type.

Element type widening

Let me comment on one common related operation. What if we
want to initialize some container with a given value but
we want its element type to be wider? This is not an artificial
case – it often happens with missing (where we initialize
some container with this value only to later replace missing with proper
values).

Using the fill function is a first thing we might try:

julia> fill(missing, 3)
3-element Vector{Missing}:
 missing
 missing
 missing

However, the produced container has Missing element type which
is not useful if we e.g. wanted to later also store integers in it.

One can use a comprehension annotated with a proper output element type
instead:

julia> Union{Int, Missing}[missing for _ in 1:3]
3-element Vector{Union{Missing, Int64}}:
 missing
 missing
 missing

The pattern with missing is needed often enough that we have
a custom function in the Missings.jl package that can be used
to get the desired result more conveniently:

julia> using Missings

julia> missings(Int, 3)
3-element Vector{Union{Missing, Int64}}:
 missing
 missing
 missing

Conclusions

Fortunately, in interactive use the problem with setting
of the proper element type for some collection does not
occur often. However, when I write production programs I make
sure to always think if I need to use comprehension with
element type specification to ensure type stability of my code.

Broadcast fusion in Julia: all you need to know to avoid pitfalls

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/03/31/broadcast.html

Introduction

Broadcasting is a powerful feature of Julia and quickly becomes a tool of choice
of developers because it is convenient to use.

However, in more complicated scenarios it can be tricky. The issue is that because
of its power broadcasting has a complex design and even the Julia Manual has it covered in four
different parts to discuss different aspects of the functionality:
here, here, here, and here.

For this reason, some time ago I have written a post about @. to
clarify its usage. A recent discussion on Julia Discourse prompted me to write
another post about this topic.

I will cover two things related to broadcast fusion today:
broadcasting of containers having different shape
and aliasing in broadcasted assignment.

The presented examples were tested under Julia 1.9.0-rc1.

Broadcasting of containers having different shape

Let me start with an example:

julia> x = string.([1, 2, 3], ",", ["a" "b"], ":")
3×2 Matrix{String}:
 "1,a:"  "1,b:"
 "2,a:"  "2,b:"
 "3,a:"  "3,b:"

julia> y = rand.(Int8, 1:3)
3-element Vector{Vector{Int8}}:
 [-95]
 [65, -119]
 [-77, -78, -5]

julia> string.(x, y)
3×2 Matrix{String}:
 "1,a:Int8[-95]"           "1,b:Int8[-95]"
 "2,a:Int8[65, -119]"      "2,b:Int8[65, -119]"
 "3,a:Int8[-77, -78, -5]"  "3,b:Int8[-77, -78, -5]"

julia> string.([1, 2, 3], ",", ["a" "b"], ":", rand.(Int8, 1:3))
3×2 Matrix{String}:
 "1,a:Int8[-127]"            "1,b:Int8[-93]"
 "2,a:Int8[92, -91]"         "2,b:Int8[-88, 118]"
 "3,a:Int8[-104, -29, -38]"  "3,b:Int8[23, 109, 76]"

What we can see here is that the x = string.([1, 2, 3], ",", ["a" "b"], ":") operation creates a 3×2 matrix,
while y = rand.(Int8, 1:3) creates a 3-element vector. Since in Julia vectors are treated as columnar
the matrix and vector have matching dimensions and can be used in broadcasting.
The call string.(x, y) reuses elements of y in each row. As a result the suffix of every string
in the resulting matrix is the same for each row.

Therefore, you might be surprised that when you combine the expressions into one call
string.([1, 2, 3], ",", ["a" "b"], ":", rand.(Int8, 1:3)) you get a different result.
Now each suffix is different (I used random numbers to show you that the suffix is different indeed).

What is the reason for this behavior? As is explained in the Julia Manual entries I linked to in the introduction
Julia performs broadcast fusion. This means that it behaves as if it created a single loop over two
dimensions of the output matrix and evaluates the expression:
string(p, ",", q, ":", rand.(Int8, r)) for values of p, q and r appropriately determined
from the source data [1, 2, 3], ["a" "b"], and 1:3 without caching them when doing the expansion
of the 1:3 vector over the second dimension. This means that we get different suffix in each cell.
Sometimes it is indeed desired, in other cases it can be surprising and not wanted.

First, let me explain how to resolve this issue. You can use the identity function (not-broacasted)
to break broadcast fusion behavior. Here is how you can do it:

julia> string.([1, 2, 3], ",", ["a" "b"], ":", identity(rand.(Int8, 1:3)))
3×2 Matrix{String}:
 "1,a:Int8[45]"           "1,b:Int8[45]"
 "2,a:Int8[-121, 60]"     "2,b:Int8[-121, 60]"
 "3,a:Int8[47, -25, 42]"  "3,b:Int8[47, -25, 42]"

The part of the expression wrapped in identity gets evaluated and then is fed into the enclosing
broadcasting expression.

In our example this changes the result of the operation, because we generated random numbers.
However, even if the result would not be impacted it can affect the performance significantly.
Have a look at this example (timings are after compilation):

julia> @time sin.(1:1000) .+ cos.((1:1000)');
  0.032546 seconds (2 allocations: 7.629 MiB)

julia> @time identity(sin.(1:1000)) .+ identity(cos.((1:1000)'));
  0.005688 seconds (4 allocations: 7.645 MiB)

What is the reason of the difference? In the first case both sin and cos
are evaluated 1,000,000 times (for each cell separately).
In the second example we have only 1000 calls of sin and cos.

You might ask when the default behavior might be desirable? It is useful when for example
you want to avoid aliasing. Take a look:

julia> m1 = tuple.([1 2], vcat.(1:3, 4:6))
3×2 Matrix{Tuple{Int64, Vector{Int64}}}:
 (1, [1, 4])  (2, [1, 4])
 (1, [2, 5])  (2, [2, 5])
 (1, [3, 6])  (2, [3, 6])

julia> push!(m1[1, 1][2], 100)
3-element Vector{Int64}:
   1
   4
 100

julia> m1
3×2 Matrix{Tuple{Int64, Vector{Int64}}}:
 (1, [1, 4, 100])  (2, [1, 4])
 (1, [2, 5])       (2, [2, 5])
 (1, [3, 6])       (2, [3, 6])

julia> m2 = tuple.([1 2], identity(vcat.(1:3, 4:6)))
3×2 Matrix{Tuple{Int64, Vector{Int64}}}:
 (1, [1, 4])  (2, [1, 4])
 (1, [2, 5])  (2, [2, 5])
 (1, [3, 6])  (2, [3, 6])

julia> push!(m2[1, 1][2], 100)
3-element Vector{Int64}:
   1
   4
 100

julia> m2
3×2 Matrix{Tuple{Int64, Vector{Int64}}}:
 (1, [1, 4, 100])  (2, [1, 4, 100])
 (1, [2, 5])       (2, [2, 5])
 (1, [3, 6])       (2, [3, 6])

As you can see, in this case we typically would want vcat to be called separately for every cell.
When we broke broadcast fusion with identity(vcat.(1:3, 4:6)) we get the same vector in every cell in a single row,
which could lead to hard-to-catch bugs.

Another question is when broadcast fusion is useful from the performance perspective?
The answer is that in simple calls like (timings are after compilation):

julia> @time cot.(sin.(cos.(tan.(1:10^6))));
  0.057595 seconds (2 allocations: 7.629 MiB)

we avoid unnecessary allocation of intermediate objects. We can simulate non-fused performance
by injecting identity to see the difference:

julia> @time cot.(identity(sin.(identity(cos.(identity(tan.(1:10^6)))))));
  0.085532 seconds (8 allocations: 30.518 MiB)

Aliasing in broadcasted assignment

Another potential issue is aliasing in broadcasted assignment .=. Have a look at this example:

julia> x = [1 2; 3 4]
2×2 Matrix{Int64}:
 1  2
 3  4

julia> x .= sum.(Ref(x))
2×2 Matrix{Int64}:
 10  35
 19  68

julia> x = [1 2; 3 4]
2×2 Matrix{Int64}:
 1  2
 3  4

julia> x .= identity(sum.(Ref(x)))
2×2 Matrix{Int64}:
 10  10
 10  10

In the first case of x .= sum.(Ref(x)), as we already discussed, sum.(Ref(x))
gets executed for each cell of x matrix. Now, since we use .= broadcasted assignment
the operation happens in-place, which means that x gets updated during the process
and consecutive sum.(Ref(x)) calls use changed x. Again, breaking broadcasting
fusion with identity(sum.(Ref(x))) forces Julia to materialize the sum before
doing the outer broadcasted assignment and we get 10 in every cell.

To give another example let us have a look how we can fill a vector with consecutive
powers of 2 (of course there are better ways to do it):

julia> x = [1, 0, 0, 0, 0, 0, 0]
7-element Vector{Int64}:
 1
 0
 0
 0
 0
 0
 0

julia> x .= sum.(Ref(x))
7-element Vector{Int64}:
  1
  1
  2
  4
  8
 16
 32

Conclusions

In summary, it is important to keep in mind that Julia performs broadcast fusion when
operating on several broadcasted function calls that are chained together.

This broadcast fusion in general improves performance and reduces allocations, but in some
cases it is not desirable. The most common scenarios are:

  • when we broadcast operations over containers of different dimensions
    (when it can degrade performance, or lead to different results).
  • when we perform broadcasted assignment to a container that is also used on right hand side of
    an expression (when it can lead to unexpectedly incorrect results).

As I have shown, in such cases one of the ways to fix the problem is to break broadcast fusion
by injecting a non-broadcasted function call forcing materialization of intermediate results
of computation. The identity function can be used to achieve this effect.