Tag Archives: julialang

Julia beginner’s corner: mastering comparison operators

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/10/08/comparisons.html

Introduction

In every programming language performing comparisons is one of the most
fundamental operations. Many people starting to learn Julia find it surprising
that it provides two sets of comparison operators. Today I want to summarize
how each of them works and discuss the practical consequences.

In this the post I use Julia 1.6.3.

The standard comparison operators

Normally one uses == and != to test for equality, and <, >, <=, and
>= to test for ordering of values.

Here we can see how it works:

julia> 1 == 2
false

julia> "ab" < "cd"
true

julia> (1, "ab") > (1, "cd")
false

julia> (1 => 2) < (3 => 4)
true

An important distinction is that == and != are always defined for values of
any type, while the ordering comparisons are defined only when the type designer
decided that such comparisons make sense.

Another general rule that is worth remembering, and it was shown in the examples
above, is that comparisons can be applied to collections (like e.g. arrays or
tuples) and they normally are implemented by recursively comparing elements
contained in the collection using the lexicographic ordering.

The standard operators are typically used in practice and are, in particular,
easy to type and read. However, they exhibit behavior that might not be
desirable in certain cases. The three most common situations are as follows.

Case 1: numeric -0.0 and 0.0 are considered equal:

julia> -0.0 == 0.0
true

julia> -0.0 < 0.0
false

Case 2: comparisons with NaN always produce false:

julia> NaN == NaN
false

julia> NaN < NaN
false

julia> NaN > NaN
false

julia> NaN == 1.0
false

julia> NaN < 1.0
false

julia> NaN > 1.0
false

Case 3: comparisons with missing always produce missing:

julia> missing == missing
missing

julia> missing < missing
missing

julia> missing > missing
missing

julia> missing == 1
missing

julia> missing < 1
missing

julia> missing > 1
missing

These properties make sense in certain situations, but when e.g. we want
to sort values or store them in a dictionary or set they are not desirable.
Therefore Julia introduces another set of comparison operators.

The special comparison operators

There are two special comparison functions: isequal and isless. The major
difference here is that the user can expect that these comparisons always return
a Bool value. Additionally isequal distinguishes 0.0 and -0.0 and
considers all NaN values as equal. Therefore:

julia> isequal(NaN, NaN)
true

julia> isless(NaN, NaN)
false

julia> isequal(-0.0, 0.0)
false

julia> isless(-0.0, 0.0)
true

julia> isequal(missing, missing)
true

julia> isless(missing, missing)
false

Here is an example of isequal at work:

julia> unique([-1.0, -0.0, 0.0, missing, NaN, NaN])
5-element Vector{Union{Missing, Float64}}:
  -1.0
  -0.0
   0.0
    missing
 NaN

The unique function uses isequal to test for equality thus it de-duplicated
NaNs, but retained both -0.0 and 0.0. Also, even though we had missing
in the vector it was not a problem and the function worked without an error.

Similarly, sort uses isless by default so the following works:

julia> sort([-1.0, -0.0, 0.0, missing, NaN, NaN])
6-element Vector{Union{Missing, Float64}}:
  -1.0
  -0.0
   0.0
 NaN
 NaN
    missing

If we switched the comparison operator to < we would get an error:

julia> sort([-1.0, -0.0, 0.0, missing, NaN, NaN], lt=<)
ERROR: TypeError: non-boolean (Missing) used in boolean context

or have an undefined result in corner cases:

julia> sort([-0.0, 0.0], lt=<)
2-element Vector{Float64}:
 -0.0
  0.0

julia> sort([0.0, -0.0], lt=<)
2-element Vector{Float64}:
  0.0
 -0.0

In practice the most common use of isequal and isless is in cases when
we want to avoid missing result from a comparison.

The egal equality

Before we move forward it is worth to know that in Julia there is yet a third
notion of equality. It is invoked using the === comparison. This comparison
always returns a Bool value and tests if the passed arguments are identical,
in the sense that no program could distinguish them.

This distinction is most relevant for mutable types:

julia> x = [1]
1-element Vector{Int64}:
 1

julia> y = [1]
1-element Vector{Int64}:
 1

julia> x === x
true

julia> x === y
false

julia> x == x
true

julia> x == y
true

As you can see above x and y vectors are considered equal by == as they
have the same contents, but are not equal by === as they have a different
location in memory.

It is important to note here that immutable types are compared with === by
their contents on bit level, so we have the following:

julia> x = (1,)
(1,)

julia> y = (1,)
(1,)

julia> x === y
true

Rules for designing custom types

In Julia it is very easy to define custom types. Therefore it is crucially
important, when doing so, to understand what are the default implementations
of the comparison operators. Here are the rules:

  • == falls back to === by default;
  • isequal falls back to == by default and additionally requires that the
    hash function is consistently defined;
  • < falls back to isless.

As you can see, maybe somewhat surprisingly, the fallback implementations work
in different ways for == and isequal vs < and isless pairs. Also,
although == is not directly linked with hash it is indirectly linked to it
because isequal falls back to it.

The simple practical rule then is the following. If you define a new type then:

  • always design ==, isequal and hash functions jointly if you implement
    them (if you do not implement any you are safe as the default fallbacks for
    === and hash are designed in a consistent way);
  • if you want your type to support ordering, always design < and isless
    jointly, and then also define the equality operators discussed in the bullet
    above.

The in case study

As a special application of the above examples let me discuss the in function
here. It is very useful for testing if some value is found in some collection.
I am mentioning it because the implementation of in is quite tricky in Julia
1.6. Normally it uses == to test for equality except for certain collections,
like Set or Dict, which use isequal instead. In consequence we have the
following:

julia> 1 in [1, missing]
true

julia> 1 in [2, missing]
missing

julia> 1 in Set([1, missing])
true

julia> 1 in Set([2, missing])
false

julia> NaN in [NaN]
false

julia> NaN in Set([NaN])
true

julia> 0.0 in [-0.0]
true

julia> 0.0 in Set([-0.0])
false

These differences can be surprising so it is important to remember them. Let me
note that this is a quite relevant practical consideration because using of a
Set wrapper is a very common pattern for improving the performance of the
in test:

julia> x = rand(Int, 10^5);

julia> y = rand(Int, 10^5);

julia> in.(x, Ref(y)); # precompile

julia> @time in.(x, Ref(y)); # this is slow
  5.528883 seconds (6 allocations: 16.672 KiB)

julia> in.(x, Ref(Set(y))); # precompile

julia> @time in.(x, Ref(Set(y))); # this is fast
  0.006156 seconds (14 allocations: 2.267 MiB)

The isapprox case study

Sometimes when comparing numeric values we are interested in checking if
they are approximately equal. The reason is that in some cases due to round-off
errors the sharp == equality is not what one might expect:

julia> 0.1 + 0.2 == 0.3
false

In such cases, when we are interested in testing of approximate equality the
isapprox function can be used:

julia> isapprox(0.1 + 0.2, 0.3)
true

You might ask how the approximate equality is defined. The rules are a bit
involved so I refer you to the documentation for the details. Here
let me just note that you can control absolute tolerance, relative tolerance
and how NaN values are handled via atol, reltol, and nans keyword
arguments respectively.

Conclusions

Designing comparison operators properly is one of the hardest tasks in every
programming language. In Julia the design covers a wide range of possible
scenarios that the user might want in practice. The cost of this flexible
design is that it takes some time to master it. I hope that after reading this
post you have enough understanding of the details to be able to confidently
work with comparison operators in Julia.

On Alan Edelman’s knife-edge condition in computational social science

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/10/01/socialscience.html

Introduction

Last week Alan Edelman and Viral Shah had an excellent keynote talk during
the Social Simulation Conference 2021. If you would like to learn
about one of the models that was discussed about please read on.

My objective with this post is to make a permanent record of the model as,
if I understood the comment of Alan Edelman correctly, it has not been published.

Of course all errors, extensions or omissions in the outline are mine (and I
will gladly correct the post if they are any).

In the computational part of the post I use Julia 1.6.3, Roots.jl 1.3.5,
and Plots.jl 1.22.3.

The problem statement

The question we want to answer is an attempt to address the the following
empirical phenomenon:

Imagine there is a road accident. Assume it happened in a remote place where
only one person is able to observe it. In such a situation it is highly likely
that this person will try to help.

Now assume that the same accident happens in a densely populated area.
Empirical data shows that each individual observer is much less likely
to help, and even it is possible that the probability of any help drops
in comparison to the first scenario when there is only one observer.

The model

The model of this situation that Alan Edelman presented is the following. Assume
that we have \(n\) spectators of the event. Next we take that the baseline
(i.e. when there is only one spectator) probability of helping is \(r\).
Finally we define the probability multiplier \(f(n)>0\), where we assume
that if there are \(n\) spectators each of them independently decides to help
with probability \(f(n)r\). The assumptions about \(f\) are that \(f(1)=1\) and
that \(f\) is decreasing.

Under these assumptions we can observe that the probability that any help is
given is \(1-(1-f(n)r)^n\). We want to analyze the properties of this number.
Now we fully switch from the social scientist’s hat to mathematician’s hat,
by assuming \(n\to+\infty\).

Knife-edge condition

When we analyze the asymptotic properties of this model it is useful to define
\(g(n)=f(n)n\). Then the formula for our probability becomes \(1-(1-g(n)r/n)^n\).

We now can notice that if \(g(n)\to0\) then the probability that help is given
tends to \(0\). On the other hand if \(g(n)\to+\infty\) then the probability
tends to \(1\).

So as we can see we get a sharp asymptotic knife-edge condition for the model
of not being uninteresting. We get that the asymptotic probability is in
\(]0,1[\) (assuming \(g(n)\) has a limit) only if \(g(n)\) has a positive and
finite limit. In other words, asymptotically the probability of reaction of a
spectator has to be proportional to \(1/n\). Assume that the limit is \(g\),
i.e. \(g(n)\to g\). Then the asymptotic probability is \(1-\exp(-gr)\). So we
might ask for which \(g\) the probability of reacting does not change. This is a
solution to the equation \(r = 1-\exp(-gr)\). We can solve it
to get \(g=-\log(1-r)/r\).

Now we are ready to jump into Julia to do some plotting.

Computational support

Let us now try plotting the solution of the equation \(r = 1-\exp(-gr)\) in
\(g\) being a function of \(r\).

Here is the code that produces the requested plot:

julia> using Roots

julia> using Plots

julia> plot(r -> find_zero(g -> 1 - exp(-r * g) - r, 1.0),
            xlim=[0.01,0.99], xlabel="r", ylabel="g", legend=nothing)

and we get the following relationship:

g(r) plot

In the code above instead of using the analytical solution I have shown
how one can dynamically find root of our equation as a function of a parameter.

As you can see the higher the initial \(r\) the higher \(g\) needs to be and it
has to be greater than \(1\) in general.

Conclusions

I hope you enjoyed the post even though it was not that much about Julia. We
have learned that the higher the baseline individual probability of reaction to
some bad event the harder it becomes to keep the reaction probability at the
given level as the number of spectators increases.

Hands-on Data Science with Julia

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/09/24/manning.html

Introduction

Warning! The post includes (self)promotion.

Recently with Łukasz Kraiński we have published with Manning
the Hands-on Data Science with Julia liveProject.

In this post I want to discuss the idea behind this format and what you can
expect inside.

Why liveProject format?

If you are reading this post most likely you know that I write a lot on Julia
Slack, Julia Discourse, StackOverflow [julia] tag, give various tutorials,
write package documentation, and finally I do weekly updates to this blog.

So how is liveProject different format so that I have decided to give it a try
instead of doing e.g. a new JuliaAcademy course?

First, the content is project oriented. This means that you get a business
description of the challenge and should try to finish the task yourself. If
something is challenging to finish then you have three levels of support:
hints, partial solution, and finally full solution for every task. It might
seem as not very significant, but I believe that trying to do something on ones
own (and getting only as much help as really required) is superior to just
reading through some example codes.

The second valuable option is that on the platform you can compare your
implementation with other solutions to the same problem and also discuss it
with other coders or mentors.

All this means that although the projects are marked as intermediate level
content (since such experience is required to do the tasks on your own) even if
one is a beginner the material is still useful, with a twist that in this case
you probably will have to rely more on the provided help to be able to finish
the tasks.

I will see how it works out and maybe write a post with conclusions after some
time of giving the liveProject format a try.

What is inside?

The Hands-on Data Science with Julia liveProject is divided into five
parts. Here I will focus on commenting Julia packages are featured in
them (all packages are using their latest versions as for the time of writing
this post; I list them incrementally as they are introduced in consecutive
projects):

  1. data prepossessing: Arrow.jl, Chain.jl, CSV.jl, DataFrames.jl, FreqTables.jl,
    Plots.jl, StatsBase.jl;
  2. clustering (k-means, DBSCAN): Clustering.jl, Distances.jl;
  3. dimensionality reduction (PCA, t-SNE, UMAP): Conda.jl, MultivariateStats.jl,
    PyCall.jl;
  4. predictive modelling – regression problems (random forest, GLM):
    DecisionTree.jl, GLM.jl, HypothesisTests.jl;
  5. predictive modelling – classification problems (XGBoost): ROCAnalysis.jl,
    XGBoost.jl.

Conclusions

As opposed to standard teaching materials liveProjects are a paid content and
that is why I have issued a (self)promotion warning at the beginning of this
post. However, a good thing is that the first project on
data preprocessing is available for free so it is easy to check if you
like the content that we have prepared.