Hunting for bugs in Julia for Data Analysis

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/03/03/errata.html

Introduction

A few months ago my Julia for Data Analysis book was released.
I tried very hard to make it correct. Regrettably, I failed.

Fortunately my readers are more careful than me and they found several issues
in my book. I keep their log in the Errata section of the GitHub
repository of the book.

In this post I want to share you my experience about the kinds of bugs
slipped into the book and comment how I think they could be avoided.

My experience is that there are three important classes of issues:

  • various misprints and typographical errors;
  • consistency of code execution across Julia versions;
  • factual errors.

Let me go through these classes in sequence.

Misprints and typographical errors

Such errors sneak in for many reasons. Let me highlight some that have bitten
me and could be avoided.

First is printing of non-standard UTF-8 characters. These are nasty. In my
version of the book things look nice, but when it went through printing
preparation process, somehow on the way some characters got messed up.
For example in Chapter 6, section 6.4.1, page 132
codeunits("ε") is printed as codeunits("?").

Takeaway: before your book goes to press always carefully check these parts of
the text where you use non-standard UTF-8 characters (a common case in Julia).

Second is the side effect of using multi-selection or auto-replace in your
editor. For example in Chapter 3, section 3.2.3, page 59 I have
an issue that I write sort(v::AbstractVector; kwthe.) instead of
sort(v::AbstractVector; kws...). It is clear that I must have used some
pattern matching and replaced s.. with the. I do not even remember why.

Takeaway: multi-selection and auto-replace are nice, but never do them
globally; it is best to review every change before it is applied (do not make
them automatically in one shot; it is hart to resist, but this is what you
really should not do).

Consistency across versions of Julia

There are many flavors of this issue. It can mostly be resolved by strict use
of Project.toml and Manifest.toml files, but sometimes unexpected things happen.

For example in Chapter 1, section 1.2.1, page 7 I show the following snippet:

julia> function sum_n(n)
           s = 0
           for i in 1:n
               s += i
           end
           return s
       end
sum_n (generic function with 1 method)

julia> @time sum_n(1_000_000_000)
  0.000001 seconds
500000000500000000

This timing is surprisingly fast (and the reason is explained in the book).
The issue is that this is the situation under Julia 1.7 as this is the version
of the language I used in the book.

Under Julia 1.8 and Julia 1.9 running the same code takes longer
(tested under Julia 1.9-beta4):

julia> @time sum_n(1_000_000_000)
  2.265569 seconds
500000000500000000

The reason for this inconsistency is a bug in the @time macro introduced in
Julia 1.8 release. The sum_n(1_000_000_000) call (without @time) is executed
fast. Here is a simplified benchmark (run under Julia 1.9-beta4) showing this:

julia> let
           start = time_ns()
           v = sum_n(1_000_000_000)
           stop=time_ns()
           v, Int(stop - start)
       end
(500000000500000000, 1000)

Unfortunately there is an issue with the @time macro used in global scope
that causes the timing to be inaccurate. This bug needs to be resolved in
Base Julia. See this issue for details.

Takeaway: things can change as software evolves; always make sure to
explicitly tell your readers which version and configuration of software they
should use to run your code.

Factual errors

This one is most problematic, as I really feel bad, when I find that I have
written something that is not fully correct. Let me give you an example
from Chapter 2, section 2.3.1, page 30.

The Julia Manual in the Short Circuit Evaluation section states:

Instead of if <cond> <statement> end, one can write <cond> && <statement>
(which could be read as: <cond> and then <statement>).
Similarly, instead of if if ! <cond> <statement> end, one can write
<cond> || <statement> (which could be read as: <cond> or else <statement>).

Similarly, in my book I have considered the following expressions:

x > 0 && println(x)

and

if x > 0
    println(x)
end

where x = -7.

I write in the book that Julia interprets them both in the same way.

Indeed it is true that the same expressions get evaluated in both cases.
However, in general if statement and doing short-circuit evaluation are
not equivalent.

What is the difference? If the condition would be false then the value of if
statement (without else) is nothing and the value of the expression using
short-circuiting is false. Here is an example:

julia> x = -7
-7

julia> show(x > 0 && println(x))
false
julia> show(if x > 0
           println(x)
       end)
nothing

The difference is subtle, but in cases when you would use the produced
value later in your code it could become important.

Takeaway: always carefully consider all aspects of code that you present.

Conclusions

As a conclusion I would like to ask all readers of my book to share with me
your feedback. I will try to incorporate it in the next release of the book
in the best way I can.

Hunting for bugs in Julia for Data Analysis

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/03/03/errata.html

Introduction

A few months ago my Julia for Data Analysis book was released.
I tried very hard to make it correct. Regrettably, I failed.

Fortunately my readers are more careful than me and they found several issues
in my book. I keep their log in the Errata section of the GitHub
repository of the book.

In this post I want to share you my experience about the kinds of bugs
slipped into the book and comment how I think they could be avoided.

My experience is that there are three important classes of issues:

  • various misprints and typographical errors;
  • consistency of code execution across Julia versions;
  • factual errors.

Let me go through these classes in sequence.

Misprints and typographical errors

Such errors sneak in for many reasons. Let me highlight some that have bitten
me and could be avoided.

First is printing of non-standard UTF-8 characters. These are nasty. In my
version of the book things look nice, but when it went through printing
preparation process, somehow on the way some characters got messed up.
For example in Chapter 6, section 6.4.1, page 132
codeunits("ε") is printed as codeunits("?").

Takeaway: before your book goes to press always carefully check these parts of
the text where you use non-standard UTF-8 characters (a common case in Julia).

Second is the side effect of using multi-selection or auto-replace in your
editor. For example in Chapter 3, section 3.2.3, page 59 I have
an issue that I write sort(v::AbstractVector; kwthe.) instead of
sort(v::AbstractVector; kws...). It is clear that I must have used some
pattern matching and replaced s.. with the. I do not even remember why.

Takeaway: multi-selection and auto-replace are nice, but never do them
globally; it is best to review every change before it is applied (do not make
them automatically in one shot; it is hart to resist, but this is what you
really should not do).

Consistency across versions of Julia

There are many flavors of this issue. It can mostly be resolved by strict use
of Project.toml and Manifest.toml files, but sometimes unexpected things happen.

For example in Chapter 1, section 1.2.1, page 7 I show the following snippet:

julia> function sum_n(n)
           s = 0
           for i in 1:n
               s += i
           end
           return s
       end
sum_n (generic function with 1 method)

julia> @time sum_n(1_000_000_000)
  0.000001 seconds
500000000500000000

This timing is surprisingly fast (and the reason is explained in the book).
The issue is that this is the situation under Julia 1.7 as this is the version
of the language I used in the book.

Under Julia 1.8 and Julia 1.9 running the same code takes longer
(tested under Julia 1.9-beta4):

julia> @time sum_n(1_000_000_000)
  2.265569 seconds
500000000500000000

The reason for this inconsistency is a bug in the @time macro introduced in
Julia 1.8 release. The sum_n(1_000_000_000) call (without @time) is executed
fast. Here is a simplified benchmark (run under Julia 1.9-beta4) showing this:

julia> let
           start = time_ns()
           v = sum_n(1_000_000_000)
           stop=time_ns()
           v, Int(stop - start)
       end
(500000000500000000, 1000)

Unfortunately there is an issue with the @time macro used in global scope
that causes the timing to be inaccurate. This bug needs to be resolved in
Base Julia. See this issue for details.

Takeaway: things can change as software evolves; always make sure to
explicitly tell your readers which version and configuration of software they
should use to run your code.

Factual errors

This one is most problematic, as I really feel bad, when I find that I have
written something that is not fully correct. Let me give you an example
from Chapter 2, section 2.3.1, page 30.

The Julia Manual in the Short Circuit Evaluation section states:

Instead of if <cond> <statement> end, one can write <cond> && <statement>
(which could be read as: <cond> and then <statement>).
Similarly, instead of if if ! <cond> <statement> end, one can write
<cond> || <statement> (which could be read as: <cond> or else <statement>).

Similarly, in my book I have considered the following expressions:

x > 0 && println(x)

and

if x > 0
    println(x)
end

where x = -7.

I write in the book that Julia interprets them both in the same way.

Indeed it is true that the same expressions get evaluated in both cases.
However, in general if statement and doing short-circuit evaluation are
not equivalent.

What is the difference? If the condition would be false then the value of if
statement (without else) is nothing and the value of the expression using
short-circuiting is false. Here is an example:

julia> x = -7
-7

julia> show(x > 0 && println(x))
false
julia> show(if x > 0
           println(x)
       end)
nothing

The difference is subtle, but in cases when you would use the produced
value later in your code it could become important.

Takeaway: always carefully consider all aspects of code that you present.

Conclusions

As a conclusion I would like to ask all readers of my book to share with me
your feedback. I will try to incorporate it in the next release of the book
in the best way I can.