Tag Archives: julialang

Hunting for bugs in Julia for Data Analysis

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/03/03/errata.html

Introduction

A few months ago my Julia for Data Analysis book was released.
I tried very hard to make it correct. Regrettably, I failed.

Fortunately my readers are more careful than me and they found several issues
in my book. I keep their log in the Errata section of the GitHub
repository of the book.

In this post I want to share you my experience about the kinds of bugs
slipped into the book and comment how I think they could be avoided.

My experience is that there are three important classes of issues:

  • various misprints and typographical errors;
  • consistency of code execution across Julia versions;
  • factual errors.

Let me go through these classes in sequence.

Misprints and typographical errors

Such errors sneak in for many reasons. Let me highlight some that have bitten
me and could be avoided.

First is printing of non-standard UTF-8 characters. These are nasty. In my
version of the book things look nice, but when it went through printing
preparation process, somehow on the way some characters got messed up.
For example in Chapter 6, section 6.4.1, page 132
codeunits("ε") is printed as codeunits("?").

Takeaway: before your book goes to press always carefully check these parts of
the text where you use non-standard UTF-8 characters (a common case in Julia).

Second is the side effect of using multi-selection or auto-replace in your
editor. For example in Chapter 3, section 3.2.3, page 59 I have
an issue that I write sort(v::AbstractVector; kwthe.) instead of
sort(v::AbstractVector; kws...). It is clear that I must have used some
pattern matching and replaced s.. with the. I do not even remember why.

Takeaway: multi-selection and auto-replace are nice, but never do them
globally; it is best to review every change before it is applied (do not make
them automatically in one shot; it is hart to resist, but this is what you
really should not do).

Consistency across versions of Julia

There are many flavors of this issue. It can mostly be resolved by strict use
of Project.toml and Manifest.toml files, but sometimes unexpected things happen.

For example in Chapter 1, section 1.2.1, page 7 I show the following snippet:

julia> function sum_n(n)
           s = 0
           for i in 1:n
               s += i
           end
           return s
       end
sum_n (generic function with 1 method)

julia> @time sum_n(1_000_000_000)
  0.000001 seconds
500000000500000000

This timing is surprisingly fast (and the reason is explained in the book).
The issue is that this is the situation under Julia 1.7 as this is the version
of the language I used in the book.

Under Julia 1.8 and Julia 1.9 running the same code takes longer
(tested under Julia 1.9-beta4):

julia> @time sum_n(1_000_000_000)
  2.265569 seconds
500000000500000000

The reason for this inconsistency is a bug in the @time macro introduced in
Julia 1.8 release. The sum_n(1_000_000_000) call (without @time) is executed
fast. Here is a simplified benchmark (run under Julia 1.9-beta4) showing this:

julia> let
           start = time_ns()
           v = sum_n(1_000_000_000)
           stop=time_ns()
           v, Int(stop - start)
       end
(500000000500000000, 1000)

Unfortunately there is an issue with the @time macro used in global scope
that causes the timing to be inaccurate. This bug needs to be resolved in
Base Julia. See this issue for details.

Takeaway: things can change as software evolves; always make sure to
explicitly tell your readers which version and configuration of software they
should use to run your code.

Factual errors

This one is most problematic, as I really feel bad, when I find that I have
written something that is not fully correct. Let me give you an example
from Chapter 2, section 2.3.1, page 30.

The Julia Manual in the Short Circuit Evaluation section states:

Instead of if <cond> <statement> end, one can write <cond> && <statement>
(which could be read as: <cond> and then <statement>).
Similarly, instead of if if ! <cond> <statement> end, one can write
<cond> || <statement> (which could be read as: <cond> or else <statement>).

Similarly, in my book I have considered the following expressions:

x > 0 && println(x)

and

if x > 0
    println(x)
end

where x = -7.

I write in the book that Julia interprets them both in the same way.

Indeed it is true that the same expressions get evaluated in both cases.
However, in general if statement and doing short-circuit evaluation are
not equivalent.

What is the difference? If the condition would be false then the value of if
statement (without else) is nothing and the value of the expression using
short-circuiting is false. Here is an example:

julia> x = -7
-7

julia> show(x > 0 && println(x))
false
julia> show(if x > 0
           println(x)
       end)
nothing

The difference is subtle, but in cases when you would use the produced
value later in your code it could become important.

Takeaway: always carefully consider all aspects of code that you present.

Conclusions

As a conclusion I would like to ask all readers of my book to share with me
your feedback. I will try to incorporate it in the next release of the book
in the best way I can.

Hunting for bugs in Julia for Data Analysis

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/03/03/errata.html

Introduction

A few months ago my Julia for Data Analysis book was released.
I tried very hard to make it correct. Regrettably, I failed.

Fortunately my readers are more careful than me and they found several issues
in my book. I keep their log in the Errata section of the GitHub
repository of the book.

In this post I want to share you my experience about the kinds of bugs
slipped into the book and comment how I think they could be avoided.

My experience is that there are three important classes of issues:

  • various misprints and typographical errors;
  • consistency of code execution across Julia versions;
  • factual errors.

Let me go through these classes in sequence.

Misprints and typographical errors

Such errors sneak in for many reasons. Let me highlight some that have bitten
me and could be avoided.

First is printing of non-standard UTF-8 characters. These are nasty. In my
version of the book things look nice, but when it went through printing
preparation process, somehow on the way some characters got messed up.
For example in Chapter 6, section 6.4.1, page 132
codeunits("ε") is printed as codeunits("?").

Takeaway: before your book goes to press always carefully check these parts of
the text where you use non-standard UTF-8 characters (a common case in Julia).

Second is the side effect of using multi-selection or auto-replace in your
editor. For example in Chapter 3, section 3.2.3, page 59 I have
an issue that I write sort(v::AbstractVector; kwthe.) instead of
sort(v::AbstractVector; kws...). It is clear that I must have used some
pattern matching and replaced s.. with the. I do not even remember why.

Takeaway: multi-selection and auto-replace are nice, but never do them
globally; it is best to review every change before it is applied (do not make
them automatically in one shot; it is hart to resist, but this is what you
really should not do).

Consistency across versions of Julia

There are many flavors of this issue. It can mostly be resolved by strict use
of Project.toml and Manifest.toml files, but sometimes unexpected things happen.

For example in Chapter 1, section 1.2.1, page 7 I show the following snippet:

julia> function sum_n(n)
           s = 0
           for i in 1:n
               s += i
           end
           return s
       end
sum_n (generic function with 1 method)

julia> @time sum_n(1_000_000_000)
  0.000001 seconds
500000000500000000

This timing is surprisingly fast (and the reason is explained in the book).
The issue is that this is the situation under Julia 1.7 as this is the version
of the language I used in the book.

Under Julia 1.8 and Julia 1.9 running the same code takes longer
(tested under Julia 1.9-beta4):

julia> @time sum_n(1_000_000_000)
  2.265569 seconds
500000000500000000

The reason for this inconsistency is a bug in the @time macro introduced in
Julia 1.8 release. The sum_n(1_000_000_000) call (without @time) is executed
fast. Here is a simplified benchmark (run under Julia 1.9-beta4) showing this:

julia> let
           start = time_ns()
           v = sum_n(1_000_000_000)
           stop=time_ns()
           v, Int(stop - start)
       end
(500000000500000000, 1000)

Unfortunately there is an issue with the @time macro used in global scope
that causes the timing to be inaccurate. This bug needs to be resolved in
Base Julia. See this issue for details.

Takeaway: things can change as software evolves; always make sure to
explicitly tell your readers which version and configuration of software they
should use to run your code.

Factual errors

This one is most problematic, as I really feel bad, when I find that I have
written something that is not fully correct. Let me give you an example
from Chapter 2, section 2.3.1, page 30.

The Julia Manual in the Short Circuit Evaluation section states:

Instead of if <cond> <statement> end, one can write <cond> && <statement>
(which could be read as: <cond> and then <statement>).
Similarly, instead of if if ! <cond> <statement> end, one can write
<cond> || <statement> (which could be read as: <cond> or else <statement>).

Similarly, in my book I have considered the following expressions:

x > 0 && println(x)

and

if x > 0
    println(x)
end

where x = -7.

I write in the book that Julia interprets them both in the same way.

Indeed it is true that the same expressions get evaluated in both cases.
However, in general if statement and doing short-circuit evaluation are
not equivalent.

What is the difference? If the condition would be false then the value of if
statement (without else) is nothing and the value of the expression using
short-circuiting is false. Here is an example:

julia> x = -7
-7

julia> show(x > 0 && println(x))
false
julia> show(if x > 0
           println(x)
       end)
nothing

The difference is subtle, but in cases when you would use the produced
value later in your code it could become important.

Takeaway: always carefully consider all aspects of code that you present.

Conclusions

As a conclusion I would like to ask all readers of my book to share with me
your feedback. I will try to incorporate it in the next release of the book
in the best way I can.

What does it mean that a data frame are a collection of rows?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/02/24/dfrows.html

Introduction

In my recent post I have discussed what interfaces Julia defines for
working with containers. Today I want to make a closer look at data frame
objects that are defined in DataFrames.jl.

Before I move forward I want to make a small announcement. On my blog I have
recently added Learning section, where I collect a list of learning
materials that I find useful for doing data science with Julia. If you would
like to have some position added to this list please contact me.

This post was written under Julia 1.9.0-beta4 and DataFrames.jl 1.5.0.

Interfaces refresher

Let me start with recalling the discussion we had in this post about
data frame design:

  1. Data frame is not iterable. Instead, if you want to iterate its rows
    use the eachrow wrapper, and if you want to iterate its columns use the
    eachcol wrapper.
  2. You can index data frame like a matrix, but you are always required to pass
    both row and column indices (in other words: linear indexing is not
    supported).
  3. In broadcasting data frame behaves as a matrix (two dimensional container).
  4. You can get columns of a data frame by their name using property access.

For this reason often users are surprised when they read that data frame
is considered to be a collection of rows. From the way it supports the
standard interfaces it does not seem that it is the case.

What we are missing from the whole picture is that the four above interfaces are
related to only a limited number of functions:

  1. Iteration: iterate.
  2. Indexing: getindex, setindex!, firstindex, lastindex, size.
  3. Broadcasting: axes, broadcastable, BroadcastStyle, similar,
    copy , copyto!.
  4. Property access: getproperty and setproperty!.

In general, DataFrames.jl exposes over 120 functions that work on data frame
objects and out of them 38 are methods that are extensions to functions
that are defined in Base Julia and work on collections. All these functions
consider data frame to be a collection of rows.

In what follows I will go through all of them so that DataFrames.jl users have
an easy reference to them in one place.

Row sorting and reordering

We support sort, sort!, sortperm, issorted, permute!, invpermute!,
reverse!, reverse, shuffle!, and shuffle functions that work on data
frame rows.

Here let me remark that in particular shuffling functions are often quite handy
when preparing data to be passed to various statistical models.

Dropping rows

We support deleteat!, keepat!, empty!, empty, filter!, filter,
first, last, and resize!.

Let me mention that resize! allows not only to drop rows form a data frame
but also add them (although it is not often used).

In addition there is an isempty function that checks if data frame has zero
rows. It is a important to remember that it is not required that data frame
has no columns:

julia> using DataFrame

julia> df = DataFrame(a=1, b=2)
1×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      2

julia> isempty(df)
false

julia> empty!(df)
0×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┴──────────────

julia> isempty(df)
true

Remember that if the DataFrames.jl documentation says that data frame is empty
it means that it has zero rows (but it does not say anything about number of
columns).

Adding rows

You can add single rows to a data frame using push!, pushfirst!, and
insert! or collections of rows (in general Tables.jl tables) using
append! and prepend!.

Related functions are repeat! and repeat to repeat rows in a data frame.

Row extraction

There are four functions that allow you to extract one row from a data frame
only, pop!, popat!, and popfirst!.

I often find the only function useful, when I want to explicitly verify
the contract that some operation returned a data frame with only one row.

Identification of missing values in rows

You can use completecases to find which rows do not contain missing values
and dropmissing! and dropmissing to drop them.

Identification of unique rows

There are unique and unique! functions that return unique rows in a
source data frame. To get an indicator vector which rows are non-unique use
the nonunique function, and allunique checks if all rows in a data frame
are unique.

Here let me show an example of nonunique functionality that allows you to
choose which duplicates are highlighted (it was added in 1.5 release):

julia> df = DataFrame(a=[1, 2, 1, 3, 1])
5×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     1
   4 │     3
   5 │     1

julia> nonunique(df) # by default first duplicate is kept
5-element Vector{Bool}:
 0
 0
 1
 0
 1

julia> nonunique(df, keep=:last) # keep last duplicate
5-element Vector{Bool}:
 1
 0
 1
 0
 0

julia> nonunique(df, keep=:noduplicates) # do not keep any duplicates
5-element Vector{Bool}:
 1
 0
 1
 0
 1

The keep keyword argument name is used because often false in the returned
vector is meant to indicate which rows should be later kept in a data frame
(the same keyword argument name is consistently used in unique and unique!).

Conclusions

As you could see in this post there are many functions in Base Julia that
support working with collections. In DataFrames.jl we wanted users to be able
to reuse these functions when working with data frames. Therefore all of them
are supported and they consider data frames to be collections of rows.

Sometimes it is useful to have an iterable and indexable collection of,
respectively, rows and columns of a data frame. For this reason we provide the
eachrow and eachcol wrappers that provide this functionality. As a
consequence, for clarity and to minimize the risk of error on user’s side,
without being wrapped data frame is not iterable and behaves like a matrix in
indexing and broadcasting.