Author Archives: Blog by Bogumił Kamiński

Why do I use main function in my Julia scripts?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/07/15/main.html

Introduction

Many users are attracted to Julia because it offers good performance of code
execution. It is possible, because Julia programs are compiled to native
machine code before they are executed. At the same time Julia is a convenient
tool in interactive data science sessions.

However, there is a fine line between doing data analysis in an interactive
session and writing a script that is meant to be executed many times after
its creation. The problem is that often interactive session turns into
a script eventually, and there is a temptation to use the code originally
written verbatim in that script.

In this post I will explain why I personally never do it. Instead when
I realize that I will need some code many times I change my mindset
how it should be written.

The points I rise in my post today are commonly known so most likely what I
discuss is not new for many of the readers. However, I decided to write it as
keeping them in mind is especially important when coding in Julia, as opposed
to, for example, R or Python, as I will explain in this post.

The post was tested under Julia 1.7.2 and Julia 1.9.0-DEV.575.

How do I define interactive and script coding style?

There are many aspects in which a well written program is different from a log
of an interactive session. However, in my opinion, the most important one is
that in interactive session one heavily uses global variables, while in programs
preferably no global variables are used. Instead, I typically define main
function taking no arguments and I implement all top-level logic in this
function.

The three most important reasons why this is a desirable practice are:

  • performance;
  • “forgotten variable” problem;
  • Julia’s variable scoping rules.

Let me give examples of these three potential problems.

Are loops in Julia fast?

One of the selling points of Julia is that for loops are fast, as opposed
to R or Python. However, you might ask if this is really always the case.

Let us check (start a fresh Julia session):

julia> s = 0
0

julia> @time for i in 1:10^8
           s += i
       end
  5.766783 seconds (200.00 M allocations: 2.980 GiB, 6.07% gc time)

julia> s
5000000050000000

This is not fast. The reason is clear for any relatively experienced Julia
user. In the loop we refer to s variable that is global. It is one of the
first things one learns when one starts using Julia.

The problem is, in my experience, that in practice it is very easy to forget
to fix such issue when switching from interactive mode to script mode if the
code base is more than a few dozen lines of code.

How would the code with top-level main function look in this case?
(start a fresh Julia session)

julia> function main()
           s = 0
           for i in 1:10^8
               s += i
           end
           return s
       end
main (generic function with 1 method)

julia> @time main()
  0.000001 seconds
5000000050000000

There are other options to make this code run fast in top level scope, here are
some of them.

First is to use let block (start a fresh Julia session):

julia> s = 0
0

julia> @time s = let s = s
           s = 0
           for i in 1:10^8
               s += i
           end
           s
       end

  0.000001 seconds (1 allocation: 16 bytes)
5000000050000000

The other is to add type assertion to global variable s (this requires Julia 1.8,
I used Julia 1.9.0-DEV.575 in the test):

julia> s::Int = 0
0

julia> @time for i in 1:10^8
           s += i
       end
  1.226593 seconds (100.00 M allocations: 1.490 GiB, 10.49% gc time)

julia> s
5000000050000000

As you can see type assertion makes the code run faster, but it is slower
than the other options.

However, even if these options are available I find the main function
approach cleanest.

Why does my program produce strange results?

Here is an example program that can produce surprising results (start a fresh
Julia session):

julia> l = -100
-100

julia> function f()
       x = 1
       for i in l:3
           x += i
       end
       return x
       end
f (generic function with 1 method)

julia> f()
-5043

The problem I show here is standard trap, I have replaced 1 with l in the
for loop specification.

The issue is that if you have more than a few hundreds of lines of code it is
easy to forget what variable names you have already defined in global scope.
Then, if you mistype something in other part of the code, instead of an error
you silently get a wrong result. I have, unfortunately, fallen victim of this
trap so many times that I started to call it “forgotten variable” problem.

The risk of such problem is significantly lower if you wrap all your code in
functions (and preferably the body of the function should be short). In such
cases you most likely will remember all variables that are defined in each
function.

Why does my program change its behavior when I run it as a script?

Save the code we have used above for analyzing for loop performance in
the example1.jl file. The contents of the file should be:

s = 0
@time for i in 1:10^8
    s += i
end
println(s)

Now I try running this code in a script:

$ julia example1.jl
┌ Warning: Assignment to `s` in soft scope is ambiguous because a global
variable by the same name exists: `s` will be treated as a new local.
Disambiguate by using `local s` to suppress this warning or `global s`
to assign to the existing global variable.
└ @ example1.jl:3
ERROR: LoadError: UndefVarError: s not defined

As you can see, because of Julia’s scoping rules, the code that worked in
interactive mode stopped working in script mode. This is bad, but at least
your code threw an error.

Let us try the following code:

x = 0
for i in 1:10
    x = i
end
@show x

First run it in a fresh interactive session:

julia> x = 0
0

julia> for i in 1:10
           x = i
       end

julia> @show x
x = 10
10

Now save this code as example2.jl file and run it as a script:

$ julia example2.jl
┌ Warning: Assignment to `x` in soft scope is ambiguous because a global
variable by the same name exists: `x` will be treated as a new local.
Disambiguate by using `local x` to suppress this warning or `global x`
to assign to the existing global variable.
└ @ example2.jl:3
x = 0

Now the code gave a warning (which is easy to ignore since it is complex, and
even might get lost in a flood of different outputs that the script produced),
but produced a result. The problem is that the result is different than the
one produced in an interactive session.

Conclusions

The problems I described in my post today are well known and most developers
remember about them when writing programs. However, in data science projects one
starts ones work in an interactive session, and only later turns the code into a
script. In such cases it is tempting to keep the code written in invractive mode
unchanged. The problems I have described today show why you should resist this
temptation and refactor your code to use functions in such a case.

What is important to keep in mind that while “forgotten variable” problem is
present in many programming languages, the performance and variable scoping
traps are Julia specific. The reason why performance trap is not so relevant
for R or Python users is that these languages are not compiled so for loops
are slow even when defined inside functions. Similarly, the scoping rule
of how global variables are treated in local soft scope is only an issue
in Julia. (If you are unsure what local soft scope is please refer to the
section on Local Scope in the Julia Manual.)

In summary, when working with Julia, you need to switch your mindset much
earlier from interactive mode to script mode than you would have to in R or
Python.

Efficiency of data frame row iteration

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/07/08/iteration.html

Introduction

Last week I have received several questions about efficiency of iteration
over rows of data frames in DataFrames.jl. In this post I summarize the
most important recommendations on this topic.

The code I use was written under Julia 1.7.2, DataFrames.jl 1.3.4,
DataFramesMeta.jl 0.11.0.

A basic approach

Assume we have a data frame that has two numeric columns :a and :b and we
want to check if for all rows the value in column :a is less than the value in
column :b. We want to compare several approaches to this task and check their
performance. (I have chosen an easy task on purpose to concentrate on the issue
of iteration.)

Here is a basic approach to this problem (in all the examples I will run
the operation twice and show the @time output to capture the compilation
time and reflect interactive experience of the user):

julia> using DataFrames

julia> df = DataFrame(a=1:100_000_000, b=2:100_000_001)
100000000×2 DataFrame
       Row │ a          b
           │ Int64      Int64
───────────┼──────────────────────
         1 │         1          2
         2 │         2          3
     ⋮     │     ⋮          ⋮
  99999999 │  99999999  100000000
 100000000 │ 100000000  100000001
             99999996 rows omitted

julia> @time all(df.a .< df.b)
  0.204820 seconds (326.44 k allocations: 28.022 MiB, 53.94% compilation time)
true

julia> @time all(df.a .< df.b)
  0.093742 seconds (6 allocations: 11.925 MiB)
true

I have used broadcasting to present a reference performance of the operation.

Using data frame row iteration

The first take on our problem is to use the eachrow iterator:

julia> function f1(df)
           for row in eachrow(df)
               row.a < row.b || return false
           end
           return true
       end
f1 (generic function with 1 method)

julia> @time f1(df)
 19.280229 seconds (700.15 M allocations: 10.438 GiB, 6.00% gc time, 0.20% compilation time)
true

julia> @time f1(df)
 19.465430 seconds (700.00 M allocations: 10.431 GiB, 5.46% gc time)
true

As you can see using eachrow is slow. This approach is easy, but
should be used only for data frames that have few rows. The reason why it
is slow is that it is not type stable.

Using named tuples

Here is the approach that is type stable:

julia> function f2(nti)
           for row in nti
               row.a < row.b || return false
           end
           return true
       end
f2 (generic function with 1 method)

julia> @time f2(Tables.namedtupleiterator(df))
  0.104822 seconds (7.64 k allocations: 438.955 KiB, 7.41% compilation time)
true

julia> @time f2(Tables.namedtupleiterator(df))
  0.090318 seconds (9 allocations: 336 bytes)
true

This time the operation is fast. Note two things though:

  • using Tables.namedtupleiterator can be slow if data frame has many columns
    (it can have high compilation cost);
  • we need to pass Tables.namedtupleiterator(df) as an argument to f2 to
    make this function type stable.

Using vectors

The next approach that is typically used is to pass the vectors that
we want to compare to the function:

julia> function f3(a, b)
           for i in eachindex(a, b)
               @inbounds a[i] < b[i] || return false
           end
           return true
       end
f3 (generic function with 1 method)

julia> @time f3(df.a, df.b)
  0.113009 seconds (6.14 k allocations: 323.976 KiB, 6.03% compilation time)
true

julia> @time f3(df.a, df.b)
  0.082265 seconds
true

As you can see the operation is fast this time. I know I can safely use
@inbounds because the i index is taken from eachindex(a, b) that
guarantees that only valid indices are passed.

Using DataFramesMeta.jl

The last option we consider is using the @eachrow! macro from DataFramesMeta.jl:

julia> function f4(df)
           flag = true
           @eachrow! df begin
               if !(:a < :b)
                   flag = false
               end
           end
           return flag
       end
f4 (generic function with 1 method)

julia> @time f4(df)
  0.133970 seconds (80.46 k allocations: 4.634 MiB, 16.30% compilation time)
true

julia> @time f4(df)
  0.090125 seconds (27 allocations: 1.578 KiB)
true

The operation is also fast. Here it is worth to note two things:

  • we use @eachrow! not @eachrow as the latter would copy df which in our
    case is not needed;
  • we need to use the flag helper variable and we cannot use break to stop
    the iteration early (in the example I use this does not affect the result
    since :a is always less than :b so we iterate all rows anyway, but for
    other data it could matter).

Conclusions

The take-aways from these examples are as follows:

  • eachrow is easy to use, but it will be slow if you work with data frame
    that has many rows;
  • you can use Tables.namedtupleiterator wrapper instead of eachrow; it will
    be fast but it can have large compilation time for wide tables (note, however,
    that you can always pass only a narrow data frame to it if not all source
    columns are needed in your operation, for example
    Tables.namedtupleiterator(df[!, [:a, :b]]) in the code we used in this post);
  • you can directly pass columns you want to work with to a function – this
    approach will be fast and gives you most control (at the expense of having to
    write a more low-level code);
  • There is the @eachrow! macro (and @eachrow if you want to copy data) in
    DataFramesMeta.jl that will be also fast. When you use it remember that it
    is designed to always iterate all rows of a data frame.

Efficiency of data frame row iteration

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/07/08/iteration.html

Introduction

Last week I have received several questions about efficiency of iteration
over rows of data frames in DataFrames.jl. In this post I summarize the
most important recommendations on this topic.

The code I use was written under Julia 1.7.2, DataFrames.jl 1.3.4,
DataFramesMeta.jl 0.11.0.

A basic approach

Assume we have a data frame that has two numeric columns :a and :b and we
want to check if for all rows the value in column :a is less than the value in
column :b. We want to compare several approaches to this task and check their
performance. (I have chosen an easy task on purpose to concentrate on the issue
of iteration.)

Here is a basic approach to this problem (in all the examples I will run
the operation twice and show the @time output to capture the compilation
time and reflect interactive experience of the user):

julia> using DataFrames

julia> df = DataFrame(a=1:100_000_000, b=2:100_000_001)
100000000×2 DataFrame
       Row │ a          b
           │ Int64      Int64
───────────┼──────────────────────
         1 │         1          2
         2 │         2          3
     ⋮     │     ⋮          ⋮
  99999999 │  99999999  100000000
 100000000 │ 100000000  100000001
             99999996 rows omitted

julia> @time all(df.a .< df.b)
  0.204820 seconds (326.44 k allocations: 28.022 MiB, 53.94% compilation time)
true

julia> @time all(df.a .< df.b)
  0.093742 seconds (6 allocations: 11.925 MiB)
true

I have used broadcasting to present a reference performance of the operation.

Using data frame row iteration

The first take on our problem is to use the eachrow iterator:

julia> function f1(df)
           for row in eachrow(df)
               row.a < row.b || return false
           end
           return true
       end
f1 (generic function with 1 method)

julia> @time f1(df)
 19.280229 seconds (700.15 M allocations: 10.438 GiB, 6.00% gc time, 0.20% compilation time)
true

julia> @time f1(df)
 19.465430 seconds (700.00 M allocations: 10.431 GiB, 5.46% gc time)
true

As you can see using eachrow is slow. This approach is easy, but
should be used only for data frames that have few rows. The reason why it
is slow is that it is not type stable.

Using named tuples

Here is the approach that is type stable:

julia> function f2(nti)
           for row in nti
               row.a < row.b || return false
           end
           return true
       end
f2 (generic function with 1 method)

julia> @time f2(Tables.namedtupleiterator(df))
  0.104822 seconds (7.64 k allocations: 438.955 KiB, 7.41% compilation time)
true

julia> @time f2(Tables.namedtupleiterator(df))
  0.090318 seconds (9 allocations: 336 bytes)
true

This time the operation is fast. Note two things though:

  • using Tables.namedtupleiterator can be slow if data frame has many columns
    (it can have high compilation cost);
  • we need to pass Tables.namedtupleiterator(df) as an argument to f2 to
    make this function type stable.

Using vectors

The next approach that is typically used is to pass the vectors that
we want to compare to the function:

julia> function f3(a, b)
           for i in eachindex(a, b)
               @inbounds a[i] < b[i] || return false
           end
           return true
       end
f3 (generic function with 1 method)

julia> @time f3(df.a, df.b)
  0.113009 seconds (6.14 k allocations: 323.976 KiB, 6.03% compilation time)
true

julia> @time f3(df.a, df.b)
  0.082265 seconds
true

As you can see the operation is fast this time. I know I can safely use
@inbounds because the i index is taken from eachindex(a, b) that
guarantees that only valid indices are passed.

Using DataFramesMeta.jl

The last option we consider is using the @eachrow! macro from DataFramesMeta.jl:

julia> function f4(df)
           flag = true
           @eachrow! df begin
               if !(:a < :b)
                   flag = false
               end
           end
           return flag
       end
f4 (generic function with 1 method)

julia> @time f4(df)
  0.133970 seconds (80.46 k allocations: 4.634 MiB, 16.30% compilation time)
true

julia> @time f4(df)
  0.090125 seconds (27 allocations: 1.578 KiB)
true

The operation is also fast. Here it is worth to note two things:

  • we use @eachrow! not @eachrow as the latter would copy df which in our
    case is not needed;
  • we need to use the flag helper variable and we cannot use break to stop
    the iteration early (in the example I use this does not affect the result
    since :a is always less than :b so we iterate all rows anyway, but for
    other data it could matter).

Conclusions

The take-aways from these examples are as follows:

  • eachrow is easy to use, but it will be slow if you work with data frame
    that has many rows;
  • you can use Tables.namedtupleiterator wrapper instead of eachrow; it will
    be fast but it can have large compilation time for wide tables (note, however,
    that you can always pass only a narrow data frame to it if not all source
    columns are needed in your operation, for example
    Tables.namedtupleiterator(df[!, [:a, :b]]) in the code we used in this post);
  • you can directly pass columns you want to work with to a function – this
    approach will be fast and gives you most control (at the expense of having to
    write a more low-level code);
  • There is the @eachrow! macro (and @eachrow if you want to copy data) in
    DataFramesMeta.jl that will be also fast. When you use it remember that it
    is designed to always iterate all rows of a data frame.