Author Archives: Blog by Bogumił Kamiński

An exercise in DataFrames.jl transformation minilanguage

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/05/06/minilanguage.html

Introduction

Recently I answered an interesting question about transformation of a data
frame. I thought that the problem and solution are instructive enough to
warrant writing a blog post about them.

This post was written under Julia 1.7.0 and DataFrames.jl 1.3.4.

The problem

Assume you are given the following data frame:

julia> using DataFrames

julia> df1 = DataFrame(reshape(1:30, 5, 6), vec(string.(["x", "y"], [1 2 3])))
5×6 DataFrame
 Row │ x1     y1     x2     y2     x3     y3
     │ Int64  Int64  Int64  Int64  Int64  Int64
─────┼──────────────────────────────────────────
   1 │     1      6     11     16     21     26
   2 │     2      7     12     17     22     27
   3 │     3      8     13     18     23     28
   4 │     4      9     14     19     24     29
   5 │     5     10     15     20     25     30

Before we move forward let me comment a bit about this code.
The reshape(1:30, 5, 6) part creates a 5×6 matrix filled with integers
ranging from 1 to 30:

julia> reshape(1:30, 5, 6)
5×6 reshape(::UnitRange{Int64}, 5, 6) with eltype Int64:
 1   6  11  16  21  26
 2   7  12  17  22  27
 3   8  13  18  23  28
 4   9  14  19  24  29
 5  10  15  20  25  30

Next the string.(["x", "y"], [1 2 3]) part creates a matrix of column names:

julia> string.(["x", "y"], [1 2 3])
2×3 Matrix{String}:
 "x1"  "x2"  "x3"
 "y1"  "y2"  "y3"

I use vec on it since the DataFrame constructor requires column names to
be passed as a vector.

We want to create a new data frame having the following four columns:

  • x_minimum: storing for each row minimum value stored in
    the columns containing "x" in their name;
  • x_maximum: storing for each row maximum value stored in
    the columns containing "x" in their name;
  • y_minimum: storing for each row minimum value stored in
    the columns containing "y" in their name;
  • y_maximum: storing for each row maximum value stored in
    the columns containing "y" in their name.

The question is how to do it in DataFrames.jl. Below I will discuss several
options you can consider.

Using a loop

Here is a simple approach for performing this operation which relies
on knowledge of Base Julia:

julia> df2 = DataFrame()
0×0 DataFrame

julia> for n in ["x", "y"]
           mat = Matrix(df1[:, Regex(n)])
           for fun in [minimum, maximum]
               df2[:, string(n, "_", fun)] = fun.(eachrow(mat))
           end
       end

julia> df2
5×4 DataFrame
 Row │ x_minimum  x_maximum  y_minimum  y_maximum
     │ Int64      Int64      Int64      Int64
─────┼────────────────────────────────────────────
   1 │         1         21          6         26
   2 │         2         22          7         27
   3 │         3         23          8         28
   4 │         4         24          9         29
   5 │         5         25         10         30

What we do in this code can be explained as follows. First we create an empty
target data frame df2. Next we iteratively add columns to it. To be able to
use Base Julia functionality we select from the data frame the columns
respectively having "x" or "y" in their name using a regular expression and
convert the result to a Matrix. Finally we apply the minimum or maximum
function to rows of this matrix with the fun.(eachrow(mat)) expression and
assign the result to a new column in the df2 data frame.

Using broadcasting in transformation minilanguage

Now let us turn to using the DataFrames.jl transformation minilanguage
(if you do not have much experience with it I recommend you to first read
this post before proceeding):

julia> df2 = select(df1, AsTable.([r"x" r"y"]) .=>
                         ByRow.([minimum, maximum]) .=>
                         string.(["x_" "y_"], [minimum, maximum]))
5×4 DataFrame
 Row │ x_minimum  x_maximum  y_minimum  y_maximum
     │ Int64      Int64      Int64      Int64
─────┼────────────────────────────────────────────
   1 │         1         21          6         26
   2 │         2         22          7         27
   3 │         3         23          8         28
   4 │         4         24          9         29
   5 │         5         25         10         30

To understand what is going on in this expression we first need to inspect
what is passed to select as a second argument:

julia> AsTable.([r"x" r"y"]) .=>
       ByRow.([minimum, maximum]) .=>
       string.(["x_" "y_"], [minimum, maximum])
2×2 Matrix{Pair{AsTable}}:
 AsTable(r"x")=>(ByRow{typeof(minimum)}(minimum)=>"x_minimum")  AsTable(r"y")=>(ByRow{typeof(minimum)}(minimum)=>"y_minimum")
 AsTable(r"x")=>(ByRow{typeof(maximum)}(maximum)=>"x_maximum")  AsTable(r"y")=>(ByRow{typeof(maximum)}(maximum)=>"y_maximum")

As you can see Julia broadcasting mechanism magically created four operation
specification expressions. The trick is what since we wanted function names to
change faster I passed them in a vector [minimum, maximum] twice, while I
wanted column names to change slower, so I passed them to broadcasting as one
row matrices with [r"x" r"y"] and ["x_" "y_"] expressions respectively.

The second part is understanding what each of the operation specification
expression means. Let us concentrate on the first one:

AsTable(r"x")=>(ByRow{typeof(minimum)}(minimum)=>"x_minimum")

The decomposition is:

  • AsTable(r"x") means: select all columns that contain "x" and pass them
    to the transformation function as a single positional argument (this is what
    AsTable serves for here);
  • the ByRow(minimum) part means that we want to apply the minimum function
    to each row of the data passed to it;
  • finally "x_minimum" part means that we want to store the result in the column
    having this name.

An alternative way to write this transformation would be to replace minimum
and maximum with min and max. The difference is that min and max
take multiple positional arguments. It means that we would need to drop the
AsTable part in the transformation specification like this:

julia> df2 = select(df1, [r"x" r"y"] .=>
                         ByRow.([min, max]) .=>
                         string.(["x_" "y_"], [minimum, maximum]))
5×4 DataFrame
 Row │ x_minimum  x_maximum  y_minimum  y_maximum
     │ Int64      Int64      Int64      Int64
─────┼────────────────────────────────────────────
   1 │         1         21          6         26
   2 │         2         22          7         27
   3 │         3         23          8         28
   4 │         4         24          9         29
   5 │         5         25         10         30

As you can see we get the same result. You might ask why I have not used this
style initially? The reason is that r"x" potentially could have selected
thousands of columns from the source data frame. In Julia, in general, it is
not a good idea to pass very many positional arguments to functions as in some
cases it might put too much strain on the Julia compiler. In such cases
AsTable wrapper is preferred as it guarantees to pass a single argument to
the function(a collection of passed columns).

Using a comprehension in transformation minilanguage

Above I have shown you how to use broadcasting to achieve the desired result.
Let me show below that it is equally easy to use a comprehension to achieve
the same:

julia> df2 = select(df1, [AsTable(Regex(n)) => ByRow(fun) => string(n, "_", fun)
                    for n in ["x", "y"] for fun in [minimum, maximum]])
5×4 DataFrame
 Row │ x_minimum  x_maximum  y_minimum  y_maximum
     │ Int64      Int64      Int64      Int64
─────┼────────────────────────────────────────────
   1 │         1         21          6         26
   2 │         2         22          7         27
   3 │         3         23          8         28
   4 │         4         24          9         29
   5 │         5         25         10         30

The choice between using broadcasting and a comprehension is mostly a personal
preference.

Conclusions

I hope you will find the presented examples useful to better understand how to
write complex transformations in DataFrames.jl.

The codes might look scary to you at a first glance. However, in my experience,
after having some practice with broadcasting or writing comprehensions in
Julia they become natural.

Values’ mutability in Julia

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/04/29/mutability.html

Introduction

This week chapter 4 of my upcoming Julia for Data Analysis book has been
released by Manning in MEAP. Therefore, following the plan I have announced
in this post, I will discuss a topic related to the material I cover
in chapter 4, but that was not included in the book.

In chapter 4 I give an introduction to working with collections in Julia.
A fundamental topic that is related with this subject is understanding
that values in Julia can be either mutable (like Vector or Dict) or
immutable (like Tuple or NamedTuple).

In this post I want to present an example showing the relevance of this
distinction.

The codes I use were tested under Julia 1.7.2 and BenchmarkTools.jl 1.3.1.

How to check if some value is mutable?

If you have some value you can check if it is mutable using the ismutable
function. Let us check it on some example:

julia> x = big(1)
1

julia> typeof(x)
BigInt

julia> ismutable(x)
true

As you can see the value having BigInt type is mutable. This means that
it can be changed in-place. Therefore, you now know that you must be
careful when passing BigInt values to functions as you cannot assume
that such functions will not change the passed value.

Let us check that this is indeed the case:

julia> x
1

julia> Base.GMP.flipsign!(x, -1)
-1

julia> x
-1

In this case the Base.GMP.flipsign! function mutated the value bound to the
x variable name.

Is mutability of BigInt useful?

You might ask why BigInt values are made mutable. This is a bit surprising
as other standard numeric types like Int64, Bool, or Float64 are
immutable. The reason is that in certain cases it allows to perform operations
on BigInt values in a faster way. Here is an example of computing a sum
of elements of an array storing BigInt values:

julia> using BenchmarkTools

julia> v = collect(big(1):big(1_000_000));

julia> function sum1(v::AbstractArray{BigInt})
           s = big(0)
           for x in v
               s += x
           end
           return s
       end
sum1 (generic function with 1 method)

julia> function sum2(v::AbstractArray{BigInt})
           s = big(0)
           for x in v
               Base.GMP.MPZ.add!(s, x)
           end
           return s
       end
sum2 (generic function with 1 method)

julia> @btime sum1($v)
  97.111 ms (2000002 allocations: 45.78 MiB)
500000500000

julia> @btime sum2($v)
  27.560 ms (3 allocations: 48 bytes)
500000500000

julia> @btime sum($v)
  27.557 ms (3 allocations: 48 bytes)
500000500000

The difference between sum1 and sum2 is that the former uses + to make
addition and the later uses Base.GMP.MPZ.add!, which updates its first
argument in-place. As you can see sum2 allocates much less memory and is
faster. Additionally I have shown you that sum has essentially the same
performance. This suggests that sum has a special method for performing
summation of BigInt values. Indeed this is the case, the implementation of
this method can be found in base/gmp.jl file and is as follows:

sum(arr::Union{AbstractArray{BigInt}, Tuple{BigInt, Vararg{BigInt}}}) =
    foldl(MPZ.add!, arr; init=BigInt(0))

We can see that it uses an in-place add! function.

Note that both in sum2 and sum functions it was crucial to use a new
BigInt value as an accumulator. The point is that since add! mutates
it in-place we must not use any value stored in the source array for this
purpose.

Conclusions

I hope that you will find the presented example useful. As a final comment let
me highlight one convention.

In our codes the add! function has a ! suffix. This is a convention that
signals that it might mutate its arguments. This is indeed what happens both in
sum2 and sum functions to the accumulator value. However, sum2 and sum
functions do not need a ! in their names as in their implementation the passed
array is not mutated (only accumulator value that is created inside these
functions is mutated).

Testing Julia

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/04/22/testing.html

Introduction

Recently I was invited by Talk Julia to take part in the podcast.
Today it has been released and you can watch it on YouTube.
In the discussion I share my thoughts on DataFrames.jl design principles
and discuss several examples from my upcoming Julia for Data Analysis
book.

One of the things I discuss in the podcast is that in DataFrames.jl development
process we are putting a lot of emphasis on tests so I thought that it is worth
to expand on this topic a bit more in this post.

The codes I use were tested under Julia 1.7.2 and InlineStrings.jl 1.2.2.

Testing and reproducibility of tests

When one develops some software ensuring its proper test coverage is one of the
key practices that help maintaining good quality of code.

My experience with DataFrames.jl is that taking care of proper test coverage
serves three important goals:

  1. Making sure the functionality we provide follows the contract specified in
    its documentation. This is particularly important when testing corner case
    situations, since they happen rarely in practice (so there is lower
    probability that users spot problems and report them) and also allow us to
    make sure our design is logically consistent.
  2. Making sure that as we add new functionalities to the package we do not
    accidentally break something.
  3. Making sure that upgrading dependencies of DataFrames.jl to newer versions
    does not break something (and the biggest such dependency is Base Julia).

A crucial part of writing tests is ensuring their reproducibility. What I mean
by this is that when you run your test suite twice on you should get the same
results. This intends to avoid situations in which you run your tests and get a
bug report. Then you run the tests again and the bug is not present. Such
situation is unwanted, as it is later hard to locate the root cause of the
reported bug.

Randomized tests

When one implements complex algorithms it is hard to cover all possible
hard testing scenarios by writing them down by hand. In such situations,
one of the possible testing methods is to use randomized tests.

An example, old and already resolved, problem that potentially could have been
caught by randomized tests is an issue related to pasting Unicode
characters in Julia REPL. Let me here present a minimal working example of a
code having a similar problem.

Assume I want to write a code that strips last character from a string. Here
is an attempt to implement it:

julia> mychop(s::AbstractString) = isempty(s) ? s : s[1:end-1]
mychop (generic function with 1 method)

Let us write a simple test set for this function:

julia> using Test

julia> @testset "basic test" begin
           @test mychop("") == ""
           @test mychop("a") == ""
           @test mychop("abc") == "ab"
       end
Test Summary: | Pass  Total
basic test    |    3      3
Test.DefaultTestSet("basic test", Any[], 3, false, false)

All looks good so far. However, let us run some more advanced testing
of mychop using randomized tests:

julia> using Random

julia> @testset "advanced tests" begin
           Random.seed!(1234)
           for _ in 1:40
               len = rand(1:10)
               input = rand(UInt8, len) |> String
               output = join(collect(input)[1:end-1])
               @test output == mychop(input)
           end
       end
advanced tests: Error During Test at REPL[83]:7
  Test threw exception
  Expression: output == mychop(input)
  StringIndexError: invalid index [9], valid nearby indices [8]=>'˄', [10]=>'�'
Test Summary:  | Pass  Error  Total
advanced tests |   39      1     40
ERROR: Some tests did not pass: 39 passed, 0 failed, 1 errored, 0 broken.

What I do in this test is generateing input string using randoom bits and then
use a slow method to get a desired output string by going through individual
characters of the original string. I ensure reproducibility of this test on a
given version of Julia by setting the seed of random number generator using
Random.seed!(1234).

From the bug report we can see that something is not right with indexing. The
problem occurs if the second character from the end of the string is not ASCII.
Let us check it:

julia> mychop("∀a")
ERROR: StringIndexError: invalid index [3], valid nearby indices [1]=>'∀', [4]=>'a'

The point of randomized test is that guessing the test scenario
second character from the end of the string is not ASCII
is not easy.

Let us fix mychop to resolve this problem:

julia> mychop(s::AbstractString) = isempty(s) ? s : s[1:prevind(s, end)]
mychop (generic function with 1 method)

julia> @testset "basic test" begin
           @test mychop("") == ""
           @test mychop("a") == ""
           @test mychop("abc") == "ab"
       end
Test Summary: | Pass  Total
basic test    |    3      3
Test.DefaultTestSet("basic test", Any[], 3, false, false)

julia> @testset "advanced tests" begin
           Random.seed!(1234)
           for _ in 1:32768 # pick some round larger number to be sure all works well
               len = rand(1:10)
               input = rand(UInt8, len) |> String
               output = join(collect(input)[1:end-1])
               @test output == mychop(input)
           end
       end
Test Summary:  |  Pass  Total
advanced tests | 32768  32768
Test.DefaultTestSet("advanced tests", Any[], 32768, false, false)

Are we done now? Not yet. We should check our function with some other
AbstractString than String. Let me use InlineStrings.jl:

julia> using InlineStrings
julia> @testset "InlineStrings.jl test" begin
           @test "ab" == @inferred mychop(InlineString("abc"))
       end
InlineStrings.jl test: Error During Test at REPL[110]:2
  Test threw exception
  Expression: "ab" == #= REPL[110]:2 =# @inferred(mychop(InlineString("abc")))
  return type SubString{String3} does not match inferred return type Union{SubString{String3}, String3}
Test Summary:         | Error  Total
InlineStrings.jl test |     1      1
ERROR: Some tests did not pass: 0 passed, 0 failed, 1 errored, 0 broken.

As you can see there is still some more work to be done, as our mychop
function is not type stable, which is checked by the @inferred macro. The
problem is that indexing into String3 that happens if the passed string is not
empty produces a SubString. Let us fix it by making sure that in every branch
of code we apply the same operation to our source string (in the code, like in
the codes above, we use the fact that AbstractString is guaranteed to use
1-based indexing).

julia> mychop(s::AbstractString) = s[1:prevind(s, max(1, end))]
mychop (generic function with 1 method)

julia> @testset "advanced tests" begin
           Random.seed!(1234)
           for _ in 1:32768 # pick some round larger number to be sure all works well
               len = rand(0:10) # make sure to cover 0-length strings
               input = rand(UInt8, len) |> String
               output = join(collect(input)[1:end-1])
               @test output == @inferred mychop(input)
               @test output == @inferred mychop(InlineString(input))
           end
       end
Test Summary:  |  Pass  Total
advanced tests | 65536  65536
Test.DefaultTestSet("advanced tests", Any[], 65536, false, false)

Now all works as expected and we are done.

Conclusions

I hope you found my thoughts on writing tests useful. As usual, I have some
additional comments regarding the presented codes:

  • I used Random.seed! explicitly in my tests. However, the @testset macro,
    before the execution of its body, makes a call to Random.seed!(seed) where
    seed is the current seed of the global RNG. Therefore, if you design a
    larger test suite you do not have to set the seed in every @testset. It is
    enough to set it once per all tests you run.
  • As I have already commented, the presented codes are reproducible on a given
    version of Julia. If you want them to be reproducible across different
    versions of Julia I recommend you to use for example the
    StableRNGs.jl package as a source of randomness in your tests.