Author Archives: Blog by Bogumił Kamiński

Swift as an Arrow.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2020/11/06/arrow.html

Introduction

The Julia language is renowned for being swift (yet it is clearly not Swift).

Recently its data reading and writing capabilities are ultra swift mainly
thanks to Jacob Quinn building on earlier efforts of ExpandingMan.

The game changer is Arrow.jl package which is not only fast, but also
it is an implementation of Apache Arrow format. This means that we have
a great format for in-memory and on-disk data exchange with C, C++, C#, Go, Java,
JavaScript, MATLAB, Python, R, Ruby, and Rust.

Recently a very nice blog post that presents how Arrow.jl is used
in practice was written by Jacob Zelko.

In this blog I want to do some performance benchmarking and give recommendations
for people working with DataFrames.jl.

The post is written under Linux, Julia 1.5.2 and the following package setup:

(@v1.5) pkg> status Arrow CSV DataFrames
Status `~/.julia/environments/v1.5/Project.toml`
  [69666777] Arrow v0.4.1
  [336ed68f] CSV v0.7.7
  [a93c6f00] DataFrames v0.21.8

Arrow.jl test drive

First we create some data frame we will work with:

julia> using Arrow, DataFrames

julia> df = DataFrame(["x$i" => i:10^7+i for i in 1:10])
10000001×10 DataFrame
│ Row      │ x1       │ x2       │ x3       │ x4       │ x5       │ x6       │ x7       │ x8       │ x9       │ x10      │
│          │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │
├──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┤
│ 1        │ 1        │ 2        │ 3        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │
│ 2        │ 2        │ 3        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │
│ 3        │ 3        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │
│ 4        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │
│ 5        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │ 14       │
│ 6        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │ 14       │ 15       │
│ 7        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │ 14       │ 15       │ 16       │
⋮
│ 9999994  │ 9999994  │ 9999995  │ 9999996  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │
│ 9999995  │ 9999995  │ 9999996  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │
│ 9999996  │ 9999996  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │
│ 9999997  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │
│ 9999998  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │
│ 9999999  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │ 10000008 │
│ 10000000 │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │ 10000008 │ 10000009 │
│ 10000001 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │ 10000008 │ 10000009 │ 10000010 │

And benchmark the writing-to-disk speed (I run all timings twice to capture both
time of the first run, and time after things are compiled, as both are relevant
in practice):

julia> using Arrow, DataFrames

julia> @time Arrow.write("test.arrow", df)
  2.792842 seconds (7.61 M allocations: 386.438 MiB)
"test.arrow"

julia> @time Arrow.write("test.arrow", df)
  2.222653 seconds (436 allocations: 35.953 KiB)
"test.arrow"

julia> stat("test.arrow").size / 10^6
800.001826

The performance looks really good given the final file has 800 MB.
Also we see that compilation latency is low.

Now we read the file back:

julia> @time df2 = Arrow.Table("test.arrow") |> DataFrame;
  0.831660 seconds (2.44 M allocations: 123.515 MiB)

julia> @time df2 = Arrow.Table("test.arrow") |> DataFrame
  0.000320 seconds (649 allocations: 40.047 KiB)
10000001×10 DataFrame
│ Row      │ x1       │ x2       │ x3       │ x4       │ x5       │ x6       │ x7       │ x8       │ x9       │ x10      │
│          │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │
├──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┤
│ 1        │ 1        │ 2        │ 3        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │
│ 2        │ 2        │ 3        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │
│ 3        │ 3        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │
│ 4        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │
│ 5        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │ 14       │
│ 6        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │ 14       │ 15       │
│ 7        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │ 14       │ 15       │ 16       │
⋮
│ 9999994  │ 9999994  │ 9999995  │ 9999996  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │
│ 9999995  │ 9999995  │ 9999996  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │
│ 9999996  │ 9999996  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │
│ 9999997  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │
│ 9999998  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │
│ 9999999  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │ 10000008 │
│ 10000000 │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │ 10000008 │ 10000009 │
│ 10000001 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │ 10000008 │ 10000009 │ 10000010 │

and here the magic happens. When Arrow.Table reads the file from the disk
it does memory mapping so reading it is almost instant (again, as when writing
the file the compilation latency is low, which is nice).

There is one cost of this speed, Arrow.jl uses its own custom vector type and it
is additionally read only when memory mapped, as we can see here:

julia> typeof(df2.x1)
Arrow.Primitive{Int64,Array{Int64,1}}

julia> df2.x1[1] = 100
ERROR: ReadOnlyMemoryError()

Fortunately, it is easily fixed by making a copy of a DataFrame:

julia> @time df3 = copy(df2);
  0.522758 seconds (287.94 k allocations: 777.914 MiB, 9.34% gc time)

julia> @time df3 = copy(df2);
  0.559576 seconds (44 allocations: 762.943 MiB, 36.20% gc time)

julia> typeof(df3.x1)
Array{Int64,1}

And it costs around 0.5 second in our case, which is not that bad I think.

If we want to avoid memory mapping, we can read a file from an IO like this:

julia> @time df2 = open("test.arrow") do io
           return Arrow.Table(io) |> DataFrame
       end;
  0.338193 seconds (42.84 k allocations: 765.156 MiB)

julia> @time df2 = open("test.arrow") do io
           return Arrow.Table(io) |> DataFrame
       end
  0.323562 seconds (8.50 k allocations: 763.373 MiB)
10000001×10 DataFrame
│ Row      │ x1       │ x2       │ x3       │ x4       │ x5       │ x6       │ x7       │ x8       │ x9       │ x10      │
│          │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │
├──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┤
│ 1        │ 1        │ 2        │ 3        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │
│ 2        │ 2        │ 3        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │
│ 3        │ 3        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │
│ 4        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │
│ 5        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │ 14       │
│ 6        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │ 14       │ 15       │
│ 7        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │ 14       │ 15       │ 16       │
⋮
│ 9999994  │ 9999994  │ 9999995  │ 9999996  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │
│ 9999995  │ 9999995  │ 9999996  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │
│ 9999996  │ 9999996  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │
│ 9999997  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │
│ 9999998  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │
│ 9999999  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │ 10000008 │
│ 10000000 │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │ 10000008 │ 10000009 │
│ 10000001 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │ 10000008 │ 10000009 │ 10000010 │

julia> typeof(df2.x1)
Arrow.Primitive{Int64,Array{Int64,1}}

julia> df2.x1[1] = 100
100

julia> push!(df2, 1:10)
ERROR: MethodError: no method matching resize!(::Arrow.Primitive{Int64,Array{Int64,1}}, ::Int64)

As you can see this time the vectors are mutable, but not re-sizable. Having
mutability costs around 0.3 second of extra read time over memory mapping.

To get a relative feeling about these timings let us try CSV.jl
(we use a single thread):

julia> using CSV

julia> @time CSV.write("test.csv", df)
 14.644919 seconds (501.00 M allocations: 11.977 GiB, 8.92% gc time)
"test.csv"

julia> @time CSV.write("test.csv", df)
 15.076190 seconds (499.98 M allocations: 11.925 GiB, 8.93% gc time)
"test.csv"

julia> stat("test.csv").size / 10^6
788.889406

julia> @time df2 = CSV.read("test.csv");
  7.391509 seconds (6.01 M allocations: 2.909 GiB, 3.92% gc time)

julia> @time df2 = CSV.read("test.csv");
  3.292178 seconds (2.94 k allocations: 2.612 GiB, 6.21% gc time)

So we see that indeed both reading and writing is much faster (although the size
of the file on disk is comparable in both approaches).

Concluding remarks

I did not want to go into the details of Arrow.jl usage in this post,
but rather show a high level view how it works and present how to use it with
DataFrames.jl (depending on what level of mutability one wants).

Maybe as a final comment let me just highlight the three options you have to
create the Arrow.Table table object. You can use data from:

  • from a file on a disk, in which case you have an advantage of being able to use
    memory mapping;
  • from an IO (so this means you can ingest data from any external source);
  • from a Vector{UInt8} (so you can easily process data passed as a pointer,
    or e.g. byes read from a HTTP request).

I think it is really great as it covers virtually any use case one might
encounter in practice.

Train your brain with the Julia language

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2020/10/30/train-your-brain.html

Introduction

Together with Paweł Prałat we have writen Train Your Brain
book, that contains mathematical problems that are challenging (at least for us)
yet only require elementary mathematics.

In general we solve the problems using pen-and-paper only, but often we use
codes in the Julia programming language to support the reasoning. We mostly use
it to build the intuition that helps to find the proof.
In A Julia language companion to the book we provide a brief introduction
to the Julia language and discuss in detail the source codes we use in the main
book.

One of the problems from the book I especially like is the following:

Every point on a circle is painted with one of three colors: red, green, and
blue. Prove that there are three points on the circle that have the same color
and form an isosceles triangle.

In this post I will discuss how computer support can be used in the process
of solving of this problem.

Warm up

Let us start with a simpler problem:

Paint every point on a circle using one of two colors red and green in such a
way that there do not exist three points on the circle that have the same
color and form an equilateral triangle.

To solve it pick any diameter of the circle and denote its endpoints by \(A\)
and \(B\). Now start in point \(A\) and moving clockwise paint the arc from
\(A\) (inclusive) to \(B\) (exclusive) red and arc from \(B\) (inclusive) to
\(A\) (exclusive) green. We will argue that this coloring meets the requirements
of the problem.

Consider points \(P\), \(Q\), and \(R\) located on a circle that are all painted
with the same color. This means that the triangle \(PQR\) formed by them does
not contain the center of the circle, and thus it is obtuse, so it is not
equilateral. This finishes the proof.

The actual problem

Solving the actual problem is more challenging. Here, I discuss how computer
support can be used to attack it.

The first observation is that in order to use the computer we probably should
reformulate the problem. Instead of considering the circle we will consider
a regular \(n\)-gon that is inscribed in the circle. We choose such a
simplification as, intuitively, a regular \(n\)-gon contains many isosceles
triangles. Now observe that in such polygon at least \(k=\lceil n / 3 \rceil\)
points are painted using one color, by the pigeonhole principle.
This means that it is enough to find such \(n\) for which any \(k\)-element
subset of its vertices forms at least one isosceles triangle.

From this we see that the most promising values of \(n\) should be of the form
\(3k-2\).

Clearly picking \(k=3\) is not enough as then \(n=7\) and we can pick
three vertices from a \(7\)-gon that do not form an isosceles triangle.

The next guess is \(k=4\). Then we have \(n=10\). However, also in this case
we notice that if we pick four vertices that form a rectangle (for example
verices number 1, 2, 6, and 7 if we number vertices clockwise) they do not
form an isosceles triangle.

We thus turn to \(k=5\) and \(n=13\). It is not that clear what happens in this
case. Let us use the Julia language to verify this. Again number vertices
clockwise, this time from 1 to 13.

Let us start with a simpler task given three vertex numbers a, b, and c
how to check if they form an isosceles triangle. Here is a simple code that
verifies this (I did not try to make it super efficient, so probably I would not
get an approval for it if I wanted to merge it to DataFrames.jl, but this
is my blog post so I can approve the PRs myself):

function isisosceles(a::Int, b::Int, c::Int)
    # sorting ensures we can calculate the distance easily
    a, b, c = sort!([a, b, c])

    # it is always safe to make sure we got what we expected
    @assert 0 < a < b < c < 14

    d1 = b - a # distance from point a to b
    d2 = c - b # distance from point b to c
    d3 = 13 - d1 - d2 # distance from point c to a
    # if any of the distances are equal as then we have an isosceles triangle
    return d1 == d2 || d2 == d3 || d3 == d1
end

(note in this code that the tricky case is when d1, d2, or d3 is greater
than 6 but in this case clearly it is only possible that the triangle is
isosceles if two remaining values are equal)

Let us quickly check if our function produces the correct results on two sample
triangles:

julia> isisosceles(1, 2, 3)
true

julia> isisosceles(1, 2, 4)
false

Indeed it seems to work correctly (of course this is not the kind of tests one
should write when developing a package).

Let us move to checking all five element subsets of the set
\(\{1,2,\ldots,12,13\}\) to make sure each of them contains at least one
isosceles triangle. Here is the code that does the job (the code was tested
under Julia 1.5.2 and Combinatorics.jl 1.0.2):

julia> using Combinatorics

julia> for p5 in combinations(1:13, 5)
           if !any(isisosceles(p3...) for p3 in combinations(p5, 3))
               @error "vertex set $p5 does not form any isosceles triangles"
           end
       end

julia>

As you can see no error message was printed, so indeed we have a support for the
claim that if one takes a regular \(13\)-gon and paints its vertices using three
colors then there will be always such three vertices that form an isosceles
triangle and have the same color.

Concluding remarks

Before I finish let me make two remarks.

Firstly, in the book we show how to prove this claim without computer
support (as it is always possible that there is some bug in: Julia Base,
Combinatorics.jl, or my code).

Secondly, you might want to check out the Julia companion in which
I additionally discuss how to find the coloring of the regular \(13\)-gon that
generates the lowest number of isosceles triangles.

The power of lookup functions in Julia

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2020/10/23/argmax.html

Introduction

Julia provides four very useful general look-up functions in Base:
argmin, argmax, findmin, and findmax.

Normally they are used with arrays, like this:

julia> a = [2, 1, 5, 0]
4-element Array{Int64,1}:
 2
 1
 5
 0

julia> findmax(a)
(5, 3)

julia> argmax(a)
3

julia> findmin(a)
(0, 4)

julia> argmin(a)
4

However, the beauty of their design is that they work with arbitrary collections
of values that support comparisons. Below I give some examples.

This post was written using Julia 1.5.2, StatsBase.jl v0.32.2 and
DataFrames.jl 0.21.8.

Dictionaries

A quite useful pattern for use of these functions is with dictionaries. Here
is an example:

julia> findmax(d)
(5, 'c')

julia> argmax(d)
'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)

julia> findmin(d)
(0, 'd')

julia> argmin(d)
'd': ASCII/Unicode U+0064 (category Ll: Letter, lowercase)

Now you might ask when this is useful? Consider that we have some set of
nominal values and want to find the most frequent one. Here is an easy way to
do it using StatsBase.jl:

julia> using Random, StatsBase

julia> Random.seed!(1234);

julia> r = rand(1:10, 1000);

julia> m = countmap(r)
Dict{Int64,Int64} with 10 entries:
  7  => 100
  4  => 91
  9  => 81
  10 => 110
  2  => 107
  3  => 99
  5  => 103
  8  => 107
  6  => 107
  1  => 95

julia> findmax(m) # find the frequency and value of the most frequent item
(110, 10)

julia> findall(==(findmax(m)[1]), m) # make sure it is unique
1-element Array{Int64,1}:
 10

Data frames

When working with data frames we often store data that can be lexicographically
compared. Let us consider the following simple data frame:

julia> using DataFrames

julia> Random.seed!(1234);

julia> df = DataFrame(c=rand('a':'d', 10), n=rand(10))
10×2 DataFrame
│ Row │ c    │ n        │
│     │ Char │ Float64  │
├─────┼──────┼──────────┤
│ 1   │ 'a'  │ 0.372846 │
│ 2   │ 'b'  │ 0.263121 │
│ 3   │ 'c'  │ 0.98869  │
│ 4   │ 'a'  │ 0.489858 │
│ 5   │ 'd'  │ 0.425211 │
│ 6   │ 'a'  │ 0.379765 │
│ 7   │ 'd'  │ 0.286955 │
│ 8   │ 'c'  │ 0.118085 │
│ 9   │ 'd'  │ 0.739658 │
│ 10  │ 'c'  │ 0.97138  │

Assume, as stated above, that we want to find its largest row lexicographically:

julia> findmax(eachrow(df))
(DataFrameRow
│ Row │ c    │ n        │
│     │ Char │ Float64  │
├─────┼──────┼──────────┤
│ 9   │ 'd'  │ 0.739658 │, 9)

Also because eachrow(df) returns an AbstractArray object we can also safely
use maximum function to get the following:

julia> maximum(eachrow(df))
DataFrameRow
│ Row │ c    │ n        │
│     │ Char │ Float64  │
├─────┼──────┼──────────┤
│ 9   │ 'd'  │ 0.739658 │

All this is nice and clean in a fully generic way.

Conclusions

This time conclusions will be a warning. One should clearly understand how these
functions work, as in some cases their behavior might be surprising. Here
is an example:

julia> d = Dict('a':'d' .=> 4:-1:1)
Dict{Char,Int64} with 4 entries:
  'a' => 4
  'c' => 2
  'd' => 1
  'b' => 3

julia> maximum(d)
'd' => 1

julia> findmax(d)
(4, 'a')

And we see that maximum finds a largest key-value pair while findmax
locates the largest value and returns a tuple containing this value
and a key corresponding to it.

Note though that e.g. for NamedTuple the behavior is different:

julia> nt = (a=4, b=3, c=2, d=1)
(a = 4, b = 3, c = 2, d = 1)

julia> maximum(nt)
4

julia> findmax(nt)
(4, :a)

and in this case we consistently get the largest value in both cases.