Author Archives: Blog by Bogumił Kamiński

Why DataFrame is not type stable and when it matters

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/01/08/typestable.html

Introduction

One of the most frequent performance questions related to DataFrames.jl
are caused by the fact that the DataFrame object is not type stable.
Here is a recent question on Stack Overflow that originated from this
issue. Experienced Julia users are aware of the trade-offs I discuss here, but
they are often surprising for people starting to use DataFrames.jl.

In this post I will want to cover the following issues: a) what does it mean
that DataFrame is not type stable, b) what positive consequences it has, c)
when type instability hits the user most and how to handle it.

This post was written under Julia 1.5.3, DataFrames.jl 0.22.2, and Tables 1.2.2.

What does it mean that DataFrame is not type stable?

The reason here is relatively simple. Here is a stripped definition of DataFrame:

struct DataFrame <: AbstractDataFrame
    columns::Vector{AbstractVector}
    colindex::Index
end

The columns field stores the vectors that constitute the data frame, and the
colindex field maintains a mapping between column numbers and column names.

We can see two crucial things in this definition:

  • the bad: columns has element type AbstractVector, breaking the most
    fundamental rule from the Julia Manual about writing high-performance
    Julia code; this design means that when we extract a column from a DataFrame
    the Julia compiler is not able to infer its type (and in consequence is not
    able to produce the most efficient code);
  • the good: the DataFrame type does not have parameters; this means that if
    you run some function once on a data frame it does not have to be recompiled
    later for any DataFrame you might pass to it (no matter what columns it
    would contain); it also means that it is possible to efficiently precompile
    many of the functions in the package with known signatures to reduce
    the time to first result; finally — you are free to add or remove columns
    from a DataFrame in-place.

The simplest type-stable type that can work similarly to a DataFrame is a
NamedTuple of vectors. However, it should be stressed that this type-stability
is only available when which columns are extracted from such NamedTuple can be
inferred at compile time. Let me give two simple examples of such type
instability for the case of type-stable container like NamedTuple:

Example #1:

julia> nt = (a=[1, 2], b=[3.0, 4.0], c=[true, false])
(a = [1, 2], b = [3.0, 4.0], c = Bool[1, 0])

julia> f(nt, i) = nt[i]
f (generic function with 1 method)

julia> g(nt) = f(nt, 1)
g (generic function with 1 method)

julia> @code_warntype f(nt, 1)
Variables
  #self#::Core.Compiler.Const(f, false)
  nt::NamedTuple{(:a, :b, :c),Tuple{Array{Int64,1},Array{Float64,1},Array{Bool,1}}}
  i::Int64

Body::Array
1 ─ %1 = Base.getindex(nt, i)::Array
└──      return %1

julia> @code_warntype g(nt)
Variables
  #self#::Core.Compiler.Const(g, false)
  nt::NamedTuple{(:a, :b, :c),Tuple{Array{Int64,1},Array{Float64,1},Array{Bool,1}}}

Body::Array{Int64,1}
1 ─ %1 = Main.f(nt, 1)::Array{Int64,1}
└──      return %1

Here you can see that you achieve type stability of the result only if the
compiler is able to perform constant propagation (as in the case of the function
g). Otherwise, even if, in theory NamedTuple is type stable, the type of the
vector returned by function f is not known by the compiler.

This leads us to

Example #2:

julia> function h()
           nt = (a=[1, 2], b=[3.0, 4.0], c=[true, false])
           return sum(nt)
       end
h (generic function with 1 method)

julia> @code_warntype h()
Variables
  #self#::Core.Compiler.Const(h, false)
  nt::NamedTuple{(:a, :b, :c),Tuple{Array{Int64,1},Array{Float64,1},Array{Bool,1}}}

Body::Any
1 ─ %1 = (:a, :b, :c)::Core.Compiler.Const((:a, :b, :c), false)
│   %2 = Core.apply_type(Core.NamedTuple, %1)::Core.Compiler.Const(NamedTuple{(:a, :b, :c),T} where T<:Tuple, false)
│   %3 = Base.vect(1, 2)::Array{Int64,1}
│   %4 = Base.vect(3.0, 4.0)::Array{Float64,1}
│   %5 = Base.vect(true, false)::Array{Bool,1}
│   %6 = Core.tuple(%3, %4, %5)::Tuple{Array{Int64,1},Array{Float64,1},Array{Bool,1}}
│        (nt = (%2)(%6))
│   %8 = Main.sum(nt)::Any
└──      return %8

We can see that even if the type of nt is known at compile-time the compiler
is not able to determine the return type of the sum function as this function
iterates over elements of nt (which is not type stable in general; some
functions for small number of iterations might perform loop unrolling in which
case the result would be type stable).

So the first take-away is that it is not enough to have a type-stable data
structure but also you have to write your code in a way that is type stable.

Why making DataFrame type stable would be a problem?

In order to understand why type-stability might be a problem consider the
following examples.

Example #3:

Here we will show what happens if we have many data frames to process and they
have a different schema (start a new Julia session):

julia> using DataFrames

julia> @time dfs = [DataFrame("a$i" => [isodd(i) ? 1 : 1.0]) for i in 1:10000];
  0.333070 seconds (792.20 k allocations: 49.514 MiB, 6.14% gc time)

julia> @time dfs = [DataFrame("a$i" => [isodd(i) ? 1 : 1.0]) for i in 1:10000];
  0.077577 seconds (298.23 k allocations: 23.319 MiB, 14.77% gc time)

julia> f(df) = df[1, 1]
f (generic function with 1 method)

julia> @time f.(dfs);
  0.109797 seconds (206.24 k allocations: 10.630 MiB)

julia> @time f.(dfs);
  0.001640 seconds (19.50 k allocations: 461.141 KiB)

now start a new Julia session again:

julia> @time nts = [(;Symbol("a$i") => [isodd(i) ? 1 : 1.0]) for i in 1:10000];
    280.058361 seconds (14.82 M allocations: 886.891 MiB, 0.12% gc time)

julia> @time nts = [(;Symbol("a$i") => [isodd(i) ? 1 : 1.0]) for i in 1:10000];
 18.461082 seconds (193.24 k allocations: 8.826 MiB)

julia> f(nt) = nt[1][1]
f (generic function with 1 method)

julia> @time f.(nts);
 41.306839 seconds (18.02 M allocations: 1.085 GiB, 1.36% gc time)

julia> @time f.(nts);
  0.006529 seconds (19.50 k allocations: 461.141 KiB)

As you can see having DataFrame non-parametric is much more compiler friendly.
Of course the above examples are artificial, but they show what happens if you
would need to work with data frames constantly changing schemas (e.g. having
repeatedly added/removed columns).

Example #4:

Now we switch to consider a single data frame, that is wide (again start a fresh
Julia session):

julia> using DataFrames

julia> @time DataFrame(["a$i" => [isodd(i) ? 1 : 1.0] for i in 1:10000]);
  0.495980 seconds (1.22 M allocations: 63.541 MiB, 4.83% gc time)

julia> @time DataFrame(["a$i" => [isodd(i) ? 1 : 1.0] for i in 1:10000]);
  0.077374 seconds (205.31 k allocations: 11.107 MiB)

julia> @time (;[Symbol("a$i") => [isodd(i) ? 1 : 1.0] for i in 1:10000]...);
 23.039182 seconds (2.09 M allocations: 82.427 MiB, 0.12% gc time)

julia> @time (;[Symbol("a$i") => [isodd(i) ? 1 : 1.0] for i in 1:10000]...);
  0.070873 seconds (176.05 k allocations: 9.829 MiB)

As you can see we have a situation in which the compiler is very severely
burdened, this time by the fact that our NamedTuple is very wide.

In conclusion — because DataFrames.jl is intended to be a general purpose
package it uses a type-unstable design. In this way we are sure that we are not
going to be hit by compilation-related issues even when having very many tables
having different schemas or very wide tables (and both scenarios are encountered
in practice).

There is one additional consideration that favors making the DataFrame type
type-unstable, and it is related to allowing to change the schema of the data
frame in-place. What does changing schema mean? Well — many things users
typically want to do in-place:

  • renaming columns of a data frame;
  • adding/replacing/removing columns of a data frame;
  • using a long list of functions ending with ! in DataFrames.jl (that
    essentially do one or both of the operations listed above).

When it really matters that DataFrame is not type stable?

Maybe let me start with commenting when it does not matter that DataFrame is
not type stable. In general (except for extreme situations) when you are working
on whole columns of a data frame you are not going to be severely punished by
type instability. The only cost you have to pay is the cost of one dynamic
dispatch to the function to which you are passing the extracted column (this
strategy is used in DataFrames.jl e.g. in the select function).

Let us have a look at some simple example:

julia> using Statistics

julia> using DataFrames

julia> mx = rand(10000, 100);

julia> df = DataFrame(mx, :auto);

julia> cor_df(df) = [cor(x, y) for x in eachcol(df), y in eachcol(df)]
cor_df (generic function with 1 method)

julia> cor(mx); @time cor(mx);
  0.007407 seconds (12 allocations: 7.708 MiB)

julia> cor_df(df); @time cor_df(df);
  0.067229 seconds (60.10 k allocations: 1.911 MiB)

As you can see the performance is pretty similar.

However if you really need to go type-stable starting from a data frame it is
very easy just use the Tables.columntable function:

julia> using DataFrames

julia> df = DataFrame(a=[1,2], b=[3,4])
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      3
   2 │     2      4

julia> tbl = Tables.columntable(df)
(a = [1, 2], b = [3, 4])

and you move to a type-stable NamedTuple (no copying of columns is performed),
and assuming the table is not very wide this conversion is cheap. Of course it
is equally easy to go back:

julia> DataFrame(tbl, copycols=false)
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      3
   2 │     2      4

I used copycols=false to avoid copying of the columns — so again the
process is cheap; note, however, that by default the DataFrame constructor
copies columns passed to it as it is a safer approach.

So when does type stability matter most? The answer is when you need to iterate
rows of a data frame (especially as in a typical data frame there are many times
more rows than columns). Note that iterating a data frame cell by cell (i.e.
extracting single values from it) is essentially the same case.

Why it is a problem? Well — the issue is that when you extract a row from a
DataFrame it is still type unstable DataFrameRow object (the reason is
similar to what we have discussed above — if you would have a very wide data
frame you would have huge compilation cost if you wanted to make a row of a data
frame type-stable). However, this time we are paying type-instability cost for
each row of a data frame — and this can indeed get expensive.

Here is an example:

julia> using DataFrames

julia> df = DataFrame(rand(10^6, 2), :auto);

julia> @time [row[1] for row in eachrow(df)];
  0.227021 seconds (4.19 M allocations: 78.453 MiB, 24.12% gc time)

julia> @time [row[1] for row in eachrow(df)];
  0.279041 seconds (4.11 M allocations: 73.963 MiB, 49.28% gc time)

Is this timing bad? Well, compare it to:

julia> tbl = Tables.columntable(df);

julia> @time [row[1] for row in Tables.rows(tbl)];
  0.049375 seconds (66.81 k allocations: 11.309 MiB, 21.53% gc time)

julia> @time [row[1] for row in Tables.rows(tbl)];
  0.029195 seconds (44.86 k allocations: 9.987 MiB)

so indeed it looks bad. Actually there is a shorthand for the operation we just
did:

julia> @time [row[1] for row in Tables.namedtupleiterator(df)];
  0.232059 seconds (1.25 M allocations: 68.209 MiB, 6.59% gc time)

julia> @time [row[1] for row in Tables.namedtupleiterator(df)];
  0.042947 seconds (45.12 k allocations: 9.925 MiB)

So the sort conclusion is — you should expect to be most significantly
influenced by the performance degradation due to type-instability of DataFrame
if you want to iterate rows (or access single cells). However, in such cases you
can use e.g. Tables.columntable or Tables.namedtupleiterator to easily
switch to type-stable mode (there is also @eachrow macro in
DataFramesMeta.jl).

As you saw above this was the approach that allowed to resolve the
question asked on Stack Overflow.

Conclusions

The crucial take-away from this discussion is that it is easy to switch between
type-stable and type-unstable modes when using DataFrames.jl.

Therefore, especially in generic code that is designed to work with arbitrary
data we do not know in advance (that e.g. can be wide or can constantly change
its schema), a practical pattern is to:

  • work with DataFrame most of the time (and accept paying a small cost of its
    type instability)
  • when you really need performance (and especially when you need to iterate rows
    of the data frame) temporarily switch to type-stable mode (this is what
    select does internally when you use the ByRow wrapper on a function)

Dissecting RANDU

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2020/12/31/randu.html

Introduction

Today finally the Train Your Brain book by Paweł Prałat and me is finally
available. In the book we argue that using computer support can help solve
mathematical problems.

In this post I thought of doing the reverse — consider a classical computing
problem that can be better understood using mathematics. Here I concentrate on
the computational part, but all the theory needed to rigorously study what I
discuss is presented in the Train Your Brain book.

The topic I want to tackle today is why devising a good pseudo random number
generator is hard. I will use the classical RANDU generator to
highlight potential problems that one might encounter in practice.

This post was tested under Julia 1.5.3, DataFrames.jl 0.22.2, FreqTables.jl
0.4.2, HypothesisTests.jl 0.10.2, Pipe.jl 1.3.0, and PyPlot.jl 2.9.0.

As a warning note that some of the computations we do in this post can take up
to several minutes and require at least several GB of RAM (I used @time to
highlight the most expensive steps), so if you run my code you should take this
into account.

Let us get started!

Setting up the scene

The RANDU generator produces floating point values from the \((0,1)\)
interval using the following rule. Let \(S_i\) be the state of the generator
at time \(i\). Then is next state is defined as:

\[S_{i+1}=65539S_i \mod 2^{31},\]

and the produced float value is \(S_{i+1} / 2^{31}\).

We start with loading the libraries we will use in this Julia session:

using DataFrames
using FreqTables
using HypothesisTests
using Pipe
using PyPlot
using Random
using Statistics

Now implement the RANDU generator:

const RANDU_CONST = 0x00010003

mutable struct Randu <: AbstractRNG
    state::UInt32
end

Random.seed!(r::Randu, state::UInt32) = (r.state = state; return r)
Base.copy(r::Randu) = Randu(r.state)

function Random.rand(r::Randu,
                     ::Random.SamplerTrivial{Random.CloseOpen01{Float64}})
    s = r.state
    s = (s * RANDU_CONST) & 0x7fffffff
    r.state = s
    return s / 0x80000000
end

Observe that we use UInt32 for all computations and as \(2^{31}\) is a divisor
of \(2^{32}\) we can safely ignore the overflow – we simply look at the lowest
31 bits of the generated numbers, which is done by performing the & 0x7fffffff
operation.

Now we are ready to get started with using this generator to produce Float64
pseudo random values from the \((0,1)\) range.

Not enough randomness

First we want to produce random permutations using the generator. Here is
the code that counts how many distinct permutations we get if the length
of the permutation is 64 and we produce 10,000,000 of them.

count_permutations(rng::AbstractRNG) =
    mapreduce(push!, 1:10_000_000, init=Set{Vector{Int}}()) do _
        sortperm(rand(rng, 64))
    end |> length

Let us start with using MersenneTwister that is currently used as a default
generator in Julia:

julia> @time count_permutations(MersenneTwister(11))
 64.585667 seconds (30.11 M allocations: 12.058 GiB, 34.31% gc time)
10000000

We can see that we got no duplicates. Now let us test our RANDU generator:

julia> @time count_permutations(Randu(11))
 61.817341 seconds (30.00 M allocations: 12.052 GiB, 31.82% gc time)
8388608

julia> @time count_permutations(Randu(12))
 39.189380 seconds (30.00 M allocations: 11.841 GiB, 19.51% gc time)
2097152

Now we got many duplicates, and we see that 12 produced many more than 11.

Let us start with trying to understand what we should expect in this case. We
have \(64!\) permutations (buckets) and \(10^7\) values so the probability
of not seeing any duplicate is:

\[1 – \prod_{i=0}^{10^7-1}\left(1-\frac{i}{64!}\right)\]

This value can be approximated from above by (where the above formula comes from
and how to do such approximations is explained in our book):

\[1 – \exp\left(-\frac{10^7\cdot (10^7-1)}{2\cdot 64!-10^7}\right)\]

This can be easily computed using Julia:

julia> Float64(1 - exp(-10_000_000*(10_000_000-1)/
                       (2*(factorial(big(64))-10_000_000))))
3.9726375353434445e-76

As you can see the probability of seeing any duplicate is extremely low,
so there must be some flaw in the RANDU generator.

In this case the problem is that it does not produce enough randomness. We can
see that the generator has at most \(2^{31}\) states (actually it has less as we
will find out soon) and it seems not enough for our purposes.

Let us investigate it in more detail. Let us first check for what smallest
positive \(c\) we have:

\[65539^c \equiv 1 \mod (2^{31})\]

as then we will have that

\[S_1 \equiv S_{c+1} \mod (2^{31})\]

and thus get a cycle. Here is the code:

julia> function find_min_power()
         m = RANDU_CONST
         i = one(UInt32)
         while m & 0x7fffffff != 1
             m *= RANDU_CONST
             i += one(UInt32)
         end
         return i
       end
find_min_power (generic function with 1 method)

julia> find_min_power()
0x20000000

julia> RANDU_CONST^0x20000000 & 0x7fffffff
0x00000001

and we see that \(c=2^{29}\) (again why it is \(2^{29}\) and how it could have
been computed using pen and paper is explained in our book). This means
that given any starting seed \(S_1\) we cannot generate more than \(2^{29}\)
distinct values. This is not much indeed for current computing standards.
However, let us investigate why seed 12 was much worse than seed 11. Here is the
code:

julia> function get_cycles()
           cycle = zeros(UInt8, 0x80000000)
           c = 0x1
           for k in 1:2^31-1
               i = k
               cycle[i] != 0x0 && continue
               while cycle[i] == 0x0
                   cycle[i] = c
                   i = (i * RANDU_CONST) & 0x7fffffff
               end
               @assert i == k # we should get a cycle
               c += 0x1
               @assert c != 0x0 # make sure we do not have more than 255 cycles
           end
           return cycle
       end
get_cycles (generic function with 1 method)

julia> @time cycle = get_cycles()
127.491600 seconds (2 allocations: 2.000 GiB, 0.29% gc time)
2147483648-element Array{UInt8,1}:
 0x01
 0x02
 0x01
 0x03
 0x04
 0x02
 0x04
 0x05
 0x01
 0x06
    ⋮
 0x0a
 0x01
 0x06
 0x01
 0x08
 0x04
 0x06
 0x04
 0x00

julia> freq_cycle = sort(freqtable(cycle), rev=true)
61-element Named Array{Int64,1}
Dim1  │
──────┼──────────
0x01  │ 536870912
0x04  │ 536870912
0x02  │ 268435456
0x06  │ 268435456
0x03  │ 134217728
0x08  │ 134217728
0x05  │  67108864
0x0a  │  67108864
0x07  │  33554432
⋮               ⋮
0x33  │         8
0x38  │         8
0x35  │         4
0x3a  │         4
0x37  │         2
0x39  │         2
0x3c  │         2
0x00  │         1
0x3b  │         1

julia> freqtable(freq_cycle)
30-element Named Array{Int64,1}
Dim1      │
──────────┼──
1         │ 2
2         │ 3
4         │ 2
8         │ 2
16        │ 2
32        │ 2
64        │ 2
128       │ 2
256       │ 2
⋮           ⋮
2097152   │ 2
4194304   │ 2
8388608   │ 2
16777216  │ 2
33554432  │ 2
67108864  │ 2
134217728 │ 2
268435456 │ 2
536870912 │ 2

And we see that we have two cycles having \(2^{29}\) length and many shorter
cycles. The shortest one has length 1, let us check if this is indeed the case:

julia> x = UInt32(only(findall(==(0x3b), cycle)))
0x40000000

julia> x * RANDU_CONST & 0x7fffffff
0x40000000

So if we are unlucky with the choice of the initial seed we will get the cycle
of length 1. This phenomenon is called a bad seed problem in pseudo random
number generators. However, a cursory look at cycle variable suggests that
it is the case that all odd numbers are good seeds (producing cycle of length
\(2^{29}\)). Let us check this:

julia> foreach(enumerate(cycle)) do (i, v)
           @assert iseven(i) ⊻ (v == 0x1 || v == 0x4)
       end

julia>

We got no error so indeed this the case. Actually the two longest cycles contain
two groups of odd numbers that respectively are congruent to 1 and 3, or 5 and 7
modulo 8 as we can see here:

julia> ([1, 3, 5, 7] .* RANDU_CONST) .% 8
4-element Array{Int64,1}:
 3
 1
 7
 5

This, in particular, shows us that low bits of the generated numbers show a
pattern so RANDU is not well suited for generation of integers. But maybe it
is good when generating floats?

Before checking this let me just summarize that we have up to this point learned
that if we use a good (i.e. odd) seed we have a generator that has a cycle of
\(2^{29}\). This means that it is essentially only useful if we generate very few
random numbers by todays standards. Above we saw that this causes a problem when
generating permutations. Before we move on let us check out another scenario of
generating binary numbers:

julia> function gen2(rng::AbstractRNG)
           means = Float64[]
           for _ in 1:1000
               s = 0
               for _ in 1:10^8
                   s += rand(rng) < 0.5
               end
               push!(means, s / 10^8)
           end
           return var(means), quantile([var(rand(means, 1000)) for _ in 1:10000],
                                       [0.025, 0.975])
       end
gen2 (generic function with 1 method)

julia> @time gen2(Randu(11))
124.834623 seconds (10.02 k allocations: 77.684 MiB)
(3.806128771508955e-9, [3.540403887824998e-9, 4.062688436887708e-9])

julia> 0.25 / 10^8 # we should get approximately this value
2.5e-9

and we can see that the variance produced for \(10^8\) trials is much to large
(note that doing 1000 repetitions of the experiment ensured that the differences
are significant, which we checked using bootstrapping). Conversely, we could
have easily made the variance too small if we use e.g. \(2^{29}\) trials (as you
can see below we got exactly the same value 100 times — for sure not producing
enough variance):

julia> function gen2_v2(rng::AbstractRNG)
           means = Float64[]
           for _ in 1:100
               s = 0
               for _ in 1:2^29
                   s += rand(rng) < 0.5
               end
               push!(means, s / 2^29)
           end
           return freqtable(means)
       end
gen2_v2 (generic function with 1 method)

julia> @time gen2_v2(Randu(11))
 62.797411 seconds (42 allocations: 5.156 KiB)
1-element Named Array{Int64,1}
Dim1  │
──────┼────
0.5   │ 100

Linearity in higher dimensions

Have a look two plots that were generated using this code:

function plot_square(rng::AbstractRNG)
    v = Tuple{Float64, Float64}[]
    for i in 1:2^28
        x, y = rand(rng), rand(rng)
        if 0.4 < x < 0.401 && 0.5 < y < 0.501
            push!(v, (x,y))
        end
    end
    scatter(getindex.(v, 1), getindex.(v, 2), s=1)
end

Here are the plots:

plot_square(Randu(11))

Randu plot

plot_square(MersenneTwister(11))

MersenneTwister plot

The RANDU plot looks oddly regular. The points seem to lie on parallel lines.

It is clear that \(S_{i+1} / 2^{31} – 65539\cdot S_{i} / 2^{31}\) is an integer
as \(S_{i+1}=65539S_i \mod 2^{31}\). However, if we had only this relationship
would not get the pattern we see above. Let us look for other lines of this type
(we limit ourselves to small values of pairs defining the lines):

julia> function find_2d()
           for a in 1:RANDU_CONST
               b = a * RANDU_CONST & 0x7fffffff
               a + b <= RANDU_CONST + 1 && @show a, -b
               b = 0x80000000 - b
               a + b <= RANDU_CONST + 1 && @show a, b
           end
       end
find_2d (generic function with 1 method)

julia> find_2d()
(a, -b) = (1, -65539)
(a, b) = (32766, 32774)
(a, -b) = (32767, -32765)

Now, of the two pairs (32766, 32774) and (32767, -32765) which one can be
expected to produce bigger spacing between the lines? Note that both produce
values in the similar width of the range. However, (32766, 32774) produces
only odd values (as we assume to use odd seed value), while (32767, -32765)
produces both even and odd values because:

\[a\frac{S_{i+1}}{2^{31}}+b\frac{S_i}{2^{31}} =
\frac{a (65539 S_i \mod 2^{31})}{2^{31}}+\frac{b S_i}{2^{31}} =
\frac{a (65539 S_i + 2^{31}p)}{2^{31}}+\frac{b S_i}{2^{31}}\]

where \(p\) is an integer, and after rearranging we get:

\[\frac{(65539a+b) S_i + 2^{31}ap}{2^{31}}\]

Now for both our \((a, b)\) pairs we have that \(65539a+b=2^{31}\) so it reduces
to:

\[S_i+ap.\]

Since \(S_i\) is odd, we see that for \(a=32766\) this is always an odd number,
while for \(a=32767\) it can be either even or odd number as discussed.

So for (32766, 32774) we expect to get \((32766+32774)/2= 32770\) distinct
integer values. Let us check this:

julia> let ru = Randu(11), s = Set{Int}()
           for _ in 1:2^28
               v = Int(32774*rand(ru)+32766*rand(ru))
               @assert isodd(v)
               push!(s, v)
           end
           length(s)
       end
32770

Having 32700 parallel lines on which the points lie is bad (we can see this
above), but probably not prohibitive. The real problem starts when we look
at the triplets of points and check if we can put them on a few planes
(again we limit ourselves to only small values of pairs defining the lines):

julia> function find_3d()
           a = 1
           for b in 0:RANDU_CONST, signb in [-1, 1]
               c = (a * RANDU_CONST^2 + signb * b * RANDU_CONST) & 0x7fffffff
               a + b + c <= RANDU_CONST && @show a, signb*b, -c
               c = 0x80000000 - c
               a + b + c <= RANDU_CONST && @show a, signb*b, c
           end
       end
find_3d (generic function with 1 method)

julia> find_3d()
(a, signb * b, -c) = (1, -5, -65530)
(a, signb * b, c) = (1, -6, 9)
(a, signb * b, -c) = (1, 32761, -32756)
(a, signb * b, -c) = (1, -32772, -32765)

Bam – we got a very bad (1, -6, 9) tuple! Now we understand the reason for
the famous RANDU plot that you can see in Wikipedia artice I linked
above:

3D RANDU

Before we move on let us check the same with our code:

julia> let ru = Randu(11), s = Set{Int}()
           for _ in 1:ceil(Int, 2^29 / 3)
               push!(s, 9*rand(ru) - 6*rand(ru) + rand(ru))
           end
           length(s)
       end
15

But why it is a problem? Let us illustrate it by splitting each of three
dimensions into \(16\) bins (so we get \(16^3\) bins in total). Finally we use
the DataFrames.jl package (which I assume everyone was waiting for :), as it was
announced in the introduction and every story teller tries to follow Чеховское
ружьё
principle):

julia> @time res3d = let
           df = DataFrame(x1=UInt8[], x2=UInt8[], x3=UInt8[])
           ru = Randu(11)
           for _ in 1:ceil(Int, 2^29 / 3)
               push!(df, trunc.(UInt8, 16 .* (rand(ru), rand(ru), rand(ru))))
           end
           df
       end
 52.469085 seconds (1.25 G allocations: 18.675 GiB, 3.29% gc time)
178956971×3 DataFrame
       Row │ x1     x2     x3
           │ UInt8  UInt8  UInt8
───────────┼─────────────────────
         1 │     0      0      0
         2 │     0      2      7
         3 │    11     13     13
         4 │     1      0      9
         5 │     3     14      3
         6 │     8      2     14
         7 │     2     15      5
     ⋮     │   ⋮      ⋮      ⋮
 178956965 │     4      5      3
 178956966 │     4     12      0
 178956967 │    15      8     11
 178956968 │     6     11     11
 178956969 │    15     10     15
 178956970 │    15      3      6
 178956971 │     7      0      0
           178956957 rows omitted

julia> @time agg3d = @pipe res3d |>
                     groupby(_, [:x1, :x2, :x3]) |>
                     combine(_, nrow) |>
                     sort(_, :nrow, rev=true)
  5.939132 seconds (216 allocations: 4.667 GiB, 0.78% gc time)
3840×4 DataFrame
  Row │ x1     x2     x3     nrow
      │ UInt8  UInt8  UInt8  Int64
──────┼────────────────────────────
    1 │     5      1      7  78360
    2 │     6      7      3  78300
    3 │     3     12     12  78268
    4 │     7     13     12  78215
    5 │     2      5     11  78201
    6 │     0      2      9  78201
    7 │     5      2     13  78199
  ⋮   │   ⋮      ⋮      ⋮      ⋮
 3834 │    10      2      9   6339
 3835 │    12     15      3   6336
 3836 │    10      0     13   6332
 3837 │    11      3      4   6331
 3838 │     7      3      8   6329
 3839 │     6      3      3   6313
 3840 │     3      0     12   6299
                  3826 rows omitted

julia> nrow(agg3d), 16^3
(3840, 4096)

julia> nrow(res3d) / 16^3
43690.666748046875

What do we see here. We used a bit over \(2^{29}\) random numbers (so we went
through the whole cycle of the generator). We got 3840 non-empty bins out of a
total of 4096 possible bins. Also we can see that as the average bin size is
around 43691 we have bins that are almost twice as populated and bins that
contain very small number of values. An extremely unlikely situation as we can
see with this test:

julia> bins = zeros(Int, 16^3);

julia> copyto!(bins, agg3d.nrow);

julia> pvalue(ChisqTest(bins))
0.0

The reason is obvious – we have points lying on fifteen planes and using
\(16\times16\times16\) grid we are very often hitting the holes between the
planes (which also means that to balance this sometimes we are hitting overly
dense regions of points by other bins). This clearly shows that using RANDU to
perform e.g. integration using MonteCarlo simulation in higher dimensions is not
advisable.

Let us give a simple example of this. We want to integrate a function:

\[f(\mathbf(x)) = 20^n \exp\left(-20\sum_{i=1}^nx_i\right),\]

where \(n\) is a number of dimensions and will range from \(1\) to \(4\).
We want to calculate its integral over \([0,1]^n\). Analytically we know it is
equal to \((1-\exp(-20))^n\), so it is almost \(1\).

Let us check the results:

julia> let ru = Randu(111)
       for n in 1:4
           int = mean(f(rand(ru, n)) for _ in 1:10^8)
           @show n, int
       end
       end
(n, int) = (1, 0.9999974000942274)
(n, int) = (2, 1.0002307988280261)
(n, int) = (3, 1.3654740670592842)
(n, int) = (4, 2.355245785641749)

Note how bad it becomes in dimension higher than 2 (I omit testing statistical
significance of the deviation for \(10^8\) samples and encourage you to perform
it as an exercise).

We can compare it to MersenneTwister results:

julia> let mt = MersenneTwister(111)
       for n in 1:4
           int = mean(f(rand(mt, n)) for _ in 1:10^8)
           @show n, int
       end
       end
(n, int) = (1, 0.9994916961823186)
(n, int) = (2, 1.000145832437786)
(n, int) = (3, 0.9932430835430406)
(n, int) = (4, 0.9993585894631343)

which produces following what would normally be expected.

Concluding remarks

I hope this post highlighted typical problems the pseudo random number
generators may posses, and the practical consequences of these issues. I also
tried to show how nicely such an analysis can be performed using the Julia
language and the packages from the JuliaData ecosystem.

I encourage you to play around with different pseudo random number generators
and check if they pass or fail the tests I have presented. If you want to learn
more about testing pseudo random number generators then this article
is a good place to start.

Happy New Year!

DataFrames.jl minilanguage explained

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2020/12/24/minilanguage.html

Introduction

As it is end-of-year I thought of writing a longer post that would be a useful
reference to DataFrames.jl users.

In DataFrames.jl we have five functions that can be used to perform
transformations of columns of a data frame:

  • combine: create a new DataFrame populated with columns that are results of
    transformations applied to the source data frame columns;
  • select: create a new DataFrame that has the same number of rows as the
    source data frame populated with columns that are results of transformations
    applied to the source data frame columns; (the exception to the above number
    of rows invariant is select(df) which produces an empty data frame);
  • select!: the same as select but updates the passed data frame in place;
  • transform: the same as select but keeps the columns that were already
    present in the data frame (note though that these columns can be potentially
    modified by the transformations passed to transform);
  • transform!: the same as transform but updates the passed data frame in
    place.

The same functions also work with GroupedDataFrame with the difference
that the transformations are applied to groups and then combined. Here, an
important distinction is that combine again allows transformations to produce
any number of rows and they are combined in the order of groups in the
GroupedDataFrame. On the other hand select, select!, transform and
transform! require transformations to produce the same number of rows for each
group as the source group and produce a result that has the same row order
as the parent data frame of GroupedDataFrame passed. This rule has two
important implications:

  • it is not allowed to perform select, select!, transform and
    transform! operations on a GroupedDataFrame whose groups do not cover all
    rows of the parent data frame;
  • select, select!, transform and transform!, contrary to combine,
    ignore the order of groups in the GroupedDataFrame.

In this post I want to explain what I mean when I write transformations.
These transformations follow a so-called DataFrames.jl minilanguage that is
largely based on using => operator. You can find a specification of this
minilanguage here. However, because the rules have to be precise they
are relatively hard to read. Therefore in this post I will introduce them by
example. I will give all the examples with data frames as a source, but as noted
above they naturally extend to GroupedDataFrame case so I will add the
examples for GroupedDataFrame only in cases when it is especially relevant (in
order to read these parts of the post please earlier check out how groupby
function works in DataFrames.jl).

This post is written under Julia 1.5.3 and DataFrames.jl 0.22.2.

The data frame that we are going to use in our examples is defined follows
(I sampled some random data to populate it, with the exception of one name
to celebrate this recent masterpiece):

julia> df = DataFrame(id = 1:6,
                      name = ["Aaron Aardvark", "Belen Barboza",
                              "春 陈", "Даниил Дубов",
                              "Elżbieta Elbląg", "Felipe Fittipaldi"],
                      age = [50, 45, 40, 35, 30, 25],
                      eye = ["blue", "brown", "hazel", "blue", "green", "brown"],
                      grade_1 = [95, 90, 85, 90, 95, 90],
                      grade_2 = [75, 80, 65, 90, 75, 95],
                      grade_3 = [85, 85, 90, 85, 80, 85])
6×7 DataFrame
 Row │ id     name               age    eye     grade_1  grade_2  grade_3
     │ Int64  String             Int64  String  Int64    Int64    Int64
─────┼────────────────────────────────────────────────────────────────────
   1 │     1  Aaron Aardvark        50  blue         95       75       85
   2 │     2  Belen Barboza         45  brown        90       80       85
   3 │     3  春 陈                 40  hazel        85       65       90
   4 │     4  Даниил Дубов          35  blue         90       90       85
   5 │     5  Elżbieta Elbląg       30  green        95       75       80
   6 │     6  Felipe Fittipaldi     25  brown        90       95       85

Column selection

These are the most simple operations that the discussed functions can perform.
In this case the result of combine and select are the same (using
transform is not very useful here, as it retains all the columns anyway).

The allowed column selectors are: column number, column name, a regular
expression, Not, Between, Cols, All(), and :. In order to select a
column or a set of columns you just pass them as arguments. Here is an example
of a simple column selection:

julia> select(df, r"grade", :)
6×7 DataFrame
 Row │ grade_1  grade_2  grade_3  id     name               age    eye
     │ Int64    Int64    Int64    Int64  String             Int64  String
─────┼────────────────────────────────────────────────────────────────────
   1 │      95       75       85      1  Aaron Aardvark        50  blue
   2 │      90       80       85      2  Belen Barboza         45  brown
   3 │      85       65       90      3  春 陈                 40  hazel
   4 │      90       90       85      4  Даниил Дубов          35  blue
   5 │      95       75       80      5  Elżbieta Elbląg       30  green
   6 │      90       95       85      6  Felipe Fittipaldi     25  brown

Above, we use a common pattern showing how one can move columns to the front of
the data frame. The r"grade" regular expression picks all columns that contain
"grade" string in their name, then : picks all the remaining columns.

An important rule here is that if we pass several column selectors that pick
multiple columns, as you can see above, it is allowed that they select the same
columns and they are included in the target data frame using first-in-first-out
policy. However, it is not allowed to select the same column twice using single
column selectors, so this works:

julia> select(df, "name", [2])
6×1 DataFrame
 Row │ name
     │ String
─────┼───────────────────
   1 │ Aaron Aardvark
   2 │ Belen Barboza
   3 │ 春 陈
   4 │ Даниил Дубов
   5 │ Elżbieta Elbląg
   6 │ Felipe Fittipaldi

but this fails:

julia> select(df, "name", 2)
ERROR: ArgumentError: duplicate output column name: :name

The same rule applies also to more advanced transformations that I cover below
(i.e. duplicates are allowed only in plain column selectors picking multiple
columns).

Basic transformations

The simplest way to specify a transformation is

source_column => transformation => target_column_name

In this scenario the source_column is passed as an argument to
transformation and stored in target_column_name column.

Here is an example:

julia> using Statistics

julia> select(df, :age => mean => :meanage)
6×1 DataFrame
 Row │ meanage
     │ Float64
─────┼─────────
   1 │    37.5
   2 │    37.5
   3 │    37.5
   4 │    37.5
   5 │    37.5
   6 │    37.5

julia> combine(df, :age => mean => :meanage)
1×1 DataFrame
 Row │ meanage
     │ Float64
─────┼─────────
   1 │    37.5

Observe that because select produces as many rows in the produced data frame
as there are rows in the source data frame, a single value is repeated
accordingly. This is not the case for combine. However, if other columns
in combine would produce multiple rows the repetition also happens:

julia> combine(df, :age => mean => :meanage, :eye => unique => :eye)
4×2 DataFrame
 Row │ meanage  eye
     │ Float64  String
─────┼─────────────────
   1 │    37.5  blue
   2 │    37.5  brown
   3 │    37.5  hazel
   4 │    37.5  green

Note, however, that it is not allowed to return vectors of different lengths in
different transformations:

julia> combine(df, :age, :eye => unique => :eye)
ERROR: ArgumentError: New columns must have the same length as old columns

Let us discuss one more example, this case using GroupedDataFrame. As you
can see below vectors get expanded into multiple columns by default, e.g.:

julia> combine(groupby(df, :eye),
               :name => (x -> identity(x)) => :name,
               :age => mean => :age)
6×3 DataFrame
 Row │ eye     name               age
     │ String  String             Float64
─────┼────────────────────────────────────
   1 │ blue    Aaron Aardvark        42.5
   2 │ blue    Даниил Дубов          42.5
   3 │ brown   Belen Barboza         35.0
   4 │ brown   Felipe Fittipaldi     35.0
   5 │ hazel   春 陈                 40.0
   6 │ green   Elżbieta Elbląg       30.0

(as a side note observe that I used (x -> identity(x)) transformation although
in this case it would be enough just to write :name; the reason is that I
wanted to highlight that if you use an anonymous function in the transformation
definition you have to wrap it in ( and ) as shown above; otherwise the
expression is parsed by Julia in an unexpected way due to the => operator
precedence level)

However, in some cases it would be more natural not to expand the :name
column into multiple rows. You can easily achieve it by protecting the value
returned by the transformation using Ref (just as in broadcasting):

julia> combine(groupby(df, :eye),
                      :name => (x -> Ref(identity(x))) => :name,
                      :age => mean => :age)
4×3 DataFrame
 Row │ eye     name                               age
     │ String  SubArray…                          Float64
─────┼────────────────────────────────────────────────────
   1 │ blue    ["Aaron Aardvark", "Даниил Дубов…     42.5
   2 │ brown   ["Belen Barboza", "Felipe Fittip…     35.0
   3 │ hazel   ["春 陈"]                             40.0
   4 │ green   ["Elżbieta Elbląg"]                   30.0

In this case the vectors produced by our transformation are kept in a single row
of the produced data frame. Note that in this case we also could have just
written:

julia> combine(groupby(df, :eye), :name => Ref => :name, :age => mean => :age)
4×3 DataFrame
 Row │ eye     name                               age
     │ String  SubArray…                          Float64
─────┼────────────────────────────────────────────────────
   1 │ blue    ["Aaron Aardvark", "Даниил Дубов…     42.5
   2 │ brown   ["Belen Barboza", "Felipe Fittip…     35.0
   3 │ hazel   ["春 陈"]                             40.0
   4 │ green   ["Elżbieta Elbląg"]                   30.0

as we were applying identity transformation to our data, which can be skipped.

Often you want to apply some function not to the whole column of a data frame,
but rather to its individual elements. Normally one would achieve this using
broadcasting like this:

julia> select(df, :name => (x -> uppercase.(x)) => :NAME)
6×1 DataFrame
 Row │ NAME
     │ String
─────┼───────────────────
   1 │ AARON AARDVARK
   2 │ BELEN BARBOZA
   3 │ 春 陈
   4 │ ДАНИИЛ ДУБОВ
   5 │ ELŻBIETA ELBLĄG
   6 │ FELIPE FITTIPALDI

This pattern is encountered very often in practice, therefore there is a ByRow
convenience wrapper for a function that creates its broadcasted variant:

julia> select(df, :name => ByRow(uppercase) => :NAME)
6×1 DataFrame
 Row │ NAME
     │ String
─────┼───────────────────
   1 │ AARON AARDVARK
   2 │ BELEN BARBOZA
   3 │ 春 陈
   4 │ ДАНИИЛ ДУБОВ
   5 │ ELŻBIETA ELBLĄG
   6 │ FELIPE FITTIPALDI

One can skip specifying a target column name, in which case it is generated
automatically by suffixing source column name by function name that is applied
to it, e.g.:

julia> select(df, :name => ByRow(uppercase))
6×1 DataFrame
 Row │ name_uppercase
     │ String
─────┼───────────────────
   1 │ AARON AARDVARK
   2 │ BELEN BARBOZA
   3 │ 春 陈
   4 │ ДАНИИЛ ДУБОВ
   5 │ ELŻBIETA ELBLĄG
   6 │ FELIPE FITTIPALDI

If you want to avoid this then pass renemecols=false keyword argument (this
is mostly useful if you perform transformations):

julia> select(df,
              :grade_1 => ByRow(x -> x / 100),
              :grade_2 => ByRow(x -> x / 100),
              :grade_3 => ByRow(x -> x / 100),
              renamecols=false)
6×3 DataFrame
 Row │ grade_1  grade_2  grade_3
     │ Float64  Float64  Float64
─────┼───────────────────────────
   1 │    0.95     0.75     0.85
   2 │    0.9      0.8      0.85
   3 │    0.85     0.65     0.9
   4 │    0.9      0.9      0.85
   5 │    0.95     0.75     0.8
   6 │    0.9      0.95     0.85

Similarly to skipping target column name you can skip the transformation part of
the pattern we discussed, in which case it is assumed to be the identity
function. In effect we get a way to rename columns in transformations:

julia> select(df, r"grade", :grade_1 => :g1, :grade_2 => :g2, :grade_3 => :g3)
6×6 DataFrame
 Row │ grade_1  grade_2  grade_3  g1     g2     g3
     │ Int64    Int64    Int64    Int64  Int64  Int64
─────┼────────────────────────────────────────────────
   1 │      95       75       85     95     75     85
   2 │      90       80       85     90     80     85
   3 │      85       65       90     85     65     90
   4 │      90       90       85     90     90     85
   5 │      95       75       80     95     75     80
   6 │      90       95       85     90     95     85

If you want to perform multiple column transformations that have a similar
structure (as in the example above) you can pass a vector of transformations:

julia> select(df,
              [x => ByRow(x -> x / 100) for x in [:grade_1, :grade_2, :grade_3]],
              renamecols=false)
6×3 DataFrame
 Row │ grade_1  grade_2  grade_3
     │ Float64  Float64  Float64
─────┼───────────────────────────
   1 │    0.95     0.75     0.85
   2 │    0.9      0.8      0.85
   3 │    0.85     0.65     0.9
   4 │    0.9      0.9      0.85
   5 │    0.95     0.75     0.8
   6 │    0.9      0.95     0.85

or even shorter using broadcasting and taking advantage of names function:

julia> select(df, names(df, r"grade") .=> ByRow(x -> x / 100), renamecols=false)
6×3 DataFrame
 Row │ grade_1  grade_2  grade_3
     │ Float64  Float64  Float64
─────┼───────────────────────────
   1 │    0.95     0.75     0.85
   2 │    0.9      0.8      0.85
   3 │    0.85     0.65     0.9
   4 │    0.9      0.9      0.85
   5 │    0.95     0.75     0.8
   6 │    0.9      0.95     0.85

You now know almost all about single column selection. The only thing left to
learn is that nrow function has a special treatment. It does not require
passing source column name (but allows specifying of target column name; the
default name is :nrow) and produces number of rows in the passed data frame.
Here is an example of both:

julia> combine(groupby(df, :eye), nrow, nrow => :count)
4×3 DataFrame
 Row │ eye     nrow   count
     │ String  Int64  Int64
─────┼──────────────────────
   1 │ blue        2      2
   2 │ brown       2      2
   3 │ hazel       1      1
   4 │ green       1      1

That was a lot of information. Probably this is a good moment to take a short
break.

Next we move on to the cases when there are multiple source columns or multiple
columns are produced as a result of the transformation.

Multiple source columns

The simplest way to pass multiple columns to a transformation is to pass their
list in a vector. In this case these columns are passed as consecutive
positional arguments to the function. Here is an example (also observe how
automatic column naming is done in this case):

julia> combine(df,
               [:age, :grade_1] => cor,
               [:age, :grade_2] => cor,
               [:age, :grade_3] => cor)
1×3 DataFrame
 Row │ age_grade_1_cor  age_grade_2_cor  age_grade_3_cor
     │ Float64          Float64          Float64
─────┼───────────────────────────────────────────────────
   1 │       0.0710072        -0.536745         0.338062

Alternatively you can pass the columns as a single argument being a
NamedTuple, in order to achieve this wrap the list of passed columns in
AsTable. Here is a simple example:

julia> combine(df, :grade_1, :grade_2,
               AsTable([:grade_1, :grade_2]) => ByRow(x -> x.grade_1 > x.grade_2))
6×3 DataFrame
 Row │ grade_1  grade_2  grade_1_grade_2_function
     │ Int64    Int64    Bool
─────┼────────────────────────────────────────────
   1 │      95       75                      true
   2 │      90       80                      true
   3 │      85       65                      true
   4 │      90       90                     false
   5 │      95       75                      true
   6 │      90       95                     false

Let us see these two rules at play in a common task of finding a sum of several
columns:

julia> select(df, names(df, r"grade") => +, AsTable(names(df, r"grade")) => sum)
6×2 DataFrame
 Row │ grade_1_grade_2_grade_3_+  grade_1_grade_2_grade_3_sum
     │ Int64                      Int64
─────┼────────────────────────────────────────────────────────
   1 │                       255                          255
   2 │                       255                          255
   3 │                       240                          240
   4 │                       265                          265
   5 │                       250                          250
   6 │                       270                          270

Multiple target columns

Normally all the transformation functions assume that a single column is
returned from a transformation function. Here is a more advanced example:

julia> select(df,
              :name => ByRow(x -> (; ([:firsname, :lastname] .=> split(x))...)))
6×1 DataFrame
 Row │ name_function
     │ NamedTuple…
─────┼───────────────────────────────────
   1 │ (firsname = "Aaron", lastname = …
   2 │ (firsname = "Belen", lastname = …
   3 │ (firsname = "春", lastname = "陈…
   4 │ (firsname = "Даниил", lastname =…
   5 │ (firsname = "Elżbieta", lastname…
   6 │ (firsname = "Felipe", lastname =…

In the above code we have used the following pattern that is a convenient way
of programmatic creation of NamedTuples:

julia> (; :a => 1, :b => 2)
(a = 1, b = 2)

julia> (; [:a => 1, :b => 2]...)
(a = 1, b = 2)

However, it is natural to ask if it is possible to produce the separate
:firstname and :lastname columns in the data frame. You can do it by
adding AsTable as target column names specifier:

julia> select(df, :name =>
                  ByRow(x -> (; ([:firsname, :lastname] .=> split(x))...)) =>
                  AsTable)
6×2 DataFrame
 Row │ firsname   lastname
     │ SubStrin…  SubStrin…
─────┼───────────────────────
   1 │ Aaron      Aardvark
   2 │ Belen      Barboza
   3 │ 春         陈
   4 │ Даниил     Дубов
   5 │ Elżbieta   Elbląg
   6 │ Felipe     Fittipaldi

In general it is not required that one has to produce a vector of NamedTuples
to get this result. It is enough to have any standard iterable (e.g. a vector
or a Tuple):

julia> select(df, :name => ByRow(split) => AsTable)
6×2 DataFrame
 Row │ x1         x2
     │ SubStrin…  SubStrin…
─────┼───────────────────────
   1 │ Aaron      Aardvark
   2 │ Belen      Barboza
   3 │ 春         陈
   4 │ Даниил     Дубов
   5 │ Elżbieta   Elbląg
   6 │ Felipe     Fittipaldi

Note that as split produces a vector (without column names) the names are
generated automatically as :x1 and :x2. You can override this behavior by
specifying the target column names explicitly:

julia> select(df, :name => ByRow(split) => [:firsname, :lastname])
6×2 DataFrame
 Row │ firsname   lastname
     │ SubStrin…  SubStrin…
─────┼───────────────────────
   1 │ Aaron      Aardvark
   2 │ Belen      Barboza
   3 │ 春         陈
   4 │ Даниил     Дубов
   5 │ Elżbieta   Elbląg
   6 │ Felipe     Fittipaldi

Several values that can be returned by a transformation are treated to produce
multiple columns by default. Therefore they are not allowed to be returned
from a function unless AsTable or multiple target column names are specified.

Here is an example:

julia> combine(df, :age => x -> (age=x, age2 = x.^2))
ERROR: ArgumentError: Table returned but a single output column was expected

julia> combine(df, :age => (x -> (age=x, age2 = x.^2)) => AsTable)
6×2 DataFrame
 Row │ age    age2
     │ Int64  Int64
─────┼──────────────
   1 │    50   2500
   2 │    45   2025
   3 │    40   1600
   4 │    35   1225
   5 │    30    900
   6 │    25    625

The list of return value types treated in this way consists of four elements:

  • AbstractDataFrame,
  • DataFrameRow,
  • AbstractMatrix,
  • NamedTuple.

In the example above we returned a NamedTuple. Note, however, that returning
a NamedTuple is not the same as returning a vector of NamedTuples (as we
did in the topmost example in this section) as this is allowed.

For the purposes of detection of column names conflicts multiple columns
returned are treated as a series of single returned columns (so duplicates are
not allowed):

julia> select(df, :name => ByRow(split) => [:name, :lastname], :name)
ERROR: ArgumentError: duplicate output column name: :name

Just for completeness of discussion note that this works:

julia> select(df, :name => ByRow(split) => [:name, :lastname], [:name])
6×2 DataFrame
 Row │ name       lastname
     │ SubStrin…  SubStrin…
─────┼───────────────────────
   1 │ Aaron      Aardvark
   2 │ Belen      Barboza
   3 │ 春         陈
   4 │ Даниил     Дубов
   5 │ Elżbieta   Elbląg
   6 │ Felipe     Fittipaldi

as [:names] is considered to be a multiple column selector (thus it is not
producing a duplicate and is just ignored in this case).

Traditional transformation style

The traditional transformation style in the combine function was to pass a
transformation as a function (without using the => minilanguage). This style
is also allowed currently in all transformation functions. Additionally if you
want to use this approach you can pass the function as a first argument of the
function (which is often convenient in combination with do block style).

In this traditional style we are not allowed to specify source columns.
Therefore such a function always takes a data frame (it is most useful with
GroupedDataFrame where the data frame is representing a given group of the
source data frame)

Similarly traditional style does not allow specifying target column names.
Therefore the following defaults are taken:

  • if AbstractDataFrame, DataFrameRow, AbstractMatrix, or NamedTuple is
    returned then it is assumed to produce multiple columns (in the case of
    NamedTuple it must contain only vectors or only single values, as if they
    are mixed an error is thrown).
  • all other return values are treated as a single column (where returning a
    vector produces multiple rows and returning any other value produces one row).

Here are examples of this style:

julia> combine(groupby(df, :eye)) do sdf
           return mean(sdf.age)
       end
4×2 DataFrame
 Row │ eye     x1
     │ String  Float64
─────┼─────────────────
   1 │ blue       42.5
   2 │ brown      35.0
   3 │ hazel      40.0
   4 │ green      30.0

julia> combine(groupby(df, :eye), sdf -> mean(sdf.age))
4×2 DataFrame
 Row │ eye     x1
     │ String  Float64
─────┼─────────────────
   1 │ blue       42.5
   2 │ brown      35.0
   3 │ hazel      40.0
   4 │ green      30.0

julia> combine(groupby(df, :eye),
               sdf -> (count = nrow(sdf),
                       mean_age = mean(sdf.age),
                       std_age = std(sdf.age)))
4×4 DataFrame
 Row │ eye     count  mean_age  std_age
     │ String  Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ blue        2      42.5   10.6066
   2 │ brown       2      35.0   14.1421
   3 │ hazel       1      40.0  NaN
   4 │ green       1      30.0  NaN

Special treatment of grouping columns

As a special rule for processing of GroupedDataFrame object is that
it is not allow to change the values stored in the grouping columns as they
are kept by default.

Here is a simple example:

julia> combine(groupby(df, :eye), :eye => ByRow(uppercase) => :eye)
ERROR: ArgumentError: column :eye in returned data frame is not equal to grouping key :eye

However, you are allowed to do this if you drop the grouping columns by passing
keepkeys=false keyword argument:

julia> transform(groupby(df, :eye),
                 :eye => ByRow(uppercase) => :eye,
                 keepkeys=false)
6×7 DataFrame
 Row │ id     name               age    eye     grade_1  grade_2  grade_3
     │ Int64  String             Int64  String  Int64    Int64    Int64
─────┼────────────────────────────────────────────────────────────────────
   1 │     1  Aaron Aardvark        50  BLUE         95       75       85
   2 │     2  Belen Barboza         45  BROWN        90       80       85
   3 │     3  春 陈                 40  HAZEL        85       65       90
   4 │     4  Даниил Дубов          35  BLUE         90       90       85
   5 │     5  Elżbieta Elbląg       30  GREEN        95       75       80
   6 │     6  Felipe Fittipaldi     25  BROWN        90       95       85

(note that for transform the key columns would be retained in the produced
data frame even with keepkeys=false; in this case this keyword argument only
influences the fact if we check that key columns have not changed in this case)

Conclusions

This was long. As a conclusion let me comment that all the above-mentioned
styles can be freely combined in a single transformation call:

julia> combine(groupby(df, :eye),
               r"grade", names(df, r"grade") => (+) => :total_grade,
               sdf -> nrow(sdf), nrow, :age => :AGE)
6×8 DataFrame
 Row │ eye     grade_1  grade_2  grade_3  total_grade  x1     nrow   AGE
     │ String  Int64    Int64    Int64    Int64        Int64  Int64  Int64
─────┼─────────────────────────────────────────────────────────────────────
   1 │ blue         95       75       85          255      2      2     50
   2 │ blue         90       90       85          265      2      2     35
   3 │ brown        90       80       85          255      2      2     45
   4 │ brown        90       95       85          270      2      2     25
   5 │ hazel        85       65       90          240      1      1     40
   6 │ green        95       75       80          250      1      1     30

If you got here then congratulations – you have mastered the DataFrames.jl
transformation minilanguage.