Author Archives: Bogumił Kamiński

An update of my tutorial on DataFrames

By: Bogumił Kamiński

Re-posted from: http://juliasnippets.blogspot.com/2017/12/an-update-of-my-tutorial-on-dataframes.html

Since my last post about my tutorial to DataFrames I had switched to use them a lot more in my daily Julia workflow. Based on the experience I have added three sections to it:

  1. performance recommendations;
  2. possible pitfalls when using DataFrames;
  3. useful packages, currently FreqTables and DataFramesMeta.
Actually all three sections would benefit from the experience of the community, so if you have a comment please make an issue or PR on https://github.com/bkamins/Julia-DataFrames-Tutorial.
As for useful packages I have tried to use raw DataFrames to reduce dependencies of my code, but actually FreqTables and DataFramesMeta helped me a lot and did not give too much mental overhead of things to remember.
I would like to especially recommend FreqTables – a small package, but really useful. I would say that it deserves much more attention from the community than it gets (looking at the number of stars). So let me write a bit about it.
There are three reasons I like it:
  1. simply I make contingency tables almost all the time; previously I have used countmap from StatsBase a lot, but it returns a dictionary which is not very handy; freqtable returns a much nicer result (e.g. if possible it is sorted) and allows for more than one dimension;
  2. with freqtable I can use vectors or work on data frames, it nicely handles missings and allows for weighting;
  3. freqtable is faster than countmap (I was surprised when I learned this, maybe not a critical thing but a nice plus).
So how does the output from freqtable  look? Here is a sampler:

julia> using DataFrames, FreqTables

julia> srand(1); df = DataFrame(rand(1:3, 10, 2))
10×2 DataFrames.DataFrame
│ Row │ x1 │ x2 │
├─────┼────┼────┤
│ 1   │ 3  │ 2  │
│ 2   │ 3  │ 1  │
│ 3   │ 3  │ 1  │
│ 4   │ 3  │ 2  │
│ 5   │ 1  │ 2  │
│ 6   │ 1  │ 2  │
│ 7   │ 1  │ 3  │
│ 8   │ 2  │ 3  │
│ 9   │ 1  │ 1  │
│ 10  │ 1  │ 2  │

julia> freqtable(df, :x1, :x2)
3×3 Named Array{Int64,2}
x1 ╲ x2 │ 1  2  3
────────┼────────
1       │ 1  3  1
2       │ 0  0  1
3       │ 2  2  0

And now a simple benchmark against countmap :

julia> using DataFrames, FreqTables, StatsBase, BenchmarkTools

julia> srand(1); x = rand(1:100, 10^6); y = categorical(x); z = string.(x);

julia> @benchmark freqtable($x)
BenchmarkTools.Trial:
  memory estimate:  25.89 KiB
  allocs estimate:  83
  ————–
  minimum time:     24.246 ms (0.00% GC)
  median time:      24.672 ms (0.00% GC)
  mean time:        25.425 ms (0.00% GC)
  maximum time:     39.739 ms (0.00% GC)
  ————–
  samples:          197
  evals/sample:     1

julia> @benchmark countmap($x)
BenchmarkTools.Trial:
  memory estimate:  6.61 KiB
  allocs estimate:  10
  ————–
  minimum time:     42.230 ms (0.00% GC)
  median time:      42.813 ms (0.00% GC)
  mean time:        43.110 ms (0.00% GC)
  maximum time:     46.244 ms (0.00% GC)
  ————–
  samples:          116
  evals/sample:     1

julia> @benchmark freqtable($y)
BenchmarkTools.Trial:
  memory estimate:  10.16 KiB
  allocs estimate:  76
  ————–
  minimum time:     1.064 ms (0.00% GC)
  median time:      1.112 ms (0.00% GC)
  mean time:        1.129 ms (0.09% GC)
  maximum time:     3.485 ms (66.72% GC)
  ————–
  samples:          4403
  evals/sample:     1

julia> @benchmark countmap($y)
BenchmarkTools.Trial:
  memory estimate:  6.61 KiB
  allocs estimate:  10
  ————–
  minimum time:     87.141 ms (0.00% GC)
  median time:      88.167 ms (0.00% GC)
  mean time:        88.510 ms (0.00% GC)
  maximum time:     92.177 ms (0.00% GC)
  ————–
  samples:          57
  evals/sample:     1

julia> @benchmark freqtable($z)
BenchmarkTools.Trial:
  memory estimate:  45.81 MiB
  allocs estimate:  2000285
  ————–
  minimum time:     75.712 ms (3.94% GC)
  median time:      77.057 ms (3.94% GC)
  mean time:        77.346 ms (4.16% GC)
  maximum time:     83.298 ms (3.35% GC)
  ————–
  samples:          65
  evals/sample:     1

julia> @benchmark countmap($z)
BenchmarkTools.Trial:
  memory estimate:  6.61 KiB
  allocs estimate:  10
  ————–
  minimum time:     81.931 ms (0.00% GC)
  median time:      83.128 ms (0.00% GC)
  mean time:        83.472 ms (0.00% GC)
  maximum time:     89.977 ms (0.00% GC)
  ————–
  samples:          60
  evals/sample:     1

As you can see freqtable does really a good job on different types of inputs.
Actually there is a third way to do a similar using by form DataFrames  which is also quite fast but it is more messy. freqtable is more specialized – does one job, but does it well.
Here are the benchmarks of by:

julia> @benchmark by(DataFrame(x = $x), :x, nrow)
BenchmarkTools.Trial:
  memory estimate:  38.91 MiB
  allocs estimate:  5986
  ————–
  minimum time:     28.946 ms (1.56% GC)
  median time:      34.440 ms (14.82% GC)
  mean time:        34.291 ms (14.80% GC)
  maximum time:     41.079 ms (20.70% GC)
  ————–
  samples:          146
  evals/sample:     1

julia> @benchmark by(DataFrame(x = $y), :x, nrow)
BenchmarkTools.Trial:
  memory estimate:  38.92 MiB
  allocs estimate:  6198
  ————–
  minimum time:     44.810 ms (3.52% GC)
  median time:      50.244 ms (10.53% GC)
  mean time:        49.715 ms (10.38% GC)
  maximum time:     56.052 ms (17.67% GC)
  ————–
  samples:          101
  evals/sample:     1

julia> @benchmark by(DataFrame(x = $z), :x, nrow)
BenchmarkTools.Trial:
  memory estimate:  38.91 MiB
  allocs estimate:  5986
  ————–
  minimum time:     46.891 ms (0.93% GC)
  median time:      53.657 ms (10.12% GC)
  mean time:        52.539 ms (9.68% GC)
  maximum time:     60.736 ms (16.24% GC)
  ————–
  samples:          96
  evals/sample:     1

A Tutorial on DataFrames in Julia

By: Bogumił Kamiński

Re-posted from: http://juliasnippets.blogspot.com/2017/12/tutorial-on-dataframes-in-julia.html

This time a blog post is more of an announcement :).

After the release of DataFrames 0.11 in Julia I have decided that it the ecosystem has stabilized enough to add it to my teaching materials.

There were some Issues and PRs submitted in the process. Finally, I have managed to compile the basic notes about using DataFrames. If you would be interested in them I have published them on GitHub here.

Basics of generating random numbers in Julia

By: Bogumił Kamiński

Re-posted from: http://juliasnippets.blogspot.com/2017/11/basics-of-generating-random-numbers-in.html

Recently there were two questions regarding random number generation in Julia: one in discussion on Discourse and the other on Stack Overflow.
In this post I have tried to collect the basic information related to those two topics.

How does rand() work?

The default random number generator in Julia uses Mersenne Twister algorithm. By default a global instance of this generator is used. It can be directly accessed using variable Base.GLOBAL_RNG. Therefore calling rand() and rand(Base.GLOBAL_RNG) provide identical results.
You can set a seed to random number generator using srand function. There are several options for this but two most popular are:
  1. Calling srand() which will make Julia to use the system entropy to get a seed (you can think of this as random seed);
  2. Calling srand(i) where i is an integer. In this case you will get a reproducible sequence of random numbers.

An important detail of rand() is that it produces a value in the interval [0,1[ and generates exactly 52 bits of randomness – it can produce 2⁵² distinct values. In order to understand why you must know that internally Julia generates a value from the interval [1, 2[ and then subtracts 1 from it. The reason for this approach is that Float64 has 52 bit significand and in range [1,2[ all representable floating point values are equally spaced, see e.g. Wikipedia. This ensures that values generated by rand() are uniformly distributed.

Generating something else than one float from [0, 1[ interval

There are many options here, please refer to the Julia Manual for details. What I want to highlight here are the consequences of the fact that the underlying random number generator produces only 52 bits of randomness. Below we measure the time of obtaining one random sample of Int32 and Int64. Here is the benchmark:

julia> using BenchmarkTools

julia> @btime rand()
  7.464 ns (0 allocations: 0 bytes)
0.5055246442408914

julia> @btime rand(Int32)
  7.464 ns (0 allocations: 0 bytes)
37355051

julia> @btime rand(Int64)
  10.730 ns (0 allocations: 0 bytes)
2952626120715896373

The reason for the difference is that in order to get Int64 we need to sample twice because 64>52. You might ask if this would matter in practical code. Actually it does, because on most machines you run Julia Int is Int64. Here is an example of sampling a value from a range (my machine is 64 bit):

julia> @btime rand(1:10)
  37.789 ns (0 allocations: 0 bytes)
5

julia> @btime Int(rand(Int32(1):Int32(10)))
  26.126 ns (0 allocations: 0 bytes)
4

If you can accept a bit of inaccuracy in the distribution you can get more speed for generating 1:n range if n is small with the following code (I use n=10 as above):

julia> @btime ceil(Int, 10rand())
  15.395 ns (0 allocations: 0 bytes)
6

The discussion why it is inaccurate is presented on Discourse. Here let me give you an extreme example (the proper value of the evaluated expression is 0.5 in both cases):

julia> mean(ceil(Int64, 2^60 * rand()) % Bool for i in 1:10^6)
0.0

julia> mean(rand(1:2^60) % Bool for i in 1:10^6)
0.500582

We know why the first approach fails – Float64 does not have enough precision and we will always get an even number if we multiply such a value by 2⁶⁰.

And this is a less extreme example, but when multiplying by something smaller than 2⁵² (the correct result is 1):

julia> mean(rand(1:(2^50+2^49)) % 3 for i in 1:10^7)
1.0000001

julia> mean(ceil(Int64, (2^50+2^49) * rand()) % 3 for i in 1:10^7)
0.9586021

And again we see a bias (although not as large).

Using multiple random number generators

Sometimes it is enough to have a single source of pseudorandomness. In such cases Base.GLOBAL_RNG is enough. However, in many settings (like the one discussed on Stack Overflow) we need more than one independent random number generator. The usual approach is to generate seeds for consecutive random number generators that ensure that sequences produced by them do not overlap. In the paper Efficient Jump Ahead for F₂-Linear RandomNumber Generators you can find the detailed discussion of the topic.

In Julia you can generate such independent generators using randjump function. You give it a number of generators you require and they are guaranteed to use seeds that correspond to random number generator states separated by 10²⁰ steps.

There is one good practice if you decide use randjump. Call it only once do produce the required generators. Consecutive uses of randjump based on the same random number generator can give surprising results. In order to understand the reason for this approach consider the following example:

julia> srand(1)
MersenneTwister(UInt32[0x00000001], Base.dSFMT.DSFMT_state(Int32[1749029653, 1072851681, 1610647787, 1072862326, 1841712345, 1073426746, -198061126, 1073322060, -156153802, 1073567984  …  1977574422, 1073209915, 278919868, 1072835605, 1290372147, 18858467, 1815133874, -1716870370, 382, 0]), [1.23603, 1.34652, 1.31271, 1.00791, 1.48861, 1.21097, 1.95192, 1.9999, 1.25166, 1.98667  …  1.1489, 1.77623, 1.16774, 1.00214, 1.46738, 1.03244, 1.36938, 1.75271, 1.43027, 1.71293], 382)

julia> rand(100);

julia> x = randjump(Base.GLOBAL_RNG, 2)
2-element Array{MersenneTwister,1}:
 MersenneTwister(UInt32[0x00000001], Base.dSFMT.DSFMT_state(Int32[-423222098, 1072940746, 1823958146, 1073056597, 94617959, 1073021145, 2081944769, 1072701541, -1344696523, 1073205595  …  -1005791883, 1073144418, 24484970, 1073440808, 1370926729, 1336278534, -1527371338, -19485865, 382, 0]), [1.23603, 1.34652, 1.31271, 1.00791, 1.48861, 1.21097, 1.95192, 1.9999, 1.25166, 1.98667  …
  1.1489, 1.77623, 1.16774, 1.00214, 1.46738, 1.03244, 1.36938, 1.75271, 1.43027, 1.71293], 100)
 MersenneTwister(UInt32[0x00000001], Base.dSFMT.DSFMT_state(Int32[2095040610, 1073104849, 93118390, 1073625566, 1967429533, 1073003175, -889983042, 1073278562, 29166119, 1073602896  …  1102161927, 1072737217, -1901932835, 1073559995, 1843649418, -1848103721, -1079494630, -1219397333, 382, 0]), [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 382)

julia> rand();

julia> y = randjump(Base.GLOBAL_RNG, 2)
2-element Array{MersenneTwister,1}:
 MersenneTwister(UInt32[0x00000001], Base.dSFMT.DSFMT_state(Int32[-423222098, 1072940746, 1823958146, 1073056597, 94617959, 1073021145, 2081944769, 1072701541, -1344696523, 1073205595  …  -1005791883, 1073144418, 24484970, 1073440808, 1370926729, 1336278534, -1527371338, -19485865, 382, 0]), [1.23603, 1.34652, 1.31271, 1.00791, 1.48861, 1.21097, 1.95192, 1.9999, 1.25166, 1.98667  …
  1.1489, 1.77623, 1.16774, 1.00214, 1.46738, 1.03244, 1.36938, 1.75271, 1.43027, 1.71293], 101)
 MersenneTwister(UInt32[0x00000001], Base.dSFMT.DSFMT_state(Int32[2095040610, 1073104849, 93118390, 1073625566, 1967429533, 1073003175, -889983042, 1073278562, 29166119, 1073602896  …  1102161927, 1072737217, -1901932835, 1073559995, 1843649418, -1848103721, -1079494630, -1219397333, 382, 0]), [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 382)

julia> x[1] === y[1]
true

julia> x[2] === y[2]
false

julia> x[2] == y[2]
true

What do we learn from it:

  • The first element of a vector of random number generators returned by randjump is simply its first argument – we can see it because x[1] === y[1] is true. No copying or jumping ahead is done.
  • Even though we have produced one random sample after creating x and before creating y (thus the state of Base.GLOBAL_RNG was not identical) the second elements of x and y have the same state, which can be seen by testing x[2] == y[2] (although they do not point to the same generator as x[2] === y[2] is false); the reason for this is that rand in Julia generates random numbers in batches of 382 size and calling rand() once in our case did not trigger the generation of the next batch of random numbers. So even though the state of Base.GLOBAL_RNG when generating x and y was not identical the internal state of Mersenne Twister generator was (it can be checked using Base.GLOBAL_RNG.state).
In short – it is safest not to call randjump several times in the same code.