An update of my tutorial on DataFrames

By: Bogumił Kamiński

Re-posted from: http://juliasnippets.blogspot.com/2017/12/an-update-of-my-tutorial-on-dataframes.html

Since my last post about my tutorial to DataFrames I had switched to use them a lot more in my daily Julia workflow. Based on the experience I have added three sections to it:

  1. performance recommendations;
  2. possible pitfalls when using DataFrames;
  3. useful packages, currently FreqTables and DataFramesMeta.
Actually all three sections would benefit from the experience of the community, so if you have a comment please make an issue or PR on https://github.com/bkamins/Julia-DataFrames-Tutorial.
As for useful packages I have tried to use raw DataFrames to reduce dependencies of my code, but actually FreqTables and DataFramesMeta helped me a lot and did not give too much mental overhead of things to remember.
I would like to especially recommend FreqTables – a small package, but really useful. I would say that it deserves much more attention from the community than it gets (looking at the number of stars). So let me write a bit about it.
There are three reasons I like it:
  1. simply I make contingency tables almost all the time; previously I have used countmap from StatsBase a lot, but it returns a dictionary which is not very handy; freqtable returns a much nicer result (e.g. if possible it is sorted) and allows for more than one dimension;
  2. with freqtable I can use vectors or work on data frames, it nicely handles missings and allows for weighting;
  3. freqtable is faster than countmap (I was surprised when I learned this, maybe not a critical thing but a nice plus).
So how does the output from freqtable  look? Here is a sampler:

julia> using DataFrames, FreqTables

julia> srand(1); df = DataFrame(rand(1:3, 10, 2))
10×2 DataFrames.DataFrame
│ Row │ x1 │ x2 │
├─────┼────┼────┤
│ 1   │ 3  │ 2  │
│ 2   │ 3  │ 1  │
│ 3   │ 3  │ 1  │
│ 4   │ 3  │ 2  │
│ 5   │ 1  │ 2  │
│ 6   │ 1  │ 2  │
│ 7   │ 1  │ 3  │
│ 8   │ 2  │ 3  │
│ 9   │ 1  │ 1  │
│ 10  │ 1  │ 2  │

julia> freqtable(df, :x1, :x2)
3×3 Named Array{Int64,2}
x1 ╲ x2 │ 1  2  3
────────┼────────
1       │ 1  3  1
2       │ 0  0  1
3       │ 2  2  0

And now a simple benchmark against countmap :

julia> using DataFrames, FreqTables, StatsBase, BenchmarkTools

julia> srand(1); x = rand(1:100, 10^6); y = categorical(x); z = string.(x);

julia> @benchmark freqtable($x)
BenchmarkTools.Trial:
  memory estimate:  25.89 KiB
  allocs estimate:  83
  ————–
  minimum time:     24.246 ms (0.00% GC)
  median time:      24.672 ms (0.00% GC)
  mean time:        25.425 ms (0.00% GC)
  maximum time:     39.739 ms (0.00% GC)
  ————–
  samples:          197
  evals/sample:     1

julia> @benchmark countmap($x)
BenchmarkTools.Trial:
  memory estimate:  6.61 KiB
  allocs estimate:  10
  ————–
  minimum time:     42.230 ms (0.00% GC)
  median time:      42.813 ms (0.00% GC)
  mean time:        43.110 ms (0.00% GC)
  maximum time:     46.244 ms (0.00% GC)
  ————–
  samples:          116
  evals/sample:     1

julia> @benchmark freqtable($y)
BenchmarkTools.Trial:
  memory estimate:  10.16 KiB
  allocs estimate:  76
  ————–
  minimum time:     1.064 ms (0.00% GC)
  median time:      1.112 ms (0.00% GC)
  mean time:        1.129 ms (0.09% GC)
  maximum time:     3.485 ms (66.72% GC)
  ————–
  samples:          4403
  evals/sample:     1

julia> @benchmark countmap($y)
BenchmarkTools.Trial:
  memory estimate:  6.61 KiB
  allocs estimate:  10
  ————–
  minimum time:     87.141 ms (0.00% GC)
  median time:      88.167 ms (0.00% GC)
  mean time:        88.510 ms (0.00% GC)
  maximum time:     92.177 ms (0.00% GC)
  ————–
  samples:          57
  evals/sample:     1

julia> @benchmark freqtable($z)
BenchmarkTools.Trial:
  memory estimate:  45.81 MiB
  allocs estimate:  2000285
  ————–
  minimum time:     75.712 ms (3.94% GC)
  median time:      77.057 ms (3.94% GC)
  mean time:        77.346 ms (4.16% GC)
  maximum time:     83.298 ms (3.35% GC)
  ————–
  samples:          65
  evals/sample:     1

julia> @benchmark countmap($z)
BenchmarkTools.Trial:
  memory estimate:  6.61 KiB
  allocs estimate:  10
  ————–
  minimum time:     81.931 ms (0.00% GC)
  median time:      83.128 ms (0.00% GC)
  mean time:        83.472 ms (0.00% GC)
  maximum time:     89.977 ms (0.00% GC)
  ————–
  samples:          60
  evals/sample:     1

As you can see freqtable does really a good job on different types of inputs.
Actually there is a third way to do a similar using by form DataFrames  which is also quite fast but it is more messy. freqtable is more specialized – does one job, but does it well.
Here are the benchmarks of by:

julia> @benchmark by(DataFrame(x = $x), :x, nrow)
BenchmarkTools.Trial:
  memory estimate:  38.91 MiB
  allocs estimate:  5986
  ————–
  minimum time:     28.946 ms (1.56% GC)
  median time:      34.440 ms (14.82% GC)
  mean time:        34.291 ms (14.80% GC)
  maximum time:     41.079 ms (20.70% GC)
  ————–
  samples:          146
  evals/sample:     1

julia> @benchmark by(DataFrame(x = $y), :x, nrow)
BenchmarkTools.Trial:
  memory estimate:  38.92 MiB
  allocs estimate:  6198
  ————–
  minimum time:     44.810 ms (3.52% GC)
  median time:      50.244 ms (10.53% GC)
  mean time:        49.715 ms (10.38% GC)
  maximum time:     56.052 ms (17.67% GC)
  ————–
  samples:          101
  evals/sample:     1

julia> @benchmark by(DataFrame(x = $z), :x, nrow)
BenchmarkTools.Trial:
  memory estimate:  38.91 MiB
  allocs estimate:  5986
  ————–
  minimum time:     46.891 ms (0.93% GC)
  median time:      53.657 ms (10.12% GC)
  mean time:        52.539 ms (9.68% GC)
  maximum time:     60.736 ms (16.24% GC)
  ————–
  samples:          96
  evals/sample:     1