An update of my tutorial on DataFrames

Re-posted from: http://juliasnippets.blogspot.com/2017/12/an-update-of-my-tutorial-on-dataframes.html

Since my last post about my tutorial to DataFrames I had switched to use them a lot more in my daily Julia workflow. Based on the experience I have added three sections to it:

performance recommendations;
possible pitfalls when using DataFrames;
useful packages, currently FreqTables and DataFramesMeta.

Actually all three sections would benefit from the experience of the community, so if you have a comment please make an issue or PR on https://github.com/bkamins/Julia-DataFrames-Tutorial.

As for useful packages I have tried to use raw DataFrames to reduce dependencies of my code, but actually FreqTables and DataFramesMeta helped me a lot and did not give too much mental overhead of things to remember.

I would like to especially recommend FreqTables – a small package, but really useful. I would say that it deserves much more attention from the community than it gets (looking at the number of stars). So let me write a bit about it.

There are three reasons I like it:

simply I make contingency tables almost all the time; previously I have used countmap from StatsBase a lot, but it returns a dictionary which is not very handy; freqtable returns a much nicer result (e.g. if possible it is sorted) and allows for more than one dimension;
with freqtable I can use vectors or work on data frames, it nicely handles missings and allows for weighting;
freqtable is faster than countmap (I was surprised when I learned this, maybe not a critical thing but a nice plus).

So how does the output from freqtable look? Here is a sampler:

julia> using DataFrames, FreqTables

julia> srand(1); df = DataFrame(rand(1:3, 10, 2))

10×2 DataFrames.DataFrame

│ Row │ x1 │ x2 │

├─────┼────┼────┤

│ 1 │ 3 │ 2 │

│ 2 │ 3 │ 1 │

│ 3 │ 3 │ 1 │

│ 4 │ 3 │ 2 │

│ 5 │ 1 │ 2 │

│ 6 │ 1 │ 2 │

│ 7 │ 1 │ 3 │

│ 8 │ 2 │ 3 │

│ 9 │ 1 │ 1 │

│ 10 │ 1 │ 2 │

julia> freqtable(df, :x1, :x2)

3×3 Named Array{Int64,2}

x1 ╲ x2 │ 1 2 3

────────┼────────

1 │ 1 3 1

2 │ 0 0 1

3 │ 2 2 0

And now a simple benchmark against countmap :

julia> using DataFrames, FreqTables, StatsBase, BenchmarkTools

julia> srand(1); x = rand(1:100, 10^6); y = categorical(x); z = string.(x);

julia> @benchmark freqtable($x)

BenchmarkTools.Trial:

memory estimate: 25.89 KiB

allocs estimate: 83

————–

minimum time: 24.246 ms (0.00% GC)

median time: 24.672 ms (0.00% GC)

mean time: 25.425 ms (0.00% GC)

maximum time: 39.739 ms (0.00% GC)

————–

samples: 197

evals/sample: 1

julia> @benchmark countmap($x)

BenchmarkTools.Trial:

memory estimate: 6.61 KiB

allocs estimate: 10

————–

minimum time: 42.230 ms (0.00% GC)

median time: 42.813 ms (0.00% GC)

mean time: 43.110 ms (0.00% GC)

maximum time: 46.244 ms (0.00% GC)

————–

samples: 116

evals/sample: 1

julia> @benchmark freqtable($y)

BenchmarkTools.Trial:

memory estimate: 10.16 KiB

allocs estimate: 76

————–

minimum time: 1.064 ms (0.00% GC)

median time: 1.112 ms (0.00% GC)

mean time: 1.129 ms (0.09% GC)

maximum time: 3.485 ms (66.72% GC)

————–

samples: 4403

evals/sample: 1

julia> @benchmark countmap($y)

BenchmarkTools.Trial:

memory estimate: 6.61 KiB

allocs estimate: 10

————–

minimum time: 87.141 ms (0.00% GC)

median time: 88.167 ms (0.00% GC)

mean time: 88.510 ms (0.00% GC)

maximum time: 92.177 ms (0.00% GC)

————–

samples: 57

evals/sample: 1

julia> @benchmark freqtable($z)

BenchmarkTools.Trial:

memory estimate: 45.81 MiB

allocs estimate: 2000285

————–

minimum time: 75.712 ms (3.94% GC)

median time: 77.057 ms (3.94% GC)

mean time: 77.346 ms (4.16% GC)

maximum time: 83.298 ms (3.35% GC)

————–

samples: 65

evals/sample: 1

julia> @benchmark countmap($z)

BenchmarkTools.Trial:

memory estimate: 6.61 KiB

allocs estimate: 10

————–

minimum time: 81.931 ms (0.00% GC)

median time: 83.128 ms (0.00% GC)

mean time: 83.472 ms (0.00% GC)

maximum time: 89.977 ms (0.00% GC)

————–

samples: 60

evals/sample: 1

As you can see freqtable does really a good job on different types of inputs.

Actually there is a third way to do a similar using by form DataFrames which is also quite fast but it is more messy. freqtable is more specialized – does one job, but does it well.

Here are the benchmarks of by:

julia> @benchmark by(DataFrame(x = $x), :x, nrow)

BenchmarkTools.Trial:

memory estimate: 38.91 MiB

allocs estimate: 5986

————–

minimum time: 28.946 ms (1.56% GC)

median time: 34.440 ms (14.82% GC)

mean time: 34.291 ms (14.80% GC)

maximum time: 41.079 ms (20.70% GC)

————–

samples: 146

evals/sample: 1

julia> @benchmark by(DataFrame(x = $y), :x, nrow)

BenchmarkTools.Trial:

memory estimate: 38.92 MiB

allocs estimate: 6198

————–

minimum time: 44.810 ms (3.52% GC)

median time: 50.244 ms (10.53% GC)

mean time: 49.715 ms (10.38% GC)

maximum time: 56.052 ms (17.67% GC)

————–

samples: 101

evals/sample: 1

julia> @benchmark by(DataFrame(x = $z), :x, nrow)

BenchmarkTools.Trial:

memory estimate: 38.91 MiB

allocs estimate: 5986

————–

minimum time: 46.891 ms (0.93% GC)

median time: 53.657 ms (10.12% GC)

mean time: 52.539 ms (9.68% GC)

maximum time: 60.736 ms (16.24% GC)

————–

samples: 96

evals/sample: 1

juliabloggers.com

A Julia Language Blog Aggregator

An update of my tutorial on DataFrames

Related