Author Archives: Blog by Bogumił Kamiński

Benchmarking split-apply-combine: DataFrames.jl vs Pandas

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/05/27/strings.html

Introduction

Recently chapters 5 and 6 of my Julia for Data Analysis book were
published in MEAP. Finally we are getting to some more fun stuff like parametric
types or discussion of options for working with strings in Julia.

In the comments from the readers who decided to opt-in for early access to
the book I was asked how does performance of DataFrames.jl compare to Pandas.
Therefore today I decided to run a small benchmark.

The codes were run under:

  • Julia 1.7.2, DataFrames.jl 1.3.4, InlineStrings.jl 1.1.2,
    PooledArrays.jl 1.4.2, CategoricalArrays.jl 0.10.6,
    and BenchmarkTools.jl 1.3.1;
  • Python 3.9.12, Pandas 1.4.2.

The test scenario

I want to:

  1. generate 1,000,000 random strings of length 8 consisting of letters;
  2. put them in a data frame;
  3. group the data frame by the string column and calculate number of rows
    in every group.

In the tests I want to compare the performance of different options how strings
can be stored.

Julia code

In Julia tests I check the following storage modes of strings: Vector{String},
Vector{String15} (inline string), Vector{Symbol}, PooledVector{String},
and CategoricalVector{String}. Here is the benchmark:

julia> using Random

julia> using InlineStrings

julia> using BenchmarkTools

julia> using PooledArrays

julia> using CategoricalArrays

julia> Random.seed!(1234);

julia> df = transform!(DataFrame(str=[randstring() for _ in 1:10^6]),
                       :str .=>
                       [inlinestrings, ByRow(Symbol),
                        PooledArray, CategoricalArray] .=>
                       [:istr, :sym, :pstr, :cstr])
1000000×5 DataFrame
     Row │ str       istr      sym       pstr      cstr
         │ String    String15  Symbol    String    Cat…
─────────┼──────────────────────────────────────────────────
       1 │ KYDtLOxn  KYDtLOxn  KYDtLOxn  KYDtLOxn  KYDtLOxn
       2 │ UkZj0CRg  UkZj0CRg  UkZj0CRg  UkZj0CRg  UkZj0CRg
    ⋮    │    ⋮         ⋮         ⋮         ⋮         ⋮
  999999 │ CZL6fcG5  CZL6fcG5  CZL6fcG5  CZL6fcG5  CZL6fcG5
 1000000 │ 1MqvPpVb  1MqvPpVb  1MqvPpVb  1MqvPpVb  1MqvPpVb
                                         999996 rows omitted

julia> @belapsed combine(groupby($df, :str), nrow)
0.0486384

julia> @belapsed combine(groupby($df, :istr), nrow)
0.0461695

julia> @belapsed combine(groupby($df, :sym), nrow)
0.0422065

julia> @belapsed combine(groupby($df, :pstr), nrow)
0.0096023

julia> @belapsed combine(groupby($df, :cstr), nrow)
0.0304362

As you can see PooledVector is by far fastest. Next comes CategoricalVector,
but it is significantly slower. The performance of String, String15,
and Symbol is comparable; String is slowest and Symbol is fastest.

Python code

In Python I compared Series containing str and Categorical. Here is the
code:

import pandas as pd

import random

import string

random.seed(1234)

s = [''.join(random.choice(string.ascii_letters) for _ in range(8)) for _ in range(10**6)]

cs = pd.Categorical(s)

df = pd.DataFrame({'str': s, 'cstr': cs})

%time df.groupby(['str']).size()
CPU times: total: 1.56 s
Wall time: 1.56 s
Out[8]:
str
AAAGINxY    1
AAAIavpP    1
AAASsiaU    1
AAAfnfqH    1
AAAxQZTv    1
           ..
zzzNaWBc    1
zzzTCZdT    1
zzzmmTcP    1
zzzoySQG    1
zzzwzAsA    1
Length: 1000000, dtype: int64

%time df.groupby(['cstr']).size()
CPU times: total: 109 ms
Wall time: 105 ms
Out[9]:
cstr
AAAGINxY    1
AAAIavpP    1
AAASsiaU    1
AAAfnfqH    1
AAAxQZTv    1
           ..
zzzNaWBc    1
zzzTCZdT    1
zzzmmTcP    1
zzzoySQG    1
zzzwzAsA    1
Length: 1000000, dtype: int64

As you can see using Categorical is much faster. However, both options
are visibly slower than any of the Julia variants we have considered.

Conclusions

First, I would like to thank all reviewers and readers of my book in MEAP. They
really help to improve it and I appreciate the feedback I receive a lot.

Now let me turn to a comment on benchmark results. First, I want to remark that
I usually avoid running such comparisons as it is really hard to do them
comprehensively and objectively. Therefore, I recommend you to treat my
conclusions as a general guidance inferred from the tests I have performed:

  • DataFrames.jl was consistently faster than Pandas.
  • In both ecosystems it is better to “pool” data before running
    split-apply-combine operation; this is especially relevant if you expect
    to run such operation many times.
  • In Julia PooledArrays.jl is visibly faster than CategoricalArrays.jl.
    Therefore if you just need compression (as offered by PooledArrays.jl)
    I do not recommend you to use CategoricalArrays.jl (use it if you need a
    first-class container for categorical data). Note that we see such a big
    difference because we had very many levels in our data. If we had only
    a few levels (a typical use case for CategoricalArrays.jl) its performance
    would be much closer to PooledArrays.jl.
  • In Julia if you use non-pooled vector the choice of string storage type
    does not affect the considered timings a lot. If you would like to learn
    when this decision is important I recommend you to read this post.

I hope you will find the presented benchmarks useful in your daily data
wrangling tasks.

Why do I use Julia?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/05/20/whyjulia.html

Introduction

It has been two years since I have started writing this blog. Therefore I
thought of composing some more high level post today and share with you
my thoughts on why I use Julia.

This post does not aim to give a comprehensive review of strengths and
weaknesses of Julia in general. Instead, I want to collect some notes of my
story of working with this language (and I think it is important that I am an
economist by training, as people with, e.g., computer science background might
have different thoughts or expectations).

Starting with Julia

My starting point with Julia was around time when I was finishing my
habilitation degree (this is roughly an equivalent of getting a tenure in US).
At that time I was implementing a lot of agent-based simulations. Studying such
models usually requires quite a lot of computations. The reason is that not only
a single simulation run takes time but also you usually need to run these models
many times under different parameterizations.

If you would like to learn more how such models are designed and used you can
have a look, for example, at this paper where you can find
a description of a model and experiments run using it and a link to a GitHub
repository with Julia source codes.

So what requirements did I have for a programming language? Here they are:

  • it should be fast, and support multi-threading and distributed computing;
  • it should be easy to code with, so that the code is short, easy to maintain,
    and one does not have to constantly use, e.g., Valgrind to debug
    it;
  • it should support interactive development (Julia has an excellent REPL).

Julia fitted these requirements perfectly.

In short: I started using Julia because it solved an important problem I had
in my work.

Staying with Julia

Over the years I have started using Julia in numerous projects. Since most of
problems I needed to solve involved some numerical computing I rarely had an
issue that Julia missed important features. Sometimes indeed Julia lacks some
functionality, but, fortunately, then it is really easy to integrate it with
C, Python, or R.

Another nice aspect of Julia is that it allowed me over the years to easily
collaborate with other non-programmers. You might be surprised by this
statement, but there are the following aspects of Julia that make it possible:

  • A well written Julia code is easy to reason about (and when working with
    mathematicians they like that it uses 1-based indexing so the formulas they
    use in papers can be directly translated to code). This allows readers to both
    visually verify correctness of the code and to make small changes in its logic
    if needed.
  • If I wanted someone else (non-technical) to run my code I just needed to ask
    this person to install Julia (this is easy). Next, I shared Project.toml and
    Manifest.toml files along with my code and Julia took care about proper setup
    of its working environment. This saved me hours of work that I would otherwise
    spend on helping my collaborators to configure things properly (or teach them
    how to, e.g., set-up and use Docker).

After using Julia for a while I decided to start contributing to packages. This
is another area where I enjoy Julia a lot. In my early days the selling point
was that I found it extremely easy to learn how to create a package, register
and release it, set up CI on GitHub, find advice on good coding style or writing
tests. The package to which I ended up to contribute most is DataFrames.jl. I
have found the community of people involved with this package extremely
welcoming, engaged, and knowledgeable. I have learned a lot from them and at the
same time I made a lot of friends.

This friendly social aspect of the Julia community is extremely important in the
long run. Please do not take me wrong – it does not mean that any PR or issue is
just accepted. I would say it is the opposite. Many times the discussions take
months of going there and back with the design. The point is that if you feel
people reacting to your thoughts are welcoming it is much easier to go through
all the obstacles you encounter along the way. In short – in Julia community
even though I often felt like this I have never felt like this.

Conclusions

Do I think Julia is one language to rule them all? Certainly not.
Different languages have different target audiences, where they shine.
However, if you are looking for a language that:

  • is easy to use,
  • provides a decent interactive environment,
  • has a best-in-class package manager,
  • is fast (including multi-threading and distributed computing),
  • allows you to become a member of a friendly and responsive community
    from which you can learn,

you may consider giving Julia a try like I have done.

Will you encounter problems or missing features along the way? For sure you will
(as most likely you would with any programming language). If you want to see a
record of PRs and issues I opened for Julia and how they got solved it is easy
to do it here.

My first Twitch live streaming session

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/05/12/twitch.html

Introduction

In 24 hours I will have my first Twitch live streaming session. It will begin on
Friday, May 13, 7 PM EDT on ManningPublications channel.

In this post I want to share the source material I am going to present so that
everyone interested can easily follow it.

The codes are a shortened version of contents of chapters 8 and 9 of my upcoming
Julia for Data Analysis book.

Environment setup

I will run the codes under Julia 1.7.2. You will need to install the following
packages (I show you the versions of the packages I use):

  • CSV.jl 0.10.4
  • CodecBzip2.jl 0.7.2
  • DataFrames.jl 1.3.4
  • Loess.jl 0.5.4
  • Plots.jl 1.28.1

The problem

In the session I will analyze Lichess puzzles database. It contains
information about over 2,000,000 puzzles, covering such data as number of times
a given puzzle was played, how hard the puzzle is, how much Lichess users like
the puzzle, or what chess themes the puzzle features. My goal is to check the
relationship between the puzzle hardness and how much users like it.

Source codes

Here are the source codes that I am going to present and explain during the
session.

I will start with fetching the data from the internet, unpacking it, and reading
it into a data frame:

import Downloads
Downloads.download("https://database.lichess.org/lichess_db_puzzle.csv.bz2",
                   "puzzles.csv.bz2")

using CodecBzip2
compressed = read("puzzles.csv.bz2")
plain = transcode(Bzip2Decompressor, compressed)

using CSV
using DataFrames
puzzles = CSV.read(plain, DataFrame;
                   header=["PuzzleId", "FEN", "Moves", "Rating","RatingDeviation",
                           "Popularity", "NbPlays", "Themes","GameUrl"])

describe(puzzles)

Next, I will perform exploratory data analysis of the data base and subset it
to only keep the puzzles that I will later want to analyze:

using Plots
plot([histogram(puzzles[!, col]; label=col) for
      col in ["Rating", "RatingDeviation", "Popularity", "NbPlays"]]...)

using Statistics
plays_lo = median(puzzles.NbPlays)
rating_lo = 1500
rating_hi = quantile(puzzles.Rating, 0.99)
row_selector = (puzzles.NbPlays .> plays_lo) .&&
               (rating_lo .< puzzles.Rating .< rating_hi)

sum(row_selector)
count(row_selector)

good = puzzles[row_selector, ["Rating", "Popularity"]]

plot(histogram(good.Rating; label="Rating"),
     histogram(good.Popularity; label="Popularity"))

describe(good)

Finally I will perform some aggregation data of the data stored in the Lichess
database and analyze the relationship between puzzle difficulty and popularity:

grouped_good = groupby(good, :Rating, sort=true)
agg_good = combine(grouped_good, :Popularity => mean)
scatter(agg_good.Rating, agg_good.Popularity_mean;
        xlabel="rating", ylabel="mean popularity", legend=false)

using Loess
model = loess(agg_good.Rating, agg_good.Popularity_mean)
agg_good.pred = predict(model, float.(agg_good.Rating))
plot!(agg_good.Rating, agg_good.pred; width=5)

Conclusions

I invite everyone to join me during the Twitch live streaming session.
If you would have any questions please do not hesitate to ask them in chat and I
will try to answer them live. I hope you will enjoy it!