Author Archives: Blog by Bogumił Kamiński

Advanced reshaping in DataFrames.jl

Re-posted from: https://bkamins.github.io/julialang/2021/05/28/pivot.html

In DataFrames.jl currently you have two functions that you can use
for reshaping your data: stack and unstack. Their design goals are very
simple:

stack allows you to go from wide to long data format;
unstck works the other way around and takes data in long format producing
a wide table.

In this post I want to focus on the unstack function. Starting from explaining
its basic usage and then covering three common more complex scenarios.

This post was written under Julia 1.6.1 and DataFrames.jl 1.1.1.

Preparing the data

Consider the following data
(and practice a bit basic data transformation skills):

julia> using DataFrames

julia> sales = DataFrame(year=repeat(2001:2003, inner=4),
                         quarter=repeat(1:4, outer=3),
                         north=1:12, south=21:32)
12×4 DataFrame
 Row │ year   quarter  north  south
     │ Int64  Int64    Int64  Int64
─────┼──────────────────────────────
│  2001        1      1     21
│  2001        2      2     22
│  2001        3      3     23
│  2001        4      4     24
│  2002        1      5     25
│  2002        2      6     26
│  2002        3      7     27
│  2002        4      8     28
│  2003        1      9     29
│  2003        2     10     30
│  2003        3     11     31
│  2003        4     12     32

julia> costs = select(sales, :year, :quarter, [:north, :south] .=> x -> x/2,
                      renamecols=false)
12×4 DataFrame
 Row │ year   quarter  north    south
     │ Int64  Int64    Float64  Float64
─────┼──────────────────────────────────
│  2001        1      0.5     10.5
│  2001        2      1.0     11.0
│  2001        3      1.5     11.5
│  2001        4      2.0     12.0
│  2002        1      2.5     12.5
│  2002        2      3.0     13.0
│  2002        3      3.5     13.5
│  2002        4      4.0     14.0
│  2003        1      4.5     14.5
│  2003        2      5.0     15.0
│  2003        3      5.5     15.5
│  2003        4      6.0     16.0

julia> long_sales = stack(sales, [:north, :south], [:year, :quarter],
                          variable_name=:region, value_name=:sales)
24×4 DataFrame
 Row │ year   quarter  region  sales
     │ Int64  Int64    String  Int64
─────┼───────────────────────────────
│  2001        1  north       1
│  2001        2  north       2
│  2001        3  north       3
│  2001        4  north       4
│  2002        1  north       5
│  2002        2  north       6
│  2002        3  north       7
│  2002        4  north       8
│  2003        1  north       9
│  2003        2  north      10
│  2003        3  north      11
│  2003        4  north      12
│  2001        1  south      21
│  2001        2  south      22
│  2001        3  south      23
│  2001        4  south      24
│  2002        1  south      25
│  2002        2  south      26
│  2002        3  south      27
│  2002        4  south      28
│  2003        1  south      29
│  2003        2  south      30
│  2003        3  south      31
│  2003        4  south      32

julia> long_costs = stack(costs, [:north, :south], [:year, :quarter],
                          variable_name=:region, value_name=:costs)
24×4 DataFrame
 Row │ year   quarter  region  costs
     │ Int64  Int64    String  Float64
─────┼─────────────────────────────────
│  2001        1  north       0.5
│  2001        2  north       1.0
│  2001        3  north       1.5
│  2001        4  north       2.0
│  2002        1  north       2.5
│  2002        2  north       3.0
│  2002        3  north       3.5
│  2002        4  north       4.0
│  2003        1  north       4.5
│  2003        2  north       5.0
│  2003        3  north       5.5
│  2003        4  north       6.0
│  2001        1  south      10.5
│  2001        2  south      11.0
│  2001        3  south      11.5
│  2001        4  south      12.0
│  2002        1  south      12.5
│  2002        2  south      13.0
│  2002        3  south      13.5
│  2002        4  south      14.0
│  2003        1  south      14.5
│  2003        2  south      15.0
│  2003        3  south      15.5
│  2003        4  south      16.0

julia> long = innerjoin(long_sales, long_costs, on=[:year, :quarter, :region])
24×5 DataFrame
 Row │ year   quarter  region  sales  costs
     │ Int64  Int64    String  Int64  Float64
─────┼────────────────────────────────────────
│  2001        1  north       1      0.5
│  2001        2  north       2      1.0
│  2001        3  north       3      1.5
│  2001        4  north       4      2.0
│  2002        1  north       5      2.5
│  2002        2  north       6      3.0
│  2002        3  north       7      3.5
│  2002        4  north       8      4.0
│  2003        1  north       9      4.5
│  2003        2  north      10      5.0
│  2003        3  north      11      5.5
│  2003        4  north      12      6.0
│  2001        1  south      21     10.5
│  2001        2  south      22     11.0
│  2001        3  south      23     11.5
│  2001        4  south      24     12.0
│  2002        1  south      25     12.5
│  2002        2  south      26     13.0
│  2002        3  south      27     13.5
│  2002        4  south      28     14.0
│  2003        1  south      29     14.5
│  2003        2  south      30     15.0
│  2003        3  south      31     15.5
│  2003        4  south      32     16.0

The basics of `unstack`

Assume we want to get the sales table back. We need to unstack our long
table putting :year and :quarter in rows and :region in columns, while
taking :sales as values:

julia> unstack(long, [:year, :quarter], :region, :sales)
12×4 DataFrame
 Row │ year   quarter  north   south
     │ Int64  Int64    Int64?  Int64?
─────┼────────────────────────────────
│  2001        1       1      21
│  2001        2       2      22
│  2001        3       3      23
│  2001        4       4      24
│  2002        1       5      25
│  2002        2       6      26
│  2002        3       7      27
│  2002        4       8      28
│  2003        1       9      29
│  2003        2      10      30
│  2003        3      11      31
│  2003        4      12      32

We also check that we have recovered what we wanted:

julia> unstack(long, [:year, :quarter], :region, :sales) == sales
true

However, now try to put only :year in rows. If we want to drop :quarter then
we get:

julia> unstack(long, :year, :region, :sales)
ERROR: ArgumentError: Duplicate entries in unstack at row 2 for key (2001,) and variable north. Pass allowduplicates=true to allow them.

julia> unstack(long, :year, :region, :sales, allowduplicates=true)
3×3 DataFrame
 Row │ year   north   south
     │ Int64  Int64?  Int64?
─────┼───────────────────────
   1 │  2001       4      24
   2 │  2002       8      28
   3 │  2003      12      32

Clearly even if we pass allowduplicates=true we do not get what we most likely
wanted. This leads us to the first case.

Pivot tables with `unstack`

Most likely we want to aggregate sales per year using the sum function. This
is a classic pivot table task. In DataFrames.jl currently one does it in two
steps: first aggregate, then reshape. Here is how you can do it (I am showing
two separate steps, but you could use e.g. Chain.jl to streamline
the processing):

julia> tmp1 = combine(groupby(long, [:year, :region]), :sales => sum => :sales)
6×3 DataFrame
 Row │ year   region  sales
     │ Int64  String  Int64
─────┼──────────────────────
   1 │  2001  north      10
   2 │  2001  south      90
   3 │  2002  north      26
   4 │  2002  south     106
   5 │  2003  north      42
   6 │  2003  south     122

julia> unstack(tmp1, :year, :region, :sales)
3×3 DataFrame
 Row │ year   north   south
     │ Int64  Int64?  Int64?
─────┼───────────────────────
   1 │  2001      10      90
   2 │  2002      26     106
   3 │  2003      42     122

Multiple variables put in columns

What if we wanted to put only :year in rows, but both :quarter and :region
in columns?

In this case we need to create a temporary column which we combine the
:quarter and :region. Here is a simple example:

julia> tmp2 = transform(long, [:region, :quarter] => ByRow(string) => :rq)
24×6 DataFrame
 Row │ year   quarter  region  sales  costs    rq
     │ Int64  Int64    String  Int64  Float64  String
─────┼────────────────────────────────────────────────
│  2001        1  north       1      0.5  north1
│  2001        2  north       2      1.0  north2
│  2001        3  north       3      1.5  north3
│  2001        4  north       4      2.0  north4
│  2002        1  north       5      2.5  north1
│  2002        2  north       6      3.0  north2
│  2002        3  north       7      3.5  north3
│  2002        4  north       8      4.0  north4
│  2003        1  north       9      4.5  north1
│  2003        2  north      10      5.0  north2
│  2003        3  north      11      5.5  north3
│  2003        4  north      12      6.0  north4
│  2001        1  south      21     10.5  south1
│  2001        2  south      22     11.0  south2
│  2001        3  south      23     11.5  south3
│  2001        4  south      24     12.0  south4
│  2002        1  south      25     12.5  south1
│  2002        2  south      26     13.0  south2
│  2002        3  south      27     13.5  south3
│  2002        4  south      28     14.0  south4
│  2003        1  south      29     14.5  south1
│  2003        2  south      30     15.0  south2
│  2003        3  south      31     15.5  south3
│  2003        4  south      32     16.0  south4

julia> unstack(tmp2, :year, :rq, :sales)
3×9 DataFrame
 Row │ year   north1  north2  north3  north4  south1  south2  south3  south4
     │ Int64  Int64?  Int64?  Int64?  Int64?  Int64?  Int64?  Int64?  Int64?
─────┼───────────────────────────────────────────────────────────────────────
│  2001       1       2       3       4      21      22      23      24
│  2002       5       6       7       8      25      26      27      28
│  2003       9      10      11      12      29      30      31      32

Note that this additional step is only required for columns as for rows
unstack accepts multiple columns as shown above.

Multiple value variables

Now we get to my favorite Chekhov’s gun element of our story. Why do we
have a :costs column in our long table? The reason is that now we will
discuss how one can unstack a data frame on multiple value columns.

Here you have three options how you want to store the values:

use a nested field;
stack them vertically;
merge them horizontally.

Let me now discuss the three options. Nesting the field can be done e.g.
in the following way:

julia> tmp3 = transform(long, AsTable([:sales, :costs]) =>
                              ByRow(identity) =>
                              :indicators)
24×6 DataFrame
 Row │ year   quarter  region  sales  costs    indicators
     │ Int64  Int64    String  Int64  Float64  NamedTupl…
─────┼────────────────────────────────────────────────────────────────────
│  2001        1  north       1      0.5  (sales = 1, costs = 0.5)
│  2001        2  north       2      1.0  (sales = 2, costs = 1.0)
│  2001        3  north       3      1.5  (sales = 3, costs = 1.5)
│  2001        4  north       4      2.0  (sales = 4, costs = 2.0)
│  2002        1  north       5      2.5  (sales = 5, costs = 2.5)
│  2002        2  north       6      3.0  (sales = 6, costs = 3.0)
│  2002        3  north       7      3.5  (sales = 7, costs = 3.5)
│  2002        4  north       8      4.0  (sales = 8, costs = 4.0)
│  2003        1  north       9      4.5  (sales = 9, costs = 4.5)
│  2003        2  north      10      5.0  (sales = 10, costs = 5.0)
│  2003        3  north      11      5.5  (sales = 11, costs = 5.5)
│  2003        4  north      12      6.0  (sales = 12, costs = 6.0)
│  2001        1  south      21     10.5  (sales = 21, costs = 10.5)
│  2001        2  south      22     11.0  (sales = 22, costs = 11.0)
│  2001        3  south      23     11.5  (sales = 23, costs = 11.5)
│  2001        4  south      24     12.0  (sales = 24, costs = 12.0)
│  2002        1  south      25     12.5  (sales = 25, costs = 12.5)
│  2002        2  south      26     13.0  (sales = 26, costs = 13.0)
│  2002        3  south      27     13.5  (sales = 27, costs = 13.5)
│  2002        4  south      28     14.0  (sales = 28, costs = 14.0)
│  2003        1  south      29     14.5  (sales = 29, costs = 14.5)
│  2003        2  south      30     15.0  (sales = 30, costs = 15.0)
│  2003        3  south      31     15.5  (sales = 31, costs = 15.5)
│  2003        4  south      32     16.0  (sales = 32, costs = 16.0)

julia> unstack(tmp3, [:year, :quarter], :region, :indicators)
12×4 DataFrame
 Row │ year   quarter  north                      south
     │ Int64  Int64    NamedTup…?                 NamedTup…?
─────┼───────────────────────────────────────────────────────────────────────
│  2001        1  (sales = 1, costs = 0.5)   (sales = 21, costs = 10.5)
│  2001        2  (sales = 2, costs = 1.0)   (sales = 22, costs = 11.0)
│  2001        3  (sales = 3, costs = 1.5)   (sales = 23, costs = 11.5)
│  2001        4  (sales = 4, costs = 2.0)   (sales = 24, costs = 12.0)
│  2002        1  (sales = 5, costs = 2.5)   (sales = 25, costs = 12.5)
│  2002        2  (sales = 6, costs = 3.0)   (sales = 26, costs = 13.0)
│  2002        3  (sales = 7, costs = 3.5)   (sales = 27, costs = 13.5)
│  2002        4  (sales = 8, costs = 4.0)   (sales = 28, costs = 14.0)
│  2003        1  (sales = 9, costs = 4.5)   (sales = 29, costs = 14.5)
│  2003        2  (sales = 10, costs = 5.0)  (sales = 30, costs = 15.0)
│  2003        3  (sales = 11, costs = 5.5)  (sales = 31, costs = 15.5)
│  2003        4  (sales = 12, costs = 6.0)  (sales = 32, costs = 16.0)

The second option is vertical stacking:

julia> vcat(unstack(long, [:year, :quarter], :region, :sales),
            unstack(long, [:year, :quarter], :region, :costs),
            source=:indicator=>["sales", "costs"])
24×5 DataFrame
 Row │ year   quarter  north     south     indicator
     │ Int64  Int64    Float64?  Float64?  String
─────┼───────────────────────────────────────────────
│  2001        1       1.0      21.0  sales
│  2001        2       2.0      22.0  sales
│  2001        3       3.0      23.0  sales
│  2001        4       4.0      24.0  sales
│  2002        1       5.0      25.0  sales
│  2002        2       6.0      26.0  sales
│  2002        3       7.0      27.0  sales
│  2002        4       8.0      28.0  sales
│  2003        1       9.0      29.0  sales
│  2003        2      10.0      30.0  sales
│  2003        3      11.0      31.0  sales
│  2003        4      12.0      32.0  sales
│  2001        1       0.5      10.5  costs
│  2001        2       1.0      11.0  costs
│  2001        3       1.5      11.5  costs
│  2001        4       2.0      12.0  costs
│  2002        1       2.5      12.5  costs
│  2002        2       3.0      13.0  costs
│  2002        3       3.5      13.5  costs
│  2002        4       4.0      14.0  costs
│  2003        1       4.5      14.5  costs
│  2003        2       5.0      15.0  costs
│  2003        3       5.5      15.5  costs
│  2003        4       6.0      16.0  costs

julia> unstack(stack(long, [:sales, :costs], [:year, :quarter, :region],
                     variable_name=:indicator),
               [:year, :quarter, :indicator], :region, :value)
24×5 DataFrame
 Row │ year   quarter  indicator  north     south
     │ Int64  Int64    String     Float64?  Float64?
─────┼───────────────────────────────────────────────
│  2001        1  sales           1.0      21.0
│  2001        2  sales           2.0      22.0
│  2001        3  sales           3.0      23.0
│  2001        4  sales           4.0      24.0
│  2002        1  sales           5.0      25.0
│  2002        2  sales           6.0      26.0
│  2002        3  sales           7.0      27.0
│  2002        4  sales           8.0      28.0
│  2003        1  sales           9.0      29.0
│  2003        2  sales          10.0      30.0
│  2003        3  sales          11.0      31.0
│  2003        4  sales          12.0      32.0
│  2001        1  costs           0.5      10.5
│  2001        2  costs           1.0      11.0
│  2001        3  costs           1.5      11.5
│  2001        4  costs           2.0      12.0
│  2002        1  costs           2.5      12.5
│  2002        2  costs           3.0      13.0
│  2002        3  costs           3.5      13.5
│  2002        4  costs           4.0      14.0
│  2003        1  costs           4.5      14.5
│  2003        2  costs           5.0      15.0
│  2003        3  costs           5.5      15.5
│  2003        4  costs           6.0      16.0

Finally we might want to perform horizontal merging which can be done e.g. like
this:

julia> outerjoin(unstack(long, [:year, :quarter], :region, :sales),
                 unstack(long, [:year, :quarter], :region, :costs),
                 on=[:year, :quarter], renamecols="_sales" => "_costs")
12×6 DataFrame
 Row │ year   quarter  north_sales  south_sales  north_costs  south_costs
     │ Int64  Int64    Int64?       Int64?       Float64?     Float64?
─────┼────────────────────────────────────────────────────────────────────
   1 │  2001        1            1           21          0.5         10.5
   2 │  2001        2            2           22          1.0         11.0
   3 │  2001        3            3           23          1.5         11.5
   4 │  2001        4            4           24          2.0         12.0
   5 │  2002        1            5           25          2.5         12.5
   6 │  2002        2            6           26          3.0         13.0
   7 │  2002        3            7           27          3.5         13.5
   8 │  2002        4            8           28          4.0         14.0
   9 │  2003        1            9           29          4.5         14.5
  10 │  2003        2           10           30          5.0         15.0
  11 │  2003        3           11           31          5.5         15.5
  12 │  2003        4           12           32          6.0         16.0

Concluding remmarks

Today I have focused mostly on the unstack function, and only mentioned
stack in a few places.

However, it is also worth to know that there are two other functions that are
very often handy and easy to forget about. One is good old permutedims
(transposing a data frame) and the other is flatten(flattening nested
columns). If you want to widen your DataFrames.jl related arsenal of tricks I
recommend you to check out their documentation.

A new tutorial on DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/05/20/odsc.html

This time my post is going to be shorter than usual (but hopefully as exciting :)).

I am going to give a tutorial on DataFrames.jl during
ODSC Europe 2021 conference.

You can find the link to the outline of my workshop
“DataFrames.jl: a Perfect Sidekick for Your Next Data Science Project”
here.

The materials that are going to be presented during the tutorial can be found in
this GitHub repository. This is a first workshop where I decided not to
use Jupyter Notebook, but just classic Julia REPL so I hope participants will
also get convinced by it. Personally, I prefer using Julia REPL and plain old
text editor combo in all my work over any other currently available option.

Finally, I have written a small pre-conference post, that was published on ODSC
blog here. It was meant to attract the attention of people who do not
use Julia yet. And to be clear: the benchmarks shown there were not something I
spent days to hand pick a favorable scenario for DataFrames.jl. It was the first
thing that came to my mind to check. Actually, initially I did not even consider
benchmarking against Polars as it is not that popular yet. However, later
I wanted to check for myself how it compares. And indeed Polars turns out to
have a really decent implementation.

I hope you will find the shared materials useful!

The hardest part of DataFrames.jl development process

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/05/14/nrow.html

Introduction

I have spent several years now helping to develop DataFrames.jl.
There are many issues to consider when working on such a big package:

providing new functionalities;
avoiding and fixing bugs;
performance;
integration of the functionality with the rest of data ecosystem;
handling of conflicting expectations of the users;
getting the reviews done (super hard for complex PR’s);
managing release process and synchronization with dependencies;
working consistently on different versions of Julia that should be supported;
fixing bugs uncovered in other packages/Julia;
ensuring proper documentation and tutorials;
managing deprecated functionality;
…

These are the issues that instantly come to my mind and there are many more.
A natural question is then — what is the hardest task?

From my experience it is deciding what API to provide (function names, their
positional and keyword arguments, their return values), and the starting point
in this area is deciding which functions should be made available to the user.
The discussions about what should be the names of functions we export were one
of the longest and hardest because this is a social process, that is hugely
affected by past experience of the contributors.

As this topic is very wide I decided to comment on three selected
decisions in this scope:

why we decided not to provide head and tail functions but use first
and last instead;
why we decided to provide nrow and ncol functions, while size function
gives the same information;
why we provide both filter and subset functions that serve the same purpose.

I hope this will shed some light in the mental process we go through when making
such decisions.

This post refers to the state of the DataFrames.jl package in its 1.1.1 release.

Why `head` and `tail` are not defined

head and tail are commonly used in other ecosystems (e.g. in R) to get few
first/last rows of a data frame. This gives us a first criterion:

Criterion 1: try to use function names that are natural for users to guess
without having to learn them.

However, there are first and last functions in Julia Base that serve the same
purpose. This gives us the following new criteria:

Criterion 2: stay consistent with Julia Base and try to add methods for
functions already defined there (as users are likely to know them).

Criterion 3: minimize number of verbs (function names) that are introduced
by the package, as this makes the functionality easier to learn and maintain.

Criterion 4: avoid defining common and short names. Such names are very likely
to conflict with names defined user’s code leading to problems.

Criterion 5: if we want to add a method to a function defined in Julia Base
will it do the same thing (we do not want to change the contract established
in Julia Base) and not cause type piracy.

In this case we have first and last functions in the Julia Base that already
are defined to allow to pick first/last elements of the collection. Additionally
Julia Base defines Base.tail, which is not exported currently, but there is
always a risk that this would change in the future (and it does a bit different
thing). Finally head and tail are pretty common names, that were likely to
be already in use in user’s code. Here a crucial consideration is that if we
claimed some common name many years ago it would be less a problem. However,
some users have thousands of code using DataFrames.jl. In such a case
introducing a common name might cause code base that worked previously to start
failing.

All in all – we stick to first/last combo although it does not conform to
Criterion 1 for some of the users (this is subjective though).

Why `nrow` and `ncol` are defined

In this case clearly we followed Criterion 1. Let us analyze why other criteria
did not get that much weight. We are clearly breaking Criteria 2 and 3.
Fortunately most likely we are not breaking Criterion 4 (names are short,
but not likely to be commonly used). Criterion 5 is not applicable.

Let us dig into Criteria 2 and 3 a bit. Instead of writing nrow(df) you
can alternatively write size(df, 1) or size(df)[1]. There are three reasons
why this is not optimal:

it is a bit more to type;
you actually have two styles to get number of rows (and I know from StactOverflow
that which one to choose was confusing — we do not want for such a common
operation to have two similar, but a bit different styles);
nrow does not require you do define an anonymous function if you want to
pass it to some higher order function; compare:
```
  combine(groupby(df, :col), nrow)
```
vs
```
  combine(groupby(df, :col), x -> size(x, 1))
```

It is not only much easier to read but also the former has to be compiled only
once while the latter is recompiled every time if you are in global scope.

For these reasons defining nrow and ncol was accepted.

Why both `filter` and `subset` are provided

Clearly there is a filter function in Julia Base, so why do we need
a subset function? I have discussed the differences between them in
my last post so they do not do the same. Here a crucial consideration was
following Critetion 5. Methods for the filter function defined in
DataFrames.jl should follow the contract for filter defined in Julia Base.
However, users wanted a function doing a similar thing, but with a different
contract (e.g. different order of arguments, whole column passed to the
predicate function, option to skip missing values). Therefore we decided to
keep filter consistent with Julia Base and add a new function subset that
would follow what users wanted.

Conclusions

Before I finish let me add one more comment. What if we have a function name,
like describe, that is not defined in Julia Base, but it is likely that
several packages might want add methods to it? In this case we need to have some
package umbrella that only defines this function (possibly with a default
implementation). In data science related ecosystem in Julia we have two such
packages: DataAPI.jl and StatsAPI.jl.

juliabloggers.com

A Julia Language Blog Aggregator

Author Archives: Blog by Bogumił Kamiński

Advanced reshaping in DataFrames.jl

Preparing the data

The basics of `unstack`

Pivot tables with `unstack`

Multiple variables put in columns

Multiple value variables

Concluding remmarks

A new tutorial on DataFrames.jl

The hardest part of DataFrames.jl development process

Introduction

Why `head` and `tail` are not defined

Why `nrow` and `ncol` are defined

Why both `filter` and `subset` are provided

Conclusions

Preparing the data

The basics of unstack

Pivot tables with unstack

Multiple variables put in columns

Multiple value variables

Concluding remmarks

Introduction

Why head and tail are not defined

Why nrow and ncol are defined

Why both filter and subset are provided

Conclusions

The basics of `unstack`

Pivot tables with `unstack`

Why `head` and `tail` are not defined

Why `nrow` and `ncol` are defined

Why both `filter` and `subset` are provided