Author Archives: Blog by Bogumił Kamiński

The hardest part of DataFrames.jl development process

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/05/14/nrow.html

Introduction

I have spent several years now helping to develop DataFrames.jl.
There are many issues to consider when working on such a big package:

  • providing new functionalities;
  • avoiding and fixing bugs;
  • performance;
  • integration of the functionality with the rest of data ecosystem;
  • handling of conflicting expectations of the users;
  • getting the reviews done (super hard for complex PR’s);
  • managing release process and synchronization with dependencies;
  • working consistently on different versions of Julia that should be supported;
  • fixing bugs uncovered in other packages/Julia;
  • ensuring proper documentation and tutorials;
  • managing deprecated functionality;

These are the issues that instantly come to my mind and there are many more.
A natural question is then — what is the hardest task?

From my experience it is deciding what API to provide (function names, their
positional and keyword arguments, their return values), and the starting point
in this area is deciding which functions should be made available to the user.
The discussions about what should be the names of functions we export were one
of the longest and hardest because this is a social process, that is hugely
affected by past experience of the contributors.

As this topic is very wide I decided to comment on three selected
decisions in this scope:

  • why we decided not to provide head and tail functions but use first
    and last instead;
  • why we decided to provide nrow and ncol functions, while size function
    gives the same information;
  • why we provide both filter and subset functions that serve the same purpose.

I hope this will shed some light in the mental process we go through when making
such decisions.

This post refers to the state of the DataFrames.jl package in its 1.1.1 release.

Why head and tail are not defined

head and tail are commonly used in other ecosystems (e.g. in R) to get few
first/last rows of a data frame. This gives us a first criterion:

Criterion 1: try to use function names that are natural for users to guess
without having to learn them.

However, there are first and last functions in Julia Base that serve the same
purpose. This gives us the following new criteria:

Criterion 2: stay consistent with Julia Base and try to add methods for
functions already defined there (as users are likely to know them).

Criterion 3: minimize number of verbs (function names) that are introduced
by the package, as this makes the functionality easier to learn and maintain.

Criterion 4: avoid defining common and short names. Such names are very likely
to conflict with names defined user’s code leading to problems.

Criterion 5: if we want to add a method to a function defined in Julia Base
will it do the same thing (we do not want to change the contract established
in Julia Base) and not cause type piracy.

In this case we have first and last functions in the Julia Base that already
are defined to allow to pick first/last elements of the collection. Additionally
Julia Base defines Base.tail, which is not exported currently, but there is
always a risk that this would change in the future (and it does a bit different
thing). Finally head and tail are pretty common names, that were likely to
be already in use in user’s code. Here a crucial consideration is that if we
claimed some common name many years ago it would be less a problem. However,
some users have thousands of code using DataFrames.jl. In such a case
introducing a common name might cause code base that worked previously to start
failing.

All in all – we stick to first/last combo although it does not conform to
Criterion 1 for some of the users (this is subjective though).

Why nrow and ncol are defined

In this case clearly we followed Criterion 1. Let us analyze why other criteria
did not get that much weight. We are clearly breaking Criteria 2 and 3.
Fortunately most likely we are not breaking Criterion 4 (names are short,
but not likely to be commonly used). Criterion 5 is not applicable.

Let us dig into Criteria 2 and 3 a bit. Instead of writing nrow(df) you
can alternatively write size(df, 1) or size(df)[1]. There are three reasons
why this is not optimal:

  • it is a bit more to type;
  • you actually have two styles to get number of rows (and I know from StactOverflow
    that which one to choose was confusing — we do not want for such a common
    operation to have two similar, but a bit different styles);
  • nrow does not require you do define an anonymous function if you want to
    pass it to some higher order function; compare:

      combine(groupby(df, :col), nrow)
    

    vs

      combine(groupby(df, :col), x -> size(x, 1))
    

It is not only much easier to read but also the former has to be compiled only
once while the latter is recompiled every time if you are in global scope.

For these reasons defining nrow and ncol was accepted.

Why both filter and subset are provided

Clearly there is a filter function in Julia Base, so why do we need
a subset function? I have discussed the differences between them in
my last post so they do not do the same. Here a crucial consideration was
following Critetion 5. Methods for the filter function defined in
DataFrames.jl should follow the contract for filter defined in Julia Base.
However, users wanted a function doing a similar thing, but with a different
contract (e.g. different order of arguments, whole column passed to the
predicate function, option to skip missing values). Therefore we decided to
keep filter consistent with Julia Base and add a new function subset that
would follow what users wanted.

Conclusions

Before I finish let me add one more comment. What if we have a function name,
like describe, that is not defined in Julia Base, but it is likely that
several packages might want add methods to it? In this case we need to have some
package umbrella that only defines this function (possibly with a default
implementation). In data science related ecosystem in Julia we have two such
packages: DataAPI.jl and StatsAPI.jl.

DataFrames.jl: why do we have both subset and filter functions?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/05/07/subset.html

Introduction

Before I start let me comment that exactly one year ago this blog has been
started. I hope to keep posting weekly updates on the Julia language, and
especially its ecosystem for data science, so:

Happy birthday

Now let us go back to business.

The 1.1 release of the DataFrames.jl package introduced a small fix of how
the subset function works. Today I will discuss its design and compare it
to the filter function.

In this post I am using Julia 1.6.1 and DataFrames.jl 1.1.0.

The design of filter

The filter function is defined in Julia Base. Therefore in DataFrames.jl we
add methods to it. Let us start with the contract for filter(f, a) then:

Return a copy of collection a, removing elements for which f is false.
The function f is passed one argument.

How do we translate this into DataFrames.jl realm? We have to cases.

If a is an AbstractDataFrame then we treat it as a collection of rows.
Therefore f will get one row of data and we expect it to return a Bool value.
As a result of the operation we produce a DataFrame (unless view keyword
argument is true in which case we return a SubDataFrame).

Here is a basic example:

julia> using DataFrames

julia> df = DataFrame(a=1:3)
3×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     3

julia> filter(row -> row.a != 2, df)
2×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     3

A more efficient (faster to execute) way to express the same is:

julia> filter(:a => !=(2), df)
2×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     3

As you can see the style is that you pass a Pair or column name and a
predicate function (i.e. a function that produces Bool). This has two
benefits. Firstly, the operation is type stable (thus faster). Secondly, in the
row -> row.a != 2 we define a new anonymous function with each call of
filter, which causes compilation (unless the operation is wrapped in a
function or we predefine the predicate function).

The second case is when a is a GroupedDataFrame. In this case f will get
one group and should return a Bool value again. The result will be a
GroupedDataFrame with groups appropriately removed:

julia> gdf = groupby(df, :a)
GroupedDataFrame with 3 groups based on key: a
First Group (1 row): a = 1
 Row │ a
     │ Int64
─────┼───────
   1 │     1
⋮
Last Group (1 row): a = 3
 Row │ a
     │ Int64
─────┼───────
   1 │     3

julia> filter(sdf -> sdf.a != [2], gdf)
GroupedDataFrame with 2 groups based on key: a
First Group (1 row): a = 1
 Row │ a
     │ Int64
─────┼───────
   1 │     1
⋮
Last Group (1 row): a = 3
 Row │ a
     │ Int64
─────┼───────
   1 │     3

A Pair version is also supported:

julia> filter(:a => !=([2]), gdf)
GroupedDataFrame with 2 groups based on key: a
First Group (1 row): a = 1
 Row │ a
     │ Int64
─────┼───────
   1 │     1
⋮
Last Group (1 row): a = 3
 Row │ a
     │ Int64
─────┼───────
   1 │     3

A crucial thing to note is that this time the predicate gets a data frame (or
its column/columns).

In summary — the filter function (apart from the view keyword argument and
a special Pair syntax that improves the performance) works exactly like
the Julia Base contract requires.

Before we move forward you might notice that the Pair syntax for the
AbstractDataFrame case is different than the same syntax for select,
transform, and combine functions, where always a whole column is passed.
Indeed there is a small inconsistency. It was left for user convenience
and consistency with Julia Base.

On the other hand subset is fully consistent with the rest of DataFrames.jl
ecosystem, so let us move to it now.

The design of subset

The subset function is designed for filtering of rows in a way consistent
with the select, transform, and combine functions. The contract for
the subset(df, args...) function is:

Return a copy of data frame df containing only rows for which all values
produced by transformation(s) args for a given row are true.

If instead of a df data frame you pass a GroupedDataFrame the rules are
the same, but the difference is that they apply to the parent of the
GroupedDataFrame. So this leads us to a list of differences from filter, as
in subset:

  • the AbstactDataFrame/GroupedDataFrame argument goes first;
  • you are allowed do pass multiple conditions on which you want to perform row selection;
  • always works on whole columns;
  • always filters rows;
  • the transformation is expected to return a vector (not a scalar Bool — remember
    we are filtering rows so the length of the vector must match the number of rows);
  • by default always produces a data frame.

The additional differences follow the available keyword arguments:

  • all transformations must produce vectors containing true or false; however,
    optionally missing is allowed if skipmissing=true (this option is not available in filter);
  • for GroupedDataFrame case if ungroup=false the resulting data frame is
    re-grouped based on the same grouping columns as the source GroupedDataFrame
    (but by default a data frame is returned).

The view keyword argument works like in filter and allows you to produce
a SubDataFrame instead of a DataFrame.

Enough theory, let us get to the examples:

julia> df2 = DataFrame(a=repeat(1:3, 2), b=1:6)
6×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      3
   4 │     1      4
   5 │     2      5
   6 │     3      6

julia> subset(df2, :a => ByRow(==(1)), :b => ByRow(isodd))
1×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      1

Here you can see that we had to wrap predicates in ByRow to make sure
that a vector of Bool is produce by the filtering conditions. Otherwise
you would get an error:

julia> subset(df2, :a => ==(1))
ERROR: ArgumentError: functions passed to `subset` must return an AbstractVector.

(By the way: this is a thing that was changed in DataFrames.jl 1.1 release;
previously unintentionally returning scalar Bool was allowed which was error
prone, as the comparison was made against a whole vector — not its elements.)

The second key thing to remember is that subset filters rows always,
also in GroupedDataFrame case:

julia> gdf2 = groupby(df2, :a)
GroupedDataFrame with 3 groups based on key: a
First Group (2 rows): a = 1
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      4
⋮
Last Group (2 rows): a = 3
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     3      3
   2 │     3      6

julia> subset(gdf2, :b => (x -> x .== maximum(x)))
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │     2      5
   3 │     3      6

This is often very useful if we want to filter rows by some within-group condition,
like in the example above.

Finally, let me show the skipmissing keyword argument at work:

julia> df3 = DataFrame(a=[1, missing, 3, 4])
4×1 DataFrame
 Row │ a
     │ Int64?
─────┼─────────
   1 │       1
   2 │ missing
   3 │       3
   4 │       4

julia> subset(df3, :a => ByRow(isodd))
ERROR: ArgumentError: missing was returned in condition number 1 but only true or false are allowed; pass skipmissing=true to skip missing values

julia> subset(df3, :a => ByRow(isodd), skipmissing=true)
2×1 DataFrame
 Row │ a
     │ Int64?
─────┼────────
   1 │      1
   2 │      3

Conclusions

In summary both filter and subset are useful, but in
different contexts. The basic rules are:

  • if you have multiple conditions to apply use subset;
  • if you want to easily handle missing values use subset;
  • if you have a single predicate that takes a single row (or a scalar)
    and returns Bool and want to filter a data frame use filter
    (this saves you typing ByRow in subset);
  • if you have a single predicate that returns Bool and want to filter
    whole groups of a GroupedDataFrame (as opposed to rows) use filter.

The things are unfortunately a bit complex, but we provide them for user
convenience as both filter and subset are useful in different contexts.

Before I finish let me highlight that there are also in-place filter! and
subset! variants of these functions.

The order of join and grouping result in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/04/30/roworder.html

Introduction

Today I want to focus on an issue that is often not noticed by users when
working DataFrames.jl, but in some cases it it might be relevant.

The subject is the order of join and grouping operations result in
DataFrames.jl. The key point of the post is that this order depends on several
factors, so it is simplest to assume that it is undefined.
I am not going to list all cases in my examples, but just focus on showing
the fact as in the future the order might change.

In this post I am using Julia 1.6.1 and DataFrames.jl 1.0.1.

Joins

Consider the following example of innerjoin:

julia> using DataFrames

julia> df1 = DataFrame(x=[2, 3, 1, 4], id1=1:4)
4×2 DataFrame
 Row │ x      id1
     │ Int64  Int64
─────┼──────────────
   1 │     2      1
   2 │     3      2
   3 │     1      3
   4 │     4      4

julia> df2 = DataFrame(x=[1, 3, 2, 5, 6], id2=1:5)
5×2 DataFrame
 Row │ x      id2
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     3      2
   3 │     2      3
   4 │     5      4
   5 │     6      5

julia> innerjoin(df1, df2, on=:x)
3×3 DataFrame
 Row │ x      id1    id2
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      3      1
   2 │     3      2      2
   3 │     2      1      3

julia> innerjoin(df2, df1, on=:x)
3×3 DataFrame
 Row │ x      id2    id1
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      1      3
   2 │     3      2      2
   3 │     2      3      1

As you can see currently the row order in the result of innerjoin is taken
from the longer table.

Now consider outerjoin (similar results are for leftjoin and rightjoin):

julia> outerjoin(df1, df2, on=:x)
6×3 DataFrame
 Row │ x      id1      id2
     │ Int64  Int64?   Int64?
─────┼─────────────────────────
   1 │     1        3        1
   2 │     3        2        2
   3 │     2        1        3
   4 │     4        4  missing
   5 │     5  missing        4
   6 │     6  missing        5

julia> outerjoin(df2, df1, on=:x)
6×3 DataFrame
 Row │ x      id2      id1
     │ Int64  Int64?   Int64?
─────┼─────────────────────────
   1 │     1        1        3
   2 │     3        2        2
   3 │     2        3        1
   4 │     5        4  missing
   5 │     6        5  missing
   6 │     4  missing        4

Now we have the following parts of the table: first comes the chunk
matching what innerjoin produces, then we have a non-matching part from the
left table, and finally we have a non-matching part from the right table.

While before 1.0 release we did not guarantee the row order in joins, the
actual order has changed in DataFrames.jl 1.0. The reason were performance
considerations. Consider the following examples of joins and their timing:

julia> df1 = DataFrame(x=string.(1:10^7));

julia> df2 = DataFrame(x=string.(1:10));

julia> @time innerjoin(df1, df2, on=:x);
  0.246627 seconds (176 allocations: 13.797 KiB)

julia> @time innerjoin(df2, df1, on=:x);
  0.237981 seconds (175 allocations: 13.781 KiB)

(I am showing you the timings after compilationp; I use Vector{String} to join
on as this case is the slowest scenario under DataFrames.jl 1.0).

Now switch to DataFrames.jl 0.22.7 for a while (you need a fresh session and a
fresh project environment to test this; timings are again after compilation):

julia> df1 = DataFrame(x=string.(1:10^7));

julia> df2 = DataFrame(x=string.(1:10));

julia> @time innerjoin(df1, df2, on=:x);
  0.350317 seconds (177 allocations: 152.602 MiB)

julia> @time innerjoin(df2, df1, on=:x);
  1.140921 seconds (183 allocations: 662.071 MiB)

As you can see the current algorithm not only uses much less memory, but also
it is faster in general and not affected by the argument order (the last thing
was a major bane of joins before 1.0 release of DataFrames.jl).

For a reference check what data.table in R offers in this case in terms of
performance (I am adding it as the performance against data.table is a hot
topic recently):

> library(data.table)
data.table 1.14.0 using 4 threads (see ?getDTthreads).  Latest news: r-datatable.com
> dt1 <- data.table(x=as.character(1:10^7))
> dt2 <- data.table(x=as.character(1:10))
> system.time(merge(dt1, dt2, all=FALSE))
   user  system elapsed
  7.445   0.153   3.544
> system.time(merge(dt1, dt2, all=FALSE, sort=FALSE))
   user  system elapsed
  6.735   0.128   2.827

(note that I have used non-pooled vectors in both cases, as this was the scenario
that allowed me to compare DataFrames.jl 1.0 and 0.22.7 best; clearly if we
joined on pooled vectors the timings would be much better)

Grouping

In groupby operation the rules of ordering of the GroupedDataFrame object
depend on the type of the column you join on (I am assuming you are not passing
sort=true keyword argument, as then groups are sorted). The two cases are:

  • if you join on columns that are pooled (like PooledVector or
    CategoricalVector) and the number of possible groups is not huge
    then you get your result in the order of levels in the pool;
  • otherwise the group ordering is their order of appearance in the source vector.

Here a particular corner case are integer columns, which are treated to be
pooled (so the groups are sorted), unless the range of the integers is huge
(as then we fall back to the order of appearance). Here is an example:

julia> df = DataFrame(x=[3, 1, 2], y=[300, 1, 2])
3×2 DataFrame
 Row │ x      y
     │ Int64  Int64
─────┼──────────────
   1 │     3    300
   2 │     1      1
   3 │     2      2

julia> keys(groupby(df, :x))
3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (x = 1,)
 GroupKey: (x = 2,)
 GroupKey: (x = 3,)

julia> keys(groupby(df, :y))
3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (y = 300,)
 GroupKey: (y = 1,)
 GroupKey: (y = 2,)

What is considered to be huge is left undefined (as it might change in the
future), but in general if there is less levels than the number of rows in of
the data frame (this is a typical case in practice) then we do not consider
it as a huge number.

Again – let us do some benchmarking against the 0.22.7 release of DataFrames.jl.
First the results for DataFrames.jl 1.0:

julia> df = DataFrame(x=1:10^7+1, y=[1:10^7; 10^10]);

julia> @time groupby(df, :x);
  0.055407 seconds (64 allocations: 85.834 MiB)

julia> @time groupby(df, :y);
  0.895716 seconds (50 allocations: 280.591 MiB)

and now under 0.22.7 release:

julia> df = DataFrame(x=1:10^7+1, y=[1:10^7; 10^10]);

julia> @time groupby(df, :x);
  0.890674 seconds (31 allocations: 280.590 MiB)

julia> @time groupby(df, :y);
  0.884177 seconds (31 allocations: 280.590 MiB)

As you can see, in the case of grouping integer columns we are much faster
than before if the integer range is not huge.

Let us have a comparison with data.table again (we need to also perform some
aggregation to match apples to apples in terms of timing).

First DataFrames.jl 1.0:

julia> df = DataFrame(x=1:10^7+1, y=[1:10^7; 10^10]);

julia> @time combine(groupby(df, :x), nrow);
  0.180592 seconds (261 allocations: 324.266 MiB)

julia> @time combine(groupby(df, :y), nrow);
  1.006619 seconds (247 allocations: 519.023 MiB)
> df <- data.table(x=1:(10^7+1), y=c(1:10^7, 10^10))
> system.time(df[, .N, by = x])
   user  system elapsed
  0.644   0.088   0.266
> system.time(df[, .N, by = y])
   user  system elapsed
  0.991   0.096   0.404

This time for the huge range DataFrames.jl is slower. (note that data.table
is using four threads – which is great – and I tested my code on a single thread
in Julia, as in DataFrames.jl we do not support multi-threading in this
particular case yet)

Conclusions

In summary: although there are precise rules that determine the order of join
and grouping results is simplest to assume that it is undefined (like in data
bases). The reason for this are operation performance considerations (so the
rules are complex and might change in the future).

However, based on the user feedback, we might in the future consider adding some
keyword arguments to joins or groupby that would guarantee some particular
order. Therefore if you have any thoughts on it please open an issue in
DataFrames.jl repository on GitHub.