The hardest part of DataFrames.jl development process

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/05/14/nrow.html

Introduction

I have spent several years now helping to develop DataFrames.jl.
There are many issues to consider when working on such a big package:

  • providing new functionalities;
  • avoiding and fixing bugs;
  • performance;
  • integration of the functionality with the rest of data ecosystem;
  • handling of conflicting expectations of the users;
  • getting the reviews done (super hard for complex PR’s);
  • managing release process and synchronization with dependencies;
  • working consistently on different versions of Julia that should be supported;
  • fixing bugs uncovered in other packages/Julia;
  • ensuring proper documentation and tutorials;
  • managing deprecated functionality;

These are the issues that instantly come to my mind and there are many more.
A natural question is then — what is the hardest task?

From my experience it is deciding what API to provide (function names, their
positional and keyword arguments, their return values), and the starting point
in this area is deciding which functions should be made available to the user.
The discussions about what should be the names of functions we export were one
of the longest and hardest because this is a social process, that is hugely
affected by past experience of the contributors.

As this topic is very wide I decided to comment on three selected
decisions in this scope:

  • why we decided not to provide head and tail functions but use first
    and last instead;
  • why we decided to provide nrow and ncol functions, while size function
    gives the same information;
  • why we provide both filter and subset functions that serve the same purpose.

I hope this will shed some light in the mental process we go through when making
such decisions.

This post refers to the state of the DataFrames.jl package in its 1.1.1 release.

Why head and tail are not defined

head and tail are commonly used in other ecosystems (e.g. in R) to get few
first/last rows of a data frame. This gives us a first criterion:

Criterion 1: try to use function names that are natural for users to guess
without having to learn them.

However, there are first and last functions in Julia Base that serve the same
purpose. This gives us the following new criteria:

Criterion 2: stay consistent with Julia Base and try to add methods for
functions already defined there (as users are likely to know them).

Criterion 3: minimize number of verbs (function names) that are introduced
by the package, as this makes the functionality easier to learn and maintain.

Criterion 4: avoid defining common and short names. Such names are very likely
to conflict with names defined user’s code leading to problems.

Criterion 5: if we want to add a method to a function defined in Julia Base
will it do the same thing (we do not want to change the contract established
in Julia Base) and not cause type piracy.

In this case we have first and last functions in the Julia Base that already
are defined to allow to pick first/last elements of the collection. Additionally
Julia Base defines Base.tail, which is not exported currently, but there is
always a risk that this would change in the future (and it does a bit different
thing). Finally head and tail are pretty common names, that were likely to
be already in use in user’s code. Here a crucial consideration is that if we
claimed some common name many years ago it would be less a problem. However,
some users have thousands of code using DataFrames.jl. In such a case
introducing a common name might cause code base that worked previously to start
failing.

All in all – we stick to first/last combo although it does not conform to
Criterion 1 for some of the users (this is subjective though).

Why nrow and ncol are defined

In this case clearly we followed Criterion 1. Let us analyze why other criteria
did not get that much weight. We are clearly breaking Criteria 2 and 3.
Fortunately most likely we are not breaking Criterion 4 (names are short,
but not likely to be commonly used). Criterion 5 is not applicable.

Let us dig into Criteria 2 and 3 a bit. Instead of writing nrow(df) you
can alternatively write size(df, 1) or size(df)[1]. There are three reasons
why this is not optimal:

  • it is a bit more to type;
  • you actually have two styles to get number of rows (and I know from StactOverflow
    that which one to choose was confusing — we do not want for such a common
    operation to have two similar, but a bit different styles);
  • nrow does not require you do define an anonymous function if you want to
    pass it to some higher order function; compare:

      combine(groupby(df, :col), nrow)
    

    vs

      combine(groupby(df, :col), x -> size(x, 1))
    

It is not only much easier to read but also the former has to be compiled only
once while the latter is recompiled every time if you are in global scope.

For these reasons defining nrow and ncol was accepted.

Why both filter and subset are provided

Clearly there is a filter function in Julia Base, so why do we need
a subset function? I have discussed the differences between them in
my last post so they do not do the same. Here a crucial consideration was
following Critetion 5. Methods for the filter function defined in
DataFrames.jl should follow the contract for filter defined in Julia Base.
However, users wanted a function doing a similar thing, but with a different
contract (e.g. different order of arguments, whole column passed to the
predicate function, option to skip missing values). Therefore we decided to
keep filter consistent with Julia Base and add a new function subset that
would follow what users wanted.

Conclusions

Before I finish let me add one more comment. What if we have a function name,
like describe, that is not defined in Julia Base, but it is likely that
several packages might want add methods to it? In this case we need to have some
package umbrella that only defines this function (possibly with a default
implementation). In data science related ecosystem in Julia we have two such
packages: DataAPI.jl and StatsAPI.jl.