Author Archives: Blog by Bogumił Kamiński

News features in DataFrames.jl 1.3: part 2

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/12/17/selectors.html

Introduction

This post continues the presentation of new features added in DataFrames.jl 1.3.0. Last week in this post I have discussed the changes that
improve performance of reduction operations that take wide data (e.g. taking an
average of 10,000 columns). This week I will focus on improvements of convenience
of use of the data transformation mini-language.

The post was written under Julia 1.7.0 and DataFrames.jl 1.3.0.

The data transformation mini-language

The select[!], transform[!], combine, and subset[!] functions in
DataFrames.jl accept specification of column transformation’s using a so called
data transformation mini-language. It has a general form:

[input column names] => [transformation function] => [output columns]

A full specification of allowed forms can be found here. However, you
might find it a bit technical. This is unfortunately unavoidable, as the
mini-language was designed to allow maximum flexibility, so that packages like
DataFramesMeta.jl or DataFrameMacros.jl can rely on it and
provide a nice user-facing syntax. Therefore in this post I have
presented several introductory examples of its usage.

New features

One of the common advanced use-cases of the mini-language is performing the same
transformation on multiple columns of a data frame. Imagine that you have
the following data frame:

julia> using DataFrames

julia> df = DataFrame(name='A':'E', year2019=1:5, year2020=2:6, year2021=3:7)
5×4 DataFrame
 Row │ name  year2019  year2020  year2021
     │ Char  Int64     Int64     Int64
─────┼────────────────────────────────────
   1 │ A            1         2         3
   2 │ B            2         3         4
   3 │ C            3         4         5
   4 │ D            4         5         6
   5 │ E            5         6         7

Now assume that we wanted to calculate sum of each of the columns :year2019,
:year2020, and :year2021. The simplest way to achieve this is the
following:

julia> combine(df, :year2019 => sum, :year2020 => sum, :year2021 => sum)
1×3 DataFrame
 Row │ year2019_sum  year2020_sum  year2021_sum
     │ Int64         Int64         Int64
─────┼──────────────────────────────────────────
   1 │           15            20            25

(Note that in the call I have omitted output column name part so DataFrames.jl
automatically generated the column names consisting of the source column name
and the transformation function name that was applied to it.)

However, you might consider the above call to the combine function a bit
redundant. You can write the same using broadcasting like this:

julia> combine(df, [:year2019, :year2020, :year2021] .=> sum)
1×3 DataFrame
 Row │ year2019_sum  year2020_sum  year2021_sum
     │ Int64         Int64         Int64
─────┼──────────────────────────────────────────
   1 │           15            20            25

Note how the [:year2019, :year2020, :year2021] .=> sum is being handled by
Julia before it is passed to the combine function:

julia> [:year2019, :year2020, :year2021] .=> sum
3-element Vector{Pair{Symbol, typeof(sum)}}:
 :year2019 => sum
 :year2020 => sum
 :year2021 => sum

Now you might ask, what if I did not have three columns to process but 100 of
them? It is easy to select their names using the names function. Here I show
you how to select all columns in the data frame except the :name column:

julia> names(df, Not(:name))
3-element Vector{String}:
 "year2019"
 "year2020"
 "year2021"

Therefore the call to combine above can be rewritten as:

julia> combine(df, names(df, Not(:name)) .=> sum)
1×3 DataFrame
 Row │ year2019_sum  year2020_sum  year2021_sum
     │ Int64         Int64         Int64
─────┼──────────────────────────────────────────
   1 │           15            20            25

This already looks quite powerful, but there is one annoying thing. Why do we
need to call the names function? It should be obvious that Not(:name)
applies to the df data frame. Let us check if this would work:

julia> combine(df, Not(:name) .=> sum)
1×3 DataFrame
 Row │ year2019_sum  year2020_sum  year2021_sum
     │ Int64         Int64         Int64
─────┼──────────────────────────────────────────
   1 │           15            20            25

Yes it does! And this is the new feature in DataFrames.jl 1.3 I wanted to talk
about today.

The select[!], transform[!], combine, and subset[!] functions when they
get any of the selectors Not, Between, Cols, All in a broadcasting
expression are now able to resolve them with respect to the context of the data
frame that is being processed by them.

Let me give two more examples of this feature to show you how it works:

julia> combine(df, Not(:name) .=> [minimum maximum])
1×6 DataFrame
 Row │ year2019_minimum  year2020_minimum  year2021_minimum  year2019_maximum  year2020_maximum  year2021_maximum
     │ Int64             Int64             Int64             Int64             Int64             Int64
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │                1                 2                 3                 5                 6                 7

julia> combine(df, Not(:name) .=> sum .=> Not(:name))
1×3 DataFrame
 Row │ year2019  year2020  year2021
     │ Int64     Int64     Int64
─────┼──────────────────────────────
   1 │       15        20        25

In the first one you can see that broadcasting is properly applied even in two
dimensional case (note that [minimum maximum] is a Matrix).

In the second example you see that broadcasting is properly handled both in
specification of source as well as for target column names.

Behind the scenes

The way things work are in my opinion intuitive and expected. However, let me
show you that they are not as easy as you might think. The reason is that
broadcasting is resolved before the data transformation mini-language
expression is passed to combine (or other transformation functions I have
listed). Let us check how the expressions I have used above get resolved
before they got passed to combine:

julia> Not(:name) .=> sum
InvertedIndices.BroadcastedInvertedIndex(InvertedIndex{Symbol}(:name)) => sum

julia> Not(:name) .=> [minimum maximum]
1×2 Matrix{Pair{InvertedIndices.BroadcastedInvertedIndex}}:
 BroadcastedInvertedIndex(InvertedIndex{Symbol}(:name))=>minimum  BroadcastedInvertedIndex(InvertedIndex{Symbol}(:name))=>maximum

julia> Not(:name) .=> sum .=> Not(:name)
InvertedIndices.BroadcastedInvertedIndex(InvertedIndex{Symbol}(:name)) => (sum => InvertedIndices.BroadcastedInvertedIndex(InvertedIndex{Symbol}(:name)))

They look quite messy. What is the problem? All these three data transformation
mini-language expressions do not include df in them. Therefore when Julia
executes the broadcasting operation it is unaware of the df context. The
workaround is to create a special BroadcastedInvertedIndex object (in the
case of Not operation; for Cols, Between, and All also a special
wrapper object is created) that signals combine that broadcasting was used on
Not(:name) selector. Then combine internally has implemented its own
broadcasting machinery that matches the Julia Base broadcasting rules and
resolves the expression within the df context as required.

As you can see things that seem simple end up quite complex. In particular
this means that DataFrames.jl must closely monitor changes in Julia Base
broadcasting implementation to make sure it matches its rules.

Conclusions

I have two conclusions for today.

The first one is user facing. In DataFrames.jl 1.3 we have added a long
requested convenience functionality of broadcasting Not, Cols, Between,
and All calls in data transformation mini-language within the context of a
data frame that they apply to. Therefore, hopefully, our users will be more
happy now.

The second is for DataFrames.jl maintenance. Some of the users might have noted
that JuliaData members always ask for a strong justification before new
features are added. The reason is twofold. Firstly, having increasingly more
features makes learning of DataFrames.jl harder. Secondly, as you can see in
the example given today, adding new features makes the code base of
DataFrames.jl quite complex and implicitly strongly linked to Julia Base
design. This means that it becomes increasingly harder for new contributors to
get involved in the package development (and we would love to see more of them
so we prefer to keep things simple if possible).

Finally, in the coming weeks I will continue the discussion of the new features
in DataFrames.jl 1.3, so stay tuned.

News features in DataFrames.jl 1.3: part 1

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/12/10/rowreduce.html

Introduction

A few days ago we have released DataFrames.jl 1.3.0.
In the coming weeks I will discuss new features introduced in this release.
Each post will be devoted to a single topic.

Today I start with performance improvement of aggregation of rows of a data
frame since recently a related interesting question was asked on Slack.

The post was written under Julia 1.7.0 and DataFrames.jl 1.3.0.

A typical scenario of row aggregation

Let us start with a square data frame that has 10,000 rows and columns.
In the following examples I will omit printing the computed output to reduce
the length of the post.

julia> using DataFrames

julia> df = DataFrame(rand(10_000, 10_000), :auto);

Assume you want to compute the sum of values in each row of the data frame.

Here is a simple, but inefficient way to do it (I will run the @time macro
twice to show the times including compilation and after compilation):

julia> @time sum.(eachrow(df));
 29.653202 seconds (784.98 M allocations: 13.199 GiB, 5.93% gc time, 0.42% compilation time)

julia> @time sum.(eachrow(df));
 30.152483 seconds (784.69 M allocations: 13.183 GiB, 5.69% gc time)

Now let us use the select function to achieve the same:

julia> @time select(df, AsTable(:) => ByRow(sum));
  1.424650 seconds (4.62 M allocations: 274.349 MiB, 4.04% gc time, 96.05% compilation time)

julia> @time select(df, AsTable(:) => ByRow(sum));
  0.064406 seconds (19.66 k allocations: 1.140 MiB)

The performance improvement is significant.

Here is a list of functions for
which the fast path of aggregation is implemented for the form
AsTable(cols) => fun [=> destination] of the DataFrames.jl mini-language:

  • sum, ByRow(sum), ByRow(sum∘skipmissing);
  • length, ByRow(length), ByRow(length∘skipmissing);
  • mean, ByRow(mean), ByRow(mean∘skipmissing);
  • ByRow(var), ByRow(var∘skipmissing);
  • ByRow(std), ByRow(std∘skipmissing);
  • ByRow(median), ByRow(median∘skipmissing);
  • minimum, ByRow(minimum), ByRow(minimum∘skipmissing);
  • maximum, ByRow(maximum), ByRow(maximum∘skipmissing);
  • fun∘collect and ByRow(f∘collect) where f is any function.

You might be curious about the last form. The optimization is that if you have
any function f that takes a vector and makes its reduction it will be efficiently
executed when it is composed with combine. Let us use the extrema function
as an example:

julia> @time extrema.(eachrow(df));
 30.513559 seconds (884.87 M allocations: 14.683 GiB, 6.22% gc time, 0.27% compilation time)

julia> @time extrema.(eachrow(df));
 30.887099 seconds (884.69 M allocations: 14.673 GiB, 6.25% gc time)

julia> @time select(df, AsTable(:) => ByRow(extrema∘collect));
  1.984904 seconds (925.93 k allocations: 812.528 MiB, 1.19% gc time, 18.85% compilation time)

julia> @time select(df, AsTable(:) => ByRow(extrema∘collect));
  1.515307 seconds (49.17 k allocations: 764.989 MiB, 0.78% gc time)

Note that the collect part is important. If we have not used it the result
would be as follows:

julia> @time select(df, AsTable(:) => ByRow(extrema));
 58.041291 seconds (297.92 M allocations: 9.731 GiB, 0.95% gc time, 78.49% compilation time)

julia> @time select(df, AsTable(:) => ByRow(extrema));
 12.495241 seconds (295.00 M allocations: 9.617 GiB, 4.51% gc time)

What is the difference? With collect we are processing a vector, while without it
a NamedTuple is passed to extrema. The result gets computed, but, as you
can see in the output of @time the compilation time for the first call is huge,
and also after compilation it is faster to work with collect version.

A use-case from practice

The task recently asked on Slack is the following. We have again a data frame
that has 10,000 rows and columns, but this time we have 50% of missing values
randomly scattered in it. What we want to do is to fill missing values in each
row with row means of non-missing values.

Let us first generate the data frame (I start a fresh session again:

julia> using DataFrames

julia> using Statistics

julia> df = DataFrame(rand([1.0, missing], 10_000, 10_000), :auto) .* (1:10_000);

I first show you how I would have done this operation before DataFrames.jl 1.3
release if I wanted to avoid excessive memory allocation. First I make a vector
of columns of this data frame without copying them:

julia> cols = identity.(eachcol(df));

Note that I broadcast identity to make the element type of the cols vector concrete.

Now we compute a vector of of fill values for each row:

julia> @time fill_vals = [(mean(skipmissing(v[i] for v in cols))) for i in 1:nrow(df)];
  2.167842 seconds (178.27 k allocations: 8.515 MiB, 5.18% compilation time)

julia> @time fill_vals = [(mean(skipmissing(v[i] for v in cols))) for i in 1:nrow(df)];
  2.212677 seconds (160.38 k allocations: 7.513 MiB, 4.38% compilation time)

As you can see this step is quite fast. If we skipped the identity call things
would be slower:

julia> @time fill_vals = [(mean∘skipmissing)(v[i] for v in eachcol(df)) for i in 1:nrow(df)];
 16.641022 seconds (395.08 M allocations: 6.638 GiB, 6.65% gc time, 0.77% compilation time)

julia> @time fill_vals = [(mean∘skipmissing)(v[i] for v in eachcol(df)) for i in 1:nrow(df)];
 15.949596 seconds (395.07 M allocations: 6.637 GiB, 5.16% gc time, 0.78% compilation time)

If we just iterated rows things also would be even slower:

julia> @time fill_vals = (mean∘skipmissing).(eachrow(df));
 25.953399 seconds (634.99 M allocations: 10.218 GiB, 5.18% gc time, 0.51% compilation time)

julia> @time fill_vals = (mean∘skipmissing).(eachrow(df));
 25.739528 seconds (634.72 M allocations: 10.203 GiB, 4.91% gc time)

Now let us check how fast the select machinery we have just learned works:

julia> @time fill_vals = select(df, AsTable(:) => ByRow(mean∘skipmissing) => :fill_vals).fill_vals;
  1.773402 seconds (4.42 M allocations: 260.296 MiB, 3.73% gc time, 69.85% compilation time)

julia> @time fill_vals = select(df, AsTable(:) => ByRow(mean∘skipmissing) => :fill_vals).fill_vals;
  0.549109 seconds (19.66 k allocations: 1.217 MiB)

As a final step let us check the performance if we kept the data in a matrix instead
(I am not counting the cost of conversion of a data frame to a matrix here):

julia> mat = Matrix(df);

julia> @time (mean∘skipmissing).(eachrow(mat));
  1.581645 seconds (7 allocations: 468.906 KiB)

julia> @time (mean∘skipmissing).(eachrow(mat));
  1.547717 seconds (7 allocations: 468.906 KiB)

julia> mat2 = permutedims(mat);

julia> @time (mean∘skipmissing).(eachcol(mat2));
  0.689801 seconds (344.41 k allocations: 19.172 MiB, 19.95% compilation time)

julia> @time (mean∘skipmissing).(eachcol(mat2));
  0.553592 seconds (7 allocations: 468.906 KiB)

As you can see the performance of select is comparable to the performance
on a matrix when we work on columns (which means we perform the operation using
the fact that Julia uses column major storage of matrices).

To conclude the task we produce a new table with imputed values:

julia> @time coalesce.(df, fill_vals);
  0.989387 seconds (370.37 k allocations: 780.629 MiB, 5.10% gc time, 14.19% compilation time)

julia> @time coalesce.(df, fill_vals);
  0.842541 seconds (149.03 k allocations: 768.343 MiB, 4.01% gc time)

Before I finish let me comment how you could have filled missing values in
columns with means of columns:

julia> @time select(df, names(df) .=> (x -> coalesce.(x, mean(skipmissing(x)))), renamecols=false);
  1.192647 seconds (1.89 M allocations: 861.187 MiB, 7.53% gc time, 40.42% compilation time)

julia> @time select(df, names(df) .=> (x -> coalesce.(x, mean(skipmissing(x)))), renamecols=false);
  0.971668 seconds (1.29 M allocations: 826.755 MiB, 1.91% gc time, 21.57% compilation time)

Next week in the post I will discuss how you could have replaced names(df)
selector in the last expression with All() selector (and how it
is implemented).

Conclusions

In this post you have learned how to perform fast reductions over rows of
a data frame by using the new features of the mini-language implementation.
I have shown you that the implementation we have is quite efficient. Also since
we support the f∘collect composition it is quite extensible to any custom
reduction function that accepts vectors.

Choosing how to store your stings

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/12/03/strings.html

Introduction

When you work with strings in Julia you have several options how to store them.
In this post I discuss the most common usage scenarios and the recommended
choices.

The sting storage decision tree

The choices I discuss below are related to performance and memory consumption.
Therefore in what follows I assume you work with large data sets relative to
available RAM on your machine.

Let me start with the decision flowchart and then explain it:

String decision guideline flowchart

The first decision you need to make is if you want only performance optimization
or you want your strings to be treated as ordered or unordered categorical data
in statistical sense. If you need your data to be categorical then the choice
is simple. The only option is to use the CategoricalArrays.jl package,
where the underlying data can be stored in a String.

On the other hand if your goal is only performance and saving memory then the
first question is if the number of unique values of strings in your data is
low. If this is the case then the recommended package is PooledArrays.jl, where again you should be fine with storing String values.

We are down to the scenario when you have a lot of strings that have very many
unique values. In such a case the question is if all of these strings are
relatively short and have a similar size. If this is the case then you can use
the InlineStrings.jl package. It provides several types called String1,
String3, String7, String15 etc. where the number indicates maximum string
size in bytes that a given type can store. The benefit of these values is that
they are not heap allocated. It means that they are fast to work with and they
do not burden the Julia Garbage Collector.

Finally we are left with many strings, that have many unique values and that
have varying and possibly large size. In this case what Julia Base offers is a
sensible choice. Normally you should just use the String type stored in
standard collections like Vector. However, there is one special case when
you could consider using the Symbol type instead of a String type.
You can use Symbol instead of String if all the following conditions are
met:

  • your strings are just labels that you only need to compare against each other;
    in particular it assumes that you do not need to perform any transformations
    on them; the reason is that Symbol is not an AbstractString so it cannot
    be passed to functions that only accept strings (as a benefit comparing
    Symbols is faster than comparing Strings);
  • you are OK with the fact that once Symbol is created the memory it uses up
    will be never reclaimed by the Julia Garbage Collector until the end of the
    session (however, the benefit is that if you have several identical Symbols
    they share the same memory).

Conclusions

Choosing an appropriate type to store your strings is often a quite hard
decision. I hope that after reading this post you have a better overview of
available options and when each of them is appropriate to be used.

It is also recommended to immediately convert the data to an appropriate format
when you read it in. Therefore, e.g. I recommend you to check out the
documentation of the CSV.jl package to learn how to specify what you
want to get when reading the CSV files (the most important keyword arguments
for handling these choices are pool and stringtype).