Tag Archives: julialang

The order of join and grouping result in DataFrames.jl

Re-posted from: https://bkamins.github.io/julialang/2021/04/30/roworder.html

Introduction

Today I want to focus on an issue that is often not noticed by users when
working DataFrames.jl, but in some cases it it might be relevant.

The subject is the order of join and grouping operations result in
DataFrames.jl. The key point of the post is that this order depends on several
factors, so it is simplest to assume that it is undefined.
I am not going to list all cases in my examples, but just focus on showing
the fact as in the future the order might change.

In this post I am using Julia 1.6.1 and DataFrames.jl 1.0.1.

Joins

Consider the following example of innerjoin:

julia> using DataFrames

julia> df1 = DataFrame(x=[2, 3, 1, 4], id1=1:4)
4×2 DataFrame
 Row │ x      id1
     │ Int64  Int64
─────┼──────────────
   1 │     2      1
   2 │     3      2
   3 │     1      3
   4 │     4      4

julia> df2 = DataFrame(x=[1, 3, 2, 5, 6], id2=1:5)
5×2 DataFrame
 Row │ x      id2
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     3      2
   3 │     2      3
   4 │     5      4
   5 │     6      5

julia> innerjoin(df1, df2, on=:x)
3×3 DataFrame
 Row │ x      id1    id2
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      3      1
   2 │     3      2      2
   3 │     2      1      3

julia> innerjoin(df2, df1, on=:x)
3×3 DataFrame
 Row │ x      id2    id1
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      1      3
   2 │     3      2      2
   3 │     2      3      1

As you can see currently the row order in the result of innerjoin is taken
from the longer table.

Now consider outerjoin (similar results are for leftjoin and rightjoin):

julia> outerjoin(df1, df2, on=:x)
6×3 DataFrame
 Row │ x      id1      id2
     │ Int64  Int64?   Int64?
─────┼─────────────────────────
   1 │     1        3        1
   2 │     3        2        2
   3 │     2        1        3
   4 │     4        4  missing
   5 │     5  missing        4
   6 │     6  missing        5

julia> outerjoin(df2, df1, on=:x)
6×3 DataFrame
 Row │ x      id2      id1
     │ Int64  Int64?   Int64?
─────┼─────────────────────────
   1 │     1        1        3
   2 │     3        2        2
   3 │     2        3        1
   4 │     5        4  missing
   5 │     6        5  missing
   6 │     4  missing        4

Now we have the following parts of the table: first comes the chunk
matching what innerjoin produces, then we have a non-matching part from the
left table, and finally we have a non-matching part from the right table.

While before 1.0 release we did not guarantee the row order in joins, the
actual order has changed in DataFrames.jl 1.0. The reason were performance
considerations. Consider the following examples of joins and their timing:

julia> df1 = DataFrame(x=string.(1:10^7));

julia> df2 = DataFrame(x=string.(1:10));

julia> @time innerjoin(df1, df2, on=:x);
  0.246627 seconds (176 allocations: 13.797 KiB)

julia> @time innerjoin(df2, df1, on=:x);
  0.237981 seconds (175 allocations: 13.781 KiB)

(I am showing you the timings after compilationp; I use Vector{String} to join
on as this case is the slowest scenario under DataFrames.jl 1.0).

Now switch to DataFrames.jl 0.22.7 for a while (you need a fresh session and a
fresh project environment to test this; timings are again after compilation):

julia> df1 = DataFrame(x=string.(1:10^7));

julia> df2 = DataFrame(x=string.(1:10));

julia> @time innerjoin(df1, df2, on=:x);
  0.350317 seconds (177 allocations: 152.602 MiB)

julia> @time innerjoin(df2, df1, on=:x);
  1.140921 seconds (183 allocations: 662.071 MiB)

As you can see the current algorithm not only uses much less memory, but also
it is faster in general and not affected by the argument order (the last thing
was a major bane of joins before 1.0 release of DataFrames.jl).

For a reference check what data.table in R offers in this case in terms of
performance (I am adding it as the performance against data.table is a hot
topic recently):

> library(data.table)
data.table 1.14.0 using 4 threads (see ?getDTthreads).  Latest news: r-datatable.com
> dt1 <- data.table(x=as.character(1:10^7))
> dt2 <- data.table(x=as.character(1:10))
> system.time(merge(dt1, dt2, all=FALSE))
   user  system elapsed
  7.445   0.153   3.544
> system.time(merge(dt1, dt2, all=FALSE, sort=FALSE))
   user  system elapsed
  6.735   0.128   2.827

(note that I have used non-pooled vectors in both cases, as this was the scenario
that allowed me to compare DataFrames.jl 1.0 and 0.22.7 best; clearly if we
joined on pooled vectors the timings would be much better)

Grouping

In groupby operation the rules of ordering of the GroupedDataFrame object
depend on the type of the column you join on (I am assuming you are not passing
sort=true keyword argument, as then groups are sorted). The two cases are:

if you join on columns that are pooled (like PooledVector or
CategoricalVector) and the number of possible groups is not huge
then you get your result in the order of levels in the pool;
otherwise the group ordering is their order of appearance in the source vector.

Here a particular corner case are integer columns, which are treated to be
pooled (so the groups are sorted), unless the range of the integers is huge
(as then we fall back to the order of appearance). Here is an example:

julia> df = DataFrame(x=[3, 1, 2], y=[300, 1, 2])
3×2 DataFrame
 Row │ x      y
     │ Int64  Int64
─────┼──────────────
   1 │     3    300
   2 │     1      1
   3 │     2      2

julia> keys(groupby(df, :x))
3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (x = 1,)
 GroupKey: (x = 2,)
 GroupKey: (x = 3,)

julia> keys(groupby(df, :y))
3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (y = 300,)
 GroupKey: (y = 1,)
 GroupKey: (y = 2,)

What is considered to be huge is left undefined (as it might change in the
future), but in general if there is less levels than the number of rows in of
the data frame (this is a typical case in practice) then we do not consider
it as a huge number.

Again – let us do some benchmarking against the 0.22.7 release of DataFrames.jl.
First the results for DataFrames.jl 1.0:

julia> df = DataFrame(x=1:10^7+1, y=[1:10^7; 10^10]);

julia> @time groupby(df, :x);
  0.055407 seconds (64 allocations: 85.834 MiB)

julia> @time groupby(df, :y);
  0.895716 seconds (50 allocations: 280.591 MiB)

and now under 0.22.7 release:

julia> df = DataFrame(x=1:10^7+1, y=[1:10^7; 10^10]);

julia> @time groupby(df, :x);
  0.890674 seconds (31 allocations: 280.590 MiB)

julia> @time groupby(df, :y);
  0.884177 seconds (31 allocations: 280.590 MiB)

As you can see, in the case of grouping integer columns we are much faster
than before if the integer range is not huge.

Let us have a comparison with data.table again (we need to also perform some
aggregation to match apples to apples in terms of timing).

First DataFrames.jl 1.0:

julia> df = DataFrame(x=1:10^7+1, y=[1:10^7; 10^10]);

julia> @time combine(groupby(df, :x), nrow);
  0.180592 seconds (261 allocations: 324.266 MiB)

julia> @time combine(groupby(df, :y), nrow);
  1.006619 seconds (247 allocations: 519.023 MiB)

> df <- data.table(x=1:(10^7+1), y=c(1:10^7, 10^10))
> system.time(df[, .N, by = x])
   user  system elapsed
  0.644   0.088   0.266
> system.time(df[, .N, by = y])
   user  system elapsed
  0.991   0.096   0.404

This time for the huge range DataFrames.jl is slower. (note that data.table
is using four threads – which is great – and I tested my code on a single thread
in Julia, as in DataFrames.jl we do not support multi-threading in this
particular case yet)

Conclusions

In summary: although there are precise rules that determine the order of join
and grouping results is simplest to assume that it is undefined (like in data
bases). The reason for this are operation performance considerations (so the
rules are complex and might change in the future).

However, based on the user feedback, we might in the future consider adding some
keyword arguments to joins or groupby that would guarantee some particular
order. Therefore if you have any thoughts on it please open an issue in
DataFrames.jl repository on GitHub.

Some comments on DataFrames 1.0 release

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/04/24/dataframes.html

Introduction

DataFrames.jl has just got a 1.0 release.
The major question we should answer following this is:

What consequences for users does this have?

The answer is pretty boring (but significant): you can expect that we will
not introduce any braking changes till 2.0 release. The point is that we judge
that the package is mature enough that 2.0 release will not happen soon.
In consequence it is safe to use DataFrames.jl in production code that is expected
not to be updates over longer periods.

The other aspect of calling DataFrames.jl mature enough is that we believe that
the package and its ecosystem have a good performance, so you should be able
to safely use it in your data science project without getting sudden timing
hiccups.

In consequence in this post I want to cover two things. First is what is
deprecated in DataFrames.jl 1.0 release (which means it might get removed or
changed in some 1.x release). The second is some simple performance comparisons.

This post was written under Julia 1.6 and DataFrames.jl 1.0.1.

Deprecations

`indicator` keyword argument in joins

We have a new functionality of vcat in DataFrames.jl 1.0.
Start with an example:

julia> using DataFrames

julia> df1 = DataFrame(a=1:3)
3×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     3

julia> df2 = DataFrame(a=4:6)
3×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     4
   2 │     5
   3 │     6

julia> vcat(df1, df2, source=:source)
6×2 DataFrame
 Row │ a      source
     │ Int64  Int64
─────┼───────────────
   1 │     1       1
   2 │     2       1
   3 │     3       1
   4 │     4       2
   5 │     5       2
   6 │     6       2

julia> vcat(df1, df2, source=:source => ["df1", "df2"])
6×2 DataFrame
 Row │ a      source
     │ Int64  String
─────┼───────────────
   1 │     1  df1
   2 │     2  df1
   3 │     3  df1
   4 │     4  df2
   5 │     5  df2
   6 │     6  df2

As you can see you can pass source keyword argument to get an identifier of
source data frame for every row in the resulting data frame.

We have a similar functionality for joins. Here is an example:

julia> outerjoin(df1, df2, on=:a, source=:source)
6×2 DataFrame
 Row │ a      source
     │ Int64  String
─────┼───────────────────
   1 │     1  left_only
   2 │     2  left_only
   3 │     3  left_only
   4 │     4  right_only
   5 │     5  right_only
   6 │     6  right_only

As you can see the keyword argument used here is source. In the past this
keyword argument was called indicator, which was not very discoverable and now
it would be not consistent wthi vcat either. Therefore indicator keyword
argument is deprecated in favor of source. It will be removed in 2.0 release,
as keeping both keyword arguments is mostly harmless (so you have a lot of time
to clean up your codes).

Broadcasting assignment behavior

This is a super tricky deprecation as it is currently not printed
(because it cannot be under Julia 1.6).

Here is an example of deprecated functionality:

~$ julia --depwarn=yes
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.6.0 (2021-03-24)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using DataFrames

julia> df = DataFrame(a=1:3)
3×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     3

julia> df.a .= 'x'
3-element Vector{Int64}:
 120
 120
 120

julia> df
3×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │   120
   2 │   120
   3 │   120

In short, as you can see, df.col = value syntax works in-place under Julia 1.6.
In this post I have discussed in detail why we decided to change it.
But one of the major reasons is to allow for:

julia> df.b .= 1
ERROR: ArgumentError: column name :b not found in the data frame; existing most similar names are: :a

Let us switch to Julia nightly for a second:

$ ./julia --depwarn=yes
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.7.0-DEV.999 (2021-04-24)
 _/ |\__'_|_|_|\__'_|  |  Commit 1474566ffc* (0 days old master)
|__/                   |


julia> using DataFrames

julia> df = DataFrame(a=1:3)
3×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     3

julia> df.a .= 'x'
┌ Warning: In the future this operation will allocate a new column instead of performing an in-place assignment.
│   caller = top-level scope at REPL[3]:1
└ @ Core REPL[3]:1
3-element Vector{Int64}:
 120
 120
 120

julia> df.b .= 1
3-element Vector{Int64}:
 1
 1
 1

julia> df
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │   120      1
   2 │   120      1
   3 │   120      1

As you can see df.a .= 'x' is deprecated but df.b .= 1 is allowed.

The situation is very unfortunate, as we are unable to print deprecation
warnings under Julia 1.6. There are two deprecated cases:

df.a .= value into an existing column of a DataFrame currently is in-place,
and will replace a column in the future;
dfv.a .= value will be disallowed if dfv is a SubDataFrame;

The good thing is that the first change (replace instead of in-place) will not
affect almost any user workflows (except for the case I have given above, when
by doing df.a .= 'x' you got 120 in the column; in the future you will get
'x', as most likely you wanted). The second change (error for SubDataFrame)
will be easy to fix, as it will throw an error.

In the future (when Julia 1.7 is commonly used), either in 1.x or 2.0 release
of DataFrames.jl these deprecations will be turned into the target functionality.

Performance

To show performance changes I decided to look for some examples of data.table
performance tests prepared by Jan Gorecki (Jan is a data.table guru and
data.table is a golden standard for in-memory data wrangling) that I have not
checked earlier when developing the 1.0 release (to have a fair comparison).

For this I have found this post that gave a link to these two Stack
Overflow questions answered by Jan using data.table: question 1 and
question 2 (one is for split-apply-combine, and the other is for joins).
I have changed the size of the data a bit to make a single run of the test
on data.table be around 1 second. Both for data.table and DataFrames.jl
I use 4 threads.

I decided to rewrite these two examples under DataFrames.jl and compare timings
of:

DataFrames.jl 1.0.1
data.table 1.14.0 (to see a competitive comparison)
DataFrames.jl 0.22.7 (to see the difference in the performance)

(we are on R 4.0.5 and back on Julia 1.6)

Below I share reproducible details but the short conclusion is:

we have improved;
we are competitive with data.table (at least on the tests that Jan Goercki
used in the Stack Overflow examples).

(still we have a lot of work to do to improve performance post 1.0 release)

Split-apply-combine tests

We start with data.table timing:

> library(data.table)
data.table 1.14.0 using 4 threads (see ?getDTthreads).  Latest news: r-datatable.com
> library(microbenchmark)
> n = 5e7
> k = 5e5
> x = runif(n)
> grp = sample(k, n, TRUE)
> dt = setnames(setDT(list(x, grp)), c("x","grp"))
> microbenchmark(dt[, .(sum(x), .N), grp], times=10)
Unit: milliseconds
                     expr      min       lq     mean   median       uq      max
 dt[, .(sum(x), .N), grp] 928.9709 976.8374 1087.029 1095.204 1199.951 1220.086
 neval
    10
>

Now DataFrames.jl 1.0.1:

julia> using DataFrames

julia> using BenchmarkTools

julia> n = 50_000_000;

julia> k = 500_000;

julia> x = rand(n);

julia> grp = rand(1:k, n);

julia> df = DataFrame(x=x, grp=grp);

julia> @benchmark combine(groupby($df, :grp), :x => sum, nrow)
BenchmarkTools.Trial:
  memory estimate:  397.73 MiB
  allocs estimate:  634
  --------------
  minimum time:     593.295 ms (3.19% GC)
  median time:      609.667 ms (2.59% GC)
  mean time:        622.208 ms (2.04% GC)
  maximum time:     668.141 ms (6.77% GC)
  --------------
  samples:          9
  evals/sample:     1

Finally DataFrames.jl 0.22.7:

julia> using DataFrames

julia> using BenchmarkTools

julia> n = 50_000_000;

julia> k = 500_000;

julia> x = rand(n);

julia> grp = rand(1:k, n);

julia> df = DataFrame(x=x, grp=grp);

julia> @benchmark combine(groupby($df, :grp), :x => sum, nrow)
BenchmarkTools.Trial:
  memory estimate:  1.26 GiB
  allocs estimate:  225
  --------------
  minimum time:     7.936 s (0.00% GC)
  median time:      7.936 s (0.00% GC)
  mean time:        7.936 s (0.00% GC)
  maximum time:     7.936 s (0.00% GC)
  --------------
  samples:          1
  evals/sample:     1

Join tests

First goes data.table:

> library(microbenchmark)
> library(data.table)
data.table 1.14.0 using 4 threads (see ?getDTthreads).  Latest news: r-datatable.com
> n = 5e6
> df1 = data.frame(x=sample(n,n), y1=rnorm(n))
> df2 = data.frame(x=sample(n,n), y2=rnorm(n))
> dt1 = as.data.table(df1)
> dt2 = as.data.table(df2)
> microbenchmark(dt1[dt2, nomatch=NULL, on = "x"], times=10)
Unit: milliseconds
                               expr      min       lq    mean   median       uq
 dt1[dt2, nomatch = NULL, on = "x"] 893.9817 905.4534 924.142 914.2142 940.4294
      max neval
 993.6209    10
> microbenchmark(dt2[dt1, on = "x"], times=10)
Unit: milliseconds
               expr      min      lq     mean   median      uq      max neval
 dt2[dt1, on = "x"] 845.7094 884.172 885.5079 889.8894 891.272 903.6629    10
> microbenchmark(dt1[dt2, on = "x"], times=10)
Unit: milliseconds
               expr      min      lq     mean   median       uq      max neval
 dt1[dt2, on = "x"] 848.3348 874.032 878.4387 880.2858 885.1598 894.9579    10
> microbenchmark(merge(dt1, dt2, by = "x", all = TRUE), times=10)
Unit: seconds
                                  expr      min       lq     mean   median
 merge(dt1, dt2, by = "x", all = TRUE) 1.924328 1.931986 1.957051 1.954956
       uq      max neval
 1.981632 2.002862    10

Now DataFrames.jl 1.0.1:

julia> using DataFrames, Random

julia> using BenchmarkTools

julia> n = 5_000_000;

julia> df1 = DataFrame(x=shuffle(1:n), y1=randn(n));

julia> df2 = DataFrame(x=shuffle(1:n), y2=randn(n));

julia> @benchmark innerjoin($df1, $df2, on=:x)
BenchmarkTools.Trial:
  memory estimate:  228.90 MiB
  allocs estimate:  248
  --------------
  minimum time:     957.133 ms (0.00% GC)
  median time:      974.750 ms (0.13% GC)
  mean time:        975.144 ms (0.65% GC)
  maximum time:     992.020 ms (2.63% GC)
  --------------
  samples:          6
  evals/sample:     1

julia> @benchmark leftjoin($df1, $df2, on=:x)
BenchmarkTools.Trial:
  memory estimate:  234.26 MiB
  allocs estimate:  312
  --------------
  minimum time:     1.061 s (0.00% GC)
  median time:      1.071 s (0.00% GC)
  mean time:        1.073 s (0.41% GC)
  maximum time:     1.084 s (0.00% GC)
  --------------
  samples:          5
  evals/sample:     1

julia> @benchmark rightjoin($df1, $df2, on=:x)
BenchmarkTools.Trial:
  memory estimate:  234.26 MiB
  allocs estimate:  312
  --------------
  minimum time:     980.800 ms (0.00% GC)
  median time:      1.003 s (0.36% GC)
  mean time:        1.001 s (0.80% GC)
  maximum time:     1.013 s (2.63% GC)
  --------------
  samples:          6
  evals/sample:     1

julia> @benchmark outerjoin($df1, $df2, on=:x)
BenchmarkTools.Trial:
  memory estimate:  239.63 MiB
  allocs estimate:  332
  --------------
  minimum time:     1.081 s (0.00% GC)
  median time:      1.096 s (0.00% GC)
  mean time:        1.097 s (0.41% GC)
  maximum time:     1.106 s (0.66% GC)
  --------------
  samples:          5
  evals/sample:     1

Finally DataFrames.jl 0.22.7:

julia> using DataFrames, Random

julia> using BenchmarkTools

julia> n = 5_000_000;

julia> df1 = DataFrame(x=shuffle(1:n), y1=randn(n));

julia> df2 = DataFrame(x=shuffle(1:n), y2=randn(n));

julia> @benchmark innerjoin($df1, $df2, on=:x)
BenchmarkTools.Trial:
  memory estimate:  674.36 MiB
  allocs estimate:  218
  --------------
  minimum time:     4.788 s (0.55% GC)
  median time:      4.795 s (0.44% GC)
  mean time:        4.795 s (0.44% GC)
  maximum time:     4.803 s (0.32% GC)
  --------------
  samples:          2
  evals/sample:     1

julia> @benchmark leftjoin($df1, $df2, on=:x)
BenchmarkTools.Trial:
  memory estimate:  755.43 MiB
  allocs estimate:  223
  --------------
  minimum time:     4.975 s (0.36% GC)
  median time:      4.976 s (0.27% GC)
  mean time:        4.976 s (0.27% GC)
  maximum time:     4.978 s (0.18% GC)
  --------------
  samples:          2
  evals/sample:     1

julia> @benchmark rightjoin($df1, $df2, on=:x)
BenchmarkTools.Trial:
  memory estimate:  760.20 MiB
  allocs estimate:  237
  --------------
  minimum time:     5.077 s (0.39% GC)
  median time:      5.077 s (0.39% GC)
  mean time:        5.077 s (0.39% GC)
  maximum time:     5.077 s (0.39% GC)
  --------------
  samples:          1
  evals/sample:     1

julia> @benchmark outerjoin($df1, $df2, on=:x)
BenchmarkTools.Trial:
  memory estimate:  769.73 MiB
  allocs estimate:  228
  --------------
  minimum time:     5.148 s (0.18% GC)
  median time:      5.148 s (0.18% GC)
  mean time:        5.148 s (0.18% GC)
  maximum time:     5.148 s (0.18% GC)
  --------------
  samples:          1
  evals/sample:     1

Conclusions

In summary let me highlight some of the development objectives after 1.0 release:

API improvements (e.g. better stack/unstack functionality, having in-place
joins, extensions of transformation minilanguage).
Review of the whole package for performance bottlenecks and using
multi-threading in more places.
Resolving ecosystem integration performance bottlenecks (especially against
CSV.jl and Arrow.jl). Here one big challenge is thinking of something that
would reduce GC strain of having millions of strings stored in the DataFrame
(which is a typical situation in many data science workflows).
Documentation improvements.

Happy data wrangling with DataFrames.jl 1.0!

Working with matrices in DataFrames.jl 1.0

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/04/16/arrays.html

Introduction

DataFrames.jl will have a 1.0 release in a few days. In this post I
want to comment on an issue that I expect might cause the most legacy code
breakage with this release.

The post is tested under Julia 1.6, OrdinaryDiffEq 5.52.3, and
DataFrames.jl 0.22.7 (but I also discuss what changes in 1.0 release).

The past

In the past all matrices could be converted to a DataFrame like this:

~$ julia --banner=no
(@v1.6) pkg> st DataFrames
      Status `~/.julia/environments/v1.6/Project.toml`
  [a93c6f00] DataFrames v0.22.7

julia> using DataFrames

julia> mat = [1 2; 3 4]
2×2 Matrix{Int64}:
 1  2
 3  4

julia> DataFrame(mat)
2×2 DataFrame
 Row │ x1     x2
     │ Int64  Int64
─────┼──────────────
   1 │     1      2
   2 │     3      4

julia> exit()

However, unfortunately this output is deceptive. Try this:

~$ julia --banner=no --depwarn=error
(@v1.6) pkg> st DataFrames
      Status `~/.julia/environments/v1.6/Project.toml`
  [a93c6f00] DataFrames v0.22.7

julia> using DataFrames

julia> mat = [1 2; 3 4]
2×2 Matrix{Int64}:
 1  2
 3  4

julia> DataFrame(mat)
ERROR: `DataFrame(columns::AbstractMatrix)` is deprecated, use `DataFrame(columns, :auto)` instead.

julia> exit()

As you can see DataFrame(mat) is deprecated. The problem is that
Julia 1.6 is hiding deprecation warnings by default. Under
DataFrames.jl 1.0 DataFrame(mat) will error always.

Why is `DataFrame(mat)` not allowed?

The reason why we decided to disallow DataFrame constructor with a single
Matrix argument is that DataFrames.jl now follows the rule:

If DataFrame constructor is passed a single positional argument this argument
must be a table that is following Tables.jl API.

The point is that Matrix does not support this API.

A person knowing the DataFrames.jl API more thoroughly might point out that the
statement above is not true, and DataFrames.jl allows for some exceptions where
a non-table is accepted in a constructor. This is indeed the case. Here are
the offending cases:

~$ julia --banner=no --depwarn=error
(@v1.6) pkg> st DataFrames
      Status `~/.julia/environments/v1.6/Project.toml`
  [a93c6f00] DataFrames v0.22.7

julia> using DataFrames

julia> DataFrame(:a => 1) # a pair
1×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1

julia> df = DataFrame([:a => 1]) # a vector of pairs
1×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1

julia> dfr = df[1, :]
DataFrameRow
 Row │ a
     │ Int64
─────┼───────
   1 │     1

julia> DataFrame(dfr) # DataFrameRow
1×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1

julia> gdf = groupby(df, :a)
GroupedDataFrame with 1 group based on key: a
First Group (1 row): a = 1
 Row │ a
     │ Int64
─────┼───────
   1 │     1

julia> DataFrame(gdf) # GroupedDataFrame
1×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1

These four cases are deliberately left for convenience because there is a very
low risk that they would cause confusion. Out of these four cases a vector of
pairs is most problematic, as in Tables.jl it would get the following treatment:

julia> DataFrame(Tables.columntable([:a => 1]))
1×2 DataFrame
 Row │ first   second
     │ Symbol  Int64
─────┼────────────────
   1 │ a            1

However, we have decided that it is extremely unlikely that someone might want
to get this type of a data frame (and in case you wanted it the code above
shows you how to get it reliably).

So why have we not made an exception for matrices? The reason is that there
are many cases where AbtractMatrix actually supports Tables.jl interface
and requires a special way how it should be converted to a DataFrame.

To give some specific example have a look at this issue related
to differential equations solving (I have adapted it a bit):

~/Desktop/Dev/DF_dev$ julia --banner=no
(@v1.6) pkg> st DataFrames
      Status `~/.julia/environments/v1.6/Project.toml`
  [a93c6f00] DataFrames v0.22.7

julia> using OrdinaryDiffEq, DataFrames

julia> function parameterized_lorenz(du, u, p, t)
        du[1] = p[1] * (u[2] - u[1])
        du[2] = u[1] * (p[2] - u[3]) - u[2]
        du[3] = u[1] * u[2] - p[3] * u[3]
       end
parameterized_lorenz (generic function with 1 method)

julia> u0 = [1.0, 0.0, 0.0];

julia> tspan = (0.0, 1.0);

julia> p = [10.0, 28.0, 8/3];

julia> prob = ODEProblem(parameterized_lorenz, u0, tspan, p);

julia> sol1 = solve(prob, Rosenbrock23());

julia> DataFrame(sol1)
3×61 DataFrame
 Row │ x1       x2           x3          x4          x5          x6           ⋯
     │ Float64  Float64      Float64     Float64     Float64     Float64      ⋯
─────┼─────────────────────────────────────────────────────────────────────────
   1 │     1.0  0.999925     0.999178    0.995241    0.989338    0.977794     ⋯
   2 │     0.0  0.000209532  0.00230391  0.0134134   0.0302987   0.0641965
   3 │     0.0  7.83999e-10  9.47873e-8  3.21298e-6  1.63912e-5  7.35689e-5
                                                             55 columns omitted

julia> DataFrame(Tables.columntable(sol1))
61×4 DataFrame
 Row │ timestamp    value1     value2        value3
     │ Float64      Float64    Float64       Float64
─────┼────────────────────────────────────────────────────
   1 │ 0.0           1.0        0.0           0.0
   2 │ 7.48361e-6    0.999925   0.000209532   7.83999e-10
   3 │ 8.23197e-5    0.999178   0.00230391    9.47873e-8
   4 │ 0.000480311   0.995241   0.0134134     3.21298e-6
   5 │ 0.00108852    0.989338   0.0302987     1.63912e-5
   6 │ 0.00232154    0.977794   0.0641965     7.35689e-5
   7 │ 0.00391981    0.963652   0.107499      0.000206214
  ⋮  │      ⋮           ⋮           ⋮             ⋮
  55 │ 0.735315     -7.14414   -8.67351      25.5598
  56 │ 0.771575     -7.66582   -9.02351      25.4699
  57 │ 0.812974     -8.20321   -9.44232      25.6821
  58 │ 0.859509     -8.74036   -9.79982      26.2593
  59 │ 0.908783     -9.18149   -9.8967       27.1128
  60 │ 0.960609     -9.41649   -9.59708      28.0144
  61 │ 1.0          -9.39804   -9.12529      28.5183
                                           47 rows omitted

Now we can see that DataFrame(sol1) produces a wrong result because sol1
is an AbstractMatrix as you can check here:

julia> sol1 isa AbstractMatrix
true

Let us switch to main branch of DataFrames.jl for the remaining of this post
to test the behavior under DataFrames.jl 1.0 that will be released soon:

~/Desktop/Dev/DF_dev$ julia --banner=no
(@v1.6) pkg> activate --temp
  Activating new environment at `/tmp/jl_43Ofes/Project.toml`

(jl_43Ofes) pkg> add DataFrames#main

(jl_43Ofes) pkg> st DataFrames
      Status `/tmp/jl_43Ofes/Project.toml`
  [a93c6f00] DataFrames v0.22.7 `https://github.com/JuliaData/DataFrames.jl.git#main`

julia> using OrdinaryDiffEq, DataFrames
[ Info: Precompiling OrdinaryDiffEq [1dea7af3-3e70-54e6-95c3-0bf5283fa5ed]

julia> function parameterized_lorenz(du, u, p, t)
        du[1] = p[1] * (u[2] - u[1])
        du[2] = u[1] * (p[2] - u[3]) - u[2]
        du[3] = u[1] * u[2] - p[3] * u[3]
       end
parameterized_lorenz (generic function with 1 method)

julia> u0 = [1.0, 0.0, 0.0];

julia> tspan = (0.0, 1.0);

julia> p = [10.0, 28.0, 8/3];

julia> prob = ODEProblem(parameterized_lorenz, u0, tspan, p);

julia> sol1 = solve(prob, Rosenbrock23());

julia> DataFrame(sol1)
61×4 DataFrame
 Row │ timestamp    value1     value2        value3
     │ Float64      Float64    Float64       Float64
─────┼────────────────────────────────────────────────────
   1 │ 0.0           1.0        0.0           0.0
   2 │ 7.48361e-6    0.999925   0.000209532   7.83999e-10
   3 │ 8.23197e-5    0.999178   0.00230391    9.47873e-8
   4 │ 0.000480311   0.995241   0.0134134     3.21298e-6
   5 │ 0.00108852    0.989338   0.0302987     1.63912e-5
   6 │ 0.00232154    0.977794   0.0641965     7.35689e-5
   7 │ 0.00391981    0.963652   0.107499      0.000206214
  ⋮  │      ⋮           ⋮           ⋮             ⋮
  56 │ 0.771575     -7.66582   -9.02351      25.4699
  57 │ 0.812974     -8.20321   -9.44232      25.6821
  58 │ 0.859509     -8.74036   -9.79982      26.2593
  59 │ 0.908783     -9.18149   -9.8967       27.1128
  60 │ 0.960609     -9.41649   -9.59708      28.0144
  61 │ 1.0          -9.39804   -9.12529      28.5183
                                           48 rows omitted

And as you can see this time all worked as expected.

How to move from a matrix to a data frame the old way?

A question is can you get an old behavior easily under DataFrames.jl 1.0?
The answer is yes. It is enough to pass :auto as a second positional argument
to treat any AbstractMatrix the old way. The key point here is that :auto
adds a second argument to a constructor, which allows to disambiguate this call
and make sure we do not try a Tables.jl fallback. So continuing our last example
we have:

julia> DataFrame(sol1, :auto)
3×61 DataFrame
 Row │ x1       x2           x3          x4          x5          x6         ⋯
     │ Float64  Float64      Float64     Float64     Float64     Float64    ⋯
─────┼───────────────────────────────────────────────────────────────────────
   1 │     1.0  0.999925     0.999178    0.995241    0.989338    0.977794   ⋯
   2 │     0.0  0.000209532  0.00230391  0.0134134   0.0302987   0.0641965
   3 │     0.0  7.83999e-10  9.47873e-8  3.21298e-6  1.63912e-5  7.35689e-5
                                                           55 columns omitted

Here are some more examples (still on main):

julia> DataFrame([1 2; 3 4])
ERROR: ArgumentError: `DataFrame` constructor from a `Matrix` requires passing :auto as a second argument to automatically generate column names: `DataFrame(matrix, :auto)`

julia> DataFrame([1 2; 3 4], :auto) # auto generated column names
2×2 DataFrame
 Row │ x1     x2
     │ Int64  Int64
─────┼──────────────
   1 │     1      2
   2 │     3      4

julia> DataFrame([1 2; 3 4], [:c1, :c2]) # passing column names explicitly
2×2 DataFrame
 Row │ c1     c2
     │ Int64  Int64
─────┼──────────────
   1 │     1      2
   2 │     3      4

So as you can see the fix is easy.

Finally let me comment that another common similar case is a vector of vectors
being passed to a DataFrame constructor. It follows the same rules:

julia> DataFrame([1:2, 3:4])
ERROR: ArgumentError: `DataFrame` constructor from a `Vector` of vectors requires passing :auto as a second argument to automatically generate column names: `DataFrame(vecs, :auto)`

julia> DataFrame([1:2, 3:4], :auto)
2×2 DataFrame
 Row │ x1     x2
     │ Int64  Int64
─────┼──────────────
   1 │     1      3
   2 │     2      4

julia> DataFrame([1:2, 3:4], [:c1, :c2])
2×2 DataFrame
 Row │ c1     c2
     │ Int64  Int64
─────┼──────────────
   1 │     1      3
   2 │     2      4

Again – using Tables.jl default behavior would get you something unexpected
(unless you are really deep into Tables.jl mechanics ?):

julia> DataFrame(Tables.columntable([1:2, 3:4]))
2×2 DataFrame
 Row │ start  stop
     │ Int64  Int64
─────┼──────────────
   1 │     1      2
   2 │     3      4

Conclusions

We have tried very hard to make things in DataFrames.jl 1.0 maximally consistent
with the whole Julia package ecosystem while allowing a relatively easy handling
of common data processing tasks.

Conversion from a matrix to a DataFrame is one of common hard corner cases
affected. I hope this post explains you the rationale behind the design
decisions taken in DataFrames.jl 1.0 release in this area and ways how the
DataFrame constructor should be used to give you desired results.

juliabloggers.com

A Julia Language Blog Aggregator

Tag Archives: julialang

The order of join and grouping result in DataFrames.jl

Introduction

Joins

Grouping

Conclusions

Some comments on DataFrames 1.0 release

Introduction

Deprecations

`indicator` keyword argument in joins

Broadcasting assignment behavior

Performance

Split-apply-combine tests

Join tests

Conclusions

Working with matrices in DataFrames.jl 1.0

Introduction

The past

Why is `DataFrame(mat)` not allowed?

How to move from a matrix to a data frame the old way?

Conclusions

Introduction

Joins

Grouping

Conclusions

Introduction

Deprecations

indicator keyword argument in joins

Broadcasting assignment behavior

Performance

Split-apply-combine tests

Join tests

Conclusions

Introduction

The past

Why is DataFrame(mat) not allowed?

How to move from a matrix to a data frame the old way?

Conclusions

`indicator` keyword argument in joins

Why is `DataFrame(mat)` not allowed?