Tag Archives: julialang

Any and all reduction of rows in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/02/25/anyall.html

Introduction

Recently on Stack Overflow there was an interesting question about
row-wise aggregation of boolean columns in DataFrames.jl.

I think that the solution is an interesting exercise in functionalities of
DataFrames.jl.

The post was written under Julia 1.7 and DataFrames.jl 1.3.2.

The question

Assume we have the following data frame:

julia> using DataFrames, Random

julia> Random.seed!(1234);

julia> df = DataFrame(rand(Bool, 10, 3), :auto)
10×3 DataFrame
 Row │ x1     x2     x3
     │ Bool   Bool   Bool
─────┼─────────────────────
   1 │  true  false   true
   2 │ false  false   true
   3 │ false  false  false
   4 │  true  false  false
   5 │  true   true  false
   6 │  true   true   true
   7 │  true  false   true
   8 │  true  false  false
   9 │ false  false  false
  10 │  true  false   true

We now would like to apply all and any functions to this data frame
row-wise. The basic approach to this task would be to do the following solution:

julia> transform(df, AsTable(r"x") .=> ByRow.([all, any]))
10×5 DataFrame
 Row │ x1     x2     x3     x1_x2_x3_all  x1_x2_x3_any
     │ Bool   Bool   Bool   Bool          Bool
─────┼─────────────────────────────────────────────────
   1 │  true  false   true         false          true
   2 │ false  false   true         false          true
   3 │ false  false  false         false         false
   4 │  true  false  false         false          true
   5 │  true   true  false         false          true
   6 │  true   true   true          true          true
   7 │  true  false   true         false          true
   8 │  true  false  false         false          true
   9 │ false  false  false         false         false
  10 │  true  false   true         false          true

The problem with this approach is that it does not scale well when you have
many columns in your data frame:

julia> df2 = DataFrame(rand(Bool, 10_000, 1000), :auto);

julia> @time select(df2, AsTable(r"x") .=> ByRow.([all, any]));
 12.026759 seconds (1.13 M allocations: 730.530 MiB, 0.39% gc time, 42.79% compilation time)

julia> @time select(df2, AsTable(r"x") .=> ByRow.([all, any]));
  6.746910 seconds (145.42 k allocations: 678.763 MiB, 0.29% gc time)

As you can see it has both high compilation cost and high execution time.

Speeding up the computation

The basic way to resolve this issue is to use the composition with collect, as
the DataFrames.table_transformation documentation suggests us. In this way we
avoid constructing a very wide NamedTuple, which is problematic in this
case.

julia> @time select(df2, AsTable(r"x") .=> ByRow.([all, any] .∘ collect));
  0.616105 seconds (893.18 k allocations: 69.745 MiB, 3.26% gc time, 80.07% compilation time)

julia> @time select(df2, AsTable(r"x") .=> ByRow.([all, any] .∘ collect));
  0.129053 seconds (26.39 k allocations: 21.293 MiB, 6.65% gc time)

This is much better. However, we can further notice that any function is
essentially the same as maximum, and all the same as minimum for boolean
data. In this case we do not have to use the collect composition:

julia> @time select(df2, AsTable(r"x") .=> ByRow.([minimum, maximum]));
  0.393608 seconds (783.29 k allocations: 42.571 MiB, 2.45% gc time, 96.49% compilation time)

julia> @time select(df2, AsTable(r"x") .=> ByRow.([minimum, maximum]));
  0.002593 seconds (5.37 k allocations: 503.359 KiB)

This time things are very fast, as row-wise aggregation for maximum
and minimum is optimized.

Conclusions

There are the following lessons from the examples we have given today:

  • when performing row-wise aggregation the default AsTable timings can be bad,
    because they construct an intermediate NamedTuple, which is expensive
    (of course sometimes it is useful, if you want to refer to column names in
    transformations – that is why this form is supported);
  • to avoid construction of NamedTuple use composition with collect, which
    uses a faster path that passes a vector to the aggregation function directly;
  • an even faster option is available for selected list of most common aggregation
    functions, which includes: sum, length, mean, var, std, median,
    minimum, and maximum (all potentially composed with skipmissing).

Why you should attend JuliaCon 2022 (as a Data Scientist) ‍

By: Logan Kilpatrick

Re-posted from: https://towardsdatascience.com/why-you-should-attend-juliacon-2022-as-a-data-scientist-c05c946d7e66?source=rss-2c8aac9051d3------2

The annual JuliaCon conference is right around the corner (July 27th — 29th 2022 online). Get your ticket today!

Upcoming DataFrames.jl features

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/02/18/upcoming.html

Introduction

In this post I want to introduce two features of DataFrames.jl that will be
added in 1.4 release that are already available on main branch. The new
additions are:

  • extensions of the transformation mini-language;
  • pivot table functionality of the unstack function.

The post was written under Julia 1.7.0 and DataFrames.jl current main branch,
and DataFramesMeta.jl 0.10.

Environment setup

Since we want to work on a main branch we set it up first:

(@v1.7) pkg> add DataFrames#main
    Updating git-repo `https://github.com/JuliaData/DataFrames.jl.git`
    Updating registry at `~/.julia/registries/General`
    Updating git-repo `https://github.com/JuliaRegistries/General.git`
   Resolving package versions...
    Updating `~/.julia/environments/v1.7/Project.toml`
  [a93c6f00] ~ DataFrames v1.3.2 ⇒ v1.4.0 `https://github.com/JuliaData/DataFrames.jl.git#main`
    Updating `~/.julia/environments/v1.7/Manifest.toml`
  [a93c6f00] ~ DataFrames v1.3.2 ⇒ v1.4.0 `https://github.com/JuliaData/DataFrames.jl.git#main`
Precompiling project...
  2 dependencies successfully precompiled in 20 seconds (47 already precompiled)

julia> using DataFrames

Extensions of the transformation mini-language

In DataFrames.jl 1.3 the nrow function has a special syntax as it does not
require passing source column (but optionally allows specifying output column
name) and returns the number of rows in a data frame or grouped data frame.
Here is an example:

julia> df = DataFrame(year=repeat([2020, 2021]; outer=3),
                      region=repeat(["north", "center", "south"]; inner=2),
                      sales=[500, 600, 700, 550, 630, 680])
6×3 DataFrame
 Row │ year   region  sales
     │ Int64  String  Int64
─────┼──────────────────────
   1 │  2020  north     500
   2 │  2021  north     600
   3 │  2020  center    700
   4 │  2021  center    550
   5 │  2020  south     630
   6 │  2021  south     680

julia> gdf = groupby(df, :year)
GroupedDataFrame with 2 groups based on key: year
First Group (3 rows): year = 2020
 Row │ year   region  sales
     │ Int64  String  Int64
─────┼──────────────────────
   1 │  2020  north     500
   2 │  2020  center    700
   3 │  2020  south     630
⋮
Last Group (3 rows): year = 2021
 Row │ year   region  sales
     │ Int64  String  Int64
─────┼──────────────────────
   1 │  2021  north     600
   2 │  2021  center    550
   3 │  2021  south     680

julia> combine(df, nrow)
1×1 DataFrame
 Row │ nrow
     │ Int64
─────┼───────
   1 │     6

julia> combine(gdf, nrow => :region_count)
2×2 DataFrame
 Row │ year   region_count
     │ Int64  Int64
─────┼─────────────────────
   1 │  2020             3
   2 │  2021             3

In DataFrames.jl 1.4 three more functions having the same special handling are
going to be added. They are:

  • eachindex: returning row number in a data frame or in group of grouped data
    frame;
  • proprow: returning proportion of rows in group
    (only supported for grouped data frame);
  • groupindices: returning group number (only supported for grouped data frame);

As you can see none of these functions require passing a concrete source column
to know how to produce their result and this is the reason why they follow the
non-standard syntax.

Let me show you how they work by example. Start with eachindex:

julia> transform(df, eachindex)
6×4 DataFrame
 Row │ year   region  sales  eachindex
     │ Int64  String  Int64  Int64
─────┼─────────────────────────────────
   1 │  2020  north     500          1
   2 │  2021  north     600          2
   3 │  2020  center    700          3
   4 │  2021  center    550          4
   5 │  2020  south     630          5
   6 │  2021  south     680          6

julia> transform(gdf, eachindex => :row_number)
6×4 DataFrame
 Row │ year   region  sales  row_number
     │ Int64  String  Int64  Int64
─────┼──────────────────────────────────
   1 │  2020  north     500           1
   2 │  2021  north     600           1
   3 │  2020  center    700           2
   4 │  2021  center    550           2
   5 │  2020  south     630           3
   6 │  2021  south     680           3

Note, that in the second example the returned row indices are within-group.

Now let us see how proprow and groupindices work. As I have noted above
they are only supported for GroupedDataFrame:

julia> combine(gdf, groupindices => :group_number, proprow)
2×3 DataFrame
 Row │ year   group_number  proprow
     │ Int64  Int64         Float64
─────┼──────────────────────────────
   1 │  2020             1      0.5
   2 │  2021             2      0.5

Let us use the proprow function to check the distribution of sum of two
tosses using standard six-sided dice additionally using DataFramesMeta.jl:

julia> using Random

julia> using DataFramesMeta

julia> Random.seed!(1234);

julia> @chain DataFrame(res=rand(1:6, 10^8) + rand(1:6, 10^8)) begin
           groupby(:res)
           @combine(:empirical = $proprow)
           @rtransform(:theoretical = (6 - abs(7 - :res)) / 36)
       end
11×3 DataFrame
 Row │ res    empirical  theoretical
     │ Int64  Float64    Float64
─────┼───────────────────────────────
   1 │     2  0.0277863    0.0277778
   2 │     3  0.0555783    0.0555556
   3 │     4  0.0832717    0.0833333
   4 │     5  0.111112     0.111111
   5 │     6  0.138934     0.138889
   6 │     7  0.166655     0.166667
   7 │     8  0.138885     0.138889
   8 │     9  0.111094     0.111111
   9 │    10  0.0833185    0.0833333
  10 │    11  0.0555788    0.0555556
  11 │    12  0.0277863    0.0277778

Pivot table functionality of the unstack function

The second upcoming functionality is addition of valuestransform keyword
argument to unstack. It allows you to create pivot tables. Start with
generating some table we might want to aggregate:

julia> Random.seed!(1234);

julia> sales_df = DataFrame(year=rand(2020:2021, 100),
                            region=rand(["north", "center", "south"], 100),
                            sales=rand(100))
100×3 DataFrame
 Row │ year   region  sales
     │ Int64  String  Float64
─────┼──────────────────────────
   1 │  2020  center  0.558861
   2 │  2021  center  0.122447
   3 │  2020  north   0.20889
  ⋮  │   ⋮      ⋮         ⋮
  98 │  2020  center  0.983299
  99 │  2020  north   0.882375
 100 │  2021  south   0.232082
                 94 rows omitted

julia> transform!(groupby(sales_df, [:year, :region]), eachindex => :store_id)
100×4 DataFrame
 Row │ year   region  sales      store_id
     │ Int64  String  Float64    Int64
─────┼────────────────────────────────────
   1 │  2020  center  0.558861          1
   2 │  2021  center  0.122447          1
   3 │  2020  north   0.20889           1
  ⋮  │   ⋮      ⋮         ⋮         ⋮
  98 │  2020  center  0.983299         13
  99 │  2020  north   0.882375         12
 100 │  2021  south   0.232082         19
                           94 rows omitted

I have generated the :store_id column that gives us store identifier per year
and region as an exercise of eachindex usage.

What we want to get is a pivot table showing us total sales by year and region.

Traditionally you would do it like this:

julia> combine(groupby(sales_df, [:year, :region]), :sales => sum => :total_sales)
6×3 DataFrame
 Row │ year   region  total_sales
     │ Int64  String  Float64
─────┼────────────────────────────
   1 │  2020  center      7.25926
   2 │  2021  center      6.00744
   3 │  2020  north       7.10255
   4 │  2021  south       7.53942
   5 │  2020  south       9.35856
   6 │  2021  north      10.7248

and later use unstack on the result like this:

julia> @chain sales_df begin
           groupby([:year, :region])
           combine(:sales => sum => :total_sales)
           unstack(:region, :year, :total_sales)
       end
3×3 DataFrame
 Row │ region  2020      2021
     │ String  Float64?  Float64?
─────┼────────────────────────────
   1 │ center   7.25926   6.00744
   2 │ north    7.10255  10.7248
   3 │ south    9.35856   7.53942

Now you can do it in one step with valuestransform keyword argument to
unstack which allows you to pass aggregation function you want to apply
to the unstacked data:

julia> unstack(sales_df, :region, :year, :sales; valuestransform=sum)
3×3 DataFrame
 Row │ region  2020      2021
     │ String  Float64?  Float64?
─────┼────────────────────────────
   1 │ center   7.25926   6.00744
   2 │ north    7.10255  10.7248
   3 │ south    9.35856   7.53942

Let us now perform a bit more complex transformation:

julia> @chain sales_df begin
           @rtransform(:region_year = string(:region, "_", :year))
           unstack(:store_id, :region_year, :sales; valuestransform=sum, fill=0.0)
       end
20×7 DataFrame
 Row │ store_id  center_2020  center_2021  north_2020  south_2021  south_2020  north_2021
     │ Int64     Float64      Float64      Float64     Float64     Float64     Float64
─────┼────────────────────────────────────────────────────────────────────────────────────
   1 │        1    0.558861     0.122447     0.20889    0.143436    0.329521   0.866798
   2 │        2    0.0380815    0.309235     0.034361   0.552426    0.310385   0.959957
   3 │        3    0.516981     0.261204     0.95434    0.413821    0.104218   0.199257
   4 │        4    0.0618067    0.075524     0.099118   0.706231    0.217979   0.370522
   5 │        5    0.0507147    0.275146     0.328642   0.975947    0.677177   0.00706048
   6 │        6    0.732012     0.0207132    0.991149   0.620299    0.814166   0.531297
   7 │        7    0.204139     0.931495     0.832146   0.201035    0.554558   0.759341
   8 │        8    0.799132     0.194907     0.731654   0.234099    0.67543    0.194442
   9 │        9    0.831656     0.894577     0.694614   0.0841972   0.339449   0.164982
  10 │       10    0.979482     0.057445     0.826054   0.177784    0.858287   0.529903
  11 │       11    0.989703     0.568018     0.519209   0.185454    0.0682582  0.315405
  12 │       12    0.513391     0.573917     0.882375   0.294397    0.465337   0.426687
  13 │       13    0.983299     0.0552395    0.0        0.145345    0.55148    0.630227
  14 │       14    0.0          0.6342       0.0        0.627799    0.0697159  0.83729
  15 │       15    0.0          0.435403     0.0        0.743466    0.756869   0.940755
  16 │       16    0.0          0.102038     0.0        0.641528    0.867504   0.978431
  17 │       17    0.0          0.495932     0.0        0.236       0.436487   0.937941
  18 │       18    0.0          0.0          0.0        0.324075    0.955728   0.241034
  19 │       19    0.0          0.0          0.0        0.232082    0.248491   0.833433
  20 │       20    0.0          0.0          0.0        0.0         0.0575235  0.0

Here, I have shown you the fill keyword argument that is used to provide
value that should be used in cases we do not have any data for some combination
of row and column keys (by default such cells are filled with missing).

Conclusions

I hope you will find these upcoming functionalities useful. If you have any
questions or comments regarding them please open an issue on DataFrames.jl
repository.