Pivot tables in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/05/19/pivot.html

Introduction

Creation of pivot tables is a common operation in exploratory data analysis.
Today I want to show you one example how this can be done in DataFrames.jl
that was prompted by a recent discussion on Julia Slack.

The post was written under Julia 1.9.0, Chain 0.5.0, DataFrames.jl 1.5.0, and RDatasets 0.7.7.

The data

We will work with the classical diamonds dataset. Let us load it first:

julia> using DataFrames

julia> using Chain

julia> using RDatasets

julia> diamonds = RDatasets.dataset("ggplot2", "diamonds")
53940×10 DataFrame
   Row │ Carat    Cut        Color  Clarity  Depth    Table    Price  X        Y        Z
       │ Float64  Cat…       Cat…   Cat…     Float64  Float64  Int32  Float64  Float64  Float64
───────┼────────────────────────────────────────────────────────────────────────────────────────
     1 │    0.23  Ideal      E      SI2         61.5     55.0    326     3.95     3.98     2.43
     2 │    0.21  Premium    E      SI1         59.8     61.0    326     3.89     3.84     2.31
     3 │    0.23  Good       E      VS1         56.9     65.0    327     4.05     4.07     2.31
     4 │    0.29  Premium    I      VS2         62.4     58.0    334     4.2      4.23     2.63
     5 │    0.31  Good       J      SI2         63.3     58.0    335     4.34     4.35     2.75
     6 │    0.24  Very Good  J      VVS2        62.8     57.0    336     3.94     3.96     2.48
     7 │    0.24  Very Good  I      VVS1        62.3     57.0    336     3.95     3.98     2.47
     8 │    0.26  Very Good  H      SI1         61.9     55.0    337     4.07     4.11     2.53
     9 │    0.22  Fair       E      VS2         65.1     61.0    337     3.87     3.78     2.49
    10 │    0.23  Very Good  H      VS1         59.4     61.0    338     4.0      4.05     2.39
    11 │    0.3   Good       J      SI1         64.0     55.0    339     4.25     4.28     2.73
    12 │    0.23  Ideal      J      VS1         62.8     56.0    340     3.93     3.9      2.46
    13 │    0.22  Premium    F      SI1         60.4     61.0    342     3.88     3.84     2.33
    14 │    0.31  Ideal      J      SI2         62.2     54.0    344     4.35     4.37     2.71
   ⋮   │    ⋮         ⋮        ⋮       ⋮        ⋮        ⋮       ⋮       ⋮        ⋮        ⋮
 53928 │    0.79  Good       F      SI1         58.1     59.0   2756     6.06     6.13     3.54
 53929 │    0.79  Premium    E      SI2         61.4     58.0   2756     6.03     5.96     3.68
 53930 │    0.71  Ideal      G      VS1         61.4     56.0   2756     5.76     5.73     3.53
 53931 │    0.71  Premium    E      SI1         60.5     55.0   2756     5.79     5.74     3.49
 53932 │    0.71  Premium    F      SI1         59.8     62.0   2756     5.74     5.73     3.43
 53933 │    0.7   Very Good  E      VS2         60.5     59.0   2757     5.71     5.76     3.47
 53934 │    0.7   Very Good  E      VS2         61.2     59.0   2757     5.69     5.72     3.49
 53935 │    0.72  Premium    D      SI1         62.7     59.0   2757     5.69     5.73     3.58
 53936 │    0.72  Ideal      D      SI1         60.8     57.0   2757     5.75     5.76     3.5
 53937 │    0.72  Good       D      SI1         63.1     55.0   2757     5.69     5.75     3.61
 53938 │    0.7   Very Good  D      SI1         62.8     60.0   2757     5.66     5.68     3.56
 53939 │    0.86  Premium    H      SI2         61.0     58.0   2757     6.15     6.12     3.74
 53940 │    0.75  Ideal      D      SI2         62.2     55.0   2757     5.83     5.87     3.64
                                                                              53913 rows omitted

The task we want to do is to analyze the distribution of :Cut column by :Color.

Note that these columns are Categorical (as indicated by the Cat… information above).
This allows us to check levels of :Cut and :Color to verify their ordering.

julia> levels(diamonds.Cut)
5-element Vector{String}:
 "Fair"
 "Good"
 "Very Good"
 "Premium"
 "Ideal"

julia> levels(diamonds.Color)
7-element Vector{String}:
 "D"
 "E"
 "F"
 "G"
 "H"
 "I"
 "J"

Now we are ready to start analyzing the data.

Simple pivot table

A simple pivot table would be to calculate number of observations of each
:Cut, and :Color combination. You can do it as follows:

julia> unstack(diamonds, :Cut, :Color, :Cut, combine=length)
5×8 DataFrame
 Row │ Cut        E       I       J       H       F       G       D
     │ Cat…       Int64?  Int64?  Int64?  Int64?  Int64?  Int64?  Int64?
─────┼───────────────────────────────────────────────────────────────────
   1 │ Ideal        3903    2093     896    3115    3826    4884    2834
   2 │ Premium      2337    1428     808    2360    2331    2924    1603
   3 │ Good          933     522     307     702     909     871     662
   4 │ Very Good    2400    1204     678    1824    2164    2299    1513
   5 │ Fair          224     175     119     303     312     314     163

We see that we put the second positional argument of unstack (:Cut in our case) as rows,
and the third (:Color) as columns. The fourth positional argument is what we put in the cells
of the pivot table. Since we want to get the number of observations (combine=length) then
it does not matter which column we pass so I used :Cut.

Fixing the order

The table looks nice, but there is one problem with it. The rows and columns are not ordered nicely.
The reason is that currently unstack in DataFrames.jl orders them by the order of their appearance
in the source data frame.

We can fix it by sorting. The order of columns is set by pre-sorting the source data frame,
and the order of rows is set by post-sorting of the data frame returned by unstack.
Note that I start using @chain macro from Chain.jl for clarity of the code:

julia> @chain diamonds begin
           sort(:Color)
           unstack(:Cut, :Color, :Cut, combine=length)
           sort!(:Cut)
       end
5×8 DataFrame
 Row │ Cut        D       E       F       G       H       I       J
     │ Cat…       Int64?  Int64?  Int64?  Int64?  Int64?  Int64?  Int64?
─────┼───────────────────────────────────────────────────────────────────
   1 │ Fair          163     224     312     314     303     175     119
   2 │ Good          662     933     909     871     702     522     307
   3 │ Very Good    1513    2400    2164    2299    1824    1204     678
   4 │ Premium      1603    2337    2331    2924    2360    1428     808
   5 │ Ideal        2834    3903    3826    4884    3115    2093     896

Now the table is nicely ordered. Notice that both sort and sort! functions are aware of
categorical nature of data and properly sort it.

We almost have what we wanted. The problem is that seeing counts does not allow us to easily
assess the distributions by diamond color. This can be easily added by transforming the columns
of our data frame.

Getting proportions

Let us turn the data from counts to proportions. We can do it using the transform! function:

julia> @chain diamonds begin
           sort(:Color)
           unstack(:Cut, :Color, :Cut, combine=length)
           sort!(:Cut)
           transform(Not(:Cut) .=> x -> x / sum(x), renamecols=false)
       end
5×8 DataFrame
 Row │ Cut        D          E          F          G          H          I          J
     │ Cat…       Float64    Float64    Float64    Float64    Float64    Float64    Float64
─────┼────────────────────────────────────────────────────────────────────────────────────────
   1 │ Fair       0.024059   0.0228641  0.0326975  0.0278073  0.0364884  0.0322759  0.0423789
   2 │ Good       0.0977122  0.0952332  0.095263   0.0771343  0.0845376  0.0962744  0.10933
   3 │ Very Good  0.223321   0.244973   0.226787   0.203595   0.219653   0.222058   0.241453
   4 │ Premium    0.236605   0.238542   0.244288   0.258944   0.2842     0.263371   0.287749
   5 │ Ideal      0.418303   0.398387   0.400964   0.432519   0.37512    0.38602    0.319088

Indeed J diamonds seem to have worst :Cut, and the best are G diamonds.

Digging deeper into the data

To formally assess the order of columns by :Cut quality let us turn the data from distribution
to a cumulative distribution first:

julia> @chain diamonds begin
           sort(:Color)
           unstack(:Cut, :Color, :Cut, combine=length)
           sort!(:Cut)
           transform!(Not(:Cut) .=> x -> cumsum(x / sum(x)), renamecols=false)
       end
5×8 DataFrame
 Row │ Cut        D         E          F          G          H          I          J
     │ Cat…       Float64   Float64    Float64    Float64    Float64    Float64    Float64
─────┼───────────────────────────────────────────────────────────────────────────────────────
   1 │ Fair       0.024059  0.0228641  0.0326975  0.0278073  0.0364884  0.0322759  0.0423789
   2 │ Good       0.121771  0.118097   0.127961   0.104942   0.121026   0.12855    0.151709
   3 │ Very Good  0.345092  0.36307    0.354747   0.308537   0.340679   0.350609   0.393162
   4 │ Premium    0.581697  0.601613   0.599036   0.567481   0.62488    0.61398    0.680912
   5 │ Ideal      1.0       1.0        1.0        1.0        1.0        1.0        1.0

We would like to order columns by the first-order stochastic dominance relation.
Since DataFrames.jl makes it easier to sort rows, let us permute the dimensions of our data frame first:

julia> @chain diamonds begin
           sort(:Color)
           unstack(:Cut, :Color, :Cut, combine=length)
           sort!(:Cut)
           transform!(Not(:Cut) .=> x -> cumsum(x / sum(x)), renamecols=false)
           permutedims(:Cut, :Color)
       end
7×6 DataFrame
 Row │ Color   Fair       Good      Very Good  Premium   Ideal
     │ String  Float64    Float64   Float64    Float64   Float64
─────┼───────────────────────────────────────────────────────────
   1 │ D       0.024059   0.121771   0.345092  0.581697      1.0
   2 │ E       0.0228641  0.118097   0.36307   0.601613      1.0
   3 │ F       0.0326975  0.127961   0.354747  0.599036      1.0
   4 │ G       0.0278073  0.104942   0.308537  0.567481      1.0
   5 │ H       0.0364884  0.121026   0.340679  0.62488       1.0
   6 │ I       0.0322759  0.12855    0.350609  0.61398       1.0
   7 │ J       0.0423789  0.151709   0.393162  0.680912      1.0

Now we are ready to sort our data frame by all columns except :Color:

julia> @chain diamonds begin
           sort(:Color)
           unstack(:Cut, :Color, :Cut, combine=length)
           sort!(:Cut)
           transform!(Not(:Cut) .=> x -> cumsum(x / sum(x)), renamecols=false)
           permutedims(:Cut, :Color)
           sort!(Not(:Color))
       end
7×6 DataFrame
 Row │ Color   Fair       Good      Very Good  Premium   Ideal
     │ String  Float64    Float64   Float64    Float64   Float64
─────┼───────────────────────────────────────────────────────────
   1 │ E       0.0228641  0.118097   0.36307   0.601613      1.0
   2 │ D       0.024059   0.121771   0.345092  0.581697      1.0
   3 │ G       0.0278073  0.104942   0.308537  0.567481      1.0
   4 │ I       0.0322759  0.12855    0.350609  0.61398       1.0
   5 │ F       0.0326975  0.127961   0.354747  0.599036      1.0
   6 │ H       0.0364884  0.121026   0.340679  0.62488       1.0
   7 │ J       0.0423789  0.151709   0.393162  0.680912      1.0

The sorting exercise did not work this time.
First-order stochastic dominance does not render the best and the worst option.
Note that e.g. options E and G do not dominate each other.
However we see that option J is clearly worst as it has maximum values in all levels.

To see this more clearly let us do min-max scaling of all columns except :Ideal (as it is constant):

julia> scale(x) = (x .- minimum(x)) / (maximum(x) - minimum(x))
scale (generic function with 1 method)

julia> @chain diamonds begin
           sort(:Color)
           unstack(:Cut, :Color, :Cut, combine=length)
           sort!(:Cut)
           transform!(Not(:Cut) .=> x -> cumsum(x / sum(x)), renamecols=false)
           permutedims(:Cut, :Color)
           sort!(Not(:Color))
           transform!(Not([:Color, :Ideal]) .=> scale .=> identity)
       end
7×6 DataFrame
 Row │ Color   Fair       Good      Very Good  Premium   Ideal
     │ String  Float64    Float64   Float64    Float64   Float64
─────┼───────────────────────────────────────────────────────────
   1 │ E       0.0        0.281301   0.644408  0.300901      1.0
   2 │ D       0.0612305  0.359855   0.431965  0.125328      1.0
   3 │ G       0.253303   0.0        0.0       0.0           1.0
   4 │ I       0.482289   0.504808   0.497151  0.409932      1.0
   5 │ F       0.503895   0.492198   0.546059  0.278184      1.0
   6 │ H       0.698153   0.343921   0.379817  0.506022      1.0
   7 │ J       1.0        1.0        1.0       1.0           1.0

Indeed J color has 1.0 values in all columns.

We also now can more clearly see that if we ignore the :Fair level
the G color dominates all other colors.

Note that in the last step I have shown an alternative to renamecols=false
of how one can keep the column names unchanged under transformation.
What you can do is pass the identity function as target column name.

The reason is that if you pass a function as target column name then this function is applied
to source column name (and identity keeps things as they were).

Conclusions

I hope you found this post useful for exploring some of the functionalities
of DataFrames.jl.

I also tried to show how nicely @chain can be used to gradually build
an analysis. This is especially convenient in Julia REPL, since when you
go back in command history it allows you to get whole last command
(and not just a single line like in some other REPLs)
in the prompt with one up arrow key press and edit it.