Getting ready for JuliaCon 2022

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/07/22/juliacon2022.html

Introduction

During JuliaCon 2022 I will run a
tutorial on DataFrames.jl.
In the tutorial I will focus on ways you can write transformation operations
using the select/transform/combine functions and the operation
specification syntax.

In this post I want to give you a preview of the topics I am going to cover
in my tutorial.

The post was written under Julia 1.7.2, DataFrames.jl 1.3.4,
and DataFramesMeta.jl 0.12.

What is operation specification syntax?

If you are new to DataFrames.jl then you probably wonder what
operation specification syntax is. Fortunately it is quite easy.

If you want to transform data using one of the select/transform/combine
functions you can specify the transformation you want to perform using the
following general syntax:

[source columns] => [function] => [output columns]

For example if you write :a => mean => :b you will get a mean of column :a
and store it in output data frame in column :b (visually: input column :a is
passed to the mean function whose output is passed to output column :b).

Additionally, if you prefer assignment style of specifying operations you can
use DataFramesMeta.jl package that allows you to write the same as just
:b = mean(:a). To use DataFramesMeta.jl you need to prefix the appropriate
function name with @ (to turn it into a macro).

Let me show you a minimal working example of such a transformation.
We compute mean of column :val by groups defined by the :id column:

julia> using DataFramesMeta

julia> using Statistics

julia> df = DataFrame(id=[1, 2, 1, 2, 1, 2], val=1:6)
6×2 DataFrame
 Row │ id     val
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     1      3
   4 │     2      4
   5 │     1      5
   6 │     2      6

julia> combine(groupby(df, :id), :val => mean => :mean_val)
2×2 DataFrame
 Row │ id     mean_val
     │ Int64  Float64
─────┼─────────────────
   1 │     1       3.0
   2 │     2       4.0

julia> @combine(groupby(df, :id), :mean_val = mean(:val))
2×2 DataFrame
 Row │ id     mean_val
     │ Int64  Float64
─────┼─────────────────
   1 │     1       3.0
   2 │     2       4.0

This is a simple example of what operation specification syntax
can do. In this post let me give you a more complex example (I explain
all the details of how it works in my upcoming tutorial).

The question from StackOverflow

In this StackOverflow question the user wanted to analyze iris data
set and get 25% and 75% quantiles of the Sepal.Length column.

The R code that the StackOverflow user provided was:

> library(dplyr)

> iris %>%
+        group_by(Species) %>%
+        summarise(
+           quantile(Sepal.Length, c(.25, .75)) %>%
+              matrix(nrow = 1) %>%
+              as.data.frame() %>%
+              setNames(paste0("Sepal.Length", c(.25, .75)))
+     )
# A tibble: 3 x 3
  Species    Sepal.Length0.25 Sepal.Length0.75
  <fct>                 <dbl>            <dbl>
1 setosa                 4.8               5.2
2 versicolor             5.6               6.3
3 virginica              6.22              6.9

The question was how to achieve the same with DataFrames.jl and
DataFramesMeta.jl?

Here is a solution. First we need to load the iris dataset (in the code I take
advantage of the fact that this dataset is bundled into DataFrames.jl
installation folders):

julia> using CSV

julia> iris = CSV.read(joinpath(dirname(pathof(DataFrames)),
                                "..", "docs", "src", "assets", "iris.csv"),
                       DataFrame)
150×5 DataFrame
 Row │ SepalLength  SepalWidth  PetalLength  PetalWidth  Species
     │ Float64      Float64     Float64      Float64     String15
─────┼──────────────────────────────────────────────────────────────────
   1 │         5.1         3.5          1.4         0.2  Iris-setosa
   2 │         4.9         3.0          1.4         0.2  Iris-setosa
  ⋮  │      ⋮           ⋮            ⋮           ⋮             ⋮
 149 │         6.2         3.4          5.4         2.3  Iris-virginica
 150 │         5.9         3.0          5.1         1.8  Iris-virginica
                                                        146 rows omitted

Now we are ready to use the combine function and the operation specification
syntax:

julia> combine(groupby(iris, :Species),
               :SepalLength =>
               (x -> [quantile(x, [0.25, 0.75])]) =>
               string.("SepalLength", [0.25, 0.75]))
3×3 DataFrame
 Row │ Species          SepalLength0.25  SepalLength0.75
     │ String15         Float64          Float64
─────┼───────────────────────────────────────────────────
   1 │ Iris-setosa                4.8                5.2
   2 │ Iris-versicolor            5.6                6.3
   3 │ Iris-virginica             6.225              6.9

With DataFramesMeta.jl you would write this using the @combine macro as
follows (I additionally show here how to use operation chaining with @chain):

julia> @chain iris begin
           groupby(:Species)
           @combine($["SepalLength0.25", "SepalLength0.75"] = [quantile(:SepalLength, [0.25, 0.75])])
       end
3×3 DataFrame
 Row │ Species          SepalLength0.25  SepalLength0.75
     │ String15         Float64          Float64
─────┼───────────────────────────────────────────────────
   1 │ Iris-setosa                4.8                5.2
   2 │ Iris-versicolor            5.6                6.3
   3 │ Iris-virginica             6.225              6.9

Conclusions

The operation specification syntax was designed to allow doing simple
transformations in an easy way, but at the same to also support quite complex
operations, like the one we did on the iris data frame.

If you would like to hear a detailed explanation of how to write such code
please join me during the upcoming workshop.

You can find all the examples that I will use during the workshop in the
accompanying GitHub repository.