Author Archives: Blog by Bogumił Kamiński

Is case_when needed in DataFrames.jl?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2020/12/18/casewhen.html

Introduction

Recently I received a very interesting question regarding transforming data
using the DataFrames.jl. One of the users wanted to know if we have
a functionality similar to the case_when function in dplyr. When trying
to answer it I came to the conclusion that we do not need it that we can
reproduce it using the ⋅ ? ⋅ : ⋅ ternary operator.

In this post I will be reproducing selected examples from the documentation of
case_when.

Reproducing dplyr examples

In the examples I will first show R code and then show an Julia code.

I R examples I assume that dplyr is loaded. The Julia examples were tested under
Julia 1.5.3, DataFrames.jl 0.22.2, DataFramesMeta.jl v0.6.0, HTTP.jl v0.8.19,
JSON3.jl v1.5.1, and Pipe.jl v1.3.0.

Example 1

This is the most basic case_when usage scenario.

R code:

> library(dplyr)
> x <- 1:10
> case_when(
+   x %% 35 == 0 ~ "fizz buzz",
+   x %% 5 == 0 ~ "fizz",
+   x %% 7 == 0 ~ "buzz",
+   TRUE ~ as.character(x)
+ )
 [1] "1"    "2"    "3"    "4"    "fizz" "6"    "buzz" "8"    "9"    "fizz"

Julia code:

julia> x = 1:10
1:10

julia> (x -> x % 6 == 0 ? "fizz buzz" :
             x % 2 == 0 ? "fizz" :
             x % 3 == 0 ? "buzz" :
             string(x)).(x)
10-element Array{String,1}:
 "1"
 "fizz"
 "buzz"
 "fizz"
 "5"
 "fizz buzz"
 "7"
 "fizz"
 "buzz"
 "fizz"

In this basic example note the following things:

  • in the Julia code we do not need to load any package; we are using the
    functionality in built into the language;
  • we create an anonymous function that is then broadcasted over an input vector
    using the . operator;
  • both codes look almost the same, apart from a bit different punctuation.

In summary – in my opinion the basic use case shows that the ternary operator
is as convenient as case_when from dplyr.

Example 2

In this example missing values are introduced. We are reusing the vector created
in the previous exercise.

R code:

> x[2:4] <- NA_real_
> case_when(
+   x %% 35 == 0 ~ "fizz buzz",
+   x %% 5 == 0 ~ "fizz",
+   x %% 7 == 0 ~ "buzz",
+   is.na(x) ~ "nope",
+   TRUE ~ as.character(x)
+ )
 [1] "1"    "nope" "nope" "nope" "fizz" "6"    "buzz" "8"    "9"    "fizz"

Julia code (two variants):

julia> x = [2 <= i <= 4 ? missing : x[i] for i in axes(x, 1)]

10-element Array{Union{Missing, Int64},1}:
  1
   missing
   missing
   missing
  5
  6
  7
  8
  9
 10

julia> (x -> isequal(x % 6, 0) ? "fizz buzz" :
             isequal(x % 2, 0) ? "fizz" :
             isequal(x % 3, 0) ? "buzz" :
             ismissing(x) ? "nope" :
             string(x)).(x)
10-element Array{String,1}:
 "1"
 "nope"
 "nope"
 "nope"
 "5"
 "fizz buzz"
 "7"
 "fizz"
 "buzz"
 "fizz"

julia> (x -> coalesce(x % 6 == 0, false) ? "fizz buzz" :
             coalesce(x % 2 == 0, false) ? "fizz" :
             coalesce(x % 3 == 0, false) ? "buzz" :
             ismissing(x) ? "nope" :
             string(x)).(x)
10-element Array{String,1}:
 "1"
 "nope"
 "nope"
 "nope"
 "5"
 "fizz buzz"
 "7"
 "fizz"
 "buzz"
 "fizz"

additionally note that the code can be simplified if we put the ismissing
condition first:

julia> (x -> ismissing(x) ? "nope" :
             x % 6 == 0 ? "fizz buzz" :
             x % 2 == 0 ? "fizz" :
             x % 3 == 0 ? "buzz" :
             string(x)).(x)
10-element Array{String,1}:
 "1"
 "nope"
 "nope"
 "nope"
 "5"
 "fizz buzz"
 "7"
 "fizz"
 "buzz"
 "fizz"

Note the following patterns in this example:

  • we had to materialize the vector in Julia in a bit more complex way as the
    initial x vector was a 1:10 range which was read only;
  • in R comparison to missing is treated as failing by default; on the other
    hand Julia is strict about boolean tests and one has to use either the
    isequal or the coalesce functions to handle missing values (or move the
    ismissing test to the top); this strictness introduces a bit more verbosity
    in the code at the benefit of allowing the user to catch potential bugs in the
    logic of the code more easily.

Example 3

In this example we use the starwars dataset that is shipped with dplyr. So we
first have to fetch it from the Internet in Julia. Here is the code that does
the trick:

julia> using DataFrames

julia> using HTTP

julia> using JSON3

julia> using Pipe

julia> starwars = @pipe HTTP.get("https://swapi.dev/api/people/").body |>
                  JSON3.read |> _.results |> DataFrame |>
                  transform(_,
                            :species => ByRow(x -> isempty(x) ? "Human" : "Droid"),
                            [:height, :mass] .=> ByRow(x -> parse(Int, x)),
                            renamecols=false)
10×16 DataFrame
 Row │ name                height  mass   hair_color     skin_color   eye_color ⋯
     │ String              Int64   Int64  String         String       String    ⋯
─────┼───────────────────────────────────────────────────────────────────────────
   1 │ Luke Skywalker         172     77  blond          fair         blue      ⋯
   2 │ C-3PO                  167     75  n/a            gold         yellow
   3 │ R2-D2                   96     32  n/a            white, blue  red
   4 │ Darth Vader            202    136  none           white        yellow
   5 │ Leia Organa            150     49  brown          light        brown     ⋯
   6 │ Owen Lars              178    120  brown, grey    light        blue
   7 │ Beru Whitesun lars     165     75  brown          light        blue
   8 │ R5-D4                   97     32  n/a            white, red   red
   9 │ Biggs Darklighter      183     84  black          light        brown     ⋯
  10 │ Obi-Wan Kenobi         182     77  auburn, white  fair         blue-gray
                                                               10 columns omitted

We have fetched only 10 rows of data for the analysis (this is the number of
observations that the exposed API produces), but it is enough for our purposes.

As a side note – observe how easy it is in JuliaData ecosystem to fetch a JSON
file from the Internet, parse it, populate a DataFrame, and finally do some
column preprocessing to get the right column types for data that we are
interested it analyzing later.

Let us move to the example. In this case we want to process more than one column
using the case_when function within a data transformation pipeline.

R code

> starwars %>%
+   select(name:mass, gender, species) %>%
+   mutate(
+     type = case_when(
+       height > 200 | mass > 200 ~ "large",
+       species == "Droid"        ~ "robot",
+       TRUE                      ~ "other"
+     )
+   )
# A tibble: 87 x 6
   name               height  mass gender    species type
   <chr>               <int> <dbl> <chr>     <chr>   <chr>
 1 Luke Skywalker        172    77 masculine Human   other
 2 C-3PO                 167    75 masculine Droid   robot
 3 R2-D2                  96    32 masculine Droid   robot
 4 Darth Vader           202   136 masculine Human   large
 5 Leia Organa           150    49 feminine  Human   other
 6 Owen Lars             178   120 masculine Human   other
 7 Beru Whitesun lars    165    75 feminine  Human   other
 8 R5-D4                  97    32 masculine Droid   robot
 9 Biggs Darklighter     183    84 masculine Human   other
10 Obi-Wan Kenobi        182    77 masculine Human   other
# … with 77 more rows

Julia code

julia> @pipe starwars |>
             select(_, Between(:name, :mass), :gender, :species) |>
             transform(_, [:height, :mass, :species] =>
                          ByRow((height, mass, species) ->
                                height > 200 || mass > 200 ? "large" :
                                species == "Droid" ? "robot" :
                                "other") =>
                          :type)
10×6 DataFrame
 Row │ name                height  mass   gender  species  type
     │ String              Int64   Int64  String  String   String
─────┼────────────────────────────────────────────────────────────
   1 │ Luke Skywalker         172     77  male    Human    other
   2 │ C-3PO                  167     75  n/a     Droid    robot
   3 │ R2-D2                   96     32  n/a     Droid    robot
   4 │ Darth Vader            202    136  male    Human    large
   5 │ Leia Organa            150     49  female  Human    other
   6 │ Owen Lars              178    120  male    Human    other
   7 │ Beru Whitesun lars     165     75  female  Human    other
   8 │ R5-D4                   97     32  n/a     Droid    robot
   9 │ Biggs Darklighter      183     84  male    Human    other
  10 │ Obi-Wan Kenobi         182     77  male    Human    other

or if you like using DataFramesMeta.jl:

julia> using DataFramesMeta

julia> @pipe starwars |>
             select(_, Between(:name, :mass), :gender, :species) |>
             @eachrow _ begin
                 @newcol type::Vector{String}
                 :type = :height > 200 || :mass > 200 ? "large" :
                         :species == "Droid" ? "robot" :
                         "other"
             end
10×6 DataFrame
 Row │ name                height  mass   gender  species  type
     │ String              Int64   Int64  String  String   String
─────┼────────────────────────────────────────────────────────────
   1 │ Luke Skywalker         172     77  male    Human    other
   2 │ C-3PO                  167     75  n/a     Droid    robot
   3 │ R2-D2                   96     32  n/a     Droid    robot
   4 │ Darth Vader            202    136  male    Human    large
   5 │ Leia Organa            150     49  female  Human    other
   6 │ Owen Lars              178    120  male    Human    other
   7 │ Beru Whitesun lars     165     75  female  Human    other
   8 │ R5-D4                   97     32  n/a     Droid    robot
   9 │ Biggs Darklighter      183     84  male    Human    other
  10 │ Obi-Wan Kenobi         182     77  male    Human    other

As you can see it is easy to use the ternary operator also in the case of
several variables. Using DataFrames.jl requires a bit of boilerplate syntax.
This limitation can be conveniently overcome using DataFramesMeta.jl, in the
above example I decided to use the @eachrow macro.

Conclusions

As you can see using the ternary operator in Julia gives us a very similar
functionality and syntax in comparison to the case_when function from dplyr.
Apart from the differences how missing values are handled I have discussed above
there are two features that make the solution in Julia more convenient in my
opinion:

  • in case_when all values on right hand side have to have the same type, while
    in Julia there is no such restriction;
  • case_when evaluates all right hand side expressions, while the ternary
    operator evaluates only what has to be evaluated to determine the result
    of the operation (this is often preferred when some operations may
    throw an error for certain values of their arguments).

Binning your data with Julia

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2020/12/11/binning.html

Introduction

Cutting data into groups (binning) is one of the most common data preprocessing
tasks.

You can easily do binning into groups of equal sizes using the cut function
from CategoricalArrays.jl like this (here we bin a vector of values from 1 to 10
into 2 groups):

julia> using CategoricalArrays

julia> cut(1:10, 2)
10-element CategoricalArray{String,1,UInt32}:
 "Q1: [1.0, 5.5)"
 "Q1: [1.0, 5.5)"
 "Q1: [1.0, 5.5)"
 "Q1: [1.0, 5.5)"
 "Q1: [1.0, 5.5)"
 "Q2: [5.5, 10.0]"
 "Q2: [5.5, 10.0]"
 "Q2: [5.5, 10.0]"
 "Q2: [5.5, 10.0]"
 "Q2: [5.5, 10.0]"

However, the issue becomes more challenging when the number of bins is not a
divisor of vector length or if you have duplicates in the data.

The post is tested under Julia 1.5.3, DataFrames.jl 0.22.1, CategoricalArrays.jl
0.9.0, and FreqTables.jl 0.4.2.

The corner cases of binning

Let us first highlight some potential issues when binning data.

The first problem is when the number of groups is not a divisor of the vector
length. Let us check it out on some examples:

julia> cut(1:10, 3)
10-element CategoricalArray{String,1,UInt32}:
 "Q1: [1.0, 4.0)"
 "Q1: [1.0, 4.0)"
 "Q1: [1.0, 4.0)"
 "Q2: [4.0, 7.0)"
 "Q2: [4.0, 7.0)"
 "Q2: [4.0, 7.0)"
 "Q3: [7.0, 10.0]"
 "Q3: [7.0, 10.0]"
 "Q3: [7.0, 10.0]"
 "Q3: [7.0, 10.0]"

julia> cut(1:10, 4)
10-element CategoricalArray{String,1,UInt32}:
 "Q1: [1.0, 3.25)"
 "Q1: [1.0, 3.25)"
 "Q1: [1.0, 3.25)"
 "Q2: [3.25, 5.5)"
 "Q2: [3.25, 5.5)"
 "Q3: [5.5, 7.75)"
 "Q3: [5.5, 7.75)"
 "Q4: [7.75, 10.0]"
 "Q4: [7.75, 10.0]"
 "Q4: [7.75, 10.0]"

The cut function is deterministic and it uses the quantile function to find
the bin endpoints. This means that in the first example cut(1:10, 3) the third
bin will be always larger than the first and second bin. Similarly in cut(1:10,
4)
the first and the fourth bins are going to be larger deterministically.

The other problem is duplicates in data. Consider the following scenario:

julia> cut([1; fill(2, 8); 3], 2)
10-element CategoricalArray{String,1,UInt32}:
 "Q1: [1.0, 2.0)"
 "Q2: [2.0, 3.0]"
 "Q2: [2.0, 3.0]"
 "Q2: [2.0, 3.0]"
 "Q2: [2.0, 3.0]"
 "Q2: [2.0, 3.0]"
 "Q2: [2.0, 3.0]"
 "Q2: [2.0, 3.0]"
 "Q2: [2.0, 3.0]"
 "Q2: [2.0, 3.0]"

We want two bins. Ideally both should have five elements, but since we have
duplicates in our data the first bin has size one and the second size nine.

In some cases you will like what cut produces, but in other cases
one might want to avoid these two problems, that is:

  • always make the bins of equal size and if it is not possible to do so, make
    the decision which bin should be larger and which smaller randomly;
  • allow duplicates to be split between two or more bins (this is then
    unavoidable in some cases), but in such a way that each duplicate has the same
    chance to fall into each bin.

Random binning

Here is the function that performs the binning that has the properties I have
described above:

using DataFrames
using FreqTables
using Random

function binvec(x::AbstractVector, n::Int,
                rng::AbstractRNG=Random.default_rng())
    n > 0 || throw(ArgumentError("number of bins must be positive"))
    l = length(x)

    # find bin sizes
    d, r = divrem(l, n)
    lens = fill(d, n)
    lens[1:r] .+= 1
    # randomly decide which bins should be larger
    shuffle!(rng, lens)

    # ensure that we have data sorted by x, but ties are ordered randomly
    df = DataFrame(id=axes(x, 1), x=x, r=rand(rng, l))
    sort!(df, [:x, :r])

    # assign bin ids to rows
    binids = reduce(vcat, [fill(i, v) for (i, v) in enumerate(lens)])
    df.binids = binids

    # recover original row order
    sort!(df, :id)
    return df.binids
end

Let us now test the binning on the following vector:

julia> Random.seed!(1234);

julia> x = repeat('a':'c', 3)
9-element Array{Char,1}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)

julia> binvec(x, 2)
9-element Array{Int64,1}:
 1
 2
 2
 1
 2
 2
 1
 1
 2

As you can see 'b's are split betwen bin 1 and 2 to make them have almost
equal size.

Let us make sure that binvec does the right job in deciding on bin sizes
and splitting 'b's between both bins.

julia> df = reduce(vcat, [DataFrame(x=x, run_id=i, row_id=axes(x, 1),
                                    group_id=binvec(x, 2)) for i in 1:10_000]);

julia> freqtable(df, :group_id, :row_id, :x)
2×9×3 Named Array{Int64,3}

[:, :, x='a'] =
group_id ╲ row_id │     1      2      3      4      5      6      7      8      9
──────────────────┼──────────────────────────────────────────────────────────────
1                 │ 10000      0      0  10000      0      0  10000      0      0
2                 │     0      0      0      0      0      0      0      0      0

[:, :, x='b'] =
group_id ╲ row_id │    1     2     3     4     5     6     7     8     9
──────────────────┼─────────────────────────────────────────────────────
1                 │    0  4941     0     0  5005     0     0  5027     0
2                 │    0  5059     0     0  4995     0     0  4973     0

[:, :, x='c'] =
group_id ╲ row_id │     1      2      3      4      5      6      7      8      9
──────────────────┼──────────────────────────────────────────────────────────────
1                 │     0      0      0      0      0      0      0      0      0
2                 │     0      0  10000      0      0  10000      0      0  10000

Indeed we see that each 'b' falls to group 1 and group 2 with 50%
probability. Also the expected size of group 1 and group 2 is 4.5 as
desired. (both calculations are approximate because we used simulation.)

Conclusions

First – let me comment in what scenario the binning I described is desirable.
Assume you have a set of patients you want to vaccinate against COVID-19. Now
let each of them have assigned a discrete urgency level (typically there will be
3 or 4 such urgency levels). You then have to split them into several batches
of equal size so that each batch gets a vaccine in a different period. If you
want to be fair in assigning people to batches you get exactly the setting I
have described.

Second – as usual I wanted to showcase some features of JuliaData ecosystem. In
particular you have seen reducevcat combo twice (for vectors and for data
frames) and integration of FreqTables.jl with DataFrames.jl that I am very fond
of. Of course this is not the fastest way to get the desired results. I
encourage you to write a faster function that does the same what the binvec
function does.

DataFrames 0.22 @ JuliaAcademy

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2020/12/05/juliaacademy.html

Introduction

This time the blog post will be shorter than usual.

Logan Kilpatrick has just released a new version of JuliaAcademy
course for DataFrames.jl that was updated to its 0.22 release and also contains
some new material.

You can find course materials on GitHub here, while the videos will
be released in the coming days; first two
1. Environment Setup and
2. First Steps With Data Frames are already
available for watching.

Before you go

Not to leave you with just bare links let me present a short example how you can
process dates in DataFrames.jl while taking into account the possibility that
there might be missing values in the data (this is a question I was recently
asked how to do it).

I am using Julia 1.5.3, DataFrames 0.22.1, and Missings 0.4.4.

First we prepare a sample data frame:

julia> using Dates, Missings, DataFrames

julia> df1 = DataFrame(date = Date.(2020, 1, 1:10));

julia> allowmissing!(df1);

julia> df1.date[5] = missing;

julia> df1
10×1 DataFrame
 Row │ date
     │ Date?
─────┼────────────
   1 │ 2020-01-01
   2 │ 2020-01-02
   3 │ 2020-01-03
   4 │ 2020-01-04
   5 │ missing
   6 │ 2020-01-06
   7 │ 2020-01-07
   8 │ 2020-01-08
   9 │ 2020-01-09
  10 │ 2020-01-10

We now want to split :date column into three columns that will contain year,
month and day respectively. Here is the way how you can achieve it:

julia> df2 = transform(df1, @. :date =>
                               ByRow(passmissing([year, month, day])) =>
                               [:year, :month, :day])
10×4 DataFrame
 Row │ date        year     month    day
     │ Date?       Int64?   Int64?   Int64?
─────┼───────────────────────────────────────
   1 │ 2020-01-01     2020        1        1
   2 │ 2020-01-02     2020        1        2
   3 │ 2020-01-03     2020        1        3
   4 │ 2020-01-04     2020        1        4
   5 │ missing     missing  missing  missing
   6 │ 2020-01-06     2020        1        6
   7 │ 2020-01-07     2020        1        7
   8 │ 2020-01-08     2020        1        8
   9 │ 2020-01-09     2020        1        9
  10 │ 2020-01-10     2020        1       10

Note that we are using here a common pattern that you can use broadcasting to
easily specify multiple operations on the same object (in this case this is the
same source column).

Finally we go back and collect the :year, :month and :day columns into one
column that contains the original Date values:

julia> df3 = transform(df2, [:year, :month, :day] =>
                            ByRow(passmissing(Date)) =>
                            :date2)
10×5 DataFrame
 Row │ date        year     month    day      date2
     │ Date?       Int64?   Int64?   Int64?   Date?
─────┼───────────────────────────────────────────────────
   1 │ 2020-01-01     2020        1        1  2020-01-01
   2 │ 2020-01-02     2020        1        2  2020-01-02
   3 │ 2020-01-03     2020        1        3  2020-01-03
   4 │ 2020-01-04     2020        1        4  2020-01-04
   5 │ missing     missing  missing  missing  missing
   6 │ 2020-01-06     2020        1        6  2020-01-06
   7 │ 2020-01-07     2020        1        7  2020-01-07
   8 │ 2020-01-08     2020        1        8  2020-01-08
   9 │ 2020-01-09     2020        1        9  2020-01-09
  10 │ 2020-01-10     2020        1       10  2020-01-10

This time we took advantage of the fact that Date takes three positional
arguments and this is the default behavior of transformation specifications
in DataFrames.jl in which multiple source columns are provided.

This is all for today. Bye!