Category Archives: Julia

Reflecting on Macros

By: Jonathan Carroll

Re-posted from: https://jcarroll.com.au/2023/06/10/reflecting-on-macros/

I’ve been following the drama of the RustConf Keynote Fiasco (RKNF, per @fasterthanlime)
from a great distance – I’m not involved in that community beyond starting to learn
the language. But the controversial topic itself Compile-Time Reflection seemed like something interesting I could learn something about.

A good start is usually a Wikipedia page, and I found one called “Reflective programming” under the “MetaProgramming”
category, where it defines

reflection is the ability of a process to examine, introspect, and modify its own structure and behavior

That sounds somewhat familiar from what metaprogramming I’ve read about. One of the
great features of R is the ability to inspect and rewrite functions, for example,
the body of the sd() function (calculating the standard deviation of the input) looks
like

sd
## function (x, na.rm = FALSE) 
## sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), 
##     na.rm = na.rm))
## <bytecode: 0x55a797b52960>
## <environment: namespace:stats>

Trying to extract a “component” of that function results in the ever-classic error

sd[1]
## Error in sd[1]: object of type 'closure' is not subsettable

However, using body() we can get to the components of the function

body(sd)
## sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), 
##     na.rm = na.rm))
body(sd)[1]
## sqrt()

and we can even mess with it (meaninglessly, in this case)

vals <- c(1, 3, 5, 7)
sd(vals)
## [1] 2.581989
my_sd <- sd
body(my_sd)[1] <- call("log")
my_sd # note that the function now (wrongly) uses log() instead of sqrt()
## function (x, na.rm = FALSE) 
## log(var(if (is.vector(x) || is.factor(x)) x else as.double(x), 
##     na.rm = na.rm))
## <environment: namespace:stats>
my_sd(vals)
## [1] 1.89712

The Wikipedia page lists the following example of reflection in R

# Without reflection, assuming foo() returns an S3-type object that has method "hello"
obj <- foo()
hello(obj)

# With reflection
class_name <- "foo"
generic_having_foo_method <- "hello"
obj <- do.call(class_name, list())
do.call(generic_having_foo_method, alist(obj))

Using a more concrete data object and class, e.g. tibble::tibble and summary might be
clearer

library(tibble) # do.call doesn't like pkg::fun as a string

# Without reflection
obj <- tibble(a = 1:2, b = 3:4)
summary(obj)
##        a              b       
##  Min.   :1.00   Min.   :3.00  
##  1st Qu.:1.25   1st Qu.:3.25  
##  Median :1.50   Median :3.50  
##  Mean   :1.50   Mean   :3.50  
##  3rd Qu.:1.75   3rd Qu.:3.75  
##  Max.   :2.00   Max.   :4.00
# With reflection
class_name <- "tibble"
generic_having_foo_method <- "summary"
obj <- do.call(class_name, list(a = 1:2, b = 3:4))
obj
## # A tibble: 2 × 2
##       a     b
##   <int> <int>
## 1     1     3
## 2     2     4
do.call(generic_having_foo_method, alist(obj))
##        a              b       
##  Min.   :1.00   Min.   :3.00  
##  1st Qu.:1.25   1st Qu.:3.25  
##  Median :1.50   Median :3.50  
##  Mean   :1.50   Mean   :3.50  
##  3rd Qu.:1.75   3rd Qu.:3.75  
##  Max.   :2.00   Max.   :4.00

So, maybe it’s more to do with being able to use a string containing the “name” of
a function and go and find that function, or just the ability to generate functions
on-demand based on non-function objects (?). Please, let me know if there’s a more
enlightening explanation.

I still don’t think I understand that at all (more time required) but I did note in
some additional research that “reflection” and “macros” are two very similar concepts. Now
macros are something I’ve heard of at least, so I was off to do some more research.

Unfortunately, web searches for the terms “reflection” and “macro” turn up a lot of
macro-lens photography results.

I’ve heard of macros in Julia where they’re used to “rewrite” an expression. This is a nice rundown
of the process, as are the official docs. These are
used in many places. One up-and-coming place is the new Tidier.jl which implements the tidyverse (at least the most common dplyr and purrr parts)
using macros (denoted with a @ prefix)

using Tidier
using RDatasets

movies = dataset("ggplot2", "movies");

@chain movies begin
    @mutate(Budget = Budget / 1_000_000)
    @filter(Budget >= mean(skipmissing(Budget)))
    @select(Title, Budget)
    @slice(1:5)
end

Rust uses macros for printing (amongst other things); println!() is a macro,
apparently at least in part because it needs to be able to take an arbitrary
number of args, so one can write

>> println!("a = {}, b = {}, c = {}", 1, 2, 3)
a = 1, b = 2, c = 3

Rust has a shorthand macro for creating a new vector vec!()

>> let v = vec![2, 3, 4];

and also has the “debug macro” dbg!()
which is super handy – it prints out the expression you wrap, plus the value, so
you can inspect the current state with e.g.

>> dbg!(&v);
[src/lib.rs:109] &v = [
    2,
    3,
    4,
]

This last one would be great to have in R… as a side note, we could construct a
simple version with {rlang}

dbg <- function(x) {
  ex <- rlang::f_text(rlang::enquos(x)[[1]])
  ret <- rlang::eval_bare(x)
  message(glue::glue("DEBUG: {ex} = {ret}"))
  ret
}

a <- 1
b <- 3
x <- dbg(a + b)
## DEBUG: a + b = 4
y <- dbg(2*x + 3)
## DEBUG: 2 * x + 3 = 11
z <- 10 + dbg(y*2)
## DEBUG: y * 2 = 22

In all of these examples of macros, the code that is run is different to the code you write
because the macro makes some changes before executing.

In R there isn’t a “proper” way to do this but we do have ways to manipulate code
and we do have ways to retrieve “unparsed” input, e.g. substitute(). A quick look
for “macros in R” turned up a function in a package that is more than 20 years old (I was
only starting University when this came out and knew approximately 0 programming) and
comes with a journal article; gtools::defmacro() by Thomas Lumley
has a construction for writing something that behaves like a macro.

That article is from 2001 when R 1.3.1 was being released. The example code made me do a double-take

library(gtools)

####
# macro for replacing a specified missing value indicator with NA
# within a dataframe
###
setNA <- defmacro(df, var, values,
  expr = {
    df$var[df$var %in% values] <- NA
  }
)

# create example data using 999 as a missing value indicator
d <- data.frame(
  Grp = c("Trt", "Ctl", "Ctl", "Trt", "Ctl", "Ctl", "Trt", "Ctl", "Trt", "Ctl"),
  V1 = c(1, 2, 3, 4, 5, 6, 999, 8, 9, 10),
  V2 = c(1, 1, 1, 1, 1, 2, 999, 2, 999, 999),
  stringsAsFactors = TRUE
)
d
##    Grp  V1  V2
## 1  Trt   1   1
## 2  Ctl   2   1
## 3  Ctl   3   1
## 4  Trt   4   1
## 5  Ctl   5   1
## 6  Ctl   6   2
## 7  Trt 999 999
## 8  Ctl   8   2
## 9  Trt   9 999
## 10 Ctl  10 999
# Try it out
setNA(d, V1, 999)
setNA(d, V2, 999)
d
##    Grp V1 V2
## 1  Trt  1  1
## 2  Ctl  2  1
## 3  Ctl  3  1
## 4  Trt  4  1
## 5  Ctl  5  1
## 6  Ctl  6  2
## 7  Trt NA NA
## 8  Ctl  8  2
## 9  Trt  9 NA
## 10 Ctl 10 NA

Wait – I thought… there’s no assignment in those last lines, but the data is
being modified!?! Sure enough, the internals of defmacro make it clear that this
is the case, but it seemed like magic. Essentially, this identifies what needs to
happen, what it needs to happen to (via substitute()), and makes it happen in the parent.frame(). Neat! So, what else can we do with this?

I thought about it for a while and realised what could be a [te|ho]rrific one…

Just a couple of weeks ago, Danielle Navarro made a wish

not for the first time I find myself wishing that push() and pop() were S3 generics in #rstats

Now, if you’re not familiar with those, pop(x) removes the first element of a structure x (e.g. a vector) and returns that first value, leaving the original object x containing only the remaining elements, whereas push(x, y) inserts the value y as the first element of x, moving the remaining elements down the line. These show up more in object-oriented languages, but they
don’t exist in R.

If we define a vector a containing some values

a <- c(3, 1, 4, 1, 5, 9)

and we wish to extract the first value, we can certainly do so with

a[1]
## [1] 3

but, due to the nature of R, the vector a is unchanged

a
## [1] 3 1 4 1 5 9

Instead, we could remove the first value of a with

a[-1]
## [1] 1 4 1 5 9

but again, a remains unchanged – in order to modify a we must redefine it as e.g.

a <- a[-1]
a
## [1] 1 4 1 5 9

If we wanted to build a pop() function, we could use substitute() to figure out
what the passed input object was, perform the extraction of the first element, and so on…

But as we’ve just seen, there’s a better way to define that – a macro!

r_pop <- gtools::defmacro(x, expr = {
  ret <- x[1]
  x <- x[-1]
  ret
})

Now, if we use that on a vector

a <- c(3, 1, 4, 1, 5, 9)
r_pop(a)
## [1] 3
a
## [1] 1 4 1 5 9

It works!!!

Danielle wanted a Generic, though, so we can easily make pop() a Generic and add methods for
some classes (which can be further extended).

To that end, I present a brand new package; {weasel}

pop() goes the {weasel}

pop() goes the {weasel}

This defines pop() and push() as Generics with methods defined for vectors, lists, and data.frames

a <- list(x = c(2, 3), y = c("foo", "bar"), z = c(3.1, 4.2, 6.9))
a
## $x
## [1] 2 3
## 
## $y
## [1] "foo" "bar"
## 
## $z
## [1] 3.1 4.2 6.9
x <- pop(a)
a
## $y
## [1] "foo" "bar"
## 
## $z
## [1] 3.1 4.2 6.9
x
## [1] 2 3
a <- data.frame(x = c(2, 3, 4), y = c("foo", "bar", "baz"), z = c(3.1, 4.2, 6.9))
a
##   x   y   z
## 1 2 foo 3.1
## 2 3 bar 4.2
## 3 4 baz 6.9
x <- pop(a)
a
##   x   y   z
## 2 3 bar 4.2
## 3 4 baz 6.9
x
##   x   y   z
## 1 2 foo 3.1
a <- c(1, 4, 1, 5, 9)
a
## [1] 1 4 1 5 9
push(a, 3)
a
## [1] 3 1 4 1 5 9
a <- data.frame(y = c("foo", "bar", "baz"), z = c(3.1, 4.2, 6.9))
a
##     y   z
## 1 foo 3.1
## 2 bar 4.2
## 3 baz 6.9
push(a, data.frame(y = 99, z = 77))
a
##     y    z
## 1  99 77.0
## 2 foo  3.1
## 3 bar  4.2
## 4 baz  6.9

I wrote this (simple) package as a bit of an exercise – I really don’t think you
should actually use it for anything. The “looks like it modifies in-place but actually
doesn’t”
is really non-idiomatic for R. Nonetheless, I was really interested to see
that defmacro can be used as a function definition that the dispatch machinery will respect. The only catch I’ve found so far is that I can’t use ellipses (...) in the function signature.

I noticed that Dirk Schumacher built a similar defmacro package more recently, but that appears
to be more aimed at building macros to be expanded on package load (funnily enough, “compile-time macros” – we’ve come full circle). This seems like a great opportunity for “inlining”
some functions. I’ll definitely be digging deeper into that one.

Let me know if you have a better explanation of any of the concepts I’ve (badly) described here;
I’m absolutely just learning and following Julia Evans’ advice about blogging.

devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.1.2 (2021-11-01)
##  os       Pop!_OS 22.04 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_AU.UTF-8
##  ctype    en_AU.UTF-8
##  tz       Australia/Adelaide
##  date     2023-06-10
##  pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date (UTC) lib source
##  blogdown      1.17    2023-05-16 [1] CRAN (R 4.1.2)
##  bookdown      0.29    2022-09-12 [1] CRAN (R 4.1.2)
##  bslib         0.4.1   2022-11-02 [3] CRAN (R 4.2.2)
##  cachem        1.0.6   2021-08-19 [3] CRAN (R 4.2.0)
##  callr         3.7.3   2022-11-02 [3] CRAN (R 4.2.2)
##  cli           3.4.1   2022-09-23 [3] CRAN (R 4.2.1)
##  crayon        1.5.2   2022-09-29 [3] CRAN (R 4.2.1)
##  devtools      2.4.5   2022-10-11 [1] CRAN (R 4.1.2)
##  digest        0.6.30  2022-10-18 [3] CRAN (R 4.2.1)
##  ellipsis      0.3.2   2021-04-29 [3] CRAN (R 4.1.1)
##  evaluate      0.18    2022-11-07 [3] CRAN (R 4.2.2)
##  fansi         1.0.3   2022-03-24 [3] CRAN (R 4.2.0)
##  fastmap       1.1.0   2021-01-25 [3] CRAN (R 4.2.0)
##  fs            1.5.2   2021-12-08 [3] CRAN (R 4.1.2)
##  glue          1.6.2   2022-02-24 [3] CRAN (R 4.2.0)
##  gtools      * 3.9.4   2022-11-27 [1] CRAN (R 4.1.2)
##  htmltools     0.5.3   2022-07-18 [3] CRAN (R 4.2.1)
##  htmlwidgets   1.5.4   2021-09-08 [1] CRAN (R 4.1.2)
##  httpuv        1.6.6   2022-09-08 [1] CRAN (R 4.1.2)
##  jquerylib     0.1.4   2021-04-26 [3] CRAN (R 4.1.2)
##  jsonlite      1.8.3   2022-10-21 [3] CRAN (R 4.2.1)
##  knitr         1.40    2022-08-24 [3] CRAN (R 4.2.1)
##  later         1.3.0   2021-08-18 [1] CRAN (R 4.1.2)
##  lifecycle     1.0.3   2022-10-07 [3] CRAN (R 4.2.1)
##  magrittr      2.0.3   2022-03-30 [3] CRAN (R 4.2.0)
##  memoise       2.0.1   2021-11-26 [3] CRAN (R 4.2.0)
##  mime          0.12    2021-09-28 [3] CRAN (R 4.2.0)
##  miniUI        0.1.1.1 2018-05-18 [1] CRAN (R 4.1.2)
##  pillar        1.8.1   2022-08-19 [3] CRAN (R 4.2.1)
##  pkgbuild      1.3.1   2021-12-20 [1] CRAN (R 4.1.2)
##  pkgconfig     2.0.3   2019-09-22 [3] CRAN (R 4.0.1)
##  pkgload       1.3.0   2022-06-27 [1] CRAN (R 4.1.2)
##  prettyunits   1.1.1   2020-01-24 [3] CRAN (R 4.0.1)
##  processx      3.8.0   2022-10-26 [3] CRAN (R 4.2.1)
##  profvis       0.3.7   2020-11-02 [1] CRAN (R 4.1.2)
##  promises      1.2.0.1 2021-02-11 [1] CRAN (R 4.1.2)
##  ps            1.7.2   2022-10-26 [3] CRAN (R 4.2.2)
##  purrr         1.0.1   2023-01-10 [1] CRAN (R 4.1.2)
##  R6            2.5.1   2021-08-19 [3] CRAN (R 4.2.0)
##  Rcpp          1.0.9   2022-07-08 [1] CRAN (R 4.1.2)
##  remotes       2.4.2   2021-11-30 [1] CRAN (R 4.1.2)
##  rlang         1.0.6   2022-09-24 [1] CRAN (R 4.1.2)
##  rmarkdown     2.18    2022-11-09 [3] CRAN (R 4.2.2)
##  rstudioapi    0.14    2022-08-22 [3] CRAN (R 4.2.1)
##  sass          0.4.2   2022-07-16 [3] CRAN (R 4.2.1)
##  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.1.2)
##  shiny         1.7.2   2022-07-19 [1] CRAN (R 4.1.2)
##  stringi       1.7.8   2022-07-11 [3] CRAN (R 4.2.1)
##  stringr       1.5.0   2022-12-02 [1] CRAN (R 4.1.2)
##  tibble      * 3.1.8   2022-07-22 [3] CRAN (R 4.2.2)
##  urlchecker    1.0.1   2021-11-30 [1] CRAN (R 4.1.2)
##  usethis       2.1.6   2022-05-25 [1] CRAN (R 4.1.2)
##  utf8          1.2.2   2021-07-24 [3] CRAN (R 4.2.0)
##  vctrs         0.5.2   2023-01-23 [1] CRAN (R 4.1.2)
##  weasel      * 0.1.0   2023-06-09 [1] local
##  xfun          0.34    2022-10-18 [3] CRAN (R 4.2.1)
##  xtable        1.8-4   2019-04-21 [1] CRAN (R 4.1.2)
##  yaml          2.3.6   2022-10-18 [3] CRAN (R 4.2.1)
## 
##  [1] /home/jono/R/x86_64-pc-linux-gnu-library/4.1
##  [2] /usr/local/lib/R/site-library
##  [3] /usr/lib/R/site-library
##  [4] /usr/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────

Refreshing Mining Complex Networks book

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/06/09/graphs.html

Introduction

Two years ago together with Paweł Prałat and François Théberge
we have written Mining Complex Networks book.
Since it has received a positive response from the community
we decided to start planning for its second edition, where
we want to add material that covers recent advances in the field.

This decision prompted me to write something about graph analysis
that is at the same time a nice application of the Julia language.

The post was written under Julia 1.9.0 and Graphs.jl 1.8.0.

The problem

Let me start with a business scenario.
Assume you have a set of products in a store and know which of them
are bought together. You represent this data as a graph where
nodes are products and edges are representing co-purchase of products.

You want to find products that are not bought together, but that
are fall into similar baskets. For example, assume you have two types
of milk. Most likely they are not bought together, but they are both
bought with e.g. bread. Such products could be called substitutes.

In the book we discuss some advanced methods that can be used for this task.
In this post let me concentrate on a simple way to detect such products.

Assume that I have four products i, j, k, and l. If in the
graph we have edges i-j, j-k, k-l, and l-i, but do not have
edges i-k and j-l then we can say that i and k are substitutes
and j and l are substitutes (so this four nodes could be called
double-substitutes).

From a graph theory perspective the set of nodes {i, j, k, l} form a
cycle of length 4 and they do not form any sorter cycle
(so this cycle has a hole inside it). Let us call such cycles minimal.
We can visualize this situation for example as follows:

  i
 / \
l   j
 \ /
  k

Our task for today is the following. Assume we have a random graph
with n nodes. In this graph each edge is included with probability
p, independently from every other edge. We want to check how
many minimal 4-cycles it contains.

Analytical solution

For a random graph it is possible to compute the expected number of
minimal 4-cycles relatively easily.

Using the additivity of expectation we can concentrate on four-element
subsets of sets of nodes. For each such subset {i, j, k, l}
we have three possibilities that they can form a minimal 4-cycle:

  i        i        i
 / \      / \      / \
l   j    l   k    k   j
 \ /      \ /      \ /
  k        j        l

Note that the probability of observing any of them is p^4*(1-p)^2
(we see 4 edges and do not see 2 edges).

Therefore we can easily write a function that computes the expected
number of such motifs as:

expected_count_empty_4cycle(n, p) = p^4*(1-p)^2 * 3 * binomial(n, 4)

Let us calculate the expected number of such motifs for n equal to 10
and 100 and p equal to 0.1 and 0.2:

julia> expected_count_empty_4cycle.([10, 100], [0.1 0.2 0.3])
2×3 Matrix{Float64}:
   0.05103      0.64512      2.50047
 952.858    12046.0      46690.0

The function works fast and nice, but what if we made a mistake when defining it?
With Julia it is easy to run a simulation checking the result.

Simulation solution, part 1

Let us start with a solution that reproduces the thinking we used to derive
our analytical result:

function naive_count_empty_4cycle(g)
    empty4cycle = 0
    for i in 1:nv(g), j in i+1:nv(g), k in j+1:nv(g), l in k+1:nv(g)
        empty4cycle += has_edge(g, i, j) && has_edge(g, j, k) &&
                       has_edge(g, k, l) && has_edge(g, l, i) &&
                       !has_edge(g, i, k) && !has_edge(g, j, l)
        empty4cycle += has_edge(g, i, k) && has_edge(g, k, j) &&
                       has_edge(g, j, l) && has_edge(g, l, i) &&
                       !has_edge(g, i, j) && !has_edge(g, k, l)
        empty4cycle += has_edge(g, i, j) && has_edge(g, j, l) &&
                       has_edge(g, l, k) && has_edge(g, k, i) &&
                       !has_edge(g, i, l) && !has_edge(g, j, k)
    end
    return empty4cycle
end

The function expects to get a Graphs.jl graph g and traverses all
subsets of its nodes.

Let us check it:

julia> using Graphs

julia> using Random

julia> using Statistics

julia> Random.seed!(1234);

julia> [mean(naive_count_empty_4cycle(erdos_renyi(10, p)) for i in 1:1000)
        for p in [0.1, 0.2, 0.3]]
3-element Vector{Float64}:
 0.051
 0.598
 2.455

So the results look good for n=10. But what about n=100?

Let us check timing of a single run:

julia> @time naive_count_empty_4cycle(erdos_renyi(100, 0.1));
  0.142410 seconds (274 allocations: 38.594 KiB)

We can see that the function is slow. If we wanted to run it 1000 times it would take us
over 2 minutes. Let us think of a faster approach.

Simulation solution, part 2

The idea we can use is the following. Consider two nodes i and j and assume they
are not connected. Then it is enough to find common neighbors of i and j
and check how many pairs of them are not connected.

Here is an example implementation of this idea:

function find_common_sorted!(common, ni, nj)
    empty!(common)
    iidx = 1
    jidx = 1
    while iidx <= length(ni) && jidx <= length(nj)
        if ni[iidx] < nj[jidx]
            iidx += 1
        elseif ni[iidx] > nj[jidx]
            jidx += 1
        else
            push!(common, ni[iidx])
            iidx += 1
            jidx += 1
        end
    end
    nothing
end

function fast_count_empty_4cycle(g)
    common = Int[]
    empty4cycle = 0

    for i in 1:nv(g)
        ni = neighbors(g, i)
        for j in i+1:nv(g)
            has_edge(g, i, j) && continue
            nj = neighbors(g, j)
            find_common_sorted!(common, ni, nj)
            for a in 1:length(common), b in a+1:length(common)
                empty4cycle += !has_edge(g, common[a], common[b])
            end
        end
    end
    @assert iseven(empty4cycle)
    return empty4cycle ÷ 2
end

Note that in the code we divide the number of empty cycles found by two as
assuming that nodes {i, j, k, l} form an empty cycle we count it twice
(once starting with {i, k} pair and once starting with {j, l} pair).

Let us check our function:

julia> @time fast_count_empty_4cycle(erdos_renyi(100, 0.1));
  0.001491 seconds (280 allocations: 40.047 KiB)

Indeed it is significantly faster. Therefore we can use it to also check the n=100 case:

julia> [mean(fast_count_empty_4cycle(erdos_renyi(10, p)) for i in 1:1000)
        for p in [0.1, 0.2, 0.3]]
3-element Vector{Float64}:
 0.051
 0.598
 2.455

julia> [mean(fast_count_empty_4cycle(erdos_renyi(100, p)) for i in 1:1000)
        for p in [0.1, 0.2, 0.3]]
3-element Vector{Float64}:
   954.364
 12017.067
 46574.827

The obtained results confirm our analytical solution, so we have some more confidence
that indeed we have derived it correctly.

Conclusions

I hope you found the problem of finding the number of minimal 4-cycles interesting.
Personally, really enjoyed coding it in Julia. In particular note that in Julia
with our faster version of code we almost do not do any allocations:

julia> gr = erdos_renyi(1000, 0.1); @time fast_count_empty_4cycle(gr)
  2.534875 seconds (4 allocations: 496 bytes)
10169170

julia> expected_count_empty_4cycle.(1000, 0.1)
1.0064361314250002e7

and the code runs quite fast.

Filtering grouped data frames in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/06/02/filter.html

Introduction

In DataFrames.jl a data frame can be grouped by columns.
In this way its view is created. The type of this view
is GroupedDataFrame and it allows for performing groupwise
operations on the data stored in the original data frame.

If you use SQL then you can think of GroupedDataFrame
as an object with is obtained by using GROUP BY command
but without performing an aggregation step. This is often
useful as after grouping data you might want to perform
several different aggregation operations on it without
having to group it each time.

In this post I want to discuss how filtering of
GroupedDataFrame works in DataFrames.jl.
In general there are two ways how you might want to filter it:

  • by dropping whole groups;
  • by dropping rows within group.

I will discuss both today.

The post was written under Julia 1.9.0 and DataFrames.jl 1.5.0.

Preparing the data

Start by creating a sample GroupedDataFrame:

julia> using DataFrames

julia> df = DataFrame(x = repeat(1:4, 2), id=1:8)
8×2 DataFrame
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      3
   4 │     4      4
   5 │     1      5
   6 │     2      6
   7 │     3      7
   8 │     4      8

julia> gdf = groupby(df, :x, sort=true);

julia> show(gdf, allgroups=true)
GroupedDataFrame with 4 groups based on key: x
Group 1 (2 rows): x = 1
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      5
Group 2 (2 rows): x = 2
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     2      2
   2 │     2      6
Group 3 (2 rows): x = 3
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     3      3
   2 │     3      7
Group 4 (2 rows): x = 4
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     4      4
   2 │     4      8

Dropping whole groups

The first type of operation is when you want to drop whole groups
from your grouped data frame.

First, assume that you want to drop
the first and last group from gdf. You can achieve this using indexing:

julia> gdf[[2, 3]]
GroupedDataFrame with 2 groups based on key: x
First Group (2 rows): x = 2
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     2      2
   2 │     2      6
⋮
Last Group (2 rows): x = 3
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     3      3
   2 │     3      7

julia> gdf[Not(begin, end)]
GroupedDataFrame with 2 groups based on key: x
First Group (2 rows): x = 2
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     2      2
   2 │     2      6
⋮
Last Group (2 rows): x = 3
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     3      3
   2 │     3      7

Now assume that you want to drop groups in which the value of the grouping
variable :x is even. You could still use indexing like this:

julia> gdf[isodd.(getproperty.(keys(gdf), :x))]
GroupedDataFrame with 2 groups based on key: x
First Group (2 rows): x = 1
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      5
⋮
Last Group (2 rows): x = 3
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     3      3
   2 │     3      7

However, a more general approach is to use the filter function:

julia> filter(sdf -> isodd(first(sdf.x)), gdf)
GroupedDataFrame with 2 groups based on key: x
First Group (2 rows): x = 1
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      5
⋮
Last Group (2 rows): x = 3
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     3      3
   2 │     3      7

If you wanted to drop grouping of the result you can pass ungroup=true in filter:

julia> filter(sdf -> isodd(first(sdf.x)), gdf, ungroup=true)
4×2 DataFrame
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      5
   3 │     3      3
   4 │     3      7

Dropping rows within group

Now that you know how to drop whole groups you might wonder how to drop individual rows
within group. Assume that we want to drop from each group all rows except
the row with minimal value of :id within group. You can achieve this using the
subset function:

julia> subset(gdf, :id => x -> x .== minimum(x))
4×2 DataFrame
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      3
   4 │     4      4

Note that by default subset ungroups data. If you want to keep it grouped
pass ungroup=false:

julia> show(subset(gdf, :id => x -> x .== minimum(x), ungroup=false),
            allgroups=true)
GroupedDataFrame with 4 groups based on key: x
Group 1 (1 row): x = 1
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
Group 2 (1 row): x = 2
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     2      2
Group 3 (1 row): x = 3
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     3      3
Group 4 (1 row): x = 4
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     4      4

Conclusions

I hope you found the examples using indexing, filter, and subset
useful and they improved your understanding of row filtering options
available in DataFrames.jl.

In particular remember that filter and subset have a different
default value of the ungoup keyword argument. This difference was
made to reflect the fact that when doing filter typically we do not
want to ungroup data, while when doing subset typically we want
to drop grouping.