Working with a grouped data frame, part 2

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/03/08/gdf.html

Introduction

This is a follow up to the post from last week. We will continue
discussing how one can work with GroupedDataFrame objects in DataFrames.jl.
Today we focus on indexing of grouped data frames.

The post was written under Julia 1.10.1 and DataFrames.jl 1.6.1.

Warm-up: getting group indices

First create some grouped data frame:

julia> using DataFrames

julia> df = DataFrame(int=[1, 3, 2, 1, 3, 2],
                      str=["a", "a", "c", "c", "b", "b"])
6×2 DataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a
   3 │     2  c
   4 │     1  c
   5 │     3  b
   6 │     2  b

julia> gdf = groupby(df, :str, sort=true)
GroupedDataFrame with 3 groups based on key: str
First Group (2 rows): str = "a"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a
⋮
Last Group (2 rows): str = "c"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     1  c

It is sometimes useful to learn what is a group number of each row of the source data frame df in a grouped data frame gdf.
You can easily get this information with groupindices:

julia> groupindices(gdf)
6-element Vector{Union{Missing, Int64}}:
 1
 1
 3
 3
 2
 2

Extracting a single group

A basic operation when indexing a GroupedDataFrame is to pick a group by its number. Here is an example:

julia> gdf[1]
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a

julia> gdf[2]
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     2  b

julia> gdf[3]
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     1  c

Note, that gdf behaves similarly to a vector. You can even use begin and end in indexing:

julia> gdf[begin]
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a

julia> gdf[end]
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     1  c

Often you might want to extract a group not by its position in gdf, but by the value of the grouping
variable or variables. In this case you can use GroupKey, dictionary, tuple, or named tuple to achieve this.

Let us check how it works. Start with dictionary, tuple, and named tuple:

julia> gdf[Dict("str" => "b")] # dictionary
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     2  b

julia> gdf[("b",)] # tuple
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     2  b

julia> gdf[(; str="b")] # named tuple
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     2  b

With GroupKey we first need to get it from keys, but everything else works the same:

julia> key = keys(gdf)[1]
GroupKey: (str = "a",)

julia> gdf[key]
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a

You might ask why we require passing grouping variable in a container (dictionary, tuple, named tuple, GroupKey)
and not directly pass the required value when indexing? The reason is that if you grouped your data by integer column
the result would be ambiguous. Here is an example showing that under the defined rules there is no such ambiguity:

julia> gdf2 = groupby(df, :int, sort=false)
GroupedDataFrame with 3 groups based on key: int
First Group (2 rows): int = 1
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
⋮
Last Group (2 rows): int = 2
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b

julia> gdf2[3] # third group
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b

julia> gdf2[(3, )] # group with value of the grouping variable equal to 3
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  a
   2 │     3  b

Extracting multiple groups

You now know how to pick a single group, so selecting multiple groups is a natural next step.
You can use a collection of any of the selectors we have already discussed. Here are some examples:

julia> gdf[[3, 1]] # selection by group number
GroupedDataFrame with 2 groups based on key: str
First Group (2 rows): str = "c"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     1  c
⋮
Last Group (2 rows): str = "a"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a

julia> gdf[[("c",), ("a",)]] # selection by grouping variable value
GroupedDataFrame with 2 groups based on key: str
First Group (2 rows): str = "c"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     1  c
⋮
Last Group (2 rows): str = "a"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a

Note that indexing allows both for reordering and for dropping groups, which often comes handy when analyzing data.
Also note that groupindices is aware of such changes:

julia> groupindices(gdf[[3, 1]])
6-element Vector{Union{Missing, Int64}}:
 2
 2
 1
 1
  missing
  missing

Here group with "c" is first, with "a" is second and with "b" is dropped, so missing is returned in the produced vector.

It is also worth to remember that subset and filter can be used with GroupedDataFrames. This topic is discussed in this post.

Key lookup

Sometimes we do not want to index into a grouped data frame, but just check if it contains some key. This is easily achievable with the haskey function:

julia> haskey(gdf, ("a",))
true

julia> haskey(gdf, ("z",))
false

Conclusions

In this post we discussed indexing of GroupedDataFrames. This concludes the basic tutorial of working with these data structures.
I hope you will find the functionalities I have covered useful in your work.

Working with a grouped data frame, part 1

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/03/01/gdf.html

Introduction

One of the features of DataFrames.jl that I often find useful is that when you group
a data frame by some of its columns the resulting GroupedDataFrame is an object
that gains new and useful functionalities.

Some time ago I have discussed how GroupedDataFrame can be filtered. You can find
this post here. In this post and the following one that I plan to write next
week I thought that it would be useful to review other key functionalities of
a GroupedDataFrame.

The post was written under Julia 1.10.1 and DataFrames.jl 1.6.1.

Creating a grouped data frame

You can create a GroupedDataFrame using the groupby function.

Here are some examples:

julia> using DataFrames

julia> df = DataFrame(int=[1, 3, 2, 1, 3, 2],
                      str=["a", "a", "c", "c", "b", "b"])
6×2 DataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a
   3 │     2  c
   4 │     1  c
   5 │     3  b
   6 │     2  b

julia> show(groupby(df, :int), allgroups=true)
GroupedDataFrame with 3 groups based on key: int
Group 1 (2 rows): int = 1
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
Group 2 (2 rows): int = 2
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b
Group 3 (2 rows): int = 3
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  a
   2 │     3  b
julia> show(groupby(df, :int; sort=true), allgroups=true)
GroupedDataFrame with 3 groups based on key: int
Group 1 (2 rows): int = 1
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
Group 2 (2 rows): int = 2
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b
Group 3 (2 rows): int = 3
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  a
   2 │     3  b
julia> show(groupby(df, :int; sort=false), allgroups=true)
GroupedDataFrame with 3 groups based on key: int
Group 1 (2 rows): int = 1
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
Group 2 (2 rows): int = 3
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  a
   2 │     3  b
Group 3 (2 rows): int = 2
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b
julia> show(groupby(df, :str), allgroups=true)
GroupedDataFrame with 3 groups based on key: str
Group 1 (2 rows): str = "a"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a
Group 2 (2 rows): str = "c"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     1  c
Group 3 (2 rows): str = "b"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     2  b
julia> show(groupby(df, :str; sort=true), allgroups=true)
GroupedDataFrame with 3 groups based on key: str
Group 1 (2 rows): str = "a"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a
Group 2 (2 rows): str = "b"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     2  b
Group 3 (2 rows): str = "c"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     1  c
julia> show(groupby(df, :str; sort=false), allgroups=true)
GroupedDataFrame with 3 groups based on key: str
Group 1 (2 rows): str = "a"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a
Group 2 (2 rows): str = "c"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     1  c
Group 3 (2 rows): str = "b"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     2  b

What this example shows is that the key thing you need to remember
to decide about a grouped data frame is the order of groups.

There are two options here:

  • groups sorted by the grouping column value, when you pass sort=true;
  • groups sorted by the order of appearance of values in the source, when you pass sort=true.

You might ask what happens if you do not pass the sort keyword argument?
In this case either of the options is used depending on which one is faster.
Therefore, omitting sort, can be thought of as an information that the user does not
care about the order of groups but wants the grouping operation to be as fast as possible.

When does the order of groups not matter?

In some cases the order of groups is irrelevant (so you can safely skip passing it).
The most important scenario of this kind is when you use the select or transform function
with a GroupedDataFrame. The reason is that these functions anyway always keep the order of
rows from the source data frame (no matter how the groups are rearranged in a GroupedDataFrame).
However, it is not the case with combine, as it respects the order of groups in a GroupedDataFrame.

Let us see an example highlighting the difference between these cases:

julia> select(groupby(df, :int, sort=true), :str)
6×2 DataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a
   3 │     2  c
   4 │     1  c
   5 │     3  b
   6 │     2  b

julia> combine(groupby(df, :int, sort=true), :str)
6×2 DataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
   3 │     2  c
   4 │     2  b
   5 │     3  a
   6 │     3  b

julia> select(groupby(df, :int, sort=false), :str)
6×2 DataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a
   3 │     2  c
   4 │     1  c
   5 │     3  b
   6 │     2  b

julia> combine(groupby(df, :int, sort=false), :str)
6×2 DataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
   3 │     3  a
   4 │     3  b
   5 │     2  c
   6 │     2  b

As you can see select kept the rows in the order in which they are present in df no matter if we
passed sort=true or sort=false. On the other hand combine returns rows grouped by the groups and
the order of groups corresponds to their order in GroupedDataFrame, so passing sort=true or
sort=false in general changes.

Special operation specification syntax for working with grouped data frames

When discussing select or combine in conjunction with GroupedDataFrame it is important to mention
that there are four special cases of operation specification syntax designed specifically for working with
them. They are:

  • nrow to compute the number of rows in each group.
  • proprow to compute the proportion of rows in each group.
  • eachindex to return a vector holding the number of each row within each group.
  • groupindices to return the group number.

Each of them optionally allows you to specify the name of the target column by => syntax.
Here are some examples:

julia> combine(groupby(df, :int, sort=false), nrow)
3×2 DataFrame
 Row │ int    nrow
     │ Int64  Int64
─────┼──────────────
   1 │     1      2
   2 │     3      2
   3 │     2      2

julia> combine(groupby(df, :int, sort=false), proprow => "row %")
3×2 DataFrame
 Row │ int    row %
     │ Int64  Float64
─────┼─────────────────
   1 │     1  0.333333
   2 │     3  0.333333
   3 │     2  0.333333

julia> combine(groupby(df, :int, sort=false), eachindex)
6×2 DataFrame
 Row │ int    eachindex
     │ Int64  Int64
─────┼──────────────────
   1 │     1          1
   2 │     1          2
   3 │     3          1
   4 │     3          2
   5 │     2          1
   6 │     2          2

julia> combine(groupby(df, :int, sort=false), groupindices => "group #")
3×2 DataFrame
 Row │ int    group #
     │ Int64  Int64
─────┼────────────────
   1 │     1        1
   2 │     3        2
   3 │     2        3

Iterating a grouped data frame

Apart from using functions such as select or combine on a GroupedDataFrame it is useful to know
that it supports iteration. Therefore you can use a GroupedDataFrame in a loop or in a comprehension.
When iterated GroupedDataFrame returns data frames corresponding to the groups. Let us see:

julia> for v in groupby(df, :int, sort=false)
           println(v)
       end
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  a
   2 │     3  b
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b

julia> [v for v in groupby(df, :int, sort=false)]
3-element Vector{SubDataFrame{DataFrame, DataFrames.Index, Vector{Int64}}}:
 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  a
   2 │     3  b
 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b

julia> collect(groupby(df, :int, sort=false))
3-element Vector{Any}:
 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  a
   2 │     3  b
 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b

The last example has shown you that you can pass a GroupedDataFrame to a function expecting an iterable, in this case the collect function. The one exception to this rule is that you cannot use GroupedDataFrame with the map function directly:

julia> map(identity, groupby(df, :int, sort=false))
ERROR: ArgumentError: using map over `GroupedDataFrame`s is reserved

The reason is that it was not clear if such operation should produce a vector or a data frame, and it is easy enough to achieve both results with other means. If you want e vector use e.g. a comprehension. If you want a data frame use e.g. combine or select.

Advanced iteration

Sometimes, when iterating a GroupedDataFrame we might be interested not only in a data frame per group, but also in a value of grouping variable. This is easily achieved with the keys and pairs functions (depending on whether you only want grouping values or both grouping values and data frames):

julia> map(identity, keys(groupby(df, :int, sort=false)))
3-element Vector{DataFrames.GroupKey{GroupedDataFrame{DataFrame}}}:
 GroupKey: (int = 1,)
 GroupKey: (int = 3,)
 GroupKey: (int = 2,)

julia> map(identity, pairs(groupby(df, :int, sort=false)))
3-element Vector{Pair{DataFrames.GroupKey{GroupedDataFrame{DataFrame}}, SubDataFrame{DataFrame, DataFrames.Index, Vector{Int64}}}}:
 GroupKey: (int = 1,) => 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
 GroupKey: (int = 3,) => 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  a
   2 │     3  b
 GroupKey: (int = 2,) => 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b

I used the map function to show you that it is only reserved to use it with plain GroupedDataFrame.

Working with group keys

As you can see in this example each group in a GroupedDataFrame is associated with a GroupKey. To get all
keys use the keys function:

julia> keys(groupby(df, :int, sort=false))
3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (int = 1,)
 GroupKey: (int = 3,)
 GroupKey: (int = 2,)

Let us, as an example extract the last key so see how one can work with it:

julia> key = last(keys(groupby(df, :int, sort=false)))
GroupKey: (int = 2,)

You can get a value of the key by property access or indexing:

julia> key.int
2

julia> key[1]
2

julia> key["int"]
2

julia> key[:int]
2

It is also easy co convert GroupKey to a dictionary, vector, Tuple or NamedTuple if you would need it:

julia> Dict(key)
Dict{Symbol, Int64} with 1 entry:
  :int => 2

julia> collect(key)
1-element Vector{Int64}:
 2

julia> Tuple(key)
(2,)

julia> NamedTuple(key)
(int = 2,)

Note that, in general, you can group a data frame by multiple columns so you could query value of any grouping column
in the examples above. If you needed to get a list of grouping columns use the groupcols function:

julia> groupcols(groupby(df, :int, sort=false))
1-element Vector{Symbol}:
 :int

Conclusions

In this post we have learned how one can create a grouped data frame and how to choose the order of groups in it.
As a follow-up we have shown how GroupedDataFrame interacts with functions like select or combine.
Next we discussed iterator interface support by GroupedDataFrame and how to get and use information about
values of grouping columns for each group. I hope you found these examples useful.

In the post next week we will discuss how GroupedDataFrame supports the indexing interface.

GSoC in LLVM 2024

By: Miguel Raz Guzmán Macedo

Re-posted from: https://miguelraz.github.io/blog/gsoc2024/index.html

I'm trying to get a GSoC 2024 in LLVM

and I will be documenting my work with this ongoing blogpost in reverse chronological order.


If you want to see more posts like this, consider chucking a buck or two on my GitHub sponsors, or, you know, hire me as a grad student.


29/02/2024

"hazlo cobarde"

Add the 3 way comparison instruction <=> to LLVM.

I like this GSoC in particular because

  • I will learn a wide swath of LLVM

  • I'll be working with a lot of optimization passes

  • I'll get to bring cool perf to C++/Rust and Julia

  • I was dared by the other, more talented Miguel to actually help improve LLVM

Next task

Add a new intrinsicLangref, then Intrinsics.td, then maybe the pass verifier.

I've already put up a sample PR and got redirected on what looks like the proper working path for this endeavour.