Tag Archives: julialang

Transforming multiple columns in DataFrames.jl

Re-posted from: https://bkamins.github.io/julialang/2024/03/15/transforms.html

Introduction

Today I want to comment on a recurring topic that DataFrames.jl users raise.
The question is how one should transform multiple columns of a data frame using
operation specification syntax.

The post was written under Julia 1.10.1 and DataFrames.jl 1.6.1.

What is operation specification syntax?

In DataFrames.jl the combine, select, and transform functions allow
users for passing the requests for data transformation using operation
specification syntax. This syntax is feature-rich, and you can find its
description for example here. Today I want to focus on its principal concept.

In a general form each request for making an operation on data has the (E)xtract-(T)ransform-(L)oad form.
That means that we need to specify:

source columns to get data from (the extract part);;
the operation to apply to these columns (the transform part);
the target columns where we want to store the result of the operation (the load part).

These tree parts are syntactically expressed using the following form:

[source columns specification] => [transformation function] => [target columns specification]

Let me give an example. Assume you have the following data:

julia> using DataFrames

julia> df = DataFrame(reshape(1:15, 5, 3), :auto)
5×3 DataFrame
 Row │ x1     x2     x3
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      6     11
   2 │     2      7     12
   3 │     3      8     13
   4 │     4      9     14
   5 │     5     10     15

We want to compute the sum of column "x1" and store it in column names "x1_sum"
Since the sum function performs the addition operation the syntax specification should be:

"x1" => sum => "x1_sum"

Let us check it with the combine function:

julia> combine(df, "x1" => sum => "x1_sum")
1×1 DataFrame
 Row │ x1_sum
     │ Int64
─────┼────────
   1 │     15

In this syntax it is important to note two things:

the "x1" column as a whole was passed to the sum function (as we want to compute its sum);
the "x1" column is a single positional argument passed to the sum function.

Two natural questions that arise are the following:

What if I do not want to perform an operation on a whole column, but on its elements (a.k.a. vectorization of operation)?
What if I want to pass multiple columns as a source for computations?

We will now investigate these two dimensions.

Vectorization of operations

Vectorization in DataFrames.jl is easy. Just wrap the function you use in the ByRow object. Here is an example:

julia> combine(df, "x1" => string => "x1_str")
1×1 DataFrame
 Row │ x1_str
     │ String
─────┼─────────────────
   1 │ [1, 2, 3, 4, 5]

julia> combine(df, "x1" => ByRow(string) => "x1_strs")
5×1 DataFrame
 Row │ x1_strs
     │ String
─────┼─────────
   1 │ 1
   2 │ 2
   3 │ 3
   4 │ 4
   5 │ 5

Note that "x1" => string => "x1_str" passed the whole "x1" column to the string function so we got a single "[1, 2, 3, 4, 5]"
string in the output.

While writing "x1" => ByRow(string) => "x1_strs" passed each element of "x1" column to the string function individually,
so in the result we got a vector of five string representations of numbers of the numbers from the source.

Passing multiple columns

Now let us have a look at passing multiple columns. There are two ways you can do it.

The first is when your function accepts multiple positional arguments. An example of such function is string see:

julia> string(df.x1, df.x2)
"[1, 2, 3, 4, 5][6, 7, 8, 9, 10]"

If we pass a collection of columns as a source in operation specification syntax we get this behavior:

julia> combine(df, ["x1", "x2"] => string => "x1_x2_str")
1×1 DataFrame
 Row │ x1_x2_str
     │ String
─────┼─────────────────────────────────
   1 │ [1, 2, 3, 4, 5][6, 7, 8, 9, 10]

Naturally, the above combines with vectorization. Therefore since:

julia> string.(df.x1, df.x2)
5-element Vector{String}:
 "16"
 "27"
 "38"
 "49"
 "510"

we also have:

julia> combine(df, ["x1", "x2"] => ByRow(string) => "x1_x2_strs")
5×1 DataFrame
 Row │ x1_x2_strs
     │ String
─────┼────────────
   1 │ 16
   2 │ 27
   3 │ 38
   4 │ 49
   5 │ 510

However, there are cases when we have a function that expects multiple columns to be passed as a single positional argument.
This is handled in DataFrames.jl with the AsTable wrapper, which you can apply to the source columns.
If you use it then instead of getting multiple positional arguments the function will get a single positional argument
that will be a NamedTuple holding the source columns.

To convince ourselves that this is indeed what happens let us create a helper function:

julia> function helper(x)
           @show x
           return string(x.x1, x.x2)
       end
helper (generic function with 1 method)

This helper function first prints us its only argument x and next assumes that it has x1 and x2 fields and applies the string function to them.
Let us first check it in practice:

julia> helper((x1=[1, 2, 3, 4, 5], x2=[6, 7, 8, 9, 10]))
x = (x1 = [1, 2, 3, 4, 5], x2 = [6, 7, 8, 9, 10])
"[1, 2, 3, 4, 5][6, 7, 8, 9, 10]"

Now let us use the helper function with combine:

julia> combine(df, AsTable(["x1", "x2"]) => helper => "x1_x2_str")
x = (x1 = [1, 2, 3, 4, 5], x2 = [6, 7, 8, 9, 10])
1×1 DataFrame
 Row │ x1_x2_str
     │ String
─────┼─────────────────────────────────
   1 │ [1, 2, 3, 4, 5][6, 7, 8, 9, 10]

Indeed, we see that helper got a named tuple holding two columns of the source data frame.

Again, this syntax plays well with ByRow:

julia> combine(df, AsTable(["x1", "x2"]) => ByRow(helper) => "x1_x2_strs")
x = (x1 = 1, x2 = 6)
x = (x1 = 2, x2 = 7)
x = (x1 = 3, x2 = 8)
x = (x1 = 4, x2 = 9)
x = (x1 = 5, x2 = 10)
5×1 DataFrame
 Row │ x1_x2_strs
     │ String
─────┼────────────
   1 │ 16
   2 │ 27
   3 │ 38
   4 │ 49
   5 │ 510

We see that this time helper got a separate named tuple for each row of source data frame.

Conclusions

In summary today we discussed two special operations in DataFrames.jl operation specification syntax:

the ByRow which vectorizes the function passed to it;
the AsTable which allows us to pass source columns as a single named tuple to the transformation function
(instead of passing them as consecutive positional arguments, which is the default).

I hope these examples were useful in helping you understand the design of operation specification syntax.

Working with a grouped data frame, part 2

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/03/08/gdf.html

Introduction

This is a follow up to the post from last week. We will continue
discussing how one can work with GroupedDataFrame objects in DataFrames.jl.
Today we focus on indexing of grouped data frames.

The post was written under Julia 1.10.1 and DataFrames.jl 1.6.1.

Warm-up: getting group indices

First create some grouped data frame:

julia> using DataFrames

julia> df = DataFrame(int=[1, 3, 2, 1, 3, 2],
                      str=["a", "a", "c", "c", "b", "b"])
6×2 DataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a
   3 │     2  c
   4 │     1  c
   5 │     3  b
   6 │     2  b

julia> gdf = groupby(df, :str, sort=true)
GroupedDataFrame with 3 groups based on key: str
First Group (2 rows): str = "a"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a
⋮
Last Group (2 rows): str = "c"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     1  c

It is sometimes useful to learn what is a group number of each row of the source data frame df in a grouped data frame gdf.
You can easily get this information with groupindices:

julia> groupindices(gdf)
6-element Vector{Union{Missing, Int64}}:
 1
 1
 3
 3
 2
 2

Extracting a single group

A basic operation when indexing a GroupedDataFrame is to pick a group by its number. Here is an example:

julia> gdf[1]
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a

julia> gdf[2]
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     2  b

julia> gdf[3]
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     1  c

Note, that gdf behaves similarly to a vector. You can even use begin and end in indexing:

julia> gdf[begin]
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a

julia> gdf[end]
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     1  c

Often you might want to extract a group not by its position in gdf, but by the value of the grouping
variable or variables. In this case you can use GroupKey, dictionary, tuple, or named tuple to achieve this.

Let us check how it works. Start with dictionary, tuple, and named tuple:

julia> gdf[Dict("str" => "b")] # dictionary
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     2  b

julia> gdf[("b",)] # tuple
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     2  b

julia> gdf[(; str="b")] # named tuple
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     2  b

With GroupKey we first need to get it from keys, but everything else works the same:

julia> key = keys(gdf)[1]
GroupKey: (str = "a",)

julia> gdf[key]
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a

You might ask why we require passing grouping variable in a container (dictionary, tuple, named tuple, GroupKey)
and not directly pass the required value when indexing? The reason is that if you grouped your data by integer column
the result would be ambiguous. Here is an example showing that under the defined rules there is no such ambiguity:

julia> gdf2 = groupby(df, :int, sort=false)
GroupedDataFrame with 3 groups based on key: int
First Group (2 rows): int = 1
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
⋮
Last Group (2 rows): int = 2
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b

julia> gdf2[3] # third group
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b

julia> gdf2[(3, )] # group with value of the grouping variable equal to 3
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  a
   2 │     3  b

Extracting multiple groups

You now know how to pick a single group, so selecting multiple groups is a natural next step.
You can use a collection of any of the selectors we have already discussed. Here are some examples:

julia> gdf[[3, 1]] # selection by group number
GroupedDataFrame with 2 groups based on key: str
First Group (2 rows): str = "c"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     1  c
⋮
Last Group (2 rows): str = "a"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a

julia> gdf[[("c",), ("a",)]] # selection by grouping variable value
GroupedDataFrame with 2 groups based on key: str
First Group (2 rows): str = "c"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     1  c
⋮
Last Group (2 rows): str = "a"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a

Note that indexing allows both for reordering and for dropping groups, which often comes handy when analyzing data.
Also note that groupindices is aware of such changes:

julia> groupindices(gdf[[3, 1]])
6-element Vector{Union{Missing, Int64}}:
 2
 2
 1
 1
  missing
  missing

Here group with "c" is first, with "a" is second and with "b" is dropped, so missing is returned in the produced vector.

It is also worth to remember that subset and filter can be used with GroupedDataFrames. This topic is discussed in this post.

Key lookup

Sometimes we do not want to index into a grouped data frame, but just check if it contains some key. This is easily achievable with the haskey function:

julia> haskey(gdf, ("a",))
true

julia> haskey(gdf, ("z",))
false

Conclusions

In this post we discussed indexing of GroupedDataFrames. This concludes the basic tutorial of working with these data structures.
I hope you will find the functionalities I have covered useful in your work.

Working with a grouped data frame, part 1

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/03/01/gdf.html

Introduction

One of the features of DataFrames.jl that I often find useful is that when you group
a data frame by some of its columns the resulting GroupedDataFrame is an object
that gains new and useful functionalities.

Some time ago I have discussed how GroupedDataFrame can be filtered. You can find
this post here. In this post and the following one that I plan to write next
week I thought that it would be useful to review other key functionalities of
a GroupedDataFrame.

The post was written under Julia 1.10.1 and DataFrames.jl 1.6.1.

Creating a grouped data frame

You can create a GroupedDataFrame using the groupby function.

Here are some examples:

julia> using DataFrames

julia> df = DataFrame(int=[1, 3, 2, 1, 3, 2],
                      str=["a", "a", "c", "c", "b", "b"])
6×2 DataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a
   3 │     2  c
   4 │     1  c
   5 │     3  b
   6 │     2  b

julia> show(groupby(df, :int), allgroups=true)
GroupedDataFrame with 3 groups based on key: int
Group 1 (2 rows): int = 1
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
Group 2 (2 rows): int = 2
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b
Group 3 (2 rows): int = 3
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  a
   2 │     3  b
julia> show(groupby(df, :int; sort=true), allgroups=true)
GroupedDataFrame with 3 groups based on key: int
Group 1 (2 rows): int = 1
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
Group 2 (2 rows): int = 2
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b
Group 3 (2 rows): int = 3
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  a
   2 │     3  b
julia> show(groupby(df, :int; sort=false), allgroups=true)
GroupedDataFrame with 3 groups based on key: int
Group 1 (2 rows): int = 1
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
Group 2 (2 rows): int = 3
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  a
   2 │     3  b
Group 3 (2 rows): int = 2
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b
julia> show(groupby(df, :str), allgroups=true)
GroupedDataFrame with 3 groups based on key: str
Group 1 (2 rows): str = "a"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a
Group 2 (2 rows): str = "c"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     1  c
Group 3 (2 rows): str = "b"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     2  b
julia> show(groupby(df, :str; sort=true), allgroups=true)
GroupedDataFrame with 3 groups based on key: str
Group 1 (2 rows): str = "a"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a
Group 2 (2 rows): str = "b"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     2  b
Group 3 (2 rows): str = "c"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     1  c
julia> show(groupby(df, :str; sort=false), allgroups=true)
GroupedDataFrame with 3 groups based on key: str
Group 1 (2 rows): str = "a"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a
Group 2 (2 rows): str = "c"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     1  c
Group 3 (2 rows): str = "b"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     2  b

What this example shows is that the key thing you need to remember
to decide about a grouped data frame is the order of groups.

There are two options here:

groups sorted by the grouping column value, when you pass sort=true;
groups sorted by the order of appearance of values in the source, when you pass sort=true.

You might ask what happens if you do not pass the sort keyword argument?
In this case either of the options is used depending on which one is faster.
Therefore, omitting sort, can be thought of as an information that the user does not
care about the order of groups but wants the grouping operation to be as fast as possible.

When does the order of groups not matter?

In some cases the order of groups is irrelevant (so you can safely skip passing it).
The most important scenario of this kind is when you use the select or transform function
with a GroupedDataFrame. The reason is that these functions anyway always keep the order of
rows from the source data frame (no matter how the groups are rearranged in a GroupedDataFrame).
However, it is not the case with combine, as it respects the order of groups in a GroupedDataFrame.

Let us see an example highlighting the difference between these cases:

julia> select(groupby(df, :int, sort=true), :str)
6×2 DataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a
   3 │     2  c
   4 │     1  c
   5 │     3  b
   6 │     2  b

julia> combine(groupby(df, :int, sort=true), :str)
6×2 DataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
   3 │     2  c
   4 │     2  b
   5 │     3  a
   6 │     3  b

julia> select(groupby(df, :int, sort=false), :str)
6×2 DataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a
   3 │     2  c
   4 │     1  c
   5 │     3  b
   6 │     2  b

julia> combine(groupby(df, :int, sort=false), :str)
6×2 DataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
   3 │     3  a
   4 │     3  b
   5 │     2  c
   6 │     2  b

As you can see select kept the rows in the order in which they are present in df no matter if we
passed sort=true or sort=false. On the other hand combine returns rows grouped by the groups and
the order of groups corresponds to their order in GroupedDataFrame, so passing sort=true or
sort=false in general changes.

Special operation specification syntax for working with grouped data frames

When discussing select or combine in conjunction with GroupedDataFrame it is important to mention
that there are four special cases of operation specification syntax designed specifically for working with
them. They are:

nrow to compute the number of rows in each group.
proprow to compute the proportion of rows in each group.
eachindex to return a vector holding the number of each row within each group.
groupindices to return the group number.

Each of them optionally allows you to specify the name of the target column by => syntax.
Here are some examples:

julia> combine(groupby(df, :int, sort=false), nrow)
3×2 DataFrame
 Row │ int    nrow
     │ Int64  Int64
─────┼──────────────
   1 │     1      2
   2 │     3      2
   3 │     2      2

julia> combine(groupby(df, :int, sort=false), proprow => "row %")
3×2 DataFrame
 Row │ int    row %
     │ Int64  Float64
─────┼─────────────────
   1 │     1  0.333333
   2 │     3  0.333333
   3 │     2  0.333333

julia> combine(groupby(df, :int, sort=false), eachindex)
6×2 DataFrame
 Row │ int    eachindex
     │ Int64  Int64
─────┼──────────────────
   1 │     1          1
   2 │     1          2
   3 │     3          1
   4 │     3          2
   5 │     2          1
   6 │     2          2

julia> combine(groupby(df, :int, sort=false), groupindices => "group #")
3×2 DataFrame
 Row │ int    group #
     │ Int64  Int64
─────┼────────────────
   1 │     1        1
   2 │     3        2
   3 │     2        3

Iterating a grouped data frame

Apart from using functions such as select or combine on a GroupedDataFrame it is useful to know
that it supports iteration. Therefore you can use a GroupedDataFrame in a loop or in a comprehension.
When iterated GroupedDataFrame returns data frames corresponding to the groups. Let us see:

julia> for v in groupby(df, :int, sort=false)
           println(v)
       end
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  a
   2 │     3  b
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b

julia> [v for v in groupby(df, :int, sort=false)]
3-element Vector{SubDataFrame{DataFrame, DataFrames.Index, Vector{Int64}}}:
 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  a
   2 │     3  b
 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b

julia> collect(groupby(df, :int, sort=false))
3-element Vector{Any}:
 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  a
   2 │     3  b
 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b

The last example has shown you that you can pass a GroupedDataFrame to a function expecting an iterable, in this case the collect function. The one exception to this rule is that you cannot use GroupedDataFrame with the map function directly:

julia> map(identity, groupby(df, :int, sort=false))
ERROR: ArgumentError: using map over `GroupedDataFrame`s is reserved

The reason is that it was not clear if such operation should produce a vector or a data frame, and it is easy enough to achieve both results with other means. If you want e vector use e.g. a comprehension. If you want a data frame use e.g. combine or select.

Advanced iteration

Sometimes, when iterating a GroupedDataFrame we might be interested not only in a data frame per group, but also in a value of grouping variable. This is easily achieved with the keys and pairs functions (depending on whether you only want grouping values or both grouping values and data frames):

julia> map(identity, keys(groupby(df, :int, sort=false)))
3-element Vector{DataFrames.GroupKey{GroupedDataFrame{DataFrame}}}:
 GroupKey: (int = 1,)
 GroupKey: (int = 3,)
 GroupKey: (int = 2,)

julia> map(identity, pairs(groupby(df, :int, sort=false)))
3-element Vector{Pair{DataFrames.GroupKey{GroupedDataFrame{DataFrame}}, SubDataFrame{DataFrame, DataFrames.Index, Vector{Int64}}}}:
 GroupKey: (int = 1,) => 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
 GroupKey: (int = 3,) => 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  a
   2 │     3  b
 GroupKey: (int = 2,) => 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b

I used the map function to show you that it is only reserved to use it with plain GroupedDataFrame.

Working with group keys

As you can see in this example each group in a GroupedDataFrame is associated with a GroupKey. To get all
keys use the keys function:

julia> keys(groupby(df, :int, sort=false))
3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (int = 1,)
 GroupKey: (int = 3,)
 GroupKey: (int = 2,)

Let us, as an example extract the last key so see how one can work with it:

julia> key = last(keys(groupby(df, :int, sort=false)))
GroupKey: (int = 2,)

You can get a value of the key by property access or indexing:

julia> key.int
2

julia> key[1]
2

julia> key["int"]
2

julia> key[:int]
2

It is also easy co convert GroupKey to a dictionary, vector, Tuple or NamedTuple if you would need it:

julia> Dict(key)
Dict{Symbol, Int64} with 1 entry:
  :int => 2

julia> collect(key)
1-element Vector{Int64}:
 2

julia> Tuple(key)
(2,)

julia> NamedTuple(key)
(int = 2,)

Note that, in general, you can group a data frame by multiple columns so you could query value of any grouping column
in the examples above. If you needed to get a list of grouping columns use the groupcols function:

julia> groupcols(groupby(df, :int, sort=false))
1-element Vector{Symbol}:
 :int

Conclusions

In this post we have learned how one can create a grouped data frame and how to choose the order of groups in it.
As a follow-up we have shown how GroupedDataFrame interacts with functions like select or combine.
Next we discussed iterator interface support by GroupedDataFrame and how to get and use information about
values of grouping columns for each group. I hope you found these examples useful.

In the post next week we will discuss how GroupedDataFrame supports the indexing interface.

juliabloggers.com

A Julia Language Blog Aggregator

Tag Archives: julialang

Transforming multiple columns in DataFrames.jl

Introduction

What is operation specification syntax?

Vectorization of operations

Passing multiple columns

Conclusions

Working with a grouped data frame, part 2

Introduction

Warm-up: getting group indices

Extracting a single group

Extracting multiple groups

Key lookup

Conclusions

Working with a grouped data frame, part 1

Introduction

Creating a grouped data frame

When does the order of groups not matter?

Special operation specification syntax for working with grouped data frames

Iterating a grouped data frame

Advanced iteration

Working with group keys

Conclusions