Tag Archives: julialang

DataFrames.jl at the Journal of Statistical Software

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/09/29/dataframes.html

Introduction

This week, I have reached a small personal milestone as
the Journal of Statistical Software has just published
the DataFrames.jl: Flexible and Fast Tabular Data in Julia
paper that I co-authored with Milan Bouchet-Valat.

Therefore, in this post, I thought to summarize various resources
I have been working on over the years that can help you master DataFrames.jl.

Types of documentation

It is hard to document a package properly. The reason is that different users
have different expectations. In this post there is an excellent summary of
four typical kinds of documentation:

  • learning-oriented tutorials;
  • goal-oriented how-to guides;
  • understanding-oriented discussions;
  • information-oriented reference material.

Each kind of documentation requires a slightly different approach, and
having it all in a single place is hard. In the following sections, I will go through
various materials I have prepared over the years and explain my intention
behind them.

The Journal of Statistical Software paper

The objective of the DataFrames.jl: Flexible and Fast Tabular Data in Julia paper
is understanding-oriented. Together with Milan we tried to explain in it how the
DataFrames.jl package was designed, and what were the motivations behind these decisions.

For this reason, you most likely cannot learn how to work with DataFrames.jl after reading
the paper. However, you will get an intuition what are the basic building blocks
of the package.

It is similar to learning languages. I have recently decided to learn French.
As a part of this process, I watched the ALL THE RULES OF FRENCH IN 20 MINUTES video on YouTube.
I did not directly learn French from it, but it helped a lot in understanding the “design” of the French language.

Julia for Data Analysis book

Another major resource I have created is my Julia for Data Analysis book.
This resource is learning-oriented. I start it from the basics of the Julia language
and gradually add more complex elements so that eventually, the reader should be able to:

  • read and write data in various formats;
  • work with tabular data, including subsetting, grouping, and transforming;
  • visualize data;
  • build predictive models;
  • create data processing pipelines;

and more.

As you can probably guess, the DataFrames.jl package is a backbone of this material.
The book was prepared as a textbook that can be used in a 1-semester introductory course on data analysis using Julia
and is accompanied by numerous extra materials that can be found here.

Tutorials

Over the years, I have created a lot of goal-oriented how-to guides.
You can find their list here.

Also, my blog since year 2020 brings you each week some practical information on working with Julia
(quite often DataFrames.jl oriented).

Finally, as a part of DataFrames.jl documentation here and here there are available introductory tutorials to DataFrames.jl.

Since the volume of the tutorials is large, it might be sometimes a bit hard to navigate, but probably this is unavoidable, as there is a substantial variety of questions that users might have.

Reference

I strongly prefer implementing the functionality of DataFrames.jl following the contract specified in the documentation
of provided functions. Therefore, I believe that we have quite a strong collection of reference materials that make it precise
how DataFrames.jl functionality is implemented. It is divided into four major parts:

  • specification of how types exposed by DataFrames.jl are designed is given here;
  • reference on provided functions can be found here;
  • a complete description of how indexing works in DataFrames.jl is available here;
  • information on how data frames handle table and column metadata is given here.

It is essential to highlight that these materials aim to be complete and precise. Unfortunately, this means
that they are verbose and sometimes hard to digest by new users. Unfortunately, I think this cannot be helped,
and that is why we provide other kinds of documentation to make it easier to get started with DataFrames.jl.

Conclusions

I hope that this post can serve DataFrames.jl users as a helpful guide to different resources I have co-authored
that are provided to make learning and using the package easy and fun. Enjoy!

Working with rows of Tables.jl tables

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/09/22/tables.html

Introduction

Three weeks ago I wrote a post about getting a schema of Tables.jl tables.
Therefore today, to complement, I thought to discuss how one can get rows of such tables.

The post was written using Julia 1.9.2, Tables.jl 1.11.0, DataAPI.jl 1.15.0, and DataFrames.jl 1.6.1.

Why getting rows of a table is needed?

Many Julia users are happy with using DataFrames.jl to work with their tables.
However, this is only one of the available options.
This means that, especially package creators, prefer not to hardcode DataFrame
as a specific type that their package supports, but allow for generic Tables.jl tables.

An example of such need is, for example, a function that could take a generic table and
split it into train-validation-test subsets. To achieve this you need to be able
to take a subset of its rows.

How row sub-setting is supported in Tables.jl?

There are two functions that, in combination, can be used to generically subset a Tables.jl table:

  • the DataAPI.nrow function that returns a number of rows in a table;
  • the Tables.subset function that allows you to get a subset of rows of a table.

Before I turn to showing you how they work let me highlight one issue. Most of Tables.jl tables
support these functions. However, their support is not guaranteed. The reason is that some tables
are never materialized in memory, e.g. are only a stream of rows that can be read only once.
In such a case we will not know the number of rows in such a table (as it is dynamic) and, similarly,
to get a subset of its rows you would need to scan the whole stream anyway.

Using the row sub-setting interface of Tables.jl

The DataAPI.nrow function is easy to understand. You pass it a table and in return you get the number of its rows.
Let us see it in practice:

julia> using DataAPI

julia> using Tables

julia> table = (a=1:10, b=11:20, c=21:30)
(a = 1:10, b = 11:20, c = 21:30)

julia> DataAPI.nrow(table)
10

The Tables.subset accepts two positional arguments. The first is a table, and the second
are 1-based row indices that should be picked. You have two options for passing indices.
You can pass a single integer index like this:

julia> Tables.subset(table, 2)
(a = 2, b = 12, c = 22)

In which case you get a single row of a table.
The other option is to pass a collection of indices, in which case, you get a table (not a single row):

julia> Tables.subset(table, 2:3)
(a = 2:3, b = 12:13, c = 22:23)

To see that indeed it works for other tables, let us check a DataFrame from DataFrames.jl:

julia> using DataFrames

julia> df = DataFrame(table)
10×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1     11     21
   2 │     2     12     22
   3 │     3     13     23
   4 │     4     14     24
   5 │     5     15     25
   6 │     6     16     26
   7 │     7     17     27
   8 │     8     18     28
   9 │     9     19     29
  10 │    10     20     30

julia> nrow(df)
10

julia> Tables.subset(df, 2)
DataFrameRow
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   2 │     2     12     22

julia> Tables.subset(df, 2:3)
2×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     2     12     22
   2 │     3     13     23

Again, note that Tables.subset(df, 2) returned DataFrameRow (a single row of a table),
while Tables.subset(df, 2:3) returned a DataFrame (a table).

Advanced sub-setting options

If you work with large tables you often hit performance and memory consumption considerations.
In terms of Tables.subset this is related to the question if this function copies data
or just makes a view of the source table. This option is handled by the viewhint keyword argument.

Let us first see how it works:

julia> Tables.subset(df, 2:3, viewhint=true)
2×3 SubDataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     2     12     22
   2 │     3     13     23

julia> Tables.subset(df, 2:3, viewhint=false)
2×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     2     12     22
   2 │     3     13     23

As you can see viewhint=true returned a view (a SubDataFrame), while viewhint=false produced a copy.

Let us see another example:

julia> table2 = Tables.rowtable(df)
10-element Vector{NamedTuple{(:a, :b, :c), Tuple{Int64, Int64, Int64}}}:
 (a = 1, b = 11, c = 21)
 (a = 2, b = 12, c = 22)
 (a = 3, b = 13, c = 23)
 (a = 4, b = 14, c = 24)
 (a = 5, b = 15, c = 25)
 (a = 6, b = 16, c = 26)
 (a = 7, b = 17, c = 27)
 (a = 8, b = 18, c = 28)
 (a = 9, b = 19, c = 29)
 (a = 10, b = 20, c = 30)

julia> Tables.subset(table2, 2:3, viewhint=true)
2-element view(::Vector{NamedTuple{(:a, :b, :c), Tuple{Int64, Int64, Int64}}}, 2:3) with eltype NamedTuple{(:a, :b, :c), Tuple{Int64, Int64, Int64}}:
 (a = 2, b = 12, c = 22)
 (a = 3, b = 13, c = 23)

julia> Tables.subset(table2, 2:3, viewhint=false)
2-element Vector{NamedTuple{(:a, :b, :c), Tuple{Int64, Int64, Int64}}}:
 (a = 2, b = 12, c = 22)
 (a = 3, b = 13, c = 23)

As you can see viewhint=true produced a view of a vector, while viewhint=false made a copy of source data.

Now you might ask why the keyword argument is called viewhint? The reason is that not all Tables.jl tables allow
for flexibility of making a view or a copy. Therefore the rules are as follows:

  • if viewhint is not passed then table decides on its side if it returns a copy or a view (depending on what is possible);
  • if viewhint=true then table should return a view, but if it is not possible this can be a copy;
  • if viewhint=false then table should return a copy, but if it is not possible this can be a view.

In other words viewhint should be considered as a performance hint only.
It does not guarantee to produce what you ask for (as for some tables satisfying this request might be impossible).

Conclusions

Summarizing our post. If you want to write a generic function that subsets a Tables.jl table then you can use:

  • the DataAPI.nrow function to learn how many rows it has;
  • the Tables.subset function to get a subset of its rows using 1-based indexing.

I hope these examples are useful for your work.

Does DataFrames.jl copy or not copy, that is the question

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/09/15/copying.html

Introduction

Some time ago I have written a post about my thoughts on copying of data when working with it in Julia.

Today I want to focus on a related, but more narrow topic related to DataFrames.jl.
People starting to work with this package are sometimes confused when columns
get copied and when they are not copied. I want to discuss the most common cases in this post.

Spoiler! The post is a bit long. If you want a simple advice – you can skip to the section with conclusions.

The post was written using Julia 1.9.2 and DataFrames.jl 1.6.1.

Getting a column from a data frame

Let us start with a simpler case. When does copying happen if we get a column form a data frame?

First we set up some initial data:

julia> using DataFrames

julia> df = DataFrame(a=1:10^6)
1000000×1 DataFrame
     Row │ a
         │ Int64
─────────┼─────────
       1 │       1
       2 │       2
       3 │       3
       4 │       4
       5 │       5
    ⋮    │    ⋮
  999997 │  999997
  999998 │  999998
  999999 │  999999
 1000000 │ 1000000
999991 rows omitted

There are three ways to get the :a column from this data frame: df.a, df[:, :a] and df[!, :a].
Let us check them one by one. Start with df.a:

julia> df.a
1000000-element Vector{Int64}:
       1
       2
       3
       4
       5
       6
       7
       ⋮
  999995
  999996
  999997
  999998
  999999
 1000000

julia> @allocated df.a
0

df.a extracts the column without copying data. You can see it by the fact that there are no allocations performed in this operation.

Now check df[:, :a], which uses a standard row index : that is also used in arrays:

julia> df[:, :a]
1000000-element Vector{Int64}:
       1
       2
       3
       4
       5
       6
       7
       ⋮
  999995
  999996
  999997
  999998
  999999
 1000000

julia> @allocated df[:, :a]
8000048

df[:, :a] copies data, we see a lot of memory allocated this time. This is an identical behavior to how : works for arrays.

Finally check df[!, :a], which uses a non-standard ! row index:

julia> df[!, :a]
1000000-element Vector{Int64}:
       1
       2
       3
       4
       5
       6
       7
       ⋮
  999995
  999996
  999997
  999998
  999999
 1000000

julia> @allocated df[!, :a]
0

We can see that df[!, :a] does not allocate. It is equivalent to df.a, just with a bit different syntax
(the indexing syntax with ! is handy if we wanted to select multiple columns from a data frame, which is not possible with df.a syntax).

This part was relatively easy. Now let us turn to a harder case of setting a column of a data frame.

Case 1: setting a column in a data frame using assignment

First store the :a column in a temporary variable a (without copying it):

julia> a = df.a
1000000-element Vector{Int64}:
       1
       2
       3
       4
       5
       6
       7
       ⋮
  999995
  999996
  999997
  999998
  999999
 1000000

Now let us check various options of creation of a column that will store a.
Begin with creating of a new column.

julia> df.b = a
1000000-element Vector{Int64}:
       1
       2
       3
       4
       5
       6
       7
       ⋮
  999995
  999996
  999997
  999998
  999999
 1000000

julia> df.b === a
true

We can see that if we put df.b on the left hand side the operation does not copy the passed data.
You probably already can guess that the same happens with df[!, :c] on left hand side. Indeed
it is the case:

julia> df[!, :c] = a
1000000-element Vector{Int64}:
       1
       2
       3
       4
       5
       6
       7
       ⋮
  999995
  999996
  999997
  999998
  999999
 1000000

julia> df.c === a
true

What about df[:, :d]? Let us see:

julia> df[:, :d] = a
1000000-element Vector{Int64}:
       1
       2
       3
       4
       5
       6
       7
       ⋮
  999995
  999996
  999997
  999998
  999999
 1000000

julia> df.d === a
false

So we see a first difference. When creating a new column the data was copied.
But what would happen if some column already existed in a data frame?

Well for df.b and df[!, :c] syntaxes nothing would change, as they just put
a right hand side vector into a data frame without copying it.
But for df[:, :d] the situation is different. Let us check:

julia> d = df.d;

julia> df[:, :d] = a;

julia> df.d === a
false

julia> df.d === d
true

We can see that if we use the df[:, :d] syntax on left hand side the operation is in-place,
that is the vector already present in df is reused and the data is stored in a column
already present in a data frame. This means that we cannot use df[:, :d] = ... to change
element type of column :d. Let us see:

julia> df[:, :d] = a .+ 0.5;
ERROR: InexactError: Int64(1.5)

Indeed a .+ 0.5 contains floating point values, and the :d column allowed only integers.
Note that with df.b = ... or df[!, :c] = ... we would not have this issue as they
replace columns with what is passed on a right hand side:

julia> df.b = a .+ 0.5
1000000-element Vector{Float64}:
      1.5
      2.5
      3.5
      4.5
      5.5
      6.5
      7.5
      ⋮
 999995.5
 999996.5
 999997.5
 999998.5
 999999.5
      1.0000005e6

There is one more twist to this story. It is related to ranges.
The issue is that DataFrame object always materializes ranges
stored in it.
Therefore the following operation allocates data:

julia> df.b = 1:10^6
1:1000000

julia> df.b
1000000-element Vector{Int64}:
       1
       2
       3
       4
       5
       6
       7
       ⋮
  999995
  999996
  999997
  999998
  999999
 1000000

The issue is that generally df.b = ... does not allocate, but since we disallow storing
ranges as columns of a data frame (in our case the 1:10^6 range) the allocation still takes place.
You would have the same behavior with df[!, :c] = 1:10^6.

Case 2: setting a column in a data frame using broadcasted assignment

Julia is famous for its powerful broadcasting capabilities. Let us thus investigate what happens when we
replace = with .= in our experiments. We will reproduce all the examples we gave above from scratch.

Start with df.b .= a:

julia> df = DataFrame(a=1:10^6);

julia> a = df.a;

julia> df.b .= a;

julia> df.b === a
false

We now see a difference. The :b column is freshly allocated.

Let us check the two other options of creation of a new column:

julia> df[!, :c] .= a;

julia> df.c === a
false

julia> df[:, :d] .= a;

julia> df.d === a
false

They have the same effect: a new column gets allocated.

In the case of an existing column df.b .= ... and df[!, :c] .= ...
would again create a new copied column:

julia> df.b .= a .+ 0.5
1000000-element Vector{Float64}:
      1.5
      2.5
      3.5
      4.5
      5.5
      6.5
      7.5
      ⋮
 999995.5
 999996.5
 999997.5
 999998.5
 999999.5
      1.0000005e6

The difference is with df[:, :d] .= ...:

julia> d = df.d;

julia> df[:, :d] .= a;

julia> df.d === a
false

julia> df.d === d
true

julia> df[:, :d] .= a .+ 0.5
ERROR: InexactError: Int64(1.5)

So we see that we have here an in-place operation just like with df[:, :d] = ....

Conclusions

As a summary let me discuss a common anti-pattern:

df.a = df.b

Given the examples I presented we know that after this operation the :a and :b columns
of the df data frame are aliased, i.e. df.a === df.b produces true. Usually this is not
a desired situation as many operations assume that columns of a data frame do not share memory.

Fortunately, we also already learnt an easy fix to the aliasing problem. You can just write:

df.a .= df.b

To get a copy of :b stored in column :a.

I hope the examples I gave in my post today will be useful for your work with DataFrames.jl.