Expand your DataFrames.jl toolbox: the flatten function

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/03/20/flatten.html

Introduction

Recently I have commented on an interesting question on StackOveflow.

The problem was stated as follows. Given this input table:

2×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Time
─────┼───────────────────────────────────────
   1 │ John    A               2  16:00:00
   2 │ Joseph  B               3  15:05:00

produce the following output table:

5×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Time
─────┼───────────────────────────────────────
   1 │ John    A               2  16:00:00
   2 │ John    A               2  16:01:00
   3 │ Joseph  B               3  15:05:00
   4 │ Joseph  B               3  15:06:00
   5 │ Joseph  B               3  15:07:00

As you can see the task is to repeat each row of the source data frame as many
times as column :Duration tells you but additionally increment the
Start_Time column by one minute in each consecutive row.

This question caught my attention, because it referenced to a similar
question
using Pandas. However, I found it quite hard to immediately
understand what is going on in that code, while in DataFrames.jl the solution
seemed to be relatively simple.

This post was written under Julia 1.6.0-rc1 and DataFrames 0.22.5.

The solution using flatten

We start with creating the source data frame:

julia> using DataFrames, Dates

julia> df = DataFrame(Name=["John", "Joseph"],
                      Channel=["A", "B"],
                      Duration=[2,3],
                      Start_Time=Time.(["16:00:00", "15:05:00"]))
2×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Time
─────┼───────────────────────────────────────
   1 │ John    A               2  16:00:00
   2 │ Joseph  B               3  15:05:00

Now in order to solve the task one needs to remember that data frame can store
columns having any element type. Therefore a first natural step is to transform
the :Start_Time column from a vector holding only a starting time to a vector
holding a range of times as defined by :Duration and :Start_Time columns.

This is easy to achieve using the transform function:

julia> df2 = transform(df, [:Start_Time, :Duration] =>
                           ByRow((x,y) -> x .+ Minute.(0:y-1)) =>
                           :Start_Time)
2×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Array…
─────┼──────────────────────────────────────────────────────────────
   1 │ John    A               2  Time[16:00:00, 16:01:00]
   2 │ Joseph  B               3  Time[15:05:00, 15:06:00, 15:07:0…

alternatively one could create the df2 data frame e.g. like this:

julia> df2 = copy(df)
2×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Time
─────┼───────────────────────────────────────
   1 │ John    A               2  16:00:00
   2 │ Joseph  B               3  15:05:00

julia> df2.Start_Time = [x .+ Minute.(0:y-1) for
                         (x, y) in zip(df2.Start_Time, df2.Duration)]
2-element Vector{Vector{Time}}:
 [Time(16), Time(16, 1)]
 [Time(15, 5), Time(15, 6), Time(15, 7)]

julia> df2
2×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Array…
─────┼──────────────────────────────────────────────────────────────
   1 │ John    A               2  Time[16:00:00, 16:01:00]
   2 │ Joseph  B               3  Time[15:05:00, 15:06:00, 15:07:0…

a small benefit of transform is that is is easier to put this operation in a
chain of transformations as it takes and returns a data frame.

Once you have a df2 data frame then you need to flatten the :Start_Time
column into multiple rows. This is easily done using the flatten function like
this:

julia> flatten(df2, :Start_Time)
5×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Time
─────┼───────────────────────────────────────
   1 │ John    A               2  16:00:00
   2 │ John    A               2  16:01:00
   3 │ Joseph  B               3  15:05:00
   4 │ Joseph  B               3  15:06:00
   5 │ Joseph  B               3  15:07:00

and you are done!

For sure I know DataFrames.jl much better than Pandas. However, what I feel
(and I am for sure biased here) is that it is much easier to reason about what
DataFrames.jl code does.

The solution using iteration

Another approach that could be used to handle this task would be to construct
the resulting data frame incrementally. In this case it is a bit more complex
than the flatten solution, but it is very often quite convenient so I thought
to show it. Here is the code:

julia> df3 = DataFrame()
0×0 DataFrame

julia> for row in eachrow(df)
           chunk = repeat(DataFrame(row), row.Duration)
           chunk.Start_Time .+= Minute.(0:row.Duration-1)
           append!(df3, chunk)
       end

julia> df3
5×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Time
─────┼───────────────────────────────────────
   1 │ John    A               2  16:00:00
   2 │ John    A               2  16:01:00
   3 │ Joseph  B               3  15:05:00
   4 │ Joseph  B               3  15:06:00
   5 │ Joseph  B               3  15:07:00

or

julia> df4 = DataFrame()
0×0 DataFrame

julia> for row in eachrow(df), i in 0:row.Duration-1
           push!(df4, row)
           df4.Start_Time[end] += Minute(i)
       end

julia> df4
5×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Time
─────┼───────────────────────────────────────
   1 │ John    A               2  16:00:00
   2 │ John    A               2  16:01:00
   3 │ Joseph  B               3  15:05:00
   4 │ Joseph  B               3  15:06:00
   5 │ Joseph  B               3  15:07:00

The point of these examples is that append! and push! are quite fast in
DataFrames.jl and I find them easy to reason about.

Conclusion

I hope that you found these examples useful. In particular functions like
flatten are easy to forget about while they often are very handy, especially in
combination with the fact that data frame can store objects of any type in its
columns.

In particular, you can store a vector of vectors or a vector of structs as a
data frame column. This is a type of storage that users of such data bases as
BigQuery or Snowflake tend to like. An especially notable feature
of this functionality is that such data frames can be easily written to and read
back from a file using e.g. Arrow.jl.

If you would like to check out another example of using a vector of vectors as
a column of a data frame you can have a look at notebook 5 of the
JuliaAcademy DataFrames.jl tutorial.