Author Archives: Blog by Bogumił Kamiński

Poor man’s guide to despecialization

Re-posted from: https://bkamins.github.io/julialang/2021/04/02/despecialization.html

Introduction

We are just about to release DataFrames.jl 1.0. It will bring many new
features we are excited about. One of the major additions is support of multiple
threading in many core operations. These changes promise performance
improvements when processing large data frames. The cost is that code complexity
has also grown significantly.

Clearly code-base size is an issue for maintainers of the package, but should
end-users care? The answer is that unfortunately yes, as making code more
complex makes it compile longer.

In this post I summarize the experience we have when trying to reduce compile
time latency.

TLDR: if you have functions that are expensive to compile but are not
performance critical perform argument standardization.

In this post I am using Julia 1.6, MethodAnalysis v0.4.4, SnoopCompile v2.6.0,
and DataFrames.jl main branch state at this commit (in this case it
is relevant as changes in the code base will likely affect the results).

Before we start our experiments disable precompilation (to have a clean ground
for comparisons and simplify things; the conclusions I present hold also when
proper precompilation statements are added). To do so comment-out lines 152 and
153 in src/DataFrames.jl file:

#include("other/precompile.jl")
#precompile()

The context

In DataFrames.jl when one performs transformation of data the following two
areas cause challenges related to run-time compilation:

users can pass arbitrary functions as transformations;
the ouptut of these functions can be arbitrary values (and the logic of
processing them depends on their type).

Let us have a look at a simple example:

julia> using DataFrames

julia> gdf = groupby(DataFrame(a=1), :a);

julia> @time combine(gdf, x -> (a=1,));
  2.440215 seconds (8.09 M allocations: 469.718 MiB, 8.16% gc time)

julia> @time combine(gdf, x -> (a=1,));
  0.083986 seconds (550.13 k allocations: 33.481 MiB, 99.73% compilation time)

julia> @time combine(gdf, x -> (b=1,));
  0.364469 seconds (996.17 k allocations: 61.213 MiB, 4.25% gc time, 99.74% compilation time)

julia> @time combine(gdf, x -> (c=1,));
  0.337301 seconds (983.31 k allocations: 60.455 MiB, 1.91% gc time, 99.72% compilation time)

We can see that in each call we pass a new anonymous function to combine.
Additionally, the return value of this function is a NamedTuple that has a
fresh type each time, as the name of the column changes.

As you can see the compilation time in these examples is quite high.

Let us check how to reduce it. We will concentrate on only one method
_combine_multicol that is defined in line 7 of
src/groupeddataframe/complextransforms.jl. Its signature is:

_combine_multicol(firstres, fun::Base.Callable, gd::GroupedDataFrame,
                  incols::Union{Nothing, AbstractVector, Tuple, NamedTuple})

We start witch checking how many method instances were generated for it in our code:

julia> using MethodAnalysis

julia> methodinstances(DataFrames._combine_multicol)
9-element Vector{Core.MethodInstance}:
 MethodInstance for _combine_multicol(::NamedTuple{(:a,), Tuple{Int64}}, ::Function, ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::NamedTuple{(:a,), Tuple{Int64}}, ::var"#1#2", ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::Any, ::Function, ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::Any, ::Type, ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::NamedTuple{(:a,), Tuple{Int64}}, ::var"#3#4", ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::NamedTuple{(:b,), Tuple{Int64}}, ::Function, ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::NamedTuple{(:b,), Tuple{Int64}}, ::var"#5#6", ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::NamedTuple{(:c,), Tuple{Int64}}, ::Function, ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::NamedTuple{(:c,), Tuple{Int64}}, ::var"#7#8", ::GroupedDataFrame{DataFrame}, ::Nothing)

As you can see each call like combine(gdf, x -> (c=1,)) generates two new method instances.

Before we move forward let us check how this code would be run in 0.22 release
(precompilation is turned on here so we are not comparing apples to apples,
but if we disabled precompilation the conclusion would be similar):

julia> using DataFrames

julia> gdf = groupby(DataFrame(a=1), :a);

julia> @time combine(gdf, x -> (a=1,));
  1.412728 seconds (2.69 M allocations: 167.339 MiB, 3.50% gc time, 45.87% compilation time)

julia> @time combine(gdf, x -> (a=1,));
  0.041718 seconds (190.24 k allocations: 11.522 MiB, 20.74% gc time, 99.03% compilation time)

julia> @time combine(gdf, x -> (b=1,));
  0.209238 seconds (481.99 k allocations: 30.126 MiB, 99.58% compilation time)

julia> @time combine(gdf, x -> (c=1,));
  0.194280 seconds (468.45 k allocations: 29.367 MiB, 5.33% gc time, 99.53% compilation time)

julia> using MethodAnalysis

julia> methodinstances(DataFrames._combine_multicol)
13-element Vector{Core.MethodInstance}:
 MethodInstance for _combine_multicol(::DataFrame, ::Function, ::GroupedDataFrame{DataFrame}, ::Tuple{Vector{Bool}})
 MethodInstance for _combine_multicol(::DataFrame, ::Type, ::GroupedDataFrame{DataFrame}, ::Tuple{Vector{Bool}})
 MethodInstance for _combine_multicol(::Any, ::Function, ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::Any, ::Type, ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::NamedTuple{(:a,), Tuple{Int64}}, ::Function, ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::NamedTuple{(:b,), Tuple{Int64}}, ::Function, ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::NamedTuple{(:c,), Tuple{Int64}}, ::Function, ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::Any, ::Type, ::GroupedDataFrame{DataFrame}, ::NamedTuple)
 MethodInstance for _combine_multicol(::Any, ::Function, ::GroupedDataFrame{DataFrame}, ::NamedTuple)
 MethodInstance for _combine_multicol(::Any, ::Type, ::GroupedDataFrame{DataFrame}, ::Tuple)
 MethodInstance for _combine_multicol(::Any, ::Function, ::GroupedDataFrame{DataFrame}, ::Tuple)
 MethodInstance for _combine_multicol(::NamedTuple, ::Type, ::GroupedDataFrame{DataFrame}, ::Tuple{Vector{Bool}})
 MethodInstance for _combine_multicol(::NamedTuple, ::Function, ::GroupedDataFrame{DataFrame}, ::Tuple{Vector{Bool}})

So we can see that DataFrames.jl 0.22 produced even more instances, but since the
code was simpler the compilation time was lower.

Despecialization

The first advice one gets in such cases is to use @nospecialize on function
arguments to avoid excessive specialization. In our case we see that problematic
are firstres and fun arguments. Now exit Julia and change the signature of
the method to:

_combine_multicol(@nospecialize(firstres), @nospecialize(fun::Base.Callable),
                  gd::GroupedDataFrame,
                  incols::Union{Nothing, AbstractVector, Tuple, NamedTuple})

Now start your Julia again and run the code we have checked above:

julia> using DataFrames

julia> gdf = groupby(DataFrame(a=1), :a);

julia> @time combine(gdf, x -> (a=1,));
  2.238896 seconds (8.10 M allocations: 470.113 MiB, 5.34% gc time)

julia> @time combine(gdf, x -> (a=1,));
  0.095581 seconds (550.16 k allocations: 33.482 MiB, 9.94% gc time, 99.75% compilation time)

julia> @time combine(gdf, x -> (b=1,));
  0.304849 seconds (982.79 k allocations: 60.359 MiB, 2.95% gc time, 99.71% compilation time)

julia> @time combine(gdf, x -> (c=1,));
  0.297039 seconds (969.93 k allocations: 59.620 MiB, 6.12% gc time, 99.72% compilation time)

julia> using MethodAnalysis

julia> methodinstances(DataFrames._combine_multicol)
7-element Vector{Core.MethodInstance}:
 MethodInstance for _combine_multicol(::NamedTuple{(:a,), Tuple{Int64}}, ::var"#1#2", ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::Any, ::Function, ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::Any, ::Type, ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::NamedTuple{(:a,), Tuple{Int64}}, ::var"#3#4", ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::NamedTuple{(:b,), Tuple{Int64}}, ::var"#5#6", ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::NamedTuple{(:c,), Tuple{Int64}}, ::var"#7#8", ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::Any, ::Union{Function, Type}, ::GroupedDataFrame{DataFrame}, ::Nothing)

As you can see things got slightly better. GC time is distorting the comparison
a bit, but correcting for it we see an improvement. Also we have reduced the
number of method instances from 9 to 7. But wait – we wanted to disable specialization,
and we still get 7 method instances to be compiled. How to reduce it?

Poor man’s despecialization

The approach I found to work reliably is to aggresively perform
argument standardization. A standard trick to be sure that we break
method specialization for some value is to wrap it in Ref{Any}, and then
immediately unwrap.

Let us try it. Now we change our method signature to:

_combine_multicol((firstres,)::Ref{Any}, (fun,)::Ref{Any},
                  gd::GroupedDataFrame,
                  incols::Union{Nothing, AbstractVector, Tuple, NamedTuple})

we also need to change two lines of code where _combine_multicol is called,
namely lines 268:

idx, outcols, nms = _combine_multicol(Ref{Any}(firstres), Ref{Any}(cs_i), gd, nothing)

and 395:

idx, outcols, nms = _combine_multicol(Ref{Any}(firstres), Ref{Any}(fun), gd, incols)

in file src/groupeddataframe/splitapplycombine.jl.

Apply the changes and run our code in a fresh Julia session again:

julia> using DataFrames

julia> gdf = groupby(DataFrame(a=1), :a);

julia> @time combine(gdf, x -> (a=1,));
  2.262996 seconds (7.70 M allocations: 445.018 MiB, 8.53% gc time)

julia> @time combine(gdf, x -> (a=1,));
  0.047010 seconds (299.33 k allocations: 18.038 MiB, 99.52% compilation time)

julia> @time combine(gdf, x -> (b=1,));
  0.278649 seconds (733.47 k allocations: 45.020 MiB, 2.44% gc time, 99.62% compilation time)

julia> @time combine(gdf, x -> (c=1,));
  0.245550 seconds (720.62 k allocations: 44.284 MiB, 2.73% gc time, 99.61% compilation time)

julia> using MethodAnalysis

julia> methodinstances(DataFrames._combine_multicol)
1-element Vector{Core.MethodInstance}:
 MethodInstance for _combine_multicol(::Base.RefValue{Any}, ::Base.RefValue{Any}, ::GroupedDataFrame{DataFrame}, ::Nothing)

Now that is much better. We compiled only one method instance for
_combine_multicol and the timings have significantly improved.

Conclusions

Let us now check what timings the above code has after applying such techniques
to all relevant functions in source code, as proposed in this PR.
I still keep precompilation disabled so the comparison is again a bit unfair
in reference to 0.22 release:

julia> using DataFrames

julia> gdf = groupby(DataFrame(a=1), :a);

julia> @time combine(gdf, x -> (a=1,));
  2.478645 seconds (8.43 M allocations: 489.011 MiB, 9.96% gc time)

julia> @time combine(gdf, x -> (a=1,));
  0.005444 seconds (7.43 k allocations: 486.211 KiB, 96.44% compilation time)

julia> @time combine(gdf, x -> (b=1,));
  0.199044 seconds (371.68 k allocations: 22.773 MiB, 99.39% compilation time)

julia> @time combine(gdf, x -> (c=1,));
  0.173116 seconds (358.82 k allocations: 22.033 MiB, 4.05% gc time, 99.50% compilation time)

julia> using MethodAnalysis

julia> methodinstances(DataFrames._combine_multicol)
1-element Vector{Core.MethodInstance}:
 MethodInstance for _combine_multicol(::Base.RefValue{Any}, ::Base.RefValue{Any}, ::GroupedDataFrame{DataFrame}, ::Base.RefValue{Any})

And now we are under 0.22 timings. Also note that the whole compilation cost
is now related to generation of methods related to the new type of the return
value (the NamedTuple issue). Let us check:

julia> using SnoopCompile

julia> t = @snoopi_deep combine(gdf, x -> (d=1,));

julia> sort(SnoopCompile.flatten(t), by = x -> x.exclusive_time)
271-element Vector{SnoopCompileCore.InferenceTiming}:
 InferenceTiming: 0.000019/0.000019 on InferenceFrameInfo for convert(::Type{Int64}, 1::Int64)
 InferenceTiming: 0.000019/0.000019 on InferenceFrameInfo for convert(::Type{Int64}, 0::Int64)
 InferenceTiming: 0.000020/0.000020 on InferenceFrameInfo for convert(::Type{Int64}, 2::Int64)
 ⋮
 InferenceTiming: 0.007164/0.022407 on InferenceFrameInfo for DataFrames._combine_rows_with_first!(::NamedTuple{(:d,), Tuple{Int64}}, ::Tuple{Vector{Int64}}, ::Function, ::GroupedDataFrame{DataFrame}, nothing::Nothing, ::Tuple{Symbol}, Val{true}()::Val{true})
 InferenceTiming: 0.096098/0.096305 on InferenceFrameInfo for map(::Type{Tuple}, ::Tuple{NamedTuple{(:d,), Tuple{Int64}}})
 InferenceTiming: 0.136345/0.279824 on InferenceFrameInfo for Core.Compiler.Timings.ROOT()

and we see that the biggest cost is paid by map function applied to NamedTuple,
which is a function in Julia Base.

In summary the approaches I discuss here are most useful when you expect to get
values of very heterogeneous types as arguments to your functions. In DataFrames.jl
the two most common cases of these situations are:

anonymous transformation functions;
NamedTuples as produced values from transformations.

If you would have any comments on the best strategies to avoid code
specialization please contact me as reducing compilation latency is one
of the priorities of DataFrames.jl 1.0 release.

Before I finish let me add one thing. If you are interested in true performance
guided tips to reduce compilation latency (not just poor man’s ones I have given
in this post) I highly recommend you to check out this, this, and
this post.

Construction vs conversion in Julia

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/03/26/conversion.html

Introduction

In the recent 0.22.6 release of the DataFrames.jl package we have
deprecated a bunch of convert methods for objects defined in it.

In this post I want to comment on when convert is used and when it is
appropriate to define a convert method for custom types.

The history

DataFrames.jl is an old package, with its 0.0.0 release made on Feb 14, 2013.

During this time the Julia language has evolved. Even in Julia 0.6 convert
method was a default fallback for construction as you can see here:

However, in some cases you could consider adding methods to Base.convert
instead of defining a constructor, because Julia falls back to calling
convert() if no matching constructor is found. For example, if no constructor
T(args...) = ... exists Base.convert(::Type{T}, args...) = ... is called.

This is a past of pre-1.0 Julia design (if you are interested in some more
history you can start with this issue). The things have changed now,
but until DataFrames.jl 0.22.6 release we had some convert methods that were
inspired by this rule, and we even had constructors that fell back to convert
explicitly.

So what is the current rule for using convert?

The present

Fortunately, as of Julia 1.6 the manual is quite clear that
convert should be only defined in cases when it is safe to perform a
conversion even when the user does not ask for it explicitly. Before we move
forward let us see it in action:

julia> struct Example
       a::Int
       b::Complex{Int}
       end

julia> Example(1.0, 1.0)
Example(1, 1 + 0im)

You can see that 1.0 (of type Float64) got implicitly converted to 1 (of
type Int). Similarly 1.0 got converted to 1 + 0im (Complex{Int}).
These conversions are performed implicitly to ensure programmer convenience.

However, the following fails:

julia> example(a::Int, b::Complex{Int}) = (a, b)
example (generic function with 1 method)

julia> example(1.0, 1.0)
ERROR: MethodError: no method matching example(::Float64, ::Float64)

So when does the implicit conversion happen? Again the maunal explains it here:

Assigning to an array converts to the array’s element type.

Assigning to a field of an object converts to the declared type of the field.

Constructing an object with new converts to the object’s declared field types.

Assigning to a variable with a declared type (e.g. local x::T) converts to that type.

A function with a declared return type converts its return value to that type.

Passing a value to ccall converts it to the corresponding argument type.

So a short, simplifying, rule is that the conversion happens when you want to make
an assignment to a value that has a pre-specified type.

Let us see it at work:

julia> d1 = Set{Int}()
Set{Int64}()

julia> push!(d1, 1.0)
Set{Int64} with 1 element:
  1

julia> d2 = BitSet()
BitSet([])

julia> push!(d2, 1.0)
ERROR: MethodError: no method matching push!(::BitSet, ::Float64)

You can push 1.0 to Set{Int}, because its push! method makes no
restriction on the type of value that is pushed to the Set{Int}. Then,
internally, Set{Int} converts the passed value to Int before storing it. You
can make sure that this happens by writing:

julia> push!(d1, "a")
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Int64
Closest candidates are:
  convert(::Type{T}, ::Ptr) where T<:Integer at pointer.jl:23
  convert(::Type{T}, ::T) where T<:Number at number.jl:6
  convert(::Type{T}, ::Number) where T<:Number at number.jl:7
  ...
Stacktrace:
 [1] setindex!(h::Dict{Int64, Nothing}, v0::Nothing, key0::String)
   @ Base ./dict.jl:374
 [2] push!(s::Set{Int64}, x::String)
   @ Base ./set.jl:57
 [3] top-level scope
   @ REPL[32]:1

and if you check the setindex! implementation you will see the conversion:

function setindex!(h::Dict{K,V}, v0, key0) where V where K
    key = convert(K, key0)
    if !isequal(key, key0)
        throw(ArgumentError("$(limitrepr(key0)) is not a valid key for type $K"))
    end
    setindex!(h, v0, key)
end

So why you are not allowed to push! a float to a BitSet? The reason is that
its push! method is defined more restrictively like this:

@inline push!(s::BitSet, n::Integer) = _setint!(s, _check_bitset_bounds(n), true)

and since no conversion happens when passing parameters to functions an error is
thrown.

Conclusion

To wrap up let us go back to the manual here:

since convert can be called implicitly, its methods are restricted to cases
that are considered “safe” or “unsurprising”. convert will only convert between
types that represent the same basic kind of thing (e.g. different
representations of numbers, or different string encodings). It is also usually
lossless; converting a value to a different type and back again should result in
the exact same value.

In essence this means that, unless you are developing a low level infrastructure
package (like defining new numeric type), you are not likely to need to define
convert methods at all. Just define proper constructors for your types.

In DataFrames.jl we left only three conversions.

The first is from SubDataFrame to DataFrame. It is clearly a lose-less
conversion that sometimes might be useful (e.g. you have a container of
DataFrame objects and want to be able to implicitly add SubDataFrame to it).

The second and third conversions are from DataFrameRow and GroupKey to
NamedTuple. The reason behind allowing them is that we try to make both these
types to work like NamedTuple (except for the fact that they are views).
Again in this case the conversion is lose-less.

As a final note — I think Julia 1.6 manual is really a great resource
(half-way of writing this post I considered to stop it as the manual
is so clear about the rules).

Expand your DataFrames.jl toolbox: the flatten function

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/03/20/flatten.html

Introduction

Recently I have commented on an interesting question on StackOveflow.

The problem was stated as follows. Given this input table:

2×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Time
─────┼───────────────────────────────────────
   1 │ John    A               2  16:00:00
   2 │ Joseph  B               3  15:05:00

produce the following output table:

5×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Time
─────┼───────────────────────────────────────
   1 │ John    A               2  16:00:00
   2 │ John    A               2  16:01:00
   3 │ Joseph  B               3  15:05:00
   4 │ Joseph  B               3  15:06:00
   5 │ Joseph  B               3  15:07:00

As you can see the task is to repeat each row of the source data frame as many
times as column :Duration tells you but additionally increment the
Start_Time column by one minute in each consecutive row.

This question caught my attention, because it referenced to a similar
question using Pandas. However, I found it quite hard to immediately
understand what is going on in that code, while in DataFrames.jl the solution
seemed to be relatively simple.

This post was written under Julia 1.6.0-rc1 and DataFrames 0.22.5.

The solution using `flatten`

We start with creating the source data frame:

julia> using DataFrames, Dates

julia> df = DataFrame(Name=["John", "Joseph"],
                      Channel=["A", "B"],
                      Duration=[2,3],
                      Start_Time=Time.(["16:00:00", "15:05:00"]))
2×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Time
─────┼───────────────────────────────────────
   1 │ John    A               2  16:00:00
   2 │ Joseph  B               3  15:05:00

Now in order to solve the task one needs to remember that data frame can store
columns having any element type. Therefore a first natural step is to transform
the :Start_Time column from a vector holding only a starting time to a vector
holding a range of times as defined by :Duration and :Start_Time columns.

This is easy to achieve using the transform function:

julia> df2 = transform(df, [:Start_Time, :Duration] =>
                           ByRow((x,y) -> x .+ Minute.(0:y-1)) =>
                           :Start_Time)
2×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Array…
─────┼──────────────────────────────────────────────────────────────
   1 │ John    A               2  Time[16:00:00, 16:01:00]
   2 │ Joseph  B               3  Time[15:05:00, 15:06:00, 15:07:0…

alternatively one could create the df2 data frame e.g. like this:

julia> df2 = copy(df)
2×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Time
─────┼───────────────────────────────────────
   1 │ John    A               2  16:00:00
   2 │ Joseph  B               3  15:05:00

julia> df2.Start_Time = [x .+ Minute.(0:y-1) for
                         (x, y) in zip(df2.Start_Time, df2.Duration)]
2-element Vector{Vector{Time}}:
 [Time(16), Time(16, 1)]
 [Time(15, 5), Time(15, 6), Time(15, 7)]

julia> df2
2×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Array…
─────┼──────────────────────────────────────────────────────────────
   1 │ John    A               2  Time[16:00:00, 16:01:00]
   2 │ Joseph  B               3  Time[15:05:00, 15:06:00, 15:07:0…

a small benefit of transform is that is is easier to put this operation in a
chain of transformations as it takes and returns a data frame.

Once you have a df2 data frame then you need to flatten the :Start_Time
column into multiple rows. This is easily done using the flatten function like
this:

julia> flatten(df2, :Start_Time)
5×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Time
─────┼───────────────────────────────────────
   1 │ John    A               2  16:00:00
   2 │ John    A               2  16:01:00
   3 │ Joseph  B               3  15:05:00
   4 │ Joseph  B               3  15:06:00
   5 │ Joseph  B               3  15:07:00

and you are done!

For sure I know DataFrames.jl much better than Pandas. However, what I feel
(and I am for sure biased here) is that it is much easier to reason about what
DataFrames.jl code does.

The solution using iteration

Another approach that could be used to handle this task would be to construct
the resulting data frame incrementally. In this case it is a bit more complex
than the flatten solution, but it is very often quite convenient so I thought
to show it. Here is the code:

julia> df3 = DataFrame()
0×0 DataFrame

julia> for row in eachrow(df)
           chunk = repeat(DataFrame(row), row.Duration)
           chunk.Start_Time .+= Minute.(0:row.Duration-1)
           append!(df3, chunk)
       end

julia> df3
5×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Time
─────┼───────────────────────────────────────
   1 │ John    A               2  16:00:00
   2 │ John    A               2  16:01:00
   3 │ Joseph  B               3  15:05:00
   4 │ Joseph  B               3  15:06:00
   5 │ Joseph  B               3  15:07:00

julia> df4 = DataFrame()
0×0 DataFrame

julia> for row in eachrow(df), i in 0:row.Duration-1
           push!(df4, row)
           df4.Start_Time[end] += Minute(i)
       end

julia> df4
5×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Time
─────┼───────────────────────────────────────
   1 │ John    A               2  16:00:00
   2 │ John    A               2  16:01:00
   3 │ Joseph  B               3  15:05:00
   4 │ Joseph  B               3  15:06:00
   5 │ Joseph  B               3  15:07:00

The point of these examples is that append! and push! are quite fast in
DataFrames.jl and I find them easy to reason about.

Conclusion

I hope that you found these examples useful. In particular functions like
flatten are easy to forget about while they often are very handy, especially in
combination with the fact that data frame can store objects of any type in its
columns.

In particular, you can store a vector of vectors or a vector of structs as a
data frame column. This is a type of storage that users of such data bases as
BigQuery or Snowflake tend to like. An especially notable feature
of this functionality is that such data frames can be easily written to and read
back from a file using e.g. Arrow.jl.

If you would like to check out another example of using a vector of vectors as
a column of a data frame you can have a look at notebook 5 of the
JuliaAcademy DataFrames.jl tutorial.

juliabloggers.com

A Julia Language Blog Aggregator

Author Archives: Blog by Bogumił Kamiński

Poor man’s guide to despecialization

Introduction

The context

Despecialization

Poor man’s despecialization

Conclusions

Construction vs conversion in Julia

Introduction

The history

The present

Conclusion

Expand your DataFrames.jl toolbox: the flatten function

Introduction

The solution using `flatten`

The solution using iteration

Conclusion

Introduction

The context

Despecialization

Poor man’s despecialization

Conclusions

Introduction

The history

The present

Conclusion

Introduction

The solution using flatten

The solution using iteration

Conclusion

The solution using `flatten`