Author Archives: Blog by Bogumił Kamiński

Fixed width strings in CSV.jl

Re-posted from: https://bkamins.github.io/julialang/2021/09/10/stringx.html

Introduction

In my recent post I have written how using inline strings can help to in
crease join performance. Since with 0.9 release of CSV.jl these strings are
used by default to read CSV files so I want to focus solely on this
topic today.

The post was written under Julia 1.6.1, DataFrames.jl 1.2.2,
WeakRrefStrings.jl 1.3.0, BenchmarkTools.jl 1.1.4, and CSV.jl 0.9.1.

Introducing inline strings

WeakRefStrings.jl defines 8 inline string types:

julia> using WeakRefStrings

julia> st = subtypes(InlineString)
8-element Vector{Any}:
 String1
 String127
 String15
 String255
 String3
 String31
 String63
 String7

What is important is that all these types are bits types, as you can see here:

julia> using DataFrames

julia> sort(DataFrame(st=st,
                      isbits = isbitstype.(st),
                      sizeof = sizeof.(st)), :sizeof)
8×3 DataFrame
 Row │ st         isbits  sizeof
     │ Any        Bool    Int64
─────┼───────────────────────────
   1 │ String1      true       1
   2 │ String3      true       4
   3 │ String7      true       8
   4 │ String15     true      16
   5 │ String31     true      32
   6 │ String63     true      64
   7 │ String127    true     128
   8 │ String255    true     256

The suffix of the name of the specific string type indicates the maximum size of
the string it can hold. So for example String3 can hold strings that have
maximum size 3 as returned by sizeof. Here is an example:

julia> String3("123")
"123"

julia> String3("1234")
ERROR: ArgumentError: string too large (4) to convert to String3

julia> String3("∀")
"∀"

julia> String3("∀1")
ERROR: ArgumentError: string too large (4) to convert to String3

As you can see here it is important to remember that some characters (like ∀)
in UTF-8 encoding take more than one code unit.

You can use the InlineString function to create an inline string of
automatically selected minimal size:

julia> typeof(InlineString("∀"))
String3

julia> typeof(InlineString("∀1"))
String7

Finally, as we can see the maximum size of inline string is 255, so:

julia> InlineString("a"^256)
ERROR: ArgumentError: string too large (256) to convert to InlineString

In summary, as you can see, the String[N] types are similar to CHAR(N)
types that are provided by many data bases. The limitation is that N
can take only several fixed values.

Why and when to use inline strings?

There are two benefits of inline strings.

The first is that they can be faster. A special case when this is visible are
equality comparisons which are quite common in practice.

julia> using Random, BenchmarkTools

julia> Random.seed!(1234);

julia> s1 = [randstring(3) for i in 1:1000];

julia> s2 = String3.(s1);

julia> @benchmark $s1 .== permutedims($s1)
BenchmarkTools.Trial: 1064 samples with 1 evaluation.
 Range (min … max):  3.928 ms …   6.435 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.910 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.697 ms ± 404.737 μs  ┊ GC (mean ± σ):  0.06% ± 0.91%

  ▅▃                             ▆ ▃  ▃█▄▂
  ██▇▇▅▇▅▅▄▆▄▄▄▅▆▅▅▁▄▆▅▅▇▅▄▁▅█▄▄▁█▇█▇▇█████▇▅▄▅▅▄▁▄▁▄▄▅▄▁▁▅▅▄ █
  3.93 ms      Histogram: log(frequency) by time      5.49 ms <

 Memory estimate: 126.53 KiB, allocs estimate: 6.

julia> @benchmark $s2 .== permutedims($s2)
BenchmarkTools.Trial: 4879 samples with 1 evaluation.
 Range (min … max):  888.150 μs …  2.037 ms  ┊ GC (min … max): 0.00% … 47.53%
 Time  (median):       1.043 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.023 ms ± 77.610 μs  ┊ GC (mean ± σ):  0.23% ±  2.38%

  ▅     ▃ ▁        ▁    ▅▃   █▆                                ▁
  █▅▃▃█████▅▃▁█▆▅▅▁█▇▃▁▃██▅▁▄███▆▅▄▆▇▅▇▅▅▇▅▅▄▄▃▅▅▅▅▃▁▃▅▄▃▁▃▁▃▄ █
  888 μs        Histogram: log(frequency) by time      1.22 ms <

 Memory estimate: 126.53 KiB, allocs estimate: 6.

The second is that they are not heap allocated so they do not lead to
significant GC latency. This topic was presented in the post on join
performance.

So what are the potential shortcommings. There are several:

they have a limited capacity, so one might not be able to always
rely that the conversion to inline string is possible;
they are not efficient when we work with strings of highly variable length;
mixing of inline strings in collections can lead to type instabilities.

I have already discussed the first topic. Now, let us handle the second and
third one in consecutive sections.

Memory footprint of inline strings

Consider the following collection:

julia> x = [String127("a"^i) for i in 1:100]
100-element Vector{String127}:
 "a"
 "aa"
 "aaa"
 ⋮
 "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
 "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
 "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"

Let us check how expensive it is to perform its copy:

julia> @benchmark copy($x)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  677.100 ns … 131.135 μs  ┊ GC (min … max):  0.00% … 95.08%
 Time  (median):       1.224 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):     1.464 μs ±   4.174 μs  ┊ GC (mean ± σ):  13.84% ±  4.88%

                    ▁▅███▇▆▄▂
  ▂▁▁▁▁▁▂▂▂▂▂▂▂▂▂▃▃▆██████████▇▅▅▅▅▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▂▁▁▁▂▂▂ ▃
  677 ns           Histogram: frequency by time         2.11 μs <

 Memory estimate: 12.62 KiB, allocs estimate: 1.

Now we convert it to a standard String:

julia> y = String.(x)
100-element Vector{String}:
 "a"
 "aa"
 "aaa"
 ⋮
 "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
 "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
 "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"

julia> @benchmark copy($y)
BenchmarkTools.Trial: 10000 samples with 973 evaluations.
 Range (min … max):  64.867 ns …   1.592 μs  ┊ GC (min … max):  0.00% … 82.37%
 Time  (median):     80.387 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   98.944 ns ± 111.194 ns  ┊ GC (mean ± σ):  13.24% ± 10.98%

  ▆█▄▂                                                         ▁
  ██████▇▆▆▆▄▅▄▃▄▃▁▃▁▁▁▁▁▁▁▃▅▆▅▄▅▇▅▁▁▁▁▁▁▁▁▁▁▁▁▁▄▄▅▅▄▅▅▅▅▇▆▆▅▅ █
  64.9 ns       Histogram: log(frequency) by time       830 ns <

 Memory estimate: 896 bytes, allocs estimate: 1.

As you can see the operation on String is much faster. The reason is that x
stores 100 entries of which each has 128 bytes, while y stores 100
pointers to strings that have only 8 bytes of size. See:

julia> sizeof(x)
12800

julia> sizeof(y)
800

So much less data movement is involved if we performed copying of pointers only.

The conclusion is that it is probably better to use String type if we expect
to work with collections of strings that have a highly variable length.

Collections of inline strings

The second potential issue is working with collections of inline strings.

When you read in string data from a file CSV.jl will automatically create
columns of appropriate widths:

julia> using CSV

julia> str = """
       w1,w2,w3,w4
       a,a,a,a
       b,bb,bb,bb
       c,cc,ccc,ccc
       d,dd,ddd,dddd
       """
"w1,w2,w3,w4\na,a,a,a\nb,bb,bb,bb\nc,cc,ccc,ccc\nd,dd,ddd,dddd\n"

julia> df = CSV.read(IOBuffer(str), DataFrame)
4×4 DataFrame
 Row │ w1       w2       w3       w4
     │ String1  String3  String3  String7
─────┼────────────────────────────────────
   1 │ a        a        a        a
   2 │ b        bb       bb       bb
   3 │ c        cc       ccc      ccc
   4 │ d        dd       ddd      dddd

However, one has to be careful when converting to inline string manually. Have a
look:

julia> strs2 = InlineString.(strs)
8-element Vector{InlineString}:
 "a"
 "aa"
 "aaa"
 "aaaa"
 "aaaaa"
 "aaaaaa"
 "aaaaaaa"
 "aaaaaaaa"

julia> typeof.(strs2)
8-element Vector{DataType}:
 String1
 String3
 String3
 String7
 String7
 String7
 String7
 String15

Sometimes this is indeed what one would want (the narrowest possible
representation at the cost of having a collection of abstract element type).
But currently if you would want to have an automatic type promotion and a
concrete element type you would have to do it manually e.g.:

julia> String15.(strs)
8-element Vector{String15}:
 "a"
 "aa"
 "aaa"
 "aaaa"
 "aaaaa"
 "aaaaaa"
 "aaaaaaa"
 "aaaaaaaa"

Operations on inline strings

Additionally one should know that most operations transforming inline strings
will currently produce a String by default, e.g.:

julia> s = String3("a")
"a"

julia> typeof(s^2)
String

julia> typeof(uppercase(s))
String

which also is best kept in mind when working with them.

Conclusions

Inline strings are an excellent addition to Julia, however, one should know the
type of task they were designed to help with.

As you could see in my examples inline strings are ideal for situations where we
have millions of relatively small and strings that have relatively homogeneous
size and are not mutated.

This use case might seem restrictive, but actually in practice it is quite
common as e.g. all sorts of customer or product identifiers have exactly this
nature.

Finally, some of the limitations I listed in this post might be lifted in the
future. If you need some functionality please do not hesitate to open an issue
on WeakRefStrings.jl GitHub repository.

ABC of handling missing values in Julia

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/09/03/missing.html

Introduction

When working with real data one often encounters missing values. This is an
introductory level post aiming to explain the corner cases of working with such
data in the Julia language. It is intended to complement the section on
Missing Values of the Julia Manual. I highly recommend to read it to
everyone interested in the subject and therefore I will skip many topics that
are covered in detail there.

The post was written under Julia 1.6.1 and Missings.jl 1.0.1.

Introducing `missing`

Missing values are represented in Julia using missing that has type Missing.

As is explained in the section on Missing Values of the Julia Manual:

Julia provides support for representing missing values in the statistical
sense, that is for situations where no value is available for a variable in
an observation, but a valid value theoretically exists.

It is useful to contrast this contract with the intended use of nothing
value that has type Nothing, which should be used when some value objectively
does not exist.

For example findfirst(==(1), 2:3) returns nothing value as there does not
exist an index in the 2:3 range for which the value is equal to 1. On the
other hand if we have some empirical data collected, e.g. about patients in the
clinical trial, and for one of such patients we have not recorded subjects age
then it should be represented as missing (the patient objectively has some
age but we just do not know it).

If we work with data it is convenient to check if some value is missing using
the ismissing function. For example here is a way to drop missing values
from a vector:

julia> x = [1, missing, 3, missing]
4-element Vector{Union{Missing, Int64}}:
 1
  missing
 3
  missing

julia> filter(!ismissing, x)
2-element Vector{Union{Missing, Int64}}:
 1
 3

In this example we have used the !ismissing expression which produces a
function opposite to ismissing, i.e. returning true if the value is not
missing.

Typical problems with missing values

Since missing value follows a three-valued logic the following fails:

julia> findall(==(1), [1, missing, 1, 2])
ERROR: TypeError: non-boolean (Missing) used in boolean context

The reason is that:

julia> 1 == missing
missing

and we can see that the comparison does not produce a valid Bool value.

There are two ways to work around this problem. The first one is to use the
isequal function:

julia> findall(isequal(1), [1, missing, 1, 2])
2-element Vector{Int64}:
 1
 3

The other is to use the coalesce function:

julia> findall(x -> coalesce(x == 1, false), [1, missing, 1, 2])
2-element Vector{Int64}:
 1
 3

It is important to remember that these are not equivalent approaches.
They can differ most notably when working with floating point numbers.
Here is an example:

julia> findall(isequal(NaN), [NaN, missing, -0.0, 0.0, 1.0])
1-element Vector{Int64}:
 1

julia> findall(x -> coalesce(x == NaN, false), [NaN, missing, -0.0, 0.0, 1.0])
Int64[]

julia> findall(isequal(0.0), [NaN, missing, -0.0, 0.0, 1.0])
1-element Vector{Int64}:
 4

julia> findall(x -> coalesce(x == 0.0, false), [NaN, missing, -0.0, 0.0, 1.0])
2-element Vector{Int64}:
 3
 4

Of course one should use the method that is appropriate in the application area.

Corner cases of `skipmissings`

Typically aggregation functions produce missing when they are passed a collection
holding missing values:

julia> sum([1, missing, 2])
missing

A work-around this issue is to use the skipmissing wrapper that is a lazy
iterator skipping missing values in the passed collection, so the following
works:

julia> sum(skipmissing([1, missing, 2]))
3

It is important to know a corner case of skipmissing when the collection after
skipping missing values is empty. In such a case skipmissing tries to strip
the Missing part from the eltype of the collection, and if it is specific
enough it can be used to produce a proper result of the aggregation. However,
if the type is not specific enough an error is raised, as you can see here:

julia> sum(skipmissing(Union{Int, Missing}[missing, missing, missing]))
0

julia> sum(skipmissing([missing, missing, missing]))
ERROR: ArgumentError: reducing over an empty collection is not allowed

julia> sum(skipmissing(Any[missing, missing, missing]))
ERROR: MethodError: no method matching zero(::Type{Any})

The conclusion is that one should try to use collections of Union{Missing, T},
where T is a concrete type.

In the example above we were affected by one important design decision behind
Missing type. It is a singleton type that is not parametric. The missing
value does not carry information what is the type of the missing value, it
could be any type. Here it is worth to contrast this with e.g. R, where we have
NA, NA_integer_, NA_real_, NA_character_, and NA_complex_ constants
that cover selected, most common R types, so you have the following (run under
R 4.1.1):

> sin(NA)
[1] NA
> sin(NA_character_)
Error in sin(NA_character_) :
  non-numeric argument to mathematical function
> sum(NA, na.rm=T)
[1] 0
> sum(NA_complex_, na.rm=T)
[1] 0+0i
> sum(NA_character_, na.rm=T)
Error in sum(NA_character_, na.rm = T) :
  invalid 'type' (character) of argument

The decision to avoid such differences was deliberate and was meant to simplify
the design of functions working with missing values (at the cost of not
carrying the type information, which has to be managed on user’s side).

You might ask how one can extract T from Union{Missing, T} type. It is
easy using the nonmissingtype function:

ulia> nonmissingtype(Float64)
Float64

julia> nonmissingtype(Union{Missing, Float64})
Float64

julia> nonmissingtype(Any)
Any

julia> nonmissingtype(Union{Missing, Any})
Any

Functions not supporting missing values

As we have seen above many functions produce missing when they are passed
missing as an argument. The rationale is as follows: passed argument is
objectively present but just unknown, so the result of the operation should also
be present, but it is just unknown. Here are some examples of this behavior:

julia> 1 < missing
missing

julia> sin(missing)
missing

However, not all functions follow this rule. Take e.g. an Int constructor:

julia> Int(missing)
ERROR: MethodError: no method matching Int64(::Missing)

So what should we do if we want to convert the vector of integer float or
missing values into a vector of Union{Int, Missing} element type?
Note that the following fails:

julia> Int.([1.0, 2.0, missing])
ERROR: MethodError: no method matching Int64(::Missing)

You can do one of the two things. Either handle the case of missing manually
like this:

julia> [ismissing(x) ? missing : Int(x) for x in [1.0, 2.0, missing]]
3-element Vector{Union{Missing, Int64}}:
 1
 2
  missing

or use the passmissing wrapper that is defined in the Missings.jl package:

julia> using Missings

julia> passmissing(Int).([1.0, 2.0, missing])
3-element Vector{Union{Missing, Int64}}:
 1
 2
  missing

Changing element type of the collections

Very often we have a collection of data whose element type is not like we would
want it to be and we need to perform an appropriate transformation that keeps
the data but just changes the element type. In the context of missing values
there are two such operations.

The first is when we have a collection that does not allow missing values, but
we want another collection that holds the same data, but allows them (e.g.
because later we might want to store missing in such a collection). In such
a case use allowmissing from Missings.jl:

julia> x = [1, 2, 3]
3-element Vector{Int64}:
 1
 2
 3

julia> x[1] = missing
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Int64
Closest candidates are:
  convert(::Type{T}, ::Ptr) where T<:Integer at pointer.jl:23
  convert(::Type{T}, ::T) where T<:Number at number.jl:6
  convert(::Type{T}, ::Number) where T<:Number at number.jl:7
  ...
Stacktrace:
 [1] setindex!(A::Vector{Int64}, x::Missing, i1::Int64)
   @ Base ./array.jl:839
 [2] top-level scope
   @ REPL[63]:1

julia> y = allowmissing(x)
3-element Vector{Union{Missing, Int64}}:
 1
 2
 3

julia> y[1] = missing
missing

julia> y
3-element Vector{Union{Missing, Int64}}:
  missing
 2
 3

An opposite scenario is when we started with a collection allowing missing
values, which were removed from it and now we want it to have a narrower element
type. Here disallowmissing comes to our aid:

julia> x = [1, missing, 2]
3-element Vector{Union{Missing, Int64}}:
 1
  missing
 2

julia> filter!(!ismissing, x)
2-element Vector{Union{Missing, Int64}}:
 1
 2

julia> disallowmissing(x)
2-element Vector{Int64}:
 1
 2

Finally sometimes we might want to create a collection initially filled with
missing values, but allowing additionally some specific type of values.
Unfortunately the fill function will not help us here:

julia> fill(missing, 2)
2-element Vector{Missing}:
 missing
 missing

and we can see that the element type of the produced collection is Missing and
it is too narrow for practical use.

We should use the missings function from Missings.jl instead. For instance:

julia> z = missings(Int, 3)
3-element Vector{Union{Missing, Int64}}:
 missing
 missing
 missing

julia> z[1] = 100
100

julia> z
3-element Vector{Union{Missing, Int64}}:
 100
    missing
    missing

The distinction between collections allowing and not allowing missing values
is quite important in practice, so it is worth remembering the allowmissing,
disallowmissing, and missings functions are available. To wrap up let us
contrast this design decision with R, where storing missing values in typical
scenarios (not in all scenarios though) is supported by default and cannot be
opted-out from.

Conclusions

Most of the topics I have discussed here are standard. However, hopefully for
people starting to work with missing values in the Julia language these examples
can serve as a good additional information on top of what is written in
the section on Missing Values of the Julia Manual.

Handling vectors of vectors in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/08/27/nested.html

Introduction

In this post I want to explore a topic that is a corner case of
transformation mini-language in DataFrames.jl.
The issue is about transformations that produce vectors as values.
This question arises when users start to do more advanced operations
so I decided that it deserves a deeper treatment (as in the last post
this time I have chosen a topic following a question from the user I got recently).

This post was written under Julia 1.6.1, Arrow 1.6.2, and DataFrames.jl 1.2.2.

The standard behavior

By default if a transformation operation returns a vector it gets expanded
into multiple rows, as this is something that typically users expect.
Here is a basic example:

julia> using DataFrames

julia> df = DataFrame(x=[1, 1, 2, 3, 3])
5×1 DataFrame
 Row │ x
     │ Int64
─────┼───────
   1 │     1
   2 │     1
   3 │     2
   4 │     3
   5 │     3

julia> combine(df, :x => reverse)
5×1 DataFrame
 Row │ x_reverse
     │ Int64
─────┼───────────
   1 │         3
   2 │         3
   3 │         2
   4 │         1
   5 │         1

julia> combine(df, :x => unique)
3×1 DataFrame
 Row │ x_unique
     │ Int64
─────┼──────────
   1 │        1
   2 │        2
   3 │        3

As you can see in the unique example the number of rows produced is flexible
with combine. This would not be the case for select or transform which
require the number of rows in the result to match the source, so we get:

julia> select(df, :x => reverse)
5×1 DataFrame
 Row │ x_reverse
     │ Int64
─────┼───────────
   1 │         3
   2 │         3
   3 │         2
   4 │         1
   5 │         1

julia> select(df, :x => unique)
ERROR: ArgumentError: length 3 of vector returned from function unique is different from number of rows 5 of the source data frame.

Similar rules apply when split-apply-combine strategy is used:

julia> gdf = groupby(DataFrame(id=[1, 1, 1, 2, 2, 2], x=[1, 1, 2, 1, 3, 3]), :id)
GroupedDataFrame with 2 groups based on key: id
First Group (3 rows): id = 1
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      1
   3 │     1      2
⋮
Last Group (3 rows): id = 2
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     2      1
   2 │     2      3
   3 │     2      3

julia> combine(gdf, :x => reverse)
6×2 DataFrame
 Row │ id     x_reverse
     │ Int64  Int64
─────┼──────────────────
   1 │     1          2
   2 │     1          1
   3 │     1          1
   4 │     2          3
   5 │     2          3
   6 │     2          1

julia> combine(gdf, :x => unique)
4×2 DataFrame
 Row │ id     x_unique
     │ Int64  Int64
─────┼─────────────────
   1 │     1         1
   2 │     1         2
   3 │     2         1
   4 │     2         3

julia> select(gdf, :x => reverse)
6×2 DataFrame
 Row │ id     x_reverse
     │ Int64  Int64
─────┼──────────────────
   1 │     1          2
   2 │     1          1
   3 │     1          1
   4 │     2          3
   5 │     2          3
   6 │     2          1

julia> select(gdf, :x => unique)
ERROR: ArgumentError: all functions must return vectors with as many values as rows in each group

Putting a vector into a single row

Sometimes one wants to put a vector into a single row of the resulting data frame.
In such case the recommended way to achieve the desired result is to wrap the result
in Ref, just like in broadcasting:

julia> combine(df, :x => Ref∘unique)
1×1 DataFrame
 Row │ x_Ref_unique
     │ Array…
─────┼──────────────
   1 │ [1, 2, 3]

julia> combine(gdf, :x => Ref∘unique)
2×2 DataFrame
 Row │ id     x_Ref_unique
     │ Int64  Array…
─────┼─────────────────────
   1 │     1  [1, 2]
   2 │     2  [1, 3]

This pattern is typically most useful when working with grouped data frames.

Here it is worth to mention that this wrapping is not required if we are
performing a ByRow operation, as ByRow automatically wraps everything our
transformation function produces in an additional vector as a container.

Here is an example:

julia> select(df, :x => ByRow(x -> fill(x, x)))
5×1 DataFrame
 Row │ x_function
     │ Array…
─────┼────────────
   1 │ [1]
   2 │ [1]
   3 │ [2, 2]
   4 │ [3, 3, 3]
   5 │ [3, 3, 3]

If you wanted to expand this result into multiple rows I recommend using the
flatten function in the post-processing, as I discuss it in this post:

julia> flatten(select(df, :x => ByRow(x -> fill(x, x))), 1)
10×1 DataFrame
 Row │ x_function
     │ Int64
─────┼────────────
   1 │          1
   2 │          1
   3 │          2
   4 │          2
   5 │          3
   6 │          3
   7 │          3
   8 │          3
   9 │          3
  10 │          3

Producing a table as an output from a transformation

The patterns above are standard. However, as users get more advanced they
start doing complex transformations that produce multiple columns in their code.

Here is a simple example to start with:

julia> select(df, :x => ByRow(x -> (a=x, b=fill(x, x))) => AsTable)
5×2 DataFrame
 Row │ a      b
     │ Int64  Array…
─────┼──────────────────
   1 │     1  [1]
   2 │     1  [1]
   3 │     2  [2, 2]
   4 │     3  [3, 3, 3]
   5 │     3  [3, 3, 3]

This worked nicely because ByRow(x -> (a=x, b=fill(x, x))) produced a vector
of NamedTuples that was cleanly handled by AsTable.

However, the following fails when aggregating GroupedDataFrame:

julia> combine(gdf, :x => (x -> (a=sum(x), b=x)) => AsTable)
ERROR: ArgumentError: mixing single values and vectors in a named tuple is not allowed

The problem is that the NamedTuple the transformation produces mixes a scalar
(in the column :a) and a vector (in the column :b). This is disallowed as it
would be not clear if the user wants the scalar value of :a to be broadcasted or
the vector stored in :b to be put into a single row of output.

Here are the ways to achieve both behaviors. If you want to broadcast :a into
multiple rows to match the length of :b the simplest approach is to use
a DataFrame instead of a NamedTuple:

julia> combine(gdf, :x => (x -> DataFrame(a=sum(x), b=x)) => AsTable)
6×3 DataFrame
 Row │ id     a      b
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      4      1
   2 │     1      4      1
   3 │     1      4      2
   4 │     2      7      1
   5 │     2      7      3
   6 │     2      7      3

Here the trick is that the DataFrame constructor performs pseudo-broadcasting
automatically.

On the other hand if you want the vector stored in :b to be put into a single row
do one of the following:

julia> combine(gdf, :x => (x -> (a=[sum(x)], b=[x])) => AsTable)
2×3 DataFrame
 Row │ id     a      b
     │ Int64  Int64  SubArray…
─────┼─────────────────────────
   1 │     1      4  [1, 1, 2]
   2 │     2      7  [1, 3, 3]

julia> combine(gdf, :x => (x -> DataFrame(a=sum(x), b=Ref(x))) => AsTable)
2×3 DataFrame
 Row │ id     a      b
     │ Int64  Int64  SubArray…
─────┼─────────────────────────
   1 │     1      4  [1, 1, 2]
   2 │     2      7  [1, 3, 3]

In the first approach we wrapped both :a and :b in a vector, and in the second
we used the pseudo-broadcasting supported by the DataFrame constructor again.

Why using nested vectors might be beneficial?

There are two kinds of benefits. One is readability. The other is performance.

Regarding the readability. Consider you have a homogeneous set of values. Then
you might prefer to store them in a one column to keep them together.
Here is an example of such data:

julia> df1 = DataFrame(x = [fill(i, 1000) for i in 1:10000])
10000×1 DataFrame
   Row │ x
       │ Array…
───────┼───────────────────────────────────
     1 │ [1, 1, 1, 1, 1, 1, 1, 1, 1, 1  ……
     2 │ [2, 2, 2, 2, 2, 2, 2, 2, 2, 2  ……
     3 │ [3, 3, 3, 3, 3, 3, 3, 3, 3, 3  ……
   ⋮   │                 ⋮
  9999 │ [9999, 9999, 9999, 9999, 9999, 9…
 10000 │ [10000, 10000, 10000, 10000, 100…
                          9995 rows omitted

julia> df2 = select(df1, :x => AsTable)
10000×1000 DataFrame
   Row │ x1     x2     x3     x4     x5     x6     x7     x8  ⋯
       │ Int64  Int64  Int64  Int64  Int64  Int64  Int64  Int ⋯
───────┼───────────────────────────────────────────────────────
     1 │     1      1      1      1      1      1      1      ⋯
     2 │     2      2      2      2      2      2      2
     3 │     3      3      3      3      3      3      3
   ⋮   │   ⋮      ⋮      ⋮      ⋮      ⋮      ⋮      ⋮      ⋮ ⋱
  9999 │  9999   9999   9999   9999   9999   9999   9999   99
 10000 │ 10000  10000  10000  10000  10000  10000  10000  100 ⋯
                              993 columns and 9995 rows omitted

Of course it is subjective, but provided that the nested data makes sense to be
kept together (e.g. it could be a time series), I would prefer to use df1 than
df2. Clearly, in this case one could argue that one could consider using narrow
rather than wide table format, but in practical cases we would typically
have many additional columns with e.g. metadata that would have a constant value
for the whole series, and then I often find it more convenient not to use narrow
format.

Now let us handle the performance issue. Assume we would want to transform the
data by summing it. Here are the timings of operations in both cases
(the timings are for a second run of each operation):

julia> @time select(df1, :x => ByRow(sum));
  0.006237 seconds (102 allocations: 83.953 KiB)

julia> @time select(df2, x -> sum.(eachrow(x)));
  1.829907 seconds (68.27 M allocations: 1.167 GiB, 4.75% gc time, 1.30% compilation time)

Note that in the case of df2 one could consider writing something like
select(df2, names(df2, r"x") => ByRow(+)) to make the operation type-stable,
but the performance of this will be very bad. In this discussion you
can read about the future plans of improving simple row aggregations (like
sum here), but I could have used some more complex transformation operation
which even after that changes would be much faster on nested column.

In short – nested column, assuming its eltype is concrete, solves the tension
between type instability of DataFrame vs potentially extremely long compilation
times when you want to switch to a type stable mode via e.g. a Tuple or
a NamedTuple.

Storage of data frames with nested columns

One drawback of nested columns is that they cannot be easily persistently stored
in CSV files. However, they are easy enough to work with using Arrow.jl:

julia> Arrow.write("test.arrow", df1)
"test.arrow"

julia> df1_read = DataFrame(Arrow.Table("test.arrow"));

julia> df1_read == df1
true

Here, you just need to remember that df1_read created this way is read-only
(at the benefit of performance). You would need to write
DataFrame(Arrow.Table("test.arrow"), copycols=true) to get mutable columns.

Conclusions

Vectors of vectors are not used very commonly in the other data processing
ecosystems. However, DataFrames.jl is designed to allow easy processing
of such data and I find them convenient surprisingly often.

juliabloggers.com

A Julia Language Blog Aggregator

Author Archives: Blog by Bogumił Kamiński

Fixed width strings in CSV.jl

Introduction

Introducing inline strings

Why and when to use inline strings?

Memory footprint of inline strings

Collections of inline strings

Operations on inline strings

Conclusions

ABC of handling missing values in Julia

Introduction

Introducing `missing`

Typical problems with missing values

Corner cases of `skipmissings`

Functions not supporting missing values

Changing element type of the collections

Conclusions

Handling vectors of vectors in DataFrames.jl

Introduction

The standard behavior

Putting a vector into a single row

Producing a table as an output from a transformation

Why using nested vectors might be beneficial?

Storage of data frames with nested columns

Conclusions

Introduction

Introducing inline strings

Why and when to use inline strings?

Memory footprint of inline strings

Collections of inline strings

Operations on inline strings

Conclusions

Introduction

Introducing missing

Typical problems with missing values

Corner cases of skipmissings

Functions not supporting missing values

Changing element type of the collections

Conclusions

Introduction

The standard behavior

Putting a vector into a single row

Producing a table as an output from a transformation

Why using nested vectors might be beneficial?

Storage of data frames with nested columns

Conclusions

Introducing `missing`

Corner cases of `skipmissings`