Tag Archives: julialang

Getting full factorial design in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/03/29/ffd.html

Introduction

Often when working with data we need to get all possible combinations of some
input factors in a data frame. In the field of design of experiments
this is called full factorial design. In this post I will discuss two functions
that DataFrames.jl provides that can help you to generate such designs if
you needed them.

The post was written under Julia 1.10.1 and DataFrames.jl 1.6.1.

What is a full factorial design and how to create it?

Assume we are a cardboard box producer have three factors describing a box: its width, height, and depth.
Each of them has a finite set of possible values (due to production process limitations).
Let us create some sample data of this kind:

julia> height = [10, 12]
2-element Vector{Int64}:
 10
 12

julia> width = [8, 10, 15]
3-element Vector{Int64}:
  8
 10
 15

julia> depth = [5, 6]
2-element Vector{Int64}:
 5
 6

Our task is to compute the volume of all possible boxes that can be created by our factory.
The list of all possible cardboard box configurations is a full factorial design.
You can get an iterator of these values by using the Iterators.product function:

julia> Iterators.product(height, width, depth)
Base.Iterators.ProductIterator{Tuple{Vector{Int64}, Vector{Int64}, Vector{Int64}}}(([10, 12], [8, 10, 15], [5, 6]))

This function is lazy, to see the result we need to materialize its return value using e.g. collect:

julia> collect(Iterators.product(height, width, depth))
2×3×2 Array{Tuple{Int64, Int64, Int64}, 3}:
[:, :, 1] =
 (10, 8, 5)  (10, 10, 5)  (10, 15, 5)
 (12, 8, 5)  (12, 10, 5)  (12, 15, 5)

[:, :, 2] =
 (10, 8, 6)  (10, 10, 6)  (10, 15, 6)
 (12, 8, 6)  (12, 10, 6)  (12, 15, 6)

We can see that we get an array of tuples of all possible combinations of dimensions. Let us now compute the volumes:

julia> prod.(collect(Iterators.product(height, width, depth)))
2×3×2 Array{Int64, 3}:
[:, :, 1] =
 400  500  750
 480  600  900

[:, :, 2] =
 480  600   900
 576  720  1080

The results are nice and efficient. However, sometimes it is more convenient to have this data in a data frame.

Full factorial design in DataFrames.jl

Let us repeat the exercise using DataFrames.jl:

julia> using DataFrames

julia> df = allcombinations(DataFrame; height, width, depth)
12×3 DataFrame
 Row │ height  width  depth
     │ Int64   Int64  Int64
─────┼──────────────────────
   1 │     10      8      5
   2 │     12      8      5
   3 │     10     10      5
   4 │     12     10      5
   5 │     10     15      5
   6 │     12     15      5
   7 │     10      8      6
   8 │     12      8      6
   9 │     10     10      6
  10 │     12     10      6
  11 │     10     15      6
  12 │     12     15      6

Note that we passed height, width, depth as keyword arguments to allcombinations taking advantage of a nice functionality of Julia that in this case we can avoid writing height=height as just writing height gives us the same result.

Now we can add a volume column:

julia> transform!(df, All() => ByRow(*) => "volume")
12×4 DataFrame
 Row │ height  width  depth  volume
     │ Int64   Int64  Int64  Int64
─────┼──────────────────────────────
   1 │     10      8      5     400
   2 │     12      8      5     480
   3 │     10     10      5     500
   4 │     12     10      5     600
   5 │     10     15      5     750
   6 │     12     15      5     900
   7 │     10      8      6     480
   8 │     12      8      6     576
   9 │     10     10      6     600
  10 │     12     10      6     720
  11 │     10     15      6     900
  12 │     12     15      6    1080

We have added the "volume" column in place to df. Note that we used * as it can take any number of positional arguments and returns their product. The ByRow wrapper signals that we want to perform this operation row-wise.

In comparison to the solution shown before many users find presentation of a full factorial design easier to work with.

What if we have a fractional factorial design?

Sometimes your data is incomplete, and some level combinations are missing. Let us start by creating such a data frame:

julia> df2 = df[Not(2, 5, 9), :]
9×4 DataFrame
 Row │ height  width  depth  volume
     │ Int64   Int64  Int64  Int64
─────┼──────────────────────────────
   1 │     10      8      5     400
   2 │     10     10      5     500
   3 │     12     10      5     600
   4 │     12     15      5     900
   5 │     10      8      6     480
   6 │     12      8      6     576
   7 │     12     10      6     720
   8 │     10     15      6     900
   9 │     12     15      6    1080

Now one might ask to complete this design and re-fill the design to be complete. This can be done by the fillcombinations function. Let us see it at work:

julia> fillcombinations(df2, Not("volume"))
12×4 DataFrame
 Row │ height  width  depth  volume
     │ Int64   Int64  Int64  Int64?
─────┼───────────────────────────────
   1 │     10      8      5      400
   2 │     12      8      5  missing
   3 │     10     10      5      500
   4 │     12     10      5      600
   5 │     10     15      5  missing
   6 │     12     15      5      900
   7 │     10      8      6      480
   8 │     12      8      6      576
   9 │     10     10      6  missing
  10 │     12     10      6      720
  11 │     10     15      6      900
  12 │     12     15      6     1080

Observe that after calling this function we have created a new data frame with the missing rows added. The "volume" column is filled by default with missing for rows that were added. The Not("volume") argument meant that we want to get all combinations of values in all columns except "volume".

Conclusions

Today we worked with two functions: allcombinations and fillcombinations. You will find them useful if in your work you will ever need to create all combinations of levels of some factors. This functionality seems niche, but it is needed in practice surprisingly often.

Storing vectors of vectors in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/03/22/minicontainers.html

Introduction

The beauty of DataFrames.jl design is that you can store any data
as columns of a data frame.
However, this leads to one tricky issue – what if we want to store
a vector as a single cell of a data frame? Today I will explain you
what is exactly the problem and how to solve it.

The post was written under Julia 1.10.1 and DataFrames.jl 1.6.1.

Basic transformations of columns in DataFrames.jl

Let us start with a simple example:

julia> using DataFrames

julia> df = DataFrame(id=repeat(1:2, 5), x=1:10)
10×2 DataFrame
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     1      3
   4 │     2      4
   5 │     1      5
   6 │     2      6
   7 │     1      7
   8 │     2      8
   9 │     1      9
  10 │     2     10

We want to group the df data frame by "id" and then store the "x" column unchanged in the result.

This can be done by writing:

julia> combine(groupby(df, "id", sort=true), "x")
10×2 DataFrame
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      3
   3 │     1      5
   4 │     1      7
   5 │     1      9
   6 │     2      2
   7 │     2      4
   8 │     2      6
   9 │     2      8
  10 │     2     10

Note that the column "x" is expanded into multiple rows by combine. The rule that is applied here states that if some transformation of data returns a vector it gets expanded into multiple rows. The reason for such a behavior is that this is what we want most of the time.

However, what if we would want the vectors to be kept as they are without expanding them?
This can be achieved by writing:

julia> combine(groupby(df, "id", sort=true), "x" => Ref => "x")
2×2 DataFrame
 Row │ id     x
     │ Int64  SubArray…
─────┼─────────────────────────
   1 │     1  [1, 3, 5, 7, 9]
   2 │     2  [2, 4, 6, 8, 10]

We see that we got what we wanted, but the question is why does it work?
Let me explain.

Containers holding one element in Julia

What we just did with Ref is that we wrapped some value in a container that held exactly one element.
There are three basic ways to create such a container in Julia.
The first is to wrap a vector within another vector:

julia> [[1,2,3]]
1-element Vector{Vector{Int64}}:
 [1, 2, 3]

Above you have a vector that has one element, which is a [1, 2, 3] vector.

The second method is to create a 0-dimensional array with fill:

julia> fill([1,2,3])
0-dimensional Array{Vector{Int64}, 0}:
[1, 2, 3]

The key point here is that 0-dimensional arrays are guaranteed to hold exactly one element (as opposed to a vector presented above).

The third approach is to use Ref:

julia> Ref([1,2,3])
Base.RefValue{Vector{Int64}}([1, 2, 3])

Wrapping an object with Ref also creates a 0-dimensional container. The difference between Ref and fill is that fill creates an array, while Ref is just a container (but not an array).

How to use 1-element containers in DataFrames.jl as wrappers

All three methods described above can be used to ensure that we protect a vector from being expanded into multiple rows. Therefore the following operations give the same output:

julia> combine(groupby(df, "id", sort=true), "x" => (x -> [x]) => "x")
2×2 DataFrame
 Row │ id     x
     │ Int64  SubArray…
─────┼─────────────────────────
   1 │     1  [1, 3, 5, 7, 9]
   2 │     2  [2, 4, 6, 8, 10]

julia> combine(groupby(df, "id", sort=true), "x" => fill => "x")
2×2 DataFrame
 Row │ id     x
     │ Int64  SubArray…
─────┼─────────────────────────
   1 │     1  [1, 3, 5, 7, 9]
   2 │     2  [2, 4, 6, 8, 10]

julia> combine(groupby(df, "id", sort=true), "x" => Ref => "x")
2×2 DataFrame
 Row │ id     x
     │ Int64  SubArray…
─────┼─────────────────────────
   1 │     1  [1, 3, 5, 7, 9]
   2 │     2  [2, 4, 6, 8, 10]

The point is that combine unwraps the outer container (vector, 0-dimensional array, and Ref respectively) and stores its contents as a cell of a data frame.

Now, you might ask why initially I recommended Ref? The reason is that it is the method that has the smallest memory footprint:

julia> x = [1, 2, 3]
3-element Vector{Int64}:
 1
 2
 3

julia> @allocated [x]
64

julia> @allocated fill(x)
64

julia> @allocated Ref(x)
16

This difference is important if you have a huge data frame that has millions of groups.

Also writing Ref is simpler than writing (x -> [x]) 😄.

Aliasing trap

You might have noticed that in the above examples the resulting "x" column held SubArrays? Why it is the case?
To improve performance combine did not copy the inner vectors from the source df data frame, but instead made their views. This is faster and more memory efficient, but results in creating an alias between the source data frame and the result. In many cases this is not a problem.

However, in some cases you might want to avoid it. A most common case is when you later want to mutate df in place, but do not want the result of combine to reflect this change. If you want to de-alias data you need to copy the data in the produced columns. Therefore you should do:

julia> combine(groupby(df, "id", sort=true), "x" => Ref∘copy => "x")
2×2 DataFrame
 Row │ id     x
     │ Int64  Array…
─────┼─────────────────────────
   1 │     1  [1, 3, 5, 7, 9]
   2 │     2  [2, 4, 6, 8, 10]

Notice that now the "x" column stores Array (which indicates that the copy was made). The Ref∘copy expression signals function composition. We first applly the copy function to the source data and then pass the result to Ref.

An alternative

Sometimes we want to keep the groups as columns not as rows of a data frame. In this case you can use unstack to achieve the desired result. Here is an example how to do it:

julia> unstack(df, :id, :x, combine=identity)
1×2 DataFrame
 Row │ 1                2
     │ SubArray…?       SubArray…?
─────┼───────────────────────────────────
   1 │ [1, 3, 5, 7, 9]  [2, 4, 6, 8, 10]

and a version copying the underlying data:

julia> unstack(df, :id, :x, combine=copy)
1×2 DataFrame
 Row │ 1                2
     │ Array…?          Array…?
─────┼───────────────────────────────────
   1 │ [1, 3, 5, 7, 9]  [2, 4, 6, 8, 10]

Conclusions

Having read this post you should be comfortable with protecting vectors from being expanded into multiple rows when processing data frames in DataFrames.jl. Enjoy!

Transforming multiple columns in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/03/15/transforms.html

Introduction

Today I want to comment on a recurring topic that DataFrames.jl users raise.
The question is how one should transform multiple columns of a data frame using
operation specification syntax.

The post was written under Julia 1.10.1 and DataFrames.jl 1.6.1.

What is operation specification syntax?

In DataFrames.jl the combine, select, and transform functions allow
users for passing the requests for data transformation using operation
specification syntax. This syntax is feature-rich, and you can find its
description for example here. Today I want to focus on its principal concept.

In a general form each request for making an operation on data has the (E)xtract-(T)ransform-(L)oad form.
That means that we need to specify:

  • source columns to get data from (the extract part);;
  • the operation to apply to these columns (the transform part);
  • the target columns where we want to store the result of the operation (the load part).

These tree parts are syntactically expressed using the following form:

[source columns specification] => [transformation function] => [target columns specification]

Let me give an example. Assume you have the following data:

julia> using DataFrames

julia> df = DataFrame(reshape(1:15, 5, 3), :auto)
5×3 DataFrame
 Row │ x1     x2     x3
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      6     11
   2 │     2      7     12
   3 │     3      8     13
   4 │     4      9     14
   5 │     5     10     15

We want to compute the sum of column "x1" and store it in column names "x1_sum"
Since the sum function performs the addition operation the syntax specification should be:

"x1" => sum => "x1_sum"

Let us check it with the combine function:

julia> combine(df, "x1" => sum => "x1_sum")
1×1 DataFrame
 Row │ x1_sum
     │ Int64
─────┼────────
   1 │     15

In this syntax it is important to note two things:

  • the "x1" column as a whole was passed to the sum function (as we want to compute its sum);
  • the "x1" column is a single positional argument passed to the sum function.

Two natural questions that arise are the following:

  • What if I do not want to perform an operation on a whole column, but on its elements (a.k.a. vectorization of operation)?
  • What if I want to pass multiple columns as a source for computations?

We will now investigate these two dimensions.

Vectorization of operations

Vectorization in DataFrames.jl is easy. Just wrap the function you use in the ByRow object. Here is an example:

julia> combine(df, "x1" => string => "x1_str")
1×1 DataFrame
 Row │ x1_str
     │ String
─────┼─────────────────
   1 │ [1, 2, 3, 4, 5]

julia> combine(df, "x1" => ByRow(string) => "x1_strs")
5×1 DataFrame
 Row │ x1_strs
     │ String
─────┼─────────
   1 │ 1
   2 │ 2
   3 │ 3
   4 │ 4
   5 │ 5

Note that "x1" => string => "x1_str" passed the whole "x1" column to the string function so we got a single "[1, 2, 3, 4, 5]"
string in the output.

While writing "x1" => ByRow(string) => "x1_strs" passed each element of "x1" column to the string function individually,
so in the result we got a vector of five string representations of numbers of the numbers from the source.

Passing multiple columns

Now let us have a look at passing multiple columns. There are two ways you can do it.

The first is when your function accepts multiple positional arguments. An example of such function is string see:

julia> string(df.x1, df.x2)
"[1, 2, 3, 4, 5][6, 7, 8, 9, 10]"

If we pass a collection of columns as a source in operation specification syntax we get this behavior:

julia> combine(df, ["x1", "x2"] => string => "x1_x2_str")
1×1 DataFrame
 Row │ x1_x2_str
     │ String
─────┼─────────────────────────────────
   1 │ [1, 2, 3, 4, 5][6, 7, 8, 9, 10]

Naturally, the above combines with vectorization. Therefore since:

julia> string.(df.x1, df.x2)
5-element Vector{String}:
 "16"
 "27"
 "38"
 "49"
 "510"

we also have:

julia> combine(df, ["x1", "x2"] => ByRow(string) => "x1_x2_strs")
5×1 DataFrame
 Row │ x1_x2_strs
     │ String
─────┼────────────
   1 │ 16
   2 │ 27
   3 │ 38
   4 │ 49
   5 │ 510

However, there are cases when we have a function that expects multiple columns to be passed as a single positional argument.
This is handled in DataFrames.jl with the AsTable wrapper, which you can apply to the source columns.
If you use it then instead of getting multiple positional arguments the function will get a single positional argument
that will be a NamedTuple holding the source columns.

To convince ourselves that this is indeed what happens let us create a helper function:

julia> function helper(x)
           @show x
           return string(x.x1, x.x2)
       end
helper (generic function with 1 method)

This helper function first prints us its only argument x and next assumes that it has x1 and x2 fields and applies the string function to them.
Let us first check it in practice:

julia> helper((x1=[1, 2, 3, 4, 5], x2=[6, 7, 8, 9, 10]))
x = (x1 = [1, 2, 3, 4, 5], x2 = [6, 7, 8, 9, 10])
"[1, 2, 3, 4, 5][6, 7, 8, 9, 10]"

Now let us use the helper function with combine:

julia> combine(df, AsTable(["x1", "x2"]) => helper => "x1_x2_str")
x = (x1 = [1, 2, 3, 4, 5], x2 = [6, 7, 8, 9, 10])
1×1 DataFrame
 Row │ x1_x2_str
     │ String
─────┼─────────────────────────────────
   1 │ [1, 2, 3, 4, 5][6, 7, 8, 9, 10]

Indeed, we see that helper got a named tuple holding two columns of the source data frame.

Again, this syntax plays well with ByRow:

julia> combine(df, AsTable(["x1", "x2"]) => ByRow(helper) => "x1_x2_strs")
x = (x1 = 1, x2 = 6)
x = (x1 = 2, x2 = 7)
x = (x1 = 3, x2 = 8)
x = (x1 = 4, x2 = 9)
x = (x1 = 5, x2 = 10)
5×1 DataFrame
 Row │ x1_x2_strs
     │ String
─────┼────────────
   1 │ 16
   2 │ 27
   3 │ 38
   4 │ 49
   5 │ 510

We see that this time helper got a separate named tuple for each row of source data frame.

Conclusions

In summary today we discussed two special operations in DataFrames.jl operation specification syntax:

  • the ByRow which vectorizes the function passed to it;
  • the AsTable which allows us to pass source columns as a single named tuple to the transformation function
    (instead of passing them as consecutive positional arguments, which is the default).

I hope these examples were useful in helping you understand the design of operation specification syntax.