Working with vectors using DataFrames.jl minilanguage

By: Blog by Bogumił Kamiński

I have written in the past about DataFrames.jl operation specification syntax
(also called minilanguage), see for example this post or this post.

Today I want to discuss one design decision made in this minilanguage and its consequences.
It is related with how vectors are handled when they are returned from some transformation function.

The post was written under Julia 1.10.0 and DataFrames.jl 1.6.1.

A basic example

Consider the following example, where we want to compute a profit from some sales data:

julia> using DataFrames

julia> df = DataFrame(name=["A", "B", "C"],
                      revenue=[10, 20, 30],
                      cost=[5, 12, 18])
3×3 DataFrame
 Row │ name    revenue  cost
     │ String  Int64    Int64
   1 │ A            10      5
   2 │ B            20     12
   3 │ C            30     18

julia> combine(df, All(), ["revenue", "cost"] => (-) => "profit")
3×4 DataFrame
 Row │ name    revenue  cost   profit
     │ String  Int64    Int64  Int64
   1 │ A            10      5       5
   2 │ B            20     12       8
   3 │ C            30     18      12

The crucial point to understand here is that the - function takes
two columns "revenue" and "cost" and returns a vector.
Users typically expect, as in this example, that this vector should
be spread across several rows.

When vector spreading is not desirable?

However, there are cases, when we might not want to spread a vector
into multiple rows. Consider for example a transformation in which
we want to put "revenue" and "profit" values in a 2-element vector
per product. Intuitively we could write something like:

julia> combine(groupby(df, :name),
               ["revenue", "cost"] => ((x,y) -> [only(x), only(y)]) => "vec")
ERROR: ArgumentError: all functions must return vectors of the same length

We get an error unfortunately. We will soon understand why, but before
I proceed let me comment on the [only(x), only(y)] part of the definition.
The only function makes sure that we have exactly one row per product.

To diagnose the issue let us drop the All() part in our call:

julia> combine(groupby(df, :name),
               ["revenue", "cost"] => ((x,y) -> [only(x), only(y)]) => "vec")
6×2 DataFrame
 Row │ name    vec
     │ String  Int64
   1 │ A          10
   2 │ A           5
   3 │ B          20
   4 │ B          12
   5 │ C          30
   6 │ C          18

Now we understand the problem. Because our function returns a vector it gets
spread over several rows (which leads to an error as other columns of df have
a different length).

Solving the vector-spreading issue

As I have said above, most of the time vector spreading is a desired feature,
but in the example we have just studied it is not wanted.
For such cases DataFrames.jl allows you to protect vectors from being spread.
What you need to do is to call Ref function on the returned value.
This will protect the result from being spread:

julia> combine(groupby(df, :name),
               ["revenue", "cost"] => ((x,y) -> Ref([only(x), only(y)])) => "vec")
3×4 DataFrame
 Row │ name    revenue  cost   vec
     │ String  Int64    Int64  Array…
   1 │ A            10      5  [10, 5]
   2 │ B            20     12  [20, 12]
   3 │ C            30     18  [30, 18]

Now, as we wanted, the entries of the "vec" columns are vectors. Wrapping the return
value of our function with Ref protected the vectors from being spread.

The alternative function that you could use to get the same effect is fill:

julia> combine(groupby(df, :name),
               ["revenue", "cost"] => ((x,y) -> fill([only(x), only(y)])) => "vec")
3×4 DataFrame
 Row │ name    revenue  cost   vec
     │ String  Int64    Int64  Array…
   1 │ A            10      5  [10, 5]
   2 │ B            20     12  [20, 12]
   3 │ C            30     18  [30, 18]

or you could wrap the return value with another pair of [...]:

julia> combine(groupby(df, :name),
               ["revenue", "cost"] => ((x,y) -> [[only(x), only(y)]]) => "vec")
3×4 DataFrame
 Row │ name    revenue  cost   vec
     │ String  Int64    Int64  Array…
   1 │ A            10      5  [10, 5]
   2 │ B            20     12  [20, 12]
   3 │ C            30     18  [30, 18]

What is going on here? In all three cases (Ref, fill, and [...]) we are wrapping a vector in another object that works like an outer vector.
In the case of [...] it is just a vector, fill produces a 0-dimensional array, and Ref creates a wrapper that behaves like 0-dimensional array.
In all cases DataFrames.jl treats this outer wrapper as a 1-element vector and just stores its contents in a single row (because there is one element to store).


I hope that you will find the example I gave today useful when transforming vectors using DataFrames.jl.

My understanding of object property access in Julia

By: Blog by Bogumił Kamiński

Today I wanted to discuss a conceptual aspect of Julia programming.
It is related to the question how you should query some object for its properties.
The topic is especially relevant if you want to write code that is expected to be stable
in the longer term, that means that it is easy to maintain as versions of its dependencies change.

The post was written under Julia 1.10.0 and DataFrames.jl 1.6.1.

The internals

A fundamental element of Julia design are composite types. This kind of object
is a collection of fields, that have names. Each of such fields can hold some value.

To make things non-abstract let us have a look at a SubDataFrame type from DataFrames.jl.
First create an instance of such object:

julia> using DataFrames

julia> df = DataFrame(x=1:3, y=11:13, z=111:113)
3×3 DataFrame
 Row │ x      y      z
     │ Int64  Int64  Int64
   1 │     1     11    111
   2 │     2     12    112
   3 │     3     13    113

julia> sdf = @view df[1:2, 1:2]
2×2 SubDataFrame
 Row │ x      y
     │ Int64  Int64
   1 │     1     11
   2 │     2     12

To check what fields SubDataFrame contains you can use the the fieldnames function:

julia> fieldnames(SubDataFrame)
(:parent, :colindex, :rows)

Note that we pass a type to fieldnames. It is important – the list of fields is fixed for every
instance of an object of a given type.

In this case we learned that SubDataFrame has three fields. The three functions associated with
fieldnames are: fieldcount returning the number of fields of a type,
fieldtypes returning their declared types, and hasfield allowing you
to query if a specific field is present. There is an example:

julia> fieldcount(SubDataFrame)

julia> fieldtypes(SubDataFrame)
(AbstractDataFrame, DataFrames.AbstractIndex, AbstractVector{Int64})

julia> hasfield(SubDataFrame, :parent)

julia> hasfield(SubDataFrame, :parentx)

For a given instance of a type you can query the field with getfield and set it with setfield!.
For example, let us get the field :parent of our sdf object (a source data frame in this case):

julia> getfield(sdf, :parent)
3×3 DataFrame
 Row │ x      y      z
     │ Int64  Int64  Int64
   1 │     1     11    111
   2 │     2     12    112
   3 │     3     13    113

Having learned all these methods you might ask yourself when to use it. The short answer is:

Never directly access fields of a type. They might be changed
between versions of code you use without warning.

The longer answer is that you should assume that direct field access is typically considered internal.
The list and fields and their types are an implementation detail and as a user of this type you should
not rely on them. The use of property access is restricted to the designers of a type to allow them
manipulate its inner physical representation.

So how should we work with composite types then?

The composite type interface

Julia introduces a concept of property that is a logical representation of data stored in a given object.
You can query for properties of an object with the propertynames function. You also have the hasproperty,
getproperty and setproperty! functions similar as for fields.

In case of our sdf SubDataFrame we have the following logical representation:

julia> propertynames(sdf)
2-element Vector{Symbol}:

julia> hasproperty(sdf, :x)

julia> getproperty(sdf, :x)
2-element view(::Vector{Int64}, 1:2) with eltype Int64:

julia> setproperty!(sdf, :x, [1001, 1002])
2-element Vector{Int64}:

julia> sdf
2×2 SubDataFrame
 Row │ x      y
     │ Int64  Int64
   1 │  1001     11
   2 │  1002     12

We immediately see a significant difference. The sdf properties in this case are columns of our data frame.
We do not care how they are mapped to a physical representation of SubDataFrame, this is taken care of
by designers of the DataFrames.jl package.

There are the following important aspects of properties.

The first is that property access is typically considered a public API.
Designers of the type should make sure that the way you can access properties
of an object should remain stable and a change in this area would be breaking, so:

You should access properties of objects in your code (not fields).

The second is that properties are bound to object, not to a type.
This means that different objects of the same type may have different sets of properties.
It is quite useful, e.g. each data frame can have a different set of columns.

The third, practical, information is that by default properties fall back to fields,
as you can read here in the Julia Manual.

The next aspect is convenient syntax.
You do not need to call the getproperty and setproperty! functions explicitly.
The getproperty(a, :b) is equivalent to a.b, and setproperty!(a, :b, v) is the same as a.b = v.

Finally note that the propertynames function optionally takes a second positional argument
that is Bool. If it is passed and set to true you get a list of all properties of some object.
By default the second argument is false and you get a list of public properties of some object
(and in practice you should use the default mode).


Today I have a short conclusion.

Fields represent physical layout of a type.
Properties represent a logical view of an object.

In your code use object properties and not their fields.
Field access is considered internal and typically should be only done by developers of a
package providing a given object.