Author Archives: Blog by Bogumił Kamiński

Strings vs symbols in DataFrames.jl column indexing

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/08/05/symbol.html

Introduction

In DataFrames.jl you can use both symbols and strings for column indexing. Which
to choose is one of the topics that new users ask about most frequently. In this
post I will explain why both options are supported and what is a difference
between them. Note that this is an entry level post, so I will omit many details
of the discussed topic and focus on most important aspects only.

The post was written under Julia 1.7.2, DataFrames.jl 1.3.4,
DataFramesMeta.jl 0.12.0, BenchmarkTools.jl 1.3.1.

What are strings and symbols?

In Julia a string allows users to store sequences of characters. The simplest
way to create a string is to write some text between double quotation marks:

julia> "an example string"
"an example string"

Symbols are objects used in Julia to create identifiers. You can think of them
as labels. Symbols are normally created by prefixing some label with : like
this:

julia> :label
:label

In this way you can create symbols that are valid variable names.
So, for example, you cannot create a symbol that has a space using ::

julia> :my label
ERROR: syntax: extra token "label" after end of expression

Instead, in such cases, you need to call Symbol passing it a string as
an argument:

julia> Symbol("my label")
Symbol("my label")

How are string and symbols different?

To understand the difference between symbols and strings it is easiest to
think of them as follows:

  • symbols are labels;
  • strings are sequences of characters.

So symbols are indivisible – they are always considered to as a whole,
while strings consist of multiple characters. The most important consequences of
this distinction are the following:

  • symbols are faster than strings when you compare them for equality using ==;
  • you can manipulate strings (e.g. uppercase, chop, perform substring matching etc.)
    while none of such operations are supported for symbols.

Let us have a look at these two characteristics by example. First we check
comparison speed. We create 1000-element vectors with unique values and compare
all pairs of their entries, so we make 1 million comparisons and expect 1000
matches.

julia> using BenchmarkTools

julia> string_vec = string.("s", 1:1000)
1000-element Vector{String}:
 "s1"
 "s2"
 "s3"
 "s4"
 ⋮
 "s997"
 "s998"
 "s999"
 "s1000"

julia> symbol_vec = Symbol.("s", 1:1000)
1000-element Vector{Symbol}:
 :s1
 :s2
 :s3
 :s4
 ⋮
 :s997
 :s998
 :s999
 :s1000

julia> test_cmp(v) = count(x == y for x in v, y in v)
test_cmp (generic function with 1 method)

julia> @btime test_cmp($string_vec)
  3.038 ms (0 allocations: 0 bytes)
1000

julia> @btime test_cmp($symbol_vec)
  635.400 μs (0 allocations: 0 bytes)
1000

Indeed symbol comparison is faster.

Now let us look at manipulation:

julia> str = "example"
"example"

julia> uppercase(str)
"EXAMPLE"

julia> chop(str)
"exampl"

julia> match(r"ex", str)
RegexMatch("ex")

julia> sym = :example
:example

julia> uppercase(sym)
ERROR: MethodError: no method matching uppercase(::Symbol)

julia> chop(sym)
ERROR: MethodError: no method matching chop(::Symbol)

julia> match(r"ex", sym)
ERROR: MethodError: no method matching match(::Regex, ::Symbol)

So in summary we could conclude that:

  • one can use symbol if the value stored in it is not manipulated
    (i.e. is treated as a label); they are faster in comparisons than strings
    and a bit easier to type (only : prefix is needed) provided that they do
    not contain characters like spaces (in which case they are not convenient
    to type);
  • strings support manipulation as opposed to symbols; the cost is that
    comparing them is slower than comparing symbols.

Let us now discuss how these considerations translate to the DataFrames.jl realm.

Strings vs symbols in DataFrames.jl

Column names in a DataFrame are labels. For this reason both symbols and
strings are allowed to be used when referencing them without introducing
an ambiguity. Here is an example. We start with strings:

julia> using DataFrames

julia> df = DataFrame("col1" => 1, "col 2" => 2)
1×2 DataFrame
 Row │ col1   col 2
     │ Int64  Int64
─────┼──────────────
   1 │     1      2

julia> df."col1"
1-element Vector{Int64}:
 1

julia> df."col 2"
1-element Vector{Int64}:
 2

julia> df[:, "col1"]
1-element Vector{Int64}:
 1

julia> df[:, "col 2"]
1-element Vector{Int64}:
 2

Now we try the same with symbols:

julia> df = DataFrame(:col1 => 1, Symbol("col 2") => 2)
1×2 DataFrame
 Row │ col1   col 2
     │ Int64  Int64
─────┼──────────────
   1 │     1      2

julia> df.col1
1-element Vector{Int64}:
 1

julia> getproperty(df, Symbol("col 2"))
1-element Vector{Int64}:
 2

julia> df[:, :col1]
1-element Vector{Int64}:
 1

julia> df[:, Symbol("col 2")]
1-element Vector{Int64}:
 2

We now see the first difference, that we have already discussed. If column
names are all valid variable names symbols are more convenient, however,
if they are not (e.g. contain spaces) then using strings is more convenient.
As an extreme case, note that the convenience syntax for getproperty using
. accessor does not work for symbols containing spaces and we need to do
an explicit getproperty call.

The second important aspect is that all functions that manipulate column
names in DataFrames.jl work with strings. This is natural, as symbol
manipulation is not supported by Julia. Here is a combo showing this in action:

julia> select(df, Cols(startswith("c")) .=> identity .=> uppercase)
1×2 DataFrame
 Row │ COL1   COL 2
     │ Int64  Int64
─────┼──────────────
   1 │     1      2

The Cols(startswith("c")) .=> identity .=> uppercase operation specification
syntax means that we want to pick all columns whose name starts with "c"
(note that the startswith function expects string as an input), keep them
unchanged (the identiy function) and uppercase their names in the output
(note that uppercase expects string as an input).

Finally, you might ask about comparison of speed of column lookup using strings
vs symbols. Here is a simple test:

julia> @btime $df.col1
  7.500 ns (0 allocations: 0 bytes)
1-element Vector{Int64}:
 1

julia> @btime $df."col1"
  38.446 ns (0 allocations: 0 bytes)
1-element Vector{Int64}:
 1

As you can see there is a noticeable performance difference. However, please
note that both these operations are very fast. Therefore, in practice,
column lookup is almost never a performance bottleneck in operations on
data frames (usually what you do with the column picked from a data frame
is more expensive by several orders of magnitude). So a practical recommendation
is that performance should not be a reason of choosing symbols over strings
most of the time.

If you really need speed then column lookup using an integer index is fastest:

julia> @btime $df[!, 1]
  4.100 ns (0 allocations: 0 bytes)
1-element Vector{Int64}:
 1

However, this way of picking columns is not recommended and you should use it
only if you are sure what column is stored under a given number in a data frame.

Additional practical considerations of using strings and symbols in DataFrames.jl

The first tip is that you can get a list of column names of a data frame as
strings and as symbols in DataFrames.jl using the names and propertynames
functions respectively:

julia> names(df)
2-element Vector{String}:
 "col1"
 "col 2"

julia> propertynames(df)
2-element Vector{Symbol}:
 :col1
 Symbol("col 2")

The second important consideration is that in DataFramesMeta.jl only symbols are
considered to be column identifiers in operations by default.
Therefore you can write:

julia> using DataFramesMeta

julia> @rselect(df, :out = :col1 + 1)
1×1 DataFrame
 Row │ out
     │ Int64
─────┼───────
   1 │     2

If you want to use strings instead you have to escape them with $:

julia> @rselect(df, $"out" = $"col1" + 1)
1×1 DataFrame
 Row │ out
     │ Int64
─────┼───────
   1 │     2

Conclusions

The post today was long, but the conclusion is simple. In DataFrames.jl
you can use both symbols and strings to get access to a column of a data frame.
The major consideration you should use when picking one or the other is your
convenience.

Getting ready for JuliaCon 2022

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/07/22/juliacon2022.html

Introduction

During JuliaCon 2022 I will run a
tutorial on DataFrames.jl.
In the tutorial I will focus on ways you can write transformation operations
using the select/transform/combine functions and the operation
specification syntax.

In this post I want to give you a preview of the topics I am going to cover
in my tutorial.

The post was written under Julia 1.7.2, DataFrames.jl 1.3.4,
and DataFramesMeta.jl 0.12.

What is operation specification syntax?

If you are new to DataFrames.jl then you probably wonder what
operation specification syntax is. Fortunately it is quite easy.

If you want to transform data using one of the select/transform/combine
functions you can specify the transformation you want to perform using the
following general syntax:

[source columns] => [function] => [output columns]

For example if you write :a => mean => :b you will get a mean of column :a
and store it in output data frame in column :b (visually: input column :a is
passed to the mean function whose output is passed to output column :b).

Additionally, if you prefer assignment style of specifying operations you can
use DataFramesMeta.jl package that allows you to write the same as just
:b = mean(:a). To use DataFramesMeta.jl you need to prefix the appropriate
function name with @ (to turn it into a macro).

Let me show you a minimal working example of such a transformation.
We compute mean of column :val by groups defined by the :id column:

julia> using DataFramesMeta

julia> using Statistics

julia> df = DataFrame(id=[1, 2, 1, 2, 1, 2], val=1:6)
6×2 DataFrame
 Row │ id     val
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     1      3
   4 │     2      4
   5 │     1      5
   6 │     2      6

julia> combine(groupby(df, :id), :val => mean => :mean_val)
2×2 DataFrame
 Row │ id     mean_val
     │ Int64  Float64
─────┼─────────────────
   1 │     1       3.0
   2 │     2       4.0

julia> @combine(groupby(df, :id), :mean_val = mean(:val))
2×2 DataFrame
 Row │ id     mean_val
     │ Int64  Float64
─────┼─────────────────
   1 │     1       3.0
   2 │     2       4.0

This is a simple example of what operation specification syntax
can do. In this post let me give you a more complex example (I explain
all the details of how it works in my upcoming tutorial).

The question from StackOverflow

In this StackOverflow question the user wanted to analyze iris data
set and get 25% and 75% quantiles of the Sepal.Length column.

The R code that the StackOverflow user provided was:

> library(dplyr)

> iris %>%
+        group_by(Species) %>%
+        summarise(
+           quantile(Sepal.Length, c(.25, .75)) %>%
+              matrix(nrow = 1) %>%
+              as.data.frame() %>%
+              setNames(paste0("Sepal.Length", c(.25, .75)))
+     )
# A tibble: 3 x 3
  Species    Sepal.Length0.25 Sepal.Length0.75
  <fct>                 <dbl>            <dbl>
1 setosa                 4.8               5.2
2 versicolor             5.6               6.3
3 virginica              6.22              6.9

The question was how to achieve the same with DataFrames.jl and
DataFramesMeta.jl?

Here is a solution. First we need to load the iris dataset (in the code I take
advantage of the fact that this dataset is bundled into DataFrames.jl
installation folders):

julia> using CSV

julia> iris = CSV.read(joinpath(dirname(pathof(DataFrames)),
                                "..", "docs", "src", "assets", "iris.csv"),
                       DataFrame)
150×5 DataFrame
 Row │ SepalLength  SepalWidth  PetalLength  PetalWidth  Species
     │ Float64      Float64     Float64      Float64     String15
─────┼──────────────────────────────────────────────────────────────────
   1 │         5.1         3.5          1.4         0.2  Iris-setosa
   2 │         4.9         3.0          1.4         0.2  Iris-setosa
  ⋮  │      ⋮           ⋮            ⋮           ⋮             ⋮
 149 │         6.2         3.4          5.4         2.3  Iris-virginica
 150 │         5.9         3.0          5.1         1.8  Iris-virginica
                                                        146 rows omitted

Now we are ready to use the combine function and the operation specification
syntax:

julia> combine(groupby(iris, :Species),
               :SepalLength =>
               (x -> [quantile(x, [0.25, 0.75])]) =>
               string.("SepalLength", [0.25, 0.75]))
3×3 DataFrame
 Row │ Species          SepalLength0.25  SepalLength0.75
     │ String15         Float64          Float64
─────┼───────────────────────────────────────────────────
   1 │ Iris-setosa                4.8                5.2
   2 │ Iris-versicolor            5.6                6.3
   3 │ Iris-virginica             6.225              6.9

With DataFramesMeta.jl you would write this using the @combine macro as
follows (I additionally show here how to use operation chaining with @chain):

julia> @chain iris begin
           groupby(:Species)
           @combine($["SepalLength0.25", "SepalLength0.75"] = [quantile(:SepalLength, [0.25, 0.75])])
       end
3×3 DataFrame
 Row │ Species          SepalLength0.25  SepalLength0.75
     │ String15         Float64          Float64
─────┼───────────────────────────────────────────────────
   1 │ Iris-setosa                4.8                5.2
   2 │ Iris-versicolor            5.6                6.3
   3 │ Iris-virginica             6.225              6.9

Conclusions

The operation specification syntax was designed to allow doing simple
transformations in an easy way, but at the same to also support quite complex
operations, like the one we did on the iris data frame.

If you would like to hear a detailed explanation of how to write such code
please join me during the upcoming workshop.

You can find all the examples that I will use during the workshop in the
accompanying GitHub repository.

How DataFrames.jl helps fighting piracy

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/07/22/regex.html

Introduction

During JuliaCon 2022 I gave a tutorial on DataFrames.jl.
You can find its recording on YouTube and all source code on GitHub.

This post is a follow up to one of the questions that I got during
the workshop. The topic of the discussion was applying the same function to
many columns of a data frame. Since the question is quite technical I will
first give you a brief introduction to the topic and next dive deep into the
issue.

I hope that this will be a useful material even for people that do not use
DataFrames.jl as we will explore the consequences of the
avoid type piracy rule that the Julia Manual recommends.

The post was written under Julia 1.7.2, DataFrames.jl 1.3.4.

How can one apply a function to multiple columns in DataFrames.jl?

Let us create a sample data frame first:

julia> using DataFrames

julia> df = DataFrame(a1=1:2, b1=3:4, a2=5:6)
2×3 DataFrame
 Row │ a1     b1     a2
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      3      5
   2 │     2      4      6

Next I define a simple function that allows us to inspect what arguments
it received:

julia> inspect(x...) = Ref(x)
inspect (generic function with 1 method)

In this function I wrap a tuple of arguments in Ref as in DataFrames.jl
Ref protects the wrapped value against being expanded.

Let us check this function in action on some simple examples of combine
transformation (if you do not know the operation specification syntax please
check my tutorial on DataFrames.jl I have linked above for an
introduction):

julia> combine(df,
               :a1 => inspect,
               r"a" => inspect,
               Cols(endswith("1")) => inspect)
1×3 DataFrame
 Row │ a1_inspect  a1_a2_inspect     a1_b1_inspect
     │ Tuple…      Tuple…            Tuple…
─────┼────────────────────────────────────────────────
   1 │ ([1, 2],)   ([1, 2], [5, 6])  ([1, 2], [3, 4])

julia> combine(df,
               AsTable(:a1) => inspect,
               AsTable(r"a") => inspect,
               AsTable(Cols(endswith("1"))) => inspect)
1×3 DataFrame
Row │ a1_inspect      a1_a2_inspect             a1_b1_inspect
    │ Tuple…          Tuple…                    Tuple…
────┼─────────────────────────────────────────────────────────────────────
  1 │ ((a1=[1, 2],),) ((a1=[1, 2], a2=[5, 6]),) ((a1=[1, 2], b1=[3, 4]),)

What we can see in these examples is the following:

  • r"a" picks all columns whose names match this regular expression
    (in this case contain "a");
  • Cols(endswith("1")) picks all columns whose names meet the endswith("1")
    predicate (that is, end with "1");
  • by default the selected columns are passed as multiple positional arguments to
    the executed function;
  • if you wrap the selected columns in AsTable then they get passed a single
    positional argument in a NamedTuple.

It is crucially important to understand at this point that r"a" and
Cols(endswith("1")) column selectors do not get resolved before being passed
to combine:

julia> r"a" => inspect
r"a" => inspect

julia> Cols(endswith("1")) => inspect
Cols{Tuple{Base.Fix2{typeof(endswith), String}}}((Base.Fix2{typeof(endswith), String}(endswith, "1"),)) => inspect

What I mean by resolved is that the expression will get its meaning (i.e. will
determine which columns it actually selects, inside the combine function in
the context of the data frame that is passed as a first argument to combine).

Having seen these basic examples, let us check how one can apply a given
function to multiple columns individually. You can do it e.g. like this:

julia> combine(df, :a1 => inspect, :a2 => inspect)
1×2 DataFrame
 Row │ a1_inspect  a2_inspect
     │ Tuple…      Tuple…
─────┼────────────────────────
   1 │ ([1, 2],)   ([5, 6],)

but there is an easier way. You can do it like this:

julia> combine(df, [:a1, :a2] .=> inspect)
1×2 DataFrame
 Row │ a1_inspect  a2_inspect
     │ Tuple…      Tuple…
─────┼────────────────────────
   1 │ ([1, 2],)   ([5, 6],)

The point is that combine (and other functions in DataFrames.jl) accept
vectors and matrices of operation specification syntax expressions and above
I create such a vector using broadcasting with .=>.

Let us check:

julia> [:a1, :a2] .=> inspect
2-element Vector{Pair{Symbol, typeof(inspect)}}:
 :a1 => inspect
 :a2 => inspect

julia> combine(df, [:a1 => inspect, :a2 => inspect])
1×2 DataFrame
 Row │ a1_inspect  a2_inspect
     │ Tuple…      Tuple…
─────┼────────────────────────
   1 │ ([1, 2],)   ([5, 6],)

Having the information I have shared above we are now ready to face the pirates.

Using column selectors to pick columns to which we apply the function

We have seen above that r"a" and Cols(endswith("1")) do not get resolved
before they get passed to combine. Therefore how can we apply the inspect
function to all columns selected by them?

The basic approach is to use the names function like this:

julia> names(df, r"a")
2-element Vector{String}:
 "a1"
 "a2"

julia> names(df, r"a") .=> inspect
2-element Vector{Pair{String, typeof(inspect)}}:
 "a1" => inspect
 "a2" => inspect

julia> combine(df, names(df, r"a") .=> inspect)
1×2 DataFrame
 Row │ a1_inspect  a2_inspect
     │ Tuple…      Tuple…
─────┼────────────────────────
   1 │ ([1, 2],)   ([5, 6],)

This method works, but it is a bit heavy-handed, as it requires us to use
the names function that needs the df as its first argument. This duplication
of information is not optimal.

Therefore we are tempted to skip the names(df, r"a") and just write
r"a" .=> inspect. Let us see what it gives us:

julia> combine(df, r"a" .=> inspect)
1×1 DataFrame
 Row │ a1_a2_inspect
     │ Tuple…
─────┼──────────────────
   1 │ ([1, 2], [5, 6])

julia> r"a" .=> inspect
r"a" => inspect

Unfortunately this is not what we expected. The reason is that, as you can see,
r"a" is treated by broadcasting as a scalar. Here we need to note that
the r"a" .=> inspect is resolved before the value of this expression is
passed to combine, so evaluation of this expression cannot be made by Julia
in the context of the df data frame.

Let us check what happens if we use the same approach with
Cols(endswith("1")) .=> inspect:

julia> combine(df, Cols(endswith("1")) .=> inspect)
1×2 DataFrame
 Row │ a1_inspect  b1_inspect
     │ Tuple…      Tuple…
─────┼────────────────────────
   1 │ ([1, 2],)   ([3, 4],)

This time we got what we wanted. It seems that this expression had to be
resolved only after it got passed to combine. Let us inspect it:

julia> Cols(endswith("1")) .=> inspect
DataAPI.BroadcastedSelector{Cols{Tuple{Base.Fix2{typeof(endswith), String}}}}(Cols{Tuple{Base.Fix2{typeof(endswith), String}}}((Base.Fix2{typeof(endswith), String}(endswith, "1"),))) => inspect

We see some strange DataAPI.BroadcastedSelector type. Its role is exactly to
delay the final resolution of broadcasted operation only after the expression
is processed in combine. When combine sees a value of type
DataAPI.BroadcastedSelector it does its post-processing in the context
of the df data frame to give us the desired result.

So how does this relate to type piracy? The answer is:

  • we could have made in DataFrames.jl Cols(endswith("1")) .=> inspect
    to have a delayed broadcasting behavior because the Cols selector
    is defined in DataAPI.jl so we can customize how it is handled in broadcasting;
  • we could not do the same for r"a", because the Regex type is defined in
    Base Julia, and it has a defined broadcasting behavior. Packages, like
    DataFrames.jl should not change how r"a" is handled by broadcasting because
    it would be type piracy.

Conclusions

So what do we learn from the today’s post?

  • for DataFrames.jl users: remember that using a regular expression as a column
    selector, while convenient, does not have the nice broadcasting support as
    other selectors, like Cols, Not, Between, and All, have;
  • for general audience: Julia is really flexible and allows packages to
    customize almost everything (in our case how broadcasting works); the only
    limitation is that such customization should be done on the types that you
    define yourself; changing the behavior of types defined in external packages
    is not recommended and it is called type piracy.

If you want to learn what exactly happens when you pass
Cols(endswith("1")) .=> inspect to combine
you can check it here and here.