Author Archives: Blog by Bogumił Kamiński

Sorting data by a transformation of columns in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/03/12/sorting.html

Introduction

Recently I have several times received a question how to sort a data frame
based on some transformation of its columns. In this post I will show you
the way it can be currently done.

The post was written under Julia 1.6.0-rc1 and DataFrames 0.22.5.

Before we start create a data frame we will work with. It contains
coordinates of some points in a three-dimensional space.

julia> using DataFrames

julia> using LinearAlgebra

julia> using Random

julia> Random.seed!(1234)
MersenneTwister(1234)

julia> df = DataFrame(x=rand(10), y=rand(10), z=rand(10)) .- 0.5
10×3 DataFrame
 Row │ x           y          z
     │ Float64     Float64    Float64
─────┼──────────────────────────────────
   1 │  0.0908446   0.148882   0.450498
   2 │  0.266797   -0.489094   0.46467
   3 │  0.0662374  -0.433577   0.445775
   4 │ -0.0399147   0.456753   0.289904
   5 │  0.294026    0.146691   0.32116
   6 │  0.354147   -0.387514  -0.46584
   7 │ -0.299414   -0.223979  -0.405456
   8 │ -0.201386    0.151664  -0.185074
   9 │ -0.253163   -0.443358  -0.37219
  10 │  0.0796722   0.342714  -0.125813

Basic sorting

You can sort this data frame by column :y like this:

julia> sort(df, :y)
10×3 DataFrame
 Row │ x           y          z
     │ Float64     Float64    Float64
─────┼──────────────────────────────────
   1 │  0.266797   -0.489094   0.46467
   2 │ -0.253163   -0.443358  -0.37219
   3 │  0.0662374  -0.433577   0.445775
   4 │  0.354147   -0.387514  -0.46584
   5 │ -0.299414   -0.223979  -0.405456
   6 │  0.294026    0.146691   0.32116
   7 │  0.0908446   0.148882   0.450498
   8 │ -0.201386    0.151664  -0.185074
   9 │  0.0796722   0.342714  -0.125813
  10 │ -0.0399147   0.456753   0.289904

If you want to sort it in reverse just do:

julia> sort(df, :y, rev=true)
10×3 DataFrame
 Row │ x           y          z
     │ Float64     Float64    Float64
─────┼──────────────────────────────────
   1 │ -0.0399147   0.456753   0.289904
   2 │  0.0796722   0.342714  -0.125813
   3 │ -0.201386    0.151664  -0.185074
   4 │  0.0908446   0.148882   0.450498
   5 │  0.294026    0.146691   0.32116
   6 │ -0.299414   -0.223979  -0.405456
   7 │  0.354147   -0.387514  -0.46584
   8 │  0.0662374  -0.433577   0.445775
   9 │ -0.253163   -0.443358  -0.37219
  10 │  0.266797   -0.489094   0.46467

or

julia> sort(df, order(:y, rev=true))
10×3 DataFrame
 Row │ x           y          z
     │ Float64     Float64    Float64
─────┼──────────────────────────────────
   1 │ -0.0399147   0.456753   0.289904
   2 │  0.0796722   0.342714  -0.125813
   3 │ -0.201386    0.151664  -0.185074
   4 │  0.0908446   0.148882   0.450498
   5 │  0.294026    0.146691   0.32116
   6 │ -0.299414   -0.223979  -0.405456
   7 │  0.354147   -0.387514  -0.46584
   8 │  0.0662374  -0.433577   0.445775
   9 │ -0.253163   -0.443358  -0.37219
  10 │  0.266797   -0.489094   0.46467

Using order is useful if you would want to sort a data frame by several columns
and apply different ordering rules to them.

If you want to apply a transformation to a single column and sort it based on
the transformed values use the by option:

julia> sort(df, :y, by=abs)
10×3 DataFrame
 Row │ x           y          z
     │ Float64     Float64    Float64
─────┼──────────────────────────────────
   1 │  0.294026    0.146691   0.32116
   2 │  0.0908446   0.148882   0.450498
   3 │ -0.201386    0.151664  -0.185074
   4 │ -0.299414   -0.223979  -0.405456
   5 │  0.0796722   0.342714  -0.125813
   6 │  0.354147   -0.387514  -0.46584
   7 │  0.0662374  -0.433577   0.445775
   8 │ -0.253163   -0.443358  -0.37219
   9 │ -0.0399147   0.456753   0.289904
  10 │  0.266797   -0.489094   0.46467

or equivalently

julia> sort(df, order(:y, by=abs))
10×3 DataFrame
 Row │ x           y          z
     │ Float64     Float64    Float64
─────┼──────────────────────────────────
   1 │  0.294026    0.146691   0.32116
   2 │  0.0908446   0.148882   0.450498
   3 │ -0.201386    0.151664  -0.185074
   4 │ -0.299414   -0.223979  -0.405456
   5 │  0.0796722   0.342714  -0.125813
   6 │  0.354147   -0.387514  -0.46584
   7 │  0.0662374  -0.433577   0.445775
   8 │ -0.253163   -0.443358  -0.37219
   9 │ -0.0399147   0.456753   0.289904
  10 │  0.266797   -0.489094   0.46467

These patterns naturally extend to multiple columns, and sorting is performed
lexicographically. Here is an example:

julia> df2 = DataFrame(x=rand(Bool, 16), y=rand(Bool, 16), z=rand(Bool, 16))
16×3 DataFrame
 Row │ x      y      z
     │ Bool   Bool   Bool
─────┼─────────────────────
   1 │ false   true  false
   2 │  true   true   true
   3 │  true  false   true
   4 │ false  false   true
   5 │  true  false   true
   6 │  true  false  false
   7 │  true  false  false
   8 │ false  false  false
   9 │ false  false   true
  10 │ false   true   true
  11 │  true  false   true
  12 │ false  false  false
  13 │  true  false  false
  14 │  true  false  false
  15 │ false  false  false
  16 │ false   true  false

julia> sort(df2, [:y, order(:z, rev=true), :x])
16×3 DataFrame
 Row │ x      y      z
     │ Bool   Bool   Bool
─────┼─────────────────────
   1 │ false  false   true
   2 │ false  false   true
   3 │  true  false   true
   4 │  true  false   true
   5 │  true  false   true
   6 │ false  false  false
   7 │ false  false  false
   8 │ false  false  false
   9 │  true  false  false
  10 │  true  false  false
  11 │  true  false  false
  12 │  true  false  false
  13 │ false   true   true
  14 │  true   true   true
  15 │ false   true  false
  16 │ false   true  false

However, a question is what if I want to sort a data frame on a function of
multiple columns taken together?

Sorting on multiple columns considered jointly

Going back to our df data frame what if we wanted to sort it by the distance
from the origin?

In this case the sortperm function is useful. What you need to
do is to create a temporary object, get its sortperm, and apply it to a
source data frame. Here is how it is done in practice:

julia> df[sortperm(norm.(eachrow(df))), :]
10×3 DataFrame
 Row │ x           y          z
     │ Float64     Float64    Float64
─────┼──────────────────────────────────
   1 │ -0.201386    0.151664  -0.185074
   2 │  0.0796722   0.342714  -0.125813
   3 │  0.294026    0.146691   0.32116
   4 │  0.0908446   0.148882   0.450498
   5 │ -0.0399147   0.456753   0.289904
   6 │ -0.299414   -0.223979  -0.405456
   7 │  0.0662374  -0.433577   0.445775
   8 │ -0.253163   -0.443358  -0.37219
   9 │  0.354147   -0.387514  -0.46584
  10 │  0.266797   -0.489094   0.46467

A nice thing is that sortperm also works for data frames, so if you wanted to
sort the data frame by the sign of :x and then by the sum of :y and :z
columns you could write:

julia> df[sortperm(select(df, :x => ByRow(sign), [:y, :z] => +)), :]
10×3 DataFrame
 Row │ x           y          z
     │ Float64     Float64    Float64
─────┼──────────────────────────────────
   1 │ -0.253163   -0.443358  -0.37219
   2 │ -0.299414   -0.223979  -0.405456
   3 │ -0.201386    0.151664  -0.185074
   4 │ -0.0399147   0.456753   0.289904
   5 │  0.354147   -0.387514  -0.46584
   6 │  0.266797   -0.489094   0.46467
   7 │  0.0662374  -0.433577   0.445775
   8 │  0.0796722   0.342714  -0.125813
   9 │  0.294026    0.146691   0.32116
  10 │  0.0908446   0.148882   0.450498

Conclusion

The thing to remember is that because data frame fully supports standard
indexing like a matrix you can easily reorder it using the sortperm function
applied to an object different than the original data frame.

However, since this feature request is raised quite often we are currently
discussing how to add a support to it in a standard sort syntax. If
you are interested in the details you can check this issue.

What is new in PooledArrays.jl?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/03/05/pooledarrays.html

Introduction

Recently PooledArrays.jl 1.2.1 has been released. The most significant change
since 1.0 release is an improvement of performance of basic operations:
getindex, copy, copy!, and copyto!. The effect of the change is
especially significant for PooledArrays that have large pools. This change, is
one of the steps towards making Julia run fast in Database-like ops
benchmark
for joins.

Let me start with the examples and then I will comment on the internals.
The post was tested under Julia 1.6.0-rc1.

The Benchmarks

I will just share a recordng of a Julia session doing the benchmarks. We start
with PooledArrays.jl 1.0.0:

(@v1.6) pkg> activate .
  Activating new environment at `~/Project.toml`

(bkamins) pkg> add [email protected]
   Resolving package versions...
    Updating `~/Project.toml`
  [2dfb63ee] + PooledArrays v1.0.0
    Updating `~/Manifest.toml`
  [9a962f9c] + DataAPI v1.6.0
  [2dfb63ee] + PooledArrays v1.0.0
  Progress [========================================>]  1/1
1 dependency successfully precompiled in 2 seconds (1 already precompiled)

julia> using BenchmarkTools, PooledArrays

julia> x = PooledArray(string.(1:10^6));

julia> @benchmark copy($x)
BenchmarkTools.Trial:
  memory estimate:  37.44 MiB
  allocs estimate:  13
  --------------
  minimum time:     16.109 ms (0.00% GC)
  median time:      17.820 ms (0.00% GC)
  mean time:        29.851 ms (40.25% GC)
  maximum time:     175.307 ms (88.17% GC)
  --------------
  samples:          168
  evals/sample:     1

julia> @benchmark $x[1:1]
BenchmarkTools.Trial:
  memory estimate:  33.63 MiB
  allocs estimate:  12
  --------------
  minimum time:     15.334 ms (0.00% GC)
  median time:      16.929 ms (0.00% GC)
  mean time:        30.654 ms (44.06% GC)
  maximum time:     188.496 ms (89.90% GC)
  --------------
  samples:          164
  evals/sample:     1

In order to assess if the timings are good or bad let us do the same operations
using a plain Vector{String}:

julia> x = string.(1:10^6);

julia> @benchmark copy($x)
BenchmarkTools.Trial:
  memory estimate:  7.63 MiB
  allocs estimate:  2
  --------------
  minimum time:     395.687 μs (0.00% GC)
  median time:      441.842 μs (0.00% GC)
  mean time:        757.598 μs (38.52% GC)
  maximum time:     6.194 ms (90.51% GC)
  --------------
  samples:          6598
  evals/sample:     1

julia> @benchmark $x[1:1]
BenchmarkTools.Trial:
  memory estimate:  96 bytes
  allocs estimate:  1
  --------------
  minimum time:     25.310 ns (0.00% GC)
  median time:      26.013 ns (0.00% GC)
  mean time:        34.384 ns (8.88% GC)
  maximum time:     2.433 μs (98.05% GC)
  --------------
  samples:          10000
  evals/sample:     992

As you can see things are really bad in PooledArrays.jl. Now start a fresh Julia
session and install 1.2.1 release of the package.

(bkamins) pkg> activate .
  Activating environment at `~/Project.toml`

(bkamins) pkg> add [email protected]
   Resolving package versions...
    Updating `~/Project.toml`
  [2dfb63ee] ↑ PooledArrays v1.0.0 ⇒ v1.2.1
    Updating `~/Manifest.toml`
  [2dfb63ee] ↑ PooledArrays v1.0.0 ⇒ v1.2.1
  [9fa8497b] + Future
  [9a3f8284] + Random
  [9e88b42a] + Serialization

julia> using BenchmarkTools, PooledArrays

julia> x = PooledArray(string.(1:10^6));

julia> @benchmark copy($x)
BenchmarkTools.Trial:
  memory estimate:  3.81 MiB
  allocs estimate:  4
  --------------
  minimum time:     893.391 μs (0.00% GC)
  median time:      948.740 μs (0.00% GC)
  mean time:        1.009 ms (5.08% GC)
  maximum time:     103.288 ms (98.55% GC)
  --------------
  samples:          4953
  evals/sample:     1

julia> @benchmark $x[1:1]
BenchmarkTools.Trial:
  memory estimate:  160 bytes
  allocs estimate:  3
  --------------
  minimum time:     47.806 ns (0.00% GC)
  median time:      53.053 ns (0.00% GC)
  mean time:        184.150 ns (44.04% GC)
  maximum time:     55.779 μs (68.58% GC)
  --------------
  samples:          10000
  evals/sample:     976

This looks better. Still you have to pay some cost over a Vector{Sting} but it
is much smaller (the cost is due to the fact that PooledArray constructor
performs some consistency checks of passed data to ensure extra safety).

You can expect that other operations that take a PooledArray or its view and
produce a PooledArray (like or copyto!) to experience similar
speedups.

So, what has changed between 1.0 and 1.2 release of PooledArrays.jl?

The Internals

In order to understand why the speedups were possible one needs to understand
first why the original code was slow. The reason is that PooledArray struct
contains three key fields:

  • refs: a vector of integer references (levels) of a PooledArray;
  • pool: a vector giving a mapping from references to actual values;
  • invpool: a dictionary providing a reverse mapping – from values to references.

As you can see this data structure is quite heavy if the size of pool relative
to the size of the PooledArray is large. In the example above I have shown an
extreme case where they were equal. But even if pool has e.g. 10% of size of
the PooledArray the cost is noticeable.

In PooledArrays.jl 1.0 all thee fields refs, pool and invpool were always
copied when a new PooledArray was created. This was expensive. What
PooeldArrays.jl 1.2 introduces is a well known from R and often asked about in
Julia copy on write behavior. What we now do is that we only copy refs. The
pool and invpool fields are shared across PooledArrays as this is a safe
thing to do as long as you are not changing the set of levels in the pool.

So where does the copy on write happen? PooledArrays.jl is now aware if you
share pool and invpool across several arrays and if this is the case and you
add levels to PooledArray then pool and invpool get copied. So essentially
we have a lazy mechanism that copies them only if needed.

One could ask why do we copy pool and invpool at all. One could just keep
sharing them without having to pay the cost of copying at all. The decision was
guided here by two considerations:

  • in practice new levels are added to PooledArray quite rarely (you mostly do
    it when constructing an initial source PooledArray);
  • it is safer to copy the pool and invpool in consideration of potential
    multi-threaded usages of PooledArray (where many tricky corner cases can
    happen).

Conclusions

The main take away is that you can expect your code using PooledArrays.jl to be
much faster since 1.2 release.

Before I finish let me comment on one important feature of PooledArray object
to keep in mind in the multi-threaded applications I have mentioned above.

It is currently not thread safe to add new levels to the pool
of PooledArray. So the rule is: if you add levels to PooledArray make sure
you are not performing any other operations on it in other threads.

However, you can safely perform any operations in multi-threaded context that do
not change the pool. So, in particular (barring standard considerations of
correct multi-threaded code), you are allowed to use setindex! on
PooledArray as long as you do not add new levels.

How to check the version of a package?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/02/27/pkg_version.html

Introduction

I think the most common type of question related to DataFrames.jl package is
that users are reporting that some functionality does not work as documented.

Sometimes it is indeed a bug but in the majority of cases the reason is that
the user does not have a correct version of the package installed. In this post
I discuss several ways of checking the version of the package one has in
a current project environment.

This post was written under Julia 1.6.0-rc1, DataFrames.jl 0.22.5 and Chain.jl
0.4.4 (so as usual you can expect some exercises in using DataFrames.jl).

Elementary methods

If you are in an interactive mode then you have two basic options. The first one
uses package manager mode. Press ] in Julia REPL and then write the following:

(bkamins) pkg> status
      Status `~/Project.toml`
  [4c9194b5] ABCDGraphGenerator v0.1.0 `https://github.com/bkamins/ABCDGraphGenerator.jl#master`
  [8be319e6] Chain v0.4.4
  [9a962f9c] DataAPI v1.6.0 `~/.julia/dev/DataAPI`
  [a93c6f00] DataFrames v0.22.5

or if you are interested in a particular package do:

(bkamins) pkg> status DataFrames
      Status `~/Project.toml`
  [a93c6f00] DataFrames v0.22.5

In the above output it is worth to note two common scenarios:

  • ABCDGraphGenerator.jl is tracking master branch of a GitHub repository as a
    source (so it means it was not installed from Julia registry);
  • DataAPI.jl is checked out for development (using dev command) and the package
    is tracking a local folder.

Alternatively we could have generated the same outputs using API like this:

julia> using Pkg

julia> Pkg.status()
      Status `~/Project.toml`
  [4c9194b5] ABCDGraphGenerator v0.1.0 `https://github.com/bkamins/ABCDGraphGenerator.jl#master`
  [8be319e6] Chain v0.4.4
  [9a962f9c] DataAPI v1.6.0 `~/.julia/dev/DataAPI`
  [a93c6f00] DataFrames v0.22.5

julia> Pkg.status("DataFrames")
      Status `~/Project.toml`
  [a93c6f00] DataFrames v0.22.5

The downside of both approaches is that they produce information to the screen.
However, often one is interested in processing programmatically the installed
packages status.

Working with Pkg.dependencies

The Pkg.dependencies function returns a dictionary mapping package UUIDs
to information about them. As you can check in the documentation string
of the function the available information is stored in the following fields:

  • name: the name of the package
  • version: the version of the package (this is Nothing for stdlibs)
  • is_direct_dep: the package is a direct dependency
  • is_tracking_path: whether a package is directly tracking a directory
  • is_pinned: whether a package is pinned
  • source: the directory containing the source code for that package
  • dependencies: the dependencies of that package as a vector of UUIDs

Using Pkg.dependencies we can easily write a function that returns a version
of the package. Here is an example:

julia> using Chain

julia> get_pkg_version(name::AbstractString) =
           @chain Pkg.dependencies() begin
               values
               [x for x in _ if x.name == name]
               only
               _.version
           end
get_pkg_version (generic function with 1 method)

julia> get_pkg_version("DataFrames")
v"0.22.5"

Here is another example getting summary statistics about installed packages in a
data frame:

julia> using DataFrames

julia> get_pkg_status(;direct::Bool=true) =
           @chain Pkg.dependencies() begin
               values
               DataFrame
               direct ? _[_.is_direct_dep, :] : _
               select(:name, :version,
                      [:is_tracking_path, :is_tracking_repo, :is_tracking_registry] =>
                      ByRow((a, b, c) -> ["path", "repo", "registry"][a+2b+3c]) =>
                      :tracking)
           end
get_pkg_status (generic function with 1 method)

julia> get_pkg_status()
4×3 DataFrame
 Row │ name                version  tracking
     │ String              Union…   String
─────┼───────────────────────────────────────
   1 │ DataAPI             1.6.0    path
   2 │ DataFrames          0.22.5   registry
   3 │ Chain               0.4.4    registry
   4 │ ABCDGraphGenerator  0.1.0    repo

As you see I have selected to provide only the most essential information
about packages in the output: name, version and whether package is tracking
registry, local path, or external repository.

If you would pass direct=false you get information about all available
packages (direct and indirect dependencies of the project). It is usually not
very useful, however, as the list tends to be long, as you can see here:

julia> get_pkg_status(direct=false)
62×3 DataFrame
 Row │ name                         version  tracking
     │ String                       Union…   String
─────┼────────────────────────────────────────────────
   1 │ OrderedCollections           1.4.0    registry
   2 │ LibSSH2_jll                           registry
   3 │ Statistics                            registry
   4 │ ArgTools                              registry
   5 │ Compat                       3.25.0   registry
   6 │ Reexport                     1.0.0    registry
   7 │ SharedArrays                          registry
  ⋮  │              ⋮                  ⋮        ⋮
  56 │ Dates                                 registry
  57 │ MbedTLS_jll                           registry
  58 │ Serialization                         registry
  59 │ IteratorInterfaceExtensions  1.0.0    registry
  60 │ Libdl                                 registry
  61 │ Artifacts                             registry
  62 │ InteractiveUtils                      registry
                                       48 rows omitted

Concluding remarks

I hope you might find these patterns useful in your work with the Julia language.

Before finishing, let me mention one other case that you might occasionally
need. The above examples show you the version of the package in your current
project environment. However, in one Julia session you can change active project
environment many times. If you would be interested in getting information about
a version of the currently loaded package here is the way to do it (this will not
work for packages from stdlib as they are bundled with Julia and have a fixed
version):

julia> Pkg.TOML.parsefile(joinpath(pkgdir(DataFrames), "Project.toml"))["version"]
"0.22.5"

Let us check that indeed the loaded version does not change if we change project
environment:

(bkamins) pkg> status DataFrames
      Status `~/Project.toml`
  [a93c6f00] DataFrames v0.22.5

(bkamins) pkg> add [email protected]
   Resolving package versions...
    Updating `~/Project.toml`
  [a93c6f00] ↓ DataFrames v0.22.5 ⇒ v0.21.8
    Updating `~/Manifest.toml`
  [324d7699] ↓ CategoricalArrays v0.9.3 ⇒ v0.8.3
  [a8cc5b0e] - Crayons v4.0.4
  [a93c6f00] ↓ DataFrames v0.22.5 ⇒ v0.21.8
  [59287772] - Formatting v0.4.2
  [2dfb63ee] ↓ PooledArrays v1.1.0 ⇒ v0.5.3
  [08abe8d2] - PrettyTables v0.11.1
  [189a3867] ↓ Reexport v1.0.0 ⇒ v0.2.0
  Progress [========================================>]  3/3
  ? DataFrames
2 dependencies successfully precompiled in 2 seconds (21 already precompiled)
1 dependency failed but may be precompilable after restarting julia

(bkamins) pkg> status DataFrames
      Status `~/Project.toml`
  [a93c6f00] DataFrames v0.21.8

julia> Pkg.TOML.parsefile(joinpath(pkgdir(DataFrames), "Project.toml"))["version"]
"0.22.5"

and we see that although project version of the package is changed the loaded
version remains the same.