Author Archives: Blog by Bogumił Kamiński

Upcoming DataFrames.jl features

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/02/18/upcoming.html

Introduction

In this post I want to introduce two features of DataFrames.jl that will be
added in 1.4 release that are already available on main branch. The new
additions are:

  • extensions of the transformation mini-language;
  • pivot table functionality of the unstack function.

The post was written under Julia 1.7.0 and DataFrames.jl current main branch,
and DataFramesMeta.jl 0.10.

Environment setup

Since we want to work on a main branch we set it up first:

(@v1.7) pkg> add DataFrames#main
    Updating git-repo `https://github.com/JuliaData/DataFrames.jl.git`
    Updating registry at `~/.julia/registries/General`
    Updating git-repo `https://github.com/JuliaRegistries/General.git`
   Resolving package versions...
    Updating `~/.julia/environments/v1.7/Project.toml`
  [a93c6f00] ~ DataFrames v1.3.2 ⇒ v1.4.0 `https://github.com/JuliaData/DataFrames.jl.git#main`
    Updating `~/.julia/environments/v1.7/Manifest.toml`
  [a93c6f00] ~ DataFrames v1.3.2 ⇒ v1.4.0 `https://github.com/JuliaData/DataFrames.jl.git#main`
Precompiling project...
  2 dependencies successfully precompiled in 20 seconds (47 already precompiled)

julia> using DataFrames

Extensions of the transformation mini-language

In DataFrames.jl 1.3 the nrow function has a special syntax as it does not
require passing source column (but optionally allows specifying output column
name) and returns the number of rows in a data frame or grouped data frame.
Here is an example:

julia> df = DataFrame(year=repeat([2020, 2021]; outer=3),
                      region=repeat(["north", "center", "south"]; inner=2),
                      sales=[500, 600, 700, 550, 630, 680])
6×3 DataFrame
 Row │ year   region  sales
     │ Int64  String  Int64
─────┼──────────────────────
   1 │  2020  north     500
   2 │  2021  north     600
   3 │  2020  center    700
   4 │  2021  center    550
   5 │  2020  south     630
   6 │  2021  south     680

julia> gdf = groupby(df, :year)
GroupedDataFrame with 2 groups based on key: year
First Group (3 rows): year = 2020
 Row │ year   region  sales
     │ Int64  String  Int64
─────┼──────────────────────
   1 │  2020  north     500
   2 │  2020  center    700
   3 │  2020  south     630
⋮
Last Group (3 rows): year = 2021
 Row │ year   region  sales
     │ Int64  String  Int64
─────┼──────────────────────
   1 │  2021  north     600
   2 │  2021  center    550
   3 │  2021  south     680

julia> combine(df, nrow)
1×1 DataFrame
 Row │ nrow
     │ Int64
─────┼───────
   1 │     6

julia> combine(gdf, nrow => :region_count)
2×2 DataFrame
 Row │ year   region_count
     │ Int64  Int64
─────┼─────────────────────
   1 │  2020             3
   2 │  2021             3

In DataFrames.jl 1.4 three more functions having the same special handling are
going to be added. They are:

  • eachindex: returning row number in a data frame or in group of grouped data
    frame;
  • proprow: returning proportion of rows in group
    (only supported for grouped data frame);
  • groupindices: returning group number (only supported for grouped data frame);

As you can see none of these functions require passing a concrete source column
to know how to produce their result and this is the reason why they follow the
non-standard syntax.

Let me show you how they work by example. Start with eachindex:

julia> transform(df, eachindex)
6×4 DataFrame
 Row │ year   region  sales  eachindex
     │ Int64  String  Int64  Int64
─────┼─────────────────────────────────
   1 │  2020  north     500          1
   2 │  2021  north     600          2
   3 │  2020  center    700          3
   4 │  2021  center    550          4
   5 │  2020  south     630          5
   6 │  2021  south     680          6

julia> transform(gdf, eachindex => :row_number)
6×4 DataFrame
 Row │ year   region  sales  row_number
     │ Int64  String  Int64  Int64
─────┼──────────────────────────────────
   1 │  2020  north     500           1
   2 │  2021  north     600           1
   3 │  2020  center    700           2
   4 │  2021  center    550           2
   5 │  2020  south     630           3
   6 │  2021  south     680           3

Note, that in the second example the returned row indices are within-group.

Now let us see how proprow and groupindices work. As I have noted above
they are only supported for GroupedDataFrame:

julia> combine(gdf, groupindices => :group_number, proprow)
2×3 DataFrame
 Row │ year   group_number  proprow
     │ Int64  Int64         Float64
─────┼──────────────────────────────
   1 │  2020             1      0.5
   2 │  2021             2      0.5

Let us use the proprow function to check the distribution of sum of two
tosses using standard six-sided dice additionally using DataFramesMeta.jl:

julia> using Random

julia> using DataFramesMeta

julia> Random.seed!(1234);

julia> @chain DataFrame(res=rand(1:6, 10^8) + rand(1:6, 10^8)) begin
           groupby(:res)
           @combine(:empirical = $proprow)
           @rtransform(:theoretical = (6 - abs(7 - :res)) / 36)
       end
11×3 DataFrame
 Row │ res    empirical  theoretical
     │ Int64  Float64    Float64
─────┼───────────────────────────────
   1 │     2  0.0277863    0.0277778
   2 │     3  0.0555783    0.0555556
   3 │     4  0.0832717    0.0833333
   4 │     5  0.111112     0.111111
   5 │     6  0.138934     0.138889
   6 │     7  0.166655     0.166667
   7 │     8  0.138885     0.138889
   8 │     9  0.111094     0.111111
   9 │    10  0.0833185    0.0833333
  10 │    11  0.0555788    0.0555556
  11 │    12  0.0277863    0.0277778

Pivot table functionality of the unstack function

The second upcoming functionality is addition of valuestransform keyword
argument to unstack. It allows you to create pivot tables. Start with
generating some table we might want to aggregate:

julia> Random.seed!(1234);

julia> sales_df = DataFrame(year=rand(2020:2021, 100),
                            region=rand(["north", "center", "south"], 100),
                            sales=rand(100))
100×3 DataFrame
 Row │ year   region  sales
     │ Int64  String  Float64
─────┼──────────────────────────
   1 │  2020  center  0.558861
   2 │  2021  center  0.122447
   3 │  2020  north   0.20889
  ⋮  │   ⋮      ⋮         ⋮
  98 │  2020  center  0.983299
  99 │  2020  north   0.882375
 100 │  2021  south   0.232082
                 94 rows omitted

julia> transform!(groupby(sales_df, [:year, :region]), eachindex => :store_id)
100×4 DataFrame
 Row │ year   region  sales      store_id
     │ Int64  String  Float64    Int64
─────┼────────────────────────────────────
   1 │  2020  center  0.558861          1
   2 │  2021  center  0.122447          1
   3 │  2020  north   0.20889           1
  ⋮  │   ⋮      ⋮         ⋮         ⋮
  98 │  2020  center  0.983299         13
  99 │  2020  north   0.882375         12
 100 │  2021  south   0.232082         19
                           94 rows omitted

I have generated the :store_id column that gives us store identifier per year
and region as an exercise of eachindex usage.

What we want to get is a pivot table showing us total sales by year and region.

Traditionally you would do it like this:

julia> combine(groupby(sales_df, [:year, :region]), :sales => sum => :total_sales)
6×3 DataFrame
 Row │ year   region  total_sales
     │ Int64  String  Float64
─────┼────────────────────────────
   1 │  2020  center      7.25926
   2 │  2021  center      6.00744
   3 │  2020  north       7.10255
   4 │  2021  south       7.53942
   5 │  2020  south       9.35856
   6 │  2021  north      10.7248

and later use unstack on the result like this:

julia> @chain sales_df begin
           groupby([:year, :region])
           combine(:sales => sum => :total_sales)
           unstack(:region, :year, :total_sales)
       end
3×3 DataFrame
 Row │ region  2020      2021
     │ String  Float64?  Float64?
─────┼────────────────────────────
   1 │ center   7.25926   6.00744
   2 │ north    7.10255  10.7248
   3 │ south    9.35856   7.53942

Now you can do it in one step with valuestransform keyword argument to
unstack which allows you to pass aggregation function you want to apply
to the unstacked data:

julia> unstack(sales_df, :region, :year, :sales; valuestransform=sum)
3×3 DataFrame
 Row │ region  2020      2021
     │ String  Float64?  Float64?
─────┼────────────────────────────
   1 │ center   7.25926   6.00744
   2 │ north    7.10255  10.7248
   3 │ south    9.35856   7.53942

Let us now perform a bit more complex transformation:

julia> @chain sales_df begin
           @rtransform(:region_year = string(:region, "_", :year))
           unstack(:store_id, :region_year, :sales; valuestransform=sum, fill=0.0)
       end
20×7 DataFrame
 Row │ store_id  center_2020  center_2021  north_2020  south_2021  south_2020  north_2021
     │ Int64     Float64      Float64      Float64     Float64     Float64     Float64
─────┼────────────────────────────────────────────────────────────────────────────────────
   1 │        1    0.558861     0.122447     0.20889    0.143436    0.329521   0.866798
   2 │        2    0.0380815    0.309235     0.034361   0.552426    0.310385   0.959957
   3 │        3    0.516981     0.261204     0.95434    0.413821    0.104218   0.199257
   4 │        4    0.0618067    0.075524     0.099118   0.706231    0.217979   0.370522
   5 │        5    0.0507147    0.275146     0.328642   0.975947    0.677177   0.00706048
   6 │        6    0.732012     0.0207132    0.991149   0.620299    0.814166   0.531297
   7 │        7    0.204139     0.931495     0.832146   0.201035    0.554558   0.759341
   8 │        8    0.799132     0.194907     0.731654   0.234099    0.67543    0.194442
   9 │        9    0.831656     0.894577     0.694614   0.0841972   0.339449   0.164982
  10 │       10    0.979482     0.057445     0.826054   0.177784    0.858287   0.529903
  11 │       11    0.989703     0.568018     0.519209   0.185454    0.0682582  0.315405
  12 │       12    0.513391     0.573917     0.882375   0.294397    0.465337   0.426687
  13 │       13    0.983299     0.0552395    0.0        0.145345    0.55148    0.630227
  14 │       14    0.0          0.6342       0.0        0.627799    0.0697159  0.83729
  15 │       15    0.0          0.435403     0.0        0.743466    0.756869   0.940755
  16 │       16    0.0          0.102038     0.0        0.641528    0.867504   0.978431
  17 │       17    0.0          0.495932     0.0        0.236       0.436487   0.937941
  18 │       18    0.0          0.0          0.0        0.324075    0.955728   0.241034
  19 │       19    0.0          0.0          0.0        0.232082    0.248491   0.833433
  20 │       20    0.0          0.0          0.0        0.0         0.0575235  0.0

Here, I have shown you the fill keyword argument that is used to provide
value that should be used in cases we do not have any data for some combination
of row and column keys (by default such cells are filled with missing).

Conclusions

I hope you will find these upcoming functionalities useful. If you have any
questions or comments regarding them please open an issue on DataFrames.jl
repository.

Safety-efficiency trade-offs in the map function

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/02/11/mappure.html

Introduction

The map function is borrowed from functional programming languages in Julia.
In many such languages you can safely assume that the functions you pass to
map are pure, meaning that they have no side effects and always return the
same output for the same passed arguments.

In practice, however, you sometimes pass a non-pure function to map. In this
post I present selected cases when this can lead to surprising results.

The post was written under Julia 1.7.0 and PooledArrays.jl 1.4.0.

Example scenario of non-pure function used with map

Assume you have a vector of data and you want to add some noise to values stored
in it. Here is an example how you can do it using map:

julia> using Random

julia> Random.seed!(1234);

julia> x = zeros(5)
5-element Vector{Float64}:
 0.0
 0.0
 0.0
 0.0
 0.0

julia> map(v -> v + randn(), x)
5-element Vector{Float64}:
 -0.3597289068234817
  1.0872084924285859
 -0.4195896169388487
  0.7189099374659392
  0.4202471777937789

As expected we have added a different random number to every element of the
passed vector.

Problematic vector types

Let us now move to a case when things get problematic:

julia> using SparseArrays

julia> y = sparse(x)
5-element SparseVector{Float64, Int64} with 0 stored entries

julia> Random.seed!(1234);

julia> map(v -> v + randn(), y)
5-element SparseVector{Float64, Int64} with 5 stored entries:
  [1]  =  -0.359729
  [2]  =  -0.359729
  [3]  =  -0.359729
  [4]  =  -0.359729
  [5]  =  -0.359729

To our surprise we have added the same noise to every element of y. The reason
is that the map function assumes for SparseVector that the function passed
to it is pure. Therefore it is called only once for 0.0 element of sparse
vector (note that technically this element is not stored in the vector).

Let us try another sparse vector:

julia> z = sparse(ones(5)) .- 1.0
5-element SparseVector{Float64, Int64} with 5 stored entries:
  [1]  =  0.0
  [2]  =  0.0
  [3]  =  0.0
  [4]  =  0.0
  [5]  =  0.0

julia> Random.seed!(1234);

julia> map(v -> v + randn(), z)
5-element SparseVector{Float64, Int64} with 5 stored entries:
  [1]  =  1.08721
  [2]  =  -0.41959
  [3]  =  0.71891
  [4]  =  0.420247
  [5]  =  -0.685671

What is going on? There are two things to observe:

  1. This time I have stored 0.0 values in the vector so for each entry in the
    sparse vector I get a new random number generated.
  2. The stream of random numbers is shifted by one, as map internally called
    randn() one more time for the default 0.0 value (which is not stored
    in this vector in our case).

Things look complex. The short lesson is: do not use the default map with
sparse vectors when you want to use a non-pure function.

As a side note, the same problem is present with broadcasting:

julia> Random.seed!(1234);

julia> (v -> v + randn()).(y)
5-element SparseVector{Float64, Int64} with 5 stored entries:
  [1]  =  -0.359729
  [2]  =  -0.359729
  [3]  =  -0.359729
  [4]  =  -0.359729
  [5]  =  -0.359729

How to use map with non-pure function and sparse vector

Fortunately there is a relatively simple way to resolve our problems. We need to
use the Base.@invoke macro. In this way we can force Julia to use the default
map method (not the optimized one that the SparseArrays module defines).
Here is how you can do it:

julia> Random.seed!(1234);

julia> Base.@invoke map(v -> v + randn(), y)
5-element Vector{Float64}:
 -0.3597289068234817
  1.0872084924285859
 -0.4195896169388487
  0.7189099374659392
  0.4202471777937789

julia> Random.seed!(1234);

julia> Base.@invoke map(v -> v + randn(), z)
5-element Vector{Float64}:
 -0.3597289068234817
  1.0872084924285859
 -0.4195896169388487
  0.7189099374659392
  0.4202471777937789

This time we get the same results as for the x vector. Note that the returned
object has Vector type (previously we were getting SparseVector).

The PooledArrays.jl case

The same problem as for SparseVector is present in PooledArrays.jl. However,
here, by default, the map function assumes that the function you pass to it
is not-pure for safety:

julia> using PooledArrays

julia> p = PooledArray(x)
5-element PooledVector{Float64, UInt32, Vector{UInt32}}:
 0.0
 0.0
 0.0
 0.0
 0.0

julia> Random.seed!(1234);

julia> map(v -> v + randn(), p)
5-element PooledVector{Float64, UInt32, Vector{UInt32}}:
 -0.3597289068234817
  1.0872084924285859
 -0.4195896169388487
  0.7189099374659392
  0.4202471777937789

If you are sure that you are passing a pure function to map in this case
you can pass pure=true keyword argument to speed things up:

julia> Random.seed!(1234);

julia> map(v -> v + randn(), p, pure=true)
5-element PooledVector{Float64, UInt32, Vector{UInt32}}:
 -0.3597289068234817
 -0.3597289068234817
 -0.3597289068234817
 -0.3597289068234817
 -0.3597289068234817

(or to break things, as in this case, since we passed a non-pure function)

Conclusions

In summary let me highlight that the situation we have discussed in this post is
a hard design decision. The reason is that most of the time you indeed pass
pure functions to the map function. Therefore having methods that are
optimized for performance makes sense. However, sometimes, you want to perform
mapping using a non-pure function, in which case you can get surprising results.
Hopefully, after reading this post you now know how to handle such cases.

Happy hacking!

Using DelimitedFiles module with DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/02/04/delimitedfiles.html

Introduction

From time to time users ask how to use the DelimitedFiles module with
DataFrames.jl so today I have decided to write a short tutorial on this
topic.

The examples were tested under Julia 1.7.0 and DataFrames.jl 1.3.2.

Reading CSV files in Julia

The CSV.jl package is a go-to package for reading CSV files. It provides dozens
of options allowing you to customize parsing of such files and is fast
when you read large files.

However, there are cases when you have small data stored in CSV format and might
want to avoid having the CSV.jl package as a dependency. In such situations
you can consider using the built-in DelimitedFiles module. In what follows
I will explain you how to use it with DataFrames.jl.

A basic scenario:

Assume you have the following data you want to parse as CSV:

csv = """
a,b,c
1,2,3
4,5,6
"""

Normally you would read the data from disk, but here I will use the IOBuffer
wrapper on a string to avoid creating such file.

Let us read this data using the DelimimitedFiles module:

julia> using DelimitedFiles

julia> mat, head = readdlm(IOBuffer(csv), ',', header=true)
([1.0 2.0 3.0; 4.0 5.0 6.0], AbstractString["a" "b" "c"])

julia> mat
2×3 Matrix{Float64}:
 1.0  2.0  3.0
 4.0  5.0  6.0

julia> head
1×3 Matrix{AbstractString}:
 "a"  "b"  "c"

The readdlm function, when passed header=true keyword argument returned two
values. The first is mat matrix containing data, the second is a 1-row matrix
head containing column names.

If you want to ingest this data into a data frame write:

julia> using DataFrames

julia> DataFrame(mat, vec(head))
2×3 DataFrame
 Row │ a        b        c
     │ Float64  Float64  Float64
─────┼───────────────────────────
   1 │     1.0      2.0      3.0
   2 │     4.0      5.0      6.0

A crucial element of the process is vec(head). The DataFrame constructor
expects that column names are passed as a vector, not as a matrix. Therefore
we use the vec function.

Mixed element types in CSV source

Let us now consider the following CSV data:

csv2 = """
a,b,c
1,2,x
4,5,y
"""

In this case we see that the first two columns contain numbers, but the last
column consists of strings. Reading it using readdlm produces:

julia> mat2, head2 = readdlm(IOBuffer(csv2), ',', header=true)
(Any[1 2 "x"; 4 5 "y"], AbstractString["a" "b" "c"])

julia> mat2
2×3 Matrix{Any}:
 1  2  "x"
 4  5  "y"

julia> head2
1×3 Matrix{AbstractString}:
 "a"  "b"  "c"

Observe that now element type of mat2 matrix is Any. Therefore when we
create a data frame we will get columns having Any element type:

julia> df2 = DataFrame(mat2, vec(head2))
2×3 DataFrame
 Row │ a    b    c
     │ Any  Any  Any
─────┼───────────────
   1 │ 1    2    x
   2 │ 4    5    y

Fortunately it is easy to narrow down the element types of columns by
broadcasting the identity function:

julia> identity.(df2)
2×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  SubStrin…
─────┼─────────────────────────
   1 │     1      2  x
   2 │     4      5  y

Conclusions

What is the benefit of using DelimitedFiles module over CSV.jl? The first that
it is shipped with Base Julia so it does not require installation. The second
is that for small files it will be faster on the first run as compilation
of functions from the CSV.jl package takes several seconds.

What are the drawbacks? For large data the readdlm function will be slower.
Additionally it lacks many options. DelimitedFiles is best suited for reading
data having homogeneous type. If columns have mixed types it becomes less
convenient. Similarly, e.g. when you have missing data in the CSV file you
would have to manually identify them after reading it in with
the readdlm function.