Tag Archives: julialang

What is new in DataFrames.jl 1.5

Re-posted from: https://bkamins.github.io/julialang/2023/03/24/df15.html

Introduction

The objective of this post is simple: I want to discuss
the most important new functionalities that DataFrames.jl 1.5 offers.

The changes I cover today are:

more powerful Cols selector;
group order setting in groupby;
setting row order in joins;
new options of handling duplicate rows in unique;
making flatten function scalar-aware.

The post was tested under Julia 1.9.0-rc1 and DataFrames.jl 1.5.0.

More powerful `Cols` selector

The Cols selector has gotten two new features:

ability to mix condition function selector with other selectors;
new operator keyword argument.

Let me explain both by example. Start with using condition function selector.

This is the old behavior that was supported:

julia> using DataFrames

julia> df = DataFrame(x1=1, x2=2, y1=3, y2=4)
1×4 DataFrame
 Row │ x1     x2     y1     y2
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1      2      3      4

julia> select(df, Cols(startswith("x")))
1×2 DataFrame
 Row │ x1     x2
     │ Int64  Int64
─────┼──────────────
   1 │     1      2

The startswith("x") condition function was required to be the only argument to Cols.
Now you can flexibly mix it with other selectors:

julia> select(df, Cols(startswith("x"), r"2"))
1×3 DataFrame
 Row │ x1     x2     y2
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      4

julia> select(df, Cols(startswith("x"), endswith("2")))
1×3 DataFrame
 Row │ x1     x2     y2
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      4

julia> select(df, Cols(startswith("x"), :y2))
1×3 DataFrame
 Row │ x1     x2     y2
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      4

The second change is operator keyword argument. By default Cols, when passed multiple arguments
takes their union:

julia> select(df, Cols(startswith("x"), endswith("2")))
1×3 DataFrame
 Row │ x1     x2     y2
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      4

However, you can pass other operators that specify the way how columns selected by individual
arguments should be combined together. For example you can take their intersection:

julia> select(df, Cols(startswith("x"), endswith("2"), operator=intersect))
1×1 DataFrame
 Row │ x2
     │ Int64
─────┼───────
   1 │     2

or set difference:

julia> select(df, Cols(startswith("x"), endswith("2"), operator=setdiff))
1×1 DataFrame
 Row │ x1
     │ Int64
─────┼───────
   1 │     1

Group order setting in `groupby`

Previously grouby when you passed sort=true sorted groups in ascending order.
Here is an example:

julia> df = DataFrame(id=["a", "c", "b"], row=1:3)
3×2 DataFrame
 Row │ id      row
     │ String  Int64
─────┼───────────────
   1 │ a           1
   2 │ c           2
   3 │ b           3

julia> show(groupby(df, :id, sort=true), allgroups=true)
GroupedDataFrame with 3 groups based on key: id
Group 1 (1 row): id = "a"
 Row │ id      row
     │ String  Int64
─────┼───────────────
   1 │ a           1
Group 2 (1 row): id = "b"
 Row │ id      row
     │ String  Int64
─────┼───────────────
   1 │ b           3
Group 3 (1 row): id = "c"
 Row │ id      row
     │ String  Int64
─────┼───────────────
   1 │ c           2

Now you can use pass in sort any set of keyword arguments that sort accepts as a named tuple.
For example if you wanted groups to be sorted in reverse order do:

julia> show(groupby(df, :id, sort=(rev=true,)), allgroups=true)
GroupedDataFrame with 3 groups based on key: id
Group 1 (1 row): id = "c"
 Row │ id      row
     │ String  Int64
─────┼───────────────
   1 │ c           2
Group 2 (1 row): id = "b"
 Row │ id      row
     │ String  Int64
─────┼───────────────
   1 │ b           3
Group 3 (1 row): id = "a"
 Row │ id      row
     │ String  Int64
─────┼───────────────
   1 │ a           1

Setting row order in joins

By default join operations do not guarantee the order of rows in the output
(just like databases):

julia> df_left = DataFrame(id=[1, 2, 4, 5], left=1:4)
4×2 DataFrame
 Row │ id     left
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     4      3
   4 │     5      4

julia> df_right = DataFrame(id=[2, 1, 3, 6, 7], right=1:5)
5×2 DataFrame
 Row │ id     right
     │ Int64  Int64
─────┼──────────────
   1 │     2      1
   2 │     1      2
   3 │     3      3
   4 │     6      4
   5 │     7      5

julia> outerjoin(df_left, df_right, on=:id)
7×3 DataFrame
 Row │ id     left     right
     │ Int64  Int64?   Int64?
─────┼─────────────────────────
   1 │     2        2        1
   2 │     1        1        2
   3 │     4        3  missing
   4 │     5        4  missing
   5 │     3  missing        3
   6 │     6  missing        4
   7 │     7  missing        5

However, often users want the result to follow the order of rows of one of the source tables.
This can be achieved now using the order keyword argument.

If you want the result to have rows in the order of left table (and then added
non-matching rows from the right table at the end, if needed) do:

julia> outerjoin(df_left, df_right, on=:id, order=:left)
7×3 DataFrame
 Row │ id     left     right
     │ Int64  Int64?   Int64?
─────┼─────────────────────────
   1 │     1        1        2
   2 │     2        2        1
   3 │     4        3  missing
   4 │     5        4  missing
   5 │     3  missing        3
   6 │     6  missing        4
   7 │     7  missing        5

Similar option is available if you want to keep the row order of the right table:

julia> outerjoin(df_left, df_right, on=:id, order=:right)
7×3 DataFrame
 Row │ id     left     right
     │ Int64  Int64?   Int64?
─────┼─────────────────────────
   1 │     2        2        1
   2 │     1        1        2
   3 │     3  missing        3
   4 │     6  missing        4
   5 │     7  missing        5
   6 │     4        3  missing
   7 │     5        4  missing

New options of handling duplicate rows in `unique`

By default unique (and related functions unique! and nonunique) kept
the first duplicate row in case of duplicates. Here is an example:

julia> df = DataFrame(a=[1, 2, 3, 1, 2, 4], id=1:6)
6×2 DataFrame
 Row │ a      id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      3
   4 │     1      4
   5 │     2      5
   6 │     4      6

julia> unique(df, :a)
4×2 DataFrame
 Row │ a      id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      3
   4 │     4      6

However, users sometimes wanted a different behavior. Two more are now supported.
If instead you want to keep the last of duplicate rows pass keep=:last:

julia> unique(df, :a, keep=:last)
4×2 DataFrame
 Row │ a      id
     │ Int64  Int64
─────┼──────────────
   1 │     3      3
   2 │     1      4
   3 │     2      5
   4 │     4      6

While, if you do not want to keep any duplicate rows pass keep=:noduplicates:

julia> unique(df, :a, keep=:noduplicates)
2×2 DataFrame
 Row │ a      id
     │ Int64  Int64
─────┼──────────────
   1 │     3      3
   2 │     4      6

Making `flatten` function scalar-aware

The flatten function is often useful when one wants to unnest a column that
holds collections (e.g. vectors). Here is an example:

julia> df = DataFrame(id=1:3, col=[["a", "b"], ["c", "d"], ["e", "f"]])
3×2 DataFrame
 Row │ id     col
     │ Int64  Array…
─────┼───────────────────
   1 │     1  ["a", "b"]
   2 │     2  ["c", "d"]
   3 │     3  ["e", "f"]

julia> flatten(df, :col)
6×2 DataFrame
 Row │ id     col
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  b
   3 │     2  c
   4 │     2  d
   5 │     3  e
   6 │     3  f

However, sometimes such a column might hold values that are not collections and we do not want to try expanding them.
The most common case is missing:

julia> df = DataFrame(id=1:3, col=[["a", "b"], missing, ["e", "f"]])
3×2 DataFrame
 Row │ id     col
     │ Int64  Array…?
─────┼───────────────────
   1 │     1  ["a", "b"]
   2 │     2  missing
   3 │     3  ["e", "f"]

julia> flatten(df, :col)
ERROR: MethodError: no method matching length(::Missing)

As you can see the operation failed as missing is not a collection that has length (thus it cannot be expanded).

Now you can specify that missing is a scalar that, when encountered, should not be expanded.
You do it by passing scalar=Missing keyword argument:

julia> flatten(df, :col, scalar=Missing)
5×2 DataFrame
 Row │ id     col
     │ Int64  String?
─────┼────────────────
   1 │     1  a
   2 │     1  b
   3 │     2  missing
   4 │     3  e
   5 │     3  f

Conclusions

I hope you will find the new additions that were introduced in DataFrames.jl useful for your data wrangling tasks!

A puzzle hidden within a classic rod cutting puzzle

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/03/17/rope.html

Introduction

Today im my post I want to concentrate on a less technical topic
(at least to start with ?).

We will discuss solving the following classic puzzle:

You are given a 1 meter long rod. Someone cuts it in n places.
The locations of the cuts are picked independently uniformly at random.
As a result of the process you get n+1 pieces. What is the probability
that it is possible to create a polygon using them?

There are many approaches that can be used to solve this puzzle.
I will want to discuss one that uses only basic probability
theory and does not require any tricks.

Next we will validate the solution using Julia 1.9.0-rc1. The solution
will lead us to another puzzle, that I think is important to understand
by people using random number generation in Julia.

Analytical solution

We start with the observation that if we are given n+1 line segments it is
possible to create a polygon using them if and only if every of them has less
than 50% of total length of the original rod.

Denote the locations of the cut points on the rod by X(i) for i going
from 1 to n. We are measuring location from the left of the rod.

When would be the above condition not met? We have two cases (I am ignoring
events having zero probability for simplicity of notation):

all X(i) are greater than 0.5 (then the first segment is overly long), or
for some X(i) it is less than 0.5 and there is no other X(j) such that
0 < X(j) - X(i) < 0.5 (then the segment that begins at point X(i) is overly long).
We have n such cases.

The second crucial observation is that the n+1 considered events are disjoint.
The reason is that it is impossible that two line segments that are longer than 0.5
at the same time.

Therefore the probability that any of the segments is longer than 0.5 is sum
of the probabilities of the individual events.

So what is the probability that all X(i) are greater than 0.5? It is relatively
easy. We have n independent events, each having probability 0.5 so the total
probability is 1/2^n.

And what is the probability that X(i) is less than 0.5 and there is no other
X(j) such that 0 < X(j) - X(i) < 0.5? Again we have n independent events
(one for location of X(i) and n-1 for location of X(j) for j other than i)
and each of these events has 0.5 probability, so again we have that the probability
is 1/2^n.

In total we get that the probability that at least one of the segments is longer than
0.5 is (n+1)/2^n, so the probability that we can create a polygon is 1-(n+1)/2^n.

Checking the solution using simulation

Let us write a simple simulation in Julia to check our result. First write a function that
takes n points from the [0, 1] interval and checks if they cut the rod into segments
from which one could create a polygon:

check(x) = maximum(diff(sort!([0.0; x; 1.0]))) < 0.5

First add start (0.0) and end (1.0) points of the rod. Next, we arrange the passed points
in ascending order in-place using sort!. By running the diff operation, we compute the lengths
of the created rod segments, and check if longest of them is less than 0.5.

Now we are ready to write a function that runs this simulation multiple times
(1 million in my example) to get an estimate of the probability:

runtest(n) = count(_ -> check(rand(n)), 1:10^6) / 10^6

Let us run the simulation for several values of n and compare the results against analytical result:

julia> using Random

julia> Random.seed!(1234);

julia> [(n=n, theory=1-(n+1)/2^n, simulation=runtest(n)) for n in 1:10]
10-element Vector{NamedTuple{(:n, :theory, :simulation), Tuple{Int64, Float64, Float64}}}:
 (n = 1, theory = 0.0, simulation = 0.0)
 (n = 2, theory = 0.25, simulation = 0.250275)
 (n = 3, theory = 0.5, simulation = 0.500066)
 (n = 4, theory = 0.6875, simulation = 0.687842)
 (n = 5, theory = 0.8125, simulation = 0.812825)
 (n = 6, theory = 0.890625, simulation = 0.890247)
 (n = 7, theory = 0.9375, simulation = 0.9374)
 (n = 8, theory = 0.96484375, simulation = 0.964881)
 (n = 9, theory = 0.98046875, simulation = 0.980502)
 (n = 10, theory = 0.9892578125, simulation = 0.989465)

The results look consistent.

Making the simulation faster

Let us benchmark the speed of our code:

julia> @time [(n=n, theory=1-(n+1)/2^n, simulation=runtest(n)) for n in 1:10];
 10.075038 seconds (100.03 M allocations: 4.621 GiB, 10.55% gc time, 0.69% compilation time)

We could make it faster by avoiding unnecessary allocations:

function runtest_faster(n)
    x = zeros(n+2)
    x[end] = 1.0
    v = @view x[2:end-1]
    return count(1:10^6) do _
        rand!(v)
        sort!(v)
        m = 0.0
        xa = first(x)
        @inbounds for i in 2:n+2
            xb = x[i]
            m = max(m, xb - xa)
            xa = xb
        end
        return m < 0.5
    end / 10^6
end

First let us check the results:

julia> Random.seed!(1234);

julia> [(n=n, theory=1-(n+1)/2^n, simulation=runtest_faster(n)) for n in 1:10]
10-element Vector{NamedTuple{(:n, :theory, :simulation), Tuple{Int64, Float64, Float64}}}:
 (n = 1, theory = 0.0, simulation = 0.0)
 (n = 2, theory = 0.25, simulation = 0.250275)
 (n = 3, theory = 0.5, simulation = 0.500066)
 (n = 4, theory = 0.6875, simulation = 0.687842)
 (n = 5, theory = 0.8125, simulation = 0.812825)
 (n = 6, theory = 0.890625, simulation = 0.890247)
 (n = 7, theory = 0.9375, simulation = 0.9374)
 (n = 8, theory = 0.96484375, simulation = 0.964919)
 (n = 9, theory = 0.98046875, simulation = 0.980717)
 (n = 10, theory = 0.9892578125, simulation = 0.989311)

What is interesting is that up to n=7 we get identical results, but they start to differ
for n=8. We will investigate it shortly.

However, first we check the timing:

julia> @time [(n=n, theory=1-(n+1)/2^n, simulation=runtest_faster(n)) for n in 1:10];
  1.792463 seconds (22.75 k allocations: 1.533 MiB, 2.80% compilation time)

As we can see there is a significant reduction in the number of allocations.

Investigating random number generation process

Have a look at this code:

julia> Random.seed!(1234); rand(7)
7-element Vector{Float64}:
32597672886359486
5490511363155669
21858665481883066
8942454282009883
35311164439921205
39425536741585077
9531246272848422

julia> Random.seed!(1234); rand!(zeros(7))
7-element Vector{Float64}:
32597672886359486
5490511363155669
21858665481883066
8942454282009883
35311164439921205
39425536741585077
9531246272848422

julia> Random.seed!(1234); rand!(view(zeros(7), :))
7-element view(::Vector{Float64}, :) with eltype Float64:
32597672886359486
5490511363155669
21858665481883066
8942454282009883
35311164439921205
39425536741585077
9531246272848422

All looks good. The results are identical. Now increase the length of the vector by one:

julia> Random.seed!(1234); rand(8)
8-element Vector{Float64}:
5798621201341324
4112941179498505
9721360824554687
014908849285099945
520354993723718
6395615996802734
8396219340580711
967142768915383

julia> Random.seed!(1234); rand!(zeros(8))
8-element Vector{Float64}:
5798621201341324
4112941179498505
9721360824554687
014908849285099945
520354993723718
6395615996802734
8396219340580711
967142768915383

julia> Random.seed!(1234); rand!(view(zeros(8), :))
8-element view(::Vector{Float64}, :) with eltype Float64:
32597672886359486
5490511363155669
21858665481883066
8942454282009883
35311164439921205
39425536741585077
9531246272848422
7955469475347194

What is going on here? We have just hit a puzzle I have promised
that is hidden in our original puzzle.

It turns out that the default random number generator
in Julia uses a different method for random number generation for Array
than for general AbstractArray in some cases.

The relevant code for our case is as follows.
The one used by Julia for Vector{Float64}:

function rand!(rng::Union{TaskLocalRNG, Xoshiro},
               dst::Array{Float64},
               ::SamplerTrivial{CloseOpen01{Float64}})
    GC.@preserve dst xoshiro_bulk(rng,
                                  convert(Ptr{UInt8},
                                  pointer(dst)),
                                  length(dst)*8,
                                  Float64,
                                  xoshiroWidth(),
                                  _bits2float)
    dst
end

and the code used for a view:

function rand!(rng::AbstractRNG, A::AbstractArray{T}, sp::Sampler) where T
    for i in eachindex(A)
        x = rand(rng, sp)
        @inbounds A[i] = x
    end
    A
end

Therefore, what you need to remember that if you change in your
code rand into rand! to improve performance the results
of your computations might change.

Conclusions

I hope you liked the puzzle. If you wonder if the analytical solution
could have been simpler the answer is yes. We can get rid of the special
cases in our derivation by the following observation. Instead of analyzing
cutting a rod in n places you can imagine that you cut a circle in n+1
places. Then we notice that all cutting points are indistinguishable,
and what we need to check if there is a gap of 0.5 length from a given
point looking clockwise from it. The probability of such gap is 1/2^n
as we have n points that cannot lie in a zone of length 0.5. The rest of
the arguments in the derivation are the same as above.

Also, as usual, we have a (hopefully useful) lesson about running
stochastic simulations in Julia (especially if you want them to be reproducible):
rand and rand! are not guaranteed to produce the same sequences of random numbers.

To copy or not to copy, that is the question

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/03/10/copying.html

Introduction

I have used Julia for quite some time now.
Therefore I thought that I have acquired adequate habits of using this language safely.
Unfortunately, it is not fully true.
Today I wanted to share with you my thoughts on an important pattern that I need to
constantly re-learn.

As you probably know developers of packages in Julia want them to be fast. For this
reason they often optimize them. One of the common optimizations is avoidance of
unnecessary allocations. This approach makes sense, but runs the risk of unwanted
mutation of such data.

In general you have two scenarios when you can can have an issue with unwanted
mutation:

when you pass your data to a function that might mutate it;
when you get data that is returned from function and you later mutate it
and at the same time this operation mutates some other data that you have not
expected to be changed.

The first scenario is handled pretty cleanly in Julia. Any function that
might mutate its arguments by convention should have its name terminated with !.
This is well explained in the Julia manual in the
Append ! to names of functions that modify their arguments section.

The second case is less obvious and is the focus of my post today. I will explain
it using two examples taken from real questions of Julia users asked during last week.

The post was written under Julia 1.9.0-rc1, DataFrames.jl 1.5.0, GLM.jl 1.8.1, and
ShiftedArrays.jl 2.0.0.

First example: shifted arrays

Consider the following code:

julia> using DataFrames

julia> using Random

julia> using ShiftedArrays: lag

julia> Random.seed!(1234);

julia> df = DataFrame(x = rand(10))
10×1 DataFrame
 Row │ x
     │ Float64
─────┼───────────
   1 │ 0.579862
   2 │ 0.411294
   3 │ 0.972136
   4 │ 0.0149088
   5 │ 0.520355
   6 │ 0.639562
   7 │ 0.839622
   8 │ 0.967143
   9 │ 0.131026
  10 │ 0.946453

julia> df.xlag = lag(df.x)
10-element ShiftedVector{Float64, Missing, Vector{Float64}}:
  missing
 0.5798621201341324
 0.4112941179498505
 0.9721360824554687
 0.014908849285099945
 0.520354993723718
 0.6395615996802734
 0.8396219340580711
 0.967142768915383
 0.13102565622085904

julia> df
10×2 DataFrame
 Row │ x          xlag
     │ Float64    Float64?
─────┼────────────────────────────
   1 │ 0.579862   missing
   2 │ 0.411294         0.579862
   3 │ 0.972136         0.411294
   4 │ 0.0149088        0.972136
   5 │ 0.520355         0.0149088
   6 │ 0.639562         0.520355
   7 │ 0.839622         0.639562
   8 │ 0.967143         0.839622
   9 │ 0.131026         0.967143
  10 │ 0.946453         0.131026

The potential trap is that the :xlag column is a view. If we modify :x also :xlag will be modified.

Here is an example:

julia> df.x[1] = 0.0
0.0

julia> df
10×2 DataFrame
 Row │ x          xlag
     │ Float64    Float64?
─────┼────────────────────────────
   1 │ 0.0        missing
   2 │ 0.411294         0.0
   3 │ 0.972136         0.411294
   4 │ 0.0149088        0.972136
   5 │ 0.520355         0.0149088
   6 │ 0.639562         0.520355
   7 │ 0.839622         0.639562
   8 │ 0.967143         0.839622
   9 │ 0.131026         0.967143
  10 │ 0.946453         0.131026

In some cases you might want it indeed. But in many cases this will lead to a bug,
especially if both objects are used in different parts of code. Here I stored them in
a data frame because it is a convenient way to show what happens.

Another pitfall is when you try to mutate such a data frame by e.g. pushing rows to it:

julia> push!(df, (10.0, 11.0))
┌ Error: Error adding value to column :xlag. Maybe it was forgotten to ask for column element type promotion, which can be done by passing the promote=true keyword argument.

As you can see the operation errored because we have a view stored
as a column of a data frame, and views are not resizable.

This might seem to be a simple mistake, but in practice users report making this error.
The simplest way to resolve this issue is to make sure you copy the view:

julia> df.xlag = copy(lag(df.x))
10-element Vector{Union{Missing, Float64}}:
  missing
0
4112941179498505
9721360824554687
014908849285099945
520354993723718
6395615996802734
8396219340580711
967142768915383
13102565622085904

julia> push!(df, (10.0, 11.0))
11×2 DataFrame
 Row │ x           xlag
     │ Float64     Float64?
─────┼─────────────────────────────
│  0.0        missing
│  0.411294         0.0
│  0.972136         0.411294
│  0.0149088        0.972136
│  0.520355         0.0149088
│  0.639562         0.520355
│  0.839622         0.639562
│  0.967143         0.839622
│  0.131026         0.967143
│  0.946453         0.131026
│ 10.0             11.0

Second example: linear models

Let me illustrate another pitfall with linear models. Start with creating a simple data set:

julia> Random.seed!(1234);

julia> df = DataFrame(randn(10, 3), [:x1, :x2, :error]);

julia> df.y = df.x1 + df.x2 + df.error;

julia> df.wts = rand(10);

julia> df
10×5 DataFrame
 Row │ x1          x2          error       y           wts
     │ Float64     Float64     Float64     Float64     Float64
─────┼───────────────────────────────────────────────────────────
   1 │  0.970656    0.705993   -1.0565      0.620151   0.840641
   2 │ -0.979218    1.09156     0.148361    0.260698   0.523948
   3 │  0.901861    0.871498   -1.83851    -0.0651514  0.0128461
   4 │ -0.0328031   0.0856911  -1.07363    -1.02074    0.40573
   5 │ -0.600792    0.960079   -0.20563     0.153657   0.124074
   6 │ -1.44518     0.907837    0.770703    0.233363   0.648874
   7 │  2.70742    -1.46506    -1.21394     0.0284219  0.574296
   8 │  1.52445    -0.215859    1.0915      2.40009    0.408537
   9 │  0.759804    0.575094    1.22004     2.55494    0.0731709
  10 │ -0.881437   -1.79563    -0.0758316  -2.7529     0.501756

Now fit a weighted linear model:

julia> using GLM

julia> model = lm(@formula(y ~ x1 + x2), df, wts=df.wts)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

y ~ 1 + x1 + x2

Coefficients:
─────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept)  -0.163691    0.721581  -0.23    0.8550   -7.38861    7.06123
x1            0.698275    0.554725   1.26    0.4108   -4.85598    6.25253
x2            1.03178     0.732565   1.41    0.3753   -6.30312    8.36668
─────────────────────────────────────────────────────────────────────────

Assume we write a simple function that computes variance-covariance
matrix of coefficients of our model:

function our_vcov(model)
    X = modelmatrix(model)
    u² = residuals(model) .^ 2
    wts = model.model.rr.wts
    X .*= sqrt.(wts)
    u² .*= wts
    dfr = dof_residual(model)
    M = inv(X'*X)
    return M * sum(u²) / dfr
end

Let us check if we get the same results as returned by the built-in vcov function:

julia> vcov(model)
3×3 Matrix{Float64}:
  0.520679   -0.0861524  -0.0613806
 -0.0861524   0.30772     0.168677
 -0.0613806   0.168677    0.536652

julia> our_vcov(model)
3×3 Matrix{Float64}:
  0.520679   -0.0861524  -0.0613806
 -0.0861524   0.30772     0.168677
 -0.0613806   0.168677    0.536652

All looks good. Let us run our function again:

julia> our_vcov(model)
3×3 Matrix{Float64}:
  0.954193  -0.196279  -0.191441
 -0.196279   0.534882   0.295785
 -0.191441   0.295785   0.965165

Wait – we got a different result this time. What is the problem?

The issue is with the X .*= sqrt.(wts) operation that updates X in place.
And in operation X = modelmatrix(model) we bound to X a value that
is internally stored in the model object. Thus we, unintentionally mutated
model.

In this case a fix would be for example to write X = X .* sqrt.(wts) or
X = copy(modelmatrix(model)).

The problem is that it is easy to forget it. Both X = modelmatrix(model)
and X .*= sqrt.(wts) operations look natural and can be easily used together.

Additionally if you look at modelmatrix help:

help?> modelmatrix
search: modelmatrix ModelMatrix

  modelmatrix(model::RegressionModel)

  Return the model matrix (a.k.a. the design matrix).

nothing warns you that the operation may return some value that is stored in model
or its copy.

A note to package developers: this kind of bug is not likely to be caught by standard tests
as it becomes apparent only after you run the operation twice on some object (and usually
tests are run only once).

Conclusions

So what is the pattern that I need to re-learn? The golden rule is:

Always check if a value returned by some function is freshly allocated.

Unfortunately, as you could see from the examples I gave, it is easy to forget to verify this.
The major problem is that there is no visual aid that would help you to identify such issues
(as opposed to functions mutating their arguments that end with !).

If you are user of DataFrames.jl you are partially guarded against the risks I discuss today.
The commonly used functions like select, combine, transform, or the DataFrame constructor
make sure to make a copy of columns of a data frame they return.
Indeed it reduces their performance, but it helps with safety and usually the performance hit is minimal.
(and for users obsessed with speed we have copycols=false keyword argument or functions with
! that work in-place).

So my answer to the question from the title of the post is: do copy (unless you are sure that you can
safely skip copying and the performance benefits are worth it).

juliabloggers.com

A Julia Language Blog Aggregator

Tag Archives: julialang

What is new in DataFrames.jl 1.5

Introduction

More powerful `Cols` selector

Group order setting in `groupby`

Setting row order in joins

New options of handling duplicate rows in `unique`

Making `flatten` function scalar-aware

Conclusions

A puzzle hidden within a classic rod cutting puzzle

Introduction

Analytical solution

Checking the solution using simulation

Making the simulation faster

Investigating random number generation process

Conclusions

To copy or not to copy, that is the question

Introduction

First example: shifted arrays

Second example: linear models

Conclusions

Introduction

More powerful Cols selector

Group order setting in groupby

Setting row order in joins

New options of handling duplicate rows in unique

Making flatten function scalar-aware

Conclusions

Introduction

Analytical solution

Checking the solution using simulation

Making the simulation faster

Investigating random number generation process

Conclusions

Introduction

First example: shifted arrays

Second example: linear models

Conclusions

More powerful `Cols` selector

Group order setting in `groupby`

New options of handling duplicate rows in `unique`

Making `flatten` function scalar-aware