juliabloggers.com

Annotating columns of a data frame with DataFramesMeta.jl

Blog by Bogumił Kamiński — Fri, 26 Apr 2024 08:12:31 +0000

Re-posted from: https://bkamins.github.io/julialang/2024/04/26/labels.html

Introduction

Today I want to discuss a functionality that was recently added to DataFramesMeta.jl.
These utility macros and functions make it easy to add custom labels and notes to columns
of a data frame. This functionality is especially useful when working with wide data frames,
as is often the case when e.g. analyzing economic data.

This post is written under Julia 1.10.1, DataFrames.jl 1.6.1, and DataFramesMeta.jl 0.15.2.

Column labels

A column label is a short description of the contents of a column.
When using DataFramesMeta.jl you can use the following basic commands to work with them:

@label! attaches a label to a column;
label allows you to retrieve column label;
printlabels presents you labels of all annotated columns in a data frame.

Here is a simple example:

julia> using DataFramesMeta

julia> df = DataFrame(year=[2000, 2001], rev=[12, 17])
2×2 DataFrame
 Row │ year   rev
     │ Int64  Int64
─────┼──────────────
   1 │  2000     12
   2 │  2001     17

julia> @label!(df, :rev = "Revenue (USD)")
2×2 DataFrame
 Row │ year   rev
     │ Int64  Int64
─────┼──────────────
   1 │  2000     12
   2 │  2001     17

julia> label(df, :rev)
"Revenue (USD)"

julia> printlabels(df)
┌────────┬───────────────┐
│ Column │         Label │
├────────┼───────────────┤
│   year │          year │
│    rev │ Revenue (USD) │
└────────┴───────────────┘

Note that if some column did not get an explicit label (like :year in our example)
by default its name is its label.

Column notes

Column notes are meant to give more detailed information about a column in a data frame.
You can use the following basic commands to work with them:

@note! attaches a note to a column;
note allows you to retrieve column note;
printnotes presents you notes of all columns in a data frame.

julia> @note!(df, :rev = "Total revenue of a company in in a calendar year in nominal USD")
2×2 DataFrame
 Row │ year   rev
     │ Int64  Int64
─────┼──────────────
   1 │  2000     12
   2 │  2001     17

julia> note(df, :rev)
"Total revenue of a company in in a calendar year in nominal USD"

julia> printnotes(df)
Column: rev
───────────
Total revenue of a company in in a calendar year in nominal USD

julia> @note!(df, :year = "Calendar year")
2×2 DataFrame
 Row │ year   rev
     │ Int64  Int64
─────┼──────────────
   1 │  2000     12
   2 │  2001     17

julia> printnotes(df)
Column: year
────────────
Calendar year

Column: rev
───────────
Total revenue of a company in in a calendar year in nominal USD

Observe that printnotes only prints notes that were actually added to
a column (as opposed to printlabels which prints labels of all columns,
using the default fallback to column name).

Conclusions

Today I covered the basic functions allowing to work with column
metadata of data frames. If you are interested in learning more
advanced functionalities please refer to DataFrames.jl
and TableMetadataTools.jl documentations.

I hope that you will find the metadata functionality provided by
DataFramesMeta.jl useful in your work.

CUDA.jl 5.2 and 5.3: Maintenance releases

Tim Besard — Fri, 26 Apr 2024 00:00:00 +0000

By: Tim Besard

Re-posted from: https://juliagpu.org/post/2024-04-26-cuda_5.2_5.3/index.html

CUDA.jl 5.2 and 5.3 are two minor release of CUDA.jl that mostly focus on bug fixes and minor improvements, but also come with a number of interesting new features. This blog post summarizes the changes in these releases.

Profiler improvements

CUDA.jl 5.1 introduced a new native profiler, which can be used to profile Julia GPU applications without having to use NSight Systems or other external tools. The tool has seen continued development, mostly improving its robustness, but CUDA.jl now also provides a @bprofile equivalent that runs your application multiple times and reports on the time distribution of individual events:

julia> CUDA.@bprofile CuArray([1]) .+ 1
Profiler ran for 1.0 s, capturing 1427349 events.Host-side activity: calling CUDA APIs took 792.95 ms (79.29% of the trace)
┌──────────┬────────────┬────────┬───────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │  Calls │ Time distribution                     │ Name                    │
├──────────┼────────────┼────────┼───────────────────────────────────────┼─────────────────────────┤
│   19.27% │  192.67 ms │ 109796 │   1.75 µs ± 10.19  (  0.95 ‥ 1279.83) │ cuMemAllocFromPoolAsync │
│   17.08% │   170.8 ms │  54898 │   3.11 µs ± 0.27   (  2.15 ‥ 23.84)   │ cuLaunchKernel          │
│   16.77% │  167.67 ms │  54898 │   3.05 µs ± 0.24   (  0.48 ‥ 16.69)   │ cuCtxSynchronize        │
│   14.11% │  141.12 ms │  54898 │   2.57 µs ± 0.79   (  1.67 ‥ 70.57)   │ cuMemcpyHtoDAsync       │
│    1.70% │   17.04 ms │  54898 │ 310.36 ns ± 132.89 (238.42 ‥ 5483.63) │ cuStreamSynchronize     │
└──────────┴────────────┴────────┴───────────────────────────────────────┴─────────────────────────┘Device-side activity: GPU was busy for 87.38 ms (8.74% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                     │ Name               │
├──────────┼────────────┼───────┼───────────────────────────────────────┼────────────────────┤
│    6.66% │   66.61 ms │ 54898 │   1.21 µs ± 0.16   (  0.95 ‥ 1.67)    │ kernel             │
│    2.08% │   20.77 ms │ 54898 │ 378.42 ns ± 147.66 (238.42 ‥ 1192.09) │ [copy to device]   │
└──────────┴────────────┴───────┴───────────────────────────────────────┴────────────────────┘NVTX ranges:
┌──────────┬────────────┬───────┬────────────────────────────────────────┬─────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                      │ Name                │
├──────────┼────────────┼───────┼────────────────────────────────────────┼─────────────────────┤
│   98.99% │  989.94 ms │ 54898 │  18.03 µs ± 49.88  ( 15.26 ‥ 10731.22) │ @bprofile.iteration │
└──────────┴────────────┴───────┴────────────────────────────────────────┴─────────────────────┘

By default, CUDA.@bprofile runs the application for 1 second, but this can be adjusted using the time keyword argument.

Display of the time distribution isn't limited to CUDA.@bprofile, and will also be used by CUDA.@profile when any operation is called more than once. For example, with the broadcasting example from above we allocate both the input CuArray and the broadcast result, which results in two calls to the allocator:

julia> CUDA.@profile CuArray([1]) .+ 1Host-side activity:
┌──────────┬────────────┬───────┬─────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                   │ Name                    │
├──────────┼────────────┼───────┼─────────────────────────────────────┼─────────────────────────┤
│   99.92% │   99.42 ms │     1 │                                     │ cuMemcpyHtoDAsync       │
│    0.02% │   21.22 µs │     2 │  10.61 µs ± 6.57   (  5.96 ‥ 15.26) │ cuMemAllocFromPoolAsync │
│    0.02% │   17.88 µs │     1 │                                     │ cuLaunchKernel          │
│    0.00% │  953.67 ns │     1 │                                     │ cuStreamSynchronize     │
└──────────┴────────────┴───────┴─────────────────────────────────────┴─────────────────────────┘

Kernel launch debugging

A common issue with CUDA programming is that kernel launches may fail when exhausting certain resources, such as shared memory or registers. This typically results in a cryptic error message, but CUDA.jl will now try to diagnose launch failures and provide a more helpful error message, as suggested by @simonbyrne:

For example, when using more parameter memory than allowed by the architecture:

julia> kernel(x) = nothing
julia> @cuda kernel(ntuple(_->UInt64(1), 2^13))
ERROR: Kernel invocation uses too much parameter memory.
64.016 KiB exceeds the 31.996 KiB limit imposed by sm_89 / PTX v8.2.

Or when using an invalid launch configuration, violating a device limit:

julia> @cuda threads=2000 identity(nothing)
ERROR: Number of threads in x-dimension exceeds device limit (2000 > 1024).
caused by: CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE)

We also diagnose launch failures that involve kernel-specific limits, such as exceeding the number of threads that are allowed in a block (e.g., because of register use):

julia> @cuda threads=1024 heavy_kernel()
ERROR: Number of threads per block exceeds kernel limit (1024 > 512).
caused by: CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE)

Sorting improvements

Thanks to @xaellison, our bitonic sorting implementation now supports sorting specific dimensions, making it possible to implement sortperm for multi-dimensional arrays:

julia> A = cu([8 7; 5 6])
2×2 CuArray{Int64, 2, Mem.DeviceBuffer}:
 8  7
 5  6julia> sortperm(A, dims = 1)
2×2 CuArray{Int64, 2, Mem.DeviceBuffer}:
 2  4
 1  3julia> sortperm(A, dims = 2)
2×2 CuArray{Int64, 2, Mem.DeviceBuffer}:
 3  1
 2  4

The bitonic kernel is now used for all sorting operations, in favor of the often slower quicksort implementation:

# before (quicksort)
julia> @btime CUDA.@sync sort($(CUDA.rand(1024, 1024)); dims=1)
  2.760 ms (30 allocations: 1.02 KiB)# after (bitonic sort)
julia> @btime CUDA.@sync sort($(CUDA.rand(1024, 1024)); dims=1)
  246.386 μs (567 allocations: 13.66 KiB)# reference CPU time
julia> @btime sort($(rand(Float32, 1024, 1024)); dims=1)
  4.795 ms (1030 allocations: 5.07 MiB)

Unified memory fixes

CUDA.jl 5.1 greatly improved support for unified memory, and this has continued in CUDA.jl 5.2 and 5.3. Most notably, when broadcasting CuArrays we now correctly preserve the memory type of the input arrays. This means that if you broadcast a CuArray that is allocated as unified memory, the result will also be allocated as unified memory. In case of a conflict, e.g. broadcasting a unified CuArray with one backed by device memory, we will prefer unified memory:

julia> cu([1]; host=true) .+ 1
1-element CuArray{Int64, 1, Mem.HostBuffer}:
 2julia> cu([1]; host=true) .+ cu([2]; device=true)
1-element CuArray{Int64, 1, Mem.UnifiedBuffer}:
 3

Software updates

Finally, we also did routine updates of the software stack, support the latest and greatest by NVIDIA. This includes support for CUDA 12.4 (Update 1), cuDNN 9, and cuTENSOR 2.0. This latest release of cuTENSOR is noteworthy as it revamps the API in a backwards-incompatible way, and CUDA.jl has opted to follow this change. For more details, refer to the cuTENSOR 2 migration guide by NVIDIA.

Of course, cuTENSOR.jl also provides a high-level Julia API which has been mostly unaffected by these changes:

using CUDA
A = CUDA.rand(7, 8, 3, 2)
B = CUDA.rand(3, 2, 2, 8)
C = CUDA.rand(3, 3, 7, 2)using cuTENSOR
tA = CuTensor(A, ['a', 'f', 'b', 'e'])
tB = CuTensor(B, ['c', 'e', 'd', 'f'])
tC = CuTensor(C, ['b', 'c', 'a', 'd'])using LinearAlgebra
mul!(tC, tA, tB)

This API is still quite underdeveloped, so if you are a user of cuTENSOR.jl and have to adapt to the new API, now is a good time to consider improving the high-level interface instead!

Future releases

The next release of CUDA.jl is gearing up to be a much larger release, with significant changes to both the API and internals of the package. Although the intent is to keep these changes non-breaking, it is always possible that some code will be affected in unexpected ways, so we encourage users to test the upcoming release by simply running ] add CUDA#master and report any issues.

Onboarding DataFrames.jl

Blog by Bogumił Kamiński — Fri, 19 Apr 2024 04:21:22 +0000

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/04/19/starting.html

Introduction

Working with data frames is one of the basic needs of any data scientist.
In the Julia ecosystem DataFrames.jl is a package providing support
for these operations. It was designed to be efficient and flexible.

Sometimes, however, novice users can be overwhelmed by the syntax due to its flexibility.
Therefore data scientists often find it useful to use the
packages that make it easier to do transformations of data frames.

Interestingly, these packages use metaprogramming, which might sound
to novices as something scary, while in reality it is the opposite. Metaprogramming
is used to make them easier to use.

Today I want do do a quick review of the main
metaprogramming packages that are available in the ecosystem.
I will not go into the details functionality and syntax of the packages, but rather just
present them briefly and give my personal (opinionated) view of their status.

This post is written under Julia 1.10.1, DataFrames.jl 1.6.1, Chain.jl 0.5.0, DataFramesMeta.jl 0.15.2,
DataFrameMacros.jl 0.4.1, and TidyData.jl 0.15.1.

A basic example

Let us start with a basic example of DataFrames.jl syntax, which we will later rewrite using metaprogramming:

julia> using Statistics

julia> using DataFrames

julia> df = DataFrame(id=[1, 2, 1, 2], v=1:4)
4×2 DataFrame
 Row │ id     v
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     1      3
   4 │     2      4

julia> transform(groupby(df, :id), :v => (x -> x .- mean(x)) => :v100)
4×3 DataFrame
 Row │ id     v      v100
     │ Int64  Int64  Float64
─────┼───────────────────────
   1 │     1      1     -1.0
   2 │     2      2     -1.0
   3 │     1      3      1.0
   4 │     2      4      1.0

The syntax looks complex and might be scary. Let us see if we can make it simpler.

Chain.jl

The first functionality we might want to use is to put the operations in a pipe. This is achieved with the Chain.jl package:

julia> using Chain

julia> @chain df begin
           groupby(:id)
           transform(:v => (x -> x .- mean(x)) => :v100)
       end
4×3 DataFrame
 Row │ id     v      v100
     │ Int64  Int64  Float64
─────┼───────────────────────
   1 │     1      1     -1.0
   2 │     2      2     -1.0
   3 │     1      3      1.0
   4 │     2      4      1.0

We have achieved the benefit of a better visual separation of operations. In my opinion Chain.jl can be considered
as a currently mostly accepted approach to piping operations in Julia (there are alternatives in the ecosystem
but as far as I can tell they have lower adoption level).

DataFramesMeta.jl

Still the transform(:v => (x -> x .- mean(x)) => :v100) part looks verbose. Let us start by showing
how it can be made simpler using DataFramesMeta.jl:

julia> using DataFramesMeta

julia> @chain df begin
           groupby(:id)
           @transform(:v100 = :v .- mean(:v))
       end
4×3 DataFrame
 Row │ id     v      v100
     │ Int64  Int64  Float64
─────┼───────────────────────
   1 │     1      1     -1.0
   2 │     2      2     -1.0
   3 │     1      3      1.0
   4 │     2      4      1.0

In my opinion the code is now really easy to read.

Here is the status of DataFramesMeta.jl:

It is actively maintained.
Its syntax is close to DataFrames.jl.
It uses : to signal that some name is a column of a data frame.

DataFrameMacros.jl

The DataFrameMacros.jl is another package that is closely tied to DataFrames.jl. Let us see how we can use it.
Note that you need to restart the Julia session before running the code as the macro names are overlapping with DataFramesMeta.jl:

julia> using DataFrameMacros

julia> @chain df begin
           groupby(:id)
           @transform(:v100 = @bycol :v .- mean(:v))
       end
4×3 DataFrame
 Row │ id     v      v100
     │ Int64  Int64  Float64
─────┼───────────────────────
   1 │     1      1     -1.0
   2 │     2      2     -1.0
   3 │     1      3      1.0
   4 │     2      4      1.0

Note the difference with the @bycol expression. It is needed because in DataFrameMacros.jl @transform by default vectorizes operations.
This is often more convenient for users, but sometimes (like in this case), one wants to suppress vectorization.

What is the status of DataFramesMeta.jl?

It is maintained but less actively developed than DataFramesMeta.jl.
Its syntax is close to DataFrames.jl, but several macros, for user convenience, vectorize operations by default (as opposed to Base Julia).
It uses : to signal that some text is a column of a data frame.

TidierData.jl

Now let us see the TidierData.jl package that is designed to follow dplyr from R:

julia> using TidierData

julia> @chain df begin
           @group_by(id)
           @mutate(v100 = v - mean(v))
           @ungroup
       end
4×3 DataFrame
 Row │ id     v      v100
     │ Int64  Int64  Float64
─────┼───────────────────────
   1 │     1      1     -1.0
   2 │     1      3      1.0
   3 │     2      2     -1.0
   4 │     2      4      1.0

If you know dplyr you should be at home with this syntax.

What is the status of DataFramesMeta.jl:

It is actively maintained.
It tries to guess as much as possible; the package automatically decides which functions should be vectorized (in our example - was vectorized but mean was not).
You do not need a : prefix in column names, the package uses scoping similar to R to resolve variable names.

As you can see, the R-style syntax is designed for maximum convenience, at the expense of control (a lot of “magic” happens behind the scenes;
admittedly most of the time this magic is what novice users would want).

Conclusions

Here is a recap of what we have discussed:

Meta-packages are here to make life easier for users. There is no need to be afraid of them.
For piping I recommend using Chain.jl.
Use plain DataFrames.jl if you are a die-hard Julia user and want all your code to be valid Julia syntax (I prefer it when writing production stuff).
Use DataFramesMeta.jl if you want an experience most consistent with Base Julia (this is my personal preference for interactive sessions, but it requires most knowledge of Julia).
DataFrameMacros.jl is an in-between package, it adds some more convenience (e.g. vectorization by default), but does not push it to the extreme
(it also has a super convenient {} notation which you might find useful; I decided to skip it to keep the post simple to follow).
TidyData.jl goes for maximum convenience. It follows R-style and tries to guess what you most likely wanted to do. Users with dplyr should be able to start using it immediately.

Sorting data with missing values

Blog by Bogumił Kamiński — Fri, 12 Apr 2024 05:33:45 +0000

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/04/12/sorting.html

Introduction

Sorting is one of the most common operations one wants to do with collections.
In this post I discuss how one can sort data that contain missing values.

The post was written under Julia 1.10.1 and Missings.jl 1.2.0.

General rules of comparison with missing values

By default missing is considered as greater than any other different value it is compared with:

julia> isless(Inf, missing)
true

julia> isless("abc", missing)
true

julia> isless(r"abc", missing)
true

Note, in particular, the last case. Although Regex does not support comparisons it can be compared to missing.
The reason is that isless has a general catch-all definition when one of the arguments is missing. Let us see it:

isless(::Missing, ::Missing) = false
isless(::Missing, ::Any) = false
isless(::Any, ::Missing) = true

The rule that missing is greater than all else has an important consequence when sorting.

Default sorting with missing values

Let us create a simple vector containing missing values:

julia> x = [missing, 3, 1, missing, 2, 4, missing]
7-element Vector{Union{Missing, Int64}}:
  missing
 3
 1
  missing
 2
 4
  missing

If we sort it missing values end up at the end of the produced vector
because, by default, sorting is done in ascending order:

julia> sort(x)
7-element Vector{Union{Missing, Int64}}:
 1
 2
 3
 4
  missing
  missing
  missing

If we want to get values in descending order missing values come first:

julia> sort(x, rev=true)
7-element Vector{Union{Missing, Int64}}:
  missing
  missing
  missing
 4
 3
 2
 1

But what if we wanted to have values sorted in descending order, but put missing at the end?

Supplementary sorting order

Users often wanted a functionality that would allow them to sort values, but treat missing
as the smallest. This means that if you sort your data in a descending order missing would be put at the end.
Similarly, if you want to sort your data in ascending order missing would be put at the beginning.

With Missings.jl release 1.2 this functionality is supported with the missingsmallest function:

julia> sort(x, lt=missingsmallest)
7-element Vector{Union{Missing, Int64}}:
  missing
  missing
  missing
 1
 2
 3
 4

julia> sort(x, lt=missingsmallest, rev=true)
7-element Vector{Union{Missing, Int64}}:
 4
 3
 2
 1
  missing
  missing
  missing

By default missingsmallest uses the isless comparison.

More advanced cases of treating missing as smallest

Assume that you have the following vector that you want to sort by
the length of the string:

julia> s = [missing, "abc", "x", missing, "bcde", "pq", missing]
7-element Vector{Union{Missing, String}}:
 missing
 "abc"
 "x"
 missing
 "bcde"
 "pq"
 missing

If you try a simple way to do it you get an error:

julia> sort(s, by=length)
ERROR: MethodError: no method matching length(::Missing)

We need to wrap length in passmissing to get what we want:

julia> sort(s, by=passmissing(length))
7-element Vector{Union{Missing, String}}:
 "x"
 "pq"
 "abc"
 "bcde"
 missing
 missing
 missing

julia> sort(s, by=passmissing(length), rev=true)
7-element Vector{Union{Missing, String}}:
 missing
 missing
 missing
 "bcde"
 "abc"
 "pq"
 "x"

But what if we wanted to treat missing values as smallest?

The first approach is the one we already know:

julia> sort(s, by=passmissing(length), lt=missingsmallest)
7-element Vector{Union{Missing, String}}:
 missing
 missing
 missing
 "x"
 "pq"
 "abc"
 "bcde"

julia> sort(s, by=passmissing(length), lt=missingsmallest, rev=true)
7-element Vector{Union{Missing, String}}:
 "bcde"
 "abc"
 "pq"
 "x"
 missing
 missing
 missing

However, there is an alternative. You can define a comparison function that works on strings:

julia> isshorter(s1::AbstractString, s2::AbstractString) = length(s1) < length(s2)
isshorter (generic function with 1 method)

Then you can pass the isshorter function to missingsmallest
as a single argument to generate a comparison function
that automatically treats missing values as smallest:

julia> sort(s, lt=missingsmallest(isshorter))
7-element Vector{Union{Missing, String}}:
 missing
 missing
 missing
 "x"
 "pq"
 "abc"
 "bcde"

julia> sort(s, lt=missingsmallest(isshorter), rev=true)
7-element Vector{Union{Missing, String}}:
 "bcde"
 "abc"
 "pq"
 "x"
 missing
 missing
 missing

Conclusions

The missingsmallest functionality was added in Missings.jl 1.2.
I hope you will find it useful when working with your data!

Deduplication of rows in DataFrames.jl

Blog by Bogumił Kamiński — Fri, 05 Apr 2024 04:12:34 +0000

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/04/05/duplicates.html

Introduction

Deduplication of rows in a table is one of the basic functionalities that
is often needed when working with data frames. Today I discuss the
allunique, nonunique, unique, and unique! functions that
are provided by DataFrames.jl and can help you with this task.

The post was written under Julia 1.10.1 and DataFrames.jl 1.6.1.

Checking if a data frame has duplicate rows

Let us start with discussing how one can check if a data frame has duplicate rows
as this is the simplest check and the functionalities that we discuss here
carry-over to other functions that we discuss later.

First create a simple data frame:

julia> using DataFrames

julia> df = DataFrame(x=1:6, y=[1.0, 2.0, 1.0, 2.0, 0.0, -0.0])
6×2 DataFrame
 Row │ x      y
     │ Int64  Float64
─────┼────────────────
   1 │     1      1.0
   2 │     2      2.0
   3 │     3      1.0
   4 │     4      2.0
   5 │     5      0.0
   6 │     6     -0.0

By just calling the allunique function we can check if whole rows of this data frame are unique:

julia> allunique(df)
true

In this case we get true as indeed all rows are unique. It is guaranteed by the column "x" which holds
consecutive integers.

However, we can pass a second positional argument to allunique. In this case we can narrow down the list of
checked columns:

julia> allunique(df, "y")
false

Here we checked uniqueness of only column "y", which contains duplicates, e.g. row 1 and row 3 contain the same value 1.0,
so we got false.

But this is not all. The second positional argument can be any transformation that is supported by the select function.
Therefore, for example, we can run:

julia> allunique(df, "x" => ByRow(iseven))
false

We got false, as applying the iseven to the x column creates duplicates since we have multiple even and odd values in it.
But e.g. we have:

julia> allunique(df, "x" => ByRow(x -> x^2))
true

Now we get true as squares of consecutive integers are unique.

We can pass several transformations as well:

julia> allunique(df, ["x" => ByRow(x -> mod(x, 3)), "y" => identity])
true

To convince ourselves that the true result is correct let us run the select operation with the same argument:

julia> select(df, ["x" => ByRow(x -> mod(x, 3)), "y" => identity])
6×2 DataFrame
 Row │ x_function  y_identity
     │ Int64       Float64
─────┼────────────────────────
   1 │          1         1.0
   2 │          2         2.0
   3 │          0         1.0
   4 │          1         2.0
   5 │          2         0.0
   6 │          0        -0.0

Indeed the rows produced by this operation are unique.

Finding duplicate rows

To get a vector with indicators of duplicate rows in a data frame use the nonunique function. Here are three examples of its usage
(note it also can take a second positional argument just like allunique):

julia> nonunique(df)
6-element Vector{Bool}:
 0
 0
 0
 0
 0
 0

All rows are unique in df, as we already know, so we got a vector of falses in the call above.

Now the second example:

julia> nonunique(df, "x" => ByRow(iseven))
6-element Vector{Bool}:
 0
 0
 1
 1
 1
 1

Here we see that we get true for all rows for which there was already a duplicate row before. So first two rows get false (non-duplicated)
and the following rows have the true indicator (as we have already seen an even and an odd number in column "x").

Now look at the last example:

julia> nonunique(df, "y")
6-element Vector{Bool}:
 0
 0
 1
 1
 0
 0

You might be surprised by the last false. The reason is that all the de-duplication functions use isequal to compare values for equality,
and 0.0 is not equal to -0.0 in this comparison:

julia> isequal(0.0, -0.0)
false

This behavior matches the way how dictionaries work in Julia.

Additionally the nonunique has a keep keyword argument. It allows us to change the default behavior which rows are marked as duplicate.
If we pass keep=:last then the last of the duplicated rows is marked as unique. See for example:

julia> nonunique(df, "x" => ByRow(iseven); keep=:last)
6-element Vector{Bool}:
 1
 1
 1
 1
 0
 0

We get false in last two rows as 5 and 6 are last even and odd numbers respectively.

The third option is keep=:noduplicates in which case only rows that have no duplicates are marked as unique. So we have:

julia> nonunique(df, "x" => ByRow(iseven); keep=:noduplicates)
6-element Vector{Bool}:
 1
 1
 1
 1
 1
 1

as no row was truly unique, but we have:

julia> nonunique(df, "y"; keep=:noduplicates)
6-element Vector{Bool}:
 1
 1
 1
 1
 0
 0

as first four rows were duplicated, but rows with 0.0 and -0.0 are indeed unique.

Removing duplicate rows from a data frame

The nonunique function returns a vector of duplicate indicators. Often we just want to get rid of them from our data frame.
The unique and unique! functions can be used to perform this operation. They support the same arguments as nonunique.
You have three options how you cen get your result:

using unique you get a new data frame by default;
using unique with view=true keyword argument passed you get a view of the source data frame with duplicates removed;
using unique! you drop the duplicates in-place from the source data frame.

Let us see how it works. First plain unique:

julia> unique(df, "y")
4×2 DataFrame
 Row │ x      y
     │ Int64  Float64
─────┼────────────────
   1 │     1      1.0
   2 │     2      2.0
   3 │     5      0.0
   4 │     6     -0.0

We got a new data frame. The df data frame is unchanged. The second option is a view:

julia> unique(df, "y"; view=true)
4×2 SubDataFrame
 Row │ x      y
     │ Int64  Float64
─────┼────────────────
   1 │     1      1.0
   2 │     2      2.0
   3 │     5      0.0
   4 │     6     -0.0

Note that still df is untouched:

julia> df
6×2 DataFrame
 Row │ x      y
     │ Int64  Float64
─────┼────────────────
   1 │     1      1.0
   2 │     2      2.0
   3 │     3      1.0
   4 │     4      2.0
   5 │     5      0.0
   6 │     6     -0.0

And finally we can change the df data frame in place:

julia> unique!(df, "y")
4×2 DataFrame
 Row │ x      y
     │ Int64  Float64
─────┼────────────────
   1 │     1      1.0
   2 │     2      2.0
   3 │     5      0.0
   4 │     6     -0.0

julia> df
4×2 DataFrame
 Row │ x      y
     │ Int64  Float64
─────┼────────────────
   1 │     1      1.0
   2 │     2      2.0
   3 │     5      0.0
   4 │     6     -0.0

In this case, as you can see, the df data frame was updated.

Conclusions

I hope that you will find this review of the functionalities of the
allunique, nonunique, unique, and unique! functions useful.

As a summary remember that:

You can determine uniqueness of rows based on transformations of data contained in the source data frame.
You can decide which rows are marked as duplicate using the keep keyword argument.

Getting full factorial design in DataFrames.jl

Blog by Bogumił Kamiński — Fri, 29 Mar 2024 08:32:12 +0000

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/03/29/ffd.html

Introduction

Often when working with data we need to get all possible combinations of some
input factors in a data frame. In the field of design of experiments
this is called full factorial design. In this post I will discuss two functions
that DataFrames.jl provides that can help you to generate such designs if
you needed them.

The post was written under Julia 1.10.1 and DataFrames.jl 1.6.1.

What is a full factorial design and how to create it?

Assume we are a cardboard box producer have three factors describing a box: its width, height, and depth.
Each of them has a finite set of possible values (due to production process limitations).
Let us create some sample data of this kind:

julia> height = [10, 12]
2-element Vector{Int64}:
 10
 12

julia> width = [8, 10, 15]
3-element Vector{Int64}:
  8
 10
 15

julia> depth = [5, 6]
2-element Vector{Int64}:
 5
 6

Our task is to compute the volume of all possible boxes that can be created by our factory.
The list of all possible cardboard box configurations is a full factorial design.
You can get an iterator of these values by using the Iterators.product function:

julia> Iterators.product(height, width, depth)
Base.Iterators.ProductIterator{Tuple{Vector{Int64}, Vector{Int64}, Vector{Int64}}}(([10, 12], [8, 10, 15], [5, 6]))

This function is lazy, to see the result we need to materialize its return value using e.g. collect:

julia> collect(Iterators.product(height, width, depth))
2×3×2 Array{Tuple{Int64, Int64, Int64}, 3}:
[:, :, 1] =
 (10, 8, 5)  (10, 10, 5)  (10, 15, 5)
 (12, 8, 5)  (12, 10, 5)  (12, 15, 5)

[:, :, 2] =
 (10, 8, 6)  (10, 10, 6)  (10, 15, 6)
 (12, 8, 6)  (12, 10, 6)  (12, 15, 6)

We can see that we get an array of tuples of all possible combinations of dimensions. Let us now compute the volumes:

julia> prod.(collect(Iterators.product(height, width, depth)))
2×3×2 Array{Int64, 3}:
[:, :, 1] =
 400  500  750
 480  600  900

[:, :, 2] =
 480  600   900
 576  720  1080

The results are nice and efficient. However, sometimes it is more convenient to have this data in a data frame.

Full factorial design in DataFrames.jl

Let us repeat the exercise using DataFrames.jl:

julia> using DataFrames

julia> df = allcombinations(DataFrame; height, width, depth)
12×3 DataFrame
 Row │ height  width  depth
     │ Int64   Int64  Int64
─────┼──────────────────────
   1 │     10      8      5
   2 │     12      8      5
   3 │     10     10      5
   4 │     12     10      5
   5 │     10     15      5
   6 │     12     15      5
   7 │     10      8      6
   8 │     12      8      6
   9 │     10     10      6
  10 │     12     10      6
  11 │     10     15      6
  12 │     12     15      6

Note that we passed height, width, depth as keyword arguments to allcombinations taking advantage of a nice functionality of Julia that in this case we can avoid writing height=height as just writing height gives us the same result.

Now we can add a volume column:

julia> transform!(df, All() => ByRow(*) => "volume")
12×4 DataFrame
 Row │ height  width  depth  volume
     │ Int64   Int64  Int64  Int64
─────┼──────────────────────────────
│     10      8      5     400
│     12      8      5     480
│     10     10      5     500
│     12     10      5     600
│     10     15      5     750
│     12     15      5     900
│     10      8      6     480
│     12      8      6     576
│     10     10      6     600
│     12     10      6     720
│     10     15      6     900
│     12     15      6    1080

We have added the "volume" column in place to df. Note that we used * as it can take any number of positional arguments and returns their product. The ByRow wrapper signals that we want to perform this operation row-wise.

In comparison to the solution shown before many users find presentation of a full factorial design easier to work with.

What if we have a fractional factorial design?

Sometimes your data is incomplete, and some level combinations are missing. Let us start by creating such a data frame:

julia> df2 = df[Not(2, 5, 9), :]
9×4 DataFrame
 Row │ height  width  depth  volume
     │ Int64   Int64  Int64  Int64
─────┼──────────────────────────────
   1 │     10      8      5     400
   2 │     10     10      5     500
   3 │     12     10      5     600
   4 │     12     15      5     900
   5 │     10      8      6     480
   6 │     12      8      6     576
   7 │     12     10      6     720
   8 │     10     15      6     900
   9 │     12     15      6    1080

Now one might ask to complete this design and re-fill the design to be complete. This can be done by the fillcombinations function. Let us see it at work:

julia> fillcombinations(df2, Not("volume"))
12×4 DataFrame
 Row │ height  width  depth  volume
     │ Int64   Int64  Int64  Int64?
─────┼───────────────────────────────
│     10      8      5      400
│     12      8      5  missing
│     10     10      5      500
│     12     10      5      600
│     10     15      5  missing
│     12     15      5      900
│     10      8      6      480
│     12      8      6      576
│     10     10      6  missing
│     12     10      6      720
│     10     15      6      900
│     12     15      6     1080

Observe that after calling this function we have created a new data frame with the missing rows added. The "volume" column is filled by default with missing for rows that were added. The Not("volume") argument meant that we want to get all combinations of values in all columns except "volume".

Conclusions

Today we worked with two functions: allcombinations and fillcombinations. You will find them useful if in your work you will ever need to create all combinations of levels of some factors. This functionality seems niche, but it is needed in practice surprisingly often.

ML Project Environment Setup in Julia, a Comprehensive Step-by-step Guide

Julia Frank — Thu, 28 Mar 2024 16:44:13 +0000

By: Julia Frank

Re-posted from: https://juliaifrank.com/ml-project-environment-setup-in-julia/

If you opt for running your ML project code locally on your machine, one of the very first things to do is to take care of the ML environment setup. But why and how?

Storing vectors of vectors in DataFrames.jl

Blog by Bogumił Kamiński — Fri, 22 Mar 2024 04:32:12 +0000

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/03/22/minicontainers.html

Introduction

The beauty of DataFrames.jl design is that you can store any data
as columns of a data frame.
However, this leads to one tricky issue – what if we want to store
a vector as a single cell of a data frame? Today I will explain you
what is exactly the problem and how to solve it.

The post was written under Julia 1.10.1 and DataFrames.jl 1.6.1.

Basic transformations of columns in DataFrames.jl

Let us start with a simple example:

julia> using DataFrames

julia> df = DataFrame(id=repeat(1:2, 5), x=1:10)
10×2 DataFrame
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     1      3
   4 │     2      4
   5 │     1      5
   6 │     2      6
   7 │     1      7
   8 │     2      8
   9 │     1      9
  10 │     2     10

We want to group the df data frame by "id" and then store the "x" column unchanged in the result.

This can be done by writing:

julia> combine(groupby(df, "id", sort=true), "x")
10×2 DataFrame
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      3
   3 │     1      5
   4 │     1      7
   5 │     1      9
   6 │     2      2
   7 │     2      4
   8 │     2      6
   9 │     2      8
  10 │     2     10

Note that the column "x" is expanded into multiple rows by combine. The rule that is applied here states that if some transformation of data returns a vector it gets expanded into multiple rows. The reason for such a behavior is that this is what we want most of the time.

However, what if we would want the vectors to be kept as they are without expanding them?
This can be achieved by writing:

julia> combine(groupby(df, "id", sort=true), "x" => Ref => "x")
2×2 DataFrame
 Row │ id     x
     │ Int64  SubArray…
─────┼─────────────────────────
   1 │     1  [1, 3, 5, 7, 9]
   2 │     2  [2, 4, 6, 8, 10]

We see that we got what we wanted, but the question is why does it work?
Let me explain.

Containers holding one element in Julia

What we just did with Ref is that we wrapped some value in a container that held exactly one element.
There are three basic ways to create such a container in Julia.
The first is to wrap a vector within another vector:

julia> [[1,2,3]]
1-element Vector{Vector{Int64}}:
 [1, 2, 3]

Above you have a vector that has one element, which is a [1, 2, 3] vector.

The second method is to create a 0-dimensional array with fill:

julia> fill([1,2,3])
0-dimensional Array{Vector{Int64}, 0}:
[1, 2, 3]

The key point here is that 0-dimensional arrays are guaranteed to hold exactly one element (as opposed to a vector presented above).

The third approach is to use Ref:

julia> Ref([1,2,3])
Base.RefValue{Vector{Int64}}([1, 2, 3])

Wrapping an object with Ref also creates a 0-dimensional container. The difference between Ref and fill is that fill creates an array, while Ref is just a container (but not an array).

How to use 1-element containers in DataFrames.jl as wrappers

All three methods described above can be used to ensure that we protect a vector from being expanded into multiple rows. Therefore the following operations give the same output:

julia> combine(groupby(df, "id", sort=true), "x" => (x -> [x]) => "x")
2×2 DataFrame
 Row │ id     x
     │ Int64  SubArray…
─────┼─────────────────────────
   1 │     1  [1, 3, 5, 7, 9]
   2 │     2  [2, 4, 6, 8, 10]

julia> combine(groupby(df, "id", sort=true), "x" => fill => "x")
2×2 DataFrame
 Row │ id     x
     │ Int64  SubArray…
─────┼─────────────────────────
   1 │     1  [1, 3, 5, 7, 9]
   2 │     2  [2, 4, 6, 8, 10]

julia> combine(groupby(df, "id", sort=true), "x" => Ref => "x")
2×2 DataFrame
 Row │ id     x
     │ Int64  SubArray…
─────┼─────────────────────────
   1 │     1  [1, 3, 5, 7, 9]
   2 │     2  [2, 4, 6, 8, 10]

The point is that combine unwraps the outer container (vector, 0-dimensional array, and Ref respectively) and stores its contents as a cell of a data frame.

Now, you might ask why initially I recommended Ref? The reason is that it is the method that has the smallest memory footprint:

julia> x = [1, 2, 3]
3-element Vector{Int64}:
 1
 2
 3

julia> @allocated [x]
64

julia> @allocated fill(x)
64

julia> @allocated Ref(x)
16

This difference is important if you have a huge data frame that has millions of groups.

Also writing Ref is simpler than writing (x -> [x]) .

Aliasing trap

You might have noticed that in the above examples the resulting "x" column held SubArrays? Why it is the case?
To improve performance combine did not copy the inner vectors from the source df data frame, but instead made their views. This is faster and more memory efficient, but results in creating an alias between the source data frame and the result. In many cases this is not a problem.

However, in some cases you might want to avoid it. A most common case is when you later want to mutate df in place, but do not want the result of combine to reflect this change. If you want to de-alias data you need to copy the data in the produced columns. Therefore you should do:

julia> combine(groupby(df, "id", sort=true), "x" => Ref∘copy => "x")
2×2 DataFrame
 Row │ id     x
     │ Int64  Array…
─────┼─────────────────────────
   1 │     1  [1, 3, 5, 7, 9]
   2 │     2  [2, 4, 6, 8, 10]

Notice that now the "x" column stores Array (which indicates that the copy was made). The Ref∘copy expression signals function composition. We first applly the copy function to the source data and then pass the result to Ref.

An alternative

Sometimes we want to keep the groups as columns not as rows of a data frame. In this case you can use unstack to achieve the desired result. Here is an example how to do it:

julia> unstack(df, :id, :x, combine=identity)
1×2 DataFrame
 Row │ 1                2
     │ SubArray…?       SubArray…?
─────┼───────────────────────────────────
   1 │ [1, 3, 5, 7, 9]  [2, 4, 6, 8, 10]

and a version copying the underlying data:

julia> unstack(df, :id, :x, combine=copy)
1×2 DataFrame
 Row │ 1                2
     │ Array…?          Array…?
─────┼───────────────────────────────────
   1 │ [1, 3, 5, 7, 9]  [2, 4, 6, 8, 10]

Conclusions

Having read this post you should be comfortable with protecting vectors from being expanded into multiple rows when processing data frames in DataFrames.jl. Enjoy!

Mastering Efficient Array Operations with StaticArrays.jl in Julia

Steven Whitaker — Mon, 18 Mar 2024 15:03:32 +0000

By: Steven Whitaker

Re-posted from: https://blog.glcs.io/staticarrays

The Julia programming languageis known for being a high-level languagethat can still compete with Cin terms of performance.As such,Julia already has performant data structures built-in,such as arrays.But what if arrays could be even faster?That’s where the StaticArrays.jl package comes in.

StaticArrays.jl provides drop-in replacements for Array,the standard Julia array type.These StaticArrays work just like Arrays,but they provide one additional piece of informationin the type:the size of the array.Consequently,you can’t insert or remove elements of a StaticArray;they are statically sized arrays(hence the name).However,this restriction allows more informationto be given to Julia’s compiler,which in turn results in more efficient machine code(for example, via loop unrolling and SIMD operations).The resulting speed-up can often be 10x or more!

In this post,we will learn how to use StaticArrays.jland compare the performance of StaticArraysto that of regular Arraysfor several different operations.

Note that the code examples in this postassume StaticArrays.jl has been installed and loaded:

# Press ] to enter the package prompt.pkg> add StaticArrays# Press Backspace to return to the Julia prompt.julia> using StaticArrays

(Check out our post on the Julia REPLfor more details about the package promptand navigating the REPL.)

How to Use StaticArrays.jl

When working with StaticArrays.jl,typically one will use the SVector typeor the SMatrix type.(There is also the SArray type for N-dimensional arrays,but we will focus on 1D and 2D arrays in this post.)SVectors and SMatrixes have both static sizeand static data,meaning the data contained in such objectscannot be modified.For statically sized arrayswhose contents can be modified,StaticArrays.jl provides MVector and MMatrix (and MArray).We will stick with SVectors and SMatrixes in this postunless we specifically need mutability.

Constructors

There are three ways to construct StaticArrays.

Convenience constructor SA:

julia> SA[1, 2, 3]3-element SVector{3, Int64} with indices SOneTo(3): 1 2 3julia> SA[1 2; 3 4]22 SMatrix{2, 2, Int64, 4} with indices SOneTo(2)SOneTo(2): 1  2 3  4

Normal constructor functions:

julia> SVector(1, 2)2-element SVector{2, Int64} with indices SOneTo(2): 1 2julia> SMatrix{2,3}(1, 2, 3, 4, 5, 6)23 SMatrix{2, 3, Int64, 6} with indices SOneTo(2)SOneTo(3): 1  3  5 2  4  6

Macros:

julia> @SVector [1, 2, 3]3-element SVector{3, Int64} with indices SOneTo(3): 1 2 3julia> @SMatrix [1 2; 3 4]22 SMatrix{2, 2, Int64, 4} with indices SOneTo(2)SOneTo(2): 1  2 3  4

Note that using macrosalso enables a convenient wayto create StaticArrays from common array-creation functions(eliminating the need to create an Array firstjust to convert it immediately to a StaticArray):

@SVector [10 * i for i = 1:10]@SVector zeros(5)@SVector rand(7)@SMatrix [(i, j) for i = 1:2, j = 1:3]@SMatrix zeros(2, 2)@SMatrix randn(6, 6)

Conversion to/from `Array`

It may occasionally be necessaryto convert to or from Arrays.To convert from an Array to a StaticArray,use the appropriate constructor function.However, because Arrays do not have size information in the type,we ourselves must provide the size to the constructor:

SVector{3}([1, 2, 3])SMatrix{4,4}(zeros(4, 4))

To convert back to an Array, use the collect function:

julia> collect(SVector(1, 2))2-element Vector{Int64}: 1 2

Comparing `StaticArray`s to `Array`s

Once a StaticArray is created,it can be operated on in the same wayas an Array.To illustrate,we will run a simple benchmark,both to compare the run-time speedsof the two types of arraysand to show that the same code can workwith either type of array.

Here’s the benchmark code,inspired by StaticArrays.jl’s benchmark:

using BenchmarkTools, StaticArrays, LinearAlgebra, Printfadd!(C, A, B) = C .= A .+ Bfunction run_benchmarks(N)    A = rand(N, N); A = A' * A    B = rand(N, N)    C = Matrix{eltype(A)}(undef, N, N)    D = rand(N)    SA = SMatrix{N,N}(A)    SB = SMatrix{N,N}(B)    MA = MMatrix{N,N}(A)    MB = MMatrix{N,N}(B)    MC = MMatrix{N,N}(C)    SD = SVector{N}(D)    speedup = [        @belapsed($A + $B) / @belapsed($SA + $SB),        @belapsed(add!($C, $A, $B)) / @belapsed(add!($MC, $MA, $MB)),        @belapsed($A * $B) / @belapsed($SA * $SB),        @belapsed(mul!($C, $A, $B)) / @belapsed(mul!($MC, $MA, $MB)),        @belapsed(norm($D)) / @belapsed(norm($SD)),        @belapsed(det($A)) / @belapsed(det($SA)),        @belapsed(inv($A)) / @belapsed(inv($SA)),        @belapsed($A \ $D) / @belapsed($SA \ $SD),        @belapsed(eigen($A)) / @belapsed(eigen($SA)),        @belapsed(map(abs, $A)) / @belapsed(map(abs, $SA)),        @belapsed(sum($D)) / @belapsed(sum($SD)),        @belapsed(sort($D)) / @belapsed(sort($SD)),    ]    return speedupendfunction main()    benchmarks = [        "Addition",        "Addition (in-place)",        "Multiplication",        "Multiplication (in-place)",        "L2 Norm",        "Determinant",        "Inverse",        "Linear Solve (A \\ b)",        "Symmetric Eigendecomposition",        "`map`",        "Sum of Elements",        "Sorting",    ]    N = [3, 5, 10, 30]    speedups = map(run_benchmarks, N)    fmt_header = Printf.Format("%-$(maximum(length.(benchmarks)))s" * " | %7s"^length(N))    header = Printf.format(fmt_header, "Benchmark", string.("N = ", N)...)    println(header)    println("="^length(header))    fmt = Printf.Format("%-$(maximum(length.(benchmarks)))s" * " | %7.1f"^length(N))    for i = 1:length(benchmarks)        println(Printf.format(fmt, benchmarks[i], getindex.(speedups, i)...))    endendmain()

Notice that all the functions calledwhen creating the array speedupin run_benchmarksare the same whether using Arrays or StaticArrays,illustrating that StaticArraysare drop-in replacements for standard Arrays.

Running the above codeprints the following results on my laptop(the numbers indicate the speedupof StaticArrays over normal Arrays;e.g., a value of 17.7 meansusing StaticArrays was 17.7 times fasterthan using Arrays):

Benchmark                    |   N = 3 |   N = 5 |  N = 10 |  N = 30====================================================================Addition                     |    17.7 |    14.5 |     7.9 |     2.0Addition (in-place)          |     1.6 |     1.3 |     1.4 |     0.7Multiplication               |     8.2 |     7.0 |     4.2 |     2.6Multiplication (in-place)    |     1.9 |     5.9 |     3.0 |     1.0L2 Norm                      |     4.2 |     4.0 |     5.4 |     9.7Determinant                  |    66.6 |     2.5 |     1.3 |     0.9Inverse                      |    54.8 |     5.9 |     1.8 |     0.9Linear Solve (A \ b)         |    65.5 |     3.7 |     1.8 |     0.9Symmetric Eigendecomposition |     3.7 |     1.0 |     1.0 |     1.0`map`                        |    10.6 |     8.2 |     4.9 |     2.1Sum of Elements              |     1.5 |     1.1 |     1.7 |     2.1Sorting                      |     7.1 |     2.9 |     1.5 |     1.1

There are two main conclusions from this table.First,using StaticArrays instead of Arrayscan result in some nice speed-ups!Second,the gains from using StaticArrays tend to diminishas the sizes of the arrays increase.So,you can’t expect StaticArrays.jlto always magically make your code faster,but if your arrays are small enough(the recommendation being fewer than about 100 elements)then you can expect to see some good speed-ups.

Of course,the above code timed just individual operations;how much faster a particular application would beis a different matter.

For example,consider a physical simulationwhere many 3D vectorsare manipulated over several time steps.Since 3D vectors are static in size(i.e., are 1D arrays with exactly three elements),such a situation is a prime exampleof where StaticArrays.jl is useful.To illustrate,here is an example(taken from the field of magnetic resonance imaging)of a physical simulationusing Arrays vs using StaticArrays:

using BenchmarkTools, StaticArrays, LinearAlgebrafunction sim_arrays(N)    M = Matrix{Float64}(undef, 3, N)    M[1,:] .= 0.0    M[2,:] .= 0.0    M[3,:] .= 1.0    M2 = similar(M)    (sin, cos) = sincosd(30)    R = [1 0 0; 0 cos sin; 0 -sin cos]    E1 = exp(-0.01)    E2 = exp(-0.1)    (sin, cos) = sincosd(1)    F = [E2 * cos E2 * sin 0; -E2 * sin E2 * cos 0; 0 0 E1]    FR = F * R    C = [0, 0, 1 - E1]    # Run for 100 time steps (each loop iteration does 2 time steps).    for t = 1:50        mul!(M2, FR, M)        M2 .+= C        mul!(M, FR, M2)        M .+= C    end    total = sum(M; dims = 2)    return complex(total[1], total[2])endfunction sim_staticarrays(N)    M = fill(SVector(0.0, 0.0, 1.0), N)    (sin, cos) = sincosd(30)    R = @SMatrix [1 0 0; 0 cos sin; 0 -sin cos]    E1 = exp(-0.01)    E2 = exp(-0.1)    (sin, cos) = sincosd(1)    F = @SMatrix [E2 * cos E2 * sin 0; -E2 * sin E2 * cos 0; 0 0 E1]    FR = F * R    C = @SVector [0, 0, 1 - E1]    # Run for 100 time steps (each loop iteration does 1 time step).    for t = 1:100        # Apply simulation dynamics to each 3D vector.        for i = 1:length(M)            M[i] = FR * M[i] + C        end    end    total = sum(M)    return complex(total[1], total[2])endfunction main(N)    r1 = @btime sim_arrays($N)    r2 = @btime sim_staticarrays($N)    @assert r1  r2 # Make sure the results are the same.end

The speed-ups on my laptopfor different values of Nwere as follows:

N = 10: 14.6x faster
N = 100: 7.1x faster
N = 1000: 5.2x faster

(Here, N is the number of 3D vectors in the simulation,not the size of the StaticArrays.)

Note also that I wrote sim_arraysto be as performant as possibleby doing in-place operations(like mul!),which has the unfortunate side effectof making the code a bit harder to read.Therefore,sim_staticarrays is both faster and easier to read!

As another exampleof how StaticArrays.jlcan speed up a more involved application,see the DifferentialEquations.jl docs.

Summary

In this post,we discussed StaticArrays.jl.We saw that StaticArrays are drop-in replacementsfor regular Julia Arrays.We also saw that using StaticArrayscan result in some nice speed-upsover using Arrays,at least when the sizes of the arraysare not too big.

Are array operations a bottleneck in your code?Try out StaticArrays.jland then comment below how it helps!

Additional Links

StaticArrays.jl Docs
- Documentation for StaticArrays.jl.

Cover image background fromhttps://openverse.org/image/875bf026-11ef-47a8-a63c-ee1f1877c156?q=circuit%20board%20array.

Transforming multiple columns in DataFrames.jl

Blog by Bogumił Kamiński — Fri, 15 Mar 2024 11:32:12 +0000

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/03/15/transforms.html

Introduction

Today I want to comment on a recurring topic that DataFrames.jl users raise.
The question is how one should transform multiple columns of a data frame using
operation specification syntax.

The post was written under Julia 1.10.1 and DataFrames.jl 1.6.1.

What is operation specification syntax?

In DataFrames.jl the combine, select, and transform functions allow
users for passing the requests for data transformation using operation
specification syntax. This syntax is feature-rich, and you can find its
description for example here. Today I want to focus on its principal concept.

In a general form each request for making an operation on data has the (E)xtract-(T)ransform-(L)oad form.
That means that we need to specify:

source columns to get data from (the extract part);;
the operation to apply to these columns (the transform part);
the target columns where we want to store the result of the operation (the load part).

These tree parts are syntactically expressed using the following form:

[source columns specification] => [transformation function] => [target columns specification]

Let me give an example. Assume you have the following data:

julia> using DataFrames

julia> df = DataFrame(reshape(1:15, 5, 3), :auto)
5×3 DataFrame
 Row │ x1     x2     x3
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      6     11
   2 │     2      7     12
   3 │     3      8     13
   4 │     4      9     14
   5 │     5     10     15

We want to compute the sum of column "x1" and store it in column names "x1_sum"
Since the sum function performs the addition operation the syntax specification should be:

"x1" => sum => "x1_sum"

Let us check it with the combine function:

julia> combine(df, "x1" => sum => "x1_sum")
1×1 DataFrame
 Row │ x1_sum
     │ Int64
─────┼────────
   1 │     15

In this syntax it is important to note two things:

the "x1" column as a whole was passed to the sum function (as we want to compute its sum);
the "x1" column is a single positional argument passed to the sum function.

Two natural questions that arise are the following:

What if I do not want to perform an operation on a whole column, but on its elements (a.k.a. vectorization of operation)?
What if I want to pass multiple columns as a source for computations?

We will now investigate these two dimensions.

Vectorization of operations

Vectorization in DataFrames.jl is easy. Just wrap the function you use in the ByRow object. Here is an example:

julia> combine(df, "x1" => string => "x1_str")
1×1 DataFrame
 Row │ x1_str
     │ String
─────┼─────────────────
   1 │ [1, 2, 3, 4, 5]

julia> combine(df, "x1" => ByRow(string) => "x1_strs")
5×1 DataFrame
 Row │ x1_strs
     │ String
─────┼─────────
   1 │ 1
   2 │ 2
   3 │ 3
   4 │ 4
   5 │ 5

Note that "x1" => string => "x1_str" passed the whole "x1" column to the string function so we got a single "[1, 2, 3, 4, 5]"
string in the output.

While writing "x1" => ByRow(string) => "x1_strs" passed each element of "x1" column to the string function individually,
so in the result we got a vector of five string representations of numbers of the numbers from the source.

Passing multiple columns

Now let us have a look at passing multiple columns. There are two ways you can do it.

The first is when your function accepts multiple positional arguments. An example of such function is string see:

julia> string(df.x1, df.x2)
"[1, 2, 3, 4, 5][6, 7, 8, 9, 10]"

If we pass a collection of columns as a source in operation specification syntax we get this behavior:

julia> combine(df, ["x1", "x2"] => string => "x1_x2_str")
1×1 DataFrame
 Row │ x1_x2_str
     │ String
─────┼─────────────────────────────────
   1 │ [1, 2, 3, 4, 5][6, 7, 8, 9, 10]

Naturally, the above combines with vectorization. Therefore since:

julia> string.(df.x1, df.x2)
5-element Vector{String}:
 "16"
 "27"
 "38"
 "49"
 "510"

we also have:

julia> combine(df, ["x1", "x2"] => ByRow(string) => "x1_x2_strs")
5×1 DataFrame
 Row │ x1_x2_strs
     │ String
─────┼────────────
   1 │ 16
   2 │ 27
   3 │ 38
   4 │ 49
   5 │ 510

However, there are cases when we have a function that expects multiple columns to be passed as a single positional argument.
This is handled in DataFrames.jl with the AsTable wrapper, which you can apply to the source columns.
If you use it then instead of getting multiple positional arguments the function will get a single positional argument
that will be a NamedTuple holding the source columns.

To convince ourselves that this is indeed what happens let us create a helper function:

julia> function helper(x)
           @show x
           return string(x.x1, x.x2)
       end
helper (generic function with 1 method)

This helper function first prints us its only argument x and next assumes that it has x1 and x2 fields and applies the string function to them.
Let us first check it in practice:

julia> helper((x1=[1, 2, 3, 4, 5], x2=[6, 7, 8, 9, 10]))
x = (x1 = [1, 2, 3, 4, 5], x2 = [6, 7, 8, 9, 10])
"[1, 2, 3, 4, 5][6, 7, 8, 9, 10]"

Now let us use the helper function with combine:

julia> combine(df, AsTable(["x1", "x2"]) => helper => "x1_x2_str")
x = (x1 = [1, 2, 3, 4, 5], x2 = [6, 7, 8, 9, 10])
1×1 DataFrame
 Row │ x1_x2_str
     │ String
─────┼─────────────────────────────────
   1 │ [1, 2, 3, 4, 5][6, 7, 8, 9, 10]

Indeed, we see that helper got a named tuple holding two columns of the source data frame.

Again, this syntax plays well with ByRow:

julia> combine(df, AsTable(["x1", "x2"]) => ByRow(helper) => "x1_x2_strs")
x = (x1 = 1, x2 = 6)
x = (x1 = 2, x2 = 7)
x = (x1 = 3, x2 = 8)
x = (x1 = 4, x2 = 9)
x = (x1 = 5, x2 = 10)
5×1 DataFrame
 Row │ x1_x2_strs
     │ String
─────┼────────────
   1 │ 16
   2 │ 27
   3 │ 38
   4 │ 49
   5 │ 510

We see that this time helper got a separate named tuple for each row of source data frame.

Conclusions

In summary today we discussed two special operations in DataFrames.jl operation specification syntax:

the ByRow which vectorizes the function passed to it;
the AsTable which allows us to pass source columns as a single named tuple to the transformation function
(instead of passing them as consecutive positional arguments, which is the default).

I hope these examples were useful in helping you understand the design of operation specification syntax.

Calibrating an Ornstein–Uhlenbeck Process

Dean Markwick's Blog -- Julia — Sat, 09 Mar 2024 00:00:00 +0000

By: Dean Markwick's Blog -- Julia

Re-posted from: https://dm13450.github.io/2024/03/09/Calibrating-an-Ornstein-Uhlenbeck-Process.html

Read enough quant finance papers or books and you’ll come across the
Ornstein–Uhlenbeck (OU) process. This is a post that explores the OU
process, the equations, how we can simulate such a process and then estimate the parameters.

Enjoy these types of posts? Then you should sign up for my newsletter.

I’ve briefly touched on mean reversion and OU processes before in my
Stat Arb – An Easy Walkthrough
blog post where we modelled the spread between an asset and its
respective ETF. The whole concept of ‘mean reversion’ is something
that comes up frequently in finance and at different time scales. It
can be thought of as the first basic extension as Brownian motion and
instead of things moving randomly there is now a slight structure
where it be oscillating around a constant value.

The Hudson Thames group have a similar post on OU processes (Mean-Reverting Spread Modeling: Caveats in Calibrating the OU Process) and
my post should be a nice compliment with code and some extensions.

The Ornstein-Uhlenbeck Equation

As a continuous process, we write the change in \(X_t\) as an increment in time and some noise

\[\mathrm{d}X_t = \theta (\mu – x_t) \mathrm{d}t + \sigma \mathrm{d}W_t\]

The amount it changes in time depends on the previous \(X_t\) and to free parameters \(\mu\) and \(\theta\).

The \(\mu\) is the long-term drift of the process
The \(\theta\) is the mean reversion or momentum parameter depending on the sign.

If \(\theta\) is 0 we can see the equation collapses down to a simple random walk.

If we assume \(\mu = 0\), so the long-term average is 0, then a positive value of \(\theta\) means we see mean reversion. Large values of \(X\) mean the next change is likely to have a negative sign, leading to a smaller value in \(X\).

A negative value of \(\theta\) means the opposite and we end up with a large value in X generating a further large positive change and the process explodes.
E
If discretise the process we can simulate some samples with different parameters to illustrate these two modes.

\[X_{t+1} – X_t = \theta (\mu – X_t) \Delta t + \sigma \sqrt{\Delta t} W_t\]

where \(W_t \sim N(0,1)\).

which is easy to write out in Julia. We can save some time by drawing the random values first and then just summing everything together.

using Distributions, Plots

function simulate_os(theta, mu, sigma, dt, maxT, initial)
    p = Array{Float64}(undef, length(0:dt:maxT))
    p[1] = initial
    w = sigma * rand(Normal(), length(p)) * sqrt(dt)
    for i in 1:(length(p)-1)
        p[i+1] = p[i] + theta*(mu-p[i])*dt + w[i]
    end
    return p
end

We have two classes of OU processes we want to simulate, a mean
reverting \(\theta > 0\) and a momentum version (\(\theta < 0\)) and
we also want to simulate a random walk at the same time, so \(\theta =
0\). We will assume \(\mu = 0\) which keeps the pictures simple.

maxT = 5
dt = 1/(60*60)
vol = 0.005

initial = 0.00*rand(Normal())

p1 = simulate_os(-0.5, 0, vol, dt, maxT, initial)
p2 = simulate_os(0.5, 0, vol, dt, maxT, initial)
p3 = simulate_os(0, 0, vol, dt, maxT, initial)

plot(0:dt:maxT, p1, label = "Momentum")
plot!(0:dt:maxT, p2, label = "Mean Reversion")
plot!(0:dt:maxT, p3, label = "Random Walk")

The mean reversion (orange) hasn’t moved away from the long-term average (\(\mu=0\)) and the momentum has diverged the furthest from the starting point, which lines up with the name. The random walk, inbetween both as we would expect.

Now we have successfully simulated the process we want to try and
estimate the \(\theta\) parameter from the simulation. We have two
slightly different (but similar methods) to achieve this.

OLS Calibration of an OU Process

When we look at the generating equation we can simply rearrange it into a linear equation.

\[\Delta X = \theta \mu \Delta t – \theta \Delta t X_t + \epsilon\]

and the usual OLS equation

\[y = \alpha + \beta X + \epsilon\]

such that

\[\alpha = \theta \mu \Delta t\]

\[\beta = -\theta \Delta t\]

where \(\epsilon\) is the noise. So we just need a DataFrame with the difference between subsequent observations and relate that to the current observation. Just a diff and a shift.

using DataFrames, DataFramesMeta
momData = DataFrame(y=p1)
momData = @transform(momData, :diffY = [NaN; diff(:y)], :prevY = [NaN; :y[1:(end-1)]])

Then using the standard OLS process from the GLM package.

mdl = lm(@formula(diffY ~ prevY), momData[2:end, :])
alpha, beta = coef(mdl)

theta = -beta / dt
mu = alpha / (theta * dt)

Which gives us \(\mu = 0.0075, \theta = -0.3989\), so close to zero
for the drift and the reversion parameter has the correct sign.

Doing the same for the mean reversion data.

mdl = lm(@formula(diffY ~ prevY), revData[2:end, :])
alpha, beta = coef(mdl)

theta = -beta / dt
mu = alpha / (theta * dt)

This time \(\mu = 0.001\) and \(\theta = 1.2797\). So a little wrong
compared to the true values, but at least the correct sign.

Does Bootstrapping Help?

It could be that we need more data, so we use the bootstrap to randomly sample from the population to give us pseudo-new draws. We use the DataFrames again and pull random rows with replacement to build out the data set. We do this sampling 1000 times.

res = zeros(1000)
for i in 1:1000
    mdl = lm(@formula(diffY ~ prevY + 0), momData[sample(2:nrow(momData), nrow(momData), replace=true), :])
    res[i] = -first(coef(mdl)/dt)
end

bootMom = histogram(res, label = :none, title = "Momentum", color = "#7570b3")
bootMom = vline!(bootMom, [-0.5], label = "Truth", momentum = 2)
bootMom = vline!(bootMom, [0.0], label = :none, color = "black")

We then do the same for the reversion data.

res = zeros(1000)
for i in 1:1000
    mdl = lm(@formula(diffY ~ prevY + 0), revData[sample(2:nrow(revData), nrow(revData), replace=true), :])
    res[i] = first(-coef(mdl)/dt)
end

bootRev = histogram(res, label = :none, title = "Reversion", color = "#1b9e77")
bootRev = vline!(bootRev, [0.5], label = "Truth", lw = 2)
bootRev = vline!(bootRev, [0.0], label = :none, color = "black")

Then combining both the graphs into one plot.

plot(bootMom, bootRev, 
  layout=(2,1),dpi=900, size=(800, 300),
  background_color=:transparent, foreground_color=:black,
     link=:all)

The momentum bootstrap has worked and centred around the correct
value, but the same cannot be said for the reversion plot. However, it
has correctly guessed the sign.

AR(1) Calibration of a OU Process

If we continue assuming that \(\mu = 0\) then we can simplify the OLS
to a 1-parameter regression – OLS without an intercept. From the
generating process, we can see that this is an AR(1) process – each
observation depends on the previous observation by some amount.

\[\phi = \frac{\sum _i X_i X_{i-1}}{\sum _i X_{i-1}^2}\]

then the reversion parameter is calculated as

\[\theta = – \frac{\log \phi}{\Delta t}\]

This gives us a simple equation to calculate \(\theta\) now.

For the momentum sample:

phi = sum(p1[2:end] .* p1[1:(end-1)]) / sum(p1[1:(end-1)] .^2)
-log(phi)/dt

Givens \(\theta = -0.50184\), so very close to the true value.

For the reversion sample

phi = sum(p2[2:end] .* p2[1:(end-1)]) / sum(p2[1:(end-1)] .^2)
-log(phi)/dt

Gives \(\theta = 1.26\), so correct sign, but quite a way off.

Finally, for the random walk

phi = sum(p3[2:end] .* p3[1:(end-1)]) / sum(p3[1:(end-1)] .^2)
-log(phi)/dt

Produces \(\theta = -0.027\), so quite close to zero.

Again, values are similar to what we expect, so our estimation process
appears to be working.

Using Multiple Samples for Calibrating an OU Process

If you aren’t convinced I don’t blame you. Those point estimates above are nowhere near the actual values that simulated the data so it’s hard to believe the estimation method is working. Instead, what we need to do is repeat the process and generate many more price paths and estimate the parameters of each one.

To make things a bit more manageable code-wise though I’m going to
introduce a struct that contains the parameters and allows to
simulate and estimate in a more contained manner.

struct OUProcess
    theta
    mu 
    sigma
    dt
    maxT
    initial
end

We now write specific functions for this object and this allows us to
simplify the code slightly.

function simulate(ou::OUProcess)
    simulate_os(ou.theta, ou.mu, ou.sigma, ou.dt, ou.maxT, ou.initial)
end

function estimate(ou::OUProcess)
   p = simulate(ou)
   phi =  sum(p[2:end] .* p[1:(end-1)]) / sum(p[1:(end-1)] .^2)
   -log(phi)/ou.dt
end

function estimate(ou::OUProcess, N)
    res = zeros(N)
    for i in 1:N
        p = simulate(ou)
        res[i] = estimate(ou)
    end
    res
end

We use these new functions to draw from the process 1,000 times and
sample the parameters for each one, collecting the results as an
array.

ou = OUProcess(0.5, 0.0, vol, dt, maxT, initial)
revPlot = histogram(estimate(ou, 1000), label = :none, title = "Reversion")
vline!(revPlot, [0.5], label = :none);

And the same for the momentum OU process

ou = OUProcess(-0.5, 0.0, vol, dt, maxT, initial)
momPlot = histogram(estimate(ou, 1000), label = :none, title = "Momentum")
vline!(momPlot, [-0.5], label = :none);

Plotting the distribution of the results gives us a decent
understanding of how varied the samples can be.

plot(revPlot, momPlot, layout = (2,1), link=:all)

We can see the heavy-tailed nature of the estimation process, but
thankfully the histograms are centred around the correct number. This
goes to show how difficult it is to estimate the mean reversion
parameter even in this simple setup. So for a real dataset, you need to
work out how to collect more samples or radically adjust how accurate
you think your estimate is.

Summary

We have progressed from simulating an Ornstein-Uhlenbeck process to
estimating its parameters using various methods. We attempted to
enhance the accuracy of the estimates through bootstrapping, but we
discovered that the best approach to improve the estimation is to have
multiple samples.

So if you are trying to fit this type of process on some real world
data, be it the spread between two stocks
(Statistical Arbitrage in the U.S. Equities Market),
client flow (Unwinding Stochastic Order Flow: When to Warehouse Trades) or anything
else you believe might be mean reverting, then understand how much
data you might need to accurately model the process.

Working with a grouped data frame, part 2

Blog by Bogumił Kamiński — Fri, 08 Mar 2024 08:32:12 +0000

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/03/08/gdf.html

Introduction

This is a follow up to the post from last week. We will continue
discussing how one can work with GroupedDataFrame objects in DataFrames.jl.
Today we focus on indexing of grouped data frames.

The post was written under Julia 1.10.1 and DataFrames.jl 1.6.1.

Warm-up: getting group indices

First create some grouped data frame:

julia> using DataFrames

julia> df = DataFrame(int=[1, 3, 2, 1, 3, 2],
                      str=["a", "a", "c", "c", "b", "b"])
6×2 DataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a
   3 │     2  c
   4 │     1  c
   5 │     3  b
   6 │     2  b

julia> gdf = groupby(df, :str, sort=true)
GroupedDataFrame with 3 groups based on key: str
First Group (2 rows): str = "a"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a
⋮
Last Group (2 rows): str = "c"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     1  c

It is sometimes useful to learn what is a group number of each row of the source data frame df in a grouped data frame gdf.
You can easily get this information with groupindices:

julia> groupindices(gdf)
6-element Vector{Union{Missing, Int64}}:
 1
 1
 3
 3
 2
 2

Extracting a single group

A basic operation when indexing a GroupedDataFrame is to pick a group by its number. Here is an example:

julia> gdf[1]
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a

julia> gdf[2]
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     2  b

julia> gdf[3]
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     1  c

Note, that gdf behaves similarly to a vector. You can even use begin and end in indexing:

julia> gdf[begin]
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a

julia> gdf[end]
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     1  c

Often you might want to extract a group not by its position in gdf, but by the value of the grouping
variable or variables. In this case you can use GroupKey, dictionary, tuple, or named tuple to achieve this.

Let us check how it works. Start with dictionary, tuple, and named tuple:

julia> gdf[Dict("str" => "b")] # dictionary
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     2  b

julia> gdf[("b",)] # tuple
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     2  b

julia> gdf[(; str="b")] # named tuple
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     2  b

With GroupKey we first need to get it from keys, but everything else works the same:

julia> key = keys(gdf)[1]
GroupKey: (str = "a",)

julia> gdf[key]
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a

You might ask why we require passing grouping variable in a container (dictionary, tuple, named tuple, GroupKey)
and not directly pass the required value when indexing? The reason is that if you grouped your data by integer column
the result would be ambiguous. Here is an example showing that under the defined rules there is no such ambiguity:

julia> gdf2 = groupby(df, :int, sort=false)
GroupedDataFrame with 3 groups based on key: int
First Group (2 rows): int = 1
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
⋮
Last Group (2 rows): int = 2
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b

julia> gdf2[3] # third group
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b

julia> gdf2[(3, )] # group with value of the grouping variable equal to 3
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  a
   2 │     3  b

Extracting multiple groups

You now know how to pick a single group, so selecting multiple groups is a natural next step.
You can use a collection of any of the selectors we have already discussed. Here are some examples:

julia> gdf[[3, 1]] # selection by group number
GroupedDataFrame with 2 groups based on key: str
First Group (2 rows): str = "c"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     1  c
⋮
Last Group (2 rows): str = "a"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a

julia> gdf[[("c",), ("a",)]] # selection by grouping variable value
GroupedDataFrame with 2 groups based on key: str
First Group (2 rows): str = "c"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     1  c
⋮
Last Group (2 rows): str = "a"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a

Note that indexing allows both for reordering and for dropping groups, which often comes handy when analyzing data.
Also note that groupindices is aware of such changes:

julia> groupindices(gdf[[3, 1]])
6-element Vector{Union{Missing, Int64}}:
 2
 2
 1
 1
  missing
  missing

Here group with "c" is first, with "a" is second and with "b" is dropped, so missing is returned in the produced vector.

It is also worth to remember that subset and filter can be used with GroupedDataFrames. This topic is discussed in this post.

Key lookup

Sometimes we do not want to index into a grouped data frame, but just check if it contains some key. This is easily achievable with the haskey function:

julia> haskey(gdf, ("a",))
true

julia> haskey(gdf, ("z",))
false

Conclusions

In this post we discussed indexing of GroupedDataFrames. This concludes the basic tutorial of working with these data structures.
I hope you will find the functionalities I have covered useful in your work.

Working with a grouped data frame, part 1

Blog by Bogumił Kamiński — Fri, 01 Mar 2024 14:32:12 +0000

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/03/01/gdf.html

Introduction

One of the features of DataFrames.jl that I often find useful is that when you group
a data frame by some of its columns the resulting GroupedDataFrame is an object
that gains new and useful functionalities.

Some time ago I have discussed how GroupedDataFrame can be filtered. You can find
this post here. In this post and the following one that I plan to write next
week I thought that it would be useful to review other key functionalities of
a GroupedDataFrame.

The post was written under Julia 1.10.1 and DataFrames.jl 1.6.1.

Creating a grouped data frame

You can create a GroupedDataFrame using the groupby function.

Here are some examples:

julia> using DataFrames

julia> df = DataFrame(int=[1, 3, 2, 1, 3, 2],
                      str=["a", "a", "c", "c", "b", "b"])
6×2 DataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a
   3 │     2  c
   4 │     1  c
   5 │     3  b
   6 │     2  b

julia> show(groupby(df, :int), allgroups=true)
GroupedDataFrame with 3 groups based on key: int
Group 1 (2 rows): int = 1
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
Group 2 (2 rows): int = 2
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b
Group 3 (2 rows): int = 3
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  a
   2 │     3  b
julia> show(groupby(df, :int; sort=true), allgroups=true)
GroupedDataFrame with 3 groups based on key: int
Group 1 (2 rows): int = 1
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
Group 2 (2 rows): int = 2
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b
Group 3 (2 rows): int = 3
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  a
   2 │     3  b
julia> show(groupby(df, :int; sort=false), allgroups=true)
GroupedDataFrame with 3 groups based on key: int
Group 1 (2 rows): int = 1
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
Group 2 (2 rows): int = 3
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  a
   2 │     3  b
Group 3 (2 rows): int = 2
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b
julia> show(groupby(df, :str), allgroups=true)
GroupedDataFrame with 3 groups based on key: str
Group 1 (2 rows): str = "a"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a
Group 2 (2 rows): str = "c"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     1  c
Group 3 (2 rows): str = "b"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     2  b
julia> show(groupby(df, :str; sort=true), allgroups=true)
GroupedDataFrame with 3 groups based on key: str
Group 1 (2 rows): str = "a"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a
Group 2 (2 rows): str = "b"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     2  b
Group 3 (2 rows): str = "c"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     1  c
julia> show(groupby(df, :str; sort=false), allgroups=true)
GroupedDataFrame with 3 groups based on key: str
Group 1 (2 rows): str = "a"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a
Group 2 (2 rows): str = "c"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     1  c
Group 3 (2 rows): str = "b"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     2  b

What this example shows is that the key thing you need to remember
to decide about a grouped data frame is the order of groups.

There are two options here:

groups sorted by the grouping column value, when you pass sort=true;
groups sorted by the order of appearance of values in the source, when you pass sort=true.

You might ask what happens if you do not pass the sort keyword argument?
In this case either of the options is used depending on which one is faster.
Therefore, omitting sort, can be thought of as an information that the user does not
care about the order of groups but wants the grouping operation to be as fast as possible.

When does the order of groups not matter?

In some cases the order of groups is irrelevant (so you can safely skip passing it).
The most important scenario of this kind is when you use the select or transform function
with a GroupedDataFrame. The reason is that these functions anyway always keep the order of
rows from the source data frame (no matter how the groups are rearranged in a GroupedDataFrame).
However, it is not the case with combine, as it respects the order of groups in a GroupedDataFrame.

Let us see an example highlighting the difference between these cases:

julia> select(groupby(df, :int, sort=true), :str)
6×2 DataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a
   3 │     2  c
   4 │     1  c
   5 │     3  b
   6 │     2  b

julia> combine(groupby(df, :int, sort=true), :str)
6×2 DataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
   3 │     2  c
   4 │     2  b
   5 │     3  a
   6 │     3  b

julia> select(groupby(df, :int, sort=false), :str)
6×2 DataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a
   3 │     2  c
   4 │     1  c
   5 │     3  b
   6 │     2  b

julia> combine(groupby(df, :int, sort=false), :str)
6×2 DataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
   3 │     3  a
   4 │     3  b
   5 │     2  c
   6 │     2  b

As you can see select kept the rows in the order in which they are present in df no matter if we
passed sort=true or sort=false. On the other hand combine returns rows grouped by the groups and
the order of groups corresponds to their order in GroupedDataFrame, so passing sort=true or
sort=false in general changes.

Special operation specification syntax for working with grouped data frames

When discussing select or combine in conjunction with GroupedDataFrame it is important to mention
that there are four special cases of operation specification syntax designed specifically for working with
them. They are:

nrow to compute the number of rows in each group.
proprow to compute the proportion of rows in each group.
eachindex to return a vector holding the number of each row within each group.
groupindices to return the group number.

Each of them optionally allows you to specify the name of the target column by => syntax.
Here are some examples:

julia> combine(groupby(df, :int, sort=false), nrow)
3×2 DataFrame
 Row │ int    nrow
     │ Int64  Int64
─────┼──────────────
   1 │     1      2
   2 │     3      2
   3 │     2      2

julia> combine(groupby(df, :int, sort=false), proprow => "row %")
3×2 DataFrame
 Row │ int    row %
     │ Int64  Float64
─────┼─────────────────
   1 │     1  0.333333
   2 │     3  0.333333
   3 │     2  0.333333

julia> combine(groupby(df, :int, sort=false), eachindex)
6×2 DataFrame
 Row │ int    eachindex
     │ Int64  Int64
─────┼──────────────────
   1 │     1          1
   2 │     1          2
   3 │     3          1
   4 │     3          2
   5 │     2          1
   6 │     2          2

julia> combine(groupby(df, :int, sort=false), groupindices => "group #")
3×2 DataFrame
 Row │ int    group #
     │ Int64  Int64
─────┼────────────────
   1 │     1        1
   2 │     3        2
   3 │     2        3

Iterating a grouped data frame

Apart from using functions such as select or combine on a GroupedDataFrame it is useful to know
that it supports iteration. Therefore you can use a GroupedDataFrame in a loop or in a comprehension.
When iterated GroupedDataFrame returns data frames corresponding to the groups. Let us see:

julia> for v in groupby(df, :int, sort=false)
           println(v)
       end
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  a
   2 │     3  b
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b

julia> [v for v in groupby(df, :int, sort=false)]
3-element Vector{SubDataFrame{DataFrame, DataFrames.Index, Vector{Int64}}}:
 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  a
   2 │     3  b
 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b

julia> collect(groupby(df, :int, sort=false))
3-element Vector{Any}:
 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  a
   2 │     3  b
 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b

The last example has shown you that you can pass a GroupedDataFrame to a function expecting an iterable, in this case the collect function. The one exception to this rule is that you cannot use GroupedDataFrame with the map function directly:

julia> map(identity, groupby(df, :int, sort=false))
ERROR: ArgumentError: using map over `GroupedDataFrame`s is reserved

The reason is that it was not clear if such operation should produce a vector or a data frame, and it is easy enough to achieve both results with other means. If you want e vector use e.g. a comprehension. If you want a data frame use e.g. combine or select.

Advanced iteration

Sometimes, when iterating a GroupedDataFrame we might be interested not only in a data frame per group, but also in a value of grouping variable. This is easily achieved with the keys and pairs functions (depending on whether you only want grouping values or both grouping values and data frames):

julia> map(identity, keys(groupby(df, :int, sort=false)))
3-element Vector{DataFrames.GroupKey{GroupedDataFrame{DataFrame}}}:
 GroupKey: (int = 1,)
 GroupKey: (int = 3,)
 GroupKey: (int = 2,)

julia> map(identity, pairs(groupby(df, :int, sort=false)))
3-element Vector{Pair{DataFrames.GroupKey{GroupedDataFrame{DataFrame}}, SubDataFrame{DataFrame, DataFrames.Index, Vector{Int64}}}}:
 GroupKey: (int = 1,) => 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
 GroupKey: (int = 3,) => 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  a
   2 │     3  b
 GroupKey: (int = 2,) => 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b

I used the map function to show you that it is only reserved to use it with plain GroupedDataFrame.

Working with group keys

As you can see in this example each group in a GroupedDataFrame is associated with a GroupKey. To get all
keys use the keys function:

julia> keys(groupby(df, :int, sort=false))
3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (int = 1,)
 GroupKey: (int = 3,)
 GroupKey: (int = 2,)

Let us, as an example extract the last key so see how one can work with it:

julia> key = last(keys(groupby(df, :int, sort=false)))
GroupKey: (int = 2,)

You can get a value of the key by property access or indexing:

julia> key.int
2

julia> key[1]
2

julia> key["int"]
2

julia> key[:int]
2

It is also easy co convert GroupKey to a dictionary, vector, Tuple or NamedTuple if you would need it:

julia> Dict(key)
Dict{Symbol, Int64} with 1 entry:
  :int => 2

julia> collect(key)
1-element Vector{Int64}:
 2

julia> Tuple(key)
(2,)

julia> NamedTuple(key)
(int = 2,)

Note that, in general, you can group a data frame by multiple columns so you could query value of any grouping column
in the examples above. If you needed to get a list of grouping columns use the groupcols function:

julia> groupcols(groupby(df, :int, sort=false))
1-element Vector{Symbol}:
 :int

Conclusions

In this post we have learned how one can create a grouped data frame and how to choose the order of groups in it.
As a follow-up we have shown how GroupedDataFrame interacts with functions like select or combine.
Next we discussed iterator interface support by GroupedDataFrame and how to get and use information about
values of grouping columns for each group. I hope you found these examples useful.

In the post next week we will discuss how GroupedDataFrame supports the indexing interface.

GSoC in LLVM 2024

Miguel Raz Guzmán Macedo — Thu, 29 Feb 2024 00:00:00 +0000

By: Miguel Raz Guzmán Macedo

Re-posted from: https://miguelraz.github.io/blog/gsoc2024/index.html

I'm trying to get a GSoC 2024 in LLVM

and I will be documenting my work with this ongoing blogpost in reverse chronological order.

If you want to see more posts like this, consider chucking a buck or two on my GitHub sponsors, or, you know, hire me as a grad student.

29/02/2024

"hazlo cobarde"

Add the 3 way comparison instruction <=> to LLVM.

I like this GSoC in particular because

I will learn a wide swath of LLVM
I'll be working with a lot of optimization passes
I'll get to bring cool perf to C++/Rust and Julia
I was dared by the other, more talented Miguel to actually help improve LLVM

Next task

Add a new intrinsic – Langref, then Intrinsics.td, then maybe the pass verifier.

I've already put up a sample PR and got redirected on what looks like the proper working path for this endeavour.

Partial function application in Julia

Blog by Bogumił Kamiński — Fri, 23 Feb 2024 04:00:00 +0000

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/02/23/fix.html

Introduction

Some functions provided in Base Julia support partial application.
I often find this functionality useful.
Therefore in this post I want to give you its explanation and a summary which functions have this property.

The post was tested with Julia Version 1.12.0-DEV.53.

Explaining partial function application

We will focus on partial application of functions having two positional arguments.
Let us work by example.

Consider the in function. You can call it to check if some item is in a collection.
Here is an example:

julia> in('a', "Abracadabra")
true

julia> in('x', "Abracadabra")
false

A common pattern you might need is to perform a repeated check if various items are contained in the same collection.
For example assume you have a vector of characters and you want to filer it to keep only the elements contained in a reference collection.
You can do it like this:

julia> v = 'a':'z'
'a':1:'z'

julia> filter(x -> in(x, "Abracadabra"), v)
5-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
 'd': ASCII/Unicode U+0064 (category Ll: Letter, lowercase)
 'r': ASCII/Unicode U+0072 (category Ll: Letter, lowercase)

This pattern is so commonly needed that there is a shorthand for x -> in(x, "Abracadabra").
Instead of creating this anonymous function you can just write in("Abracadabra").
The value returned by this function call behaves in the same way as x -> in(x, "Abracadabra").
Let us check:

julia> filter(in("Abracadabra"), v)
5-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
 'd': ASCII/Unicode U+0064 (category Ll: Letter, lowercase)
 'r': ASCII/Unicode U+0072 (category Ll: Letter, lowercase)

You can think of this operation as if we partially applied the in function by fixing its second argument
(the collection) and leaving the first (the item we check) to be specified later.

In other words the following two operations are equivalent:

julia> in('a', "Abracadabra")
true

julia> in("Abracadabra")('a')
true

Fixing of the second argument is most common. However, sometimes it is useful to fix the first argument.
This is exactly the case of the filter function we have just used.

What if you wanted to perform the filter(in("Abracadabra"), v) test for multiple different values of v but with a fixed predicate function?
Here is an example:

julia> vv = ['a'+i:'z' for i in 0:4]
5-element Vector{StepRange{Char, Int64}}:
 'a':1:'z'
 'b':1:'z'
 'c':1:'z'
 'd':1:'z'
 'e':1:'z'

julia> map(v -> filter(in("Abracadabra"), v), vv)
5-element Vector{Vector{Char}}:
 ['a', 'b', 'c', 'd', 'r']
 ['b', 'c', 'd', 'r']
 ['c', 'd', 'r']
 ['d', 'r']
 ['r']

You probably see, where I am getting at. Instead of v -> filter(in("Abracadabra"), v) we can write filter(in("Abracadabra")) and fix
the first positional argument of filter, leaving the second to be specified later.
Let us check if this works:

julia> map(filter(in("Abracadabra")), vv)
5-element Vector{Vector{Char}}:
 ['a', 'b', 'c', 'd', 'r']
 ['b', 'c', 'd', 'r']
 ['c', 'd', 'r']
 ['d', 'r']
 ['r']

Indeed, we get what we expected. Again, for a reference note that the following two operations are equivalent:

julia> filter(in("Abracadabra"), v)
5-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
 'd': ASCII/Unicode U+0064 (category Ll: Letter, lowercase)
 'r': ASCII/Unicode U+0072 (category Ll: Letter, lowercase)

julia> filter(in("Abracadabra"))(v)
5-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
 'd': ASCII/Unicode U+0064 (category Ll: Letter, lowercase)
 'r': ASCII/Unicode U+0072 (category Ll: Letter, lowercase)

Before I finish this section let me note that if you do not like writing that many parentheses you could use the |> operator.
In our example we could write:

julia> map("Abracadabra" |> in |> filter, vv)
5-element Vector{Vector{Char}}:
 ['a', 'b', 'c', 'd', 'r']
 ['b', 'c', 'd', 'r']
 ['c', 'd', 'r']
 ['d', 'r']
 ['r']

Which style you use is a matter of preference.

Catalogue of function supporting partial applications

We saw that some functions taking two arguments support partial application.
Below I give you a list of all of them that are currently supported (and this is the reason why the post is written under Julia nightly,
as there were recent changes in this list).

There is only one function in Base Julia that supports fixing its first argument and this function is filter.

However, there are many functions supporting fixing of their second argument. Here is their list:

comparisons: isequal, ==, !=, >=, <=, >, <;
inclusion testing: in, ∈, ∋, ∉, ∌;
string checking: contains, occursin, endswith, startswith;
set operations (supported since Julia 1.11; not released yet): issubset,⊆, ⊇, ⊈, ⊉, ⊊, ⊋, isdisjoint, issetequal.

Conclusions

After reading this post you know how to use partial function application in Julia and which functions from Base support it.
I hope you will find this functionality useful in your code.

Efficient Julia Optimization through an MRI Case Study

Steven Whitaker — Mon, 19 Feb 2024 15:03:46 +0000

By: Steven Whitaker

Re-posted from: https://glcs.hashnode.dev/optimization-mri

Welcome to our firstJulia for Devsblog post!This will be a continuously running series of postswhere our team will discusshow we have used Juliato solve real-life problems.So, let’s get started!

In this Julia for Devs post,we will discuss using Juliato optimize scan parametersin magnetic resonance imaging (MRI).

If you are interestedjust in seeing example codeshowing how to use Juliato optimize your own cost function,feel free to skip ahead.

While we will discuss MRI specifically,the concepts(and even some of the code)we will seewill be applicablein many situations where optimization is useful,particularly when there is a modelfor an observable signal.The model could be, e.g.,a mathematical equationor a trained neural network,and the observable signal(aka the output of the model)could be, e.g.,the image intensity of a medical imaging scanor the energy needed to heat a building.

Problem Setup

In this post,the signal model we will useis a mathematical equationthat gives the image intensityof a balanced steady-state free precession (bSSFP) MRI scan.This model has two sets of inputs:those that are user-definedand those that are not.In MRI,user-defined inputs are scan parameters,such as how long to run the scan foror how much power to use to generate MRI signal.The other input parameters are called tissue parametersbecause they are properties that are intrinsicto the imaged tissue (e.g., muscle, brain, etc.).

Tissue parameters often are assumedto take on a pre-defined valuefor each specific tissue type.However,because they are tissue-specificand can vary with health and age,sometimes tissue parameters are considered to be unknown valuesthat need to be estimated from data.

The problem we will discuss in this postis how to estimate tissue parametersfrom a set of bSSFP scans.Then we will discusshow to optimize the scan parametersof the set of bSSFP scansto improve the tissue parameter estimates.

To estimate tissue parameters,we need the following:

a signal model, and
an estimation algorithm.

Signal Model

Here’s the signal model:

# Reference: https://www.mriquestions.com/true-fispfiesta.htmlfunction bssfp(TR, flip, T1, T2)    E1 = exp(-TR / T1)    E2 = exp(-TR / T2)    (sin, cos) = sincosd(flip)    num = sin * (1 - E1)    den = 1 - cos * (E1 - E2) - E1 * E2    signal = num / den    return signalend

This function returns the bSSFP signalas a function of two scan parameters(TR, a value in units of seconds,and flip, a value in units of degrees)and two tissue parameters(T1 and T2, each a value in units of seconds).

Estimation Algorithm

There are various estimation algorithms out there,but, for simplicity in this post,we will stick with a grid search.We will compute the bSSFP signalover many pairs of T1 and T2 values.We will compare these computed signalsto the actual observed signal,and the pair of T1 and T2 valuesthat results in the closest matchwill be chosen as the estimates of the tissue parameters.Here’s the algorithm in code:

# `signal` is the observed signal.# `TR` and `flip` are the scan parameters that were used,# and we want to estimate `T1` and `T2`.# `signal`, `TR` and `flip` should be vectors of the same length,# representing running multiple scans# and recording the observed signal for each pair of `TR` and `flip` values.function gridsearch(signal, TR, flip)    # Specify the grid of values to search over.    # `T1_grid` and `T2_grid` could optionally be inputs to this function.    T1_grid = exp10.(range(log10(0.5), log10(3), 40))    T2_grid = exp10.(range(log10(0.005), log10(0.7), 40))    T1_est = T1_grid[1]    T2_est = T2_grid[1]    best = Inf    # Pre-allocate memory for some computations to speed up the following loop.    residual = similar(signal)    # Iterate over the Cartesian product of `T1_grid` and `T2_grid`.    for (T1, T2) in Iterators.product(T1_grid, T2_grid)        # Physical constraint: T1 is greater than T2, and both are positive.        T1 > T2 > 0 || continue        # For the given T1 and T2 pairs, compute the (noiseless) bSSFP signal        # for each bSSFP scan and subtract them from the given signals.        # Tip: In Julia, one can apply a scalar function elementwise on vector        #      inputs using the "dot" notation (see the `.` after `bssfp` below).        residual .= signal .- bssfp.(TR, flip, T1, T2)        # Compute the norm squared of the above difference.        err = residual' * residual        # If the candidate T1 and T2 pair fit the given signals better,        # keep them as the current estimate of T1 and T2.        if err < best            best = err            T1_est = T1            T2_est = T2        end    end    isinf(best) && error("no valid grid points; consider changing `T1_grid` and/or `T2_grid`")    return (T1_est, T2_est)end

Cost Function

Now,to optimize scan parameterswe need a function to optimize,also called a cost function.In this case,we want to minimize the errorbetween what our estimator tells usand the true tissue parameter values.But because we need to optimize scan parametersbefore running any scans,we will simulate bSSFP MRI scansusing the given sets of scan parametersand average the estimator errorover several sets of tissue parameters.Here’s the code for the cost function:

# Load the Statistics standard library package to get access to the `mean` function.using Statistics# `TR` and `flip` are vectors of the same length# specifying the pairs of scan parameters to use.function estimator_error(TR, flip)    # Specify the grid of values to average over and the noise level to simulate.    # For a given pair of `T1_true` and `T2_true` values,    # bSSFP signals will be simulated, and then the estimator    # will attempt to estimate what `T1_true` and `T2_true` values were used    # based on the simulated bSSFP signals and the given scan parameters.    T1_true = [0.8, 1.0, 1.5]    T2_true = [0.05, 0.08, 0.1]    noise_level = 0.01    T1_avg = mean(T1_true)    T2_avg = mean(T2_true)    # Pre-allocate memory for some computations to speed up the following loop.    signal = float.(TR)    # The following computes the mean estimator error over all pairs    # of true T1 and T2 values.    return mean(Iterators.product(T1_true, T2_true)) do (T1, T2)        # Ignore pairs for which `T2 > T1`.        T2 > T1 && return 0        # Simulate noisy signal using the true T1 and T2 values.        signal .= bssfp.(TR, flip, T1, T2) .+ noise_level .* randn.()        # Estimate T1 and T2 from the noisy signal.        (T1_est, T2_est) = gridsearch(signal, TR, flip)        # Compare estimates to truth.        T1_err = (T1_est - T1)^2        T2_err = (T2_est - T2)^2        # Combine the T1 and T2 errors into one single error metric        # by averaging the respective errors        # after normalizing by the mean true value of each parameter.        err = (T1_err / T1_avg + T2_err / T2_avg) / 2    endend

Optimization

Now we are finally ready to optimize!We will use Optimization.jl.Many optimizers are available,but because we have a non-convex optimization problem,we will use the adaptive particle swarm global optimization algorithm.Here’s a function that does the optimization:

# Load needed packages.# Optimization.jl provides the optimization infrastructure,# while OptimizationOptimJL.jl wraps the Optim.jl package# that provides the optimization algorithm we will use.using Optimization, OptimizationOptimJL# Load the Random standard library package to enable setting the random seed.using Random# `TR_init` and `flip_init` are vectors of the same length# that provide a starting point for the optimization algorithm.# The length of these vectors also determines the number of bSSFP scans to simulate.function scan_optimization(TR_init, flip_init)    # Ensure randomly generated noise is the same for each evaluation.    Random.seed!(0)    N_scans = length(TR_init)    length(flip_init) == N_scans || error("`TR_init` and `flip_init` have different lengths")    # The following converts the cost function we created (`estimator_error`)    # into a form that Optimization.jl can use.    # Specifically, Optimization.jl needs a cost function that takes two inputs    # (the first of which contains the parameters to optimize)    # and returns a real number.    # The input `x` is a concatenation of `TR` and `flip`,    # i.e., `TR == x[1:N_scans]` and `flip == x[N_scans+1:end]`.    # The input `p` is unused, but is needed by Optimization.jl.    cost_fun = (x, p) -> estimator_error(x[1:N_scans], x[N_scans+1:end])    # Specify constraints.    # The lower and upper bounds are chosen to ensure reasonable bSSFP scans.    (min_TR, max_TR) = (0.001, 0.02)    (min_flip, max_flip) = (1, 90)    constraints = (;        lb = [fill(min_TR, N_scans); fill(min_flip, N_scans)],        ub = [fill(max_TR, N_scans); fill(max_flip, N_scans)],    )    # Set up and solve the problem.    f = OptimizationFunction(cost_fun)    prob = OptimizationProblem(f, [TR_init; flip_init]; constraints...)    sol = solve(prob, ParticleSwarm(lower = prob.lb, upper = prob.ub, n_particles = 3))    # Extract the optimized TRs and flip angles, remembering that `sol.u == [TR; flip]`.    TR_opt = sol.u[1:N_scans]    flip_opt = sol.u[N_scans+1:end]    return (TR_opt, flip_opt)end

Results

Let’s see how scan parameter optimizationcan improve tissue parameter estimates.We will use the following functionto take scan parameters,simulate bSSFP scans,and then estimate the tissue parametersfrom those scans.We will compare the results of this functionwith manually chosen scan parametersto those with optimized scan parameters.

# Load the LinearAlgebra standard library package for access to `norm`.using LinearAlgebra# Load Plots.jl for displaying the results.using Plots# Helper function for plotting.function plot_true_vs_est(T_true, T_est, title, rmse, clim)    return heatmap([T_true T_est];        title,        clim,        xlabel = "RMSE = $(round(rmse; digits = 4)) s",        xticks = [],        yticks = [],        showaxis = false,        aspect_ratio = 1,    )endfunction run(TR, flip)    # Create a synthetic object to scan.    # The background of the object will be indicated with values of `0.0`    # for the tissue parameters.    nx = ny = 128    object = map(Iterators.product(range(-1, 1, nx), range(-1, 1, ny))) do (x, y)        r = hypot(x, y)        if r < 0.5            return (0.8, 0.08)        elseif r < 0.8            return (1.3, 0.09)        else            return (0.0, 0.0)        end    end    # Simulate the bSSFP scans.    T1_true = first.(object)    T2_true = last.(object)    noise_level = 0.001    signal = map(T1_true, T2_true) do T1, T2        # Ignore the background of the object.        (T1, T2) == (0.0, 0.0) && return 0.0        bssfp.(TR, flip, T1, T2) .+ noise_level .* randn.()    end    # Estimate the tissue parameters.    T1_est = zeros(nx, ny)    T2_est = zeros(nx, ny)    for i in eachindex(signal, T1_est, T2_est)        # Don't try to estimate tissue parameters for the background.        signal[i] == 0.0 && continue        (T1_est[i], T2_est[i]) = gridsearch(signal[i], TR, flip)    end    # Compute the root mean squared error.    m = T1_true .!= 0.0    T1_rmse = sqrt(norm(T1_true[m] - T1_est[m]) / count(m))    T2_rmse = sqrt(norm(T2_true[m] - T2_est[m]) / count(m))    # Plot the results.    p_T1 = plot_true_vs_est(T1_true, T1_est, "True vs Estimated T1", T1_rmse, (0, 2.5))    p_T2 = plot_true_vs_est(T2_true, T2_est, "True vs Estimated T2", T2_rmse, (0, 0.25))    return (p_T1, p_T2)end

First, let’s see the results with no optimization:

TR_init = [0.005, 0.005, 0.005]flip_init = [30, 60, 80](p_T1_init, p_T2_init) = run(TR_init, flip_init)

display(p_T1_init)

display(p_T2_init)

Now let’s optimize the scan parametersand then see how the tissue parameter estimates look:

(TR_opt, flip_opt) = scan_optimization(TR_init, flip_init)(p_T1_opt, p_T2_opt) = run(TR_opt, flip_opt)

display(p_T1_opt)

display(p_T2_opt)

We can see that the optimized scansresult in much better tissue parameter estimates!

Summary

In this post,we saw how Julia can be usedto optimize MRI scan parametersto improve tissue parameter estimates.Even though we discussed MRI specifically,the concepts presented hereeasily extend to other applicationswhere signal models are knownand optimization is required.

Note that most of the code for this postwas taken from a Dash appI helped create for JuliaHub.Feel free to check it outif you want to see this code in action!

Additional Links

Optimizing MRI Scans Dash App
- Web app showcasing the capabilities discussed in this post.
Optimizing MRI Scans Pluto Notebook
- Pluto notebook accompanying the Dash app mentioned above.(Unfortunately, the data used in the notebook didn’t get set up properly,so most of the code can’t run correctly.)
Links to other blog posts discussing how to do MRI in Julia:
Optimization.jl Docs
- Official documentation for Optimization.jl.

Efficient Julia Optimization through an MRI Case Study

Steven Whitaker — Mon, 19 Feb 2024 15:03:46 +0000

By: Steven Whitaker

Re-posted from: https://blog.glcs.io/optimization-mri

Welcome to our firstJulia for Devsblog post!This will be a continuously running series of postswhere our team will discusshow we have used Juliato solve real-life problems.So, let’s get started!

In this Julia for Devs post,we will discuss using Juliato optimize scan parametersin magnetic resonance imaging (MRI).

If you are interestedjust in seeing example codeshowing how to use Juliato optimize your own cost function,feel free to skip ahead.

Problem Setup

To estimate tissue parameters,we need the following:

a signal model, and
an estimation algorithm.

Signal Model

Here’s the signal model:

# Reference: https://www.mriquestions.com/true-fispfiesta.htmlfunction bssfp(TR, flip, T1, T2)    E1 = exp(-TR / T1)    E2 = exp(-TR / T2)    (sin, cos) = sincosd(flip)    num = sin * (1 - E1)    den = 1 - cos * (E1 - E2) - E1 * E2    signal = num / den    return signalend

Estimation Algorithm

# `signal` is the observed signal.# `TR` and `flip` are the scan parameters that were used,# and we want to estimate `T1` and `T2`.# `signal`, `TR` and `flip` should be vectors of the same length,# representing running multiple scans# and recording the observed signal for each pair of `TR` and `flip` values.function gridsearch(signal, TR, flip)    # Specify the grid of values to search over.    # `T1_grid` and `T2_grid` could optionally be inputs to this function.    T1_grid = exp10.(range(log10(0.5), log10(3), 40))    T2_grid = exp10.(range(log10(0.005), log10(0.7), 40))    T1_est = T1_grid[1]    T2_est = T2_grid[1]    best = Inf    # Pre-allocate memory for some computations to speed up the following loop.    residual = similar(signal)    # Iterate over the Cartesian product of `T1_grid` and `T2_grid`.    for (T1, T2) in Iterators.product(T1_grid, T2_grid)        # Physical constraint: T1 is greater than T2, and both are positive.        T1 > T2 > 0 || continue        # For the given T1 and T2 pairs, compute the (noiseless) bSSFP signal        # for each bSSFP scan and subtract them from the given signals.        # Tip: In Julia, one can apply a scalar function elementwise on vector        #      inputs using the "dot" notation (see the `.` after `bssfp` below).        residual .= signal .- bssfp.(TR, flip, T1, T2)        # Compute the norm squared of the above difference.        err = residual' * residual        # If the candidate T1 and T2 pair fit the given signals better,        # keep them as the current estimate of T1 and T2.        if err < best            best = err            T1_est = T1            T2_est = T2        end    end    isinf(best) && error("no valid grid points; consider changing `T1_grid` and/or `T2_grid`")    return (T1_est, T2_est)end

Cost Function

# Load the Statistics standard library package to get access to the `mean` function.using Statistics# `TR` and `flip` are vectors of the same length# specifying the pairs of scan parameters to use.function estimator_error(TR, flip)    # Specify the grid of values to average over and the noise level to simulate.    # For a given pair of `T1_true` and `T2_true` values,    # bSSFP signals will be simulated, and then the estimator    # will attempt to estimate what `T1_true` and `T2_true` values were used    # based on the simulated bSSFP signals and the given scan parameters.    T1_true = [0.8, 1.0, 1.5]    T2_true = [0.05, 0.08, 0.1]    noise_level = 0.01    T1_avg = mean(T1_true)    T2_avg = mean(T2_true)    # Pre-allocate memory for some computations to speed up the following loop.    signal = float.(TR)    # The following computes the mean estimator error over all pairs    # of true T1 and T2 values.    return mean(Iterators.product(T1_true, T2_true)) do (T1, T2)        # Ignore pairs for which `T2 > T1`.        T2 > T1 && return 0        # Simulate noisy signal using the true T1 and T2 values.        signal .= bssfp.(TR, flip, T1, T2) .+ noise_level .* randn.()        # Estimate T1 and T2 from the noisy signal.        (T1_est, T2_est) = gridsearch(signal, TR, flip)        # Compare estimates to truth.        T1_err = (T1_est - T1)^2        T2_err = (T2_est - T2)^2        # Combine the T1 and T2 errors into one single error metric        # by averaging the respective errors        # after normalizing by the mean true value of each parameter.        err = (T1_err / T1_avg + T2_err / T2_avg) / 2    endend

Optimization

# Load needed packages.# Optimization.jl provides the optimization infrastructure,# while OptimizationOptimJL.jl wraps the Optim.jl package# that provides the optimization algorithm we will use.using Optimization, OptimizationOptimJL# Load the Random standard library package to enable setting the random seed.using Random# `TR_init` and `flip_init` are vectors of the same length# that provide a starting point for the optimization algorithm.# The length of these vectors also determines the number of bSSFP scans to simulate.function scan_optimization(TR_init, flip_init)    # Ensure randomly generated noise is the same for each evaluation.    Random.seed!(0)    N_scans = length(TR_init)    length(flip_init) == N_scans || error("`TR_init` and `flip_init` have different lengths")    # The following converts the cost function we created (`estimator_error`)    # into a form that Optimization.jl can use.    # Specifically, Optimization.jl needs a cost function that takes two inputs    # (the first of which contains the parameters to optimize)    # and returns a real number.    # The input `x` is a concatenation of `TR` and `flip`,    # i.e., `TR == x[1:N_scans]` and `flip == x[N_scans+1:end]`.    # The input `p` is unused, but is needed by Optimization.jl.    cost_fun = (x, p) -> estimator_error(x[1:N_scans], x[N_scans+1:end])    # Specify constraints.    # The lower and upper bounds are chosen to ensure reasonable bSSFP scans.    (min_TR, max_TR) = (0.001, 0.02)    (min_flip, max_flip) = (1, 90)    constraints = (;        lb = [fill(min_TR, N_scans); fill(min_flip, N_scans)],        ub = [fill(max_TR, N_scans); fill(max_flip, N_scans)],    )    # Set up and solve the problem.    f = OptimizationFunction(cost_fun)    prob = OptimizationProblem(f, [TR_init; flip_init]; constraints...)    sol = solve(prob, ParticleSwarm(lower = prob.lb, upper = prob.ub, n_particles = 3))    # Extract the optimized TRs and flip angles, remembering that `sol.u == [TR; flip]`.    TR_opt = sol.u[1:N_scans]    flip_opt = sol.u[N_scans+1:end]    return (TR_opt, flip_opt)end

Results

# Load the LinearAlgebra standard library package for access to `norm`.using LinearAlgebra# Load Plots.jl for displaying the results.using Plots# Helper function for plotting.function plot_true_vs_est(T_true, T_est, title, rmse, clim)    return heatmap([T_true T_est];        title,        clim,        xlabel = "RMSE = $(round(rmse; digits = 4)) s",        xticks = [],        yticks = [],        showaxis = false,        aspect_ratio = 1,    )endfunction run(TR, flip)    # Create a synthetic object to scan.    # The background of the object will be indicated with values of `0.0`    # for the tissue parameters.    nx = ny = 128    object = map(Iterators.product(range(-1, 1, nx), range(-1, 1, ny))) do (x, y)        r = hypot(x, y)        if r < 0.5            return (0.8, 0.08)        elseif r < 0.8            return (1.3, 0.09)        else            return (0.0, 0.0)        end    end    # Simulate the bSSFP scans.    T1_true = first.(object)    T2_true = last.(object)    noise_level = 0.001    signal = map(T1_true, T2_true) do T1, T2        # Ignore the background of the object.        (T1, T2) == (0.0, 0.0) && return 0.0        bssfp.(TR, flip, T1, T2) .+ noise_level .* randn.()    end    # Estimate the tissue parameters.    T1_est = zeros(nx, ny)    T2_est = zeros(nx, ny)    for i in eachindex(signal, T1_est, T2_est)        # Don't try to estimate tissue parameters for the background.        signal[i] == 0.0 && continue        (T1_est[i], T2_est[i]) = gridsearch(signal[i], TR, flip)    end    # Compute the root mean squared error.    m = T1_true .!= 0.0    T1_rmse = sqrt(norm(T1_true[m] - T1_est[m]) / count(m))    T2_rmse = sqrt(norm(T2_true[m] - T2_est[m]) / count(m))    # Plot the results.    p_T1 = plot_true_vs_est(T1_true, T1_est, "True vs Estimated T1", T1_rmse, (0, 2.5))    p_T2 = plot_true_vs_est(T2_true, T2_est, "True vs Estimated T2", T2_rmse, (0, 0.25))    return (p_T1, p_T2)end

First, let’s see the results with no optimization:

TR_init = [0.005, 0.005, 0.005]flip_init = [30, 60, 80](p_T1_init, p_T2_init) = run(TR_init, flip_init)

display(p_T1_init)

display(p_T2_init)

Now let’s optimize the scan parametersand then see how the tissue parameter estimates look:

(TR_opt, flip_opt) = scan_optimization(TR_init, flip_init)(p_T1_opt, p_T2_opt) = run(TR_opt, flip_opt)

display(p_T1_opt)

display(p_T2_opt)

We can see that the optimized scansresult in much better tissue parameter estimates!

Summary

Note that most of the code for this postwas taken from a Dash appI helped create for JuliaHub.Feel free to check it outif you want to see this code in action!

Additional Links

Optimizing MRI Scans Dash App
- Web app showcasing the capabilities discussed in this post.
Optimizing MRI Scans Pluto Notebook
- Pluto notebook accompanying the Dash app mentioned above.(Unfortunately, the data used in the notebook didn’t get set up properly,so most of the code can’t run correctly.)
Links to other blog posts discussing how to do MRI in Julia:
Optimization.jl Docs
- Official documentation for Optimization.jl.

Changes in random number generation performance in Julia

Blog by Bogumił Kamiński — Fri, 16 Feb 2024 04:11:22 +0000

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/02/16/rng.html

Introduction

Today I want to present a small benchmark of random number generation
performance improvements between current Julia release 1.10.1 and
current LTS version 1.6.7.

The idea for the benchmark follows a discussion with a friend who needed
to run some compute intensive Julia code on its LTS version.

The post was written under Julia 1.10.1 and Julia 1.6.7.

The benchmark

Let us start with presenting the benchmark functions

function test_rand1()
    s = 0
    for i in 1:10^9
        s += rand(1:1_000_000)
    end
    return s
end

function test_rand2()
    s = 0.0
    for i in 1:10^9
        s += rand()
    end
    return s
end

They are relatively simple. I wanted to compare the performance of:
(1) integer generation from some range and (2) generation of floating point numbers from [0, 1) interval
as these are two most common scenarios in practice.

Let us see the results. First comes Julia 1.6.7:

julia> @time test_rand1()
  4.949335 seconds (13 allocations: 35.406 KiB)
499993991047124

julia> @time test_rand1()
  4.663646 seconds
499998112691460

julia> @time test_rand2()
  2.175424 seconds
5.000141761909688e8

julia> @time test_rand2()
  2.238839 seconds
4.9999424544883996e8

And now we have Julia 1.10.1:

julia> @time test_rand1()
  2.355028 seconds
500001818410630

julia> @time test_rand1()
  2.287840 seconds
499998082399284

julia> @time test_rand2()
  1.123886 seconds
5.000026226340503e8

julia> @time test_rand2()
  1.117811 seconds
4.9999201274214923e8

So we see that things run roughly two times faster.

Some additional remarks

What is the reason for this difference?
The major point is that between Julia 1.6.7 and Julia 1.10.1
a default random number generator was changed. Let us see
(below I use copy to ensure explicit instantiation of the random number generator object under Julia 1.10.1).

Again, first we test Julia 1.6.7:

julia> using Random

julia> copy(Random.default_rng())
MersenneTwister(0x2fe644ceb724000ca5e5b4409dc3c6ea, (0, 4502994048, 4502993046, 986, 2502992778, 986))

and next we check Julia 1.10.1:

julia> using Random

julia> copy(Random.default_rng())
Xoshiro(0x1273707731737276, 0x187b3d2e82fb1d48, 0x13f9fd1a82642acb, 0xa7dcba727da742e6, 0x3ed2b4d410aa4b31)

So indeed, we see that MersenneTwister was replaced by Xoshiro generator (to be exact Xoshiro256++).

This has one important consequence, apart from random number generation speed that is related to seeding
of the generator. Let us check, Julia 1.6.7:

julia> Random.seed!(1)
MersenneTwister(1)

julia> rand()
0.23603334566204692

vs Julia 1.10.1:

julia> Random.seed!(1)
TaskLocalRNG()

julia> rand()
0.07336635446929285

This means that when you use the default random number generator you should not expect reproducibility of results between these two Julia versions.
This lack of stability is documented as not ensured across Julia versions.

If you need to ensure such reproducibility you can use e.g. the StableRNGs.jl package.

Conclusions

The topic of changes in random number generation in Julia is probably well known to people doing compute intensive simulations.
However, I thought it is worth to present these results for new users, who might be using different versions of Julia to execute the same code
and wonder why the performance or the results themselves are different across them.

Symbolic-Numerics: how compiler smarts can help improve the performance of numerical methods (nonlinear solvers in Julia)

Christopher Rackauckas — Tue, 13 Feb 2024 10:06:17 +0000

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/symbolic-numerics-how-compiler-smarts-can-help-improve-the-performance-of-numerical-methods-nonlinear-solvers-in-julia/

Many problems can be reduced down to solving f(x)=0, maybe even more than you think! Solving a stiff differential equation? Finding out where the ball hits the ground? Solving an inverse problem to find the parameters to fit a model? In this talk we’ll showcase how SciML’s NonlinearSolve.jl is a general system for solving nonlinear equations and demonstrate its ability to efficiently handle these kinds of problems with high stability and performance. We will focus on how compilers are being integrated into the numerical stack so that many of the things that were manual before, such as defining sparsity patterns, Jacobians, and adjoints, are all automated out-of-the-box making it greatly outperform purely numerical codes like SciPy or NLsolve.jl.

PyData Global 2023

The post Symbolic-Numerics: how compiler smarts can help improve the performance of numerical methods (nonlinear solvers in Julia) appeared first on Stochastic Lifestyle.

Semantic Versioning (Semver) is flawed, and Downgrade CI is required to fix it

Christopher Rackauckas — Sun, 11 Feb 2024 12:44:34 +0000

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/semantic-versioning-semver-is-flawed-and-downgrade-ci-is-required-to-fix-it/

Semantic versioning is great. If you don’t know what it is, it’s just a versioning scheme for software that goes MAJOR.MINOR.PATCH, where

MAJOR version when you make incompatible API changes
MINOR version when you add functionality in a backward compatible manner
PATCH version when you make backward compatible bug fixes

That’s all it is, but it’s a pretty good system. If you see someone has updated their package from v3.2.0 to v3.2.1, then you know that you can just take that update because it’s just a patch, it won’t break your code. You can easily accept patch updates. Meanwhile, if you see they released v3.3.0, then you know that some new features were added, but it’s safe for you to update. This allows you to be compatible with v3.3.0 so that if a different package requires it, great you can both use it! Thus a lot of version updates to your dependencies can be accepted without even thinking about it. However, when you see that v4.0.0, you know that your dependency broke some APIs, so you need to do that compatibility bump automatically. Thus the semvar system makes it much easier to maintain large organizations of packages since the number of manual version bumps that you need to do are rather small.

Because of how useful this can be, many package managers have incorporated a form of semantic versioning into its system. Julia’s package manager, Rust’s package manager, Node’s package manager, and more all have ways that integrate semantic versioning into its systems, making it easy to automatically accept dependency updates and thus keeper a wider set of compatibility than can effectively done manually. It’s a vital part of our current dependency system.

Okay if it’s great, then how can it be flawed?

Semver is flawed for two reasons:

The definition of “breaking” is vague and ill-defined at its edges
Current tooling does not accurately check for Semver compatibility

Breaking: a great concept but with unclear boundaries

The first point is somewhat known and is best characterized by the classic XKCD comic:

Any change can break code. It’s really up to the definition of “what is breaking”. There’s many nuances:

“Breaking” only applies to the “public facing API”, i.e. things that users interact with. If anything changing was considered breaking then every change would be a breaking change, so in order for semver to work you have to have some sense of what is considered internals and what is considered public. Julia in its next recent version has a new “public” keyword to declare certain internals as public, i.e. things which are exported and specifically chosen values in a package module are considered internal by default. If you have many users, you will still find someone say “but I use function __xxxxyyyz_internal because the API doesn’t allow me to pass mycacheunsafemathbeware optimally”, but at least you can blame them for it. This is the most solvable issue of semver and simply requires due diligence and sticking to a clear system for what’s exposed and what’s not. That’s a bit harder in dynamic languages, but as shown there are systems in place for this.
What is considered “breaking” in terms of functionality can have some fuzzy edges. I work on numerical solvers and connections to machine learning (scientific machine learning or SciML). If someone calls the ODE solver with abstol=1e-6 and reltol=1e-3, then what is returned is an approximation to the ODE’s solution with a few digits of accuracy (there’s some details in here I will ignore). If a change is made internally to the package, say more SIMD for better performance, which causes the result to change in the 12 digit, is that breaking? Because the ODE solver only is guaranteeing at most 3-6 digits of accuracy, probably not. But what if the 6th digit changes? The 5th? If the built-in sin function in the language changes in the 15th digit of accuracy, is that breaking? Most documentation strings don’t explicitly say “this is computed to 1ulp (units in the last place)”, so it’s not always clear what one is truly guaranteed from a numerical function. If someone improves the performance of a random number generator and now the random numbers for the same seed are different, is this breaking? Were you guaranteed that wasn’t going to change? People will argue all day about some of these edge cases, “it broke my tests because I set the random number seed and now it broke”. Look at any documentation, for example numpy.random.rand, and it won’t clarify these details on you can rely on to change and not change. This granularity is a discussion with a vague boundary.
One man’s bug fix is another man’s breaking change. You may have intended for all instantiations of f(x::T) (or T.f(x)) to return an integer, but one of them returned a float. So what do you do? You go fix it, make them all return floats, and add documentation on the interface that they all return a floating point value and implement some interface to enforce it across all functions… and then the issues roll in “you broke my code because I required that this version of the function returned an integer!”. A bugfix is by definition correcting an unintended behavior. However, someone has to define “unintended”, and your users may not be able to read your brain and may consider what was “intended” differently. I’m not sure there really is a solution to this because a bug is by definition unintended: if you knew it was there then you would have either fixed it or documented it earlier. But left with no documentation on what to do, the user may thing the behavior is intentional and use it.
Adding new functionality may have unintended consequences. You may have previously threw an error in a given case, but now return an approximation. The user may only want exact solutions to some math function f(x), so they relied on the error throw before in order to know if the solution would have been exactly calculable. Your new approximation functionality that you just released with a nice blog post thus just broke somebody’s code. So is it a major update, or a minor update? You never “intended” for only giving exact solutions, the error message might’ve even said “we intend to add this case in the near future with an approximation”, but you still broke their code.

As Churchill said, “democracy is the worst form of government, except for all the others”. In this case, semver is great because it conveys useful information, but we shouldn’t get ahead of ourselves and thus assume it does everything perfectly. Its definitions can be vague and it requires discussion to figure out whether something is breaking or a patch sometimes.

But if it does fail, hopefully our tooling can help us know. We all have continuous integration and continuous deployment (CI/CD), that helps us handle semver… right?

Standard CI/CD Systems are Insufficient to Check Semver Compatibility

I’m no chump so I set my versioning to use semantic versioning. My Project.tomls are all setup to put lower bounds, for example I list out all of my version requirements like a champ (if you’re not familiar with Julia’s package manager, everything defaults to semver and thus DiffEqBase = “6.41” in the compat implicitly means any 6.x with x>41, but 7 is not allowed). We laugh in the face of the Python PyPI system because our package registration system rejects any package (new or new version) which does not have an upper bound. Every package is required to have compatibilities specified, and thus random breakage is greatly reduced. We have forced all package authors to “do the right thing” and users have ultimately one. Package it up, we’re done here.

But then… users… see some breakage? They make a post where they show you that user your package failed. How could that happen? Well it goes back to part one that there are some edges in semantic versioning that may have creeped in somewhere. But many times what has happened is that the authors have simply forgotten what their lower bound means. v3.3.0 introduced the function f(x) in PkgA so when you started to use that function from the dependency, you set the lower bound there and life is good. g(x) was introduced in v3.4.0 and a few years later PkgA is at 3.11.2 you learn about it and go “cool PkgA is great!”, you start using g(x), your CI system says everything is fine, and then a user pops up and says your package is broken for them. When digging into the logs, you see that there’s some other package that only allows The Missing Link: Downgrade CI

The real core issue here is that semantic versioning is generally inadequately tested. In theory, if I put a lower bound saying I accept v3.3.0 and anything above it until v4, then I might be saying I am allowing more than 100 versions of PkgA. If I also have a PkgB with similar semantic versioning, I could be allowing 100 variations of that as well. However, the way everyone’s CI/CD system runs is to take the latest version of the packages. Okay, maybe for some major dependency like the standard library you list ‘Programming Language :: Python :: 3.8’, ‘Programming Language :: Python :: 3.9’, … to test multiple versions of that, but are you checking the 100,000 permutations of all allowed dependency versions? Almost certainly not. Not only have I not seen this, it’s also just infeasible to do in practice.

But as a shortcut, what you should be doing is at least checking the most edgy of edge cases. If I state v3.3.0 is my allowed lower bound, most CI systems will simply grab the latest v3.y.z with y and z as big as possible. However, I should at least have one run with v3.3.0 to see if it’s still sensible. This would have caught that g(x) was not defined in v3.4.0. While this wouldn’t fix all issues with semantic versioning, it can at least identify many of them pretty straightforwardly.

We call this scheme “Downgrade CI”, i.e. downgrade all dependencies to their minimum versions and run it. Most users will only ever see the maximum versions so sure it doesn’t matter to most people, but as people add more and more into their environment they will start to see earlier versions, and it’s these minimum versions that are the true key to whether your package will give a sensible environment, not the version maximums which semver puts so much effort into!

Setting up Downgrade CI

Okay, so hopefully I’ve convinced you that semver is not a magical solution to all compatibility problems, it’s a nice tool but not a silver bullet, and you need to have some form of downgrade CI. How do you actually accomplish this? Thankfully the Julia ecosystem has a julia-downgrade-compat-action which sets up Github Actions CI to automatically run the package versions with this downgrade idea in mind. If you’re scared of trying to figure that out, don’t worry and just copy-paste a script out of SciML. For example, from SciMLBase.jl:

name: Downgrade
on:
  pull_request:
    branches:
      - master
    paths-ignore:
      - 'docs/**'
  push:
    branches:
      - master
    paths-ignore:
      - 'docs/**'
jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        version: ['1']
    steps:
      - uses: actions/checkout@v4
      - uses: julia-actions/setup-julia@v1
        with:
          version: ${{ matrix.version }}
      - uses: cjdoris/julia-downgrade-compat-action@v1
        with:
          skip: Pkg,TOML
      - uses: julia-actions/julia-buildpkg@v1
      - uses: julia-actions/julia-runtest@v1

This will add a new set of CI tests which run in the downgraded form and ensure your lower bounds are up to date. Will this solve all version compatibility issues? No, but hopefully this catches most of the major classes of issues.

Conclusion

In conclusion, use downgrade CI because semver isn’t perfect and while it does give a decent idea as to handling of upper bounds, lower bounds still need to be handled quite manually and “manual” is synonym for “can break”.

The post Semantic Versioning (Semver) is flawed, and Downgrade CI is required to fix it appeared first on Stochastic Lifestyle.

Semantic Versioning (Semver) is flawed, and Downgrade CI is required to fix it

Christopher Rackauckas — Sun, 11 Feb 2024 12:44:34 +0000

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/semantic-versioning-semver-is-flawed-and-downgrade-ci-is-required-to-fix-it/

Semantic versioning is great. If you don’t know what it is, it’s just a versioning scheme for software that goes MAJOR.MINOR.PATCH, where

MAJOR version when you make incompatible API changes
MINOR version when you add functionality in a backward compatible manner
PATCH version when you make backward compatible bug fixes

Okay if it’s great, then how can it be flawed?

Semver is flawed for two reasons:

The definition of “breaking” is vague and ill-defined at its edges
Current tooling does not accurately check for Semver compatibility

Breaking: a great concept but with unclear boundaries

The first point is somewhat known and is best characterized by the classic XKCD comic:

Any change can break code. It’s really up to the definition of “what is breaking”. There’s many nuances:

“Breaking” only applies to the “public facing API”, i.e. things that users interact with. If anything changing was considered breaking then every change would be a breaking change, so in order for semver to work you have to have some sense of what is considered internals and what is considered public. Julia in its next recent version has a new “public” keyword to declare certain internals as public, i.e. things which are exported and specifically chosen values in a package module are considered internal by default. If you have many users, you will still find someone say “but I use function __xxxxyyyz_internal because the API doesn’t allow me to pass mycacheunsafemathbeware optimally”, but at least you can blame them for it. This is the most solvable issue of semver and simply requires due diligence and sticking to a clear system for what’s exposed and what’s not. That’s a bit harder in dynamic languages, but as shown there are systems in place for this.
What is considered “breaking” in terms of functionality can have some fuzzy edges. I work on numerical solvers and connections to machine learning (scientific machine learning or SciML). If someone calls the ODE solver with abstol=1e-6 and reltol=1e-3, then what is returned is an approximation to the ODE’s solution with a few digits of accuracy (there’s some details in here I will ignore). If a change is made internally to the package, say more SIMD for better performance, which causes the result to change in the 12 digit, is that breaking? Because the ODE solver only is guaranteeing at most 3-6 digits of accuracy, probably not. But what if the 6th digit changes? The 5th? If the built-in sin function in the language changes in the 15th digit of accuracy, is that breaking? Most documentation strings don’t explicitly say “this is computed to 1ulp (units in the last place)”, so it’s not always clear what one is truly guaranteed from a numerical function. If someone improves the performance of a random number generator and now the random numbers for the same seed are different, is this breaking? Were you guaranteed that wasn’t going to change? People will argue all day about some of these edge cases, “it broke my tests because I set the random number seed and now it broke”. Look at any documentation, for example numpy.random.rand, and it won’t clarify these details on you can rely on to change and not change. This granularity is a discussion with a vague boundary.
One man’s bug fix is another man’s breaking change. You may have intended for all instantiations of f(x::T) (or T.f(x)) to return an integer, but one of them returned a float. So what do you do? You go fix it, make them all return floats, and add documentation on the interface that they all return a floating point value and implement some interface to enforce it across all functions… and then the issues roll in “you broke my code because I required that this version of the function returned an integer!”. A bugfix is by definition correcting an unintended behavior. However, someone has to define “unintended”, and your users may not be able to read your brain and may consider what was “intended” differently. I’m not sure there really is a solution to this because a bug is by definition unintended: if you knew it was there then you would have either fixed it or documented it earlier. But left with no documentation on what to do, the user may thing the behavior is intentional and use it.
Adding new functionality may have unintended consequences. You may have previously threw an error in a given case, but now return an approximation. The user may only want exact solutions to some math function f(x), so they relied on the error throw before in order to know if the solution would have been exactly calculable. Your new approximation functionality that you just released with a nice blog post thus just broke somebody’s code. So is it a major update, or a minor update? You never “intended” for only giving exact solutions, the error message might’ve even said “we intend to add this case in the near future with an approximation”, but you still broke their code.

But if it does fail, hopefully our tooling can help us know. We all have continuous integration and continuous deployment (CI/CD), that helps us handle semver… right?

Standard CI/CD Systems are Insufficient to Check Semver Compatibility

Setting up Downgrade CI

name: Downgrade
on:
  pull_request:
    branches:
      - master
    paths-ignore:
      - 'docs/**'
  push:
    branches:
      - master
    paths-ignore:
      - 'docs/**'
jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        version: ['1']
    steps:
      - uses: actions/checkout@v4
      - uses: julia-actions/setup-julia@v1
        with:
          version: ${{ matrix.version }}
      - uses: cjdoris/julia-downgrade-compat-action@v1
        with:
          skip: Pkg,TOML
      - uses: julia-actions/julia-buildpkg@v1
      - uses: julia-actions/julia-runtest@v1

Conclusion

The post Semantic Versioning (Semver) is flawed, and Downgrade CI is required to fix it appeared first on Stochastic Lifestyle.

One puzzle, two solutions

Blog by Bogumił Kamiński — Fri, 09 Feb 2024 14:53:32 +0000

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/02/09/pe116.html

Introduction

Today, I wanted to switch back to a lighter subject.
Therefore I decided to have a look at my favorite Project Euler website.

I picked the problem 116 as I have not tried to solve it yet.
Interestingly, it turned out that there are two ways to approach this puzzle,
so I thought to share them here.

The post was written under Julia 1.10.0.

The puzzle

The Project Euler puzzle 116 can be briefly stated as follows:

Given a row of 50 grey squares is to have a number of its tiles replaced with
coloured oblong tiles chosen from red (length two),
green (length three), or blue (length four).
How many different ways can the grey tiles be replaced if colours
cannot be mixed and at least one coloured tile must be used?

(If you want to see some visual examples of valid tilings, I encourage you to
visit the puzzle 116 page.)

The first approach

When we think of this problem, it is natural to generalize it. By C(n, d) we can
define the number of ways that n gray squares can be replaced with tiles of length d.
Then the solution to our problem is C(n, 2) + C(n, 3) + C(n, 4).
So let us focus on computing C(n, d) (assuming d is positive).

The first approach is to ask how many tiles of length d can be put. There must be at least
1, and we cannot put more than n ÷ d (here I use the ÷ notation taken from Julia that
denotes integer division; in other words the integer part of n / d).

So now assume that we want to put i blocks of length d (assuming i is valid). In
how many ways can we do it. Well, we put i long blocks and we are left with n - d*i gray blocks.
In total we have i + (n - d*i) blocks. You can then think of it as you have that many slots
from which you need pick i slots to put the long blocks. The number of ways you can do it is
given by the value of binomial coefficient. In Julia notation it is:
binomial(BigInt(i + (n - d*i)), BigInt(i)).

Now you might ask why I put the BigInt wrapper around the passed numbers? The reason is
that binomial coefficient gets large pretty quickly, so I want to make sure I will not
have issues with integer overflow.

Given these considerations the first function that produces C(n, d) can be defined as:

function C1(n::Integer, d::Integer)
    @assert d > 0 && n >= 0
    return sum(i -> binomial(BigInt(i + (n - d*i)), BigInt(i)), 1:n ÷ d; init=big"0")
end

Note that I use the init=big"0" initialization statement in the sum to ensure the
correct handling of the cases when n < d when we are given an empty collection to sum over.

The second approach

However, there is a different way how we can think of computing C(n, d).

Assume we know the values of C(n, d) for values of n smaller than the requested one.

We look at the last tile in our row.

If it is empty, then we are down to n-1 tiles to be filled.
This can be done in C(n-1, d) ways (remember that this value takes care
of the fact that at least one block of length d has to be used).

But what if the last tile in our row is filled with a block of length d?
Then we have two options. Either all other blocks are left gray (which gives us 1 combination)
or we are left with n-d tiles that are filled with at least one block of length d. The
second value is exactly C(n-d, d).

In summary we get that C(n, d) = C(n-1, d) + C(n-d, d) + 1.

This formula assumes n is at least d. But clearly for n < d
we have 0 ways to arrange the blocks.

Let us write down the code that performs the required computation:

function C2(n::Integer, d::Integer)
    @assert d > 0 && n >= 0
    npos = Dict{Int,BigInt}(i => 0 for i in 0:d-1)
    for j in d:n
        npos[j] = npos[j-1] + npos[j-d] + 1
    end
    return npos[n]
end

Note in the code that I used the npos dictionary to flexibly allow
for any potential integer values of n. The dictionary has
Dict{Int,BigInt} type, again, to ensure that the results of the computations
are stored correctly even if they are large.

Testing

Now we have two functions C1 and C2 that look completely differently.
Do they produce the same results. Let us check:

julia> using Test

julia> @testset "test C1 and C2 equality" begin
           for n in 0:200, d in 1:20
               @test C1(n, d) == C2(n, d)
           end
       end;
Test Summary:           | Pass  Total  Time
test C1 and C2 equality | 4020   4020  0.9s

Indeed we see that both C1 and C2 functions produce the same results.

To convince ourselves that using arbitrary precision integers was indeed needed
let us check some example values of the functions:

julia> C1(200, 2)
453973694165307953197296969697410619233825

julia> C2(200, 2)
453973694165307953197296969697410619233825

julia> typemax(Int)
9223372036854775807

Indeed, if we were not careful, we would have an integer overflow issue.

Conclusions

As usual I will not show the value of the solution to the problem to encourage you
to run the code yourself. You can get it by executing either
sum(d -> C1(50, d), 2:4) or sum(d -> C2(50, d), 2:4).
(We have just checked that the value produced in both cases is the same).

My First Julia Package – TriangulArt.jl

Jonathan Carroll — Sun, 04 Feb 2024 00:00:00 +0000

By: Jonathan Carroll

Re-posted from: https://jcarroll.com.au/2024/02/04/my-first-julia-package-triangulart-jl/

I’ve tried to get this same image transformation working at least three times now, but
I can finally celebrate that it’s working! I’ve been (re-)learning Julia and I still
love the language, so it was time to take my learning to the next level and actually
build a package.

For those not familiar, Julia is a much newer language than my daily-driver R, and with
that comes the freedom to take a lot of good features from other languages and implement
them. There are some features that R just won’t ever get, but they’re available in Julia and
they’re very nice to use.

I’ve written solutions to the first 20
or so Project Euler problems in Julia … wow, 5 years ago.

More recently, I have solved the first 18 days of Advent of Code 2023 in Julia
(my solutions
are in a fork of a package that I’m not using, so they more or less run independently).

With those under my belt, I revisited a project I’ve tried to implement several times. I like
the low-poly look and wanted to recreate it – it’s just an image transformation,
right? I’m even somewhat familiar with Delaunay Triangulation, or at least its dual the
Voronoi Tesselation from my days building spatial maps
of fishing areas.

It sounds like a simple enough problem; choose some points all over the image, triangulate between
all of them, then shade the resulting triangles with the average colour of the pixels they enclose.

I found this nice image of a rainbow lorikeet (these frequent my backyard)

so I got to work trying to chop it up into triangles.

Well, the naive approach is simple enough, but it produces some terrible results. I’ve built that version into what I did eventually get working, and it’s… not what I want

The problem is that by randomly selecting points across the image, you lose all the structure. With
enough triangles you might recover some of that, but then you have a lot of triangles and lose that
low-poly vibe.

After much searching for a better way to do this, I found this article from 2017. It’s
a python approach, but I figured I knew enough Julia and Python now that I could try to make
a 1:1 translation.

The first step is to get the random sampling working, because it allows me to start testing
the triangulation parts quickly. Generating those is pretty clean

function generate_uniform_random_points(image::Matrix{RGB{N0f8}}, n_points::Integer=100)
    ymax, xmax = size(image)[1:2]
    rand(n_points, 2) .* [xmax ymax]
end

The triangulation itself is handled by DelaunayTriangulation::triangulate() –
for once I’m happy that there’s so much scientific/statistical support in Julia

rng = StableRNG(2)
tri = triangulate(all_points; rng)

Slightly trickier is figuring out which points are in which triangle. For that, I am
thankful for PolygonOps::inpolygon(). With the pixels for each triangle identified,
it was only a matter of averaging the R, G, and B channels to get the median colour.

I got that working, but with the results above – far from pleasant. The next, much harder step,
was to weight the points towards the ’edges’ of the image. I couldn’t find an easy
way to translate the python code for locally sampling the entropy (via skimage)

filters.rank.entropy(im2, morphology.disk(entropy_width))

so I tried to build something of my own. I tried edge-detection algorithms but
I was clearly doing something wrong with it. Partly, I suspect, not doing the down-weighting
that the python version includes.

Since the pixels we want to up-weight are all along lines, choosing these at random can
end up with several right next to each other, which we don’t want. The python version does
something a little clever – it selects one point, then reduces the weighting of the entire image
with a Gaussian around that point, so that nearby points are unlikely to also be selected.

In the end, I failed to find a good Julia alternative, but calling python code is (almost) as
simple as using PyCall; @pyimport skimage as skimage (with slight modifications to use in a
package, as I would later discover).

With that in place, I was able to successfully weight towards high-entropy regions; regions
where a larger number of bytes are required to encode a histogram of the grayscale pixels, i.e.
where there’s a lot going on. The results are much more pleasing

Along the way I added some debug features, such as plotting the vertices and edges of the
triangulation on top of the image

With the workflow more or less working, I ran some profiling to see if I could
speed it up. Unsurprisingly, generating the weighted points was one area where a
lot of time was spent, though it’s not yet clear if that’s because it’s python
code or because that’s genuinely one of the most complex parts of the code – my
best Julia alternative was to write my own short Shannon entropy function and
make it search locally with ImageFiltering::mapwindow

function shannon(w::AbstractMatrix)
   cm = collect(values(countmap(w))) / sum(size(w)) / 256
   sum([-x * log2(x) for x in cm])
end

mapwindow(shannon, image, (19, 19))

though, this creates a square subsampling, whereas the python version uses a nicer disk.

The profiling shows a lot of blank areas, and I’m not sure how to interpret those

I realised at this point that I actually didn’t know how long the python version takes
to run. I grabbed the original source code
and tried running it (after installing the relevant python packages) but it failed –
some packages had changed their arguments and signatures since this was written. A couple
of small updates later, my fork
now runs the code. It doesn’t take terribly long to run – it doesn’t display the image,
it saves it, and I’m not sure if that’s a factor. I (naively?) expected that the Julia
version would be a lot faster, and I’m hopeful that there’s performance I’ve left on the
table.

If anyone is interested in playing with a small-ish Julia package, feel free to poke at
this – it’s definitely not being used for anything critical.

For now, I’m enjoying throwing images at this and getting some nice looking results

If you’re interested in having a play with this package or helping to improve it,
it’s on GitHub – I’m not planning
to publish it to the registry any time soon, but that’s perhaps something to look
forward to in the future. For now, the main issues I see with this package are:

The white border around the produced image remains – I have tried setting
margin=0mm but that doesn’t appear to help
Performance is not as good as it can be, I suspect; the entropy calculation
(calling python) is definitely a bottleneck.
To speed up the processing, only every 10th pixel is used to determine the
average colour of the triangle – this may fail to identify an entire triangle.
CI – I generated this package in VSCode using PkgTemplates and it is the
first Julia package I’ve built. CI failed immediately, so I’ve probably done
something wrong.
I am still somewhat of a beginner in Julia, so there are probably many places
in which improvements can be made – feel free to suggest them!

As always, I can be found on Mastodon and
the comment section below.

devtools::session_info()

One puzzle, two solutions

Blog by Bogumił Kamiński — Fri, 02 Feb 2024 14:53:32 +0000

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/02/02/pe116.html

Introduction

Today, I wanted to switch back to a lighter subject.
Therefore I decided to have a look at my favorite Project Euler website.

I picked the problem 116 as I have not tried to solve it yet.
Interestingly, it turned out that there are two ways to approach this puzzle,
so I thought to share them here.

The post was written under Julia 1.10.0.

The puzzle

The Project Euler puzzle 116 can be briefly stated as follows:

Given a row of 50 grey squares is to have a number of its tiles replaced with
coloured oblong tiles chosen from red (length two),
green (length three), or blue (length four).
How many different ways can the grey tiles be replaced if colours
cannot be mixed and at least one coloured tile must be used?

(If you want to see some visual examples of valid tilings, I encourage you to
visit the puzzle 116 page.)

The first approach

Given these considerations the first function that produces C(n, d) can be defined as:

function C1(n::Integer, d::Integer)
    @assert d > 0 && n >= 0
    return sum(i -> binomial(BigInt(i + (n - d*i)), BigInt(i)), 1:n ÷ d; init=big"0")
end

Note that I use the init=big"0" initialization statement in the sum to ensure the
correct handling of the cases when n < d when we are given an empty collection to sum over.

The second approach

However, there is a different way how we can think of computing C(n, d).

Assume we know the values of C(n, d) for values of n smaller than the requested one.

We look at the last tile in our row.

In summary we get that C(n, d) = C(n-1, d) + C(n-d, d) + 1.

This formula assumes n is at least d. But clearly for n < d
we have 0 ways to arrange the blocks.

Let us write down the code that performs the required computation:

function C2(n::Integer, d::Integer)
    @assert d > 0 && n >= 0
    npos = Dict{Int,BigInt}(i => 0 for i in 0:d-1)
    for j in d:n
        npos[j] = npos[j-1] + npos[j-d] + 1
    end
    return npos[n]
end

Testing

Now we have two functions C1 and C2 that look completely differently.
Do they produce the same results. Let us check:

julia> using Test

julia> @testset "test C1 and C2 equality" begin
           for n in 0:200, d in 1:20
               @test C1(n, d) == C2(n, d)
           end
       end;
Test Summary:           | Pass  Total  Time
test C1 and C2 equality | 4020   4020  0.9s

Indeed we see that both C1 and C2 functions produce the same results.

To convince ourselves that using arbitrary precision integers was indeed needed
let us check some example values of the functions:

julia> C1(200, 2)
453973694165307953197296969697410619233825

julia> C2(200, 2)
453973694165307953197296969697410619233825

julia> typemax(Int)
9223372036854775807

Indeed, if we were not careful, we would have an integer overflow issue.

Conclusions

Working with vectors using DataFrames.jl minilanguage

Blog by Bogumił Kamiński — Fri, 02 Feb 2024 14:53:32 +0000

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/02/02/combine.html

Introduction

I have written in the past about DataFrames.jl operation specification syntax
(also called minilanguage), see for example this post or this post.

Today I want to discuss one design decision made in this minilanguage and its consequences.
It is related with how vectors are handled when they are returned from some transformation function.

The post was written under Julia 1.10.0 and DataFrames.jl 1.6.1.

A basic example

Consider the following example, where we want to compute a profit from some sales data:

julia> using DataFrames

julia> df = DataFrame(name=["A", "B", "C"],
                      revenue=[10, 20, 30],
                      cost=[5, 12, 18])
3×3 DataFrame
 Row │ name    revenue  cost
     │ String  Int64    Int64
─────┼────────────────────────
   1 │ A            10      5
   2 │ B            20     12
   3 │ C            30     18

julia> combine(df, All(), ["revenue", "cost"] => (-) => "profit")
3×4 DataFrame
 Row │ name    revenue  cost   profit
     │ String  Int64    Int64  Int64
─────┼────────────────────────────────
   1 │ A            10      5       5
   2 │ B            20     12       8
   3 │ C            30     18      12

The crucial point to understand here is that the - function takes
two columns "revenue" and "cost" and returns a vector.
Users typically expect, as in this example, that this vector should
be spread across several rows.

When vector spreading is not desirable?

However, there are cases, when we might not want to spread a vector
into multiple rows. Consider for example a transformation in which
we want to put "revenue" and "profit" values in a 2-element vector
per product. Intuitively we could write something like:

julia> combine(groupby(df, :name),
               All(),
               ["revenue", "cost"] => ((x,y) -> [only(x), only(y)]) => "vec")
ERROR: ArgumentError: all functions must return vectors of the same length

We get an error unfortunately. We will soon understand why, but before
I proceed let me comment on the [only(x), only(y)] part of the definition.
The only function makes sure that we have exactly one row per product.

To diagnose the issue let us drop the All() part in our call:

julia> combine(groupby(df, :name),
               ["revenue", "cost"] => ((x,y) -> [only(x), only(y)]) => "vec")
6×2 DataFrame
 Row │ name    vec
     │ String  Int64
─────┼───────────────
   1 │ A          10
   2 │ A           5
   3 │ B          20
   4 │ B          12
   5 │ C          30
   6 │ C          18

Now we understand the problem. Because our function returns a vector it gets
spread over several rows (which leads to an error as other columns of df have
a different length).

Solving the vector-spreading issue

As I have said above, most of the time vector spreading is a desired feature,
but in the example we have just studied it is not wanted.
For such cases DataFrames.jl allows you to protect vectors from being spread.
What you need to do is to call Ref function on the returned value.
This will protect the result from being spread:

julia> combine(groupby(df, :name),
               All(),
               ["revenue", "cost"] => ((x,y) -> Ref([only(x), only(y)])) => "vec")
3×4 DataFrame
 Row │ name    revenue  cost   vec
     │ String  Int64    Int64  Array…
─────┼──────────────────────────────────
   1 │ A            10      5  [10, 5]
   2 │ B            20     12  [20, 12]
   3 │ C            30     18  [30, 18]

Now, as we wanted, the entries of the "vec" columns are vectors. Wrapping the return
value of our function with Ref protected the vectors from being spread.

The alternative function that you could use to get the same effect is fill:

julia> combine(groupby(df, :name),
               All(),
               ["revenue", "cost"] => ((x,y) -> fill([only(x), only(y)])) => "vec")
3×4 DataFrame
 Row │ name    revenue  cost   vec
     │ String  Int64    Int64  Array…
─────┼──────────────────────────────────
   1 │ A            10      5  [10, 5]
   2 │ B            20     12  [20, 12]
   3 │ C            30     18  [30, 18]

or you could wrap the return value with another pair of [...]:

julia> combine(groupby(df, :name),
               All(),
               ["revenue", "cost"] => ((x,y) -> [[only(x), only(y)]]) => "vec")
3×4 DataFrame
 Row │ name    revenue  cost   vec
     │ String  Int64    Int64  Array…
─────┼──────────────────────────────────
   1 │ A            10      5  [10, 5]
   2 │ B            20     12  [20, 12]
   3 │ C            30     18  [30, 18]

What is going on here? In all three cases (Ref, fill, and [...]) we are wrapping a vector in another object that works like an outer vector.
In the case of [...] it is just a vector, fill produces a 0-dimensional array, and Ref creates a wrapper that behaves like 0-dimensional array.
In all cases DataFrames.jl treats this outer wrapper as a 1-element vector and just stores its contents in a single row (because there is one element to store).

Conclusions

I hope that you will find the example I gave today useful when transforming vectors using DataFrames.jl.

juliabloggers.com

Annotating columns of a data frame with DataFramesMeta.jl

Introduction

Column labels

Column notes

Conclusions

CUDA.jl 5.2 and 5.3: Maintenance releases

Profiler improvements

Kernel launch debugging

Sorting improvements

Unified memory fixes

Software updates

Future releases

Onboarding DataFrames.jl

Introduction

A basic example

Chain.jl

DataFramesMeta.jl

DataFrameMacros.jl

TidierData.jl

Conclusions

Sorting data with missing values

Introduction

General rules of comparison with missing values

Default sorting with missing values

Supplementary sorting order

More advanced cases of treating missing as smallest

Conclusions

Deduplication of rows in DataFrames.jl

Introduction

Checking if a data frame has duplicate rows

Finding duplicate rows

Removing duplicate rows from a data frame

Conclusions

Getting full factorial design in DataFrames.jl

Introduction

What is a full factorial design and how to create it?

Full factorial design in DataFrames.jl

What if we have a fractional factorial design?

Conclusions

ML Project Environment Setup in Julia, a Comprehensive Step-by-step Guide

Storing vectors of vectors in DataFrames.jl

Introduction

Basic transformations of columns in DataFrames.jl

Containers holding one element in Julia

How to use 1-element containers in DataFrames.jl as wrappers

Aliasing trap

An alternative

Conclusions

Mastering Efficient Array Operations with StaticArrays.jl in Julia

How to Use StaticArrays.jl

Constructors

Conversion to/from Array

Comparing StaticArrays to Arrays

Summary

Additional Links

Transforming multiple columns in DataFrames.jl

Introduction

What is operation specification syntax?

Vectorization of operations

Passing multiple columns

Conclusions

Calibrating an Ornstein–Uhlenbeck Process

The Ornstein-Uhlenbeck Equation

OLS Calibration of an OU Process

Does Bootstrapping Help?

AR(1) Calibration of a OU Process

Using Multiple Samples for Calibrating an OU Process

Summary

Working with a grouped data frame, part 2

Introduction

Warm-up: getting group indices

Extracting a single group

Extracting multiple groups

Key lookup

Conclusions

Working with a grouped data frame, part 1

Introduction

Creating a grouped data frame

When does the order of groups not matter?

Conversion to/from `Array`

Comparing `StaticArray`s to `Array`s