Author Archives: Blog by Bogumił Kamiński

Why, how, and when of ∘

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/11/05/circ.html

Introduction

New users of Julia often ask me about the use of the operator. In this post
I discuss the most important things one should know about it.

In this the post I use Julia 1.6.3.

Getting started with

Let us start with the basics. Before even understanding what does you
probably wonder how you can type it. Fortunately it is easy to check in Julia’s
help:

help?> ∘
"∘" can be typed by \circ<tab>

search: ∘

  f ∘ g

  Compose functions: i.e. (f ∘ g)(args...) means f(g(args...)). The ∘ symbol can
  be entered in the Julia REPL (and most editors, appropriately configured) by
  typing \circ<tab>.

  Function composition also works in prefix form: ∘(f, g) is the same as f ∘ g.
  The prefix form supports composition of multiple functions:
  ∘(f, g, h) = f ∘ g ∘ h and splatting ∘(fs...) for composing an iterable
  collection of functions.

As you can see you can easily write in the Julia REPL by typing \circ and
pressing the <tab> key. Also most editors that support Julia allow use of
this key sequence.

We can also check the Unicode code of :

julia> '∘'
'∘': Unicode U+2218 (category Sm: Symbol, math)

and its UTF-8 representation:

julia> codeunits("∘")
3-element Base.CodeUnits{UInt8, String}:
 0xe2
 0x88
 0x98

Understanding

When f and g are functions then writing f ∘ g creates a new callable
object that takes the same arguments as the g function and passes the value
returned by calling g to the f function. This operation is called function
composition.

Let us check this on some example functions:

julia> c = abs ∘ sin
abs ∘ sin

julia> c(-1)
0.8414709848078965

julia> abs(sin(-1))
0.8414709848078965

Additionally, once the composed object is created it can be easily inspected:

julia> typeof(c)
ComposedFunction{typeof(abs), typeof(sin)}

julia> fieldnames(typeof(c))
(:outer, :inner)

julia> c.outer
abs (generic function with 13 methods)

julia> c.inner
sin (generic function with 13 methods)

As you can see you can:

  • easily identify that some object is a composed function (as it has
    ComposedFunction type);
  • easily recover the functions that were used in composition.

In some usage scenarios these two properties can be quite useful.

One important feature one has to keep in mind is that you have to put the
f ∘ g expression in parentheses if you want to immediately call it:

julia> abs∘sin(-1) # incorrect
abs ∘ -0.8414709848078965

julia> (abs∘sin)(-1) # correct
0.8414709848078965

Rationale for

You might wonder why someone would want to write (f ∘ g)(x) if you can write
f(g(x)) or x |> g |> f to get the same. The reason is that f ∘ g is an
object that can be passed around. Clearly you could have created an anonymous
function e.g. via x -> f(g(x)) or x -> x |> g |> f. However, these
approaches are more visually noisy, less explicit, and create a new anonymous
function each time they used (which means triggering compilation).

The f ∘ g object is most commonly passed as an argument to higher order
functions. Here are some examples:

julia> map(uppercase∘strip, [" a", "b ", " c "])
3-element Vector{String}:
 "A"
 "B"
 "C"

julia> sum(sqrt∘abs, -10:10)
44.936556372408205

Before I finish let me explain the compilation issue. Start with a fresh Julia
session:

julia> x = 1:100;

julia> @time sum(abs∘sin, x);
  0.146191 seconds (446.61 k allocations: 25.602 MiB, 4.85% gc time, 99.78% compilation time)

julia> @time sum(abs∘sin, x);
  0.000011 seconds (3 allocations: 80 bytes)

Now open a new fresh Julia session again:

julia> x = 1:100;

julia> @time sum(x -> abs(sin(x)), x);
  0.150860 seconds (445.16 k allocations: 25.541 MiB, 8.25% gc time, 97.47% compilation time)

julia> @time sum(x -> abs(sin(x)), x);
  0.034172 seconds (54.68 k allocations: 3.256 MiB, 99.82% compilation time)

As you can see in the first case compilation happened only once. The reason is
that the type of abs∘sin is always the same (it is
ComposedFunction{typeof(abs), typeof(sin)}). On the other hand if you define
an anonymous function it gets a distinct type each time it is created:

julia> x -> abs(sin(x))
#5 (generic function with 1 method)

julia> x -> abs(sin(x))
#7 (generic function with 1 method)

Such differences matter most if the higher order function to which you pass
the composed function is complex and thus expensive to compile.

Conclusions

Hopefully after reading this post you understand why, how, and when the
operator can be used.

Why, how, and when of ∘

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/11/05/circ.html

Introduction

New users of Julia often ask me about the use of the operator. In this post
I discuss the most important things one should know about it.

In this the post I use Julia 1.6.3.

Getting started with

Let us start with the basics. Before even understanding what does you
probably wonder how you can type it. Fortunately it is easy to check in Julia’s
help:

help?> ∘
"∘" can be typed by \circ<tab>

search: ∘

  f ∘ g

  Compose functions: i.e. (f ∘ g)(args...) means f(g(args...)). The ∘ symbol can
  be entered in the Julia REPL (and most editors, appropriately configured) by
  typing \circ<tab>.

  Function composition also works in prefix form: ∘(f, g) is the same as f ∘ g.
  The prefix form supports composition of multiple functions:
  ∘(f, g, h) = f ∘ g ∘ h and splatting ∘(fs...) for composing an iterable
  collection of functions.

As you can see you can easily write in the Julia REPL by typing \circ and
pressing the <tab> key. Also most editors that support Julia allow use of
this key sequence.

We can also check the Unicode code of :

julia> '∘'
'∘': Unicode U+2218 (category Sm: Symbol, math)

and its UTF-8 representation:

julia> codeunits("∘")
3-element Base.CodeUnits{UInt8, String}:
 0xe2
 0x88
 0x98

Understanding

When f and g are functions then writing f ∘ g creates a new callable
object that takes the same arguments as the g function and passes the value
returned by calling g to the f function. This operation is called function
composition.

Let us check this on some example functions:

julia> c = abs ∘ sin
abs ∘ sin

julia> c(-1)
0.8414709848078965

julia> abs(sin(-1))
0.8414709848078965

Additionally, once the composed object is created it can be easily inspected:

julia> typeof(c)
ComposedFunction{typeof(abs), typeof(sin)}

julia> fieldnames(typeof(c))
(:outer, :inner)

julia> c.outer
abs (generic function with 13 methods)

julia> c.inner
sin (generic function with 13 methods)

As you can see you can:

  • easily identify that some object is a composed function (as it has
    ComposedFunction type);
  • easily recover the functions that were used in composition.

In some usage scenarios these two properties can be quite useful.

One important feature one has to keep in mind is that you have to put the
f ∘ g expression in parentheses if you want to immediately call it:

julia> abs∘sin(-1) # incorrect
abs ∘ -0.8414709848078965

julia> (abs∘sin)(-1) # correct
0.8414709848078965

Rationale for

You might wonder why someone would want to write (f ∘ g)(x) if you can write
f(g(x)) or x |> g |> f to get the same. The reason is that f ∘ g is an
object that can be passed around. Clearly you could have created an anonymous
function e.g. via x -> f(g(x)) or x -> x |> g |> f. However, these
approaches are more visually noisy, less explicit, and create a new anonymous
function each time they used (which means triggering compilation).

The f ∘ g object is most commonly passed as an argument to higher order
functions. Here are some examples:

julia> map(uppercase∘strip, [" a", "b ", " c "])
3-element Vector{String}:
 "A"
 "B"
 "C"

julia> sum(sqrt∘abs, -10:10)
44.936556372408205

Before I finish let me explain the compilation issue. Start with a fresh Julia
session:

julia> x = 1:100;

julia> @time sum(abs∘sin, x);
  0.146191 seconds (446.61 k allocations: 25.602 MiB, 4.85% gc time, 99.78% compilation time)

julia> @time sum(abs∘sin, x);
  0.000011 seconds (3 allocations: 80 bytes)

Now open a new fresh Julia session again:

julia> x = 1:100;

julia> @time sum(x -> abs(sin(x)), x);
  0.150860 seconds (445.16 k allocations: 25.541 MiB, 8.25% gc time, 97.47% compilation time)

julia> @time sum(x -> abs(sin(x)), x);
  0.034172 seconds (54.68 k allocations: 3.256 MiB, 99.82% compilation time)

As you can see in the first case compilation happened only once. The reason is
that the type of abs∘sin is always the same (it is
ComposedFunction{typeof(abs), typeof(sin)}). On the other hand if you define
an anonymous function it gets a distinct type each time it is created:

julia> x -> abs(sin(x))
#5 (generic function with 1 method)

julia> x -> abs(sin(x))
#7 (generic function with 1 method)

Such differences matter most if the higher order function to which you pass
the composed function is complex and thus expensive to compile.

Conclusions

Hopefully after reading this post you understand why, how, and when the
operator can be used.

Top 10 in the Julia language

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/10/29/top10.html

Introduction

Today I will write about a common question asked when working with data that has
a general and nice, but not very well known solution in Julia.

The problem users often want to solve is how to get top-n values of some vector
or top-n rows in a data frame subject some condition (as usual we will use
DataFrames.jl data frame implementation). Here is an example of a recent
Stack Overflow post that posed exactly this question.

Before I move on let me make a small comment about the post from last week. Due
to my error when pushing things to GitHub its incorrect version was shared on
juliabloggers.com. The proper post about doing model selection using
cross validation can be found here.

In this the post I use Julia 1.6.3, BenchmarkTools 1.2.0, and DataFrames.jl
1.2.2.

Getting top-n values of a vector

In order to find top-n values in a vector use the partialsort function that is
shipped with Julia Base. Here is an example:

julia> using Random

julia> Random.seed!(1234)
MersenneTwister(1234)

julia> x = rand(10)
10-element Vector{Float64}:
 0.5908446386657102
 0.7667970365022592
 0.5662374165061859
 0.4600853424625171
 0.7940257103317943
 0.8541465903790502
 0.20058603493384108
 0.2986142783434118
 0.24683718661000897
 0.5796722333690416

julia> partialsort(x, 1:3, rev=true)
3-element view(::Vector{Float64}, 1:3) with eltype Float64:
 0.8541465903790502
 0.7940257103317943
 0.7667970365022592

Above we have selected three top values, but in general we could have passed any
range instead of 1:3. The key benefit of this approach, apart from directly
producing the requested result, is that it is very fast in comparison to e.g.
sorting the whole vector and getting the top-n values. Here is a simple
benchmark:

julia> using BenchmarkTools

julia> Random.seed!(1234)
MersenneTwister(1234)

julia> for n in [10^i for i in 3:7]
           @show n
           x = rand(n)
           @btime partialsort(x, 1:10)
           @btime sort(x)[1:10]
       end
n = 1000
  3.683 μs (2 allocations: 7.98 KiB)
  13.606 μs (2 allocations: 8.09 KiB)
n = 10000
  82.545 μs (3 allocations: 78.25 KiB)
  427.913 μs (3 allocations: 78.36 KiB)
n = 100000
  1.168 ms (3 allocations: 781.38 KiB)
  5.277 ms (3 allocations: 781.48 KiB)
n = 1000000
  11.676 ms (3 allocations: 7.63 MiB)
  67.431 ms (3 allocations: 7.63 MiB)
n = 10000000
  141.493 ms (3 allocations: 76.29 MiB)
  868.380 ms (3 allocations: 76.29 MiB)

As you can see there is a significant difference in timing and it grows with
the size of the input vector.

Getting top-n values in a data frame

In order to do the similar operation on a data frame we need to use another
function, i.e. partialsortperm. It returns the indices of the requested
values. Here is an example:

julia> Random.seed!(1234)
MersenneTwister(1234)

julia> x = rand(10)
10-element Vector{Float64}:
 0.5908446386657102
 0.7667970365022592
 0.5662374165061859
 0.4600853424625171
 0.7940257103317943
 0.8541465903790502
 0.20058603493384108
 0.2986142783434118
 0.24683718661000897
 0.5796722333690416

julia> partialsortperm(x, 1:3, rev=true)
3-element view(::Vector{Int64}, 1:3) with eltype Int64:
 6
 5
 2

Now these indices can be used to subset a data frame as follows. Assume we have
a large data frame having columns :x and :y and want to get top 10 rows
with respect to column :x. You can achieve it efficiently like this:

julia> using DataFrames

julia> Random.seed!(1234)
MersenneTwister(1234)

julia> df = DataFrame(x=rand(10^6), y=rand(10^6))
1000000×2 DataFrame
     Row │ x          y
         │ Float64    Float64
─────────┼──────────────────────
       1 │ 0.590845   0.901919
       2 │ 0.766797   0.491144
       3 │ 0.566237   0.0750416
       4 │ 0.460085   0.476479
       5 │ 0.794026   0.296354
       6 │ 0.854147   0.399129
       7 │ 0.200586   0.0894096
    ⋮    │     ⋮          ⋮
  999994 │ 0.798217   0.645084
  999995 │ 0.587953   0.978664
  999996 │ 0.47863    0.105387
  999997 │ 0.444533   0.0169648
  999998 │ 0.0250811  0.921271
  999999 │ 0.527551   0.731145
 1000000 │ 0.713854   0.0222126
             999986 rows omitted

julia> df[partialsortperm(df.x, 1:10, rev=true), :]
10×2 DataFrame
 Row │ x         y
     │ Float64   Float64
─────┼─────────────────────
   1 │ 0.999999  0.586541
   2 │ 0.999999  0.538707
   3 │ 0.999999  0.375452
   4 │ 0.999998  0.0909446
   5 │ 0.999998  0.426267
   6 │ 0.999997  0.747272
   7 │ 0.999995  0.0559812
   8 │ 0.999995  0.390885
   9 │ 0.999994  0.956622
  10 │ 0.999993  0.0909472

Conclusions

I use the partialsort and partialsortperm functions surprisingly often.
As you have seen they are not only easy to learn but also fast.

Finally, if you would need even more efficiency you can use the partialsort!
and partialsortperm! functions that are in-place variants of the discussed
functions.