Learning to create a vector in Julia

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/11/03/vec.html

Introduction

Last week I have written a “Learning to zip stuff in Julia” post
that discussed the zip function. To my surprise, even though the topic was basic,
it has received a lot of positive feedback. Therefore I thought of writing about
another entry-level problem.

Often, when working with arrays in Julia we want to flatten them to a vector.
Today, I want to discuss two ways how you can do it and the differences between them.

The post was written under Julia 1.9.2.

The problem

Assume you have the following three dimensional array:

julia> a3d = [(i, j, k) for i in 1:2, j in 1:3, k in 1:4]
2×3×4 Array{Tuple{Int64, Int64, Int64}, 3}:
[:, :, 1] =
 (1, 1, 1)  (1, 2, 1)  (1, 3, 1)
 (2, 1, 1)  (2, 2, 1)  (2, 3, 1)

[:, :, 2] =
 (1, 1, 2)  (1, 2, 2)  (1, 3, 2)
 (2, 1, 2)  (2, 2, 2)  (2, 3, 2)

[:, :, 3] =
 (1, 1, 3)  (1, 2, 3)  (1, 3, 3)
 (2, 1, 3)  (2, 2, 3)  (2, 3, 3)

[:, :, 4] =
 (1, 1, 4)  (1, 2, 4)  (1, 3, 4)
 (2, 1, 4)  (2, 2, 4)  (2, 3, 4)

In many situations you might want to transform it into a one dimensional vector.
For example many functions explicitly require AbstractVector as their input.
The question is how can you do it.

There are two fundamental ways to perform this operation. The first one creates
a new independent vector from the source data, and the second one reuses the memory
of the source data. Let me discuss them in more detail.

Copied vector

If you want to copy the data in a3d into a new vector you can write:

julia> a3d[:]
24-element Vector{Tuple{Int64, Int64, Int64}}:
 (1, 1, 1)
 (2, 1, 1)
 (1, 2, 1)
 (2, 2, 1)
 (1, 3, 1)
 (2, 3, 1)
 (1, 1, 2)
 (2, 1, 2)
 (1, 2, 2)
 (2, 2, 2)
 (1, 3, 2)
 (2, 3, 2)
 (1, 1, 3)
 (2, 1, 3)
 (1, 2, 3)
 (2, 2, 3)
 (1, 3, 3)
 (2, 3, 3)
 (1, 1, 4)
 (2, 1, 4)
 (1, 2, 4)
 (2, 2, 4)
 (1, 3, 4)
 (2, 3, 4)

The syntax is short and easy to read. It takes advantage of the fact that every Julia
array should support linear indexing, as is explained in the Julia Manual section
on Linear Indexing.

The benefit of this approach is that the new object is freshly allocated, so modifying
it will not modify the source. The downside is that it requires memory allocation.

Aliased vector

If we want to avoid excessive memory allocation (which might be relevant for large objects)
we can create a vector from an array without copying the data. This can be achieved using
the vec function:

julia> v = vec(a3d)
24-element Vector{Tuple{Int64, Int64, Int64}}:
 (1, 1, 1)
 (2, 1, 1)
 (1, 2, 1)
 (2, 2, 1)
 (1, 3, 1)
 (2, 3, 1)
 (1, 1, 2)
 (2, 1, 2)
 (1, 2, 2)
 (2, 2, 2)
 (1, 3, 2)
 (2, 3, 2)
 (1, 1, 3)
 (2, 1, 3)
 (1, 2, 3)
 (2, 2, 3)
 (1, 3, 3)
 (2, 3, 3)
 (1, 1, 4)
 (2, 1, 4)
 (1, 2, 4)
 (2, 2, 4)
 (1, 3, 4)
 (2, 3, 4)

Now v is a vector that shares memory with a3d. We can check it using the pointer function:

julia> pointer(v)
Ptr{Tuple{Int64, Int64, Int64}} @0x000002606d121540

julia> pointer(a3d)
Ptr{Tuple{Int64, Int64, Int64}} @0x000002606d121540

or the Base.mightalias function:

julia> Base.mightalias(a3d, v)
true

The benefit is that vec(a3d) is in general much faster than a3d[:] since it makes less work.
The downside is that you need to remember that mutating one of the objects will change the other.
For example:

julia> a3d[1, 1, 1] = (100, 100, 100);

julia> a3d
2×3×4 Array{Tuple{Int64, Int64, Int64}, 3}:
[:, :, 1] =
 (100, 100, 100)  (1, 2, 1)  (1, 3, 1)
 (2, 1, 1)        (2, 2, 1)  (2, 3, 1)

[:, :, 2] =
 (1, 1, 2)  (1, 2, 2)  (1, 3, 2)
 (2, 1, 2)  (2, 2, 2)  (2, 3, 2)

[:, :, 3] =
 (1, 1, 3)  (1, 2, 3)  (1, 3, 3)
 (2, 1, 3)  (2, 2, 3)  (2, 3, 3)

[:, :, 4] =
 (1, 1, 4)  (1, 2, 4)  (1, 3, 4)
 (2, 1, 4)  (2, 2, 4)  (2, 3, 4)

julia> v
24-element Vector{Tuple{Int64, Int64, Int64}}:
 (100, 100, 100)
 (2, 1, 1)
 (1, 2, 1)
 (2, 2, 1)
 (1, 3, 1)
 (2, 3, 1)
 (1, 1, 2)
 (2, 1, 2)
 (1, 2, 2)
 (2, 2, 2)
 (1, 3, 2)
 (2, 3, 2)
 (1, 1, 3)
 (2, 1, 3)
 (1, 2, 3)
 (2, 2, 3)
 (1, 3, 3)
 (2, 3, 3)
 (1, 1, 4)
 (2, 1, 4)
 (1, 2, 4)
 (2, 2, 4)
 (1, 3, 4)
 (2, 3, 4)

Conclusions

In summary if in your analytical workflow you need a vector,
but you have a general array a as an input you have two options:

  • write a[:], which creates a copy (safer, but slower and uses more memory);
  • write vec(a), which creates an alias (unsafe, but faster and uses less memory).

In practice I find both options useful depending on the circumstances, so I think it is worth to be aware of them.

Learning to zip stuff in Julia

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/10/27/zip.html

Introduction

Today I want to discuss the basics of the design of the zip function in Julia.
This is a relatively introductory topic but, in my experience, wrong usage of this function can lead to hard to catch bugs.

The post was written under Julia 1.9.2.

What does zip do?

Let us start with the description of the zip function from its documentation:

Run multiple iterators at the same time, until any of them is exhausted.
The value type of the zip iterator is a tuple of values of its subiterators.

If it is not fully clear what it does let me give two examples:

julia> [v for v in zip(1:3, 11:13)]
3-element Vector{Tuple{Int64, Int64}}:
 (1, 11)
 (2, 12)
 (3, 13)

julia> [v for v in zip("abcdef", Iterators.cycle([1, 2]))]
6-element Vector{Tuple{Char, Int64}}:
 ('a', 1)
 ('b', 2)
 ('c', 1)
 ('d', 2)
 ('e', 1)
 ('f', 2)

The examples show the following nice features of the zip function:

  • It can work with any iterator, not just e.g. vectors.
    In the second example we used a string and a special cyclic iterator from Iterators module.
    Importantly we can work with iterators that are not indexable.
  • The zip function works until the shortest iterator of all passed to it is exhausted.
    It is often useful when we have some of the iterators that are infinite.
    Again, in our second example Iterators.cycle([1, 2])) is infinite and allowed us to clearly indicate even and odd characters in the string.

The last nice feature of zip is that it is lazy. To understand what it means check this:

julia> zip(1:3, 11:13)
zip(1:3, 11:13)

julia> typeof(zip(1:3, 11:13))
Base.Iterators.Zip{Tuple{UnitRange{Int64}, UnitRange{Int64}}}

We can see that the zip function by itself does not do any work. Instead, it returns an iterator that can be queried.
That is why in the first examples I used comprehensions to show the values produced by the iterator returned by zip.
Why is it useful? Being lazy avoids excessive memory allocation, which is often important, especially if you work with large data.

Why is zip risky?

So what are the potential problems with zip in practice? The issue is that we often want to use it in scenarios
when we would want to be sure that all iterators passed to zip have the same length. Unfortunately zip does not check it.

Here is a simple example of a problematic definition of a function comupting a dot product:

julia> dot_bad(x, y) = sum(a * b for (a, b) in zip(x, y))
dot_bad (generic function with 1 method)

julia> dot_bad(1:3, 4:6)
32

julia> dot_bad(1:3, 4:7)
32

julia> dot_bad(1:4, 4:6)
32

Typically with such a function we would want it to error if it gets input iterators of unequal length.

In practice, if it is required that the passed iterators must have equal length, they usually support the length function.
Therefore often fix of the definition that is similar to the following is enough:

julia> function dot_safer(x, y)
           @assert length(x) == length(y)
           return sum(a * b for (a, b) in zip(x, y))
       end
dot_safer (generic function with 1 method)

julia> dot_safer(1:3, 4:6)
32

julia> dot_safer(1:3, 4:7)
ERROR: AssertionError: length(x) == length(y)

julia> dot_safer(1:4, 4:6)
ERROR: AssertionError: length(x) == length(y)

Conclusions

In summary: if you use zip always ask yourself a question if you assume that the passed
iterators have the same length. If you do – make sure to check it in your code explicitly.

Back to the Project Euler puzzles

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/10/20/pe.html

Introduction

After several technical posts today I decided to switch back to puzzle-solving mode.
My readers probably know that I like and promote the Project Euler project.

This time I picked a relatively easy puzzle 205 as it nicely shows several
functionalities of Julia that are worth learning.

The post was written under Julia 1.9.2 and StatsBase.jl v0.34.2.

The problem

The problem 205 is stated as follows:

Peter has nine four-sided dice, each with faces numbered from 1 to 4.
Colin has six six-sided dice, each with faces numbered from 1 to 6.
Peter and Colin roll their dice and compare totals: the highest total wins.
The result is a draw if the totals are equal.
What is the probability that Peter beats Colin?

Simulation approach

To get some intuition for our problem let us first try simulating the distribution of the results.
We will draw one million times from Peter’s and Colin’s dice:

julia> using Random

julia> using StatsBase

julia> Random.seed!(1234);

julia> sim_p = [sum(rand(1:4) for _ in 1:9) for _ in 1:10^6]
1000000-element Vector{Int64}:
 24
 23
 20
 26
 23
  ⋮
 20
 27
 22
 24
 23

julia> sim_c = [sum(rand(1:6) for _ in 1:6) for _ in 1:10^6]
1000000-element Vector{Int64}:
 21
 23
 23
 24
 24
  ⋮
 16
 30
 15
 20
 25

Now let us compare the results:

julia> describe(sim_p)
Summary Stats:
Length:         1000000
Missing Count:  0
Mean:           22.498804
Std. Deviation: 3.355036
Minimum:        9.000000
1st Quartile:   20.000000
Median:         22.000000
3rd Quartile:   25.000000
Maximum:        36.000000
Type:           Int64

julia> describe(sim_c)
Summary Stats:
Length:         1000000
Missing Count:  0
Mean:           20.996381
Std. Deviation: 4.186906
Minimum:        6.000000
1st Quartile:   18.000000
Median:         21.000000
3rd Quartile:   24.000000
Maximum:        36.000000
Type:           Int64

julia> mean(sim_p .> sim_c)
0.5728

We see that Peter’s chances of winning are around 57%.
We also see that both the mean and the median of Peter’s dice are better than Colin’s.

However, the simulation results are only approximate. Let us thus compute the exact result.

Exact approach

To compute the exact probability of Peter’s win first calculate the distribution of
Peter’s and Colin’s dice. With StatsBase.jl it is easy to do using the countmap function:

julia> ex_p = countmap(map(sum, Iterators.product((1:4 for _ in 1:9)...)))
Dict{Int64, Int64} with 28 entries:
  16 => 4950
  20 => 23607
  35 => 9
  12 => 165
  24 => 27876
  28 => 8451
  30 => 2598
  17 => 8451
  23 => 30276
  19 => 18351
  22 => 30276
  32 => 486
  11 => 45
  36 => 1
  9  => 1
  31 => 1206
  ⋮  => ⋮

julia> ex_c = countmap(map(sum, Iterators.product((1:6 for _ in 1:6)...)))
Dict{Int64, Int64} with 31 entries:
  16 => 2247
  20 => 4221
  35 => 6
  12 => 456
  24 => 3431
  28 => 1161
  8  => 21
  17 => 2856
  30 => 456
  23 => 3906
  19 => 3906
  22 => 4221
  32 => 126
  6  => 1
  11 => 252
  36 => 1
  ⋮  => ⋮

As a result we get the number of times out of the total possible outcomes that a given sum on a dice occurs.
Let us check that the total counts of the outcomes are correct. We can do it as we know that on Peter’s dice
we can get 4^9 outcomes and on Colin’s dice 6^6:

julia> sum(values(ex_p)), 4^9
(262144, 262144)

julia> sum(values(ex_c)), 6^6
(46656, 46656)

Indeed the results match.

Using distributions stored in ex_p and ex_c variables we can count the number of times Peter wins with Collin
using the Cartesian product of the distributions (the tosses of the dice are independent) and, in consequence compute the
exact probability that Peter wins:

julia> p_win = sum(pk > ck ? pv*cv : 0 for
                   (pk, pv) in pairs(ex_p),
                   (ck, cv) in pairs(ex_c))
7009890480

julia> total = 4^9 * 6^6
12230590464

julia> p_win / total
0.5731440767829801

julia> p_win // total
48679795//84934656

Note that in Julia we can nicely compute both approximate solution (using Float64) and exact solution using rational numbers.

One important aspect that we should have checked when solving this puzzle is if we do not have an integer overflow issue when
computing the result. On 64-bit machine the overflow happens for the value:

julia> typemax(Int)
9223372036854775807

which is much larger than 12230590464. But how can we be sure that we do not get an overflow when computing 4^9 * 6^6?
The easiest check is to take the logarithms of both expressions:

julia> log(typemax(Int))
43.66827237527655

julia> 9 * log(4) + 6 * log(6)
23.227206065447348

Indeed we see that we have a wide safety margin.

Note, however, that on 32-bit machine Julia would use Int32 type to represent integers by default, and hen we have:

julia> typemax(Int32)
2147483647

julia> log(typemax(Int32))
21.487562596892644

And in this case we would have an integer overflow issue, so some care is needed.

Conclusions

I hope you enjoyed the puzzle, the solution, and the code examples I have presented!