Tag Archives: julialang

Back to the Project Euler puzzles

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/10/20/pe.html

Introduction

After several technical posts today I decided to switch back to puzzle-solving mode.
My readers probably know that I like and promote the Project Euler project.

This time I picked a relatively easy puzzle 205 as it nicely shows several
functionalities of Julia that are worth learning.

The post was written under Julia 1.9.2 and StatsBase.jl v0.34.2.

The problem

The problem 205 is stated as follows:

Peter has nine four-sided dice, each with faces numbered from 1 to 4.
Colin has six six-sided dice, each with faces numbered from 1 to 6.
Peter and Colin roll their dice and compare totals: the highest total wins.
The result is a draw if the totals are equal.
What is the probability that Peter beats Colin?

Simulation approach

To get some intuition for our problem let us first try simulating the distribution of the results.
We will draw one million times from Peter’s and Colin’s dice:

julia> using Random

julia> using StatsBase

julia> Random.seed!(1234);

julia> sim_p = [sum(rand(1:4) for _ in 1:9) for _ in 1:10^6]
1000000-element Vector{Int64}:
 24
 23
 20
 26
 23
  ⋮
 20
 27
 22
 24
 23

julia> sim_c = [sum(rand(1:6) for _ in 1:6) for _ in 1:10^6]
1000000-element Vector{Int64}:
 21
 23
 23
 24
 24
  ⋮
 16
 30
 15
 20
 25

Now let us compare the results:

julia> describe(sim_p)
Summary Stats:
Length:         1000000
Missing Count:  0
Mean:           22.498804
Std. Deviation: 3.355036
Minimum:        9.000000
1st Quartile:   20.000000
Median:         22.000000
3rd Quartile:   25.000000
Maximum:        36.000000
Type:           Int64

julia> describe(sim_c)
Summary Stats:
Length:         1000000
Missing Count:  0
Mean:           20.996381
Std. Deviation: 4.186906
Minimum:        6.000000
1st Quartile:   18.000000
Median:         21.000000
3rd Quartile:   24.000000
Maximum:        36.000000
Type:           Int64

julia> mean(sim_p .> sim_c)
0.5728

We see that Peter’s chances of winning are around 57%.
We also see that both the mean and the median of Peter’s dice are better than Colin’s.

However, the simulation results are only approximate. Let us thus compute the exact result.

Exact approach

To compute the exact probability of Peter’s win first calculate the distribution of
Peter’s and Colin’s dice. With StatsBase.jl it is easy to do using the countmap function:

julia> ex_p = countmap(map(sum, Iterators.product((1:4 for _ in 1:9)...)))
Dict{Int64, Int64} with 28 entries:
  16 => 4950
  20 => 23607
  35 => 9
  12 => 165
  24 => 27876
  28 => 8451
  30 => 2598
  17 => 8451
  23 => 30276
  19 => 18351
  22 => 30276
  32 => 486
  11 => 45
  36 => 1
  9  => 1
  31 => 1206
  ⋮  => ⋮

julia> ex_c = countmap(map(sum, Iterators.product((1:6 for _ in 1:6)...)))
Dict{Int64, Int64} with 31 entries:
  16 => 2247
  20 => 4221
  35 => 6
  12 => 456
  24 => 3431
  28 => 1161
  8  => 21
  17 => 2856
  30 => 456
  23 => 3906
  19 => 3906
  22 => 4221
  32 => 126
  6  => 1
  11 => 252
  36 => 1
  ⋮  => ⋮

As a result we get the number of times out of the total possible outcomes that a given sum on a dice occurs.
Let us check that the total counts of the outcomes are correct. We can do it as we know that on Peter’s dice
we can get 4^9 outcomes and on Colin’s dice 6^6:

julia> sum(values(ex_p)), 4^9
(262144, 262144)

julia> sum(values(ex_c)), 6^6
(46656, 46656)

Indeed the results match.

Using distributions stored in ex_p and ex_c variables we can count the number of times Peter wins with Collin
using the Cartesian product of the distributions (the tosses of the dice are independent) and, in consequence compute the
exact probability that Peter wins:

julia> p_win = sum(pk > ck ? pv*cv : 0 for
                   (pk, pv) in pairs(ex_p),
                   (ck, cv) in pairs(ex_c))
7009890480

julia> total = 4^9 * 6^6
12230590464

julia> p_win / total
0.5731440767829801

julia> p_win // total
48679795//84934656

Note that in Julia we can nicely compute both approximate solution (using Float64) and exact solution using rational numbers.

One important aspect that we should have checked when solving this puzzle is if we do not have an integer overflow issue when
computing the result. On 64-bit machine the overflow happens for the value:

julia> typemax(Int)
9223372036854775807

which is much larger than 12230590464. But how can we be sure that we do not get an overflow when computing 4^9 * 6^6?
The easiest check is to take the logarithms of both expressions:

julia> log(typemax(Int))
43.66827237527655

julia> 9 * log(4) + 6 * log(6)
23.227206065447348

Indeed we see that we have a wide safety margin.

Note, however, that on 32-bit machine Julia would use Int32 type to represent integers by default, and hen we have:

julia> typemax(Int32)
2147483647

julia> log(typemax(Int32))
21.487562596892644

And in this case we would have an integer overflow issue, so some care is needed.

Conclusions

I hope you enjoyed the puzzle, the solution, and the code examples I have presented!

Getting ready for JuliaCon Local Eindhoven 2023

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/10/13/juliacon.html

Introduction

I am really excited that soon we will have a Julia conference in Europe.
The JuliaCon Local Eindhoven 2023 will take place on December 1, 2023 in Eindhoven.

I thought that as nice preparation for this event it would be useful to share
some tips that can be useful for DataFrames.jl users.
Frequently, when working with tables, one has to process time series data.
I want to share a few examples how such data can be stored in DataFrames.jl.

The post was written under Julia 1.9.2 and DataFrames.jl 1.6.1.

Setting up the stage

Let us start with a data frame that has 4 columns and 1000 rows representing consecutive measurements of some data.
I also add the :period column keeping track of the moment at which the measurement was made.

julia> using DataFrames

julia> using Random

julia> Random.seed!(1234);

julia> df1 = DataFrame(rand(1000, 4), :auto);

julia> df1.period = 1:1000;

julia> df1
1000×5 DataFrame
  Row │ x1         x2         x3        x4         period
      │ Float64    Float64    Float64   Float64    Int64
──────┼───────────────────────────────────────────────────
    1 │ 0.579862   0.473374   0.366288  0.771849        1
    2 │ 0.411294   0.176997   0.652475  0.938484        2
    3 │ 0.972136   0.676856   0.900155  0.390673        3
    4 │ 0.0149088  0.177963   0.374975  0.694063        4
  ⋮   │     ⋮          ⋮         ⋮          ⋮        ⋮
  997 │ 0.424295   0.78012    0.577681  0.274228      997
  998 │ 0.481554   0.0441882  0.700181  0.317677      998
  999 │ 0.401528   0.658903   0.438309  0.211217      999
 1000 │ 0.58985    0.909857   0.354735  0.442829     1000
                                          992 rows omitted

Reshaping data: wide to long

The first operation we want to do is reshaping the data from wide to long format,
in which we will have three columns: the period, the variable name, and the variable value.
To achieve this we can use the stack function:

julia> df2 = stack(df1, Not(:period))
4000×3 DataFrame
  Row │ period  variable  value
      │ Int64   String    Float64
──────┼─────────────────────────────
    1 │      1  x1        0.579862
    2 │      2  x1        0.411294
    3 │      3  x1        0.972136
    4 │      4  x1        0.0149088
  ⋮   │   ⋮        ⋮          ⋮
 3997 │    997  x4        0.274228
 3998 │    998  x4        0.317677
 3999 │    999  x4        0.211217
 4000 │   1000  x4        0.442829
                   3992 rows omitted

Reshaping data: long to wide

Now let us perform the reverse operation, but this time put :period as columns.
We can do it using the unstack function:

julia> df3 = unstack(df2, :variable, :period, :value)
4×1001 DataFrame
 Row │ variable  1         2         3         4          5         6        ⋯
     │ String    Float64?  Float64?  Float64?  Float64?   Float64?  Float64? ⋯
─────┼────────────────────────────────────────────────────────────────────────
   1 │ x1        0.579862  0.411294  0.972136  0.0149088  0.520355  0.639562 ⋯
   2 │ x2        0.473374  0.176997  0.676856  0.177963   0.670122  0.042361
   3 │ x3        0.366288  0.652475  0.900155  0.374975   0.387857  0.779594
   4 │ x4        0.771849  0.938484  0.390673  0.694063   0.724383  0.130453
                                                           995 columns omitted

Nesting columns

Another way to reshape this data is to put all periods in a single cell of a data frame.
We can do it in two ways.

The first is to keep them in columns :x1, :x2, :x3, and :x4. Here are two ways
how you can do it starting from df1 or df2:

julia> combine(df1, Not(:period) .=> Ref, renamecols=false)
1×4 DataFrame
 Row │ x1                                 x2                                 ⋯
     │ Array…                             Array…                             ⋯
─────┼────────────────────────────────────────────────────────────────────────
   1 │ [0.579862, 0.411294, 0.972136, 0…  [0.473374, 0.176997, 0.676856, 0…  ⋯
                                                             2 columns omitted
julia> unstack(df2, [], :variable, :value, combine=collect)
1×4 DataFrame
 Row │ x1                                 x2                                 ⋯
     │ Array…?                            Array…?                            ⋯
─────┼────────────────────────────────────────────────────────────────────────
   1 │ [0.579862, 0.411294, 0.972136, 0…  [0.473374, 0.176997, 0.676856, 0…  ⋯
                                                             2 columns omitted

Alternatively, we could want to have two columns, the first with variable names
and the second with the vectors. Again, let me show two ways to do it starting
from df2 and df3 data frames:

julia> df4 = combine(groupby(df2, :variable), :value => Ref, renamecols=false)
4×2 DataFrame
 Row │ variable  value
     │ String    SubArray…
─────┼─────────────────────────────────────────────
   1 │ x1        [0.579862, 0.411294, 0.972136, 0…
   2 │ x2        [0.473374, 0.176997, 0.676856, 0…
   3 │ x3        [0.366288, 0.652475, 0.900155, 0…
   4 │ x4        [0.771849, 0.938484, 0.390673, 0…

julia> combine(df3, :variable, AsTable(Not(:variable)) => ByRow(identity∘collect) => :value)
4×2 DataFrame
 Row │ variable  value
     │ String    Array…
─────┼─────────────────────────────────────────────
   1 │ x1        Union{Missing, Float64}[0.579862…
   2 │ x2        Union{Missing, Float64}[0.473374…
   3 │ x3        Union{Missing, Float64}[0.366288…
   4 │ x4        Union{Missing, Float64}[0.771849…

These examples are slightly more complicated so let me explain them.

In the first one we use Ref to protect a vector against expansion into multiple columns
(note that no copying of data happens here, to perform a copy you should write Ref∘copy).

In the second one the AsTable(Not(:variable)) source column selector produces a NamedTuple.
However, Julia users probably are aware that with 1000 entries such a NamedTuple would take a long
time to compile. For this reason we are using an optimization that DataFrames.jl provides.
If we pass a function of a form some_function∘collect then this compilation step is avoided.
In our case the function is just identity as we are happy with producing a vector (which collect already returns).

Unnesting columns

As a last example let us show how to expand the df4 data frame back into multiple columns.
This is easy. Just write:

julia> select(df4, :variable, :value => AsTable)
4×1001 DataFrame
 Row │ variable  x1        x2        x3        x4         x5        x6       ⋯
     │ String    Float64   Float64   Float64   Float64    Float64   Float64  ⋯
─────┼────────────────────────────────────────────────────────────────────────
   1 │ x1        0.579862  0.411294  0.972136  0.0149088  0.520355  0.639562 ⋯
   2 │ x2        0.473374  0.176997  0.676856  0.177963   0.670122  0.042361
   3 │ x3        0.366288  0.652475  0.900155  0.374975   0.387857  0.779594
   4 │ x4        0.771849  0.938484  0.390673  0.694063   0.724383  0.130453
                                                           995 columns omitted

Since we used AsTable as target column names specifier we got auto-generated column names.

Conclusions

I hope you liked the examples I presented today and learned something from them.

I am sure that if you attend JuliaCon Local Eindhoven 2023 you are going to
see a lot of highly informative and inspirational talks there!

Column unioning in Tables.jl: row vs column oriented storage

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/10/06/tables.html

Introduction

Today I want to continue exploration of Tables.jl package functionality.
Some time ago I wrote a post about getting a schema of Tables.jl tables.
In the post I discussed, in particular, “column unioning”.

The column unioning approach allows you to put data with heterogenous entries into a single table
by filling the missing entries with missing values.

In my previous post I discussed the Tables.dictrowtable function that performs column unioning.
Today I want to introduce you to the Tables.dictcolumntable function that has a similar functionality,
but uses a different storage format.

The post was written using Julia 1.9.2, Tables.jl 1.11.0, and DataFrames.jl 1.6.1.

Introduction: an example of column unioning

Let us start with a simple example of column unioning behavior:

julia> using DataFrames

julia> vnt = [(a=1, b=2), (a=3, c=4), (b=5, d=6)]
3-element Vector{NamedTuple{names, Tuple{Int64, Int64}} where names}:
 (a = 1, b = 2)
 (a = 3, c = 4)
 (b = 5, d = 6)

julia> DataFrame(Tables.dictrowtable(vnt))
3×4 DataFrame
 Row │ a        b        c        d
     │ Int64?   Int64?   Int64?   Int64?
─────┼────────────────────────────────────
   1 │       1        2  missing  missing
   2 │       3  missing        4  missing
   3 │ missing        5  missing        6

julia> DataFrame(Tables.dictcolumntable(vnt))
3×4 DataFrame
 Row │ a        b        c        d
     │ Int64?   Int64?   Int64?   Int64?
─────┼────────────────────────────────────
   1 │       1        2  missing  missing
   2 │       3  missing        4  missing
   3 │ missing        5  missing        6

We have a heterogeneous table vnt (a vector of named tuples).
Each of its rows has a different set of columns.
After column unioning operation we get a table (displayed as a data frame in our example) where each row has four columns and the missing values are filled with missing.

Such situations are quite common when one e.g. processes JSON data and each object may have a varying set of attributes.

Why do we have two functions that perform column unioning?

A natural question to ask is why do we have Tables.dictrowtable and Tables.dictcolumntable that perform the same operation?

The reason is simple. Tables.dictrowtable returns a row-oriented object, while Tables.dictcolumntable a column oriented one.

We can easily check it by digging into the internals of these objects:

julia> getfield(Tables.dictrowtable(vnt), :values)
3-element Vector{Dict{Symbol, Any}}:
 Dict(:a => 1, :b => 2)
 Dict(:a => 3, :c => 4)
 Dict(:b => 5, :d => 6)

julia> getfield(Tables.dictcolumntable(vnt), :values)
OrderedCollections.OrderedDict{Symbol, AbstractVector} with 4 entries:
  :a => Union{Missing, Int64}[1, 3, missing]
  :b => Union{Missing, Int64}[2, missing, 5]
  :c => Union{Missing, Int64}[missing, 4, missing]
  :d => Union{Missing, Int64}[missing, missing, 6]

As you can see Tables.dictrowtable internally stores a vector of dictionaries, while Tables.dictcolumntable a dictionary of vectors.

So the question is when you should use which? A simple answer is that most of the time Tables.dictcolumntable is what you want,
unless you want to process data row-by-row and the data is very sparse. The reason is that creating a Dict for each row of the data
has a big overhead. Additionally, most often analytical routines expect data stored in columns. For example DataFrame has a columnar storage.
Therefore, creating a data frame from a Tables.dictcolumntable object will be much faster.

Let us make a small test (I repeat the same @time call several times to capture the variability of the results and make sure all gets compiled):

julia> vnt2 = repeat(vnt, 10^6);

julia> @time DataFrame(Tables.dictrowtable(vnt2));
  9.625842 seconds (99.00 M allocations: 5.018 GiB, 20.14% gc time)

julia> @time DataFrame(Tables.dictrowtable(vnt2));
  9.792331 seconds (99.00 M allocations: 5.018 GiB, 24.41% gc time)

julia> @time DataFrame(Tables.dictcolumntable(vnt2));
  2.029009 seconds (30.00 M allocations: 892.629 MiB)

julia> @time DataFrame(Tables.dictcolumntable(vnt2));
  2.251707 seconds (30.00 M allocations: 892.629 MiB)

Indeed we have a faster execution time, less allocations, and less memory allocated with Tables.dictcolumntable.

If we wanted to squeeze out more performance we could pass copycols=false to DataFrame constructor since we know
that in this case we can safely skip making copies of the vectors representing columns that are passed to the constructor:

julia> @time DataFrame(Tables.dictcolumntable(vnt2), copycols=false);
  2.584373 seconds (30.02 M allocations: 790.861 MiB, 17.83% gc time, 3.40% compilation time)

julia> @time DataFrame(Tables.dictcolumntable(vnt2), copycols=false);
  2.139848 seconds (30.00 M allocations: 789.632 MiB)

julia> @time DataFrame(Tables.dictcolumntable(vnt2), copycols=false);
  1.959573 seconds (30.00 M allocations: 789.632 MiB)

Indeed we get less allocations, but the timing of the operation is only minimally better.

Conclusions

I dedicated my post today only to two functions Tables.dictrowtable and Tables.dictcolumntable.
The reason is that they are not commonly known, and at the same time they are very useful if you
need column unioning behavior when working with your source data. Happy hacking!