Tag Archives: julialang

One puzzle, two solutions

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/02/02/pe116.html

Introduction

Today, I wanted to switch back to a lighter subject.
Therefore I decided to have a look at my favorite Project Euler website.

I picked the problem 116 as I have not tried to solve it yet.
Interestingly, it turned out that there are two ways to approach this puzzle,
so I thought to share them here.

The post was written under Julia 1.10.0.

The puzzle

The Project Euler puzzle 116 can be briefly stated as follows:

Given a row of 50 grey squares is to have a number of its tiles replaced with
coloured oblong tiles chosen from red (length two),
green (length three), or blue (length four).
How many different ways can the grey tiles be replaced if colours
cannot be mixed and at least one coloured tile must be used?

(If you want to see some visual examples of valid tilings, I encourage you to
visit the puzzle 116 page.)

The first approach

When we think of this problem, it is natural to generalize it. By C(n, d) we can
define the number of ways that n gray squares can be replaced with tiles of length d.
Then the solution to our problem is C(n, 2) + C(n, 3) + C(n, 4).
So let us focus on computing C(n, d) (assuming d is positive).

The first approach is to ask how many tiles of length d can be put. There must be at least
1, and we cannot put more than n ÷ d (here I use the ÷ notation taken from Julia that
denotes integer division; in other words the integer part of n / d).

So now assume that we want to put i blocks of length d (assuming i is valid). In
how many ways can we do it. Well, we put i long blocks and we are left with n - d*i gray blocks.
In total we have i + (n - d*i) blocks. You can then think of it as you have that many slots
from which you need pick i slots to put the long blocks. The number of ways you can do it is
given by the value of binomial coefficient. In Julia notation it is:
binomial(BigInt(i + (n - d*i)), BigInt(i)).

Now you might ask why I put the BigInt wrapper around the passed numbers? The reason is
that binomial coefficient gets large pretty quickly, so I want to make sure I will not
have issues with integer overflow.

Given these considerations the first function that produces C(n, d) can be defined as:

function C1(n::Integer, d::Integer)
    @assert d > 0 && n >= 0
    return sum(i -> binomial(BigInt(i + (n - d*i)), BigInt(i)), 1:n ÷ d; init=big"0")
end

Note that I use the init=big"0" initialization statement in the sum to ensure the
correct handling of the cases when n < d when we are given an empty collection to sum over.

The second approach

However, there is a different way how we can think of computing C(n, d).

Assume we know the values of C(n, d) for values of n smaller than the requested one.

We look at the last tile in our row.

If it is empty, then we are down to n-1 tiles to be filled.
This can be done in C(n-1, d) ways (remember that this value takes care
of the fact that at least one block of length d has to be used).

But what if the last tile in our row is filled with a block of length d?
Then we have two options. Either all other blocks are left gray (which gives us 1 combination)
or we are left with n-d tiles that are filled with at least one block of length d. The
second value is exactly C(n-d, d).

In summary we get that C(n, d) = C(n-1, d) + C(n-d, d) + 1.

This formula assumes n is at least d. But clearly for n < d
we have 0 ways to arrange the blocks.

Let us write down the code that performs the required computation:

function C2(n::Integer, d::Integer)
    @assert d > 0 && n >= 0
    npos = Dict{Int,BigInt}(i => 0 for i in 0:d-1)
    for j in d:n
        npos[j] = npos[j-1] + npos[j-d] + 1
    end
    return npos[n]
end

Note in the code that I used the npos dictionary to flexibly allow
for any potential integer values of n. The dictionary has
Dict{Int,BigInt} type, again, to ensure that the results of the computations
are stored correctly even if they are large.

Testing

Now we have two functions C1 and C2 that look completely differently.
Do they produce the same results. Let us check:

julia> using Test

julia> @testset "test C1 and C2 equality" begin
           for n in 0:200, d in 1:20
               @test C1(n, d) == C2(n, d)
           end
       end;
Test Summary:           | Pass  Total  Time
test C1 and C2 equality | 4020   4020  0.9s

Indeed we see that both C1 and C2 functions produce the same results.

To convince ourselves that using arbitrary precision integers was indeed needed
let us check some example values of the functions:

julia> C1(200, 2)
453973694165307953197296969697410619233825

julia> C2(200, 2)
453973694165307953197296969697410619233825

julia> typemax(Int)
9223372036854775807

Indeed, if we were not careful, we would have an integer overflow issue.

Conclusions

As usual I will not show the value of the solution to the problem to encourage you
to run the code yourself. You can get it by executing either
sum(d -> C1(50, d), 2:4) or sum(d -> C2(50, d), 2:4).
(We have just checked that the value produced in both cases is the same).

My understanding of object property access in Julia

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/01/26/fields.html

Introduction

Today I wanted to discuss a conceptual aspect of Julia programming.
It is related to the question how you should query some object for its properties.
The topic is especially relevant if you want to write code that is expected to be stable
in the longer term, that means that it is easy to maintain as versions of its dependencies change.

The post was written under Julia 1.10.0 and DataFrames.jl 1.6.1.

The internals

A fundamental element of Julia design are composite types. This kind of object
is a collection of fields, that have names. Each of such fields can hold some value.

To make things non-abstract let us have a look at a SubDataFrame type from DataFrames.jl.
First create an instance of such object:

julia> using DataFrames

julia> df = DataFrame(x=1:3, y=11:13, z=111:113)
3×3 DataFrame
 Row │ x      y      z
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1     11    111
   2 │     2     12    112
   3 │     3     13    113

julia> sdf = @view df[1:2, 1:2]
2×2 SubDataFrame
 Row │ x      y
     │ Int64  Int64
─────┼──────────────
   1 │     1     11
   2 │     2     12

To check what fields SubDataFrame contains you can use the the fieldnames function:

julia> fieldnames(SubDataFrame)
(:parent, :colindex, :rows)

Note that we pass a type to fieldnames. It is important – the list of fields is fixed for every
instance of an object of a given type.

In this case we learned that SubDataFrame has three fields. The three functions associated with
fieldnames are: fieldcount returning the number of fields of a type,
fieldtypes returning their declared types, and hasfield allowing you
to query if a specific field is present. There is an example:

julia> fieldcount(SubDataFrame)
3

julia> fieldtypes(SubDataFrame)
(AbstractDataFrame, DataFrames.AbstractIndex, AbstractVector{Int64})

julia> hasfield(SubDataFrame, :parent)
true

julia> hasfield(SubDataFrame, :parentx)
false

For a given instance of a type you can query the field with getfield and set it with setfield!.
For example, let us get the field :parent of our sdf object (a source data frame in this case):

julia> getfield(sdf, :parent)
3×3 DataFrame
 Row │ x      y      z
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1     11    111
   2 │     2     12    112
   3 │     3     13    113

Having learned all these methods you might ask yourself when to use it. The short answer is:

Never directly access fields of a type. They might be changed
between versions of code you use without warning.

The longer answer is that you should assume that direct field access is typically considered internal.
The list and fields and their types are an implementation detail and as a user of this type you should
not rely on them. The use of property access is restricted to the designers of a type to allow them
manipulate its inner physical representation.

So how should we work with composite types then?

The composite type interface

Julia introduces a concept of property that is a logical representation of data stored in a given object.
You can query for properties of an object with the propertynames function. You also have the hasproperty,
getproperty and setproperty! functions similar as for fields.

In case of our sdf SubDataFrame we have the following logical representation:

julia> propertynames(sdf)
2-element Vector{Symbol}:
 :x
 :y

julia> hasproperty(sdf, :x)
true

julia> getproperty(sdf, :x)
2-element view(::Vector{Int64}, 1:2) with eltype Int64:
 1
 2

julia> setproperty!(sdf, :x, [1001, 1002])
2-element Vector{Int64}:
 1001
 1002

julia> sdf
2×2 SubDataFrame
 Row │ x      y
     │ Int64  Int64
─────┼──────────────
   1 │  1001     11
   2 │  1002     12

We immediately see a significant difference. The sdf properties in this case are columns of our data frame.
We do not care how they are mapped to a physical representation of SubDataFrame, this is taken care of
by designers of the DataFrames.jl package.

There are the following important aspects of properties.

The first is that property access is typically considered a public API.
Designers of the type should make sure that the way you can access properties
of an object should remain stable and a change in this area would be breaking, so:

You should access properties of objects in your code (not fields).

The second is that properties are bound to object, not to a type.
This means that different objects of the same type may have different sets of properties.
It is quite useful, e.g. each data frame can have a different set of columns.

The third, practical, information is that by default properties fall back to fields,
as you can read here in the Julia Manual.

The next aspect is convenient syntax.
You do not need to call the getproperty and setproperty! functions explicitly.
The getproperty(a, :b) is equivalent to a.b, and setproperty!(a, :b, v) is the same as a.b = v.

Finally note that the propertynames function optionally takes a second positional argument
that is Bool. If it is passed and set to true you get a list of all properties of some object.
By default the second argument is false and you get a list of public properties of some object
(and in practice you should use the default mode).

Conclusions

Today I have a short conclusion.

Fields represent physical layout of a type.
Properties represent a logical view of an object.

In your code use object properties and not their fields.
Field access is considered internal and typically should be only done by developers of a
package providing a given object.

A little exercise in CSV.jl and DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/01/19/puzzles.html

Introduction

This week I have discussed with my colleague the Lichess puzzle dataset
that I use in my Julia for Data Analysis book.

The dataset contains a list of puzzles along with information about them,
such as puzzle difficulty, puzzle solution, and tags describing puzzle type.

We were discussing if tags assigned to puzzles in this dataset are accurate.
In this post I give you an example how one can check it
(and practice a bit CSV.jl and DataFrames.jl).

The post was written under Julia 1.10.0, CSV.jl 0.10.12, and DataFrames.jl 1.6.1.

Getting the data

In this post I show you a relatively brief code. Therefore I assume that first
you download the file with the puzzle dataset and unpack it manually.
(In the book I show how to do it using Julia. You can find the source code on
GitHub repository of the book.)

Assuming you downloaded and unpacked the dataset into the puzzles.csv file
we read it in. We are interested only in columns 3 and 8 of this file,
so I use the following commands:

julia> using CSV

julia> using DataFrames

julia> df = CSV.read("puzzles.csv", DataFrame; select=[3, 8], header=false)
2132989×2 DataFrame
     Row │ Column3                            Column8
         │ String                             String
─────────┼──────────────────────────────────────────────────────────────────────
       1 │ f2g3 e6e7 b2b1 b3c1 b1c1 h6c1      crushing hangingPiece long middl…
       2 │ d3d6 f8d8 d6d8 f6d8                advantage endgame short
       3 │ b6c5 e2g4 h3g4 d1g4                advantage middlegame short
       4 │ g5e7 a5c3 b2c3 c6e7                advantage master middlegame short
       5 │ e8f7 e2e6 f7f8 e6f7                mate mateIn2 middlegame short
       6 │ a6a5 e5c7 a5b4 c7d8                crushing endgame fork short
       7 │ d4b6 f6e4 h1g1 e4f2                crushing endgame short trappedPi…
       8 │ d8f6 d1h5 h7h6 h5c5                advantage middlegame short
    ⋮    │                 ⋮                                  ⋮
 2132982 │ d2c2 c5d3 c2d3 c4d3                crushing fork middlegame short
 2132983 │ b8d7 c3b5 d6b8 a1c1 e8g8 b5c7      crushing long middlegame quietMo…
 2132984 │ g7g6 d5c6 c5c4 b3c4 b4c4 c6d6      crushing defensiveMove endgame l…
 2132985 │ g1h1 e3e1 f7f1 e1f1                endgame mate mateIn2 short
 2132986 │ g5c1 d5d6 d7f6 h7h8                advantage middlegame short
 2132987 │ d2f3 d8a5 c1d2 a5b5                advantage fork opening short
 2132988 │ f7f2 b2c2 c1b1 e2d1                endgame mate mateIn2 queensideAt…
 2132989 │ c6d4 f1e1 e8d8 b1c3 d4f3 g2f3      advantage long opening
                                                            2132973 rows omitted

julia> rename!(df, ["moves", "tags"])
2132989×2 DataFrame
     Row │ moves                              tags
         │ String                             String
─────────┼──────────────────────────────────────────────────────────────────────
       1 │ f2g3 e6e7 b2b1 b3c1 b1c1 h6c1      crushing hangingPiece long middl…
       2 │ d3d6 f8d8 d6d8 f6d8                advantage endgame short
       3 │ b6c5 e2g4 h3g4 d1g4                advantage middlegame short
       4 │ g5e7 a5c3 b2c3 c6e7                advantage master middlegame short
       5 │ e8f7 e2e6 f7f8 e6f7                mate mateIn2 middlegame short
       6 │ a6a5 e5c7 a5b4 c7d8                crushing endgame fork short
       7 │ d4b6 f6e4 h1g1 e4f2                crushing endgame short trappedPi…
       8 │ d8f6 d1h5 h7h6 h5c5                advantage middlegame short
    ⋮    │                 ⋮                                  ⋮
 2132982 │ d2c2 c5d3 c2d3 c4d3                crushing fork middlegame short
 2132983 │ b8d7 c3b5 d6b8 a1c1 e8g8 b5c7      crushing long middlegame quietMo…
 2132984 │ g7g6 d5c6 c5c4 b3c4 b4c4 c6d6      crushing defensiveMove endgame l…
 2132985 │ g1h1 e3e1 f7f1 e1f1                endgame mate mateIn2 short
 2132986 │ g5c1 d5d6 d7f6 h7h8                advantage middlegame short
 2132987 │ d2f3 d8a5 c1d2 a5b5                advantage fork opening short
 2132988 │ f7f2 b2c2 c1b1 e2d1                endgame mate mateIn2 queensideAt…
 2132989 │ c6d4 f1e1 e8d8 b1c3 d4f3 g2f3      advantage long opening
                                                            2132973 rows omitted

Note that the file does not have a header so when reading it we passed header=false
and then manually named the columns using rename!.

The task

I wanted only these two columns since today I want to check if the tags related
to mating are accurate. You can notice in the above printout that in the "tags"
column we have a tag "mateIn2". It indicates that the puzzle is mate in two moves.
This is the case for example for rows 5, 2132985, and 2132988.
In the matching "moves" column we see that we have 4 corresponding moves.
The reason is that we have two players making the move (and 2 + 2 = 4).

What we want to check if these "mateInX" tags are correct. I will check the
values of X from 1 to 5 (as only these five options are present in tags,
I leave it to you as an exercise to verify).

When should we call the tags correct. There are two conditions:

  • there is no duplicate tagging (e.g. a puzzle cannot be "mateIn1" and "mateIn2" at the same time);
  • the number of moves in a puzzle matches the tag.

Let us check it.

The solution

As a first step we (in place, i.e. modifying our df data frame) transform the original columns
into more convenient form. Instead of the raw "moves" I want the "nmoves" column that gives me
a number of moves in the puzzle. Similarly instead of "tags" I want indicator columns "mateInX"
for X ranging from 1 to 5 showing me the puzzle type. Here is how you can achieve this:

julia> select!(df,
               "moves" => ByRow(length∘split) => "nmoves",
               ["tags" => ByRow(contains("mateIn$i")) => "mateIn$i" for i in 1:5])
2132989×6 DataFrame
     Row │ nmoves  mateIn1  mateIn2  mateIn3  mateIn4  mateIn5
         │ Int64   Bool     Bool     Bool     Bool     Bool
─────────┼─────────────────────────────────────────────────────
       1 │      6    false    false    false    false    false
       2 │      4    false    false    false    false    false
       3 │      4    false    false    false    false    false
       4 │      4    false    false    false    false    false
       5 │      4    false     true    false    false    false
       6 │      4    false    false    false    false    false
       7 │      4    false    false    false    false    false
       8 │      4    false    false    false    false    false
    ⋮    │   ⋮        ⋮        ⋮        ⋮        ⋮        ⋮
 2132982 │      4    false    false    false    false    false
 2132983 │      6    false    false    false    false    false
 2132984 │      6    false    false    false    false    false
 2132985 │      4    false     true    false    false    false
 2132986 │      4    false    false    false    false    false
 2132987 │      4    false    false    false    false    false
 2132988 │      4    false     true    false    false    false
 2132989 │      6    false    false    false    false    false
                                           2132973 rows omitted

Now we see that some of the rows are not tagged as "mateInX". Let us filter them out,
to have only tagged rows left (again, we do the operation in-place):

julia> filter!(row -> any(row[Not("nmoves")]), df)
491743×6 DataFrame
    Row │ nmoves  mateIn1  mateIn2  mateIn3  mateIn4  mateIn5
        │ Int64   Bool     Bool     Bool     Bool     Bool
────────┼─────────────────────────────────────────────────────
      1 │      4    false     true    false    false    false
      2 │      4    false     true    false    false    false
      3 │      2     true    false    false    false    false
      4 │      4    false     true    false    false    false
      5 │      2     true    false    false    false    false
      6 │      4    false     true    false    false    false
      7 │      4    false     true    false    false    false
      8 │      2     true    false    false    false    false
   ⋮    │   ⋮        ⋮        ⋮        ⋮        ⋮        ⋮
 491736 │      6    false    false     true    false    false
 491737 │      4    false     true    false    false    false
 491738 │      2     true    false    false    false    false
 491739 │      4    false     true    false    false    false
 491740 │      2     true    false    false    false    false
 491741 │      2     true    false    false    false    false
 491742 │      4    false     true    false    false    false
 491743 │      4    false     true    false    false    false
                                           491727 rows omitted

Note that in the condition I used the row[Not("nmoves")] selector, as I wanted to check all columns except the "nmoves".

Now we are ready to check the correctness of tags:

julia> combine(groupby(df, "nmoves"), Not("nmoves") .=> sum)
10×6 DataFrame
 Row │ nmoves  mateIn1_sum  mateIn2_sum  mateIn3_sum  mateIn4_sum  mateIn5_sum
     │ Int64   Int64        Int64        Int64        Int64        Int64
─────┼─────────────────────────────────────────────────────────────────────────
   1 │      2       136843            0            0            0            0
   2 │      4            0       274135            0            0            0
   3 │      6            0            0        68623            0            0
   4 │      8            0            0            0         9924            0
   5 │     10            0            0            0            0         1691
   6 │     12            0            0            0            0          367
   7 │     14            0            0            0            0          127
   8 │     16            0            0            0            0           25
   9 │     18            0            0            0            0            7
  10 │     20            0            0            0            0            1

The table reads as follows:

  • There are no duplicates in tags.
  • Tags "mateInX" for X in 1 to 4 range are correct.
    The "mateIn5" tag actually means a situation where there are five or more moves.

So the verdict is that tagging is correct, but we need to know the interpretation of
"mateIn5" column as it is actually five or more moves. We could rename the column to
e.g. "mateIn5+" to reflect that or add a metadata to our df table where we would store
this information (I leave this to you as an exercise).

Conclusions

I hope that CSV.jl and DataFrames.jl users found the examples that I gave today useful and interesting. Enjoy!