Author Archives: Blog by Bogumił Kamiński

What is the shortest path to AlphaTensor?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/10/21/distances.html

Introduction

Recently I have talked with my friend who teaches undergraduate algebra class.
We discussed how one can show usefulness of matrix multiplication to students.
At the same time a paper about AlphaTensor has been published in nature
showing how reinforcement learning was used to discover fast matrix
multiplication algorithms.

Trying to combine these two events I thought of a problem where multiplication
is much more expensive than addition. You might ask why? Because this condition
is required for algorithms like ones discovered by AlphaTensor to be indeed
beneficial. One natural candidate is of course multiplication of large
matrices, but I wanted something where multiplication of underlying elements of
a matrix is much more expensive than their addition. At the same time I wanted
a nice example for my friend.

After some thought I decided to use some graph theory. The problem I will write
about today is finding shortest paths between all pairs of vertices in a graph.
One of the standard algorithms that can be used to perform this task is
Floyd–Warshall. Interestingly, the path finding problem can be expressed
in terms of matrix multiplication, as is mentioned here. See e.g.
chapter 5.9 in “The Design and Analysis of Computer Algorithms”
for an introductory explanation.

In this post I will not go into all details of the algorithmic discussion (to
leave something for teachers of algebra classes, as it requires giving an
introduction to both linear algebra and group theory). Instead, I want to show
you, by example, how to compute shortest paths using matrix multiplication and
check that indeed we get the expected result. This means that the post today is
a bit harder than usual to follow every detail of what I do, but I hope you will
still enjoy the narrative.

The presented code was tested under Julia 1.8.2 and Graphs.jl 1.7.4.

Setting up the scene

Let us first generate some test graph and compute distances between all
its nodes using Floyd-Warshall algorithm. In the example we will use
a well known Zachary’s karate club graph.

julia> using Graphs

julia> g = smallgraph(:karate)
{34, 78} undirected simple Int64 graph

julia> connected_components(g)
1-element Vector{Vector{Int64}}:
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  25, 26, 27, 28, 29, 30, 31, 32, 33, 34]

julia> g_distances = floyd_warshall_shortest_paths(g).dists
34×34 Matrix{Int64}:
 0  1  1  1  1  1  1  1  1  2  …  2  3  2  2  3  2  1  2  2
 1  0  1  1  2  2  2  1  2  2     3  3  2  2  3  1  2  2  2
 1  1  0  1  2  2  2  1  1  1     3  3  1  1  2  2  2  1  2
 1  1  1  0  2  2  2  1  2  2     3  3  2  2  3  2  2  2  2
 1  2  2  2  0  2  1  2  2  3     3  4  3  3  4  3  2  3  3
 1  2  2  2  2  0  1  2  2  3  …  3  4  3  3  4  3  2  3  3
 ⋮              ⋮              ⋱  ⋮              ⋮
 2  2  1  2  3  3  3  2  2  2     2  2  2  0  2  2  1  2  1
 3  3  2  3  4  4  4  3  2  2     2  1  2  2  0  2  2  1  1
 2  1  2  2  3  3  3  2  1  2  …  3  2  2  2  2  0  2  1  1
 1  2  2  2  2  2  2  2  2  2     1  2  2  1  2  2  0  1  1
 2  2  1  2  3  3  3  2  1  2     2  2  2  2  1  1  1  0  1
 2  2  2  2  3  3  3  3  1  1     2  1  1  1  1  1  1  1  0

julia> maximum(g_distances)
5

Let us look at the plot of the graph:

Zachary's karate club graph

Indeed, we can notice that the graph has one connected component and the biggest
distance between two nodes is 5, e.g. between nodes 16 and 17.

We will want now to reproduce the same distances using matrix multiplication
and next find the shortest paths also using matrix multiplication.

The first step: distances

To calculate distances let us define Distance type (I make it as simple
as possible to concentrate on matrix multiplication). I supply this type
with custom addition and multiplication definitions that are used for
distance calculation.

julia> struct Distance
           d::Float64
       end

julia> Base.:+(a::Distance, b::Distance) = Distance(min(a.d, b.d))

julia> Base.:*(a::Distance, b::Distance) = Distance(a.d + b.d)

julia> Base.zero(::Distance) = Distance(Inf)

julia> Base.show(io::IO, d::Distance) = show(io, d.d)

julia> Base.float(d::Distance) = d.d

We will start our computations with adjacency matrix, where we put Inf
for nodes that are not connected:

julia> n = nv(g)
34

julia> dm = Distance.([has_edge(g, i, j) ? 1 : i == j ? 0 : Inf
                       for i in 1:n, j in 1:n])
34×34 Matrix{Distance}:
 0.0  1.0  1.0  1.0  1.0  1.0  …  Inf  Inf  1.0  Inf  Inf
 1.0  0.0  1.0  1.0  Inf  Inf     Inf  1.0  Inf  Inf  Inf
 1.0  1.0  0.0  1.0  Inf  Inf     Inf  Inf  Inf  1.0  Inf
 1.0  1.0  1.0  0.0  Inf  Inf     Inf  Inf  Inf  Inf  Inf
 1.0  Inf  Inf  Inf  0.0  Inf     Inf  Inf  Inf  Inf  Inf
 1.0  Inf  Inf  Inf  Inf  0.0  …  Inf  Inf  Inf  Inf  Inf
 ⋮                        ⋮    ⋱       ⋮
 Inf  Inf  1.0  Inf  Inf  Inf     Inf  Inf  1.0  Inf  1.0
 Inf  Inf  Inf  Inf  Inf  Inf     0.0  Inf  Inf  1.0  1.0
 Inf  1.0  Inf  Inf  Inf  Inf  …  Inf  0.0  Inf  1.0  1.0
 1.0  Inf  Inf  Inf  Inf  Inf     Inf  Inf  0.0  1.0  1.0
 Inf  Inf  1.0  Inf  Inf  Inf     1.0  1.0  1.0  0.0  1.0
 Inf  Inf  Inf  Inf  Inf  Inf     1.0  1.0  1.0  1.0  0.0

In the implementation Inf means that at some point we have not yet found
any path between nodes so their distance is infinite.

We are now ready to solve the problem. Since we know that maximal distance
in the graph is five, we just can do:

julia> dm^5
34×34 Matrix{Distance}:
 0.0  1.0  1.0  1.0  1.0  1.0  …  3.0  2.0  1.0  2.0  2.0
 1.0  0.0  1.0  1.0  2.0  2.0     3.0  1.0  2.0  2.0  2.0
 1.0  1.0  0.0  1.0  2.0  2.0     2.0  2.0  2.0  1.0  2.0
 1.0  1.0  1.0  0.0  2.0  2.0     3.0  2.0  2.0  2.0  2.0
 1.0  2.0  2.0  2.0  0.0  2.0     4.0  3.0  2.0  3.0  3.0
 1.0  2.0  2.0  2.0  2.0  0.0  …  4.0  3.0  2.0  3.0  3.0
 ⋮                        ⋮    ⋱       ⋮
 2.0  2.0  1.0  2.0  3.0  3.0     2.0  2.0  1.0  2.0  1.0
 3.0  3.0  2.0  3.0  4.0  4.0     0.0  2.0  2.0  1.0  1.0
 2.0  1.0  2.0  2.0  3.0  3.0  …  2.0  0.0  2.0  1.0  1.0
 1.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  0.0  1.0  1.0
 2.0  2.0  1.0  2.0  3.0  3.0     1.0  1.0  1.0  0.0  1.0
 2.0  2.0  2.0  2.0  3.0  3.0     1.0  1.0  1.0  1.0  0.0

to get the required distances. Let us check:

julia> float.(dm^5) == g_distances
true

Also note that further multiplications of dm do not change anything:

julia> dm^5 == dm^6
true

However, clearly five multiplications are needed:

julia> dm^5 == dm^4
false

In this example addition is min and multiplication is +. Can we do better
and make a bigger difference in cost of these two operations? Yes, we can.
Let us, instead of calculating distances calculate shortest paths.

The final blow: shortest paths

Let us now use the same pattern to compute shortest paths. First define
Path type:

julia> struct Path
           p::Union{Nothing, Vector{Int}}
       end

julia> function Base.:+(a::Path, b::Path)
           isnothing(a.p) && return b
           isnothing(b.p) && return a
           return length(a.p) < length(b.p) ? a : b
       end

julia> function Base.:*(a::Path, b::Path)
           isnothing(a.p) && return Path(nothing)
           isnothing(b.p) && return Path(nothing)
           @assert last(a.p) == first(b.p)
           return Path([a.p; b.p[2:end]])
       end

julia> Base.zero(::Path) = Path(nothing)

julia> Base.show(io::IO, p::Path) = show(io, p.p)

julia> Distance(p::Path) = Distance(isnothing(p.p) ? Inf : length(p.p) - 1)

julia> Base.:(==)(a::Path, b::Path) = a.p == b.p

This time the design is a bit more complex. We denote non-existent path with
nothing. The plus is that clearly multiplication, which involves vector
concatenation, is significantly more expensive than addition (I know it could
be optimized, but I wanted to keep the code simple).

Let us check if indeed still matrix multiplication gives us the desired solution:

julia> pm = [has_edge(g, i, j) ? Path([i, j]) : Path(i == j ? [i] : nothing)
             for i in 1:n, j in 1:n]
34×34 Matrix{Path}:
 [1]      [1, 2]   [1, 3]   [1, 4]   …  nothing   nothing
 [2, 1]   [2]      [2, 3]   [2, 4]      nothing   nothing
 [3, 1]   [3, 2]   [3]      [3, 4]      [3, 33]   nothing
 [4, 1]   [4, 2]   [4, 3]   [4]         nothing   nothing
 [5, 1]   nothing  nothing  nothing     nothing   nothing
 [6, 1]   nothing  nothing  nothing  …  nothing   nothing
 ⋮                                   ⋱
 nothing  nothing  [29, 3]  nothing     nothing   [29, 34]
 nothing  nothing  nothing  nothing     [30, 33]  [30, 34]
 nothing  [31, 2]  nothing  nothing  …  [31, 33]  [31, 34]
 [32, 1]  nothing  nothing  nothing     [32, 33]  [32, 34]
 nothing  nothing  [33, 3]  nothing     [33]      [33, 34]
 nothing  nothing  nothing  nothing     [34, 33]  [34]

julia> pm5 = pm^5
34×34 Matrix{Path}:
 [1]              [1, 2]           …  [1, 32, 34]
 [2, 1]           [2]                 [2, 31, 34]
 [3, 1]           [3, 2]              [3, 33, 34]
 [4, 1]           [4, 2]              [4, 14, 34]
 [5, 1]           [5, 1, 2]           [5, 1, 32, 34]
 [6, 1]           [6, 1, 2]        …  [6, 1, 32, 34]
 ⋮                                 ⋱
 [29, 32, 1]      [29, 3, 2]          [29, 34]
 [30, 34, 32, 1]  [30, 34, 31, 2]     [30, 34]
 [31, 9, 1]       [31, 2]          …  [31, 34]
 [32, 1]          [32, 1, 2]          [32, 34]
 [33, 32, 1]      [33, 31, 2]         [33, 34]
 [34, 32, 1]      [34, 31, 2]         [34]

julia> Distance.(pm5) == dm^5
true

julia> Distance.(pm^6) == dm^5
true

julia> validate(g, p) = all(i -> has_path(g, p.p[i], p.p[i+1]), 1:length(p.p)-1)
validate (generic function with 1 method)

julia> all(p -> validate(g, p), pm5)
true

The validate function checks if indeed some Path is a path in our graph g.

Conclusions

So where is the AlphaTensor stuff in this post? Well – it turns out that it
was a click bait. Why cannot we apply AlphaTensor algorithm to our problem?
The issue is that it requires the set of values we operate on to be a
ring, while the non-standard addition and multiplication we have
defined provide only a tropical semiring. Therefore, standard matrix
multiplication algorithm works here, as it requires only addition and
multiplication. However, AlphaTensor algorithm requires subtraction.
So maybe it was not a click bait, but rather an example they one should always
check all assumptions of every theorem before using it?

Now let me comment a bit on a Julia side of the examples. I think it is amazing
that everything just worked if we defined custom types with a few extra methods.
Still, there is one thing that might worry you. What if matrix multiplication
algorithm implemented in Base Julia were changed to AlphaTensor or e.g.
Strassen variant (we all know that people behind linear algebra in
Julia are obsessed about performance)? I have a bad and a good news here. The
bad is that we would get an error (and would need to explicitly ask for a
standard multiplication algorithm). Why? Because when subtraction would be
attempted no method to perform it on Distance or Path type would be found.
So what is good news? Well, we would get an error and not an incorrect result,
so we would be explicitly informed that something went wrong.

You might ask why, by default, Julia uses the standard matrix multiplication
algorithm? The reason is what we discussed in the introduction. The faster
algorithms, like AlphaTensor or Strassen, give benefit only if
multiplication is much more expensive than addition. This in practice is
encountered for very large matrices. This is not a typical use case against
which default linear algebra algorithms in Julia were optimized.

I hope you enjoyed this post! Next week we will go back to discussion of
new features introduced in DataFrames.jl 1.4, so stay tuned.

130 graded exercises to train your Julia for data analysis muscle

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/10/14/exercises.html

Introduction

My Julia for Data Analysis book will be soon published (now all its
chapters are already available in preview for free).

An important part of the book is its GitHub repository containing
all the codes used in the book and ensuring their reproducibility.

Since the book was prepared to fit one semester course on data analysis using
Julia I am now preparing supporting teaching materials that accompany it.

Today together with Daniel Kaszyński we have released first part of these
supporting materials. In the exercises folder of the
book’s GitHub repository we have added 130 exercises that should help you
master the material covered in the book.

The exercises are grouped by book chapter. There are 10 exercises for
each chapter. Each exercise has a proposed solution. We have prepared the
exercises so that they have a varying difficulty level. The exercises from
initial chapters should be relatively easy. However, to solve exercises
from the final chapters you might need to have a significant knowledge of
Julia’s ecosystem for data analysis.

In the post I use Julia 1.8.2, and DataFrames.jl 1.4.1.

A sample exercise

To have some concrete example of what a typical exercise is I have picked a
question that was asked today on Discourse that I liked. The
problem is stated as follows.

Consider the following data frame:

julia> using DataFrames

julia> df = DataFrame(country=["Poland", "Poland", "Canada", "Canada"],
                      city=["Olecko", "Ełk", "Toronto", "Mississauga"])
4×2 DataFrame
 Row │ country  city
     │ String   String
─────┼──────────────────────
   1 │ Poland   Olecko
   2 │ Poland   Ełk
   3 │ Canada   Toronto
   4 │ Canada   Mississauga

The task is to reduce it by unique value in country column. More specifically
we want to create a new data frame with two columns. One of them should be
country that will store unique values of country column in the source data
frame df. The second column should be cities that should store a vector
of values in the city column from df that correspond to a given country.

Now let me show three ways how you can do it using the combine function.
The key to the solution is the following rule of how combine works
(taken from the documentation):

In all of these cases, function can return either a single row or multiple
rows. As a particular rule, values wrapped in a Ref or a
0-dimensional AbstractArray are unwrapped and then treated as a single row.

This means that in order to make a vector to be treated as a single row we have
three options:

  • wrap a vector in another vector as its single element (so we have a multi-row
    object but with a single row);
  • wrap a vector in Ref;
  • wrap a vector in a 0-dimensional AbstractArray, which can be done using the
    fill function.

So the three solutions to our problem are:

julia> combine(groupby(df, :country, sort=true), :city => (x -> [x]) => :cities)
2×2 DataFrame
 Row │ country  cities
     │ String   SubArray…
─────┼─────────────────────────────────────
   1 │ Canada   ["Toronto", "Mississauga"]
   2 │ Poland   ["Olecko", "Ełk"]

julia> combine(groupby(df, :country, sort=true), :city => Ref => :cities)
2×2 DataFrame
 Row │ country  cities
     │ String   SubArray…
─────┼─────────────────────────────────────
   1 │ Canada   ["Toronto", "Mississauga"]
   2 │ Poland   ["Olecko", "Ełk"]

julia> combine(groupby(df, :country, sort=true), :city => fill => :cities)
2×2 DataFrame
 Row │ country  cities
     │ String   SubArray…
─────┼─────────────────────────────────────
   1 │ Canada   ["Toronto", "Mississauga"]
   2 │ Poland   ["Olecko", "Ełk"]

Conclusions

I hope you will enjoy and benefit from doing the exercises that I have added
to the book’s GitHub repository.

Expect that soon two other sets of materials will be added to this repository:

  • For each chapter you will get a notebook which can serve as a starting point
    to develop teaching materials for a given chapter.
  • Several openly available notebooks with additional data analysis problems that
    are solved end-to-end. They will be prepared in a
    similar style to Hands-on Data Science with Julia notebooks and are
    meant to be studied as a follow-up material after you have studied the book.

When these are added I will post about it.

Is DataFrames.jl Hamiltonian?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/10/07/metadata.html

Introduction

Hamilton is an interesting general purpose micro-framework for
creating dataflows from python functions. It has been developed by Stitch Fix
to help managing complex DAGs, where the resulting data frames are wide
(1000+ columns).

While DataFrames.jl is not a framework for DAG execution it is a natural tool
for performing single steps in such processes. When I was reading the
explanation of motivation and design of Hamilton I was struck
by the fact that it shares similarities with some concepts in DataFrames.jl.

In this post I want to discuss some of these principles. Additionally, I want to
highlight how I believe that having metadata support in DataFrames.jl nicely
combines with them.

In the post I use Julia 1.8.2, and DataFrames.jl 1.4.1.

The principles

I do not list here all design choices that Hamilton makes. Instead I
want to discuss ideas that are similar between this framework and DataFrames.jl.

These concepts are simple:

  • if you use some column name it should have a single meaning in your pipeline;
  • every transformation should be a function with a name.

Such an approach helps with understanding of the code, its maintenance,
and documentation of the pipeline.

Let me discuss these two concepts one by one in the context of DataFrames.jl
design.

Single column name has a single meaning

This rule seems pretty intuitive but is often violated in practice. For example
users often transform a column (e.g. by taking log) and still keep its name.
The general recommendation is that such practices should be avoided. If you
significantly transform a column a recommended thing to do is to give it a new
name. In this way you are sure that you will not be confused later.

As an additional benefit of following this rule you can safely annotate your
columns with metadata to keep important information about their meaning. In
DataFrames.jl column-level metadata design works as follows:

  • you can add a keyvalue pair as metadata to any column of a data frame;
  • you can choose one of two styles of metadata propagation; it is either :note
    (signaling it is safe to retain metadata under certain transformations) and :default
    (signaling that metadata should be stripped after any transformation).

Let us see it in action. You can use colmetadata! to set column-level metadata
and colmetadata to retrieve it:

julia> using DataFrames

julia> df = DataFrame(sales=[1.2, 3.4, 2.1], year=[2001, 2002, 2003])
3×2 DataFrame
 Row │ sales    year
     │ Float64  Int64
─────┼────────────────
   1 │     1.2   2001
   2 │     3.4   2002
   3 │     2.1   2003

julia> colmetadata!(df, :sales, "label", "Yearly sales in USD", style=:note);

julia> colmetadata!(df, :year, "label", "Fiscal year (ending on June 30)", style=:note);

julia> colmetadata!(df, :sales, "sum", sum(df.sales), style=:default);

julia> colmetadata(df, :sales, "label")
"Yearly sales in USD"

julia> colmetadata(df, :year, "label")
"Fiscal year (ending on June 30)"

julia> colmetadata(df, :sales, "sum")
6.699999999999999

You might ask what is the use of :note and :default metadata distinction.

Most of the time metadata of :default style are things that are specific to
only a given instance of a data frame. An example of such metadata is column
creation time. In some scenarios having such information might be used, e.g.
for auditing purposes. However, such information should not be propagated under
transformations.

On the other hand :note meatadata is kept when data frame is transformed.
The rules, in short, are:

  • if the column is not changed under transformation (data it contains remains
    unchanged) :note-style metadata is kept;
  • if you change column data, but decide to keep the original column name
    metadata is also kept.

What is the rationale behind these rules? We could be super strict and say that
metadata is propagated only if the column data does not change and its name and
does not change. However, such a rule in practice seems to be overly strict.
When you get some source data you usually might want to e.g. rename the column
to some more descriptive name or perform data cleaning (e.g. dropping rows with
missing values). In such cases users usually feel that the contents of the colum
conceptually remains the same. However, some information about data in the
column might change, like for example its length. Therefore :note-style
metadata should not be used to store information that is strictly attached to a
concrete instance of a column (:default-style metadata should be used
instead).

Let us now look back at our example code.
A typical :note-style metadata is descriptive label of a column. We used
"label" key for it. Now we made a "sum" metadata to have style :default.
Someone might have wanted to store the sum information for easy access later.
However, such metadata should not be propagated under any transformation of
a data frame as it might potentially be invalidated, so we made its style
:default.

If you ask why such metadata might be useful have a look at this example:

julia> show(df, header=colmetadata.(Ref(df), names(df), "label"))
3×2 DataFrame
 Row │ Yearly sales in USD  Fiscal year (ending on June 30)
─────┼──────────────────────────────────────────────────────
   1 │                 1.2                             2001
   2 │                 3.4                             2002
   3 │                 2.1                             2003

Note that we have substituted column names (which might be not clear for end
users) with descriptive labels.

The metadata propagation rules work nice and will help you documenting your
data frame assuming that you follow the first principle: do not use a single
column name to store different information.

Every transformation should be a function with a name

The principle to define named functions for transformations, combined with
the principle of single name for single meaning, is a core of
operation specification syntax in DataFrames.jl.

Consider the following example code:

julia> usd2¢(x) = 100x
usd2¢ (generic function with 1 method)

julia> transform(df, :sales => usd2¢)
3×3 DataFrame
 Row │ sales    year   sales_usd2¢
     │ Float64  Int64  Float64
─────┼─────────────────────────────
   1 │     1.2   2001        120.0
   2 │     3.4   2002        340.0
   3 │     2.1   2003        210.0

Note that the auto-generated column name sales_usd2¢ gives us information
about a source column and transformation function applied to it. Of course you
could use a custom output name. However, using automatically generated column
name, assuming that you pass a names function as a transformation, has a big
benefit that it helps in automatic visual documentation of the transformation
applied to data (and looking up the code of the transformation function used).
This is a big feature if you work with hundreds or thousands of columns.

As with the previous principle this works if you follow a simple rule –
define transformations you want to apply as functions with a name.

Conclusions

In this post I have highlighted two principles that are useful when implementing
complex data transformation jobs:

  • if you use some column name it should have a single meaning;
  • every transformation should be a function with a name.

DataFrames.jl was designed to work nicely if you want to follow these rules, as
I have tried to explain in this post. However, nothing is carved in stone in
DataFrames.jl. You can write your code whatever way you want and nothing is
forced on the user. This is especially important in quick-and-dirty one time
interactive sessions, where users want flexibility.

If you would like to learn more about the design of operation specification
syntax I recommend you start with JuliaCon 2022 tutorial.

Similarly, we tried to design handling of metadata in a way that is easy to
understand and convenient to use. You can find more details about metadata
support in DataFrames.jl in Metadata section of the DataFrames.jl
documentation. I also plan to write a separate post in which I will give a more
in-depth example how metadata can be used in DataFrames.jl. Also there is a plan
to release TableMetadataTools.jl package, which will add even more
convenience to most common operations.