Author Archives: Blog by Bogumił Kamiński

Solving Sicherman dice puzzle using DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2020/06/14/Sicherman-dice.html

Sicherman dice puzzle

In many games players roll two normal dice to get a result that is then later
used do decide the course of play.

By normal die we understand a 6-sided die with faces numbered from 1 do 6.

Now the puzzle is to check if there exists other pairs of two 6-sided dice
with faces numbered with positive integers and have the same probability
distribution for the sum as normal dice.

A standard approach to answer this question is to use generating functions to
show that it is actually possible, see e.g. Wikipedia article about
Sicherman dice.

However, in this post we will want to use DataFrames.jl to enumerate the
possible solutions and find the feasible ones. The exercise mainly showcases
the new API for the filter function.

The code was tested under Julia 1.4.2 and DataFrames.jl 0.21, so please make
sure you have a their proper versions (the examples should work under any
post 1.0 release of Julia, but require DataFrames.jl to be at least 0.21).

Getting the reference probability distribution

First let us get the distribution of outcomes on a pair of normal dice.
We start with defining the getdist function that takes the numbers on
faces of the dies and returns their distribution:

function getdist(d1, d2)
    min1, max1 = extrema(d1)
    min2, max2 = extrema(d2)
    @assert min1 > 0 && min2 > 0
    d = zeros(Rational{Int}, max1 + max2)
    for p1 in d1, p2 in d2
        s = p1 + p2
            d[s] += 1
    end
    d .//= length(d1) * length(d2)
    return d
end

In the function we assume that sides of both dies are numbered with positive
integers. As we are working with a finite probability space all probabilities
are rationals so we store them as Rational{Int} type to avoid rounding. We
could have used Float64 (or even just Int without doing normalization), but
as we will soon learn our code will be still fast enough so no such
approximation is required.

To test the code let us get a distribution for two normal dice and store it in
the NORMAL_DIST constant (we will use it later):

julia> const NORMAL_DIST = getdist(1:6, 1:6)
12-element Array{Rational{Int64},1}:
 0//1
 1//36
 1//18
 1//12
 1//9
 5//36
 1//6
 5//36
 1//9
 1//12
 1//18
 1//36

julia> sum(NORMAL_DIST)
1//1

In the second line we have checked that we actually have a probability
distribution, as all its entries add up to 1.

Generating all possible dice

In the next step we generate all possible 6-sided die that possibly could be
used to generate the NORMAL_DIST distribution. In order to use some new features
of DataFrames.jl let us make one observation. Note that the sum equal to 2
is obtained with 1/36 probability. This means that each dice must have 1 on
its face exactly once.

Now what is the maximal possible value on the face? We see that the sum of two
maximal values is 12 and is obtained with 1/36 probability. This means that
the maximal value must be unique on both dice. But this implies that it must be
at most 8:

  • if it were 11, then the other die would have to have only 1 on all sides
    which is not possible;
  • if it were 10, then the other die would have to have one 1 and five 2s
    on its sides, which again is not allowed;
  • if it were 9, then the other die would have to be (1,2,2,2,2,3), but this
    would mean that the probability of rolling 11 would be at least 1/9 and it
    must be equal to 1/18.

So let us create a data frame, call it df1, with six columns, where each
column represents a single side of the die, and rows represent possible numbers
on its sides:

julia> using DataFrames

julia> df1 = DataFrame(Iterators.product((2:8 for i in 1:5)...));

julia> insertcols!(df1, 1, "0" => 1);

julia> show(df1, eltypes=false)
16807×6 DataFrame
│ Row   │ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │
├───────┼───┼───┼───┼───┼───┼───┤
│ 1     │ 1 │ 2 │ 2 │ 2 │ 2 │ 2 │
│ 2     │ 1 │ 3 │ 2 │ 2 │ 2 │ 2 │
│ 3     │ 1 │ 4 │ 2 │ 2 │ 2 │ 2 │
│ 4     │ 1 │ 5 │ 2 │ 2 │ 2 │ 2 │
│ 5     │ 1 │ 6 │ 2 │ 2 │ 2 │ 2 │
│ 6     │ 1 │ 7 │ 2 │ 2 │ 2 │ 2 │
│ 7     │ 1 │ 8 │ 2 │ 2 │ 2 │ 2 │
│ 8     │ 1 │ 2 │ 3 │ 2 │ 2 │ 2 │
⋮
│ 16799 │ 1 │ 7 │ 7 │ 8 │ 8 │ 8 │
│ 16800 │ 1 │ 8 │ 7 │ 8 │ 8 │ 8 │
│ 16801 │ 1 │ 2 │ 8 │ 8 │ 8 │ 8 │
│ 16802 │ 1 │ 3 │ 8 │ 8 │ 8 │ 8 │
│ 16803 │ 1 │ 4 │ 8 │ 8 │ 8 │ 8 │
│ 16804 │ 1 │ 5 │ 8 │ 8 │ 8 │ 8 │
│ 16805 │ 1 │ 6 │ 8 │ 8 │ 8 │ 8 │
│ 16806 │ 1 │ 7 │ 8 │ 8 │ 8 │ 8 │
│ 16807 │ 1 │ 8 │ 8 │ 8 │ 8 │ 8 │

After loading the DataFrames.jl package, we first create a df1 data frame
with sides not equal to 1 (remember that there is exactly one such side).
For this we use the Iterators.product function, which gives us an iterator
that is a product of passed iterators. As the side with value 1 on it is
excluded we used five such iterators. Also note that by default DataFrame
constructor treats the values produced by the iterator as rows of the produced
data frame.

Next we use insertcols! function to insert a column containing only 1 in the
first position and name it "0" (the reason will be soon seen – the
DataFrame constructor by default has named the other columns as "1", "2",
etc.). Note that the column_name => value syntax of insertcols! performs
automatic broadcasting of single values if needed (similar behavior is
implemented in DataFrame constructor, select, transform, and combine).

Note that we could have just written:

julia> DataFrame(Iterators.product(1, (2:8 for i in 1:5)...))
16807×6 DataFrame
│ Row   │ 1     │ 2     │ 3     │ 4     │ 5     │ 6     │
│       │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │
├───────┼───────┼───────┼───────┼───────┼───────┼───────┤
│ 1     │ 1     │ 2     │ 2     │ 2     │ 2     │ 2     │
│ 2     │ 1     │ 3     │ 2     │ 2     │ 2     │ 2     │
│ 3     │ 1     │ 4     │ 2     │ 2     │ 2     │ 2     │
│ 4     │ 1     │ 5     │ 2     │ 2     │ 2     │ 2     │
│ 5     │ 1     │ 6     │ 2     │ 2     │ 2     │ 2     │
│ 6     │ 1     │ 7     │ 2     │ 2     │ 2     │ 2     │
│ 7     │ 1     │ 8     │ 2     │ 2     │ 2     │ 2     │
│ 8     │ 1     │ 2     │ 3     │ 2     │ 2     │ 2     │
⋮
│ 16799 │ 1     │ 7     │ 7     │ 8     │ 8     │ 8     │
│ 16800 │ 1     │ 8     │ 7     │ 8     │ 8     │ 8     │
│ 16801 │ 1     │ 2     │ 8     │ 8     │ 8     │ 8     │
│ 16802 │ 1     │ 3     │ 8     │ 8     │ 8     │ 8     │
│ 16803 │ 1     │ 4     │ 8     │ 8     │ 8     │ 8     │
│ 16804 │ 1     │ 5     │ 8     │ 8     │ 8     │ 8     │
│ 16805 │ 1     │ 6     │ 8     │ 8     │ 8     │ 8     │
│ 16806 │ 1     │ 7     │ 8     │ 8     │ 8     │ 8     │
│ 16807 │ 1     │ 8     │ 8     │ 8     │ 8     │ 8     │

to get a similar result (but with different column names), but I wanted to show
the use of the insertcols! function.

Finally we show our data frame, but pass eltypes=false keyword argument to
avoid printing the column type information, as we do not need it.

We immediately notice that some rows in our df1 data frame are duplicates. For
example row 2 and row 8 represent the same die (permutation of numbers on sides
does not affect the distribution of outcomes). We get rid of the duplicates
in-place by requiring that the numbers on sides are sorted in the filter!
function:

julia> filter!(AsTable(:) => issorted, df1)
462×6 DataFrame
│ Row │ 0     │ 1     │ 2     │ 3     │ 4     │ 5     │
│     │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 2     │ 2     │ 2     │ 2     │
│ 2   │ 1     │ 2     │ 2     │ 2     │ 2     │ 3     │
│ 3   │ 1     │ 2     │ 2     │ 2     │ 3     │ 3     │
│ 4   │ 1     │ 2     │ 2     │ 3     │ 3     │ 3     │
│ 5   │ 1     │ 2     │ 3     │ 3     │ 3     │ 3     │
│ 6   │ 1     │ 3     │ 3     │ 3     │ 3     │ 3     │
│ 7   │ 1     │ 2     │ 2     │ 2     │ 2     │ 4     │
│ 8   │ 1     │ 2     │ 2     │ 2     │ 3     │ 4     │
⋮
│ 454 │ 1     │ 6     │ 7     │ 8     │ 8     │ 8     │
│ 455 │ 1     │ 7     │ 7     │ 8     │ 8     │ 8     │
│ 456 │ 1     │ 2     │ 8     │ 8     │ 8     │ 8     │
│ 457 │ 1     │ 3     │ 8     │ 8     │ 8     │ 8     │
│ 458 │ 1     │ 4     │ 8     │ 8     │ 8     │ 8     │
│ 459 │ 1     │ 5     │ 8     │ 8     │ 8     │ 8     │
│ 460 │ 1     │ 6     │ 8     │ 8     │ 8     │ 8     │
│ 461 │ 1     │ 7     │ 8     │ 8     │ 8     │ 8     │
│ 462 │ 1     │ 8     │ 8     │ 8     │ 8     │ 8     │

Notice that we have significantly reduced the number of possibilities this way
– from 16807 to 462.

In the filter! call note that we used AsTble(:) => issorted predicate
specifier. It means that each row of a DataFrame is converted to a
NamedTuple before being passed to the issorted function.

Going from one die to two dice

In df1 we have all possible configuration of one die. Now let us generate
all possibilities for two dice:

julia> df2 = crossjoin(df1, df1, makeunique=true);

julia> rename!(df2, [Symbol(die, side) for die in ["l", "r"] for side in 1:6]);

julia> show(df2, eltypes=false)
213444×12 DataFrame
│ Row    │ l1 │ l2 │ l3 │ l4 │ l5 │ l6 │ r1 │ r2 │ r3 │ r4 │ r5 │ r6 │
├────────┼────┼────┼────┼────┼────┼────┼────┼────┼────┼────┼────┼────┤
│ 1      │ 1  │ 2  │ 2  │ 2  │ 2  │ 2  │ 1  │ 2  │ 2  │ 2  │ 2  │ 2  │
│ 2      │ 1  │ 2  │ 2  │ 2  │ 2  │ 2  │ 1  │ 2  │ 2  │ 2  │ 2  │ 3  │
│ 3      │ 1  │ 2  │ 2  │ 2  │ 2  │ 2  │ 1  │ 2  │ 2  │ 2  │ 3  │ 3  │
│ 4      │ 1  │ 2  │ 2  │ 2  │ 2  │ 2  │ 1  │ 2  │ 2  │ 3  │ 3  │ 3  │
│ 5      │ 1  │ 2  │ 2  │ 2  │ 2  │ 2  │ 1  │ 2  │ 3  │ 3  │ 3  │ 3  │
│ 6      │ 1  │ 2  │ 2  │ 2  │ 2  │ 2  │ 1  │ 3  │ 3  │ 3  │ 3  │ 3  │
│ 7      │ 1  │ 2  │ 2  │ 2  │ 2  │ 2  │ 1  │ 2  │ 2  │ 2  │ 2  │ 4  │
│ 8      │ 1  │ 2  │ 2  │ 2  │ 2  │ 2  │ 1  │ 2  │ 2  │ 2  │ 3  │ 4  │
⋮
│ 213436 │ 1  │ 8  │ 8  │ 8  │ 8  │ 8  │ 1  │ 6  │ 7  │ 8  │ 8  │ 8  │
│ 213437 │ 1  │ 8  │ 8  │ 8  │ 8  │ 8  │ 1  │ 7  │ 7  │ 8  │ 8  │ 8  │
│ 213438 │ 1  │ 8  │ 8  │ 8  │ 8  │ 8  │ 1  │ 2  │ 8  │ 8  │ 8  │ 8  │
│ 213439 │ 1  │ 8  │ 8  │ 8  │ 8  │ 8  │ 1  │ 3  │ 8  │ 8  │ 8  │ 8  │
│ 213440 │ 1  │ 8  │ 8  │ 8  │ 8  │ 8  │ 1  │ 4  │ 8  │ 8  │ 8  │ 8  │
│ 213441 │ 1  │ 8  │ 8  │ 8  │ 8  │ 8  │ 1  │ 5  │ 8  │ 8  │ 8  │ 8  │
│ 213442 │ 1  │ 8  │ 8  │ 8  │ 8  │ 8  │ 1  │ 6  │ 8  │ 8  │ 8  │ 8  │
│ 213443 │ 1  │ 8  │ 8  │ 8  │ 8  │ 8  │ 1  │ 7  │ 8  │ 8  │ 8  │ 8  │
│ 213444 │ 1  │ 8  │ 8  │ 8  │ 8  │ 8  │ 1  │ 8  │ 8  │ 8  │ 8  │ 8  │

Using crossjoin we generate all possible combinations of both dice. We use
makeunique=true as we pass df1 as the left and right data frame in cross
join. Therefore we next rename! the data frame that we got to properly name
its columns so that they clearly show if we are considering left or right data
frame and which side of it (note that now we number sides from 1 to 6).

Final step – finding Sicherman dice

So now we want to find if our df2 data frame contains any pair of dice that
produces the same probability distribution as NORMAL_DIST we have computed
above. For this we define a helper function:

function test_dice(x...)
    d1 = ntuple(i -> x[i], 6)
    d2 = ntuple(i -> x[i + 6], 6)
    return d1 <= d2 && getdist(d1, d2) == NORMAL_DIST
end

The function assumes it is passed twelve positional arguments (soon these will
be values sored in a single row of our data frame). It constructs d1 and d2
tuples from them, to represent the dice. First we check if d1 <= d2 to avoid
permuted duplicates in the results, and if this test passes we check if the
probability distribution produced by our dice is the same as NORMAL_DIST.

Let us run the filter function to find the solution of the Sicherman dice
puzzle:

julia> show(filter(All() => test_dice, df2), eltypes=false)
2×12 DataFrame
│ Row │ l1 │ l2 │ l3 │ l4 │ l5 │ l6 │ r1 │ r2 │ r3 │ r4 │ r5 │ r6 │
├─────┼────┼────┼────┼────┼────┼────┼────┼────┼────┼────┼────┼────┤
│ 1   │ 1  │ 2  │ 2  │ 3  │ 3  │ 4  │ 1  │ 3  │ 4  │ 5  │ 6  │ 8  │
│ 2   │ 1  │ 2  │ 3  │ 4  │ 5  │ 6  │ 1  │ 2  │ 3  │ 4  │ 5  │ 6  │

Indeed we get that there exists only one pair of dice different from two normal
dice that meets our requirements, namely (1,2,2,3,3,4) and (1,3,4,5,6,8).
A surprising finding indeed!

Note that this time in filter we have used All() => test_dice predicate.
It means that test_dice for each row of a data frame is passed all its column
as positional arguments.

Concluding remarks

I hope you found the examples interesting and giving you some insight how
filtering in DataFrames.jl can be used efficiently.

Note that the proposed codes are not only relatively terse but quite fast:

julia> @time begin
           df1 = DataFrame(Iterators.product((2:8 for i in 1:5)...))
           insertcols!(df1, 1, "0" => 1)
           filter!(AsTable(:) => issorted, df1)
           df2 = crossjoin(df1, df1, makeunique=true)
           rename!(df2, [Symbol(die, side) for die in ["l", "r"] for side in 1:6])
           @time filter(All() => test_dice, df2)
       end
  0.306816 seconds (307.21 k allocations: 62.547 MiB, 2.62% gc time)
2×12 DataFrame
│ Row │ l1    │ l2    │ l3    │ l4    │ l5    │ l6    │ r1    │ r2    │ r3    │ r4    │ r5    │ r6    │
│     │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 2     │ 3     │ 3     │ 4     │ 1     │ 3     │ 4     │ 5     │ 6     │ 8     │
│ 2   │ 1     │ 2     │ 3     │ 4     │ 5     │ 6     │ 1     │ 2     │ 3     │ 4     │ 5     │ 6     │

As you can see we are able to solve the puzzle in sub-second time.

I have not attempted to reproduce these examples using data frames in R or
Python, as I do not feel competent enough to write exemplary codes for these
environments. However, I would be interested to see how to reproduce the steps
I have shown here and how fast they would run.

If someone would be interested to make such an implementation and its benchmark
please contact me with your proposal and I will update this post below,
giving a solution and a credit to the submitter.

Learning who is the author of the current state of the Julia language

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2020/06/07/analyze-julia-git.html

Development of the Julia language

During 20 years of my work as a researcher I have used numerous programming
languages to do scientific computing, chiefly R, Python, and Java.
However, when I learned Julia I immediately felt this is a to-go solution,
although I started using it when version 0.3 was released and the language
and its ecosystem was still immature.

Currently Julia has reached version 1.4.2 and in many fields its package
ecosystem provides best-in-class functionality.

A natural question to as is who has made this happen. It is easy enough to find
out on the GitHub page of the Julia project here.
However, the default GitHub interface allows you to only see contributions
by number of commits, additions or deletions. We can learn from this that
Jeff Bezanson is a leader by far in all these categories.

However, the statistics show you the whole history of the git repository.
I was always curious who is the author of the current state of the code.
Essentially, what I wanted to do is blame the whole repository and count
the distribution of the number of lines committed by the authors.

The problem is that by default git does not give you such an option.
There are ways to achieve this, which I discuss below. The project was
interesting for me, because I think it nicely shows what Julia offers you
when you have a scripting task at hand.

Before we start

In order to follow the examples below you need to have git installed.
Also you should have git-extras installed. If you are on Ubuntu just write
sudo apt install git-extras and it should be added.

In order to analyze the repository we need to download it to our local machine
e.g. to julia_src folder.
This can be done using the following command (warning! it takes some time):

~$ git clone https://github.com/JuliaLang/julia.git julia_src
Cloning into 'julia_src'...
remote: Enumerating objects: 83, done.
remote: Counting objects: 100% (83/83), done.
remote: Compressing objects: 100% (78/78), done.
remote: Total 325678 (delta 31), reused 16 (delta 5), pack-reused 325595
Receiving objects: 100% (325678/325678), 181.28 MiB | 1.20 MiB/s, done.
Resolving deltas: 100% (244259/244259), done.

Now switch our working directory to the newly downloaded repository:

~$ cd julia_src
~/julia_src (master)$

Using git

You can get the information we want using the summary command provided by
git-extras. Here is how you can do it:

~/julia_src (master)$ time git summary --line

 project  : julia_src
 lines    : 469506
 authors  :
  70542 Jeff Bezanson                     15.0%
  65223 Jameson Nash                      13.9%
  31294 Keno Fischer                      6.7%
  28562 Katharine Hyatt                   6.1%
  24794 Yichao Yu                         5.3%
  15353 Michael Hatherly                  3.3%
  11146 Stefan Karpinski                  2.4%
  10955 Kristoffer Carlsson               2.3%
  10216 Steven G. Johnson                 2.2%
   8976 Tim Holy                          1.9%
   8763 Rafael Fourquet                   1.9%
   8587 Andreas Noack Jensen              1.8%
   7992 Sacha Verweij                     1.7%
   7934 Fredrik Ekre                      1.7%
   6750 Matt Bauman                       1.4%
   6277 Simon Byrne                       1.3%
   5943 Amit Murthy                       1.3%
   5895 Jacob Quinn                       1.3%
   5392 Milan Bouchet-Valat               1.1%
   5373 Alex Arslan                       1.1%
   4792 Tony Kelman                       1.0%
   4712 Curtis Vogt                       1.0%
   4637 Viral B. Shah                     1.0%
[...]

real    8m31.993s
user    8m1.853s
sys 0m39.650s

The whole list is quite long so I have cut it down to show only people with at
least 1.0% contribution. As you can see from the distribution
Jameson Nash is really close to Jeff Bezanson
in the ranking.

As you can see I have additionally added time in front of the command to see
how long the operation took. For a such large repository as this one
(note that it has almost 500,000 lines of code) it is quite time consuming.

The first thing I did was search over the Internet and I have found the
following proposal here:

git ls-files | while read f; do git blame --line-porcelain $f \
| grep '^author '; done | sort -f | uniq -ic | sort -n

The solution finished in 4 minutes and 14 seconds, so it was two times faster
(the downside is that it does not produce a nice percentage information).

In general it lead me to thinking about writing a Julia script that would do the
job and check its speed. In the next section you can find my take on it.

Using Julia

In the solution I use FreqTables.jl, ProgressMeter.jl, and Pipe.jl in the
following versions:

(@v1.4) pkg> status FreqTables ProgressMeter Pipe
Status `~/.julia/environments/v1.4/Project.toml`
  [da1fdf0e] FreqTables v0.4.0
  [b98c9c47] Pipe v1.2.0
  [92933f4c] ProgressMeter v1.3.0

Here is the code that does the job of listing authors of all lines in the git
repository:

using FreqTables, ProgressMeter, Random

function get_git_data()
    println("Using ", Threads.nthreads(), " threads")
    files = readlines(`git ls-files`)
    shuffle!(files)
    auths = String[]
    l = Threads.SpinLock()
    p = Progress(length(files))
    Threads.@threads for f in files
        isempty(f) && continue
        lines = readlines(`git blame --line-porcelain $f`)
        filter!(k -> startswith(k, "author "), lines)
        Threads.lock(l)
        append!(auths, chop.(lines, head=7, tail=0))
        Threads.unlock(l)
        next!(p)
    end
    println()
    return auths
end

As you can see I am using Threads.@threads to use multiple threads for
my computations. In variable p I keep a progress meter that helps me
to visually track how the computations go.

In the code a line that looks innocent but is actually quite relevant is
shuffle!(files). You might wonder why do I randomly reorder files for
processing. The reason is that the files most probably (and in fact also
actually) do not have the same cost of processing using git blame. Therefore
I do not want to have expensive files clumped together. This has two benefits:

  • ProgressMeter.jl is able to quickly give me a good estimate of ETA (e.g. if
    cheap files were clumped together at the beginning of processing the
    estimate would be overly optimistic);
  • Threads.@threads does static allocation of jobs to threads; this against
    means that we do prefer to shuffle jobs in order to reduce the risk that
    all expensive jobs go to a single thread, which would negatively affect the
    overall processing time.

Finally note that I wrap append! to auths vector in a lock to avoid
race condition (different threads potentially might try to update auths at
the same time). This is not needed for next!(p) operation as ProgressMeter.jl
is thread-safe.

Now let us test the above code. First start Julia using four threads
(you can change it of course to other number of threads) using the command:

~/julia_src (master)$ JULIA_NUM_THREADS=4 julia

(on Windows do set JULIA_NUM_THREADS=4 before running Julia)

Next load the script I have given above. You are now ready for the test. Here
is the code I have run on my machine:

julia> using Pipe
julia> @time @pipe get_git_data() |>
                   freqtable |>
                   prop |>
                   sort!(_, rev=true) |>
                   filter(>=(0.01), _)
Using 4 threads
Progress: 100%|██████████████████████████████████████████████████|  ETA: 0:00:00
 97.533423 seconds (8.14 M allocations: 1.353 GiB, 0.11% gc time)
22-element Named Array{Float64,1}
Dim1                 
─────────────────────┼──────────
Jeff Bezanson          0.149081
Jameson Nash           0.141561
Keno Fischer          0.0661324
Katharine Hyatt       0.0602627
Yichao Yu             0.0523127
Michael Hatherly      0.0323932
Stefan Karpinski      0.0235169
Kristoffer Carlsson   0.0231223
Steven G. Johnson     0.0215547
Tim Holy              0.0189384
Rafael Fourquet        0.018489
Andreas Noack Jensen  0.0181176
Fredrik Ekre          0.0169466
Sacha Verweij         0.0168623
Matt Bauman           0.0142418
Simon Byrne           0.0132501
Amit Murthy           0.0125391
Jacob Quinn           0.0124378
Milan Bouchet-Valat   0.0113765
Alex Arslan           0.0113364
Curtis Vogt           0.0101254
Tony Kelman           0.0101148

As you can see I am well under 2 minutes now.

In the last part of code I have used Pipe.jl which greatly facilitates
using pipes in Julia (there is also a very nice package
Underscores.jl which I recommend you
to investigate; it has more functionality but this comes at the cost of being
a bit more complex to master).

What Pipe.jl does is best described by a section of its manual, so I just reuse
it here:

if after @pipe you place a underscore in the right hand of |>,
it will be replaced with the left hand side. So:

@pipe a |> b(x, _) # == b(x, a)

I hope you enjoyed this little exercise (and now we know exactly whose code we
run when using Julia).

P.S. Setting up your environment

As you probably know I am obsessed with proper environment setup. In an earlier
post
I discussed that you should always make sure you run proper versions
of the packages. What is a quick way to set up the environment for the project
described in this post?

When you are in Julia REPL (e.g. started as instructed above in the julia_src
directory) switch to the package manager mode by pressing ] and execute the
following commands (I am showing the whole output which is a bit long but allows
you to check which packages got recursively added to Manifest.toml):

(@v1.4) pkg> activate .
 Activating new environment at `~/julia_src/Project.toml`

(julia_src) pkg> add [email protected] [email protected] [email protected]
   Updating registry at `~/.julia/registries/General`
   Updating git-repo `https://github.com/JuliaRegistries/General.git`
  Resolving package versions...
  Installed Parsers ─ v1.0.5
   Updating `~/julia_src/Project.toml`
  [da1fdf0e] + FreqTables v0.4.0
  [b98c9c47] + Pipe v1.2.0
  [92933f4c] + ProgressMeter v1.3.0
   Updating `~/julia_src/Manifest.toml`
  [324d7699] + CategoricalArrays v0.8.1
  [861a8166] + Combinatorics v1.0.2
  [9a962f9c] + DataAPI v1.3.0
  [864edb3b] + DataStructures v0.17.17
  [e2d170a0] + DataValueInterfaces v1.0.0
  [da1fdf0e] + FreqTables v0.4.0
  [41ab1584] + InvertedIndices v1.0.0
  [82899510] + IteratorInterfaceExtensions v1.0.0
  [682c06a0] + JSON v0.21.0
  [e1d29d7a] + Missings v0.4.3
  [86f7a689] + NamedArrays v0.9.4
  [bac558e1] + OrderedCollections v1.2.0
  [69de0a69] + Parsers v1.0.5
  [b98c9c47] + Pipe v1.2.0
  [92933f4c] + ProgressMeter v1.3.0
  [ae029012] + Requires v1.0.1
  [3783bdb8] + TableTraits v1.0.0
  [bd369af6] + Tables v1.0.4
  [2a0f44e3] + Base64
  [ade2ca70] + Dates
  [8bb1440f] + DelimitedFiles
  [8ba89e20] + Distributed
  [9fa8497b] + Future
  [b77e0a4c] + InteractiveUtils
  [8f399da3] + Libdl
  [37e2e46d] + LinearAlgebra
  [56ddb016] + Logging
  [d6f4376e] + Markdown
  [a63ad114] + Mmap
  [de0858da] + Printf
  [9a3f8284] + Random
  [ea8e919c] + SHA
  [9e88b42a] + Serialization
  [6462fe0b] + Sockets
  [2f01184e] + SparseArrays
  [10745b16] + Statistics
  [8dfed614] + Test
  [cf7118a7] + UUIDs
  [4ec0a83e] + Unicode

(julia_src) pkg> status
Status `~/julia_src/Project.toml`
  [da1fdf0e] FreqTables v0.4.0
  [b98c9c47] Pipe v1.2.0
  [92933f4c] ProgressMeter v1.3.0

Now you are sure all will work as expected. Just press backspace to leave the
package manager mode and you are ready to run the examples.

Tutorials for DataFrames.jl release 0.21. Part II

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2020/05/30/data-frames-part2.html

This is a follow up post to Part I that covered new functionalities in
DataFrames.jl release 0.21, which you can find here.

New material

I have created two additional notebooks that can be downloaded from
this GitHub Gist.

The content is shared as two notebooks:

  1. airports.ipynb that mostly focuses on transformation operations and
    the syntax source => function => sink.
  2. bison.ipynb that discusses new functionalities related to population of
    data frames with heterogeneous data that were added to push! and append!.

I hope you will find the new examples useful.

Environment setup

The codes were tested under Julia 1.4.1.

Please make sure that you download all materials from the Gist. In particular
having proper Project.toml and Manifest.toml files will ensure that you have
right versions of the packages that were used.

Also note that for the first part of the tutorial (airports.ipynb) you need to
download the file I used from Kaggle (it is quite large). For the second
notebook bison.ipynb the JSON file bison.json is bundled in the gist.

If you would have any questions regarding the materials please do not hesitate
to contact me.