A classical sorting exercise

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2020/11/14/sorting.html

Introduction

A classical problem in computing is finding what is the number comparisons that
a comparison sort algorithm minimally requires in worst-case given an
\(n\)-element list. Let us try to tackle this problem, as usual using
DataFrames.jl to support the presentation of the results.

In what follows I use Julia 1.5.3, DataFrames.jl 0.21.8 and Pipe.jl 1.3.0.

Theoretical lower bound

As there are \(n!\) permutations of an \(n\)-element set and doing \(k\)
comparisons allow us to distinguish at most \(2^k\) states so we get that
\(k\geq\lceil\log_2(n!)\rceil\).

Let us use DataFrames.jl and to quickly calculate this bound on the number of
queries for \(n\in[7]\):

julia> using DataFrames

julia> using Pipe

julia> @pipe DataFrame(n=1:7) |>
             transform(_, :n => ByRow(factorial) => :permutations) |>
             transform(_, :permutations =>
                          ByRow(x -> ceil(Int, log2(x))) =>
                          :queries)
7×3 DataFrame
│ Row │ n     │ permutations │ queries │
│     │ Int64 │ Int64        │ Int64   │
├─────┼───────┼──────────────┼─────────┤
│ 1   │ 1     │ 1            │ 0       │
│ 2   │ 2     │ 2            │ 1       │
│ 3   │ 3     │ 6            │ 3       │
│ 4   │ 4     │ 24           │ 5       │
│ 5   │ 5     │ 120          │ 7       │
│ 6   │ 6     │ 720          │ 10      │
│ 7   │ 7     │ 5040         │ 13      │

Now, we switch to a core of the exercise — let us write a program that sorts
up to seven elements in no more than the above calculated number of queries.

The judge and the player

Assume we want to have a Judge that knows the order of objects and can be
queried to compare the position in the order of the queried objects:

julia> mutable struct Judge
           state::Vector{Int}
           queries::Int

           function Judge(state)
               @assert isperm(state)
               new(state, 0)
           end
       end

julia> function compare(judge::Judge, i::Int, j::Int)
           judge.queries += 1
           judge.state[i] < judge.state[j]
       end
compare (generic function with 1 method)

As you can see the Judge will count the number of queries that it received.
Also we restrict the Judge to accept only permutations for simplicity
(in general it could accept any comparable objects though).

Now we create a player function that will be allowed to query the Judge
and it is expected to return the ordering of objects that the Judge has.

julia> using Combinatorics

julia> function player(judge::Judge, options::Vector{Vector{Int}})
           @assert !isempty(options)
           n = length(options[1])
           length(options) == 1 && return options[1]
           besti, bestj, best_split = 0, 0, typemax(Int)
           starti = length(options) == factorial(n) ? n - 1 : 1
           for i in starti:n, j in i+1:n
               yes_count = count(opt -> opt[i] < opt[j], options)
               current_split = max(yes_count, length(options) - yes_count)
               if current_split < best_split
                   best_split = current_split
                   besti, bestj = i, j
               end
           end
           query = compare(judge, besti, bestj)
           filter!(opt -> (opt[besti] < opt[bestj]) == query, options)
           return player(judge, options)
       end
player (generic function with 2 methods)

julia> player(judge::Judge) =
           player(judge, collect(permutations(1:length(judge.state))))
player (generic function with 2 methods)

Initially player allows all permutations. With each question only the
permutations that are consistent with the received answers are retained. Note
that the player function uses a very simple rule: it asks the question after
which the number of remaining options in the worst case is minimized. As a
special case the starti = length(options) == factorial(n) ? n - 1 : 1 line
uses the fact that if we ask the first question it does not really matter which
pair of elements we compare (but starting from the second question we consider
all possible comparisons).

Let us test our player function on a few examples:

julia> j1 = Judge([1, 3, 5, 6, 4, 2])
Judge([1, 3, 5, 6, 4, 2], 0)

julia> player(j1)
6-element Array{Int64,1}:
 1
 3
 5
 6
 4
 2

julia> j1.queries
10

julia> j2 = Judge([7, 6, 5, 4, 3, 1, 2])
Judge([7, 6, 5, 4, 3, 1, 2], 0)

julia> player(j2)
7-element Array{Int64,1}:
 7
 6
 5
 4
 3
 1
 2

julia> j2.queries
12

The function seems to be working correctly. Now let us test it systematically
for \(n\in[7]\):

julia> df = DataFrame()
0×0 DataFrame


julia> for n in 1:7
           queries = map(permutations(1:n)) do perm
               j = Judge(perm)
               if player(j) != perm
                   error("player gave wrong result for $perm input")
               end
               return j.queries
           end
           append!(df, DataFrame(n=n, queries=queries))
       end

julia> @pipe groupby(df, [:n, :queries], sort=true) |>
             combine(_, nrow) |>
             unstack(_, :queries, :n, :nrow) |>
             coalesce.(_, "") |>
             show(_, eltypes=false, summary=false)

│ Row │ queries │ 1 │ 2 │ 3 │ 4  │ 5   │ 6   │ 7    │
├─────┼─────────┼───┼───┼───┼────┼─────┼─────┼──────┤
│ 1   │ 0       │ 1 │   │   │    │     │     │      │
│ 2   │ 1       │   │ 2 │   │    │     │     │      │
│ 3   │ 2       │   │   │ 2 │    │     │     │      │
│ 4   │ 3       │   │   │ 4 │    │     │     │      │
│ 5   │ 4       │   │   │   │ 8  │     │     │      │
│ 6   │ 5       │   │   │   │ 16 │     │     │      │
│ 7   │ 6       │   │   │   │    │ 8   │     │      │
│ 8   │ 7       │   │   │   │    │ 112 │     │      │
│ 9   │ 9       │   │   │   │    │     │ 304 │      │
│ 10  │ 10      │   │   │   │    │     │ 416 │      │
│ 11  │ 11      │   │   │   │    │     │     │ 80   │
│ 12  │ 12      │   │   │   │    │     │     │ 2912 │
│ 13  │ 13      │   │   │   │    │     │     │ 2048 │

As you can see we never got an error, so our player seems to work correctly.
Also it works optimally in worst-case. The bound we have calculated above was
never exceeded.

As an additional exercise I have used several common functions from
DataFrames.jl to (hopefully) nicely format the resulting table (in DataFrames.jl
0.22 release it will be yet nicer as we are switching to PrettyTables.jl as a
back-end for text/plain printing).

Take away notes

Unfortunately for \(n=8\) the simple algorithm we used does not produce the
desired results any more (warning — this calculation is a bit more lengthly):

julia> queries = map(permutations(1:8)) do perm
           j = Judge(perm)
           if player(j) != perm
               error("player gave wrong result for $perm input")
           end
           return j.queries
       end;

julia> @pipe DataFrame(queries=queries) |>
             groupby(_, :queries, sort=true) |>
             combine(_, nrow)
4×2 DataFrame
│ Row │ queries │ nrow  │
│     │ Int64   │ Int64 │
├─────┼─────────┼───────┤
│ 1   │ 14      │ 1072  │
│ 2   │ 15      │ 22104 │
│ 3   │ 16      │ 16936 │
│ 4   │ 17      │ 208   │

and ceil(Int, log2(factorial(8))) is equal to 16. It is known that this
bound is achievable, so we fail in 208 cases. Therefore, as a challenge, I leave
you to think of an algorithm that is better. A natural approach is to find
optimal splits using backtracking while aggressively pruning branches that
cannot lead to an optimal solution to reduce the complexity, but maybe some
simple heuristic on top of the one I considered will be enough?