Tag Archives: julialang

E-Graph Pattern Matching (Part II)

By: Philip Zucker

Re-posted from: https://www.philipzucker.com/egraph-2/

Last time we spent a bit of effort building an e-graph data structure in Julia. The e-graph is a data structure that compactly stores and maintains equivalence classes of terms.

We can use the data structure for equational rewriting by iteratively pattern matching terms inside of it and adding new equalities found by instantiating those patterns. For example, we could use the pattern ~x + 0 to instantiate the rule ~x + 0 == ~x, find that ~x binds to foo23(a,b,c) in the e-graph and then insert into the e-graph the equality foo23(a,b,c) + 0 == foo23(a,b,c) and do this over and over. The advantage of the e-graph vs just rewriting terms is that we never lose information while doing this or rewrite ourselves into a corner. The disadvantage is that we maintain this huge data structure.

The rewrite rule problem can be viewed something like a graph. The vertices of the graph are every possible term. A rewrite rule that is applicable to that node describes a directed edge in the graph.

This graph is very large, infinite often, so it can only be described implicitly.

This graph perspective fails to capture some useful properties though. Treating each term as an indivisible vertex fails the capture that there can be rewriting happening in only a small part of the term, the vast majority of it left unchanged. There is a lot of opportunity for shared computations.

The EGraph from this perspective is a data structure for holding the already seen set of vertices efficiently and sharing these computations.

You can use this procedure for theorem proving. Insert the two terms you want to prove equal into the e-graph. Iteratively rewrite using your equalities. At each iteration, check whether two nodes in the e-graph have become equivalent. If so, congrats, problem proved! This is the analog of finding a path between two nodes in a graph by doing a bidirectional search from the start and endpoint. This is roughly the main method by which Z3 reasons about syntactic terms under equality in the presence of quantified equations. There are also methods by which to reconstruct the proof found if you want it.

Another application of basically the same algorithm is finding optimized rewrites. Apply rewrites until you’re sick of it, then extract from the equivalence class of your query term your favorite equivalent term. This is a very intriguing application to me as it produces more than just a yes/no answer.

Pattern Matching

Pattern matching takes a pattern with named holes in it and tries to find a way to fill the holes to match a term. The package SymbolicUtils.jl is useful for pattern matching against single terms, which is what I’ll describe in this section, but not quite set up to pattern match in an e-graph.

Pattern matching against a term is a very straightforward process once you sit down to do it. Where it gets complicated is trying to be efficient. But one should walk before jogging. I sometimes find myself so overwhelmed by performance considerations that I never get started.

First, one needs a data type for describing patterns (well, one doesn’t need it, but it’s nice. You could directly use pattern matching combinators).

struct PatVar
    id::Symbol
end

struct PatTerm
    head::Symbol
    args::Array{Union{PatTerm, PatVar}}
end

Pattern = Union{PatTerm,PatVar}

There is an analogous data structure in SymbolicUtils which I probably should’ve just used. For example Slot is basically PatVar.

The pattern matching function takes a Term and Pattern and returns a dictionary Dict{PatVar,Term} of bindings of pattern variables to terms or nothing if it fails

Matching a pattern variable builds a dictionary of bindings from that variable to a term.

match(t::Term,  p::PatVar ) = Dict(p => t)

Matching a PatTerm requires we check if anything is obviously wrong (wrong head, or wrong number of args) and then recurse on the args.
The resulting binding dictionaries are checked for consistency of their bindings and merged.


function merge_consistent(ds)
    newd = Dict()
    for d in ds
        for (k,v) in d
            if haskey(newd,k)
                if newd[k] != v
                    return nothing
                end
            else
                newd[k] = v
            end
        end
    end
    return newd
end

function match(t::Term, p::PatTerm) 
    if t.head != p.head || length(t.args) != length(p.args)
        return nothing
    else
        merge_consistent( [ match(t1,p1) for (t1,p1) in zip(t.args, p.args) ])
    end
end

There are other ways of arranging this computation, such as passing a dictionary parameter along and modifying it, by I find the above purely functional definition the most clear.

Pattern Matching in E-Graphs

The twist that E-Graphs throw into the pattern matching process is that instead of checking whether a child of a term matches the pattern, we need to check for any possible child matching in the equivalence class of children. This means our pattern matching procedure contains a backtracking search.

The de Moura and Bjorner e-matching paper describes how to do this efficiently via an interpreted virtual machine. An explicit virtual machine is important for them because the patterns change during the Z3 search process, but it seems less relevant for the use case here or in egg. It may be better to use partial evaluation techniques to remove the existence of the virtual machine at runtime like in A Typed, Algebraic Approach to Parsing but I haven’t figured out how to do it yet.

The Simplify paper describes a much simpler inefficient pattern matching algorithm in terms of generators. This can be directly translated to Julia using Channels. This is a nice blog post describing how to build generators in Julia that work like python generators. Basically, as soon as you define your generator function, wrap the whole thing in a Channel() do c and when you want to yield myvalue call put!(c, myvalue).

# https://www.hpl.hp.com/techreports/2003/HPL-2003-148.pdf
function ematchlist(e::EGraph, t::Array{Union{PatTerm, PatVar}} , v::Array{Id}, sub)
    Channel() do c
        if length(t) == 0
            put!(c, sub)
        else
            for sub1 in ematch(e, t[1], v[1], sub)
                for sub2 in ematchlist(e, t[2:end], v[2:end], sub1)
                    put!(c, sub2)
                end
            end
        end
    end
end
# sub should be a map from pattern variables to Id
function ematch(e::EGraph, t::PatVar, v::Id, sub)
    Channel() do c
        if haskey(sub, t)
            if find_root!(e, sub[t]) == find_root!(e, v)
                put!(c, sub)
            end
        else
            put!(c,  Base.ImmutableDict(sub, t => find_root!(e, v)))
        end
    end
end

    
function ematch(e::EGraph, t::PatTerm, v::Id, sub)
    Channel() do c
        for n in e.classes[find_root!(e,v)].nodes
            if n.head == t.head
                for sub1 in ematchlist(e, t.args , n.args , sub)
                    put!(c,sub1)
                end
            end
        end
    end
end

You can then instantiate patterns with the returned dictionaries via


function instantiate(e::EGraph, p::PatVar , sub)
    sub[p]
end

function instantiate(e::EGraph, p::PatTerm , sub)
    push!( e, Term(p.head, [ instantiate(e,a,sub) for a in p.args ] ))
end

And build rewrite rules

struct Rule
    lhs::Pattern
    rhs::Pattern
end

function rewrite!(e::EGraph, r::Rule)
    matches = []
    EMPTY_DICT2 = Base.ImmutableDict{PatVar, Id}(PatVar(:____),  Id(-1))
    for (n, cls) in e.classes
        for sub in ematch(e, r.lhs, n, EMPTY_DICT2)
            push!( matches, ( instantiate(e, r.lhs ,sub)  , instantiate(e, r.rhs ,sub)))
        end
    end
    for (l,r) in matches
        union!(e,l,r)
    end
    rebuild!(e)
end

Here’s a very simple equation proving function that takes in a pile of rules

function prove_eq(e::EGraph, t1::Id, t2::Id , rules)
    for i in 1:3
        if in_same_set(e,t1,t2)
            return true
        end
        for r in rules
            rewrite!(e,r) # I should split this stuff up. We only need to rebuild at the end
        end
    end
    return nothing
end

As sometimes happens, I’m losing steam on this and would like to work on something different for a bit. But I loaded up the WIP at https://github.com/philzook58/EGraphs.jl.

Bits and Bobbles

Catlab

You can find piles of equational axioms in Catlab like here for example

A tricky thing is that some equational axioms of categories seem to produce variables out of thin air.
Consider the rewrite rule ~f => id(~A) ⋅ ~f. Where does the A come from? It’s because we’ve highly suppressed all the typing information. The A should come from the type of f.

I think the simplest way to fix is the “type tag” approach I mentioned here https://www.philipzucker.com/theorem-proving-for-catlab-2-lets-try-z3-this-time-nope/ and can be read about here https://people.mpi-inf.mpg.de/~jblanche/mono-trans.pdf. You wrap every term in a tagging function so that f becomes tag(f, Hom(a,b)). This approach makes sense because it turns GATs purely equational without the implicit typing context, I think. It is a bummer that it will probably choke the e-graph with junk though.

It may be best to build a TypedTerm to use rather than Term that has an explicit field for the type. This brings it close to the representation that Catlab already uses for syntax, so maybe I should just directly use that. I like having control over my own type definitions though and Catlab plays some monkey business with the Julia level types. 🙁

struct TypedTerm
    type
    head
    args
end

As I have written it so far, I don’t allow the patterns to match on the heads of terms, although there isn’t any technical reason to not allow it. The natural way of using a TypedTerm would probably want to do so though, since you’d want to match on the type sometimes while keeping the head a pattern. Another possible way to do this that avoids all these issues is to actually make terms a bit like how regular Julia Expr are made, which usually has :call in the head position. Then by convention the first arg could be the actual head, the second arg the type, and the rest of the arguments the regular arguments.

A confusing example sketch:

TypedTerm(:mcopy, typedterm(Hom(a,otimes(a,a))) , [TypedTerm(a,Ob,[])])
would become
Term(:call, [:mcopy, term!(Hom(a,otimes(a,a)) , term!(a) ])

Catlab already does some simplifications on it’s syntax, like collecting up associative operations into lists. It is confusing how to integrate the with the e-graph structure so I think the first step is to just not. That stuff isn’t on by default, so you can get rid of it by defining your own versions using @syntax

using Catlab
using Catlab.Theories
import Catlab.Theories: compose, Ob, Hom
@syntax MyFreeCartesianCategory{ObExpr,HomExpr} CartesianCategory begin
  #otimes(A::Ob, B::Ob) = associate_unit(new(A,B), munit)
  #otimes(f::Hom, g::Hom) = associate(new(f,g))
  #compose(f::Hom, g::Hom) = associate(new(f,g; strict=true))

  #pair(f::Hom, g::Hom) = compose(mcopy(dom(f)), otimes(f,g))
  #proj1(A::Ob, B::Ob) = otimes(id(A), delete(B))
  #proj2(A::Ob, B::Ob) = otimes(delete(A), id(B))
end

# we can translate the axiom sets programmatically.
names(Main.MyFreeCartesianCategory)
#A = Ob(MyFreeCartesianCategory, :A)

A = Ob(MyFreeCartesianCategory, :A)
B = Ob(MyFreeCartesianCategory, :B)
f = Hom(:f, A,B)
g = Hom(:g, B, A)
h = compose(compose(f,g),f)
# dump(h)
dump(f)

#=
Main.MyFreeCartesianCategory.Hom{:generator}
  args: Array{Any}((3,))
    1: Symbol f
    2: Main.MyFreeCartesianCategory.Ob{:generator}
      args: Array{Symbol}((1,))
        1: Symbol A
      type_args: Array{GATExpr}((0,))
    3: Main.MyFreeCartesianCategory.Ob{:generator}
      args: Array{Symbol}((1,))
        1: Symbol B
      type_args: Array{GATExpr}((0,))
  type_args: Array{GATExpr}((2,))
    1: Main.MyFreeCartesianCategory.Ob{:generator}
      args: Array{Symbol}((1,))
        1: Symbol A
      type_args: Array{GATExpr}((0,))
    2: Main.MyFreeCartesianCategory.Ob{:generator}
      args: Array{Symbol}((1,))
        1: Symbol B
      type_args: Array{GATExpr}((0,))
=#

Rando thought: Can an EGraph data structure itself be expressed in Catlab? I note that IntDisjointSet does show up in a Catlab definition of pushouts. The difficulty and scariness of the EGraph is maintaining it’s invariants, which may be expressible as some kind of composition condition on the various maps that make up the EGraph.

The rewrite rule problem can be viewed something like a graph. The vertices of the graph is every possible term. A rewrite rule that is applicable to that node describes a directed edge in the graph.

This graph is very large, infinite often, so it can only be described implicitly.

Proving the equivalence of two terms is like finding a path in this graph. You could attempt a greedy approach or a breadth first search approach, or any variety of your favorite search algorithm like A*.

This graph perspective fails to capture some useful properties though. Treating each term as an indivisible vertex fails the capture that there can be rewriting happening in only a small part of the term, the vast majority of it left unchanged. There is a lot of opportunity for shared computations.

The EGraph from this perspective is a data structure for holding the already seen vertices.

I suspect some heuristics to be helpful like applying the “simplifying” direction of the equations more often than the “complicating” direction. In a 5:1 ratio let’s say.

A natural algorithm to consider for optimizing terms is to take the best expression found so far, destroy the e-graph and place just that expression in it.
Try rewrite rules. If the best is still old query, don’t destroy e-graph and apply a new round of rewrites the widen the radius of the e-graph. Gives you kind of a combo greedy + complete.

Partial Evaluation

Partial evaluation seems like a good technique for optimizing pattern matches. The ideal pattern match code expands out to a nicely nested set of if statements. One alternative to passing a dictionary might be to assign each pattern variable to an integer at compile time and instead pass an array, which would be a bit better. However, by using metaprogramming we can insert local variables into generated code and avoid the need for a dictionary being based around at runtime. Then the julia compiler can register allocate and what not like it usually does (and quite efficiently I’d expect).

See this post (in particular to reference links https://www.philipzucker.com/metaocaml-style-partial-evaluation-in-coq/ for more on partial evaluation.

A first thing to consider is that we’re building up and destroying dictionaries at a shocking rate.

Secondly dictionaries themselves are (relatively) expensive things.

I experimented a bit with using a curried form for match to see if maybe the Julia compiler was smart enough to sort of evaluate away my use of dictionaries, but that did not seem to be the case.

I found the examining the @code_llvm and @code_native of the following simple experiments illuminating as to what Julia can and can no get rid of when the dictionary is obviously known at compile time to human eyes. Mostly it needs to be very obvious. I suspect Julia does not have any built special compiler magic reasoning for dictionaries and trying to automatically infer what’s safely optimizable by actually expanding the definition of a dictionary op sounds impossible.

function foo() #bad
    x = Dict(:a => 4)
    return x[:a]
end
function foo2()
    return 4
end


function foo3() # bad
    x = Dict(:a => 4)
    () -> x[:a]
end

function foo4() # much better. But we do pay a closure cost.
    x = Dict(:a => 4)
    r = x[:a]
    () -> r
end

function foo5()
    x = Dict(:a => 4)
    :( () -> $(x[:a]) )
end

function foo7()
    return foo()
end


function foo6() # julia does however do constant propagation and deletion of unnecessary bullcrap
    x = 4
    y = 7
    y += x
    return x
end

function foo7() # this compiles well
    x = Base.ImmutableDict( :a => 4)
    return x[:a]
end

function foo8()
    x = Base.ImmutableDict( :a => 4)
    r = x[:a]
    z -> z == r # still has a closure indirection. It's pretty good though
end

function foo9()
    x = Base.ImmutableDict( :a => 4)
    r = x[:a]
    z -> z == 4 
end


#@code_native foo7()

z = foo4()
#@code_llvm z()

z = eval(foo5())
@code_native z()
@code_native foo2()
@code_llvm foo6()
z = foo8()
@code_native z(3)
z = foo9()
@code_native z(3)
@code_native foo7()

# so it's possible that using an ImmutableDict is sufficient for julia to inline almost evertything itself.
# You want the indexing to happen outside the closure.

A possibly useful technique is partial evaluation. We have in a sense built an interpreter for the pattern language. Specializing this interpeter gives a fast pattern matcher. We explicitly can build up the code Expr that will do the pattern matching by interpreting the pattern.

So here’s a version that takes in a pattern and builds code the perform the pattern in terms of if then else statements and runtime bindings.

Here’s a sketch. This version only returns true or false rather than returning the bindings. I probably need to cps it to get actual bindings out. I think that’s what i was starting to do with the c parameter.

function prematch(p::PatVar, env, t, c )
    if haskey(env, p)
       :(  $(env[p]) != $t  && return false )
    else
       env[p] = gensym(p.id) #make fresh variable
       :(  $(env[p]) = $t  ) #bind it at runtime to the current t
    end
end


function sequence(code)
    foldl(
        (s1,s2) -> quote
                        $s1
                        $s2
                    end
        , code)
end

function prematch(p::PatTerm, env, t, c)
    if length(p.args) == 0
        :( $(t).head == $( QuoteNode(p.head)  ) && length($t) == 0 && return false   )
    else
        quote
            if $(t).head != $( QuoteNode(p.head) ) || $( length(p.args) ) != length($(t).args)
                return false
            else
                $( sequence([prematch(a, env, :( $(t).args[$n] ), c)   for (n,a) in enumerate(p.args) ]))
            end
        end 
    end
end


println( prematch( PatTerm(:f, []) , Dict(),  :w , nothing) )
println( prematch( PatTerm(:f, [PatVar(:a)]) , Dict(),  :t , nothing) )
println( prematch( PatTerm(:f, [PatVar(:a), PatVar(:a)]) , Dict(),  :t , nothing) )


DataFrames 0.22 @ JuliaAcademy

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2020/12/05/juliaacademy.html

Introduction

This time the blog post will be shorter than usual.

Logan Kilpatrick has just released a new version of JuliaAcademy
course for DataFrames.jl that was updated to its 0.22 release and also contains
some new material.

You can find course materials on GitHub here, while the videos will
be released in the coming days; first two
1. Environment Setup and
2. First Steps With Data Frames are already
available for watching.

Before you go

Not to leave you with just bare links let me present a short example how you can
process dates in DataFrames.jl while taking into account the possibility that
there might be missing values in the data (this is a question I was recently
asked how to do it).

I am using Julia 1.5.3, DataFrames 0.22.1, and Missings 0.4.4.

First we prepare a sample data frame:

julia> using Dates, Missings, DataFrames

julia> df1 = DataFrame(date = Date.(2020, 1, 1:10));

julia> allowmissing!(df1);

julia> df1.date[5] = missing;

julia> df1
10×1 DataFrame
 Row │ date
     │ Date?
─────┼────────────
   1 │ 2020-01-01
   2 │ 2020-01-02
   3 │ 2020-01-03
   4 │ 2020-01-04
   5 │ missing
   6 │ 2020-01-06
   7 │ 2020-01-07
   8 │ 2020-01-08
   9 │ 2020-01-09
  10 │ 2020-01-10

We now want to split :date column into three columns that will contain year,
month and day respectively. Here is the way how you can achieve it:

julia> df2 = transform(df1, @. :date =>
                               ByRow(passmissing([year, month, day])) =>
                               [:year, :month, :day])
10×4 DataFrame
 Row │ date        year     month    day
     │ Date?       Int64?   Int64?   Int64?
─────┼───────────────────────────────────────
   1 │ 2020-01-01     2020        1        1
   2 │ 2020-01-02     2020        1        2
   3 │ 2020-01-03     2020        1        3
   4 │ 2020-01-04     2020        1        4
   5 │ missing     missing  missing  missing
   6 │ 2020-01-06     2020        1        6
   7 │ 2020-01-07     2020        1        7
   8 │ 2020-01-08     2020        1        8
   9 │ 2020-01-09     2020        1        9
  10 │ 2020-01-10     2020        1       10

Note that we are using here a common pattern that you can use broadcasting to
easily specify multiple operations on the same object (in this case this is the
same source column).

Finally we go back and collect the :year, :month and :day columns into one
column that contains the original Date values:

julia> df3 = transform(df2, [:year, :month, :day] =>
                            ByRow(passmissing(Date)) =>
                            :date2)
10×5 DataFrame
 Row │ date        year     month    day      date2
     │ Date?       Int64?   Int64?   Int64?   Date?
─────┼───────────────────────────────────────────────────
   1 │ 2020-01-01     2020        1        1  2020-01-01
   2 │ 2020-01-02     2020        1        2  2020-01-02
   3 │ 2020-01-03     2020        1        3  2020-01-03
   4 │ 2020-01-04     2020        1        4  2020-01-04
   5 │ missing     missing  missing  missing  missing
   6 │ 2020-01-06     2020        1        6  2020-01-06
   7 │ 2020-01-07     2020        1        7  2020-01-07
   8 │ 2020-01-08     2020        1        8  2020-01-08
   9 │ 2020-01-09     2020        1        9  2020-01-09
  10 │ 2020-01-10     2020        1       10  2020-01-10

This time we took advantage of the fact that Date takes three positional
arguments and this is the default behavior of transformation specifications
in DataFrames.jl in which multiple source columns are provided.

This is all for today. Bye!

ABC of A/B Testing

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2020/11/27/abt.html

Introduction

After my last post
on random number generation I think that it is good to add one more thing
that Chris Rackauckas pointed out to me as relevant to show a complete
picture. If you are working with pseudo random number generators it is good
to know that Julia has an excellent RandomNumbers.jl package that
offers a wide array of different random number generators to choose from.
So in this post I will use xoroshiro128** generator.

The post was tested under Julia 1.5.3 and packages DataFrames.jl 0.22.1,
Distributions.jl 0.23.8, FreqTables.jl 0.4.2, Pipe.jl 1.3.0,
RandomNumbers.jl 1.4.0, PyPlot.jl 2.9.9 and intends to present a pretty complete
work-flow of investigating a sample data science problem.

A/B testing

In the post we will investigate several simple options how A/B tests
can be performed.

In our scenario we assume that we have to choose from \(n\) alternatives.
Each alternative \(i\in[n]\) has an unknown probability of success \(p_i\).
Assume that e.g. we have \(n\) different ways to promote a product to a customer
and \(p_i\) is a probability that customer buys after seeing a promotion.

Initially we do not know which of the options is the best, i.e. has the highest
\(p_i\), but we are allowed to present them to customers and collect their
responses to learn the unknown probabilities.

The most simple way to run such an experiment is to present each option to the
customer with the same probability \(1/n\). Such an approach focuses purely on
exploration of the alternatives. However, intuitively we feel that it is
wasteful. If after some trials we see that some option is very bad and has
marginal chances of being the best then it is not worth showing it to the
customer. We are not only losing money (we present something that we know is not
very good), but additionally we will not learn much in terms of choosing the
best option.

Another approach would be to always show our customer an option that we
currently believe is best. Such a strategy focuses purely on exploitation.
Similarly to the uniform sampling described above, we can see that it does not
have to lead to good outcomes. We can get stuck with sub-optimal solution, as
we do not give ourselves a chance to learn that maybe some alternative that is
currently not considered the best is actually superior.

So how can we balance the exploration with exploitation? There are many
approaches that can be considered, but one of the simplest, yet very efficient,
is Thompson sampling. In this approach we do the following. We constantly
keep information about the distribution of our beliefs about the quality of each
alternative. Denote the random variable corresponding to the distribution of
these beliefs as \(P_i\) for alternative \(i\). Now in each step we
independently sample a value from each distribution \(P_i\) and choose to
measure the option that has the highest value of the sample. Now why does this
heuristic work? Note that we will tend to sample the alternatives that have a
high value of beliefs (so we do not waste time on sampling very bad options)
but at the same time we will also from time to time sample options that have
a high variance of \(P_i\) distribution (so the options about which we do not
know much; even if at the moment of sampling we believe they are not so good
our beliefs show us that actually they could turn out good if we collect more
evidence on them).

In the codes below I compare these three strategies of running A/B tests. I call
them respectively:

  1. Uniform: sample each alternative with probability \(1/n\);
  2. Greedy: sample the alternative with the highest \(E(P_i)\);
  3. Thompson: sample the alternative with the highest value of a single
    independent sample from \(P_i\).

Before we turn to coding one last thing should be discussed. How should we
represent our beliefs \(P_i\)? Fortunately here Bayesian statistics comes handy.
As our experiments are Bernoulli trials, we know that if we represent the
prior beliefs as Beta distributed then the posterior beliefs are also
going to follow the same distribution. Formally we say that the Beta
distribution is a conjugate prior of Bernoulli distribution. If our beliefs have
\(Beta(\alpha, \beta)\) distribution and we see a success then the distribution
of beliefs becomes \(Beta(\alpha+1, \beta)\). On the other hand if we see a
failure it becomes \(Beta(\alpha, \beta+1)\). As you can see it is very easy to
implement.

Setting up the stage

In a fresh Julia session we first load the required packages:

using DataFrames
using Distributions
using FreqTables
using Pipe
using RandomNumbers
using PyPlot
using LinearAlgebra

Next define a function that updates the beliefs about the quality of some
alternative:

update_belief(b::Beta, r::Bool) = Beta(b.α + r, b.β + !r)

Finally define the three decision rules that we want to compare: uniform,
greedy, and Thompson:

abstract type ABRule end

struct Greedy <: ABRule end
struct Thompson <: ABRule end
struct Unif <: ABRule end

function decide(options::Vector{<:Beta}, ::Unif)
    return rand(axes(options, 1))
end

function decide(options::Vector{<:Beta}, ::Greedy)
    m = mean.(options)
    mm = maximum(m)
    return rand(findall(==(mm), m))
end

function decide(options::Vector{<:Beta}, ::Thompson)
    m = rand.(options)
    mm = maximum(m)
    return rand(findall(==(mm), m))
end

One technical aspect of our implementations of decision making rule for greedy
and Thompson sampling is that we have to take into consideration that more than
one option can have the same evaluation. That is why we need first to find
which option has maximum mean with mm = maximum(m) and then randomly select
one of the options that are best with rand(findall(==(mm), m)).

Now we are ready to run the experiment.

Implementing the experimenter

Here is an implementation of the function that runs the experiments:

function coupled_run!(n::Int, steps::Int, rng::AbstractRNG,
                      results::DataFrame, runid::Int)
    truep = rand(rng, n)
    bestp = argmax(truep)
    streams = [rand(rng, Bernoulli(p), steps) for p in truep]

    for alg in (Greedy(), Unif(), Thompson())
        options = [Beta(1, 1) for i in 1:n]
        loc = fill(1, n)
        for i in 1:steps
            d = decide(options, alg)
            options[d] = update_belief(options[d], streams[d][loc[d]])
            loc[d] += 1
        end
        push!(results, (runid=runid, alg=string(typeof(alg)),
                        correct=mean(options[bestp]) == maximum(mean.(options)),
                        payoff=sum(x -> x.α - 1, options)))
    end
end

The coupler_run! takes five arguments: k is number of alternatives to
consider, steps is the length of A/B test experiment, rng is random number
generator we will use (remember that we want to test RandomNumbers.jl library),
results is a data frame that we will update and store the experiment results
in, finally runid is experiment number.

Let us comment on several crucial parts of the above code. Note that we sample
unknown true probabilities of success in the experiment with
truep = rand(rng, n). As they are uniformly distributed we set our initial
beliefs about options to also follow a uniform distribution in this line:
options = [Beta(1, 1) for i in 1:n].

The next notable thing is that we want the experiments for all algorithms we
consider to be coupled, that is: each alternative should generate the same
sequence of successes and failures for each option. Therefore, for simplicity,
we pre-populate the results of the trials in this line
streams = [rand(rng, Bernoulli(p), steps) for p in truep] and then just tack
in loc vector which random number for each option should be picked. Such an
approach will reduce noise when comparing the alternatives. (note that we could
have implemented it more efficiently, but I did not want to over-complicate the
code).

Finally note that we are collecting two informations:

  • correct: if what we think is best at the end of the experiment is best
    indeed.
  • payoff: how much payoff we have collected when running the experiment.

Note that both of them are desirable. Everyone probably agrees that it would
be best to have correct near to one (so that after running the experiment
we are confident that we know what is best). At the same time it is also
desirable to ensure that the experiment itself is not wasteful, so that we
collect as big payoff as possible.

We will compare the three considered A/B testing rules using these two criteria.

Analyzing the experiment results

First run the experiment (it should take under 20 seconds):

julia> results = DataFrame()
0×0 DataFrame

julia> rng = Xorshifts.Xoroshiro128StarStar(1234)
RandomNumbers.Xorshifts.Xoroshiro128StarStar(0x9e929e92000004d2, 0xda409a400013489a)

julia> @time foreach(runid -> coupled_run!(10, 500, rng, results, runid), 1:20_000)
 18.736959 seconds (143.71 M allocations: 8.944 GiB, 6.78% gc time)

julia> results
60000×4 DataFrame
   Row │ runid  alg       correct  payoff
       │ Int64  String    Bool     Float64
───────┼───────────────────────────────────
     1 │     1  Greedy      false    451.0
     2 │     1  Unif         true    354.0
     3 │     1  Thompson     true    486.0
     4 │     2  Greedy       true    468.0
     5 │     2  Unif         true    264.0
   ⋮   │   ⋮       ⋮         ⋮        ⋮
 59997 │ 19999  Thompson     true    464.0
 59998 │ 20000  Greedy      false    472.0
 59999 │ 20000  Unif         true    394.0
 60000 │ 20000  Thompson     true    462.0
                         59991 rows omitted

We have run the sampling 20,000 times in under 20 seconds. We have 60,000 rows
in our DataFrame as for each sampling we produced three rows corresponding
to three alternatives we consider. In our experiment we considered 10
alternatives and a budget of 500 samples for testing.

First we collect some initial statistics about the alternatives:

julia> @pipe groupby(results, :alg) |>
             combine(_, [:correct, :payoff] .=> mean)
3×3 DataFrame
 Row │ alg       correct_mean  payoff_mean
     │ String    Float64       Float64
─────┼─────────────────────────────────────
   1 │ Greedy         0.38145      405.997
   2 │ Unif           0.8031       250.394
   3 │ Thompson       0.87955      431.31

We see that Thompson sampling has both best probability of being correct and
the best cumulative payoff. Interestingly greedy strategy has low probability
of being correct, but relatively high payoff. The opposite holds for uniform
sampling strategy.

Now we investigate correct selection probability in more detail:

julia> correct_wide = unstack(results, :runid, :alg, :correct)
20000×4 DataFrame
   Row │ runid  Greedy  Unif   Thompson
       │ Int64  Bool?   Bool?  Bool?
───────┼────────────────────────────────
     1 │     1   false   true      true
     2 │     2    true   true      true
     3 │     3   false   true      true
     4 │     4   false   true      true
     5 │     5   false  false     false
   ⋮   │   ⋮      ⋮       ⋮       ⋮
 19997 │ 19997   false   true      true
 19998 │ 19998   false  false      true
 19999 │ 19999   false   true      true
 20000 │ 20000   false   true      true
                      19991 rows omitted

julia> proptable(correct_wide, :Greedy, :Unif)
2×2 Named Array{Float64,2}
Greedy ╲ Unif │   false     true
──────────────┼─────────────────
false         │ 0.14145   0.4771
true          │ 0.05545    0.326

julia> proptable(correct_wide, :Greedy, :Thompson)
2×2 Named Array{Float64,2}
Greedy ╲ Thompson │   false     true
──────────────────┼─────────────────
false             │  0.0907  0.52785
true              │ 0.02975   0.3517

julia> proptable(correct_wide, :Unif, :Thompson)
2×2 Named Array{Float64,2}
Unif ╲ Thompson │   false     true
────────────────┼─────────────────
false           │  0.0893   0.1076
true            │ 0.03115  0.77195

Observe that Thompson and uniform sampling give quite similar results most of
the time , with Thompson sampling having a slight edge when they disagree.
Greedy strategy is much worse. It is easy to understand these findings. Greedy
strategy is not doing enough exploration. Both Thompson and uniform sampling
do explore. The edge of Thompson sampling is that it does not waste time testing
very bad alternatives (so it can focus on the good ones and better discriminate
between them).

Similarly we can investigate which alternative gives the best payoff (remember
that we made sure to have the scenarios coupled):

julia> function find_best(alg, payoff)
           best_idxs = findall(==(maximum(payoff)), payoff)
           best_algs = alg[best_idxs]
           sort!(best_algs)
           return join(best_algs, '|')
       end
find_best (generic function with 1 method)

julia> alg_freq = @pipe groupby(results, :runid) |>
                        combine(_, [:alg, :payoff] => find_best => :best) |>
                        groupby(_, :best, sort=true) |>
                        combine(_, nrow)
3×2 DataFrame
 Row │ best             nrow
     │ String           Int64
─────┼────────────────────────
   1 │ Greedy            9736
   2 │ Greedy|Thompson    120
   3 │ Thompson         10144

We observe that uniform sampling for 20,000 simulations was never the best,
while Thompson sampling has a slight edge over greedy sampling in the
probability of being best (there is also a small probability of a tie).

As an exercise let us check, sticking to Bayesian approach, what is the
probability that Thompson sampling has higher probability of being best than
greedy one. Here we use the fact that the Dirichlet distribution is a
conjugate prior for the Categorical distribution, and set uniform a priori
beliefs:

julia> posterior = Dirichlet(alg_freq.nrow .+ 1)
Dirichlet{Float64}(alpha=[9737.0, 121.0, 10145.0])

julia> posterior_dis = rand(rng, posterior, 100_000)
3×100000 Array{Float64,2}:
 0.484744    0.486453    0.492359    …  0.494059   0.484432    0.482332
 0.00632059  0.00692795  0.00563844     0.0076129  0.00706873  0.00782947
 0.508936    0.506619    0.502002       0.498328   0.508499    0.509838

julia> mean(x -> x[1] < x[3], eachcol(posterior_dis))
0.99707

We can see that given the sample size of 20,000 this probability is greater than
99%.

However, we see that the considered fractions actually do not differ that much:

julia> transform(alg_freq, :nrow => (x -> x ./ sum(x)) => :freq)
3×3 DataFrame
 Row │ best             nrow   freq
     │ String           Int64  Float64
─────┼─────────────────────────────────
   1 │ Greedy            9736   0.4868
   2 │ Greedy|Thompson    120   0.006
   3 │ Thompson         10144   0.5072

and the difference in average payoffs we observed above was relatively higher.
Let us dig into this. First we plot a distribution of difference between payoffs
with Thompson and greedy sampling:

payoff_wide = unstack(results, :runid, :alg, :payoff)
Δpayoff = payoff_wide.Thompson .- payoff_wide.Greedy
hist(Δpayoff, 25)
xlabel("Δpayoff: Thompson-Greedy")

getting the following plot:

Delta of payoff

Now we see what is going on. If Thompson sampling is worse than greedy sampling
it is worse only slightly. On the other hand greedy sampling can be much worse
than Thompson sampling (and we understand when: this is a scenario when greedy
algorithm gets stuck at some sub-optimal alternative).

We also quickly check the distribution of estimator of the difference in payoffs
using Bayesian bootstrapping (to keep to the style we use in this post):

julia> dd = Dirichlet(length(Δpayoff), 1);
julia> boot = [dot(Δpayoff, rand(rng, dd)) for _ in 1:10_000];

julia> quantile(boot, 0:0.25:1)
5-element Array{Float64,1}:
 23.427965439713827
 24.64825057737705
 24.911299035010433
 25.17170629935465
 26.513561155964837

and we see, that, as expected the distribution of the difference is very
significantly greater than zero (we could probably easily detect this
significance even with a much smaller number of tests than 20,000).

Conclusion

That was a long post, so let me finish at this point. I hope you enjoyed it and
could also appreciate the data wrangling capabilities that JuliaData ecosystem
offers you. I also encourage you to run some more experiments on your own
to chcek how the considered A/B testing strategies behave under different
parameterizations.

And a usual caveat – performance geeks (are there any in the Julia community?)
can find several spots in my codes that I left sub-optimal to keep them simple.
Feel free to experiment with improving the execution speed of my examples.