ABC of A/B Testing

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2020/11/27/abt.html

Introduction

After my last post
on random number generation I think that it is good to add one more thing
that Chris Rackauckas pointed out to me as relevant to show a complete
picture. If you are working with pseudo random number generators it is good
to know that Julia has an excellent RandomNumbers.jl package that
offers a wide array of different random number generators to choose from.
So in this post I will use xoroshiro128** generator.

The post was tested under Julia 1.5.3 and packages DataFrames.jl 0.22.1,
Distributions.jl 0.23.8, FreqTables.jl 0.4.2, Pipe.jl 1.3.0,
RandomNumbers.jl 1.4.0, PyPlot.jl 2.9.9 and intends to present a pretty complete
work-flow of investigating a sample data science problem.

A/B testing

In the post we will investigate several simple options how A/B tests
can be performed.

In our scenario we assume that we have to choose from \(n\) alternatives.
Each alternative \(i\in[n]\) has an unknown probability of success \(p_i\).
Assume that e.g. we have \(n\) different ways to promote a product to a customer
and \(p_i\) is a probability that customer buys after seeing a promotion.

Initially we do not know which of the options is the best, i.e. has the highest
\(p_i\), but we are allowed to present them to customers and collect their
responses to learn the unknown probabilities.

The most simple way to run such an experiment is to present each option to the
customer with the same probability \(1/n\). Such an approach focuses purely on
exploration of the alternatives. However, intuitively we feel that it is
wasteful. If after some trials we see that some option is very bad and has
marginal chances of being the best then it is not worth showing it to the
customer. We are not only losing money (we present something that we know is not
very good), but additionally we will not learn much in terms of choosing the
best option.

Another approach would be to always show our customer an option that we
currently believe is best. Such a strategy focuses purely on exploitation.
Similarly to the uniform sampling described above, we can see that it does not
have to lead to good outcomes. We can get stuck with sub-optimal solution, as
we do not give ourselves a chance to learn that maybe some alternative that is
currently not considered the best is actually superior.

So how can we balance the exploration with exploitation? There are many
approaches that can be considered, but one of the simplest, yet very efficient,
is Thompson sampling. In this approach we do the following. We constantly
keep information about the distribution of our beliefs about the quality of each
alternative. Denote the random variable corresponding to the distribution of
these beliefs as \(P_i\) for alternative \(i\). Now in each step we
independently sample a value from each distribution \(P_i\) and choose to
measure the option that has the highest value of the sample. Now why does this
heuristic work? Note that we will tend to sample the alternatives that have a
high value of beliefs (so we do not waste time on sampling very bad options)
but at the same time we will also from time to time sample options that have
a high variance of \(P_i\) distribution (so the options about which we do not
know much; even if at the moment of sampling we believe they are not so good
our beliefs show us that actually they could turn out good if we collect more
evidence on them).

In the codes below I compare these three strategies of running A/B tests. I call
them respectively:

  1. Uniform: sample each alternative with probability \(1/n\);
  2. Greedy: sample the alternative with the highest \(E(P_i)\);
  3. Thompson: sample the alternative with the highest value of a single
    independent sample from \(P_i\).

Before we turn to coding one last thing should be discussed. How should we
represent our beliefs \(P_i\)? Fortunately here Bayesian statistics comes handy.
As our experiments are Bernoulli trials, we know that if we represent the
prior beliefs as Beta distributed then the posterior beliefs are also
going to follow the same distribution. Formally we say that the Beta
distribution is a conjugate prior of Bernoulli distribution. If our beliefs have
\(Beta(\alpha, \beta)\) distribution and we see a success then the distribution
of beliefs becomes \(Beta(\alpha+1, \beta)\). On the other hand if we see a
failure it becomes \(Beta(\alpha, \beta+1)\). As you can see it is very easy to
implement.

Setting up the stage

In a fresh Julia session we first load the required packages:

using DataFrames
using Distributions
using FreqTables
using Pipe
using RandomNumbers
using PyPlot
using LinearAlgebra

Next define a function that updates the beliefs about the quality of some
alternative:

update_belief(b::Beta, r::Bool) = Beta(b.α + r, b.β + !r)

Finally define the three decision rules that we want to compare: uniform,
greedy, and Thompson:

abstract type ABRule end

struct Greedy <: ABRule end
struct Thompson <: ABRule end
struct Unif <: ABRule end

function decide(options::Vector{<:Beta}, ::Unif)
    return rand(axes(options, 1))
end

function decide(options::Vector{<:Beta}, ::Greedy)
    m = mean.(options)
    mm = maximum(m)
    return rand(findall(==(mm), m))
end

function decide(options::Vector{<:Beta}, ::Thompson)
    m = rand.(options)
    mm = maximum(m)
    return rand(findall(==(mm), m))
end

One technical aspect of our implementations of decision making rule for greedy
and Thompson sampling is that we have to take into consideration that more than
one option can have the same evaluation. That is why we need first to find
which option has maximum mean with mm = maximum(m) and then randomly select
one of the options that are best with rand(findall(==(mm), m)).

Now we are ready to run the experiment.

Implementing the experimenter

Here is an implementation of the function that runs the experiments:

function coupled_run!(n::Int, steps::Int, rng::AbstractRNG,
                      results::DataFrame, runid::Int)
    truep = rand(rng, n)
    bestp = argmax(truep)
    streams = [rand(rng, Bernoulli(p), steps) for p in truep]

    for alg in (Greedy(), Unif(), Thompson())
        options = [Beta(1, 1) for i in 1:n]
        loc = fill(1, n)
        for i in 1:steps
            d = decide(options, alg)
            options[d] = update_belief(options[d], streams[d][loc[d]])
            loc[d] += 1
        end
        push!(results, (runid=runid, alg=string(typeof(alg)),
                        correct=mean(options[bestp]) == maximum(mean.(options)),
                        payoff=sum(x -> x.α - 1, options)))
    end
end

The coupler_run! takes five arguments: k is number of alternatives to
consider, steps is the length of A/B test experiment, rng is random number
generator we will use (remember that we want to test RandomNumbers.jl library),
results is a data frame that we will update and store the experiment results
in, finally runid is experiment number.

Let us comment on several crucial parts of the above code. Note that we sample
unknown true probabilities of success in the experiment with
truep = rand(rng, n). As they are uniformly distributed we set our initial
beliefs about options to also follow a uniform distribution in this line:
options = [Beta(1, 1) for i in 1:n].

The next notable thing is that we want the experiments for all algorithms we
consider to be coupled, that is: each alternative should generate the same
sequence of successes and failures for each option. Therefore, for simplicity,
we pre-populate the results of the trials in this line
streams = [rand(rng, Bernoulli(p), steps) for p in truep] and then just tack
in loc vector which random number for each option should be picked. Such an
approach will reduce noise when comparing the alternatives. (note that we could
have implemented it more efficiently, but I did not want to over-complicate the
code).

Finally note that we are collecting two informations:

  • correct: if what we think is best at the end of the experiment is best
    indeed.
  • payoff: how much payoff we have collected when running the experiment.

Note that both of them are desirable. Everyone probably agrees that it would
be best to have correct near to one (so that after running the experiment
we are confident that we know what is best). At the same time it is also
desirable to ensure that the experiment itself is not wasteful, so that we
collect as big payoff as possible.

We will compare the three considered A/B testing rules using these two criteria.

Analyzing the experiment results

First run the experiment (it should take under 20 seconds):

julia> results = DataFrame()
0×0 DataFrame

julia> rng = Xorshifts.Xoroshiro128StarStar(1234)
RandomNumbers.Xorshifts.Xoroshiro128StarStar(0x9e929e92000004d2, 0xda409a400013489a)

julia> @time foreach(runid -> coupled_run!(10, 500, rng, results, runid), 1:20_000)
 18.736959 seconds (143.71 M allocations: 8.944 GiB, 6.78% gc time)

julia> results
60000×4 DataFrame
   Row │ runid  alg       correct  payoff
       │ Int64  String    Bool     Float64
───────┼───────────────────────────────────
     1 │     1  Greedy      false    451.0
     2 │     1  Unif         true    354.0
     3 │     1  Thompson     true    486.0
     4 │     2  Greedy       true    468.0
     5 │     2  Unif         true    264.0
   ⋮   │   ⋮       ⋮         ⋮        ⋮
 59997 │ 19999  Thompson     true    464.0
 59998 │ 20000  Greedy      false    472.0
 59999 │ 20000  Unif         true    394.0
 60000 │ 20000  Thompson     true    462.0
                         59991 rows omitted

We have run the sampling 20,000 times in under 20 seconds. We have 60,000 rows
in our DataFrame as for each sampling we produced three rows corresponding
to three alternatives we consider. In our experiment we considered 10
alternatives and a budget of 500 samples for testing.

First we collect some initial statistics about the alternatives:

julia> @pipe groupby(results, :alg) |>
             combine(_, [:correct, :payoff] .=> mean)
3×3 DataFrame
 Row │ alg       correct_mean  payoff_mean
     │ String    Float64       Float64
─────┼─────────────────────────────────────
   1 │ Greedy         0.38145      405.997
   2 │ Unif           0.8031       250.394
   3 │ Thompson       0.87955      431.31

We see that Thompson sampling has both best probability of being correct and
the best cumulative payoff. Interestingly greedy strategy has low probability
of being correct, but relatively high payoff. The opposite holds for uniform
sampling strategy.

Now we investigate correct selection probability in more detail:

julia> correct_wide = unstack(results, :runid, :alg, :correct)
20000×4 DataFrame
   Row │ runid  Greedy  Unif   Thompson
       │ Int64  Bool?   Bool?  Bool?
───────┼────────────────────────────────
     1 │     1   false   true      true
     2 │     2    true   true      true
     3 │     3   false   true      true
     4 │     4   false   true      true
     5 │     5   false  false     false
   ⋮   │   ⋮      ⋮       ⋮       ⋮
 19997 │ 19997   false   true      true
 19998 │ 19998   false  false      true
 19999 │ 19999   false   true      true
 20000 │ 20000   false   true      true
                      19991 rows omitted

julia> proptable(correct_wide, :Greedy, :Unif)
2×2 Named Array{Float64,2}
Greedy ╲ Unif │   false     true
──────────────┼─────────────────
false         │ 0.14145   0.4771
true          │ 0.05545    0.326

julia> proptable(correct_wide, :Greedy, :Thompson)
2×2 Named Array{Float64,2}
Greedy ╲ Thompson │   false     true
──────────────────┼─────────────────
false             │  0.0907  0.52785
true              │ 0.02975   0.3517

julia> proptable(correct_wide, :Unif, :Thompson)
2×2 Named Array{Float64,2}
Unif ╲ Thompson │   false     true
────────────────┼─────────────────
false           │  0.0893   0.1076
true            │ 0.03115  0.77195

Observe that Thompson and uniform sampling give quite similar results most of
the time , with Thompson sampling having a slight edge when they disagree.
Greedy strategy is much worse. It is easy to understand these findings. Greedy
strategy is not doing enough exploration. Both Thompson and uniform sampling
do explore. The edge of Thompson sampling is that it does not waste time testing
very bad alternatives (so it can focus on the good ones and better discriminate
between them).

Similarly we can investigate which alternative gives the best payoff (remember
that we made sure to have the scenarios coupled):

julia> function find_best(alg, payoff)
           best_idxs = findall(==(maximum(payoff)), payoff)
           best_algs = alg[best_idxs]
           sort!(best_algs)
           return join(best_algs, '|')
       end
find_best (generic function with 1 method)

julia> alg_freq = @pipe groupby(results, :runid) |>
                        combine(_, [:alg, :payoff] => find_best => :best) |>
                        groupby(_, :best, sort=true) |>
                        combine(_, nrow)
3×2 DataFrame
 Row │ best             nrow
     │ String           Int64
─────┼────────────────────────
   1 │ Greedy            9736
   2 │ Greedy|Thompson    120
   3 │ Thompson         10144

We observe that uniform sampling for 20,000 simulations was never the best,
while Thompson sampling has a slight edge over greedy sampling in the
probability of being best (there is also a small probability of a tie).

As an exercise let us check, sticking to Bayesian approach, what is the
probability that Thompson sampling has higher probability of being best than
greedy one. Here we use the fact that the Dirichlet distribution is a
conjugate prior for the Categorical distribution, and set uniform a priori
beliefs:

julia> posterior = Dirichlet(alg_freq.nrow .+ 1)
Dirichlet{Float64}(alpha=[9737.0, 121.0, 10145.0])

julia> posterior_dis = rand(rng, posterior, 100_000)
3×100000 Array{Float64,2}:
 0.484744    0.486453    0.492359    …  0.494059   0.484432    0.482332
 0.00632059  0.00692795  0.00563844     0.0076129  0.00706873  0.00782947
 0.508936    0.506619    0.502002       0.498328   0.508499    0.509838

julia> mean(x -> x[1] < x[3], eachcol(posterior_dis))
0.99707

We can see that given the sample size of 20,000 this probability is greater than
99%.

However, we see that the considered fractions actually do not differ that much:

julia> transform(alg_freq, :nrow => (x -> x ./ sum(x)) => :freq)
3×3 DataFrame
 Row │ best             nrow   freq
     │ String           Int64  Float64
─────┼─────────────────────────────────
   1 │ Greedy            9736   0.4868
   2 │ Greedy|Thompson    120   0.006
   3 │ Thompson         10144   0.5072

and the difference in average payoffs we observed above was relatively higher.
Let us dig into this. First we plot a distribution of difference between payoffs
with Thompson and greedy sampling:

payoff_wide = unstack(results, :runid, :alg, :payoff)
Δpayoff = payoff_wide.Thompson .- payoff_wide.Greedy
hist(Δpayoff, 25)
xlabel("Δpayoff: Thompson-Greedy")

getting the following plot:

Delta of payoff

Now we see what is going on. If Thompson sampling is worse than greedy sampling
it is worse only slightly. On the other hand greedy sampling can be much worse
than Thompson sampling (and we understand when: this is a scenario when greedy
algorithm gets stuck at some sub-optimal alternative).

We also quickly check the distribution of estimator of the difference in payoffs
using Bayesian bootstrapping (to keep to the style we use in this post):

julia> dd = Dirichlet(length(Δpayoff), 1);
julia> boot = [dot(Δpayoff, rand(rng, dd)) for _ in 1:10_000];

julia> quantile(boot, 0:0.25:1)
5-element Array{Float64,1}:
 23.427965439713827
 24.64825057737705
 24.911299035010433
 25.17170629935465
 26.513561155964837

and we see, that, as expected the distribution of the difference is very
significantly greater than zero (we could probably easily detect this
significance even with a much smaller number of tests than 20,000).

Conclusion

That was a long post, so let me finish at this point. I hope you enjoyed it and
could also appreciate the data wrangling capabilities that JuliaData ecosystem
offers you. I also encourage you to run some more experiments on your own
to chcek how the considered A/B testing strategies behave under different
parameterizations.

And a usual caveat – performance geeks (are there any in the Julia community?)
can find several spots in my codes that I left sub-optimal to keep them simple.
Feel free to experiment with improving the execution speed of my examples.