Accidentally Quadratic with DataFrames in Julia

By: Dean Markwick's Blog -- Julia

Re-posted from: https://dm13450.github.io/2021/04/21/Accidentally-Quadratic.html

using DataFrames, DataFramesMeta
using BenchmarkTools
using Plots

A post recently done the rounds where it looks like GTA had a bad
implementation of an algorithm that scaled in a quadratic fashion (How I cut GTA Online loading times by 70%),
which echoed a Bruce Dawson quote article about how it is common for
quadratically efficient processes to end up in production.
Quadratic
algorithms are fast enough when testing but once in production all of
a sudden the performance issues catch up with you and your sat with a
very inefficient process.

Well that happened to me.

Every month I recalibrate a model using the latest data pulled from a
database. I take this raw data and generate some features, fit a model
and save down the results. One of those operations is to match all the
id’s with the old data and new data to work out which trades need new features needed to be generated.

Basically, imagine I have a dataframe, and I want to find all the rows
that match some values. In this mock example, column B contains the
IDs and I’ve some new IDs that I want to filter the dataframe for.

I’ll create a large mock dataframe as an example.

N = 1000000
df = DataFrame(A = rand(N), B = 1:N);

My slow implementation use the DataFramesMeta package and used the broadcasted in function to check whether each value was in the new ids. This worked without a hitch last month, but then all of a sudden seemed to be incredibly slow. This was strange as I hadn’t changed anything, did the usual reboot of the machine and start afresh but it was still painfully slow.

function slow(df, ids)
  @where(df, in(ids).(:B))
end

After a quick profiling, I found that it was the above function that
was the bottleneck. So I refactored it to remove the DataFramesMeta dependancy and just used the base functions.

function quick(df, ids)
  df[findall(in(ids), df.B), :]
end

Thankfully this solved the issue, was much quicker and allowed my
process to complete without a hitch. This got me thinking, how slow was my originally implementation and how much different is the new version. So onto the benchmarking.

Using the BenchmarkTools.jl package I can run multiple iterations of each function across larger and larger IDs samples.

nSamps = [1, 10, 100, 1000, 10000, 100000, 1000000]
resQuick = zeros(length(nSamps))
resSlow = zeros(length(nSamps))

for (i, n) in enumerate(nSamps)
  ids = collect(1:n) 
    
  qb = @benchmark quick($df, $ids)
  sb = @benchmark slow($df, $ids)
    
  resQuick[i] = median(qb).time
  resSlow[i] = median(sb).time
end

I’ve made sure that I compiled the original function before starting
this benchmarking too.

plot(log.(nSamps), log.(resQuick), label="Quick", legend=:topleft, xlabel="log(Number of IDs selected)", ylab="log(Time)")
plot!(log.(nSamps), log.(resSlow), label="Slow")

svg

The difference in performance in remarkable. The quick function
is pretty much flat and just a slight increase towards the large sizes
in this log-log plot, whereas the slow version is always increasing. When we model the slow implementation performance as a power law we find that it is not quite quadratic, but more importantly, we can see that the faster method is pretty much constant, so a much scalable solution.

using GLM
lm(@formula(LogTime ~ LogSamps),
     DataFrame(LogSamps = log.(nSamps), LogTime=log.(resSlow)))
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

LogTime ~ 1 + LogSamps

Coefficients:
─────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept)  15.1134     0.275726   54.81    <1e-07  14.4046    15.8221
LogSamps      0.885168   0.0332117  26.65    <1e-05   0.799794   0.970541
─────────────────────────────────────────────────────────────────────────

When I first come across this issue I was ready to book out my week to rewriting the data functions to iron out any of the slow downs, so I was pretty happy that rewriting that one function made everything manageable.