Tag Archives: julialang

A Simplified E-graph Implementation

By: Philip Zucker

Re-posted from: https:/www.philipzucker.com/a-simplified-egraph/

I’ve been spending some time mulling over e-graphs. I think I have it kind of pared down until it’s fairly simple.
This implementation is probably not high performance as the main trick is removing a genuinely useful indexing structure. Still, this implementation is small enough that I can kind of keep it in my head. It has become rather cute.

For a user ready implementation of egraphs, see Metatheory https://github.com/0x0f0f0f/Metatheory.jl or egg https://egraphs-good.github.io/

In a computer, terms like (a + b) * c are typically stored as trees. The EGraph converts this tree structure to one with an extra layer of indirection. Instead of nodes directly connecting to nodes, they connect first through a different kind of node labelled by an integer. These integer nodes are called eclasses and the modified original nodes are called enodes.

The EGraph is a bipartite directed graph between eclasses and enodes. ENodes belong uniquely to eclasses. Once we start merging eclasses, eclasses will point to more than one enode. The graph may also develop loops allowing representation in a sense of infinite terms.

Last time https://www.philipzucker.com/union-find-dict/, I built the abstraction of a union-find dict. This data structure allows you to retrieve information keyed on an equivalence class while still supporting the union operation. Given this piece, it is simple to define the two data structures.

@auto_hash_equals struct ENode
    head::Symbol
    args::Vector{Int64}
end

struct EGraph
    eclass::IntDisjointMap
    memo::Dict{ENode, Int64}
end

EGraph() = EGraph( IntDisjointMap(union)  , Dict{ENode,Int64}())

The eclass field is a union-find dictionary from equivalence classes to vectors of ENodes. We tell the underlying IntDisjointMap that upon a union! of equivalence classes, we will union the enode vectors in the codomain of the IntDisjointMap to each other.

The memo table is not strictly necessary, but it gives us a good way to lookup which eclass an enode belongs to. Otherwise we’d have to brute force search the entire IntDisjointMap to find ENodes when we want them.

ENodes hold references to EClass ids, which unfortunately can go stale. We can canonize ENodes to use the freshest equivalence class indices.

canonize(G::EGraph, f::ENode) = ENode(f.head, [find_root(G.eclass, a) for a in f.args])

To add an ENode to the egraph first we canonize it, then we check if it already is in the memo table, and if not we actually push it in the IntDisjointMap and update the memo table.

function addenode!(G::EGraph, f::ENode)
    f = canonize(G,f)
    if haskey(G.memo, f)
        return G.memo[f]
    else
        id = push!(G.eclass, [f])
        G.memo[f] = id
        return id
    end
end

#convenience functions for pushing Julia Expr
addexpr!(G::EGraph, f::Symbol) = addenode!(G, ENode(f,[]))
function addexpr!(G::EGraph, f::Expr)
    @assert f.head == :call
    addenode!(G,  ENode(f.args[1], [ addexpr!(G,a) for a in f.args[2:end] ]  ))
end

When we assert an equality to an egraph, we take the union! of the two corresponding eclasses. We union! the underlying IntDisjointMap, then we recanonize all the held ENodes in that eclass and update the memo table.

function Base.union!(G::EGraph, f::Int64, g::Int64)
    id = union!(G.eclass, f, g)
    eclass = ENode[]
    for enode in G.eclass[id] 
        delete!(G.memo, enode) # should we even bother with delete?
        enode = canonize(G, enode) # should canonize return whether it did anything or not?
        G.memo[enode] = id
        push!(eclass, enode)
    end
    G.eclass[id] = eclass
end

That’s kind of it.

The big thing we haven’t discussed is calculating congruence closure. In my original presentation, this was a whole ordeal and the reason why we needed to maintain parent pointers from eclasses to enodes. This was very confusing.

Instead we can just find congruences via a brute force sweep over the egraph. This is inefficient compared to having likely candidates pointed out to us by the parent pointers. However, during naive ematching we are sweeping over the egraph anyway to find possible rewrite rules applications. This approach makes congruence closure feel rather similar to the other rewrite rules in the sense. There may be some utility in not considering congruence closure as a truly intrinsic part of the egraph. Perhaps you could use it for systems where congruence does not strictly hold?

An unfortunate thing is that congruences needs to be applied in worst case a number of time proportional to the depth of the egraph, as it only propagates congruences one layer at a time.

How it works: after a union! operation there are non canonical ENodes held in both memo and eclass. These noncanonical ENodes are exactly those who have arguments that include the eclass that was just turned into a child of another eclass. These are also exactly those ENodes that are candidates for congruence closure propagation. We can detect them during the sweep by canonization.

This expensive congruence sweep forgives more sins than the more efficient one. Something that can happen is that we try to addexpr! an ENode that is one of the stale ones, in other words it should be in the memo table but is not. This will falsely create a new eclass for this ENode. However, the congruence closure sweep will find this equivalence on the next pass.


# I forgot to include this IntDisjointMap iterator function in my last post.
# Conceptually it belongs there.
function Base.iterate(x::IntDisjointMap, state=1)
    while state <= length(x.parents)
        if x.parents[state] < 0
            return ((state, x.values[state]) , state + 1)
        end
        state += 1
    end
    return nothing
end

# returns a list of tuples of found congruences
function congruences(G::EGraph)
    buf = Tuple{Int64,Int64}[] 
    for (id1, eclass) in G.eclass #alternatively iterate over memo
        for enode in eclass
            cnode = canonize(G,enode)
            if haskey(G.memo, cnode)
                id2 = G.memo[cnode]
                if id1 != id2
                    push!(buf, (id1,id2))
                end
            end
        end
    end
    return buf
end

# propagate all congruences
function propagate_congruence(G::EGraph)
    cong = congruences(G)
    while length(cong) > 0
        for (i,j) in cong
            union!(G,i,j)
        end
        cong = congruences(G)
    end
end

Bits and Bobbles

In principle I think this formulation makes it easier to parallelize congruence finding alongside rewrite rule matching. The rewriting process becomes a swapsies between finding tuples to union and actually applying them.

Everything in this post could probably be tuned up to be more efficient.

To add analyses, you want to store a compound structure in the IntDisjointMap. Tuple{Vector{ENode}, Analysis) The merge operation then does both enode merge and analysis merge.

Possibly changing enodes to be binary might be nice. One can compile arbitrary arity into this. Then everything fits in place in the appropriate arrays, removing significant indirection

Uses of egraphs:

My other implementation

Some tests

using EGraphs
using Test

@testset "Basic EGraph" begin
G = EGraph()
a = addenode!(G, ENode(:a, []))
b = addenode!(G, ENode(:b, []))
#println(G)
union!(G, a, b)
#println(G)
@test addenode!(G, ENode(:a, [])) == 2
@test addenode!(G, ENode(:c, [])) == 3
@test addenode!(G, ENode(:f, [a])) == 4
union!(G, 3, 4)

#= println(G)
for (k,v) in G.eclass
    println(k,v)
end =#
G = EGraph()
a = addenode!(G, ENode(:a, []))
b = addenode!(G, ENode(:b, []))
c = addenode!(G, ENode(:c, []))
union!(G, a, b)
fa = addenode!(G, ENode(:f, [a])) 
fb = addenode!(G, ENode(:f, [b])) 
fc = addenode!(G, ENode(:f, [c])) 

ffa = addenode!(G, ENode(:f, [fa])) 
ffb = addenode!(G, ENode(:f, [fb])) 

@test congruences(G) == [(fa,fb)]


for (x,y) in congruences(G)
    union!(G,x,y)
end

@test congruences(G) == [(ffa,ffb)]

union!(G, a, c)

@test congruences(G) == [(fc,fb), (ffa,ffb)]

for (x,y) in congruences(G)
    union!(G,x,y)
end

@test congruences(G) == []


G = EGraph()
f5a = addexpr!(G, :( f(f(f(f(f(a)))))  ))
f2a = addexpr!(G, :( f(f(a))  ))
@test length(G.eclass) == 6
union!(G , f5a, f2a)
@test find_root(G,f5a) == find_root(G,f2a)
@test length(G.eclass) == 5
f5a = addexpr!(G, :( f(f(f(f(f(a)))))  ))
f2a = addexpr!(G, :( f(f(f(a)))  ))
@test length(G.eclass) == 5

G = EGraph()
f5a = addexpr!(G, :( f(f(f(f(f(a)))))  ))
fa = addexpr!(G, :( f(a)  ))
a = addexpr!(G, :a)
@test length(G.eclass) == 6
union!(G , fa, a)
@test find_root(G,fa) == find_root(G,a)

propagate_congruence(G)
@test length(G.eclass) == 1

G = EGraph()
ffa = addexpr!(G, :( f(f(a))  ))
f5a = addexpr!(G, :( f(f(f(f(f(a)))))  ))
a = addexpr!(G, :a)
@test length(G.eclass) == 6
union!(G , ffa, a)
@test find_root(G,ffa) == find_root(G,a)

propagate_congruence(G)
@test length(G.eclass) == 2



end

A Simplified E-graph Implementation

By: Philip Zucker

Re-posted from: https://www.philipzucker.com/a-simplified-egraph/

I’ve been spending some time mulling over e-graphs. I think I have it kind of pared down until it’s fairly simple.
This implementation is probably not high performance as the main trick is removing a genuinely useful indexing structure. Still, this implementation is small enough that I can kind of keep it in my head. It has become rather cute.

For a user ready implementation of egraphs, see Metatheory https://github.com/0x0f0f0f/Metatheory.jl or egg https://egraphs-good.github.io/. For more motivation, see the egg paper or my first post https://www.philipzucker.com/egraph-1/

In a computer, terms like (a + b) * c are typically stored as trees. The EGraph converts this tree structure to one with an extra layer of indirection. Instead of nodes directly connecting to nodes, they connect first through a different kind of node labelled by an integer. These integer nodes are called eclasses and the modified original nodes are called enodes.

The EGraph is a bipartite directed graph between eclasses and enodes. ENodes belong uniquely to eclasses. Once we start merging eclasses, eclasses will point to more than one enode. The graph may also develop loops allowing representation in a sense of infinite terms.

Last time https://www.philipzucker.com/union-find-dict/, I built the abstraction of a union-find dict. This data structure allows you to retrieve information keyed on an equivalence class while still supporting the union operation. Given this piece, it is simple to define the two data structures.

@auto_hash_equals struct ENode
    head::Symbol
    args::Vector{Int64}
end

struct EGraph
    eclass::IntDisjointMap
    memo::Dict{ENode, Int64}
end

EGraph() = EGraph( IntDisjointMap(union)  , Dict{ENode,Int64}())

The eclass field is a union-find dictionary from equivalence classes to vectors of ENodes. We tell the underlying IntDisjointMap that upon a union! of equivalence classes, we will union the enode vectors in the codomain of the IntDisjointMap to each other.

The memo table is not strictly necessary, but it gives us a good way to lookup which eclass an enode belongs to. Otherwise we’d have to brute force search the entire IntDisjointMap to find ENodes when we want them.

ENodes hold references to EClass ids, which unfortunately can go stale. We can canonize ENodes to use the freshest equivalence class indices.

canonize(G::EGraph, f::ENode) = ENode(f.head, [find_root(G.eclass, a) for a in f.args])

To add an ENode to the egraph first we canonize it, then we check if it already is in the memo table, and if not we actually push it in the IntDisjointMap and update the memo table.

function addenode!(G::EGraph, f::ENode)
    f = canonize(G,f)
    if haskey(G.memo, f)
        return G.memo[f]
    else
        id = push!(G.eclass, [f])
        G.memo[f] = id
        return id
    end
end

#convenience functions for pushing Julia Expr
addexpr!(G::EGraph, f::Symbol) = addenode!(G, ENode(f,[]))
function addexpr!(G::EGraph, f::Expr)
    @assert f.head == :call
    addenode!(G,  ENode(f.args[1], [ addexpr!(G,a) for a in f.args[2:end] ]  ))
end

When we assert an equality to an egraph, we take the union! of the two corresponding eclasses. We union! the underlying IntDisjointMap, then we recanonize all the held ENodes in that eclass and update the memo table.

function Base.union!(G::EGraph, f::Int64, g::Int64)
    id = union!(G.eclass, f, g)
    eclass = ENode[]
    for enode in G.eclass[id] 
        delete!(G.memo, enode) # should we even bother with delete?
        enode = canonize(G, enode) # should canonize return whether it did anything or not?
        G.memo[enode] = id
        push!(eclass, enode)
    end
    G.eclass[id] = eclass
end

That’s kind of it.

The big thing we haven’t discussed is calculating congruence closure. In my original presentation, this was a whole ordeal and the reason why we needed to maintain parent pointers from eclasses to enodes. This was very confusing.

Instead we can just find congruences via a brute force sweep over the egraph. This is inefficient compared to having likely candidates pointed out to us by the parent pointers. However, during naive ematching we are sweeping over the egraph anyway to find possible rewrite rules applications. This approach makes congruence closure feel rather similar to the other rewrite rules in the sense. There may be some utility in not considering congruence closure as a truly intrinsic part of the egraph. Perhaps you could use it for systems where congruence does not strictly hold?

An unfortunate thing is that congruences needs to be applied in worst case a number of time proportional to the depth of the egraph, as it only propagates congruences one layer at a time.

How it works: after a union! operation there are non canonical ENodes held in both memo and eclass. These noncanonical ENodes are exactly those who have arguments that include the eclass that was just turned into a child of another eclass. These are also exactly those ENodes that are candidates for congruence closure propagation. We can detect them during the sweep by canonization.

This expensive congruence sweep forgives more sins than the more efficient one. Something that can happen is that we try to addexpr! an ENode that is one of the stale ones, in other words it should be in the memo table but is not. This will falsely create a new eclass for this ENode. However, the congruence closure sweep will find this equivalence on the next pass.


# I forgot to include this IntDisjointMap iterator function in my last post.
# Conceptually it belongs there.
function Base.iterate(x::IntDisjointMap, state=1)
    while state <= length(x.parents)
        if x.parents[state] < 0
            return ((state, x.values[state]) , state + 1)
        end
        state += 1
    end
    return nothing
end

# returns a list of tuples of found congruences
function congruences(G::EGraph)
    buf = Tuple{Int64,Int64}[] 
    for (id1, eclass) in G.eclass #alternatively iterate over memo
        for enode in eclass
            cnode = canonize(G,enode)
            if haskey(G.memo, cnode)
                id2 = G.memo[cnode]
                if id1 != id2
                    push!(buf, (id1,id2))
                end
            end
        end
    end
    return buf
end

# propagate all congruences
function propagate_congruence(G::EGraph)
    cong = congruences(G)
    while length(cong) > 0
        for (i,j) in cong
            union!(G,i,j)
        end
        cong = congruences(G)
    end
end

Bits and Bobbles

In principle I think this formulation makes it easier to parallelize congruence finding alongside rewrite rule matching. The rewriting process becomes a swapsies between finding tuples to union and actually applying them.

Everything in this post could probably be tuned up to be more efficient.

To add analyses, you want to store a compound structure in the IntDisjointMap. Tuple{Vector{ENode}, Analysis) The merge operation then does both enode merge and analysis merge.

Possibly changing enodes to be binary might be nice. One can compile arbitrary arity into this. Then everything fits in place in the appropriate arrays, removing significant indirection

Uses of egraphs:

My other implementation

Edit: Max Willsey on twitter, an author of egg, says that egg originally took an approach to congruence like the above but found it too slow on larger workloads. It does indeed have a worse asymptotic performance than actually tracking parents and sniping the congruence locations. https://twitter.com/mwillsey/status/1378476707460509698?s=20

Some tests

using EGraphs
using Test

@testset "Basic EGraph" begin
G = EGraph()
a = addenode!(G, ENode(:a, []))
b = addenode!(G, ENode(:b, []))
#println(G)
union!(G, a, b)
#println(G)
@test addenode!(G, ENode(:a, [])) == 2
@test addenode!(G, ENode(:c, [])) == 3
@test addenode!(G, ENode(:f, [a])) == 4
union!(G, 3, 4)

#= println(G)
for (k,v) in G.eclass
    println(k,v)
end =#
G = EGraph()
a = addenode!(G, ENode(:a, []))
b = addenode!(G, ENode(:b, []))
c = addenode!(G, ENode(:c, []))
union!(G, a, b)
fa = addenode!(G, ENode(:f, [a])) 
fb = addenode!(G, ENode(:f, [b])) 
fc = addenode!(G, ENode(:f, [c])) 

ffa = addenode!(G, ENode(:f, [fa])) 
ffb = addenode!(G, ENode(:f, [fb])) 

@test congruences(G) == [(fa,fb)]


for (x,y) in congruences(G)
    union!(G,x,y)
end

@test congruences(G) == [(ffa,ffb)]

union!(G, a, c)

@test congruences(G) == [(fc,fb), (ffa,ffb)]

for (x,y) in congruences(G)
    union!(G,x,y)
end

@test congruences(G) == []


G = EGraph()
f5a = addexpr!(G, :( f(f(f(f(f(a)))))  ))
f2a = addexpr!(G, :( f(f(a))  ))
@test length(G.eclass) == 6
union!(G , f5a, f2a)
@test find_root(G,f5a) == find_root(G,f2a)
@test length(G.eclass) == 5
f5a = addexpr!(G, :( f(f(f(f(f(a)))))  ))
f2a = addexpr!(G, :( f(f(f(a)))  ))
@test length(G.eclass) == 5

G = EGraph()
f5a = addexpr!(G, :( f(f(f(f(f(a)))))  ))
fa = addexpr!(G, :( f(a)  ))
a = addexpr!(G, :a)
@test length(G.eclass) == 6
union!(G , fa, a)
@test find_root(G,fa) == find_root(G,a)

propagate_congruence(G)
@test length(G.eclass) == 1

G = EGraph()
ffa = addexpr!(G, :( f(f(a))  ))
f5a = addexpr!(G, :( f(f(f(f(f(a)))))  ))
a = addexpr!(G, :a)
@test length(G.eclass) == 6
union!(G , ffa, a)
@test find_root(G,ffa) == find_root(G,a)

propagate_congruence(G)
@test length(G.eclass) == 2



end

Poor man’s guide to despecialization

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/04/02/despecialization.html

Introduction

We are just about to release DataFrames.jl 1.0. It will bring many new
features we are excited about. One of the major additions is support of multiple
threading in many core operations. These changes promise performance
improvements when processing large data frames. The cost is that code complexity
has also grown significantly.

Clearly code-base size is an issue for maintainers of the package, but should
end-users care? The answer is that unfortunately yes, as making code more
complex makes it compile longer.

In this post I summarize the experience we have when trying to reduce compile
time latency.

TLDR: if you have functions that are expensive to compile but are not
performance critical perform argument standardization.

In this post I am using Julia 1.6, MethodAnalysis v0.4.4, SnoopCompile v2.6.0,
and DataFrames.jl main branch state at this commit (in this case it
is relevant as changes in the code base will likely affect the results).

Before we start our experiments disable precompilation (to have a clean ground
for comparisons and simplify things; the conclusions I present hold also when
proper precompilation statements are added). To do so comment-out lines 152 and
153 in src/DataFrames.jl file:

#include("other/precompile.jl")
#precompile()

The context

In DataFrames.jl when one performs transformation of data the following two
areas cause challenges related to run-time compilation:

  • users can pass arbitrary functions as transformations;
  • the ouptut of these functions can be arbitrary values (and the logic of
    processing them depends on their type).

Let us have a look at a simple example:

julia> using DataFrames

julia> gdf = groupby(DataFrame(a=1), :a);

julia> @time combine(gdf, x -> (a=1,));
  2.440215 seconds (8.09 M allocations: 469.718 MiB, 8.16% gc time)

julia> @time combine(gdf, x -> (a=1,));
  0.083986 seconds (550.13 k allocations: 33.481 MiB, 99.73% compilation time)

julia> @time combine(gdf, x -> (b=1,));
  0.364469 seconds (996.17 k allocations: 61.213 MiB, 4.25% gc time, 99.74% compilation time)

julia> @time combine(gdf, x -> (c=1,));
  0.337301 seconds (983.31 k allocations: 60.455 MiB, 1.91% gc time, 99.72% compilation time)

We can see that in each call we pass a new anonymous function to combine.
Additionally, the return value of this function is a NamedTuple that has a
fresh type each time, as the name of the column changes.

As you can see the compilation time in these examples is quite high.

Let us check how to reduce it. We will concentrate on only one method
_combine_multicol that is defined in line 7 of
src/groupeddataframe/complextransforms.jl. Its signature is:

_combine_multicol(firstres, fun::Base.Callable, gd::GroupedDataFrame,
                  incols::Union{Nothing, AbstractVector, Tuple, NamedTuple})

We start witch checking how many method instances were generated for it in our code:

julia> using MethodAnalysis

julia> methodinstances(DataFrames._combine_multicol)
9-element Vector{Core.MethodInstance}:
 MethodInstance for _combine_multicol(::NamedTuple{(:a,), Tuple{Int64}}, ::Function, ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::NamedTuple{(:a,), Tuple{Int64}}, ::var"#1#2", ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::Any, ::Function, ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::Any, ::Type, ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::NamedTuple{(:a,), Tuple{Int64}}, ::var"#3#4", ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::NamedTuple{(:b,), Tuple{Int64}}, ::Function, ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::NamedTuple{(:b,), Tuple{Int64}}, ::var"#5#6", ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::NamedTuple{(:c,), Tuple{Int64}}, ::Function, ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::NamedTuple{(:c,), Tuple{Int64}}, ::var"#7#8", ::GroupedDataFrame{DataFrame}, ::Nothing)

As you can see each call like combine(gdf, x -> (c=1,)) generates two new method instances.

Before we move forward let us check how this code would be run in 0.22 release
(precompilation is turned on here so we are not comparing apples to apples,
but if we disabled precompilation the conclusion would be similar):

julia> using DataFrames

julia> gdf = groupby(DataFrame(a=1), :a);

julia> @time combine(gdf, x -> (a=1,));
  1.412728 seconds (2.69 M allocations: 167.339 MiB, 3.50% gc time, 45.87% compilation time)

julia> @time combine(gdf, x -> (a=1,));
  0.041718 seconds (190.24 k allocations: 11.522 MiB, 20.74% gc time, 99.03% compilation time)

julia> @time combine(gdf, x -> (b=1,));
  0.209238 seconds (481.99 k allocations: 30.126 MiB, 99.58% compilation time)

julia> @time combine(gdf, x -> (c=1,));
  0.194280 seconds (468.45 k allocations: 29.367 MiB, 5.33% gc time, 99.53% compilation time)

julia> using MethodAnalysis

julia> methodinstances(DataFrames._combine_multicol)
13-element Vector{Core.MethodInstance}:
 MethodInstance for _combine_multicol(::DataFrame, ::Function, ::GroupedDataFrame{DataFrame}, ::Tuple{Vector{Bool}})
 MethodInstance for _combine_multicol(::DataFrame, ::Type, ::GroupedDataFrame{DataFrame}, ::Tuple{Vector{Bool}})
 MethodInstance for _combine_multicol(::Any, ::Function, ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::Any, ::Type, ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::NamedTuple{(:a,), Tuple{Int64}}, ::Function, ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::NamedTuple{(:b,), Tuple{Int64}}, ::Function, ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::NamedTuple{(:c,), Tuple{Int64}}, ::Function, ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::Any, ::Type, ::GroupedDataFrame{DataFrame}, ::NamedTuple)
 MethodInstance for _combine_multicol(::Any, ::Function, ::GroupedDataFrame{DataFrame}, ::NamedTuple)
 MethodInstance for _combine_multicol(::Any, ::Type, ::GroupedDataFrame{DataFrame}, ::Tuple)
 MethodInstance for _combine_multicol(::Any, ::Function, ::GroupedDataFrame{DataFrame}, ::Tuple)
 MethodInstance for _combine_multicol(::NamedTuple, ::Type, ::GroupedDataFrame{DataFrame}, ::Tuple{Vector{Bool}})
 MethodInstance for _combine_multicol(::NamedTuple, ::Function, ::GroupedDataFrame{DataFrame}, ::Tuple{Vector{Bool}})

So we can see that DataFrames.jl 0.22 produced even more instances, but since the
code was simpler the compilation time was lower.

Despecialization

The first advice one gets in such cases is to use @nospecialize on function
arguments to avoid excessive specialization. In our case we see that problematic
are firstres and fun arguments. Now exit Julia and change the signature of
the method to:

_combine_multicol(@nospecialize(firstres), @nospecialize(fun::Base.Callable),
                  gd::GroupedDataFrame,
                  incols::Union{Nothing, AbstractVector, Tuple, NamedTuple})

Now start your Julia again and run the code we have checked above:

julia> using DataFrames

julia> gdf = groupby(DataFrame(a=1), :a);

julia> @time combine(gdf, x -> (a=1,));
  2.238896 seconds (8.10 M allocations: 470.113 MiB, 5.34% gc time)

julia> @time combine(gdf, x -> (a=1,));
  0.095581 seconds (550.16 k allocations: 33.482 MiB, 9.94% gc time, 99.75% compilation time)

julia> @time combine(gdf, x -> (b=1,));
  0.304849 seconds (982.79 k allocations: 60.359 MiB, 2.95% gc time, 99.71% compilation time)

julia> @time combine(gdf, x -> (c=1,));
  0.297039 seconds (969.93 k allocations: 59.620 MiB, 6.12% gc time, 99.72% compilation time)

julia> using MethodAnalysis

julia> methodinstances(DataFrames._combine_multicol)
7-element Vector{Core.MethodInstance}:
 MethodInstance for _combine_multicol(::NamedTuple{(:a,), Tuple{Int64}}, ::var"#1#2", ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::Any, ::Function, ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::Any, ::Type, ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::NamedTuple{(:a,), Tuple{Int64}}, ::var"#3#4", ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::NamedTuple{(:b,), Tuple{Int64}}, ::var"#5#6", ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::NamedTuple{(:c,), Tuple{Int64}}, ::var"#7#8", ::GroupedDataFrame{DataFrame}, ::Nothing)
 MethodInstance for _combine_multicol(::Any, ::Union{Function, Type}, ::GroupedDataFrame{DataFrame}, ::Nothing)

As you can see things got slightly better. GC time is distorting the comparison
a bit, but correcting for it we see an improvement. Also we have reduced the
number of method instances from 9 to 7. But wait – we wanted to disable specialization,
and we still get 7 method instances to be compiled. How to reduce it?

Poor man’s despecialization

The approach I found to work reliably is to aggresively perform
argument standardization. A standard trick to be sure that we break
method specialization for some value is to wrap it in Ref{Any}, and then
immediately unwrap.

Let us try it. Now we change our method signature to:

_combine_multicol((firstres,)::Ref{Any}, (fun,)::Ref{Any},
                  gd::GroupedDataFrame,
                  incols::Union{Nothing, AbstractVector, Tuple, NamedTuple})

we also need to change two lines of code where _combine_multicol is called,
namely lines 268:

idx, outcols, nms = _combine_multicol(Ref{Any}(firstres), Ref{Any}(cs_i), gd, nothing)

and 395:

idx, outcols, nms = _combine_multicol(Ref{Any}(firstres), Ref{Any}(fun), gd, incols)

in file src/groupeddataframe/splitapplycombine.jl.

Apply the changes and run our code in a fresh Julia session again:

julia> using DataFrames

julia> gdf = groupby(DataFrame(a=1), :a);

julia> @time combine(gdf, x -> (a=1,));
  2.262996 seconds (7.70 M allocations: 445.018 MiB, 8.53% gc time)

julia> @time combine(gdf, x -> (a=1,));
  0.047010 seconds (299.33 k allocations: 18.038 MiB, 99.52% compilation time)

julia> @time combine(gdf, x -> (b=1,));
  0.278649 seconds (733.47 k allocations: 45.020 MiB, 2.44% gc time, 99.62% compilation time)

julia> @time combine(gdf, x -> (c=1,));
  0.245550 seconds (720.62 k allocations: 44.284 MiB, 2.73% gc time, 99.61% compilation time)

julia> using MethodAnalysis

julia> methodinstances(DataFrames._combine_multicol)
1-element Vector{Core.MethodInstance}:
 MethodInstance for _combine_multicol(::Base.RefValue{Any}, ::Base.RefValue{Any}, ::GroupedDataFrame{DataFrame}, ::Nothing)

Now that is much better. We compiled only one method instance for
_combine_multicol and the timings have significantly improved.

Conclusions

Let us now check what timings the above code has after applying such techniques
to all relevant functions in source code, as proposed in this PR.
I still keep precompilation disabled so the comparison is again a bit unfair
in reference to 0.22 release:

julia> using DataFrames

julia> gdf = groupby(DataFrame(a=1), :a);

julia> @time combine(gdf, x -> (a=1,));
  2.478645 seconds (8.43 M allocations: 489.011 MiB, 9.96% gc time)

julia> @time combine(gdf, x -> (a=1,));
  0.005444 seconds (7.43 k allocations: 486.211 KiB, 96.44% compilation time)

julia> @time combine(gdf, x -> (b=1,));
  0.199044 seconds (371.68 k allocations: 22.773 MiB, 99.39% compilation time)

julia> @time combine(gdf, x -> (c=1,));
  0.173116 seconds (358.82 k allocations: 22.033 MiB, 4.05% gc time, 99.50% compilation time)

julia> using MethodAnalysis

julia> methodinstances(DataFrames._combine_multicol)
1-element Vector{Core.MethodInstance}:
 MethodInstance for _combine_multicol(::Base.RefValue{Any}, ::Base.RefValue{Any}, ::GroupedDataFrame{DataFrame}, ::Base.RefValue{Any})

And now we are under 0.22 timings. Also note that the whole compilation cost
is now related to generation of methods related to the new type of the return
value (the NamedTuple issue). Let us check:

julia> using SnoopCompile

julia> t = @snoopi_deep combine(gdf, x -> (d=1,));

julia> sort(SnoopCompile.flatten(t), by = x -> x.exclusive_time)
271-element Vector{SnoopCompileCore.InferenceTiming}:
 InferenceTiming: 0.000019/0.000019 on InferenceFrameInfo for convert(::Type{Int64}, 1::Int64)
 InferenceTiming: 0.000019/0.000019 on InferenceFrameInfo for convert(::Type{Int64}, 0::Int64)
 InferenceTiming: 0.000020/0.000020 on InferenceFrameInfo for convert(::Type{Int64}, 2::Int64)
 ⋮
 InferenceTiming: 0.007164/0.022407 on InferenceFrameInfo for DataFrames._combine_rows_with_first!(::NamedTuple{(:d,), Tuple{Int64}}, ::Tuple{Vector{Int64}}, ::Function, ::GroupedDataFrame{DataFrame}, nothing::Nothing, ::Tuple{Symbol}, Val{true}()::Val{true})
 InferenceTiming: 0.096098/0.096305 on InferenceFrameInfo for map(::Type{Tuple}, ::Tuple{NamedTuple{(:d,), Tuple{Int64}}})
 InferenceTiming: 0.136345/0.279824 on InferenceFrameInfo for Core.Compiler.Timings.ROOT()

and we see that the biggest cost is paid by map function applied to NamedTuple,
which is a function in Julia Base.

In summary the approaches I discuss here are most useful when you expect to get
values of very heterogeneous types as arguments to your functions. In DataFrames.jl
the two most common cases of these situations are:

  • anonymous transformation functions;
  • NamedTuples as produced values from transformations.

If you would have any comments on the best strategies to avoid code
specialization please contact me as reducing compilation latency is one
of the priorities of DataFrames.jl 1.0 release.

Before I finish let me add one thing. If you are interested in true performance
guided tips to reduce compilation latency (not just poor man’s ones I have given
in this post) I highly recommend you to check out this, this, and
this post.