Tag Archives: julialang

Union Find Dicts: Dictionaries Keyed on Equivalence Classes

By: Philip Zucker

Re-posted from: https:/www.philipzucker.com/union-find-dict/

Dictionary/map data structures usually have a set-like counter part. You get a Set data structure over the keys of a dict by just storing some dummy value such as ().

  • Inserting element k in the set is dictionary insertion d[k] = ()
  • Key lookup is the same thing as check set membership. haskey(d,k)
  • Set union is dictionary merge.
  • Enumerating elements of the set is enumerating the keys of the dict.

Not every set data structure is derived this way. BitSets for example are not really a dict you ignore the value in.
Nevertheless, hash and binary tree based Maps and Sets have this similarity to each other. They are largely the same except perhaps the set data structure is specialized to not even bothering to hold the default value.

The union find data structure is a way of storing disjoint unions of sets in such a way that the union operation remains fast. There is a pretty good explanation here https://en.wikipedia.org/wiki/Disjoint-set_data_structure

I want a version of union find that is a dictionary that stores values keyed on these equivalence classes. I mention why at the end (spoiler: Egraphs).

I’m kind of surprised this abstract data structure interface is not commonly available. I couldn’t find anything when I looked. It is only a very small addition to the regular union find.

Basic Union Find Without Path Compression

You can find the DataStructures.jl version of the union find data structure with path compression here https://github.com/JuliaCollections/DataStructures.jl/blob/master/src/disjoint_set.jl. It’s pretty readable, which is a nice aspect of Julia.

struct IntDisjointSet
    parents::Vector{Int64}
end

This stores and array of indices of the parents of the nodes.

IntDisjointSet() = IntDisjointSet(Vector{Int64}[])
Base.length(x::IntDisjointSet) = length(x.parents)

Here we use a little trick, perhaps ill advised. When an item j is a root node, the storage at
x.parents[j] is redundant. One choice is to set x.parents[j] = j in this case. This is what the DataStructures.jl version does, and it does make some operations clean. Another is to use this space to store extra information. I used negatives numbers at the parent node pointer to represent the size of the tree. This allows union to easily place the smaller tree under the larger one. Hence creating a set of size 1 has a -1 put in its parents array here.

Base.push!(x::IntDisjointSet) = push!(x.parents, -1)

Here is a simple find_root loop. It traverses up the parent array. Here is a basic version of it where I stripped out path compression. Sacrilege you say? Perhaps. Path compression is kind of the cleverest and most confusing part.

function find_root(x::IntDisjointSet, i::Int64)
    while x.parents[i] >= 0
        i = x.parents[i]
    end
    return i
end

The union operation is almost interesting. It sets the parent of the smaller tree to the larger one. This has a tendency to keep the trees shallow. You could also use the rank of the subtree. In the absence of path compression, the rank is the height.

function Base.union!(x::IntDisjointSet, i::Int64, j::Int64)
    pi = find_root(x, i)
    pj = find_root(x, j)
    if pi != pj
        isize = -x.parents[pi]
        jsize = -x.parents[pj]
        if isize > jsize # swap to make size of i less than j
            pi, pj = pj, pi
            isize, jsize = jsize, isize
        end
        x.parents[pj] -= isize # increase new size of pj
        x.parents[pi] = pj # set parent of pi to pj
    end
    return pj
end

I have some suspicion for a use case that normalizing the parent pointers all at once in a batch might be kind of nice. Then further lookups are no longer mutating and can more easily be done in parallel and without a loop. Depends on your usage pattern.

function normalize!(x::IntDisjointSet)
    for i in length(x)
        pi = find_root(x, i)
        if pi != i
            x.parents[i] = pi
        end
    end
end

# If normalized we don't even need a loop here.
function _find_root_normal(x::IntDisjointSet, i::Int64)
    pi = x.parents[i]
    if pi < 0 # Is `i` a root?
        return i
    else
        return pi
    end
end

A question I had is if it was possible to make a dictionary structure that is keyed on equivalence classes. Yes, it is, With some caveats.

You need to define what you want to happen to the values when two equivalance class keys merge. I think a reasonable requirement is that values held in the are combined using an associative and commutative operation, since you are unlikely to be able to predict these orderings. It is tempting to think that this should be a semilattice (idempotent) but it is not necessary.
If you wanted to use arrays with array concatenation for example, expect the order of the array to be scrambled.

struct IntDisjointMap
    parents::Vector{Int64}
    values::Vector{Any}
    merge_fun
end

IntDisjointMap(merge_fun) = IntDisjointMap(Vector{Int64}[], [], merge_fun)

Most of the previous definitions stay the same. We do have to change the definition of union and we can add new definitions for retrieving and updating the dict.

Base.getindex(x::IntDisjointMap, i::Int64) = x.values[find_root(x,i)]
Base.setindex!(x::IntDisjointMap, v, i::Int64) = x.values[find_root(x,i)] = v


function Base.union!(x::IntDisjointMap, i::Int64, j::Int64)
    pi = find_root(x, i)
    pj = find_root(x, j)
    if pi != pj
        isize = -x.parents[pi]
        jsize = -x.parents[pj]
        if isize > jsize # swap to make size of i less than j
            pi, pj = pj, pi
            isize, jsize = jsize, isize
        end
        x.parents[pj] -= isize # increase new size of pj
        x.values[pj] = x.merge_fun(x.values[pj], x.values[pi])
        x.values[pi] = nothing # clear out unused storage
        x.parents[pi] = pj
    end
    return pj
end

Some usage storing integers in the Map using + as a merge operation

using Test
G = IntDisjointMap(+)
push!(G, 1)
@test G.parents == [-1]
@test G.values == [1]
push!(G,14)
@test G.parents == [-1, -1]
@test G.values == [1, 14]
@test G[1] == 1
@test G[2] == 14
union!(G,1,2)
@test G.parents == [2,-2]
@test G.values == [nothing, 15]
@test G[1] == 15
@test G[2] == 15
push!(G, 4)
@test find_root(G,1) == 2
@test find_root(G,2) == 2
@test find_root(G,3) == 3
union!(G,1,3)
@test G[3] == 19
@test find_root(G,3) == 2
@test find_root(G,1) == 2
@test G.parents == [2,-3,2]

G[3] = 42
@test G[2] == 42

Why?

Well, I think the the IntDisjointMap is a missing obvious intersection of abstract data type interfaces.

I’ve been playing around with EGraphs lately https://www.philipzucker.com/egraph-1/, and they involve a bipartite graph between Equivalence classes and ENodes, which are tern nodes with the children replaced by EClass Ids. A fairly minimal representation of such a structure would be an IntDisjointMap holding ENodes.

struct ENode
    head::Symbol
    args::Vector{Int64}
end

I’m kind of hoping to make a very minimal, perhaps inefficient implementation of the essence of egraphs.

Notes

Structures mapping equivalence classes to terms also show up in unification.

There is no impediment to implementing this interface over a full path compressed union find.
It is also possible to implement this as a wrapper over a normal union find, at the expense of some unnecessary find operations

struct IntDisjointMap
    unionfind::IntDisjointSet
    values::Vector{Int64}
    merge_fun
end

I think that perhaps using Parametric types like so may give a speed boost as Julia specializes on types.

struct IntDisjointMap{V}
    parents::Vector{Int64}
    values::Vector{V}
    merge_fun
end

Even further, putting a tag of the merge_fun in the type might do something good specializing in that function call.

struct IntDisjointMap{V, merge_fun}
    parents::Vector{Int64}
    values::Vector{V}
end

Union Find Dicts: Dictionaries Keyed on Equivalence Classes

By: Philip Zucker

Re-posted from: https://www.philipzucker.com/union-find-dict/

Dictionary/map data structures usually have a set-like counter part. You get a Set data structure over the keys of a dict by just storing some dummy value such as ().

  • Inserting element k in the set is dictionary insertion d[k] = ()
  • Key lookup is the same thing as check set membership. haskey(d,k)
  • Set union is dictionary merge.
  • Enumerating elements of the set is enumerating the keys of the dict.

Not every set data structure is derived this way. BitSets for example are not really a dict you ignore the value in.
Nevertheless, hash and binary tree based Maps and Sets have this similarity to each other. They are largely the same except perhaps the set data structure is specialized to not even bothering to hold the default value.

The union find data structure is a way of storing disjoint unions of sets in such a way that the union operation remains fast. There is a pretty good explanation here https://en.wikipedia.org/wiki/Disjoint-set_data_structure

I want a version of union find that is a dictionary that stores values keyed on these equivalence classes. I mention why at the end (spoiler: Egraphs).

I’m kind of surprised this abstract data structure interface is not commonly available. I couldn’t find anything when I looked. It is only a very small addition to the regular union find.

Basic Union Find Without Path Compression

You can find the DataStructures.jl version of the union find data structure with path compression here https://github.com/JuliaCollections/DataStructures.jl/blob/master/src/disjoint_set.jl. It’s pretty readable, which is a nice aspect of Julia.

struct IntDisjointSet
    parents::Vector{Int64}
end

This stores and array of indices of the parents of the nodes.

IntDisjointSet() = IntDisjointSet(Vector{Int64}[])
Base.length(x::IntDisjointSet) = length(x.parents)

Here we use a little trick, perhaps ill advised. When an item j is a root node, the storage at
x.parents[j] is redundant. One choice is to set x.parents[j] = j in this case. This is what the DataStructures.jl version does, and it does make some operations clean. Another is to use this space to store extra information. I used negatives numbers at the parent node pointer to represent the size of the tree. This allows union to easily place the smaller tree under the larger one. Hence creating a set of size 1 has a -1 put in its parents array here.

Base.push!(x::IntDisjointSet) = push!(x.parents, -1)

Here is a simple find_root loop. It traverses up the parent array. Here is a basic version of it where I stripped out path compression. Sacrilege you say? Perhaps. Path compression is kind of the cleverest and most confusing part.

function find_root(x::IntDisjointSet, i::Int64)
    while x.parents[i] >= 0
        i = x.parents[i]
    end
    return i
end

The union operation is almost interesting. It sets the parent of the smaller tree to the larger one. This has a tendency to keep the trees shallow. You could also use the rank of the subtree. In the absence of path compression, the rank is the height.

function Base.union!(x::IntDisjointSet, i::Int64, j::Int64)
    pi = find_root(x, i)
    pj = find_root(x, j)
    if pi != pj
        isize = -x.parents[pi]
        jsize = -x.parents[pj]
        if isize > jsize # swap to make size of i less than j
            pi, pj = pj, pi
            isize, jsize = jsize, isize
        end
        x.parents[pj] -= isize # increase new size of pj
        x.parents[pi] = pj # set parent of pi to pj
    end
    return pj
end

I have some suspicion for a use case that normalizing the parent pointers all at once in a batch might be kind of nice. Then further lookups are no longer mutating and can more easily be done in parallel and without a loop. Depends on your usage pattern.

function normalize!(x::IntDisjointSet)
    for i in length(x)
        pi = find_root(x, i)
        if pi != i
            x.parents[i] = pi
        end
    end
end

# If normalized we don't even need a loop here.
function _find_root_normal(x::IntDisjointSet, i::Int64)
    pi = x.parents[i]
    if pi < 0 # Is `i` a root?
        return i
    else
        return pi
    end
end

A question I had is if it was possible to make a dictionary structure that is keyed on equivalence classes. Yes, it is, With some caveats.

You need to define what you want to happen to the values when two equivalance class keys merge. I think a reasonable requirement is that values held in the are combined using an associative and commutative operation, since you are unlikely to be able to predict these orderings. It is tempting to think that this should be a semilattice (idempotent) but it is not necessary.
If you wanted to use arrays with array concatenation for example, expect the order of the array to be scrambled.

struct IntDisjointMap
    parents::Vector{Int64}
    values::Vector{Any}
    merge_fun
end

IntDisjointMap(merge_fun) = IntDisjointMap(Vector{Int64}[], [], merge_fun)

Most of the previous definitions stay the same. We do have to change the definition of union and we can add new definitions for retrieving and updating the dict.

Base.getindex(x::IntDisjointMap, i::Int64) = x.values[find_root(x,i)]
Base.setindex!(x::IntDisjointMap, v, i::Int64) = x.values[find_root(x,i)] = v


function Base.union!(x::IntDisjointMap, i::Int64, j::Int64)
    pi = find_root(x, i)
    pj = find_root(x, j)
    if pi != pj
        isize = -x.parents[pi]
        jsize = -x.parents[pj]
        if isize > jsize # swap to make size of i less than j
            pi, pj = pj, pi
            isize, jsize = jsize, isize
        end
        x.parents[pj] -= isize # increase new size of pj
        x.values[pj] = x.merge_fun(x.values[pj], x.values[pi])
        x.values[pi] = nothing # clear out unused storage
        x.parents[pi] = pj
    end
    return pj
end

Some usage storing integers in the Map using + as a merge operation

using Test
G = IntDisjointMap(+)
push!(G, 1)
@test G.parents == [-1]
@test G.values == [1]
push!(G,14)
@test G.parents == [-1, -1]
@test G.values == [1, 14]
@test G[1] == 1
@test G[2] == 14
union!(G,1,2)
@test G.parents == [2,-2]
@test G.values == [nothing, 15]
@test G[1] == 15
@test G[2] == 15
push!(G, 4)
@test find_root(G,1) == 2
@test find_root(G,2) == 2
@test find_root(G,3) == 3
union!(G,1,3)
@test G[3] == 19
@test find_root(G,3) == 2
@test find_root(G,1) == 2
@test G.parents == [2,-3,2]

G[3] = 42
@test G[2] == 42

Why?

Well, I think the the IntDisjointMap is a missing obvious intersection of abstract data type interfaces.

I’ve been playing around with EGraphs lately https://www.philipzucker.com/egraph-1/, and they involve a bipartite graph between Equivalence classes and ENodes, which are tern nodes with the children replaced by EClass Ids. A fairly minimal representation of such a structure would be an IntDisjointMap holding ENodes.

struct ENode
    head::Symbol
    args::Vector{Int64}
end

I’m kind of hoping to make a very minimal, perhaps inefficient implementation of the essence of egraphs.

Notes

Structures mapping equivalence classes to terms also show up in unification.

There is no impediment to implementing this interface over a full path compressed union find.
It is also possible to implement this as a wrapper over a normal union find, at the expense of some unnecessary find operations

struct IntDisjointMap
    unionfind::IntDisjointSet
    values::Vector{Int64}
    merge_fun
end

I think that perhaps using Parametric types like so may give a speed boost as Julia specializes on types.

struct IntDisjointMap{V}
    parents::Vector{Int64}
    values::Vector{V}
    merge_fun
end

Even further, putting a tag of the merge_fun in the type might do something good specializing in that function call.

struct IntDisjointMap{V, merge_fun}
    parents::Vector{Int64}
    values::Vector{V}
end

Would Int32 give savings?

Is there a category of Equivalence Maps? Maybe.

How to do this in prolog? A dict keyed on unification variables. Would attributed variables suffice?

Construction vs conversion in Julia

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/03/26/conversion.html

Introduction

In the recent 0.22.6 release of the DataFrames.jl package we have
deprecated a bunch of convert methods for objects defined in it.

In this post I want to comment on when convert is used and when it is
appropriate to define a convert method for custom types.

The history

DataFrames.jl is an old package, with its 0.0.0 release made on Feb 14, 2013.

During this time the Julia language has evolved. Even in Julia 0.6 convert
method was a default fallback for construction as you can see here:

However, in some cases you could consider adding methods to Base.convert
instead of defining a constructor, because Julia falls back to calling
convert() if no matching constructor is found. For example, if no constructor
T(args...) = ... exists Base.convert(::Type{T}, args...) = ... is called.

This is a past of pre-1.0 Julia design (if you are interested in some more
history you can start with this issue). The things have changed now,
but until DataFrames.jl 0.22.6 release we had some convert methods that were
inspired by this rule, and we even had constructors that fell back to convert
explicitly.

So what is the current rule for using convert?

The present

Fortunately, as of Julia 1.6 the manual is quite clear that
convert should be only defined in cases when it is safe to perform a
conversion even when the user does not ask for it explicitly. Before we move
forward let us see it in action:

julia> struct Example
       a::Int
       b::Complex{Int}
       end

julia> Example(1.0, 1.0)
Example(1, 1 + 0im)

You can see that 1.0 (of type Float64) got implicitly converted to 1 (of
type Int). Similarly 1.0 got converted to 1 + 0im (Complex{Int}).
These conversions are performed implicitly to ensure programmer convenience.

However, the following fails:

julia> example(a::Int, b::Complex{Int}) = (a, b)
example (generic function with 1 method)

julia> example(1.0, 1.0)
ERROR: MethodError: no method matching example(::Float64, ::Float64)

So when does the implicit conversion happen? Again the maunal explains it here:

  • Assigning to an array converts to the array’s element type.
  • Assigning to a field of an object converts to the declared type of the field.
  • Constructing an object with new converts to the object’s declared field types.
  • Assigning to a variable with a declared type (e.g. local x::T) converts to that type.
  • A function with a declared return type converts its return value to that type.
  • Passing a value to ccall converts it to the corresponding argument type.

So a short, simplifying, rule is that the conversion happens when you want to make
an assignment to a value that has a pre-specified type.

Let us see it at work:

julia> d1 = Set{Int}()
Set{Int64}()

julia> push!(d1, 1.0)
Set{Int64} with 1 element:
  1

julia> d2 = BitSet()
BitSet([])

julia> push!(d2, 1.0)
ERROR: MethodError: no method matching push!(::BitSet, ::Float64)

You can push 1.0 to Set{Int}, because its push! method makes no
restriction on the type of value that is pushed to the Set{Int}. Then,
internally, Set{Int} converts the passed value to Int before storing it. You
can make sure that this happens by writing:

julia> push!(d1, "a")
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Int64
Closest candidates are:
  convert(::Type{T}, ::Ptr) where T<:Integer at pointer.jl:23
  convert(::Type{T}, ::T) where T<:Number at number.jl:6
  convert(::Type{T}, ::Number) where T<:Number at number.jl:7
  ...
Stacktrace:
 [1] setindex!(h::Dict{Int64, Nothing}, v0::Nothing, key0::String)
   @ Base ./dict.jl:374
 [2] push!(s::Set{Int64}, x::String)
   @ Base ./set.jl:57
 [3] top-level scope
   @ REPL[32]:1

and if you check the setindex! implementation you will see the conversion:

function setindex!(h::Dict{K,V}, v0, key0) where V where K
    key = convert(K, key0)
    if !isequal(key, key0)
        throw(ArgumentError("$(limitrepr(key0)) is not a valid key for type $K"))
    end
    setindex!(h, v0, key)
end

So why you are not allowed to push! a float to a BitSet? The reason is that
its push! method is defined more restrictively like this:

@inline push!(s::BitSet, n::Integer) = _setint!(s, _check_bitset_bounds(n), true)

and since no conversion happens when passing parameters to functions an error is
thrown.

Conclusion

To wrap up let us go back to the manual here:

since convert can be called implicitly, its methods are restricted to cases
that are considered “safe” or “unsurprising”. convert will only convert between
types that represent the same basic kind of thing (e.g. different
representations of numbers, or different string encodings). It is also usually
lossless; converting a value to a different type and back again should result in
the exact same value.

In essence this means that, unless you are developing a low level infrastructure
package (like defining new numeric type), you are not likely to need to define
convert methods at all. Just define proper constructors for your types.

In DataFrames.jl we left only three conversions.

The first is from SubDataFrame to DataFrame. It is clearly a lose-less
conversion that sometimes might be useful (e.g. you have a container of
DataFrame objects and want to be able to implicitly add SubDataFrame to it).

The second and third conversions are from DataFrameRow and GroupKey to
NamedTuple. The reason behind allowing them is that we try to make both these
types to work like NamedTuple (except for the fact that they are views).
Again in this case the conversion is lose-less.

As a final note — I think Julia 1.6 manual is really a great resource
(half-way of writing this post I considered to stop it as the manual
is so clear about the rules).