Tag Archives: julialang

Progress on Automated Reasoning for Catlab with Metatheory.jl Egraphs

Re-posted from: https://www.philipzucker.com/metatheory-progress/

Oh cursed equation! How you haunt me!

I’ve been working on mechanizing equational reasoning for categories for a while now with not so much success.

Using ATP like Vampire https://www.philipzucker.com/notes-on-synthesis-and-equation-proving-for-catlab-jl/
Using SMT like Z3 https://www.philipzucker.com/theorem-proving-for-catlab-2-lets-try-z3-this-time-nope

The Egg paper came out and was very compelling to me. It is an implementation with novel twists of egraphs, a useful data structure for equational reasoning. I tried my hand at implementing the egraph data structure here with the purpose of using it for Catlab reasoning in mind.

But Alessandro Cheli has come along and really taken things to the next level with his Metatheory.jl library. The package includes a significantly improved egraph implementation, plus a classical rewriting system. It also has some really interesting ideas in regards to homoiconicity and theory composition. Maybe it’s useful to you. Check it out.

The equational reasoning that occurs in Catlab is compelling because it feels like high school algebra, but describes significantly more powerful systems of reasoning. It also has a beautiful and intuitive diagrammatic interpretation.

Understanding how to take a system like egraph rewriting, which in examples seems at the moment best suited for rewriting ordinary algebraic expressions, and tweak it to work with the more complicated logic of Catlab has been a challenge.

There are a couple problems.

The typing context contains important information. Consider the rule $f \rightarrow id(a) \cdot f $. Looking at only the syntax of this rule, a is invented on the right hand side. This information actually comes from the type of f::hom(a,b)
We need to strip off overloading? $\otimes$ is a symbol used in catlab for both the binary operator on objects and morphisms, as is customary. I guess this might fine? It makes me squeamish.
Some rules require guarding on the type before execution. For most of the rules in Catlab, if the left hand side of the rule is well-typed, then so is the right. This is not the case in general though.
Consider the trick question: Can you apply the interchange law (f ⊗ g) ⋅ (h ⊗ k) => (f ⋅ h) ⊗ (g ⋅ k) to the term (f ⊗ id(a)) ⋅ (id(b) ⊗ k)?
No you can’t. For in my example, I have secretly decided that f actually has type b ⊗ c -> b ⊗ c and k has type c ⊗ a -> c ⊗ a.
The other direction is always fine though. Given (f ⋅ h) ⊗ (g ⋅ k) is well typed, so is (f ⊗ g) ⋅ (h ⊗ k)

I thought way back that the types could be stripped and everything would be fine, like if you strip the types from the simply typed lambda calculus. This is not the case. I think this misconception is due to my inexperience doing hand calculations of this type (which almost no one does. String diagrams rule), plus I think you “code the happy path” when doing it by hand and may not even notice a situation where just shallow pattern matching on the syntax would allow you do do something verboten, since you’re smart enough to not make that dumb mistake.

My initial thinking this time around was to take a type tagging approach https://people.mpi-inf.mpg.de/~jblanche/mono-trans.pdf. This is a standard way of encoding more sophisticated typed logics into simpler typed ones. Type tagging replaces every term ever used with a more complex term that contains the type. In a curious analogy, it is the analog of dynamic typing vs static typing. In dynamically typed languages, all values carry with them tags describing which type they are at runtime.

As an example, the identity id(a) could be tagged as type(id(type(a,ob)), hom(typ(a,ob), type(a,ob))) or as hom(id(ob(a)), ob(a), ob(a)). You can see how verbose this is. It is not ideal both because it is unacceptable to write by hand, and because this extra structure is just more junk to choke your solver. But you can see that now the necessary types to fill in missing pieces of the right hand side is always available.

I tried this through some arcane metaprogramming methods involving Julog and it seemed too slow. Alessandro and James Fairbanks have a more elegant looking version of this here.

It feels insane that you need all this overhead when it is not necessary when working by hand. Many of the rules do not have this rhs invention problem or the typed guarding requirement. For cartesian categories, only about 5 of 30 rules have this condition. $(f \cdot g) \cdot h \rightarrow (f \cdot g) \cdot h$ is totally fine for example.

So here is idea 1, replace any new derived information on the right hand side with “function calls” that find that information. For our example, this becomes f => id(dom(type(f))) ⋅ f. Then we need to internalize the typing rules into the rewrite system

typing_rules = @theory begin
    dom(hom(a,b)) => a
    cod(hom(a,b)) => b
    type(id(a)) => hom(a,a)
    type(f ⋅ g) => hom( dom(type(f)) , cod(type(g)) )
    type(f ⊗ₘ g) => hom( dom(type(f)) ⊗ₒ dom(type(g)) , cod(type(f)) ⊗ₒ cod(type(g)))
    type(a ⊗ₒ b) => :ob
    type(munit()) => :ob
    type(σ(a,b)) => hom(a ⊗ₒ b, b ⊗ₒ a)
    type(⋄(a)) => hom(a, munit())
    type(Δ(a)) => hom(a, a ⊗ₒ a)
    type(pair(f,g)) => hom(dom(type(f)), cod(type(f)) ⊗ₒ cod(type(g)))
    type(proj1(a,b)) => hom(a ⊗ₒ b, a)
    type(proj2(a,b)) => hom(a ⊗ₒ b, b)
end

Idea two is to use Metatheory “guards” to enforce type correctness of the left hand side before a rule can fire. As part of this, the necessary type queries are added to the egraph. Alessandro had this idea of just returning the original expression to enforce a guard, which I think is pretty clever. Here is how I think the guard works for the interchange law. This will be cleaner with an upcoming Metatheory macro.

    (f ⊗ₘ p) ⋅ (g ⊗ₘ q) |>
    # future metatheory macros will clean this up
    begin
        fcod = addexpr!(_egraph, :(cod(type($f))))
        gdom = addexpr!(_egraph, :(dom(type($g))))
        pcod = addexpr!(_egraph, :(cod(type($p))))
        qdom = addexpr!(_egraph, :(dom(type($q))))
        if (fcod == gdom && pcod == qdom)
            :(($f ⋅ $g) ⊗ₘ ($p ⋅ $q))
        else
            :(($f ⊗ₘ $p) ⋅ ($g ⊗ₘ $q))
        end
    end

You can see that even making the query about whether the types are right adds them to the egraph. It may be currently unknown what the types are, but by adding them, eventually the typing rules will fire and this rule will have its chance again. This also isn’t as bad as it may appear performance wise, because the egraph in a sense memoizes these typing calls. See the Fibonacci example by Mason Protter which has non exponential complexity despite the naive implementation https://github.com/0x0f0f0f/Metatheory.jl/blob/master/test/test_fibonacci.jl

It is my belief that both of these points (internalization of the typing rules and guards) can be mechanized to produce a sound rule system for an arbitrary GAT. I could easily be wrong. The rough picture I suggest is that you need to infer by looking at the lhs of each rule what typing constraints are implied. If you do not imply the entire typing context, you need to add as many extra guards to ensure it as needed. For example, the well typed-ness of (f ⊗ₘ p) ⋅ (g ⊗ₘ q) implies $cod(f) \otimes cod(p) = dom(g) \otimes dom(q)$ but this does not imply the condition $cod(f) = dom(g) \land cod(p) = dom(q)$ so they must be added.

EGraphs are confusing to reason about. But 2 points:

I think it is important that the Egraph inductively contains only well typed terms in some sense. This should be definable and checkable in a precise manner.
I think it is important to have a semantic model in mind when working with egraphs. Pure syntax considerations lead you astray. EGraphs are more than just a bag of terms.

Here is a full script to run the derivation pair(proj1(a,b),proj2(a,b)) = id(a ⊗ₒ b). I can’t get it to go end to end without assistance yet. We’ll get there. I believe. The real holy grail is not proofs, but simplifications. This also works, but not yet at the depth I’m hoping for.

Evan Patterson’s explanation of the proof has been absolutely vital https://julialang.zulipchat.com/#narrow/stream/230248-catlab.2Ejl/topic/Automatic.20Theorem.20Proving/near/211396362. Zulip is nice for this.

using Pkg
Pkg.activate("metacat")
Pkg.add(url = "https://github.com/0x0f0f0f/Metatheory.jl.git")
using Metatheory
using Metatheory.EGraphs
using Metatheory.Util

# See this link for the catlab axioms
# https://github.com/AlgebraicJulia/Catlab.jl/blob/ce2fde9c63a8aab65cf2a7697f43cd24e5e00b3a/src/theories/Monoidal.jl#L127

cat_rules = @theory begin
    f ⋅ id(b) => f
    id(a) ⋅ f => f

    #f => f ⋅ id(dom(type(f)))
    #f => id(cod(type(f))) ⋅ f

    a ⊗ₒ munit() => a
    munit() ⊗ₒ a => a

    #a => a ⊗ₒ munit() 
    #a => munit() ⊗ₒ a

    f ⋅ (g ⋅ h) == (f ⋅ g) ⋅ h


end

monoidal_rules = @theory begin
    id(munit()) ⊗ₘ f => f
    f ⊗ₘ id(munit()) => f
    a ⊗ₒ (b ⊗ₒ c) == (a ⊗ₒ b) ⊗ₒ c
    f ⊗ₘ (h ⊗ₘ j) == (f ⊗ₘ h) ⊗ₘ j
    id(a ⊗ₒ b) == id(a) ⊗ₘ id(b)

    (f ⊗ₘ p) ⋅ (g ⊗ₘ q) |>
    # future metatheory macros will clean this up
    begin
        fcod = addexpr!(_egraph, :(cod(type($f))))
        gdom = addexpr!(_egraph, :(dom(type($g))))
        pcod = addexpr!(_egraph, :(cod(type($p))))
        qdom = addexpr!(_egraph, :(dom(type($q))))
        if (fcod == gdom && pcod == qdom)
            :(($f ⋅ $g) ⊗ₘ ($p ⋅ $q))
        else
            :(($f ⊗ₘ $p) ⋅ ($g ⊗ₘ $q))
        end
    end

    (f ⋅ g) ⊗ₘ (p ⋅ q) => (f ⊗ₘ p) ⋅ (g ⊗ₘ q)
end



sym_rules = @theory begin
    σ(a, b) ⋅ σ(b, a) == id(a ⊗ₒ b)
    (σ(a, b) ⊗ₘ id(c)) ⋅ (id(b) ⊗ₘ σ(a, c)) == σ(a, b ⊗ₒ c)
    (id(a) ⊗ₘ σ(b, c)) ⋅ (σ(a, c) ⊗ₘ id(b)) == σ(a ⊗ₒ b, c)

    (f ⊗ₘ h) ⋅ σ(a, b) |> begin
        fcod = addexpr!(_egraph, :(cod(type($f)))).id
        hcod = addexpr!(_egraph, :(cod(type($h)))).id
        if fcod == a && hcod == b
            :(σ(dom(type($f)), dom(type($h))) ⋅ ($h ⊗ₘ $f))
        else
            :(($f ⊗ₘ $h) ⋅ σ($a, $b))
        end
    end


    σ(c, d) ⋅ (h ⊗ₘ f) |> begin
        fdom = addexpr!(_egraph, :(dom(type($f)))).id
        hdom = addexpr!(_egraph, :(dom(type($h)))).id
        if fdom == c && hdom == d
            :(($f ⊗ₘ $h) ⋅ σ(cod(type($f)), cod(type($h))))
        else
            :(σ($c, $d) ⋅ ($h ⊗ₘ $f))
        end
    end

    # these rules arer not catlab
    σ(a, munit()) == id(a)
    σ(munit(), a) == id(a)
    σ(munit(), munit()) => id(munit() ⊗ₒ munit())

end



diag_rules = @theory begin

    Δ(a) ⋅ (⋄(a) ⊗ₘ id(a)) == id(a)
    Δ(a) ⋅ (id(a) ⊗ₘ ⋄(a)) == id(a)
    Δ(a) ⋅ σ(a, a) == Δ(a)

    (Δ(a) ⊗ₘ Δ(b)) ⋅ (id(a) ⊗ₘ σ(a, b) ⊗ₘ id(b)) == Δ(a ⊗ₒ b)


    Δ(a) ⋅ (Δ(a) ⊗ₘ id(a)) == Δ(a) ⋅ (id(a) ⊗ₘ Δ(a))
    ⋄(a ⊗ₒ b) == ⋄(a) ⊗ₘ ⋄(b)

    Δ(munit()) == id(munit())
    ⋄(munit()) == id(munit())
end


cart_rules = @theory begin

    Δ(a) ⋅ (f ⊗ₘ k) |> begin
        a_id = find(_egraph, a)
        if (
            addexpr!(_egraph, :(dom(type($f)))).id == a_id &&
            addexpr!(_egraph, :(dom(type($k)))).id == a_id
        )
            :(pair($f, $k))
        else
            :(Δ($a) ⋅ ($f ⊗ₘ $k))
        end
    end


    pair(f, k) => Δ(dom(type(f))) ⋅ (f ⊗ₘ k)
    proj1(a, b) == id(a) ⊗ₘ ⋄(b)
    proj2(a, b) == ⋄(a) ⊗ₘ id(b)
    f ⋅ ⋄(b) => ⋄(dom(type(f)))
    # Has to invent f. Hard to fix.
    # ⋄(b) => f ⋅ ⋄(b)  

    f ⋅ Δ(b) => Δ(dom(type(f))) ⋅ (f ⊗ₘ f)
    Δ(a) ⋅ (f ⊗ₘ f) => f ⋅ Δ(cod(type(f)))
end





typing_rules = @theory begin
    dom(hom(a, b)) => a
    cod(hom(a, b)) => b
    type(id(a)) => hom(a, a)
    type(f ⋅ g) => hom(dom(type(f)), cod(type(g)))
    type(f ⊗ₘ g) => hom(dom(type(f)) ⊗ₒ dom(type(g)), cod(type(f)) ⊗ₒ cod(type(g)))
    type(a ⊗ₒ b) => :ob
    type(munit()) => :ob
    type(σ(a, b)) => hom(a ⊗ₒ b, b ⊗ₒ a)
    type(⋄(a)) => hom(a, munit())
    type(Δ(a)) => hom(a, a ⊗ₒ a)
    type(pair(f, g)) => hom(dom(type(f)), cod(type(f)) ⊗ₒ cod(type(g)))
    type(proj1(a, b)) => hom(a ⊗ₒ b, a)
    type(proj2(a, b)) => hom(a ⊗ₒ b, b)
end


rules = typing_rules ∪ cat_rules ∪ monoidal_rules ∪ sym_rules ∪ diag_rules ∪ cart_rules ∪ typing_rules


# A goofy little helper macro
# Taking inspiration from Lean/Dafny/Agda
macro calc(e...)
    theory = eval(e[1])
    e = rmlines(e[2])
    @assert e.head == :block
    for (a, b) in zip(e.args[1:end-1], e.args[2:end])
        println(a, " =? ", b)
        @time println(areequal(theory, a, b; timeout = 40))
    end
end

# Get the Julia motor hummin'
@calc rules begin

    ((⋄(a) ⊗ₘ ⋄(b)) ⋅ σ(munit(), munit()))
    (σ(a, b) ⋅ (⋄(b) ⊗ₘ ⋄(a)))

end 


@calc rules begin
    id(a ⊗ₒ b)
    id(a) ⊗ₘ id(b)
    (Δ(a) ⋅ (id(a) ⊗ₘ ⋄(a))) ⊗ₘ id(b)
    (Δ(a) ⋅ (id(a) ⊗ₘ ⋄(a))) ⊗ₘ (Δ(b) ⋅ (⋄(b) ⊗ₘ id(b)))
    (Δ(a) ⊗ₘ Δ(b)) ⋅ ((id(a) ⊗ₘ ⋄(a)) ⊗ₘ (⋄(b) ⊗ₘ id(b)))
    (Δ(a) ⊗ₘ Δ(b)) ⋅ (id(a) ⊗ₘ (⋄(a) ⊗ₘ ⋄(b)) ⊗ₘ id(b))
    (Δ(a) ⊗ₘ Δ(b)) ⋅ (id(a) ⊗ₘ ((⋄(a) ⊗ₘ ⋄(b)) ⋅ σ(munit(), munit())) ⊗ₘ id(b))
    (Δ(a) ⊗ₘ Δ(b)) ⋅ ((id(a) ⊗ₘ (σ(a, b) ⋅ (⋄(b) ⊗ₘ ⋄(a))) ⊗ₘ id(b)))
    (Δ(a) ⊗ₘ Δ(b)) ⋅ ((id(a) ⊗ₘ (σ(a, b) ⋅ (⋄(b) ⊗ₘ ⋄(a))) ⊗ₘ id(b)))
    (Δ(a) ⊗ₘ Δ(b)) ⋅ ((id(a) ⋅ id(a)) ⊗ₘ (σ(a, b) ⋅ (⋄(b) ⊗ₘ ⋄(a))) ⊗ₘ id(b))
    (Δ(a) ⊗ₘ Δ(b)) ⋅ ((id(a) ⊗ₘ σ(a, b) ⊗ₘ id(b)) ⋅ (id(a) ⊗ₘ (⋄(b) ⊗ₘ ⋄(a)) ⊗ₘ id(b)))
    Δ(a ⊗ₒ b) ⋅ (id(a) ⊗ₘ (⋄(b) ⊗ₘ ⋄(a)) ⊗ₘ id(b))
    Δ(a ⊗ₒ b) ⋅ (id(a) ⊗ₘ (⋄(b) ⊗ₘ ⋄(a)) ⊗ₘ id(b))
    Δ(a ⊗ₒ b) ⋅ (proj1(a, b) ⊗ₘ proj2(a, b))
    pair(proj1(a, b), proj2(a, b))
end

# shorter proof also accepted
@calc rules begin
    id(a ⊗ₒ b)
    (Δ(a) ⊗ₘ Δ(b)) ⋅ ((id(a) ⊗ₘ (σ(a, b) ⋅ (⋄(b) ⊗ₘ ⋄(a))) ⊗ₘ id(b)))
    pair(proj1(a, b), proj2(a, b))
end

Where to go from here

We could just say this is already nice and push forward.
You may have noticed I have a few rules commented out. These may or may not present extra difficulties. The unit production rules are probably fine with another kind of guard. The deletion rule in the other direction will never be fine. This rule truly invents f in an essential way.
This is all totally unintegrated with catlab itself. A WIP. There are different directions one could go on this. Alessandro has a branch of Metatheory in the work for custom user defined data structures, so it is possible to use Catlab’s native GATExpr maybe. He’s looking for help too btw!
I’m kind of intrigued at trying to make a pure egg version of the above, especially so I can compile to wasm
-String diagrams https://julialang.zulipchat.com/#narrow/stream/230248-catlab.2Ejl/topic/Using.20Metatheory.20for.20String.20Diagrams There is a compelling argument that string diagrams are a preferred representation, normalizing the associativity,commutative, and unit properties of a categorical construction. The useful canonicity of string diagrams is a viewpoint Evan Patterson has put forward. It has been suggested by Alessandro and others that generalizing the EGraph data structure in some sense may be to go. Extensions of the egraph that work natively on something more string diagram-like?
More wildly speculative. Co-end calculus? How to deal with binding forms? Daniele Palombi has brought up the coend calculus https://arxiv.org/pdf/1501.02503.pdf This seems like a very interesting application. I think the direct attack on this problem using Metatheory requires understanding how to deal with alpha equivalence in the EGraph Conversation here: https://julialang.zulipchat.com/#narrow/stream/277860-metatheory.2Ejl/topic/Formal.20Proofs.20for.20CT.3F. I personally would like to start attacking the problem from the angle of regular integrals and sums, which are syntactically analogous.
Proof Recording and output. The actual proof is there in the egraph rewriting process if we record it. Alessandro sounded interested in this direction and my understanding is it is also in the works for egg
Cody has also pointed out to me that he had a project with similar goals but different methods many moons ago https://github.com/codyroux/catnapp. I don’t think this got that far.

What is new in PooledArrays.jl?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/03/05/pooledarrays.html

Introduction

Recently PooledArrays.jl 1.2.1 has been released. The most significant change
since 1.0 release is an improvement of performance of basic operations:
getindex, copy, copy!, and copyto!. The effect of the change is
especially significant for PooledArrays that have large pools. This change, is
one of the steps towards making Julia run fast in Database-like ops
benchmark for joins.

Let me start with the examples and then I will comment on the internals.
The post was tested under Julia 1.6.0-rc1.

The Benchmarks

I will just share a recordng of a Julia session doing the benchmarks. We start
with PooledArrays.jl 1.0.0:

(@v1.6) pkg> activate .
  Activating new environment at `~/Project.toml`

(bkamins) pkg> add [email protected]
   Resolving package versions...
    Updating `~/Project.toml`
  [2dfb63ee] + PooledArrays v1.0.0
    Updating `~/Manifest.toml`
  [9a962f9c] + DataAPI v1.6.0
  [2dfb63ee] + PooledArrays v1.0.0
  Progress [========================================>]  1/1
1 dependency successfully precompiled in 2 seconds (1 already precompiled)

julia> using BenchmarkTools, PooledArrays

julia> x = PooledArray(string.(1:10^6));

julia> @benchmark copy($x)
BenchmarkTools.Trial:
  memory estimate:  37.44 MiB
  allocs estimate:  13
  --------------
  minimum time:     16.109 ms (0.00% GC)
  median time:      17.820 ms (0.00% GC)
  mean time:        29.851 ms (40.25% GC)
  maximum time:     175.307 ms (88.17% GC)
  --------------
  samples:          168
  evals/sample:     1

julia> @benchmark $x[1:1]
BenchmarkTools.Trial:
  memory estimate:  33.63 MiB
  allocs estimate:  12
  --------------
  minimum time:     15.334 ms (0.00% GC)
  median time:      16.929 ms (0.00% GC)
  mean time:        30.654 ms (44.06% GC)
  maximum time:     188.496 ms (89.90% GC)
  --------------
  samples:          164
  evals/sample:     1

In order to assess if the timings are good or bad let us do the same operations
using a plain Vector{String}:

julia> x = string.(1:10^6);

julia> @benchmark copy($x)
BenchmarkTools.Trial:
  memory estimate:  7.63 MiB
  allocs estimate:  2
  --------------
  minimum time:     395.687 μs (0.00% GC)
  median time:      441.842 μs (0.00% GC)
  mean time:        757.598 μs (38.52% GC)
  maximum time:     6.194 ms (90.51% GC)
  --------------
  samples:          6598
  evals/sample:     1

julia> @benchmark $x[1:1]
BenchmarkTools.Trial:
  memory estimate:  96 bytes
  allocs estimate:  1
  --------------
  minimum time:     25.310 ns (0.00% GC)
  median time:      26.013 ns (0.00% GC)
  mean time:        34.384 ns (8.88% GC)
  maximum time:     2.433 μs (98.05% GC)
  --------------
  samples:          10000
  evals/sample:     992

As you can see things are really bad in PooledArrays.jl. Now start a fresh Julia
session and install 1.2.1 release of the package.

(bkamins) pkg> activate .
  Activating environment at `~/Project.toml`

(bkamins) pkg> add [email protected]
   Resolving package versions...
    Updating `~/Project.toml`
  [2dfb63ee] ↑ PooledArrays v1.0.0 ⇒ v1.2.1
    Updating `~/Manifest.toml`
  [2dfb63ee] ↑ PooledArrays v1.0.0 ⇒ v1.2.1
  [9fa8497b] + Future
  [9a3f8284] + Random
  [9e88b42a] + Serialization

julia> using BenchmarkTools, PooledArrays

julia> x = PooledArray(string.(1:10^6));

julia> @benchmark copy($x)
BenchmarkTools.Trial:
  memory estimate:  3.81 MiB
  allocs estimate:  4
  --------------
  minimum time:     893.391 μs (0.00% GC)
  median time:      948.740 μs (0.00% GC)
  mean time:        1.009 ms (5.08% GC)
  maximum time:     103.288 ms (98.55% GC)
  --------------
  samples:          4953
  evals/sample:     1

julia> @benchmark $x[1:1]
BenchmarkTools.Trial:
  memory estimate:  160 bytes
  allocs estimate:  3
  --------------
  minimum time:     47.806 ns (0.00% GC)
  median time:      53.053 ns (0.00% GC)
  mean time:        184.150 ns (44.04% GC)
  maximum time:     55.779 μs (68.58% GC)
  --------------
  samples:          10000
  evals/sample:     976

This looks better. Still you have to pay some cost over a Vector{Sting} but it
is much smaller (the cost is due to the fact that PooledArray constructor
performs some consistency checks of passed data to ensure extra safety).

You can expect that other operations that take a PooledArray or its view and
produce a PooledArray (like or copyto!) to experience similar
speedups.

So, what has changed between 1.0 and 1.2 release of PooledArrays.jl?

The Internals

In order to understand why the speedups were possible one needs to understand
first why the original code was slow. The reason is that PooledArray struct
contains three key fields:

refs: a vector of integer references (levels) of a PooledArray;
pool: a vector giving a mapping from references to actual values;
invpool: a dictionary providing a reverse mapping – from values to references.

As you can see this data structure is quite heavy if the size of pool relative
to the size of the PooledArray is large. In the example above I have shown an
extreme case where they were equal. But even if pool has e.g. 10% of size of
the PooledArray the cost is noticeable.

In PooledArrays.jl 1.0 all thee fields refs, pool and invpool were always
copied when a new PooledArray was created. This was expensive. What
PooeldArrays.jl 1.2 introduces is a well known from R and often asked about in
Julia copy on write behavior. What we now do is that we only copy refs. The
pool and invpool fields are shared across PooledArrays as this is a safe
thing to do as long as you are not changing the set of levels in the pool.

So where does the copy on write happen? PooledArrays.jl is now aware if you
share pool and invpool across several arrays and if this is the case and you
add levels to PooledArray then pool and invpool get copied. So essentially
we have a lazy mechanism that copies them only if needed.

One could ask why do we copy pool and invpool at all. One could just keep
sharing them without having to pay the cost of copying at all. The decision was
guided here by two considerations:

in practice new levels are added to PooledArray quite rarely (you mostly do
it when constructing an initial source PooledArray);
it is safer to copy the pool and invpool in consideration of potential
multi-threaded usages of PooledArray (where many tricky corner cases can
happen).

Conclusions

The main take away is that you can expect your code using PooledArrays.jl to be
much faster since 1.2 release.

Before I finish let me comment on one important feature of PooledArray object
to keep in mind in the multi-threaded applications I have mentioned above.

It is currently not thread safe to add new levels to the pool
of PooledArray. So the rule is: if you add levels to PooledArray make sure
you are not performing any other operations on it in other threads.

However, you can safely perform any operations in multi-threaded context that do
not change the pool. So, in particular (barring standard considerations of
correct multi-threaded code), you are allowed to use setindex! on
PooledArray as long as you do not add new levels.

How to check the version of a package?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/02/27/pkg_version.html

Introduction

I think the most common type of question related to DataFrames.jl package is
that users are reporting that some functionality does not work as documented.

Sometimes it is indeed a bug but in the majority of cases the reason is that
the user does not have a correct version of the package installed. In this post
I discuss several ways of checking the version of the package one has in
a current project environment.

This post was written under Julia 1.6.0-rc1, DataFrames.jl 0.22.5 and Chain.jl
0.4.4 (so as usual you can expect some exercises in using DataFrames.jl).

Elementary methods

If you are in an interactive mode then you have two basic options. The first one
uses package manager mode. Press ] in Julia REPL and then write the following:

(bkamins) pkg> status
      Status `~/Project.toml`
  [4c9194b5] ABCDGraphGenerator v0.1.0 `https://github.com/bkamins/ABCDGraphGenerator.jl#master`
  [8be319e6] Chain v0.4.4
  [9a962f9c] DataAPI v1.6.0 `~/.julia/dev/DataAPI`
  [a93c6f00] DataFrames v0.22.5

or if you are interested in a particular package do:

(bkamins) pkg> status DataFrames
      Status `~/Project.toml`
  [a93c6f00] DataFrames v0.22.5

In the above output it is worth to note two common scenarios:

ABCDGraphGenerator.jl is tracking master branch of a GitHub repository as a
source (so it means it was not installed from Julia registry);
DataAPI.jl is checked out for development (using dev command) and the package
is tracking a local folder.

Alternatively we could have generated the same outputs using API like this:

julia> using Pkg

julia> Pkg.status()
      Status `~/Project.toml`
  [4c9194b5] ABCDGraphGenerator v0.1.0 `https://github.com/bkamins/ABCDGraphGenerator.jl#master`
  [8be319e6] Chain v0.4.4
  [9a962f9c] DataAPI v1.6.0 `~/.julia/dev/DataAPI`
  [a93c6f00] DataFrames v0.22.5

julia> Pkg.status("DataFrames")
      Status `~/Project.toml`
  [a93c6f00] DataFrames v0.22.5

The downside of both approaches is that they produce information to the screen.
However, often one is interested in processing programmatically the installed
packages status.

Working with `Pkg.dependencies`

The Pkg.dependencies function returns a dictionary mapping package UUIDs
to information about them. As you can check in the documentation string
of the function the available information is stored in the following fields:

name: the name of the package
version: the version of the package (this is Nothing for stdlibs)
is_direct_dep: the package is a direct dependency
is_tracking_path: whether a package is directly tracking a directory
is_pinned: whether a package is pinned
source: the directory containing the source code for that package
dependencies: the dependencies of that package as a vector of UUIDs

Using Pkg.dependencies we can easily write a function that returns a version
of the package. Here is an example:

julia> using Chain

julia> get_pkg_version(name::AbstractString) =
           @chain Pkg.dependencies() begin
               values
               [x for x in _ if x.name == name]
               only
               _.version
           end
get_pkg_version (generic function with 1 method)

julia> get_pkg_version("DataFrames")
v"0.22.5"

Here is another example getting summary statistics about installed packages in a
data frame:

julia> using DataFrames

julia> get_pkg_status(;direct::Bool=true) =
           @chain Pkg.dependencies() begin
               values
               DataFrame
               direct ? _[_.is_direct_dep, :] : _
               select(:name, :version,
                      [:is_tracking_path, :is_tracking_repo, :is_tracking_registry] =>
                      ByRow((a, b, c) -> ["path", "repo", "registry"][a+2b+3c]) =>
                      :tracking)
           end
get_pkg_status (generic function with 1 method)

julia> get_pkg_status()
4×3 DataFrame
 Row │ name                version  tracking
     │ String              Union…   String
─────┼───────────────────────────────────────
   1 │ DataAPI             1.6.0    path
   2 │ DataFrames          0.22.5   registry
   3 │ Chain               0.4.4    registry
   4 │ ABCDGraphGenerator  0.1.0    repo

As you see I have selected to provide only the most essential information
about packages in the output: name, version and whether package is tracking
registry, local path, or external repository.

If you would pass direct=false you get information about all available
packages (direct and indirect dependencies of the project). It is usually not
very useful, however, as the list tends to be long, as you can see here:

julia> get_pkg_status(direct=false)
62×3 DataFrame
 Row │ name                         version  tracking
     │ String                       Union…   String
─────┼────────────────────────────────────────────────
   1 │ OrderedCollections           1.4.0    registry
   2 │ LibSSH2_jll                           registry
   3 │ Statistics                            registry
   4 │ ArgTools                              registry
   5 │ Compat                       3.25.0   registry
   6 │ Reexport                     1.0.0    registry
   7 │ SharedArrays                          registry
  ⋮  │              ⋮                  ⋮        ⋮
  56 │ Dates                                 registry
  57 │ MbedTLS_jll                           registry
  58 │ Serialization                         registry
  59 │ IteratorInterfaceExtensions  1.0.0    registry
  60 │ Libdl                                 registry
  61 │ Artifacts                             registry
  62 │ InteractiveUtils                      registry
                                       48 rows omitted

Concluding remarks

I hope you might find these patterns useful in your work with the Julia language.

Before finishing, let me mention one other case that you might occasionally
need. The above examples show you the version of the package in your current
project environment. However, in one Julia session you can change active project
environment many times. If you would be interested in getting information about
a version of the currently loaded package here is the way to do it (this will not
work for packages from stdlib as they are bundled with Julia and have a fixed
version):

julia> Pkg.TOML.parsefile(joinpath(pkgdir(DataFrames), "Project.toml"))["version"]
"0.22.5"

Let us check that indeed the loaded version does not change if we change project
environment:

(bkamins) pkg> status DataFrames
      Status `~/Project.toml`
  [a93c6f00] DataFrames v0.22.5

(bkamins) pkg> add [email protected]
   Resolving package versions...
    Updating `~/Project.toml`
  [a93c6f00] ↓ DataFrames v0.22.5 ⇒ v0.21.8
    Updating `~/Manifest.toml`
  [324d7699] ↓ CategoricalArrays v0.9.3 ⇒ v0.8.3
  [a8cc5b0e] - Crayons v4.0.4
  [a93c6f00] ↓ DataFrames v0.22.5 ⇒ v0.21.8
  [59287772] - Formatting v0.4.2
  [2dfb63ee] ↓ PooledArrays v1.1.0 ⇒ v0.5.3
  [08abe8d2] - PrettyTables v0.11.1
  [189a3867] ↓ Reexport v1.0.0 ⇒ v0.2.0
  Progress [========================================>]  3/3
  ? DataFrames
2 dependencies successfully precompiled in 2 seconds (21 already precompiled)
1 dependency failed but may be precompilable after restarting julia

(bkamins) pkg> status DataFrames
      Status `~/Project.toml`
  [a93c6f00] DataFrames v0.21.8

julia> Pkg.TOML.parsefile(joinpath(pkgdir(DataFrames), "Project.toml"))["version"]
"0.22.5"

and we see that although project version of the package is changed the loaded
version remains the same.

juliabloggers.com

A Julia Language Blog Aggregator

Tag Archives: julialang

Progress on Automated Reasoning for Catlab with Metatheory.jl Egraphs

Where to go from here

What is new in PooledArrays.jl?

Introduction

The Benchmarks

The Internals

Conclusions

How to check the version of a package?

Introduction

Elementary methods

Working with `Pkg.dependencies`

Concluding remarks

Where to go from here

Introduction

The Benchmarks

The Internals

Conclusions

Introduction

Elementary methods

Working with Pkg.dependencies

Concluding remarks

Working with `Pkg.dependencies`