Tag Archives: julialang

Expand your DataFrames.jl toolbox: the flatten function

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/03/20/flatten.html

Introduction

Recently I have commented on an interesting question on StackOveflow.

The problem was stated as follows. Given this input table:

2×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Time
─────┼───────────────────────────────────────
   1 │ John    A               2  16:00:00
   2 │ Joseph  B               3  15:05:00

produce the following output table:

5×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Time
─────┼───────────────────────────────────────
   1 │ John    A               2  16:00:00
   2 │ John    A               2  16:01:00
   3 │ Joseph  B               3  15:05:00
   4 │ Joseph  B               3  15:06:00
   5 │ Joseph  B               3  15:07:00

As you can see the task is to repeat each row of the source data frame as many
times as column :Duration tells you but additionally increment the
Start_Time column by one minute in each consecutive row.

This question caught my attention, because it referenced to a similar
question
using Pandas. However, I found it quite hard to immediately
understand what is going on in that code, while in DataFrames.jl the solution
seemed to be relatively simple.

This post was written under Julia 1.6.0-rc1 and DataFrames 0.22.5.

The solution using flatten

We start with creating the source data frame:

julia> using DataFrames, Dates

julia> df = DataFrame(Name=["John", "Joseph"],
                      Channel=["A", "B"],
                      Duration=[2,3],
                      Start_Time=Time.(["16:00:00", "15:05:00"]))
2×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Time
─────┼───────────────────────────────────────
   1 │ John    A               2  16:00:00
   2 │ Joseph  B               3  15:05:00

Now in order to solve the task one needs to remember that data frame can store
columns having any element type. Therefore a first natural step is to transform
the :Start_Time column from a vector holding only a starting time to a vector
holding a range of times as defined by :Duration and :Start_Time columns.

This is easy to achieve using the transform function:

julia> df2 = transform(df, [:Start_Time, :Duration] =>
                           ByRow((x,y) -> x .+ Minute.(0:y-1)) =>
                           :Start_Time)
2×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Array…
─────┼──────────────────────────────────────────────────────────────
   1 │ John    A               2  Time[16:00:00, 16:01:00]
   2 │ Joseph  B               3  Time[15:05:00, 15:06:00, 15:07:0…

alternatively one could create the df2 data frame e.g. like this:

julia> df2 = copy(df)
2×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Time
─────┼───────────────────────────────────────
   1 │ John    A               2  16:00:00
   2 │ Joseph  B               3  15:05:00

julia> df2.Start_Time = [x .+ Minute.(0:y-1) for
                         (x, y) in zip(df2.Start_Time, df2.Duration)]
2-element Vector{Vector{Time}}:
 [Time(16), Time(16, 1)]
 [Time(15, 5), Time(15, 6), Time(15, 7)]

julia> df2
2×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Array…
─────┼──────────────────────────────────────────────────────────────
   1 │ John    A               2  Time[16:00:00, 16:01:00]
   2 │ Joseph  B               3  Time[15:05:00, 15:06:00, 15:07:0…

a small benefit of transform is that is is easier to put this operation in a
chain of transformations as it takes and returns a data frame.

Once you have a df2 data frame then you need to flatten the :Start_Time
column into multiple rows. This is easily done using the flatten function like
this:

julia> flatten(df2, :Start_Time)
5×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Time
─────┼───────────────────────────────────────
   1 │ John    A               2  16:00:00
   2 │ John    A               2  16:01:00
   3 │ Joseph  B               3  15:05:00
   4 │ Joseph  B               3  15:06:00
   5 │ Joseph  B               3  15:07:00

and you are done!

For sure I know DataFrames.jl much better than Pandas. However, what I feel
(and I am for sure biased here) is that it is much easier to reason about what
DataFrames.jl code does.

The solution using iteration

Another approach that could be used to handle this task would be to construct
the resulting data frame incrementally. In this case it is a bit more complex
than the flatten solution, but it is very often quite convenient so I thought
to show it. Here is the code:

julia> df3 = DataFrame()
0×0 DataFrame

julia> for row in eachrow(df)
           chunk = repeat(DataFrame(row), row.Duration)
           chunk.Start_Time .+= Minute.(0:row.Duration-1)
           append!(df3, chunk)
       end

julia> df3
5×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Time
─────┼───────────────────────────────────────
   1 │ John    A               2  16:00:00
   2 │ John    A               2  16:01:00
   3 │ Joseph  B               3  15:05:00
   4 │ Joseph  B               3  15:06:00
   5 │ Joseph  B               3  15:07:00

or

julia> df4 = DataFrame()
0×0 DataFrame

julia> for row in eachrow(df), i in 0:row.Duration-1
           push!(df4, row)
           df4.Start_Time[end] += Minute(i)
       end

julia> df4
5×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Time
─────┼───────────────────────────────────────
   1 │ John    A               2  16:00:00
   2 │ John    A               2  16:01:00
   3 │ Joseph  B               3  15:05:00
   4 │ Joseph  B               3  15:06:00
   5 │ Joseph  B               3  15:07:00

The point of these examples is that append! and push! are quite fast in
DataFrames.jl and I find them easy to reason about.

Conclusion

I hope that you found these examples useful. In particular functions like
flatten are easy to forget about while they often are very handy, especially in
combination with the fact that data frame can store objects of any type in its
columns.

In particular, you can store a vector of vectors or a vector of structs as a
data frame column. This is a type of storage that users of such data bases as
BigQuery or Snowflake tend to like. An especially notable feature
of this functionality is that such data frames can be easily written to and read
back from a file using e.g. Arrow.jl.

If you would like to check out another example of using a vector of vectors as
a column of a data frame you can have a look at notebook 5 of the
JuliaAcademy DataFrames.jl tutorial.

Sorting data by a transformation of columns in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/03/12/sorting.html

Introduction

Recently I have several times received a question how to sort a data frame
based on some transformation of its columns. In this post I will show you
the way it can be currently done.

The post was written under Julia 1.6.0-rc1 and DataFrames 0.22.5.

Before we start create a data frame we will work with. It contains
coordinates of some points in a three-dimensional space.

julia> using DataFrames

julia> using LinearAlgebra

julia> using Random

julia> Random.seed!(1234)
MersenneTwister(1234)

julia> df = DataFrame(x=rand(10), y=rand(10), z=rand(10)) .- 0.5
10×3 DataFrame
 Row │ x           y          z
     │ Float64     Float64    Float64
─────┼──────────────────────────────────
   1 │  0.0908446   0.148882   0.450498
   2 │  0.266797   -0.489094   0.46467
   3 │  0.0662374  -0.433577   0.445775
   4 │ -0.0399147   0.456753   0.289904
   5 │  0.294026    0.146691   0.32116
   6 │  0.354147   -0.387514  -0.46584
   7 │ -0.299414   -0.223979  -0.405456
   8 │ -0.201386    0.151664  -0.185074
   9 │ -0.253163   -0.443358  -0.37219
  10 │  0.0796722   0.342714  -0.125813

Basic sorting

You can sort this data frame by column :y like this:

julia> sort(df, :y)
10×3 DataFrame
 Row │ x           y          z
     │ Float64     Float64    Float64
─────┼──────────────────────────────────
   1 │  0.266797   -0.489094   0.46467
   2 │ -0.253163   -0.443358  -0.37219
   3 │  0.0662374  -0.433577   0.445775
   4 │  0.354147   -0.387514  -0.46584
   5 │ -0.299414   -0.223979  -0.405456
   6 │  0.294026    0.146691   0.32116
   7 │  0.0908446   0.148882   0.450498
   8 │ -0.201386    0.151664  -0.185074
   9 │  0.0796722   0.342714  -0.125813
  10 │ -0.0399147   0.456753   0.289904

If you want to sort it in reverse just do:

julia> sort(df, :y, rev=true)
10×3 DataFrame
 Row │ x           y          z
     │ Float64     Float64    Float64
─────┼──────────────────────────────────
   1 │ -0.0399147   0.456753   0.289904
   2 │  0.0796722   0.342714  -0.125813
   3 │ -0.201386    0.151664  -0.185074
   4 │  0.0908446   0.148882   0.450498
   5 │  0.294026    0.146691   0.32116
   6 │ -0.299414   -0.223979  -0.405456
   7 │  0.354147   -0.387514  -0.46584
   8 │  0.0662374  -0.433577   0.445775
   9 │ -0.253163   -0.443358  -0.37219
  10 │  0.266797   -0.489094   0.46467

or

julia> sort(df, order(:y, rev=true))
10×3 DataFrame
 Row │ x           y          z
     │ Float64     Float64    Float64
─────┼──────────────────────────────────
   1 │ -0.0399147   0.456753   0.289904
   2 │  0.0796722   0.342714  -0.125813
   3 │ -0.201386    0.151664  -0.185074
   4 │  0.0908446   0.148882   0.450498
   5 │  0.294026    0.146691   0.32116
   6 │ -0.299414   -0.223979  -0.405456
   7 │  0.354147   -0.387514  -0.46584
   8 │  0.0662374  -0.433577   0.445775
   9 │ -0.253163   -0.443358  -0.37219
  10 │  0.266797   -0.489094   0.46467

Using order is useful if you would want to sort a data frame by several columns
and apply different ordering rules to them.

If you want to apply a transformation to a single column and sort it based on
the transformed values use the by option:

julia> sort(df, :y, by=abs)
10×3 DataFrame
 Row │ x           y          z
     │ Float64     Float64    Float64
─────┼──────────────────────────────────
   1 │  0.294026    0.146691   0.32116
   2 │  0.0908446   0.148882   0.450498
   3 │ -0.201386    0.151664  -0.185074
   4 │ -0.299414   -0.223979  -0.405456
   5 │  0.0796722   0.342714  -0.125813
   6 │  0.354147   -0.387514  -0.46584
   7 │  0.0662374  -0.433577   0.445775
   8 │ -0.253163   -0.443358  -0.37219
   9 │ -0.0399147   0.456753   0.289904
  10 │  0.266797   -0.489094   0.46467

or equivalently

julia> sort(df, order(:y, by=abs))
10×3 DataFrame
 Row │ x           y          z
     │ Float64     Float64    Float64
─────┼──────────────────────────────────
   1 │  0.294026    0.146691   0.32116
   2 │  0.0908446   0.148882   0.450498
   3 │ -0.201386    0.151664  -0.185074
   4 │ -0.299414   -0.223979  -0.405456
   5 │  0.0796722   0.342714  -0.125813
   6 │  0.354147   -0.387514  -0.46584
   7 │  0.0662374  -0.433577   0.445775
   8 │ -0.253163   -0.443358  -0.37219
   9 │ -0.0399147   0.456753   0.289904
  10 │  0.266797   -0.489094   0.46467

These patterns naturally extend to multiple columns, and sorting is performed
lexicographically. Here is an example:

julia> df2 = DataFrame(x=rand(Bool, 16), y=rand(Bool, 16), z=rand(Bool, 16))
16×3 DataFrame
 Row │ x      y      z
     │ Bool   Bool   Bool
─────┼─────────────────────
   1 │ false   true  false
   2 │  true   true   true
   3 │  true  false   true
   4 │ false  false   true
   5 │  true  false   true
   6 │  true  false  false
   7 │  true  false  false
   8 │ false  false  false
   9 │ false  false   true
  10 │ false   true   true
  11 │  true  false   true
  12 │ false  false  false
  13 │  true  false  false
  14 │  true  false  false
  15 │ false  false  false
  16 │ false   true  false

julia> sort(df2, [:y, order(:z, rev=true), :x])
16×3 DataFrame
 Row │ x      y      z
     │ Bool   Bool   Bool
─────┼─────────────────────
   1 │ false  false   true
   2 │ false  false   true
   3 │  true  false   true
   4 │  true  false   true
   5 │  true  false   true
   6 │ false  false  false
   7 │ false  false  false
   8 │ false  false  false
   9 │  true  false  false
  10 │  true  false  false
  11 │  true  false  false
  12 │  true  false  false
  13 │ false   true   true
  14 │  true   true   true
  15 │ false   true  false
  16 │ false   true  false

However, a question is what if I want to sort a data frame on a function of
multiple columns taken together?

Sorting on multiple columns considered jointly

Going back to our df data frame what if we wanted to sort it by the distance
from the origin?

In this case the sortperm function is useful. What you need to
do is to create a temporary object, get its sortperm, and apply it to a
source data frame. Here is how it is done in practice:

julia> df[sortperm(norm.(eachrow(df))), :]
10×3 DataFrame
 Row │ x           y          z
     │ Float64     Float64    Float64
─────┼──────────────────────────────────
   1 │ -0.201386    0.151664  -0.185074
   2 │  0.0796722   0.342714  -0.125813
   3 │  0.294026    0.146691   0.32116
   4 │  0.0908446   0.148882   0.450498
   5 │ -0.0399147   0.456753   0.289904
   6 │ -0.299414   -0.223979  -0.405456
   7 │  0.0662374  -0.433577   0.445775
   8 │ -0.253163   -0.443358  -0.37219
   9 │  0.354147   -0.387514  -0.46584
  10 │  0.266797   -0.489094   0.46467

A nice thing is that sortperm also works for data frames, so if you wanted to
sort the data frame by the sign of :x and then by the sum of :y and :z
columns you could write:

julia> df[sortperm(select(df, :x => ByRow(sign), [:y, :z] => +)), :]
10×3 DataFrame
 Row │ x           y          z
     │ Float64     Float64    Float64
─────┼──────────────────────────────────
   1 │ -0.253163   -0.443358  -0.37219
   2 │ -0.299414   -0.223979  -0.405456
   3 │ -0.201386    0.151664  -0.185074
   4 │ -0.0399147   0.456753   0.289904
   5 │  0.354147   -0.387514  -0.46584
   6 │  0.266797   -0.489094   0.46467
   7 │  0.0662374  -0.433577   0.445775
   8 │  0.0796722   0.342714  -0.125813
   9 │  0.294026    0.146691   0.32116
  10 │  0.0908446   0.148882   0.450498

Conclusion

The thing to remember is that because data frame fully supports standard
indexing like a matrix you can easily reorder it using the sortperm function
applied to an object different than the original data frame.

However, since this feature request is raised quite often we are currently
discussing how to add a support to it in a standard sort syntax. If
you are interested in the details you can check this issue.

Progress on Automated Reasoning for Catlab with Metatheory.jl Egraphs

By: Philip Zucker

Re-posted from: https:/www.philipzucker.com/metatheory-progress/

Oh cursed equation! How you haunt me!

I’ve been working on mechanizing equational reasoning for categories for a while now with not so much success.

The Egg paper came out and was very compelling to me. It is an implementation with novel twists of egraphs, a useful data structure for equational reasoning. I tried my hand at implementing the egraph data structure here.

But Alessandro Cheli has come along and really taken things to the next level with his Metatheory.jl library. The package includes a significantly improved egraph implementation, plus a classical rewriting system. It also has some really interesting ideas in regards to homoiconicity and theory composition. Maybe it’s useful to you. Check it out.

The equational reasoning that occurs in Catlab is compelling because it feels like high school algebra, but describes significantly more powerful systems of reasoning. It also has a beautiful and intuitive diagrammatic interpretation.

Understanding how to take a system like egraph rewriting, which in examples seems at the moment best suited for rewriting ordinary algebraic expressions, and tweak it to work with the more complicated logic of Catlab has been a challenge.

There are a couple problems.

  1. The typing context contains important information. Consider the rule $f \rightarrow id(a) \cdot f $. Looking at only the syntax of this rule, a is invented on the right hand side. This information actually comes from the type of f::hom(a,b)
  2. We need to strip off overloading? $\otimes$ is a symbol used in catlab for both the binary operator on objects and morphisms, as is customary. I guess this might fine? It makes me squeamish.
  3. Some rules require guarding on the type before execution. For most of the rules in Catlab, if the left hand side of the rule is well-typed, then so is the right. This is not the case in general though.
    Consider the trick question: Can you apply the interchange law (f ⊗ g) ⋅ (h ⊗ k) => (f ⋅ h) ⊗ (g ⋅ k) to the term (f ⊗ id(a)) ⋅ (id(b) ⊗ k)?
    No you can’t. For in my example, I have secretly decided that f actually has type b ⊗ c -> b ⊗ c and k has type c ⊗ a -> c ⊗ a.
    The other direction is always fine though. Given (f ⋅ h) ⊗ (g ⋅ k) is well typed, so is (f ⊗ g) ⋅ (h ⊗ k)

I thought way back that the types could be stripped and everything would be finem like if you strip the types from the simply typed lambda calculus STLC. This is not the case. I think this misconception is due to my inexperience doing hand calculations of this type (which almost no one does. String diagrams rule), plus I think you “code the happy path” when doing it by hand and may not even notice a situation where just shallow pattern matching on the syntax would allow you do do something verboten, since you’re smart enough to not make that dumb mistake.

My initial thinking this time around was to take a type tagging approach https://people.mpi-inf.mpg.de/~jblanche/mono-trans.pdf. This is a standard way of encoding more sophisticated typed logics into simpler typed ones. Type tagging replaces every term ever used with a more complex term that contains the type. In a curious analogy, it is the analog of dynamic typing vs static typing. In dynamically typed languages, all values carry with them tags describing which type they are at runtime.

As an example, the identity id(a) could be tagged as type(id(type(a,ob)), hom(typ(a,ob), type(a,ob))) or as hom(id(ob(a)), ob(a), ob(a)). You can see how verbose this is. It is not ideal both because it is unacceptable to write by hand, and because this extra structure is just more junk to choke your solver. But you can see that now the necessary types to fill in missing pieces of the right hand side is always available.

I tried this through some arcane metaprogramming methods involving Julog and it seemed too slow. Alessandro and James Fairbanks have a more elegant looking version of this here.

It feels insane that you need all this overhead when it is not necessary when working by hand. Many of the rules do not have this rhs invention problem or the typed guarding requirement. For cartesian categories, only about 5 of 30 rules have this condition. $(f \cdot g) \cdot h \rightarrow (f \cdot g) \cdot h$ is totally fine for example.

So here is idea 1, replace any new derived information on the right hand side with “function calls” that find that information. For our example, this becomes f => id(dom(type(f))) ⋅ f. Then we need to internalize the typing rules into the rewrite system

typing_rules = @theory begin
    dom(hom(a,b)) => a
    cod(hom(a,b)) => b
    type(id(a)) => hom(a,a)
    type(f ⋅ g) => hom( dom(type(f)) , cod(type(g)) )
    type(f ⊗ₘ g) => hom( dom(type(f)) ⊗ₒ dom(type(g)) , cod(type(f)) ⊗ₒ cod(type(g)))
    type(a ⊗ₒ b) => :ob
    type(munit()) => :ob
    type(σ(a,b)) => hom(a ⊗ₒ b, b ⊗ₒ a)
    type(⋄(a)) => hom(a, munit())
    type(Δ(a)) => hom(a, a ⊗ₒ a)
    type(pair(f,g)) => hom(dom(type(f)), cod(type(f)) ⊗ₒ cod(type(g)))
    type(proj1(a,b)) => hom(a ⊗ₒ b, a)
    type(proj2(a,b)) => hom(a ⊗ₒ b, b)
end

Idea two is to use Metatheory “guards” to enforce type correctness of the left hand side before a rule can fire. As part of this, the necessary type queries are added to the egraph. Alessandro had this idea of just returning the original expression to enforce a guard, which I think is pretty clever. Here is how I think the guard works for the interchange law. This will be cleaner with an upcoming Metatheory macro.

    (f ⊗ₘ p) ⋅ (g ⊗ₘ q) |>
    # future metatheory macros will clean this up
    begin
        fcod = addexpr!(_egraph, :(cod(type($f))))
        gdom = addexpr!(_egraph, :(dom(type($g))))
        pcod = addexpr!(_egraph, :(cod(type($p))))
        qdom = addexpr!(_egraph, :(dom(type($q))))
        if (fcod == gdom && pcod == qdom)
            :(($f ⋅ $g) ⊗ₘ ($p ⋅ $q))
        else
            :(($f ⊗ₘ $p) ⋅ ($g ⊗ₘ $q))
        end
    end

You can see that even making the query about whether the types are right adds them to the egraph. It may be currently unknown what the types are, but by adding them, eventually the typing rules will fire and this rule will have its chance again. This also isn’t as bad as it may appear performance wise, because the egraph in a sense memoizes these typing calls. See the Fibonacci example by Mason Protter which has non exponential complexity despite the naive implementation https://github.com/0x0f0f0f/Metatheory.jl/blob/master/test/test_fibonacci.jl

It is my belief that both of these points (internalization of the typing rules and guards) can be mechanized to produce a sound rule system for an arbitrary GAT. I could easily be wrong. The rough picture I suggest is that you need to infer by looking at the lhs of each rule what typing constraints are implied. If you do not imply the entire typing context, you need to add as many extra guards to ensure it as needed. For example, the well typed-ness of (f ⊗ₘ p) ⋅ (g ⊗ₘ q) implies $cod(f) \otimes cod(p) = dom(g) \otimes dom(q)$ but this does not imply the condition `cod(f) = dom(g) \land cod(p) == dom(q)$ so they must be added.

Here is a full script to run the derivation pair(proj1(a,b),proj2(a,b)) = id(a ⊗ₒ b). I can’t get it to go end to end without assistance yet. We’ll get there. I believe. The real holy grail is not proofs, but simplifications. This also works, but not yet at the depth I’m hoping for.

Evan Patterson’s explanation of the proof has been absolutely vital https://julialang.zulipchat.com/#narrow/stream/230248-catlab.2Ejl/topic/Automatic.20Theorem.20Proving/near/211396362. Zulip is nice for this.

using Pkg
Pkg.activate("metacat")
Pkg.add(url = "https://github.com/0x0f0f0f/Metatheory.jl.git")
using Metatheory
using Metatheory.EGraphs

# See this link for the catlab axioms
# https://github.com/AlgebraicJulia/Catlab.jl/blob/ce2fde9c63a8aab65cf2a7697f43cd24e5e00b3a/src/theories/Monoidal.jl#L127

cat_rules = @theory begin
    f  id(b) => f
    id(a)  f => f

    #f => f ⋅ id(dom(type(f)))
    #f => id(cod(type(f))) ⋅ f

    a ⊗ₒ munit() => a
    munit() ⊗ₒ a => a

    #a => a ⊗ₒ munit() 
    #a => munit() ⊗ₒ a

    f  (g  h) == (f  g)  h


end

monoidal_rules = @theory begin
    id(munit()) ⊗ₘ f => f
    f ⊗ₘ id(munit()) => f
    a ⊗ₒ (b ⊗ₒ c) == (a ⊗ₒ b) ⊗ₒ c
    f ⊗ₘ (h ⊗ₘ j) == (f ⊗ₘ h) ⊗ₘ j
    id(a ⊗ₒ b) == id(a) ⊗ₘ id(b)

    (f ⊗ₘ p)  (g ⊗ₘ q) |>
    # future metatheory macros will clean this up
    begin
        fcod = addexpr!(_egraph, :(cod(type($f))))
        gdom = addexpr!(_egraph, :(dom(type($g))))
        pcod = addexpr!(_egraph, :(cod(type($p))))
        qdom = addexpr!(_egraph, :(dom(type($q))))
        if (fcod == gdom && pcod == qdom)
            :(($f  $g) ⊗ₘ ($p  $q))
        else
            :(($f ⊗ₘ $p)  ($g ⊗ₘ $q))
        end
    end

    (f  g) ⊗ₘ (p  q) => (f ⊗ₘ p)  (g ⊗ₘ q)
end



sym_rules = @theory begin
    σ(a, b)  σ(b, a) == id(a ⊗ₒ b)
    (σ(a, b) ⊗ₘ id(c))  (id(b) ⊗ₘ σ(a, c)) == σ(a, b ⊗ₒ c)
    (id(a) ⊗ₘ σ(b, c))  (σ(a, c) ⊗ₘ id(b)) == σ(a ⊗ₒ b, c)

    (f ⊗ₘ h)  σ(a, b) |> begin
        fcod = addexpr!(_egraph, :(cod(type($f)))).id
        hcod = addexpr!(_egraph, :(cod(type($h)))).id
        if fcod == a && hcod == b
            :(σ(dom(type($f)), dom(type($h)))  ($h ⊗ₘ $f))
        else
            :(($f ⊗ₘ $h)  σ($a, $b))
        end
    end


    σ(c, d)  (h ⊗ₘ f) |> begin
        fdom = addexpr!(_egraph, :(dom(type($f)))).id
        hdom = addexpr!(_egraph, :(dom(type($h)))).id
        if fdom == c && hdom == d
            :(($f ⊗ₘ $h)  σ(cod(type($f)), cod(type($h))))
        else
            :(σ($c, $d)  ($h ⊗ₘ $f))
        end
    end

    # these rules arer not catlab
    σ(a, munit()) == id(a)
    σ(munit(), a) == id(a)
    σ(munit(), munit()) => id(munit() ⊗ₒ munit())

end



diag_rules = @theory begin

    Δ(a)  ((a) ⊗ₘ id(a)) == id(a)
    Δ(a)  (id(a) ⊗ₘ (a)) == id(a)
    Δ(a)  σ(a, a) == Δ(a)

    (Δ(a) ⊗ₘ Δ(b))  (id(a) ⊗ₘ σ(a, b) ⊗ₘ id(b)) == Δ(a ⊗ₒ b)


    Δ(a)  (Δ(a) ⊗ₘ id(a)) == Δ(a)  (id(a) ⊗ₘ Δ(a))
    (a ⊗ₒ b) == (a) ⊗ₘ (b)

    Δ(munit()) == id(munit())
    (munit()) == id(munit())
end


cart_rules = @theory begin

    Δ(a)  (f ⊗ₘ k) |> begin
        a_id = find(_egraph, a)
        if (
            addexpr!(_egraph, :(dom(type($f)))).id == a_id &&
            addexpr!(_egraph, :(dom(type($k)))).id == a_id
        )
            :(pair($f, $k))
        else
            :(Δ($a)  ($f ⊗ₘ $k))
        end
    end


    pair(f, k) => Δ(dom(type(f)))  (f ⊗ₘ k)
    proj1(a, b) == id(a) ⊗ₘ (b)
    proj2(a, b) == (a) ⊗ₘ id(b)
    f  (b) => (dom(type(f)))
    # Has to invent f. Hard to fix.
    # ⋄(b) => f ⋅ ⋄(b)  

    f  Δ(b) => Δ(dom(type(f)))  (f ⊗ₘ f)
    Δ(a)  (f ⊗ₘ f) => f  Δ(cod(type(f)))
end





typing_rules = @theory begin
    dom(hom(a, b)) => a
    cod(hom(a, b)) => b
    type(id(a)) => hom(a, a)
    type(f  g) => hom(dom(type(f)), cod(type(g)))
    type(f ⊗ₘ g) => hom(dom(type(f)) ⊗ₒ dom(type(g)), cod(type(f)) ⊗ₒ cod(type(g)))
    type(a ⊗ₒ b) => :ob
    type(munit()) => :ob
    type(σ(a, b)) => hom(a ⊗ₒ b, b ⊗ₒ a)
    type((a)) => hom(a, munit())
    type(Δ(a)) => hom(a, a ⊗ₒ a)
    type(pair(f, g)) => hom(dom(type(f)), cod(type(f)) ⊗ₒ cod(type(g)))
    type(proj1(a, b)) => hom(a ⊗ₒ b, a)
    type(proj2(a, b)) => hom(a ⊗ₒ b, b)
end


rules = typing_rules  cat_rules  monoidal_rules  sym_rules  diag_rules  cart_rules  typing_rules

using MacroTools

# A goofy little helper macro
# Taking inspiration from Lean/Dafny/Agda
macro calc(e...)
    theory = eval(e[1])
    e = rmlines(e[2])
    @assert e.head == :block
    for (a, b) in zip(e.args[1:end-1], e.args[2:end])
        println(a, " =? ", b)
        @time println(areequal(theory, a, b; timeout = 40))
    end
end

# Get the Julia motor hummin'
@calc rules begin

    (((a) ⊗ₘ (b))  σ(munit(), munit()))
    (σ(a, b)  ((b) ⊗ₘ (a)))

end 


@calc rules begin
    id(a ⊗ₒ b)
    id(a) ⊗ₘ id(b)
    (Δ(a)  (id(a) ⊗ₘ (a))) ⊗ₘ id(b)
    (Δ(a)  (id(a) ⊗ₘ (a))) ⊗ₘ (Δ(b)  ((b) ⊗ₘ id(b)))
    (Δ(a) ⊗ₘ Δ(b))  ((id(a) ⊗ₘ (a)) ⊗ₘ ((b) ⊗ₘ id(b)))
    (Δ(a) ⊗ₘ Δ(b))  (id(a) ⊗ₘ ((a) ⊗ₘ (b)) ⊗ₘ id(b))
    (Δ(a) ⊗ₘ Δ(b))  (id(a) ⊗ₘ (((a) ⊗ₘ (b))  σ(munit(), munit())) ⊗ₘ id(b))
    (Δ(a) ⊗ₘ Δ(b))  ((id(a) ⊗ₘ (σ(a, b)  ((b) ⊗ₘ (a))) ⊗ₘ id(b)))
    (Δ(a) ⊗ₘ Δ(b))  ((id(a) ⊗ₘ (σ(a, b)  ((b) ⊗ₘ (a))) ⊗ₘ id(b)))
    (Δ(a) ⊗ₘ Δ(b))  ((id(a)  id(a)) ⊗ₘ (σ(a, b)  ((b) ⊗ₘ (a))) ⊗ₘ id(b))
    (Δ(a) ⊗ₘ Δ(b))  ((id(a) ⊗ₘ σ(a, b) ⊗ₘ id(b))  (id(a) ⊗ₘ ((b) ⊗ₘ (a)) ⊗ₘ id(b)))
    Δ(a ⊗ₒ b)  (id(a) ⊗ₘ ((b) ⊗ₘ (a)) ⊗ₘ id(b))
    Δ(a ⊗ₒ b)  (id(a) ⊗ₘ ((b) ⊗ₘ (a)) ⊗ₘ id(b))
    Δ(a ⊗ₒ b)  (proj1(a, b) ⊗ₘ proj2(a, b))
    pair(proj1(a, b), proj2(a, b))
end

# shorter proof also accepted
@calc rules begin
    id(a ⊗ₒ b)
    (Δ(a) ⊗ₘ Δ(b))  ((id(a) ⊗ₘ (σ(a, b)  ((b) ⊗ₘ (a))) ⊗ₘ id(b)))
    pair(proj1(a, b), proj2(a, b))
end

Where to go from here

  • We could just say this is already nice and push forward.
  • This is all totally unintegrated with catlab itself.
  • I’m kind of intrigued at trying to make a pure egg version of the above, especially so I can compile to wasm
    -String diagrams https://julialang.zulipchat.com/#narrow/stream/230248-catlab.2Ejl/topic/Using.20Metatheory.20for.20String.20Diagrams There is a compelling argument that string diagrams are a preferred representation, normalizing the associativity,commutative, and unit properties of a categorical construction. The useful canonicity of string diagrams is a viewpoint Evan Patterson has put forward. It has been suggested by Alessandro and others that generalizing the EGraph data structure in some sense may be to go. Extensions of the egraph that work natively on something more string diagram-like?
  • More wildly speculative. Co-end calculus? How to deal with binding forms? Daniele Palombi has brought the coend calculus https://arxiv.org/pdf/1501.02503.pdf This seems like a very interesting application. I think the direct attack on this problem using Metatheory requires understanding how to deal with alpha equivalence in the EGraph Conversation here: https://julialang.zulipchat.com/#narrow/stream/277860-metatheory.2Ejl/topic/Formal.20Proofs.20for.20CT.3F. I personally would like to start attacking the problem from the angle of regular integrals and sums, which are syntactically analogous.
  • Proof Recording and output. The actual proof is there in the egraph rewriting process if we record it. Alessandro sounded interested in this direction and my understanding is it is also in the works for egg
  • Cody has also pointed out to me that he had a project with similar goals but different methods many moons ago https://github.com/codyroux/catnapp. I don’t think this got that far.