Tag Archives: Uncategorized

A Buchberger in Julia

By: philzook58

Re-posted from: https://www.philipzucker.com/a-buchberger-in-julia/

Similarly to how Gaussian elimination putting linear equations into LU form solves most linear algebra problems one might care about, Buchberger’s algorithm for finding a Grobner basis of a system of multivariate polynomial equations solves most questions you might ask. Some fun applications

  • Linkages
  • Geometrical Theorem proving. Circles are x^2 + y^2 – 1 = 0 and so on.
  • Optics
  • Constraint satisfaction problems. x^2 – 1 = 0 gives you a boolean variable. It’s a horrible method but it works if your computer doesn’t explode.
  • Energy and momentum conservation. “Classical Feynman Diagrams” p1 + p2 = p3 + p4 and so on.
  • Frequency domain circuits and linear dynamical systems 😉 more on this another day

To learn more about Grobner bases I highly recommend Cox Little O’Shea

To understand what a Grobner basis is, first know that univariate polynomial long division is a thing. It’s useful for determining if one polynomial is a multiple of another. If so, then you’ll find the remainder is zero.

One could want to lift the problem of determining if a polynomial is a multiple of others to multivariate polynomials. Somewhat surprisingly the definition of long division has some choice in it. Sure, x^2 is a term that is ahead of x, but is x a larger term than y? y^2? These different choices are admissible. In addition now one has systems of equations. Which equation do we divide by first? It turns out to matter and change the result. That is unless one has converted into a Grobner Basis.

A Grobner basis is a set of polynomials such that remainder under multinomial division becomes unique regardless of the order in which division occurs.

How does one find such a basis? In essence kind of by brute force. You consider all possible polynomials that could divide two ways depending on your choice.

Julia has packages for multivariate polynomials. https://github.com/JuliaAlgebra/MultivariatePolynomials.jl defines an abstract interface and generic functions. DynamicPolynomials gives flexible representation for construction. TypedPolynomials gives a faster representation.

These already implement a bulk of what we need to get a basic Buchberger going: Datastructures, arithmetic, and division with remainder. With one caveat, there is already a picked monomial ordering. And it’s not lexicographic, which is the nice one for eliminating variables. This would not be too hard to change though?

Polynomial long division with respect to a set of polynomials is implemented here

https://github.com/JuliaAlgebra/MultivariatePolynomials.jl/blob/9a0f7bf531ba3346f0c2ccf319ae92bf4dc261af/src/division.jl#L60

Unfortunately, (or fortunately? A good learning experience. Learned some stuff about datastructures and types in julia so that’s nice) quite late I realized that a very similar Grobner basis algorithm to the below is implemented inside of of SemiAlgebraic.jl package. Sigh.

using MultivariatePolynomials
using DataStructures


function spoly(p,q)
    pq = lcm(leadingmonomial(p),leadingmonomial(q))
    return div(  pq , leadingterm(p) ) * p - div(pq , leadingterm(q)) * q
end

function isgrobner(F::Array{T}) where {T <: AbstractPolynomialLike} # check buchberger criterion
    for (i, f1) in enumerate(F)
        for f2 in F[i+1:end]
            s = spoly(f1,f2)
            _,s = divrem(s,F)
            if !iszero(s)
                return false
            end
        end
    end
    return true
end

function buchberger(F::Array{T}) where {T <: AbstractPolynomialLike}
    pairs = Queue{Tuple{T,T}}()
    # intialize with all pairs from F
    for (i, f1) in enumerate(F)
        for f2 in F[i+1:end]
            enqueue!(pairs, (f1,f2))
        end
    end
    
    # consider all possible s-polynomials and reduce them
    while !isempty(pairs)
        (f1,f2) = dequeue!(pairs)
        s = spoly(f1,f2)
        _,s = divrem(s,F)
        if !iszero(s) #isapproxzero? Only add to our set if doesn't completely reduce
            for f in F
                enqueue!(pairs, (s,f))
            end
            push!(F,s)
        end
    end

    # reduce redundant entries in grobner basis.
    G = Array{T}(undef, 0)
    while !isempty(F)
        f = pop!(F)
        _,r = divrem(f, vcat(F,G))
        if !iszero(r)
            push!(G,r)
        end
    end
    
    return G
end

Some usage. You can see here that Gaussian elimination implemented by the backslash operator is a special case of taking the Grobner basis of a linear set of equations


using DynamicPolynomials
@polyvar x y

buchberger( [ x + 1.0 + y   , 2.0x + 3y + 7  ] )
#= 
2-element Array{Polynomial{true,Float64},1}:
 -0.5y - 2.5
 x - 4.0
=#

[ 1 1 ; 2  3 ] \ [-1 ; -7]
#=
2-element Array{Float64,1}:
  4.0
 -5.0
=#


buchberger( [ x^3 - y , x^2 - x*y ])
#=
3-element Array{Polynomial{true,Int64},1}:
 -xy + y²
 y³ - y
 x² - y²
=#

Improvements

Many. This is not a good Buchberger implementation, but it is simple. See http://www.scholarpedia.org/article/Buchberger%27s_algorithm for some tips, which include criterion for avoiding unneeded spolynomial pairs, and smart ordering. Better Buchberger implementations will use the f4 or f5 algorithm, which use sparse matrix facilities to perform many division steps in parallel. My vague impression of this f4 algorithm is that you prefill a sparse matrix (rows correspond to an spolynomial or monomial multiple of your current basis, columns correspond to monomials) with monomial multiples of your current basis that you know you might need.

In my implementation, I’m tossing away the div part of divrem. It can be useful to retain these so you know how to write your Grobner basis in terms of the original basis.

You may want to look at the julia bindings to Singular.jl

Links

Data Structure Design in CSV.jl

By: Jacob Quinn

Re-posted from: https://quinnj.home.blog/2020/06/07/data-structure-design-in-csv-jl/

I’m currently working through a refactor of CSV.jl internals and once again am faced with some tricky questions surrounding which data structures to use. In csv reading, like many things, there are trade-offs between performance/efficiency and convenience: we want to read delimited files as fast as possible, but we also want the most natural, standard data structures. Some of the complexity in these decisions come from the csv spec (or lack thereof!) itself, but to fully understand that, let’s talk through the primary task of csv reading.

The aim for CSV.jl, is to take plain text data–columns separated by a delimiter, and rows separated by newlines–parse text values into native typed values, and output the data in columnar data structures, like a native Julia Vector. Unfortunately, csv files do not contain any metadata to inform how many rows, columns, or what data types columns will be, as well as if they will have null values or not. This lack of metadata means we have to use tricks like guessing the # of rows in a file, “detecting” data types, and being flexible in case we detect one type and need to “promote” to a different data type later. The complexities are compounded when considering multi-threaded parsing because you either have to play the synchronization dance between threads, or give each thread a chunk of the file and “merge” their results afterwards (the current CSV.jl approach).

Given these processing considerations, here are some of the data structure questions I’m currently pondering:

Vector{Union{T, Missing}} vs. SentinelVector{T}: Base Julia includes an optimization that allows Arrays with isbits Union elements to be stored inline, which means they can be extremely efficient when working with, for example, Union{Float64, Missing}. I’ve been prototyping a new SentinelArrays.jl package that allows wrapping any AbstractArray and specifying a special “sentinel” value of the array type that should return a different “special value”, like missing. For Float64 for example, we can use a non-standard NaN bit pattern as a sentinel for misssing and this works quite well: we’re basically working with a plain Vector{Float64} in terms of storage, but get the advantage of representing Union{Float64, Missing} through our sentinel.

The pros of Vector{Union{T, Missing}} is that it’s Base julia, built-in, standard, and much of the data ecosystem has become familiar with the type and integrated well. The cons are that it takes up a little extra space (1 extra byte per element which signals whether an element has a Float64 or a missing), and that there currently isn’t a way to “truncate” the array like convert(Vector{Float64}, A::Vector{Union{Float64, Missing}}) without copying data. This is desirable because while parsing, we need to assume Union{T, Missing} in case missing values are encountered, but might finish parsing and know that there were in fact none, in which case we’d like to just return Vector{Float64}. With a SentinelVector, this is trivial because we can just “unwrap” the underlying array and return that, given no sentinels were used while parsing. The disadvantages of SentinelArrays are just that they’re non-standard, though the Julia ecosystem has evolved pretty well to just rely on the general AbstractArray interface instead of needing specific array types.

The 2nd question is How to return/represent String columns? String columns are a bit of the odd duck with respect to parsing because you’re not really “parsing” anything but just noting the start/end positions of a cell of the csv file. And indeed, one representation of a String column is just that: a custom array type that holds on to the original file byte buffer, and each “element” is just a byte position and length of each cell. Upon indexing, the String can be fully materialized, but otherwise, this lazy-materializing structure provides a lot of efficiency. The disadvantage of this approach is needing to hold on to the original file buffer, which has caused confusion for users in the past when they try to modify/delete the file after parsing and get errors saying the file is still in use. Another disadvantage is trying to support the full AbstractArray interface with this lazy structure, particularly with regards to mutating operations (push!, append!, etc.). The WeakRefStrings.jl package provides a structure that can mostly be used for this kind of task, but there are a few internal mismatches with how the position/lengths are represented. The alternative of just materializing a full Vector{String} has the advantage of being extremely standard and easy to work with, but just expensive because each string cell in the file must be copied/allocated, even if it never ends up getting used.

The 3rd data structure question involves multi-threaded parsing: How to best return/represent columns when multiple threads are involved? As noted earlier, CSV.jl currently chunks up large files and lets each thread process a chunk separately: detecting types, guessing rows, the full parsing task. After each thread has finished, a “merge” operation is performed where columns will be promoted together, recoded, and ensure that each has a consistent type. CSV.jl currently defines its own CSV.Column2 type (admittedly not the greatest name 😜) that “chains” each threads’ column together into a single “column”. This seems to work pretty well in practice, with iteration still being extremely efficient, but there’s a gotcha if you tried to do

for i = 1:length(column)
    x = column[i]
    # do stuff with x
end

That is, linear getindex operations are slower (O(log N) slow) because it has to determine which underlying chained array i belongs to.
The most obvious alternative solution is to just vcat all the thread columns together, but my worry there is we’re essentially doubling the required memory for parsing, if only for the brief operation of appending columns together.

Anyway, these are some of the questions I’ve been sleeping on for the past few days; I don’t have firm commitments to choose one solution over the other, but with the other internal refactoring, I thought it would be good to think through the ideals and see if we can improve things somehow. My ulterior motive in writing this all up in a blog post is to hopefully elicit some discussion/ideas around ways we can accomplish some of the things I’ve discussed here. As always, feel free to ping me on Twitter or the #data julialang slack channel to chat.

Tables.jl 1.0 Release

By: Jacob Quinn

Re-posted from: https://quinnj.home.blog/2020/02/12/tables-jl-1-0-release/

Since its inception on a couple of whiteboards at JuliaCon 2018 in London, the Tables.jl package has grown to become a foundational set of interfaces for data packages in Julia, even accumulating 74 direct dependencies in the General registry as of this writing (Feb 2020)!

So why the 1.0 release? An official 1.0 release can signal stability and maturity, particularly in the case of an “interface package” like Tables.jl. With very little change in API since the package started, we figured it was time to send that signal out to the broader community: hey! this package can be pretty useful and we’re not going to break it (intentionally)! It also gives the opportunity to polish up a few APIs, iron out a few wrinkles, and cleanup a lot of documentation.

For those who aren’t familiar, the Tables.jl package is about providing powerful interfaces for “table types” in all their shapes, sizes, and formats, to talk with each other seamlessly, and most importantly, without direct knowledge of each other! At the highest level, it provides the Tables.rows and Tables.columns functions, which provide two of the most common access patterns to table data everywhere (row-by-row, or by entire columns). Table types become valid “sources” by implementing access to their data via Tables.rows or Tables.columns (or both!), and “sinks” can then operate on any source by calling Tables.rows or Tables.columns and processing the resulting data appropriately. A key feature for sinks is the “orientation agnostic” nature of Tables.rows and Tables.columns; i.e. sinks don’t need to worry if the source is row or column-oriented by nature, since performant, generic fallbacks are provided both ways. That is, calling Tables.rows on a source that only defined Tables.columns fallsback to a generic lazy row iteration over the input columns. And vice versa, calling Tables.columns on a source that only defined Tables.rows will process the input rows to build up columns. This works because it’s work the sink would have had to do anyway; if the sink really needs columns to do its work, then it would have to turn rows into columns anyway, so Tables.jl provides that with the best, community-sourced implementation possible.

Another core concept, and now clarified in the 1.0 release, is the interface for accessing data on an individual row, as well as a set of columns. It turns out, by viewing a “row” as an ordered set of named column values, and “columns” as an ordered set of named columns, lends itself naturally to a common interface, simplified implementations, and ability to provide powerful default functionality. Check out new the docs, particularly about Tables.AbstractRow and Tables.AbstractColumns to learn more. You can also checkout the Discourse announcement, which is geared a little bit more towards 1.0 release notes and upgrade guide.

A recent popular blog post highlighted features of Julia that lend itself to such widespread composability between packages and Tables.jl is a powerful example of this. Through just a few simple APIs, it allows a DataFrame to be serialized to disk in the binary feather format, read back, converted to JSON to be sent over the web, and loaded into a mysql database. All without any of those packages knowing about each other, and without needing a single intermediary table type to convert between. Applications and processing tasks can feel free to leverage any number of in-memory representations, disk formats, and databases to handle table manipulations.

Here’s to more integrations, more composability, and more table types in the future!