FemtoCleaner – A bot to automatically upgrade your Julia syntax

TL;DR: FemtoCleaner is a GitHub bot that upgrades old Julia syntax to new syntax. It has been installed
in more than 700 repositories, submitted 100+ pull requests and touched 10000 lines of code since
last Friday. Scroll down for instructions, screen shots and pretty plots.

Background

As julia is approaching its 1.0 release, we have been revisiting
several key areas of the language. We want to ensure that the 1.0
release is of sufficient quality that it can serve as a stable
foundation of the Julia ecosystem for many years to come without
requiring breaking changes. In effect, however, prioritizing such
breaking changes over ones that can be safely done in a non-breaking
fashion after 1.0 means that we are currently making many more
breaking changes than we otherwise might. Two particularly disruptive
such changes were the syntax changes to type keywords and parametric
function syntax, both of which were introduced in 0.6. The old syntax
is now deprecated on master will be removed in 1.0. The former change
involves changing the type definition keywords from type/immutable to
mutable struct/struct, e.g.

immutable RGB{T}
    r::T
    g::T
    b::T
end

becomes

struct RGB{T}
    r::T
    g::T
    b::T
end

The parametric function syntax change is a bit more tricky.
In the simplest case, it involves rewriting functions like:

eltype{T}(::Array{T}) = T

to

eltype(::Array{T}) where {T} = T

which is relatively straightforward. However, there are more complicated corner cases
involving inner constructors such as:

immutable Wrapper{T}
    data::Vector{T}
    Wrapper{S}(data::Vector{S}) = new(convert(Vector{T}, data))
end

which now has to read

struct Wrapper{T}
    data::Vector{T}
    Wrapper{T}(data::Vector{S}) where {T,S} = new(convert(Vector{T}, data))
end

This last example also shows why this syntax was changed. In prior versions of julia,
the braces syntax (F{T} for some F,T) was inconsistent between meaning parameter application
and introducing parameters for a method. Julia 0.6 features a significantly more powerful (and correct) type
system. At the same time, the F{T} syntax was changed to always mean parameter application
(modulo support for parsing the deprecated syntax for backwards compatibility of course),
reducing confusion and making it possible to more easily express some of the new method signatures
now made possible by the new type system improvements. For further information see also
Stefan’s Discourse post
and the 0.6 release notes.

Realizing the magnitude of the required change and the growing amount of Julia code that exists in the wild,
several julia contributors suggested on Discourse that we attempt to automate these syntax upgrades.
Unfortunately, is not simply a of search/replace in a source file. The rewrite can be quite complex
and depends on the scope in which it is used. Nevertheless, we set out to build such an automated system, with the following goals in mind:

  • Correctness – Being able to upgrade syntax is not very useful if we have to go in and clean up after the automated process’ mistakes,
    it probably would have been faster to just do it ourselves in the first place.
  • Style preservation – Many programmers carefully write their code in their own preferred style and we should try hard to preserve
    such choices whenever possible (otherwise people might not want to use the tool)
  • Convenience – Ideally no setup would be required to use the tool

CSTParser

The first goal, correctness, forces us to use a proper parser for our undertaking, rather than
relying on simple find/replace or regular expressions. Unfortunately, while julia’s parser is
accessible from within the language and can be used to find these instances of deprecated syntax,
it cannot be used for our purposes. This is because it does not support our second goal – style preservation.
In going from the input text to the Abstract Syntax Tree, the parser discards a significant amount
of semantically-irrelevant information (formatting, comments, distinctions between different syntax
forms that are otherwise semantically equivalent). Instead, we need a parser that retains and exposes
all of this information. There are several names of this concept, “round-tripable representation”,
“Concrete Syntax Tree (CST)” or “Lossless Syntax Tree” being perhaps the most common. Luckily,
in the Julia ecosystem we have not one, but two choices for such a parser:

  • JuliaParser.jl – a slightly older translation of the scheme parser from the main julia codebase into Julia,
    later retrofitted with precise location information.
  • CSTParser.jl – a ground up rewrite of the parser with the explicit goal of writing a high performance, correct,
    lossless parser, originally for use in the VS Code IDE extension

Ultimately the decision came down to the fact that CSTParser.jl was actively maintained, while JuliaParser.jl had
not yet been updated to the new Julia syntax. With a number of small enhancements and additional features I
contributed in order to make it useful for this project, CSTParser is now able to parse essentially all publicly
available Julia code correctly, while retaining the needed formatting information.

The design of CSTParser.jl is somewhat similar to that of the Roslyn parser (a good overview can be found here). Each leaf node in the AST stores
only its total size (but not its absolute position in the file), as well as what part of its contents are semantically significant
as opposed to leading or trailing trivia (comments, whitespace, semicolons etc). This is useful for the IDE use case,
since it allows efficient reparsing when small changes are made to a file (since a local change does not invalidate any
data in a far away node). The resulting tree can be a little awkward to work with, but as we shall see it is easy to work
around this for our use case.

Deprecations.jl

The new Deprecations.jl package is the heart of this project. It contains all the
logic to rewrite Julia code making use of deprecated syntax constructs. It supports two modes of specifying such rewrites:

  • Using CST matcher templates
  • By working with the raw CST api
    Independent of the mode, a new rewrite is introduced as such:

      struct OldStructSyntax; end
      register(OldStructSyntax, Deprecation(
          "The type-definition keywords (type, immutable, abstract) where changed in Julia 0.6",
          "julia",
          v"0.6.0",
          v"0.7.0-DEV.198",
          typemax(VersionNumber)
      ))
    

    which gives a description, as well as some version bounds. This is important because we need to make sure
    to only apply rewrites that are compatible with the package’s declared minimum supported julia version
    (i.e. we need to make sure not to introduce julia 0.6 syntax to a package that still supports julia 0.5).
    Each Julia package provides a REQUIRE file specifying it’s supported minimum versions.

Having declared the new rewrite, let’s actually make it work by adding some CST matcher templates to it:

    match(OldStructSyntax,
        "immutable \$name\n\$BODY...\nend",
        "struct\$name\n\$BODY!...\nend",
        format_paramlist
    )
    match(OldStructSyntax,
        "type \$name\n\$BODY...\nend",
        "mutable struct\$name\n\$BODY!...\nend",
        format_paramlist
    )

The way this works is fairly straightforward. For each match call, the first line is the template to
match and the second is its replacement. Under the hood, this works by parsing both expressions, pattern
matching the resulting template tree against the tree we want to update and then splicing in the replacement
tree (with the appropriate parameters taken from the tree we’re matching against). The whole thing is implemented
in about 200 lines of code.

In this description I’ve skipped a bit of magic. Simply splicing together a new tree of CST nodes, doesn’t quite
work. As mentioned above the CST nodes only know their kind and size and very little else. In particular,
they know neither their position in the original buffer, nor what text is at that position. Instead, the replacement
tree is made out of different kind of node that retains both pieces of information (which the original buffer is
and where in the buffer that node is located). Conceptually this is again similar to Roslyn’s red-green trees. However, there
is very little code
associated with this abstraction. Most of the functionality is provided by the AbstractTrees.jl package by lifting the tree structure of the underlying CST nodes.

Lastly, there’s a couple of other node types to be found in this “replacement tree” to insert or replace
whitespace or other trivia. This is useful for formatting purposes. E.g. the example above, we passed format_paramlist
as a formatter. This function runs after the matching and adjusts formatting. To see this consider:

immutable RGB{RT,
              GT,
              BT}
    r::RT
    g::GT
    b::BT
end

Naively, this would end up as

struct RGB{RT,
              GT,
              BT}
    r::RT
    g::GT
    b::BT
end

leaving us with unhappy users. Instead, the formatter turns this into

struct RGB{RT,
           GT,
           BT}
    r::RT
    g::GT
    b::BT
end

by adjusting the leading trivia of the GT and BT nodes (or rather the trailing trivia of their predecessors).

Lastly, while the CST templates shown above are powerful, they are still limited to simple pattern matching.
Sometimes we need to perform more complicated kinds of transformation to decide which rewrites to perform.
One example is code like:

if VERSION > v"0.5"
    do_this()
else
    do_that()
end

which, depending on the current julia version, executes either one branch or the other. Of course, if
the package declares that it requires julia 0.6 at a minimum, the condition is true for any supported
julia version, so we can “constant fold” the expression and remove the else branch. Doing so with simple
templates is infeasible, since we need to recognize all patterns of the form “comparison of VERSION against
some version number” and then compute whether the condition is actually always true (or false) given the declared
version bounds. Such transformations are possible using the raw API. Writing such transformations is more complicated
(and beyond the scope of this post), but can be very powerful.

FemtoCleaner

Having addressed the first two goals, let’s get to the third goal – convenience. The vast majority of public Julia code
is hosted on GitHub, so the natural way to do this is create a GitHub bot that clones a repository, applies the rewrites
and submits a pull request to the relevant repository. The simplest way to do would be to clone all the repositories,
apply the rewrites, and then programmatically submit pull requests to all of them (the PkgDev.jl packages has a function
to automatically submit a pull request against a Julia package). However, this approach falls short for several reasons:

  • It’s very manual. When new features are added, we have to manually perform a new such run. This is also problematic,
    because in practice it means that these runs have to always be done by the person who knows how the setup works. He’s
    a very busy guy.
  • It would only catch registered Julia packages. There are a significant number of repositories that use Julia code,
    but are not registered Julia packages. Of course one could go the other way and submit pull requests to repositories
    that look like Julia code, but that risks creating a significant number of useless pull request (because of forks,
    long dead codebases, etc)
  • It wouldn’t work on private packages
  • It doesn’t allow the user to control and interact with the process

A better alternative that addresses all these problems is to create a GitHub bot (also called a GitHub app) to perform these functions. The
Julia community is quite familiar with these already. We have the venerable nanosoldier, which performs on-demand performance benchmarking of julia commits, attobot which assists Julia users in registering their packages with METADTA and (perhaps less well known) jlbuild which controls the julia buildbots (which build releases and perform some additional continous integration on secondary platforms).

Joining these now is femtocleaner (phew that took a while to get to – I hope the background above was useful though), which performs exactly this function. Let’s see how it works. First go to https://github.com/apps/femtocleaner and click “Configure”. You’ll be presented with a
choice of accounts to install femtocleaner into:

Choosing an account will give you the option to install femtocleaner on either one or all of
the repositories in that account:

In this case, I will install femtocleaner in all repositories of the JuliaParallel organization.
Without any further ado, femtocleaner will go to work, cloning each repository, applying
the rewrites it knows about and then submitting a pull request to each repository where it was
able to make a change:

From now on, FemtoCleaner will listen to events on these repositories and submit another pull
request whenever these packages decide to drop support for an older julia version, thus allowing
the bot remove more deprecated syntax. The bot can also be triggered manually by opening an
issue with the title “Run femtocleaner”.

The bot has a few additional features meant to make interacting with it easier. The most used one
is the bad bot command, which is used to inform the developers that the bot has made a mistake.
It can be triggered by simply creating a “Changes Requested” GitHub PR review, and annotating an incorrect
change with the review comment bad bot, like so:

In response the bot will open an issue on its source code repository giving the relevant context
and linking back to the PR:

Enabling this functionality right from the the pull request review window has proven very powerful.
Rather than requiring the user to leave their current context (reviewing a pull request) and navigate
to a different repository to file an issue, everything can be done right there in the pull request
review. Lastly, once the rewrite bug has been addressed, the bot will come back, update the pull
request and leave a comment to inform the user it did so:

This workflow is also very convenient from the other side. All the issues are in one place (rather
than having to monitor activity on all pull requests filed by the bug) and addressing the bug is
as simple as fixing the code and pushing it to the source code repository. The bot will automatically,
update itself and go back and fix up any pull requests that would now differ as a result of the new code:

Results

The whole project from the first line of code written in support of it until this blog post (which
represents its completion) took about three weeks. As part of it, I made a number of changes
to CSTParser and its dependencies (which should prove very useful for future parsing endeavors) as
well as GitHub.jl (which will hopefully help write more of these kinds of bots to support the Julia
community). After some initial testing and an alpha run on JuliaStats on Aug 8 (huge thanks to Alex Arslan for aggreeing to diligently review and try out the process), we announced the public availability of
femtocleaner on discourse last friday (Aug 11). Since then, the bot has been installed on 759 repositories (though about 200 of them were ineligible for femtocleaner processing, either because they had missing or malformed REQUIRE files or because they were not actually Julia packages), submitting 132 pull requests that add 8850 lines and
delete 9331. Most of these pull requests have been merged:

As people started using femtocleaner, a number of issues were discovered, but developers took advantage
of the bad bot mechanism described above to report them and we did our best to address them quickly.
The following graph shows the number of open/closed such issues over the time period that femtocleaner has
been active:

Alex Arslan’s original testing on Aug 8 is well visible (and took a few days to catch up to), but all known
issues have been addressed. Another interesting data point is the distribution of supported julia versions
that femtocleaner was installed on. As discussed above, it was primarily written to aid in moving to the new
syntax available in julia 0.6, though a few rewrites (such as the generic VERSION comparisons) are also applicable to older versions. The following shows the number of repositories as well as the number of open
prs by minimum supported julia version (no pr opened means that the bot found no deprecated syntax):

As expected, packages supporting 0.6.0 got proportionally the most pull requests. However, this just means that
femtocleaner will be back for the remaining 0.5.0 packages once they decide to drop support for 0.5.0.
We can also look at the number of changed lines by the package’s supported minimum version:

Again the bias of the bot for upgrading 0.6 syntax stands out. It is perhaps interesting to note that
most of the 0.6 packages with a small to medium number of changes had already been upgraded manually to
the new syntax. Still, the bot was able to find a few changes that were missed in this process and clean
them up automatically.

Conclusions

Overall, this work should accelerate the movement of the package ecosystem
towards 1.0 by making upgrading code easier. Generally, the package ecosystem lags
behind the julia release by a few months as package maintainers upgrade their code bases.
We hope this system will help make sure that 1.0 is released with a full set of up-to-date
packages, as well as ease the burden on package maintainers, allowing them to spend their time
on improving their packages instead of being forced to spend a lot of time performing tedious
syntax upgrades. We are very happy with the outcome of this work. There are already almost ten thousand
fewer deprecation warnings across the Julia ecosystem and more will be removed automatically once
the package developers are ready for it. Additionally, the underlying technology should help
with a number of other developer-productivity tools and improvements, such as IDE support, better
error messages and the debugger. All code is open source and available on GitHub.
You are welcome to contribute, improve the code or build your own GitHub bots.

We would like to thank GitHub for providing a rich enough API to allow this convenient workflow.

Lastly, we thank and acknowledge the Sloan foundation for their continued supported of the Julia
ecosystem by providing the funding for this work.