Local packages in a separate directory in Julia

Re-posted from: http://tpapp.github.io/post/julia-local-test/

I run Pkg.update() fairly often to stay up to date and benefit from the latest improvements of various packages. I rarely ever pin to a specific package version, but I occasionally checkout master for some packages, especially if I am contributing.
Despite updating regularly, I found that the documentation subtly diverged from what I was experiencing for some packages. After looking into the issue, I learned that I was 2–3 minor versions behind despite updating regularly.

Conducting analysis on LendingClub data using JuliaDB

If there is anything years of research on how to streamline data analytics has shown us, it is that working with big data is not cake walk. No matter how one looks at it, it is time consuming and computationally intensive to create, maintain, and build models based upon large datasets.

In Julia v0.6, we aim to take another step towards solving this problem with our new package, JuliaDB.

JuliaDB is a high performance, distributed, column-oriented data store providing functionality for both in-memory and out-of-core calculations. Being fully implemented in Julia, JuliaDB allows for ease of integration with data loading, analytics, and visualization packages throughout the Julia language ecosystem. Such seamless integration allows for rapid development of data and compute intensive applications.

This example shall use datasets provided by LendingClub, the world’s largest online marketplace for connecting borrowers and investors. On their website, they provide publicly available, detailed datasets that contain anonymous data regarding all loans that have been issued through their system, including the current loan status and latest payment information.

The analysis conducted below is similar to that performed on the same datasets in this post by the Microsoft R Server Tiger Team for the Azure Data Science VM.

The first step in conducting this analysis is to download the following files from the website: LoanStats2016_Q1.csv, LoanStats2016_Q2.csv, LoanStats2016_Q3.csv, LoanStats2016_Q4.csv, LoanStats2017_Q1.csv, LoanStats3b.csv, LoanStats3c.csv and LoanStats3d.csv. A basic clean-up of the data files is performed by deleting the first and last line of descriptive text from each csv.

Writing the Julia code

Once the file clean-up is done, add the following packages: JuliaDB, TextParse, IndexedTables, NullableArrays, DecisionTree, CoupledFields, Gadfly, Cairo, Fontconfig, Dagger, and Compose, followed by loading the required ones.

# Packages that need to be installed with Julia 0.6
Pkg.clone("https://github.com/AndyGreenwell/ROC.jl.git")
using Dagger, Compose
import TextParse: Numeric, NAToken, CustomParser, tryparsenext, eatwhitespaces, Quoted, Percentage


Now define a variable that contains a path to the directory containing the data files, and a dictionary that contains the names of all of the columns that are contained in the dataset as keys.

dir = "/home/venkat/LendingClubDemo/files"
const floatparser = Numeric(Float64)
const intparser = Numeric(Int)

t  = Dict("id"                 => Quoted(Int),
"member_id"                      => Quoted(Int),
"loan_amnt"                      => Quoted(Nullable{Float64}),
"funded_amnt"                    => Quoted(Nullable{Float64}),
"funded_amnt_inv"                => Quoted(Nullable{Float64}),
"term"                           => Quoted(TextParse.StrRange),
"int_rate"                       => Quoted(NAToken(Percentage())),
"delinq_2yrs"                    => Quoted(Nullable{Int}),
"earliest_cr_line"               => Quoted(TextParse.StrRange),
"inq_last_6mths"                 => Quoted(Nullable{Int}),
...and so on
)


Calling the function “loadfiles” from the JuliaDB package parses the data files, and constructs the corresponding table (providing the above dictionary as input helps it construct the table, although it doesn’t necessarily need this input). Since none of the dictionary columns are index columns, JuliaDB will itself create its own implicit index column with each row having a unique integer value, starting with 1.

LS = loadfiles(glob("*.csv", dir), indexcols=[], colparsers=t, escapechar='"')


Once done, we classify some loans as bad loans and others as good loans based upon whether the payment on the loan is late, in default, or has been charged off. We then split the table based upon whether the loans are good or bad.

bad_status = ("Late (16-30 days)","Late (31-120 days)","Default","Charged Off")
# Determine which loans are bad loans
getdatacol(LS, :loan_status)) |> collect |> Vector{Bool}
# Split the table into two based on the loan classification
LStrue = filter(x->x.loan_status in bad_status, LS)
LSfalse = filter(x->!(x.loan_status in bad_status), LS)


Constructing a relevant model necessitates that we identify which factors are the best in identifying good and bad loans. Over here, the feature selection method that we use is a graphical comparison based upon how each numerical column’s row values are associated with either a good or bad categorization of individual loans. We construct two density plots of the values contained in each numerical column, one for good loans and the other for bad. This process necessitates that we first figure out which columns are numerical. We do that by using the following set of “isnumeric” functions.

# Define a function for determining if a value is numeric, whether or not the
# value is a Nullable.
isnumeric(::Number) = true
isnumeric{T<:Number}(::Nullable{T}) = true
isnumeric(::Any) = false
isnumeric{T<:Number}(x::Quoted{T}) = true
isnumeric{T<:Nullable}(x::Quoted{T}) = eltype(T) <: Number


We then map our isnumeric function over each column of the JuliaDB table, construct Gadfly layers for each density plot for the good and bad loans, and then display that collection for feature selection.

# Produce density plots of the numeric columns based on the loan classification
varnames = map(Symbol, collect(keys(filter((k,v)->(k != "id" && k!="member_id" && isnumeric(v)), t))))

for s in varnames
nt = dropnull(collect(getdatacol(LStrue,s)))
nf = dropnull(collect(getdatacol(LSfalse,s)))
push!(layers, layer(x = nt, Geom.density, Theme(default_color=colorant"blue")))
push!(layers, layer(x = nf, Geom.density, Theme(default_color=colorant"red")))
end

# Layout the individual plots on a 2D grid
N = length(varnames)
M = round(Int,ceil(sqrt(N)))
cs = Array{Compose.Context}(M,M)
for i = 1:N
Guide.title(string(varnames[i])),
Guide.xlabel("value"),Guide.ylabel("density")))
end
for i = N+1:M^2
end
draw(PNG("featureplot.png",24inch, 24inch), gridstack(cs))


The Gadfly plots would typically look like this:

In order to make sure that our analysis is as close as possible as that conducted by Microsoft, we’ll select the same set of predictor variables that they did:

revol_util, int_rate, mths_since_last_record, annual_inc_joint, dti_joint
total_rec_prncp, all_util


Creating the predictive model

Our predictive model will be created by using the random forest model of the DecisionTree.jl package. There are two steps here — one where we use a large amount of data to construct the model, and two, a smaller set of data to test the model. So we randomly split the data into two parts, one containing 75% of the data points, to be used for training the model, and the other containing the other 25%, to be used to test the model.

# Split the data into 75% training / 25% test
n = length(LS)
srand(1)
p = randperm(n)
m = round(Int,n*3/4)
a = sort(p[1:m])
b = sort(p[m+1:end])
LStrain = LS[a]
LStest  = LS[b]


The random forest model needs us to create two vectors — one being a vector of labels, and the other being the corresponding feature matrix. For the label vector, we reuse the index vector used above (when extracting the training subset of the original data to extract the corresponding subset of the is_bad label vector). For the construction of the feature matrix, we extract the columns for our selected features from the distributed JuliaDB table, gather those columns to the master process, and finally concatenate the resulting vectors into our feature matrix.

features_train = [revol_util_train int_rate_train mths_since_last_record_train annual_inc_joint_train total_rec_prncp_train all_util_train]


Having done this, we can now call the “build_forest” function from the DecisionTree.jl package.

model = build_forest(labels_train, features_train, 3, 10, 0.8, 6)


Should we want to save our model to reuse at a later time, we can store it to our disk.

f = open("  loanmodel.jls", "w")
serialize(f, model)
close(f)


We can now test our model on the rest of the data. To do this, we will generate predictions in parallel across all workers by mapping the “apply_forest” function onto every row of the JuliaDB dataset.

predictions = collect(map(row->DecisionTree.apply_forest(model, [row.revol_util.value; row.int_rate.value;row.mths_since_last_record.value;row.annual_inc_joint.value;row.dti_joint.value;row.total_rec_prncp.value;row.all_util.value]), LStest)).data


With our set of predictions, we construct a ROC curve using the ROC.jl package and calculate the area under the curve to find a single measure of how predictive our trained model is on the dataset.

# Receiver Operating Characteristics curve

# An ROC plot in Gadfly with data calculuated using ROC.jl
Gadfly.plot(layer(x = curve.FPR,y = curve.TPR, Geom.line),
layer(x = linspace(0.0,1.0,101), y = linspace(0.0,1.0,101),
Geom.point, Theme(default_color=colorant"red")), Guide.title("ROC"),
Guide.xlabel("False Positive Rate"),Guide.ylabel("True Positive Rate"))


The ROC would look like this.

Area under the curve would be:

# Area Under Curve
AUC(curve)
0.5878135617540067


There. That is how you would create a model that can predictively determine the quality of a loan using JuliaDB.

Re-posted from: http://juliacomputing.com/blog/2017/08/17/femtocleaner.html

TL;DR: FemtoCleaner is a GitHub bot that upgrades old Julia syntax to new syntax. It has been installed
in more than 700 repositories, submitted 100+ pull requests and touched 10000 lines of code since
last Friday. Scroll down for instructions, screen shots and pretty plots.

Background

As julia is approaching its 1.0 release, we have been revisiting
several key areas of the language. We want to ensure that the 1.0
release is of sufficient quality that it can serve as a stable
foundation of the Julia ecosystem for many years to come without
requiring breaking changes. In effect, however, prioritizing such
breaking changes over ones that can be safely done in a non-breaking
fashion after 1.0 means that we are currently making many more
breaking changes than we otherwise might. Two particularly disruptive
such changes were the syntax changes to type keywords and parametric
function syntax, both of which were introduced in 0.6. The old syntax
is now deprecated on master will be removed in 1.0. The former change
involves changing the type definition keywords from type/immutable to
mutable struct/struct, e.g.

immutable RGB{T}
r::T
g::T
b::T
end


becomes

struct RGB{T}
r::T
g::T
b::T
end


The parametric function syntax change is a bit more tricky.
In the simplest case, it involves rewriting functions like:

eltype{T}(::Array{T}) = T


to

eltype(::Array{T}) where {T} = T


which is relatively straightforward. However, there are more complicated corner cases
involving inner constructors such as:

immutable Wrapper{T}
data::Vector{T}
Wrapper{S}(data::Vector{S}) = new(convert(Vector{T}, data))
end


struct Wrapper{T}
data::Vector{T}
Wrapper{T}(data::Vector{S}) where {T,S} = new(convert(Vector{T}, data))
end


This last example also shows why this syntax was changed. In prior versions of julia,
the braces syntax (F{T} for some F,T) was inconsistent between meaning parameter application
and introducing parameters for a method. Julia 0.6 features a significantly more powerful (and correct) type
system. At the same time, the F{T} syntax was changed to always mean parameter application
(modulo support for parsing the deprecated syntax for backwards compatibility of course),
reducing confusion and making it possible to more easily express some of the new method signatures
Stefan’s Discourse post
and the 0.6 release notes.

Realizing the magnitude of the required change and the growing amount of Julia code that exists in the wild,
several julia contributors suggested on Discourse that we attempt to automate these syntax upgrades.
Unfortunately, is not simply a of search/replace in a source file. The rewrite can be quite complex
and depends on the scope in which it is used. Nevertheless, we set out to build such an automated system, with the following goals in mind:

• Correctness – Being able to upgrade syntax is not very useful if we have to go in and clean up after the automated process’ mistakes,
it probably would have been faster to just do it ourselves in the first place.
• Style preservation – Many programmers carefully write their code in their own preferred style and we should try hard to preserve
such choices whenever possible (otherwise people might not want to use the tool)
• Convenience – Ideally no setup would be required to use the tool

CSTParser

The first goal, correctness, forces us to use a proper parser for our undertaking, rather than
relying on simple find/replace or regular expressions. Unfortunately, while julia’s parser is
accessible from within the language and can be used to find these instances of deprecated syntax,
it cannot be used for our purposes. This is because it does not support our second goal – style preservation.
In going from the input text to the Abstract Syntax Tree, the parser discards a significant amount
of semantically-irrelevant information (formatting, comments, distinctions between different syntax
forms that are otherwise semantically equivalent). Instead, we need a parser that retains and exposes
all of this information. There are several names of this concept, “round-tripable representation”,
“Concrete Syntax Tree (CST)” or “Lossless Syntax Tree” being perhaps the most common. Luckily,
in the Julia ecosystem we have not one, but two choices for such a parser:

• JuliaParser.jl – a slightly older translation of the scheme parser from the main julia codebase into Julia,
later retrofitted with precise location information.
• CSTParser.jl – a ground up rewrite of the parser with the explicit goal of writing a high performance, correct,
lossless parser, originally for use in the VS Code IDE extension

Ultimately the decision came down to the fact that CSTParser.jl was actively maintained, while JuliaParser.jl had
not yet been updated to the new Julia syntax. With a number of small enhancements and additional features I
contributed in order to make it useful for this project, CSTParser is now able to parse essentially all publicly
available Julia code correctly, while retaining the needed formatting information.

The design of CSTParser.jl is somewhat similar to that of the Roslyn parser (a good overview can be found here). Each leaf node in the AST stores
only its total size (but not its absolute position in the file), as well as what part of its contents are semantically significant
as opposed to leading or trailing trivia (comments, whitespace, semicolons etc). This is useful for the IDE use case,
since it allows efficient reparsing when small changes are made to a file (since a local change does not invalidate any
data in a far away node). The resulting tree can be a little awkward to work with, but as we shall see it is easy to work
around this for our use case.

Deprecations.jl

The new Deprecations.jl package is the heart of this project. It contains all the
logic to rewrite Julia code making use of deprecated syntax constructs. It supports two modes of specifying such rewrites:

• Using CST matcher templates
• By working with the raw CST api
Independent of the mode, a new rewrite is introduced as such:

  struct OldStructSyntax; end
register(OldStructSyntax, Deprecation(
"The type-definition keywords (type, immutable, abstract) where changed in Julia 0.6",
"julia",
v"0.6.0",
v"0.7.0-DEV.198",
typemax(VersionNumber)
))


which gives a description, as well as some version bounds. This is important because we need to make sure
to only apply rewrites that are compatible with the package’s declared minimum supported julia version
(i.e. we need to make sure not to introduce julia 0.6 syntax to a package that still supports julia 0.5).
Each Julia package provides a REQUIRE file specifying it’s supported minimum versions.

Having declared the new rewrite, let’s actually make it work by adding some CST matcher templates to it:

    match(OldStructSyntax,
"immutable \$name\n\$BODY...\nend",
"struct\$name\n\$BODY!...\nend",
format_paramlist
)
match(OldStructSyntax,
"type \$name\n\$BODY...\nend",
"mutable struct\$name\n\$BODY!...\nend",
format_paramlist
)


The way this works is fairly straightforward. For each match call, the first line is the template to
match and the second is its replacement. Under the hood, this works by parsing both expressions, pattern
matching the resulting template tree against the tree we want to update and then splicing in the replacement
tree (with the appropriate parameters taken from the tree we’re matching against). The whole thing is implemented
in about 200 lines of code.

In this description I’ve skipped a bit of magic. Simply splicing together a new tree of CST nodes, doesn’t quite
work. As mentioned above the CST nodes only know their kind and size and very little else. In particular,
they know neither their position in the original buffer, nor what text is at that position. Instead, the replacement
tree is made out of different kind of node that retains both pieces of information (which the original buffer is
and where in the buffer that node is located). Conceptually this is again similar to Roslyn’s red-green trees. However, there
is very little code
associated with this abstraction. Most of the functionality is provided by the AbstractTrees.jl package by lifting the tree structure of the underlying CST nodes.

Lastly, there’s a couple of other node types to be found in this “replacement tree” to insert or replace
whitespace or other trivia. This is useful for formatting purposes. E.g. the example above, we passed format_paramlist
as a formatter. This function runs after the matching and adjusts formatting. To see this consider:

immutable RGB{RT,
GT,
BT}
r::RT
g::GT
b::BT
end


Naively, this would end up as

struct RGB{RT,
GT,
BT}
r::RT
g::GT
b::BT
end


leaving us with unhappy users. Instead, the formatter turns this into

struct RGB{RT,
GT,
BT}
r::RT
g::GT
b::BT
end


by adjusting the leading trivia of the GT and BT nodes (or rather the trailing trivia of their predecessors).

Lastly, while the CST templates shown above are powerful, they are still limited to simple pattern matching.
Sometimes we need to perform more complicated kinds of transformation to decide which rewrites to perform.
One example is code like:

if VERSION > v"0.5"
do_this()
else
do_that()
end


which, depending on the current julia version, executes either one branch or the other. Of course, if
the package declares that it requires julia 0.6 at a minimum, the condition is true for any supported
julia version, so we can “constant fold” the expression and remove the else branch. Doing so with simple
templates is infeasible, since we need to recognize all patterns of the form “comparison of VERSION against
some version number” and then compute whether the condition is actually always true (or false) given the declared
version bounds. Such transformations are possible using the raw API. Writing such transformations is more complicated
(and beyond the scope of this post), but can be very powerful.

FemtoCleaner

Having addressed the first two goals, let’s get to the third goal – convenience. The vast majority of public Julia code
is hosted on GitHub, so the natural way to do this is create a GitHub bot that clones a repository, applies the rewrites
and submits a pull request to the relevant repository. The simplest way to do would be to clone all the repositories,
apply the rewrites, and then programmatically submit pull requests to all of them (the PkgDev.jl packages has a function
to automatically submit a pull request against a Julia package). However, this approach falls short for several reasons:

• It’s very manual. When new features are added, we have to manually perform a new such run. This is also problematic,
because in practice it means that these runs have to always be done by the person who knows how the setup works. He’s
a very busy guy.
• It would only catch registered Julia packages. There are a significant number of repositories that use Julia code,
but are not registered Julia packages. Of course one could go the other way and submit pull requests to repositories
that look like Julia code, but that risks creating a significant number of useless pull request (because of forks,
• It wouldn’t work on private packages
• It doesn’t allow the user to control and interact with the process

A better alternative that addresses all these problems is to create a GitHub bot (also called a GitHub app) to perform these functions. The
Julia community is quite familiar with these already. We have the venerable nanosoldier, which performs on-demand performance benchmarking of julia commits, attobot which assists Julia users in registering their packages with METADTA and (perhaps less well known) jlbuild which controls the julia buildbots (which build releases and perform some additional continous integration on secondary platforms).

Joining these now is femtocleaner (phew that took a while to get to – I hope the background above was useful though), which performs exactly this function. Let’s see how it works. First go to https://github.com/apps/femtocleaner and click “Configure”. You’ll be presented with a
choice of accounts to install femtocleaner into:

Choosing an account will give you the option to install femtocleaner on either one or all of
the repositories in that account:

In this case, I will install femtocleaner in all repositories of the JuliaParallel organization.
Without any further ado, femtocleaner will go to work, cloning each repository, applying
the rewrites it knows about and then submitting a pull request to each repository where it was
able to make a change:

From now on, FemtoCleaner will listen to events on these repositories and submit another pull
request whenever these packages decide to drop support for an older julia version, thus allowing
the bot remove more deprecated syntax. The bot can also be triggered manually by opening an
issue with the title “Run femtocleaner”.

The bot has a few additional features meant to make interacting with it easier. The most used one
is the bad bot command, which is used to inform the developers that the bot has made a mistake.
It can be triggered by simply creating a “Changes Requested” GitHub PR review, and annotating an incorrect
change with the review comment bad bot, like so:

In response the bot will open an issue on its source code repository giving the relevant context
and linking back to the PR:

Enabling this functionality right from the the pull request review window has proven very powerful.
Rather than requiring the user to leave their current context (reviewing a pull request) and navigate
to a different repository to file an issue, everything can be done right there in the pull request
review. Lastly, once the rewrite bug has been addressed, the bot will come back, update the pull

This workflow is also very convenient from the other side. All the issues are in one place (rather
than having to monitor activity on all pull requests filed by the bug) and addressing the bug is
as simple as fixing the code and pushing it to the source code repository. The bot will automatically,
update itself and go back and fix up any pull requests that would now differ as a result of the new code:

Results

The whole project from the first line of code written in support of it until this blog post (which
represents its completion) took about three weeks. As part of it, I made a number of changes
to CSTParser and its dependencies (which should prove very useful for future parsing endeavors) as
well as GitHub.jl (which will hopefully help write more of these kinds of bots to support the Julia
community). After some initial testing and an alpha run on JuliaStats on Aug 8 (huge thanks to Alex Arslan for aggreeing to diligently review and try out the process), we announced the public availability of
femtocleaner on discourse last friday (Aug 11). Since then, the bot has been installed on 759 repositories (though about 200 of them were ineligible for femtocleaner processing, either because they had missing or malformed REQUIRE files or because they were not actually Julia packages), submitting 132 pull requests that add 8850 lines and
delete 9331. Most of these pull requests have been merged:

As people started using femtocleaner, a number of issues were discovered, but developers took advantage
of the bad bot mechanism described above to report them and we did our best to address them quickly.
The following graph shows the number of open/closed such issues over the time period that femtocleaner has
been active:

Alex Arslan’s original testing on Aug 8 is well visible (and took a few days to catch up to), but all known
issues have been addressed. Another interesting data point is the distribution of supported julia versions
that femtocleaner was installed on. As discussed above, it was primarily written to aid in moving to the new
syntax available in julia 0.6, though a few rewrites (such as the generic VERSION comparisons) are also applicable to older versions. The following shows the number of repositories as well as the number of open
prs by minimum supported julia version (no pr opened means that the bot found no deprecated syntax):

As expected, packages supporting 0.6.0 got proportionally the most pull requests. However, this just means that
femtocleaner will be back for the remaining 0.5.0 packages once they decide to drop support for 0.5.0.
We can also look at the number of changed lines by the package’s supported minimum version:

Again the bias of the bot for upgrading 0.6 syntax stands out. It is perhaps interesting to note that
most of the 0.6 packages with a small to medium number of changes had already been upgraded manually to
the new syntax. Still, the bot was able to find a few changes that were missed in this process and clean
them up automatically.

Conclusions

Overall, this work should accelerate the movement of the package ecosystem
towards 1.0 by making upgrading code easier. Generally, the package ecosystem lags
behind the julia release by a few months as package maintainers upgrade their code bases.
We hope this system will help make sure that 1.0 is released with a full set of up-to-date
packages, as well as ease the burden on package maintainers, allowing them to spend their time
on improving their packages instead of being forced to spend a lot of time performing tedious
syntax upgrades. We are very happy with the outcome of this work. There are already almost ten thousand
fewer deprecation warnings across the Julia ecosystem and more will be removed automatically once
the package developers are ready for it. Additionally, the underlying technology should help
with a number of other developer-productivity tools and improvements, such as IDE support, better
error messages and the debugger. All code is open source and available on GitHub.
You are welcome to contribute, improve the code or build your own GitHub bots.

We would like to thank GitHub for providing a rich enough API to allow this convenient workflow.

Lastly, we thank and acknowledge the Sloan foundation for their continued supported of the Julia
ecosystem by providing the funding for this work.