# Local packages in a separate directory in Julia

Re-posted from: http://tpapp.github.io/post/julia-local-test/

I run Pkg.update() fairly often to stay up to date and benefit from the latest improvements of various packages. I rarely ever pin to a specific package version, but I occasionally checkout master for some packages, especially if I am contributing.
Despite updating regularly, I found that the documentation subtly diverged from what I was experiencing for some packages. After looking into the issue, I learned that I was 2–3 minor versions behind despite updating regularly.

# Conducting analysis on LendingClub data using JuliaDB

If there is anything years of research on how to streamline data analytics has shown us, it is that working with big data is not cake walk. No matter how one looks at it, it is time consuming and computationally intensive to create, maintain, and build models based upon large datasets.

In Julia v0.6, we aim to take another step towards solving this problem with our new package, JuliaDB.

JuliaDB is a high performance, distributed, column-oriented data store providing functionality for both in-memory and out-of-core calculations. Being fully implemented in Julia, JuliaDB allows for ease of integration with data loading, analytics, and visualization packages throughout the Julia language ecosystem. Such seamless integration allows for rapid development of data and compute intensive applications.

This example shall use datasets provided by LendingClub, the world’s largest online marketplace for connecting borrowers and investors. On their website, they provide publicly available, detailed datasets that contain anonymous data regarding all loans that have been issued through their system, including the current loan status and latest payment information.

The analysis conducted below is similar to that performed on the same datasets in this post by the Microsoft R Server Tiger Team for the Azure Data Science VM.

The first step in conducting this analysis is to download the following files from the website: LoanStats2016_Q1.csv, LoanStats2016_Q2.csv, LoanStats2016_Q3.csv, LoanStats2016_Q4.csv, LoanStats2017_Q1.csv, LoanStats3b.csv, LoanStats3c.csv and LoanStats3d.csv. A basic clean-up of the data files is performed by deleting the first and last line of descriptive text from each csv.

## Writing the Julia code

Once the file clean-up is done, add the following packages: JuliaDB, TextParse, IndexedTables, NullableArrays, DecisionTree, CoupledFields, Gadfly, Cairo, Fontconfig, Dagger, and Compose, followed by loading the required ones.

# Packages that need to be installed with Julia 0.6
Pkg.add("Cairo") # Needed for PNG creation with Gadfly
Pkg.add("Fontconfig") # Needed for PNG creation with Gadfly
Pkg.clone("https://github.com/AndyGreenwell/ROC.jl.git")
using Dagger, Compose
using DecisionTree, JuliaDB, TextParse, NullableArrays
import TextParse: Numeric, NAToken, CustomParser, tryparsenext, eatwhitespaces, Quoted, Percentage


Now define a variable that contains a path to the directory containing the data files, and a dictionary that contains the names of all of the columns that are contained in the dataset as keys.

dir = "/home/venkat/LendingClubDemo/files"
const floatparser = Numeric(Float64)
const intparser = Numeric(Int)

t  = Dict("id"                 => Quoted(Int),
"member_id"                      => Quoted(Int),
"loan_amnt"                      => Quoted(Nullable{Float64}),
"funded_amnt"                    => Quoted(Nullable{Float64}),
"funded_amnt_inv"                => Quoted(Nullable{Float64}),
"term"                           => Quoted(TextParse.StrRange),
"int_rate"                       => Quoted(NAToken(Percentage())),
"delinq_2yrs"                    => Quoted(Nullable{Int}),
"earliest_cr_line"               => Quoted(TextParse.StrRange),
"inq_last_6mths"                 => Quoted(Nullable{Int}),
...and so on
)


Calling the function “loadfiles” from the JuliaDB package parses the data files, and constructs the corresponding table (providing the above dictionary as input helps it construct the table, although it doesn’t necessarily need this input). Since none of the dictionary columns are index columns, JuliaDB will itself create its own implicit index column with each row having a unique integer value, starting with 1.

LS = loadfiles(glob("*.csv", dir), indexcols=[], colparsers=t, escapechar='"')


Once done, we classify some loans as bad loans and others as good loans based upon whether the payment on the loan is late, in default, or has been charged off. We then split the table based upon whether the loans are good or bad.

bad_status = ("Late (16-30 days)","Late (31-120 days)","Default","Charged Off")
# Determine which loans are bad loans
getdatacol(LS, :loan_status)) |> collect |> Vector{Bool}
# Split the table into two based on the loan classification
LStrue = filter(x->x.loan_status in bad_status, LS)
LSfalse = filter(x->!(x.loan_status in bad_status), LS)


Constructing a relevant model necessitates that we identify which factors are the best in identifying good and bad loans. Over here, the feature selection method that we use is a graphical comparison based upon how each numerical column’s row values are associated with either a good or bad categorization of individual loans. We construct two density plots of the values contained in each numerical column, one for good loans and the other for bad. This process necessitates that we first figure out which columns are numerical. We do that by using the following set of “isnumeric” functions.

# Define a function for determining if a value is numeric, whether or not the
# value is a Nullable.
isnumeric(::Number) = true
isnumeric{T<:Number}(::Nullable{T}) = true
isnumeric(::Any) = false
isnumeric{T<:Number}(x::Quoted{T}) = true
isnumeric{T<:Nullable}(x::Quoted{T}) = eltype(T) <: Number


We then map our isnumeric function over each column of the JuliaDB table, construct Gadfly layers for each density plot for the good and bad loans, and then display that collection for feature selection.

# Produce density plots of the numeric columns based on the loan classification
varnames = map(Symbol, collect(keys(filter((k,v)->(k != "id" && k!="member_id" && isnumeric(v)), t))))

for s in varnames
nt = dropnull(collect(getdatacol(LStrue,s)))
nf = dropnull(collect(getdatacol(LSfalse,s)))
push!(layers, layer(x = nt, Geom.density, Theme(default_color=colorant"blue")))
push!(layers, layer(x = nf, Geom.density, Theme(default_color=colorant"red")))
end

# Layout the individual plots on a 2D grid
N = length(varnames)
M = round(Int,ceil(sqrt(N)))
cs = Array{Compose.Context}(M,M)
for i = 1:N
Guide.title(string(varnames[i])),
Guide.xlabel("value"),Guide.ylabel("density")))
end
for i = N+1:M^2
end
draw(PNG("featureplot.png",24inch, 24inch), gridstack(cs))


The Gadfly plots would typically look like this:

In order to make sure that our analysis is as close as possible as that conducted by Microsoft, we’ll select the same set of predictor variables that they did:

revol_util, int_rate, mths_since_last_record, annual_inc_joint, dti_joint
total_rec_prncp, all_util


## Creating the predictive model

Our predictive model will be created by using the random forest model of the DecisionTree.jl package. There are two steps here — one where we use a large amount of data to construct the model, and two, a smaller set of data to test the model. So we randomly split the data into two parts, one containing 75% of the data points, to be used for training the model, and the other containing the other 25%, to be used to test the model.

# Split the data into 75% training / 25% test
n = length(LS)
srand(1)
p = randperm(n)
m = round(Int,n*3/4)
a = sort(p[1:m])
b = sort(p[m+1:end])
LStrain = LS[a]
LStest  = LS[b]


The random forest model needs us to create two vectors — one being a vector of labels, and the other being the corresponding feature matrix. For the label vector, we reuse the index vector used above (when extracting the training subset of the original data to extract the corresponding subset of the is_bad label vector). For the construction of the feature matrix, we extract the columns for our selected features from the distributed JuliaDB table, gather those columns to the master process, and finally concatenate the resulting vectors into our feature matrix.

features_train = [revol_util_train int_rate_train mths_since_last_record_train annual_inc_joint_train total_rec_prncp_train all_util_train]


Having done this, we can now call the “build_forest” function from the DecisionTree.jl package.

model = build_forest(labels_train, features_train, 3, 10, 0.8, 6)


Should we want to save our model to reuse at a later time, we can store it to our disk.

f = open("  loanmodel.jls", "w")
serialize(f, model)
close(f)


We can now test our model on the rest of the data. To do this, we will generate predictions in parallel across all workers by mapping the “apply_forest” function onto every row of the JuliaDB dataset.

predictions = collect(map(row->DecisionTree.apply_forest(model, [row.revol_util.value; row.int_rate.value;row.mths_since_last_record.value;row.annual_inc_joint.value;row.dti_joint.value;row.total_rec_prncp.value;row.all_util.value]), LStest)).data


With our set of predictions, we construct a ROC curve using the ROC.jl package and calculate the area under the curve to find a single measure of how predictive our trained model is on the dataset.

# Receiver Operating Characteristics curve
curve = roc(convert(Vector{Float64},predictions), convert(BitArray{1},is_bad[b]))

# An ROC plot in Gadfly with data calculuated using ROC.jl
Gadfly.plot(layer(x = curve.FPR,y = curve.TPR, Geom.line),
layer(x = linspace(0.0,1.0,101), y = linspace(0.0,1.0,101),
Geom.point, Theme(default_color=colorant"red")), Guide.title("ROC"),
Guide.xlabel("False Positive Rate"),Guide.ylabel("True Positive Rate"))


The ROC would look like this.

Area under the curve would be:

# Area Under Curve
AUC(curve)
0.5878135617540067


There. That is how you would create a model that can predictively determine the quality of a loan using JuliaDB.

# Using julia -L startupfile.jl, rather than machinefiles for starting workers.

Re-posted from: http://white.ucc.asn.au/2017/08/17/starting-workers.html

If one wants to have full control over the worker process to method to use is addprocs and the -L startupfile.jl commandline arguement when you start julia
See the documentation for addprocs.

The simplest way to add processes to the julia worker is to invoke it with julia -p 4.
The -p 4 argument says start 4 worker processes, on the local machine.
For more control, one uses julia --machinefile ~/machines
Where ~/machines is a file listing the hosts.
The machinefile is often just a list of hostnames/IP-addresses,
but sometimes is more detailed.
Julia will connect to each host and start a number of workers on each equal to the number of cores.

Even the most detailed machinefile doesn’t give full control,
for example you can not specify the topology, or the location of the julia exectuable.

For full control, one shoud invoke addprocs directly,
and to do so, one should use julia -L startupfile.jl