Tag Archives: Machine Learning

A Tour of the Data Ecosystem in Julia

By: Jacob Quinn

Re-posted from: https://quinnj.home.blog/2019/07/21/a-tour-of-the-data-ecosystem-in-julia/

Julia 1.0 was released at JuliaCon 2018 and it’s been a quick year for the package ecosystem to build upon the first long-term stable release. In a lot of ways, pre-1.0 for packages involved a lot of experimentation; a lot of trying out various ideas, shotgun-style and seeing what sticks, in addition to trying to keep up with the evolving core language. One of my favorite things about Julia is the great efforts that have been made to collaborate, coordinate, and modularize not just the base language and standard libraries, but the entire package ecosystem. Julia was born in the age of GitHub, Discourse, and Slack, which has led to an exceptional amount of public communication from and with core developers, targeted efforts to automatically update package code and test the impact of core changes to the ecosystem, and a proliferation of user-group meetups and domain-specific GitHub organizations. I don’t feel dishonest in saying I think Julia is the most collaborative programming language that exists today.

With all this context, one might be wondering: so what is the current status of working with data in Julia? How do various packages like CSV.jl, DataFrames.jl, JuliaDB.jl, and Query.jl play together. And so I present, “A Tour of the Data Ecosystem in Julia”. Let’s begin…

Data I/O: How do I get data in and out of Julia?

Text Files
I once heard Jeff Bezanson say tongue-in-cheek, “It doesn’t really matter what fancy features you put in a programming language, all people really want to do is read csv files. That’s it, just csv file reading.” Julia has grown a lot since the earliest days of dlmread, with several excellent packages for reading csv and other delimited files, each growing out of a unique idea for approaching this old problem or addressing a specific integration need.

CSV.jl
As blog author, you’ll have to forgive the shameless plug for my own packages: I started CSV.jl in 2015 to try and make a strong effort to match csv parsing functionality provided by other popular languages (in particular, Pandas and fread in R). Since then, it’s accumulated some ~400 commits from over 25 collaborators, and recently hit issue #464. It also released a significant upgrade with the recent 0.5 release, bringing performance inline with the fastest parsers in other languages, while providing some powerful and unique features not available elsewhere, including: “perfect” column typing without needing to restart parsing at all, auto delimiter detection, automatic handling of invalid rows, and several options/layers for lazily parsing files, all with unparalleled performance (a full feature comparison with pandas and R can be found in the latest discourse announcement post). Part of CSV.jl’s speed comes from the hand-tuned type parsers made available in the Parsers.jl package, which includes extremely performant parsing for ints, floats, bools, and dates/datetimes, all in pure Julia.

Tables.jl
We’re going to take a quick detour from I/O packages here to mention a key building block in the I/O story. The Tables.jl package was born during the 2018 JuliaCon hack-a-thon. It was a collaboration between several data-related package authors to come up with essentially two access patterns for “table” like data: “rows” and “columns”. The core idea is simple, yet powerful: any table-like data format that can implement one of the access patterns (via row iteration, or column access), can “automatically” integrate with any downstream package that then uses one of the access patterns, all without needing to take bi-directional dependencies. Even being a relatively new package, the power of this interface can be seen already in the integrations already available:

  • In-memory datastructures
    • DataFrames.jl
    • TypedTables.jl
    • IndexedTables.jl/JuliaDB.jl
    • FunctionalTables.jl
  • Data Formating/Processing Packages
    • DataKnots.jl
    • FreqTables.jl
    • Mustache.jl
    • FormattedTables.jl
    • PrettyTables.jl
    • TableView.jl
    • BrowseTables.jl
  • Data File Format Packages
    • CSV.jl
    • Feather.jl
    • StataDTAFiles.jl
    • Taro.jl
    • XLSX.jl
  • Database Packages
    • ODBC.jl
    • SQLite.jl
    • MySQL.jl
    • LibPQ.jl
    • JDBC.jl
  • Statistics Packages
    • StatsModels.jl
    • MLJ.jl
    • GLM.jl
  • Plotting Packages
    • StatsMakie.jl
    • StatsPlots.jl
    • TableWidgets.jl

Ok, back to data I/O. We also have the wonderful TextParse.jl, which started as the data ingestion engine for JuliaDB.jl. While blazing fast, it doesn’t have quite the maturity or breadth of functionality compared to CSV.jl, like this long-standing issue of incorrect float parsing. The CSVFiles.jl package also exists to provide FileIO.jl integration, meaning you can just do load("data.csv") and FileIO.jl automatically recognizes the .csv extension and knows how to load the data. (A full feature comparison between CSV.jl and CSVFiles.jl can be found here). Another interesting approach to csv reading popped up recently in the form of TableReader.jl from the ever-talented Kenta Sato, using finite state machines. A pretty decent collection of benchmarks can be found here.

In addition to the excellent packages for handling text-based, delimited files, there are also some great files for managing/handling data formats in general.

  • RData.jl: for reading .rda files from R into Julia
  • DBFTables.jl: for reading .dbf files into Julia
  • StatFiles.jl: for reading SAS, SPSS, and Stata files into Julia
  • StataDTAFiles.jl: for reading and writing Stata files
  • DataDeps.jl: for declaring data dependencies and managing reproducible setups for with data in Julia
  • ExcelFiles.jl: for reading excel files into Julia and integration with FileIO.jl

Another popular option for data storage is binary-based formats, including feather, apache arrow, parquet, orc, avro, and BSON. Julia has pretty good coverage of these formats, including native Julia implementations in Feather.jl (and FileIO.jl integration in FeatherFiles.jl), Arrow.jl, and Parquet.jl (again, with FileIO.jl integration with ParquetFiles.jl). There’s also the BSON.jl package for binary JSON support. Beginning support for avro has been started in Avro.jl, and I’m personally interested in diving into the ORC format.

Database Support
As mentioned above in Tables.jl integrations, there are also great packages in place to support extracting data from databases, including:

  • ODBC.jl: provides generic support for any database that provides a compatible ODBC driver. Supports parameterized queries for inserting data, as well as extracting data from a query, and with Tables.jl support, “exporting” the results to any Tables.jl-compatible sink. It does require setting up an ODBC administrator tool on OSX and linux (windows has builtin support), and then proper setup/installation of the specific database’s ODBC driver, but once setup, things generally work pretty seamlessly.
  • JDBC.jl: the JDBC.jl package also provides access for databases that support JDBC access. It does so via JavaCall.jl, which does require an active JDK to work, which can also be tricky to setup, but database vendors tend to have pretty good support for a JDBC driver
  • MySQL.jl/LibPQ.jl: specific packages for mysql/postgres databases, respectively. Both provide integration with Tables.jl and aim to provide support for database-specific types and functionality (beyond the more generic interfaces of ODBC/JDBC). They can be easier to setup since there’s no “middle man” interface library and you just need to interact with the database libraries directly.
  • SQLite.jl: a library providing integration with the excellent sqlite database; supports in-memory databases as well as file-based. Supports parameterized queries and Tables.jl integration for loading data and extracting query results.

Data Processing: what can I do with my data once it’s in Julia?

Once you identify the right I/O package for reading your data, you then have a choice to make with regards to what you need/want to do with it. Clean it? Reshape it? Filter/calculate/group/sort it? Just browse around for a little while? Do advanced statistics, model training, or other machine learning techniques? Here we take a tour of the current landscape of packages that provide various types of functionality with regards to data processing, including table-like structures, query functionality, and statistics/machine learning.

Table Structures
Yes, in Julia we can have an entire section where we talk about table structures, plural. Sometimes when I mention to people that there are several dataframe-like packages in Julia, they are initially confused: why does Julia need more than one table package? Why doesn’t everyone just work on a single package to focus on quality over quantity? My most common response goes something like this: for a project like Pandas, or data.table, or data.frame, most of the actual code is written in what language? Python? Or R? It’s actually C/C++. The common belief is that because R/Python are dynamic, “high-level” languages, they can’t be fast, but that’s no worry, you just write the parts that need to be fast in C. But therein lies a core issue: for languages like Python and R, where the user-to-developer ratio is so high, having core pieces of functionality written in a lower-level language like C makes the code much less accessible to someone who wants to contribute, let alone just take a peek under the hood to see how things work. It automatically splits the code into a lower-level “black box” component, and a higher-level “wrapper” component. I believe it stunts innovation due to such a high “barrier to entry” to take a new approach to a table structure. I might be a budding R or Python user, getting into developing things, but shoot, if I want to make a meaningful contribution to Pandas or data.table, all the sudden I’m wading into deep make/cmake/build issues, compiler versions, and manual memory management that I’ve never dealt with before. The other alternative is I write something in pure R/Python which will just surely be doomed to performance issues, regardless of how novel my interfaces or APIs might be.
Julia, however, is a famously declared solution to this “two-language problem”. No longer do budding developers need to fear plain for-loops, or vectorize every operation, or rely on C/C++ for the “core stuff”. You can just write plain Julia, down in the trenches, and up at the highest-level user APIs. Julia, all the way down.
I firmly believe this has led to greater innovation in Julia for experimenting with unique table structures, as well as making packages more accessible to those hoping to contribute.
Ok, enough soap-boxing, let’s talk packages.

DataFrames.jl
The DataFrames.jl package is one of the very oldest packages in the Julia ecosystem. This tenure and naming proximity with its cousins in Pandas and R have also made it one of the most popular packages for those hoping to give Julia a try. The package has evolved quite a bit since its early days, and is rapidly approaching its own 1.0 release (expected around JuliaCon 2019). Development over the last year or two has focused on core performance, safety of APIs, and overall consistency with Base APIs. The amount of thought, effort, discussion, and documentation by numerous collaborators makes it the most mature “table” package in my opinion. So what’s unique about DataFrames? I’ll try to give what I deem to be notable highlights of how DataFrames.jl approaches representing tabular data:

  • A DataFrame stores columns internally as a Vector{AbstractVector}; but wait, you might ask, isn’t that type unstable (since we’re essentially lumping all columns, regardless of individual column type, as AbstractVector)? Yes! And on purpose! Experienced Julia developers are quick to point out that sometimes code can get “overly typed”, leading to “compilation overdrive”, where the compiler is having to generate very specialized code for every operation, with compiled code reuse rare. A DataFrame can represent any number of columns, with any combination of column types, so it’s a natural scenario where you may want to “hide” too much type information from the compiler by slapping the lowest common denominator abstract type as the type label (AbstractVector in this case). This design decision has certainly been extensively discussed, but remains as-is, if not as a more compiler-friendly option than other “strongly typed table” types.
  • DataFrames.jl includes specialized subtypes for representing SubDataFrames and GroupedDataFrames, as opposed to returning full DataFrames; these “lazy” structures are mostly used in intermediate operations and is a useful way to avoid too much unnecessary data copying/movement
  • Core manipulation operations included in the package itself include grouping, joining, and indexing; in particular, supporting column indexing via regex matches, functional selecting, Not indexing (inverted indices), and flexible interfaces similar to Base.Arrays in terms of filtering/selecting specific indices for rows or columns.
  • A lot of work in recent years has also been to simplify and move code out of the DataFrames.jl package, to focus on the core types, functionality, and reduce the dependency burden, being a common dependency for packages around the ecosystem; this has included notably the creation of the DataFramesMeta.jl package to support other common query/filter/manipulation operations
  • DataFrames.jl supports the Tables.jl interface, which means any I/O package also supporting it can automatically convert it’s table format into an in-memory DataFrame, and convert back to the format for output

IndexedTables.jl / JuliaDB.jl

JuliaDB.jl splashed onto the Julia scene in early 2017, touting a new approach to query/table operations that utilized type stability and Julia’s built in parallelism to provide “out of core” functionality like a database. JuliaDB.jl’s core table type actually lives in the IndexedTables.jl package, which in turn uses the clever StructArrays.jl package to turn NamedTuple rows into an efficient struct-of-arrays structure more suitable for columnar analytics. JuliaDB.jl itself then, with the use of Dagger.jl, adds the “parallel” layer on top of IndexedTables.jl. The benefits of “type stability” come from a table being essentially encoded as Table{Col1T, Col2T, Col3T, ...}, where the full type of each column is encoded in the top-level table type. This allows operations like selecting columns, filtering, and aggregating to be extremely efficient due to the compiler knowing the exact types it’s dealing with (as opposed to having to do “runtime” checks like in the DataFrames.jl case). Now, as discussed in the DataFrames.jl section, this doesn’t come without a cost; indeed, there’s a long-standing issue for dealing with the compilation cost for tables with a large number of columns. But, in the case of a manageable number of columns, and with the ability to scale tables up beyond a single machine’s memory limits is powerful functionality. JuliaDB.jl is also sponsored by JuliaComputing, which gives it a nice stamp of support and stability. While it may lack some of the maturity of DataFrames.jl long-discussed APIs, I’m excited by the unique, “Julian” approach JuliaDB.jl provides in the big data analytics space. Go check out the docs here and give it a spin. An additional package providing experimental manipulation functions for JuliaDB is JuliaDBMeta.jl, which mostly mirrors the before-mentioned DataFramesMeta.jl, but for JuliaDB tables.

TypedTables.jl
The TypedTables.jl package is one that started out in “experimentation” mode for a while until recently being declared “ready” by its primary author, the venerable Andy Ferris. Andy has long been known in the Julia community for his eye for API consistency and being able to strike that elusive balance between theoretical ideals and practical use. In TypedTables.jl, the “fully type stable” approach is taken, similar to IndexedTables.jl/JuliaDB.jl, with a Table being defined literally as <: AbstractVector{T} where {T <: NamedTuple}, that is, a collection of “rows” or NamedTuples. While TypedTables.jl includes two @Select and @Compute macros for simple manipulations, some of the more interesting promise comes from the TypedTables.jl “supporting cast” packages:

  • AcceleratedArrays.jl: a tidy package to turn any array into an “indexed” array (in the database sense), to provide optimized “search” functions: findall, findfirst, filter, unique, group, join, etc. An AcceleratedArray can thus be used in a TypedTable to provide powerful indexing behavior for an entire table (though note that it works on any AbstractArray, which means these indexed columns could even be used in a DataFrame).
  • SplitApplyCombine.jl: this package provides a powerful set of functions to perform common split, apply, and combine operations on generic collections like mapmany, group, product, groupreduce, and innerjoin. The aim is to provide the basic building blocks of relational algebra functions that work on any kind of collection, obviously including a TypedTable as a specialized type of AbstractVector of NamedTuples.

CSV.File
One more honorable mention for table structures (and another shameless plug) actually comes from the CSV.jl package. While CSV.read materializes a file as a DataFrame, a CSV.File, which supports all the same keyword arguments as CSV.read, can be used to transfer data to any other Tables.jl sink, or used itself as a table directly. It supports getproperty for column view access and iterates a CSV.Row type (which acts like a NamedTuple). I think as packages continue to evolve, we’ll see more and more cases of customized structures like this, which allow for certain efficiencies or specialized views into raw data formats, and with sufficiently general interfaces like Tables.jl and SplitApplyCombine.jl, users won’t need to worry as much about conforming to a single table structure for everything, but can focus on understanding more generic interfaces, and using data structures optimized for specific use-cases, data formats, and workflows.

Query.jl
Another package (set of packages really) that must be discussed in the Julia data ecosystem is Query.jl. Pioneered by David Anthoff, Query.jl provides a LINQ implementation for Julia. Query.jl and its sister package QueryOperators.jl provide a custom “query dsl” by a set of macros that allow convenient “query context” syntax for common manipulation tasks: selection and projection, filtering, grouping, and joining. These “query verbs” also are able to operate on any iterator, in true LINQ fashion, which makes the processing functions extremely versatile. While currently Query.jl/QueryOperators.jl hold the sole implementations of the query verbs, the grander scheme of having a custom dsl is the ability to represent entire queries in an AST (abstract syntax tree), which could then allow custom implementations that “execute” a structured query. This is most immediately useful when one considers being able to use a single “query dsl” to operate on both DataFrames and database tables, having a query translated into a vendor-specific SQL syntax. While not currently fully fleshed out, the ambitious undertaking is exciting to track.

DataValues.jl
One note for users on the use of Query.jl is the current reliance on the DataValues.jl package to represent missing data. What that means is that query functions in Query.jl/QueryOperators.jl aren’t integrated with the Base-builtin representation of missing, but rely on the DataValues.jl package, which defines a DataValue{T} wrapper type that also holds whether a value is missing or not (along with what “type” of missing value it is). While the history of missing data in Julia is long and storied, Andy Ferris wisely noted that no missing value representation is perfect. missing was included in Base largely due to the ease of working with a single sentinel value, and compiler support for code generation involving Union{T, Missing}. Inherent in the use of Union{T, Missing}, however, is a current compiler complexity involving inference of Union values that are stored in a parametric struct field. This affects the @map macro in Query.jl with NamedTuple inputs to NamedTuple outputs, hence, Query.jl relies on the use of DataValue{T} to more conveniently pass type information through projections. There are also active efforts explore ways the core language can avoid the need for an explicit wrapper type while still propagating the type information soundly through projections.

Additional Data Efforts

Other efforts integrating with the data ecosystem include (but are certainly not limited to):

  • StatsModels.jl: for specifying (in familiar “formula” notation), fitting and evaluating statistical models; can operate on any Tables.jl-compatible
  • GLM.jl for working specifically with generalized linear models
  • LightQuery.jl: another budding approach to type-stable querying capabilities
  • Flux.jl for an extremely extensible approach to machine learning modelling, GPU integration, and pure Julia, all the way down
  • Distributions.jl: for sampling, moments, and density/mass functions for a wide variety of distributions
  • StatsMakie.jl for GPU-enabled statistical plotting goodness
  • MLJ.jl: another new holistic approach to representing a variety of machine learning models sponsored by the Alan Turing Institute
  • ScikitLearn.jl for Julia access to the scikit-learn APIs for machine learning, including pure Julia implementations and integration with python models via PyCall.jl
  • OnlineStats.jl: a mature, fully featured statistical package for “online” statistical algorithms, including well-documented source code for published algorithms
  • TextAnalysis.jl: providing algorithms, statistical support, and feature engineering for text analysis
  • Dagger.jl: a dask-like Julia framework for distributed, parallel computation graphs
  • MultivariateStats.jl: stats, but for multiple variables!
  • RCall.jl: package that allows integrating with the R statistical language; transfer objects between languages, call R functions/libraries, etc.
  • StatsPlots.jl: another strong statistical plotting package

Future of Data in Julia

So now JuliaCon 2019 is upon us and we have to wonder: what’s next for working with data in Julia? As I’ve tried to illustrate in this long showcase of Julia packages, the data ecosystem has come along way since the official 1.0 release of the language itself. Support for the most common data formats is about as mature as any other language, but there’s always room to improve. The in-memory processing is an exciting space to watch in Julia due to the number of approaches being fleshed out, with varying degrees of maturity. DataFrames.jl is solid and should be a main utility knife for any Julia programmer, but one of the wonders of Julia, as mentioned above, is the ease of developing high-level, performant solutions in the language itself, so it’s exciting to see alternative approaches that can offer trade-offs or additional features that may enable better workflows depending on the environment. But given all that, here’s a shortlist of things swimming in my head around the future of the data ecosystem in Julia:

  • Ensure the performance and usability of Union{T, Missing} to represent missing data in Julia; currently, 95% of uses and workflows work amazingly well, but we always want to track down corner cases and do everything we can to improve the compiler, core language, or APIs to improve
  • Data format support: the job is never done here, but on my mind are an officially blessed (and integrated) implementation of apache arrow, write support for Parquet.jl, and a Julia package for supporting the ORC data format
  • Working towards a common API package/definitions for common table processing tasks; while exploratory efforts are always encouraged, it could be immensely convenient to users of various table types if there were a common set of operations that “just worked”. While Query.jl currently provides the best solution for this, it has other integration issues in the ecosystem; so working to resolve those or define a new common “table operations” type package
  • Relatedly, defining a full “structured query graph” model could be one way to provide a lower-level shared representation of querying tasks. This could catalyze “frontend” efforts (custom DSLs like Query.jl’s, or new dplyr-like verbs, or even a plain SQL parsing package) to “lower” to this common representation, while allowing similar types of backend innovation in *how* these structured query graphs are executed (with parallel support, out-of-core, etc.). I’ve recently been studying an old effort to do something like this for inspiration

For those who made it this far, kudos! As always, follow and ping me on twitter to chat data in Julia. And if you’ll be at JuliaCon 2019 in Baltimore, hit me up in the official Julia slack to meet up and chat.

Moving to Julia 1.1

By: Sören Dobberschütz

Re-posted from: https://tensorflowjulia.blogspot.com/2019/03/moving-to-julia-11.html

Moving from Julia 0.6 to 1.1


Finally, all files in the GitHub repository have been updated to be able to run on Julia 1.1. In order to be able to run them (at the time of writing), the developmental versions of the Tensorflow.jl and PyCall.jl packages need to be installed. Some notable changes are listed below:

General

  • The shuffle command has been moved to the Random package as Random.shuffle().
  • linspace has been replaced with range(start, stop=…, length=…).
  • To determine the length of a matrix, use size(…,2) instead of length().
  • For accessing the first and last part of datasets, head() as been replaced with first(), and tail() with last().
  • For arithmetic calculations involving constants and arrays, the better syntax const .- array needs to be used instead of const – array. In connection with this, a space might be required before the operator; i.e. to avoid confusion with number types use 1 .- array, and not 1.-array.
  • Conversion of a dataframe df to a matrix can be done via convert(Matrix, df) instead of convert(Array, df).

Exercise 8

  • The creation of an initially undefined vector etc. now requires the undef keyword as in activation_functions = Vector{Function}(undef, size(hidden_units,1)).
  • An assignment of a function to multiple entries of a vector requires the dot-operator: activation_functions[1:end-1] .= z->nn.dropout(nn.relu(z), keep_probability)

Exercise 10

  • For this exercise to work, the MNIST.jl package needs an update. A quick fix can be found in the repository together with the exercise notebooks.
  • Instead of flipdim(A, d), use reverse(A, dims=d).
  • indmax has been replaced by argmax.
  • Parsing expressions has been moved to the Meta package and can be done by Meta.parse(…).

Exercise 11

  • The behaviour of taking the transpose of a vector has been changed – it now creates an abstract linear algebra object. We use collect() on the transpose for functions that cannot handle the new type.
  • To test if a string contains a certain word, contains() has been replaced by occursin().




Intro to Sparse Data and Embeddings

By: Sören Dobberschütz

Re-posted from: https://tensorflowjulia.blogspot.com/2018/09/intro-to-sparse-data-and-embeddings.html

This is the final exercise of Google’s Machine Learning Crash Course. We use the ACL 2011 IMDB dataset to train a Neural Network in predicting wether a movie review is favourable or not, based on the words used in the review text.

There are two notable differences from the original exercise:

  • We do not build a proper input pipeline for the data. This creates a lot of computational overhead – in principle, we need to preprocess the whole dataset before we start training the network. In practise, this if often not feasible. It would be interesting to see how such a pipeline can be implemented for TensorFlow.jl. The Julia package MLLabelUtils.jl might come handy for this task.
  • When visualizing the embedding layer, our Neural Network builds effectively a 1D-representation of keywords to describe if a movie has a favorable review or not. In the Python version, a real 2D embedding is obtained (see the pictures). The reasons for this difference are unknown.


Julia embedding – effectively a 1D line

Python embedding




The Jupyter notebook can be downloaded here






This notebook is based on the file Embeddings programming exercise, which is part of Google’s Machine Learning Crash Course.
In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Intro to Sparse Data and Embeddings

Learning Objectives:
  • Convert movie-review string data to a feature vector
  • Implement a sentiment-analysis linear model using a feature vector
  • Implement a sentiment-analysis DNN model using an embedding that projects data into two dimensions
  • Visualize the embedding to see what the model has learned about the relationships between words
In this exercise, we’ll explore sparse data and work with embeddings using text data from movie reviews (from the ACL 2011 IMDB dataset). Open and run TFrecord Extraction.ipynb Colaboratory notebook to extract the data from the original .tfrecord file as Julia variables.

Setup

Let’s import our dependencies and open the training and test data. We have exported the test and training data as hdf5 files in the previous step, so we use the HDF5-package to load the data.
In [1]:
using Plots
using Distributions
gr()
using DataFrames
using TensorFlow
import CSV
import StatsBase
using PyCall
@pyimport sklearn.metrics as sklm
using Images
using Colors
using HDF5

sess=Session(Graph())
Out[1]:
Session(Ptr{Void} @0x0000000128c2ac30)
Open the test and training raw data sets.
In [2]:
c = h5open("train_data.h5", "r") do file
global train_labels=read(file, "output_labels")
global train_features=read(file, "output_features")
end
c = h5open("test_data.h5", "r") do file
global test_labels=read(file, "output_labels")
global test_features=read(file, "output_features")
end
train_labels=train_labels'
test_labels=test_labels';
2018-09-16 16:21:02.045635: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.2 AVX AVX2 FMA
Have a look at one data set:
In [4]:
test_labels[301,:]
Out[4]:
1-element Array{Float32,1}:
1.0
In [5]:
test_features[301]
Out[5]:
"['\"' 'don' \"'\" 't' 'change' 'your' 'husband' '\"' 'is' 'another' 'soap'\n 'opera' 'comedy' 'from' 'producer' '/' 'director' 'cecil' 'b' '.' 'de'\n 'mille' '.' 'it' 'is' 'notable' 'as' 'the' 'first' 'of' 'several' 'films'\n 'he' 'made' 'starring' 'gloria' 'swanson' '.' 'i' 'guess' 'you' 'could'\n 'also' 'call' 'it' 'a' 'sequel' 'of' 'sorts' 'to' 'his' '\"' 'old' 'wives'\n 'for' 'new' '\"' '(' '####' ')' '.' 'james' '(' 'elliot' 'dexter' ')'\n 'and' 'leila' '(' 'swanson' ')' 'porter' 'are' 'a' 'fortyish' 'couple'\n 'where' 'james' 'has' 'gone' 'to' 'seed' 'and' 'become' 'slovenly' 'and'\n 'lazy' '.' 'he' 'has' 'a' 'penchant' 'for' 'smelly' 'cigars' 'and'\n 'eating' 'raw' 'onions' '.' 'he' 'takes' 'his' 'wife' 'for' 'granted' '.'\n 'leila' 'tries' 'to' 'get' 'him' 'to' 'straighten' 'out' 'to' 'no'\n 'avail' '.' 'one' 'night' 'at' 'a' 'dinner' 'party' 'at' 'the' 'porters'\n ',' 'leila' 'meets' 'the' 'dashing' 'schyler' 'van' 'sutphen' '(' 'now'\n \"there's\" 'a' 'moniker' ')' ',' 'the' 'playboy' 'nephew' 'of' 'socialite'\n 'mrs' '.' 'huckney' '(' 'sylvia' 'ashton' ')' '.' 'she' 'invites' 'leila'\n 'to' 'her' 'home' 'for' 'the' 'weekend' 'to' 'make' 'james' '\"' 'miss'\n 'her' '\"' '.' 'once' 'there' 'schyler' 'begins' 'to' 'put' 'the' 'moves'\n 'on' 'her' ',' 'promising' 'her' 'pleasure' ',' 'wealth' 'and' 'love' ','\n 'if' 'she' 'will' 'leave' 'her' 'husband' 'and' 'go' 'with' 'him' '.'\n 'the' 'sequences' 'involving' \"leila's\" 'imagining' 'this' 'promised'\n 'new' 'life' 'are' 'lavishly' 'staged' 'and' 'forecast' 'de' \"mille's\"\n 'epic' 'costume' 'drams' 'later' 'in' 'his' 'career' '.' 'leila' ','\n 'bored' 'with' 'her' 'marriage' 'and' 'her' 'disinterested' 'husband' ','\n 'divorces' 'james' 'and' 'marries' 'the' 'playboy' '.' 'james'\n 'ultimately' 'realizes' 'that' 'he' 'has' 'lost' 'the' 'only' 'thing'\n 'that' 'mattered' 'to' 'him' 'and' 'begins' 'to' 'mend' 'his' 'ways' '.'\n 'he' 'shaves' 'off' 'his' 'mustache' ',' 'works' 'out' ',' 'shuns'\n 'onions' 'and' 'reacquires' 'some' 'manners' '.' 'meanwhile' ',' 'all'\n 'is' 'not' 'rosy' 'with' \"leila's\" 'new' 'marriage' '.' 'schyler' 'it'\n 'seems' 'likes' 'to' 'gamble' 'and' 'has' 'taken' 'up' 'with' 'the'\n 'gold' 'digging' 'nanette' '(' 'aka' 'tootsie' ',' 'or' 'some' 'such'\n 'name' ')' '(' 'julia' 'faye' ')' '.' 'schyler' 'loses' 'all' 'of' 'his'\n 'money' 'and' 'steals' \"leila's\" 'diamond' 'ring' 'to' 'cover' 'his'\n 'losses' '.' 'one' 'fateful' 'day' ',' 'leila' 'meets' 'the' '\"' 'new'\n '\"' 'james' 'and' 'is' 'taken' 'by' 'the' 'changes' 'in' 'him' '.'\n 'james' 'drives' 'her' 'home' 'and' 'becomes' 'aware' 'of' 'her'\n 'situation' 'and' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.'\n '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.'\n '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.'\n 'this' 'film' 'marked' 'the' 'beginning' 'of' 'gloria' \"swanson's\" 'rise'\n 'to' 'super' 'stardom' 'in' 'a' 'career' 'that' 'would' 'rival' 'that'\n 'of' 'mary' 'pickford' '.' 'barely' '##' 'years' 'of' 'age' ',' 'she'\n 'had' 'begun' 'her' 'career' 'in' 'mack' 'sennett' 'two' 'reel'\n 'comedies' 'as' 'a' 'teen' 'ager' '.' 'elliot' 'dexter' 'was' 'almost'\n '##' 'at' 'this' 'time' 'but' 'he' 'and' 'swanson' 'make' 'a' 'good'\n 'team' ',' 'although' \"it's\" 'hard' 'to' 'imagine' 'anyone' 'tiring' 'of'\n 'the' 'lovely' 'miss' 'swanson' 'as' 'is' 'the' 'case' 'in' 'this' 'film'\n '.' 'dexter' 'and' 'sylvia' 'ashton' 'had' 'appeared' 'in' 'the'\n 'similar' '\"' 'old' 'wives' 'for' 'new' '\"' 'where' 'the' 'wife' 'had'\n 'gone' 'to' 'seed' 'and' 'the' 'husband' 'was' 'wronged' '.' 'also' 'in'\n 'the' 'cast' 'are' 'de' 'mille' 'regulars' 'theodore' 'roberts' 'as' 'a'\n 'bishop' 'and' 'raymond' 'hatton' 'as' 'a' 'gambler' '.']"

Building a Sentiment Analysis Model

Let’s train a sentiment-analysis model on this data that predicts if a review is generally favorable (label of 1) or unfavorable (label of 0).
To do so, we’ll turn our string-value terms into feature vectors by using a vocabulary, a list of each term we expect to see in our data. For the purposes of this exercise, we’ve created a small vocabulary that focuses on a limited set of terms. Most of these terms were found to be strongly indicative of favorable or unfavorable, but some were just added because they’re interesting.
Each term in the vocabulary is mapped to a coordinate in our feature vector. To convert the string-value terms for an example into this vector format, we encode such that each coordinate gets a value of 0 if the vocabulary term does not appear in the example string, and a value of 1 if it does. Terms in an example that don’t appear in the vocabulary are thrown away.
NOTE: We could of course use a larger vocabulary, and there are special tools for creating these. In addition, instead of just dropping terms that are not in the vocabulary, we can introduce a small number of OOV (out-of-vocabulary) buckets to which you can hash the terms not in the vocabulary. We can also use a feature hashing approach that hashes each term, instead of creating an explicit vocabulary. This works well in practice, but loses interpretability, which is useful for this exercise.

Building the Input Pipeline

First, let’s configure the input pipeline to import our data into a TensorFlow model. We can use the following function to parse the training and test data and return an array of the features and the corresponding labels.
In [3]:
function create_batches(features, targets, steps, batch_size=5, num_epochs=0)
"""Create batches.

Args:
features: Input features.
targets: Target column.
steps: Number of steps.
batch_size: Batch size.
num_epochs: Number of epochs, 0 will let TF automatically calculate the correct number
Returns:
An extended set of feature and target columns from which batches can be extracted.
"""
if(num_epochs==0)
num_epochs=ceil(batch_size*steps/size(features,1))
end

features_batches=copy(features)
target_batches=copy(targets)

for i=1:num_epochs
select=shuffle(1:size(features,1))
if i==1
features_batches=(features[select,:])
target_batches=(targets[select,:])
else
features_batches=vcat(features_batches, features[select,:])
target_batches=vcat(target_batches, targets[select,:])
end
end
return features_batches, target_batches
end
Out[3]:
create_batches (generic function with 3 methods)
In [4]:
function construct_feature_columns(input_features):
"""Construct the TensorFlow Feature Columns.

Args:
input_features: The numerical input features to use.
Returns:
A set of feature columns
"""
out=convert(Array, input_features[:,:])
return convert.(Float64,out)
end
Out[4]:
construct_feature_columns (generic function with 1 method)
In [5]:
function next_batch(features_batches, targets_batches, batch_size, iter)
"""Next batch.

Args:
features_batches: Features batches from create_batches.
targets_batches: Target batches from create_batches.
batch_size: Batch size.
iter: Number of the current iteration
Returns:
A batch of features and targets.
"""
select=mod((iter-1)*batch_size+1, size(features_batches,1)):mod(iter*batch_size, size(features_batches,1));

ds=features_batches[select,:];
target=targets_batches[select,:];

return ds, target
end
Out[5]:
next_batch (generic function with 1 method)
In [6]:
function my_input_fn(features_batches, targets_batches, iter, batch_size=5, shuffle_flag=1):
"""Prepares a batch of features and labels for model training.

Args:
features_batches: Features batches from create_batches.
targets_batches: Target batches from create_batches.
iter: Number of the current iteration
batch_size: Batch size.
shuffle_flag: Determines wether data is shuffled before being returned
Returns:
Tuple of (features, labels) for next data batch
"""

# Construct a dataset, and configure batching/repeating.
ds, target = next_batch(features_batches, targets_batches, batch_size, iter)

# Shuffle the data, if specified.
if shuffle_flag==1
select=shuffle(1:size(ds, 1));
ds = ds[select,:]
target = target[select, :]
end

# Return the next batch of data.
return ds, target
end
Out[6]:
my_input_fn (generic function with 3 methods)

Task 1: Use a Linear Model with Sparse Inputs and an Explicit Vocabulary

For our first model, we’ll build a Linear Classifier model using 50 informative terms; always start simple!
The following code constructs the feature column for our terms.
In [7]:
# 50 informative terms that compose our model vocabulary 
informative_terms = ["bad", "great", "best", "worst", "fun", "beautiful",
"excellent", "poor", "boring", "awful", "terrible",
"definitely", "perfect", "liked", "worse", "waste",
"entertaining", "loved", "unfortunately", "amazing",
"enjoyed", "favorite", "horrible", "brilliant", "highly",
"simple", "annoying", "today", "hilarious", "enjoyable",
"dull", "fantastic", "poorly", "fails", "disappointing",
"disappointment", "not", "him", "her", "good", "time",
"?", ".", "!", "movie", "film", "action", "comedy",
"drama", "family"]
Out[7]:
50-element Array{String,1}:
"bad"
"great"
"best"
"worst"
"fun"
"beautiful"
"excellent"
"poor"
"boring"
"awful"
"terrible"
"definitely"
"perfect"

"her"
"good"
"time"
"?"
"."
"!"
"movie"
"film"
"action"
"comedy"
"drama"
"family"
The following function takes the input data and vocabulary and converts the data to a one-hot encoded matrix.
In [8]:
# function for creating categorial colum from vocabulary list in one hot encoding
function create_data_columns(data, informative_terms)
onehotmat=zeros(length(data), length(informative_terms))

for i=1:length(data)
string=data[i]
for j=1:length(informative_terms)
if contains(string, informative_terms[j])
onehotmat[i,j]=1
end
end
end
return onehotmat
end
Out[8]:
create_data_columns (generic function with 1 method)
In [9]:
train_feature_mat=create_data_columns(train_features, informative_terms)
test_features_mat=create_data_columns(test_features, informative_terms);
Next, we’ll construct the Linear Classifier model, train it on the training set, and evaluate it on the evaluation set. After you read through the code, run it and see how you do.
In [10]:
function train_linear_classifier_model(learning_rate,
steps,
batch_size,
training_examples,
training_targets,
validation_examples,
validation_targets)
"""Trains a linear classifier model.

Args:
learning_rate: A `float`, the learning rate.
steps: A non-zero `int`, the total number of training steps. A training step
consists of a forward and backward pass using a single batch.
batch_size: A non-zero `int`, the batch size.
training_examples, etc: The input data.
Returns:
weight: The weights of the linear model.
bias: The bias of the linear model.
validation_probabilities: Probabilities for the validation examples.
p1: Plot of loss function for the different periods
"""

periods = 10
steps_per_period = steps / periods

# Create feature columns.
feature_columns = placeholder(Float32)
target_columns = placeholder(Float32)
eps=1E-8

# these two variables need to be initialized as 0, otherwise method gives problems
m=Variable(zeros(size(training_examples,2),1).+0.0)
b=Variable(0.0)

ytemp=nn.sigmoid(feature_columns*m + b)
y= clip_by_value(ytemp, 0.0, 1.0)
loss = -reduce_mean(log(y+eps).*target_columns + log(1-y-eps).*(1-target_columns))

features_batches, targets_batches = create_batches(training_examples, training_targets, steps, batch_size)

# Advanced Adam optimizer decent with gradient clipping
my_optimizer=(train.AdamOptimizer(learning_rate))
gvs = train.compute_gradients(my_optimizer, loss)
capped_gvs = [(clip_by_norm(grad, 5.0), var) for (grad, var) in gvs]
my_optimizer = train.apply_gradients(my_optimizer,capped_gvs)

run(sess, global_variables_initializer()) #this needs to be run after constructing the optimizer!

# Train the model, but do so inside a loop so that we can periodically assess
# loss metrics.
println("Training model...")
println("LogLoss (on training data):")
training_log_losses = []
validation_log_losses=[]
for period in 1:periods
# Train the model, starting from the prior state.
for i=1:steps_per_period
features, labels = my_input_fn(features_batches, targets_batches, convert(Int,(period-1)*steps_per_period+i), batch_size)
run(sess, my_optimizer, Dict(feature_columns=>construct_feature_columns(features), target_columns=>construct_feature_columns(labels)))
end
# Take a break and compute predictions.
training_probabilities = run(sess, y, Dict(feature_columns=> construct_feature_columns(training_examples)));
validation_probabilities = run(sess, y, Dict(feature_columns=> construct_feature_columns(validation_examples)));

# Compute loss.
training_log_loss=run(sess,loss,Dict(feature_columns=> construct_feature_columns(training_examples), target_columns=>construct_feature_columns(training_targets)))
validation_log_loss =run(sess,loss,Dict(feature_columns=> construct_feature_columns(validation_examples), target_columns=>construct_feature_columns(validation_targets)))

# Occasionally print the current loss.
println(" period ", period, ": ", training_log_loss)
weight = run(sess,m)
bias = run(sess,b)

loss_val=run(sess,loss,Dict(feature_columns=> construct_feature_columns(training_examples), target_columns=>construct_feature_columns(training_targets)))

# Add the loss metrics from this period to our list.
push!(training_log_losses, training_log_loss)
push!(validation_log_losses, validation_log_loss)
end

weight = run(sess,m)
bias = run(sess,b)

println("Model training finished.")

# Output a graph of loss metrics over periods.
p1=plot(training_log_losses, label="training", title="LogLoss vs. Periods", ylabel="LogLoss", xlabel="Periods")
p1=plot!(validation_log_losses, label="validation")

println("Final LogLoss (on training data): ", training_log_losses[end])

# calculate additional ouputs
validation_probabilities = run(sess, y, Dict(feature_columns=> construct_feature_columns(validation_examples)));

return weight, bias, validation_probabilities, p1
end
Out[10]:
train_linear_classifier_model (generic function with 1 method)
In [14]:
weight, bias, validation_probabilities,  p1 = train_linear_classifier_model(
0.0005, #learning rate
1000, #steps
50, #batch_size
train_feature_mat,
train_labels,
test_features_mat,
test_labels)
Training model...
LogLoss (on training data):
period 1: 0.6716383366395725
period 2: 0.6520718400586106
period 3: 0.6347970177101597
period 4: 0.6191596702328207
period 5: 0.6047782721033566
period 6: 0.5922131685262155
period 7: 0.580839687465644
period 8: 0.5702423812570189
period 9: 0.5607007472318791
period 10: 0.5519328267238274
Model training finished.
Out[14]:
([-0.36884; 0.331766; … ; 0.127656; 0.181674], 0.038435446715259586, [0.590407; 0.390335; … ; 0.662796; 0.529415], Plot{Plots.GRBackend() n=2})
Final LogLoss (on training data): 0.5519328267238274
In [15]:
plot(p1)
Out[15]:
2468100.560.580.600.620.640.66LogLoss vs. PeriodsPeriodsLogLosstrainingvalidation
The following function converts the validation probabilites back to 0-1-predictions.
In [13]:
# Function for converting probabilities to 0/1 decision
function castto01(probabilities)
out=copy(probabilities)
for i=1:length(probabilities)
if(probabilities[i]<0.5)
out[i]=0
else
out[i]=1
end
end
return out
end
Out[13]:
castto01 (generic function with 1 method)
Let’s have a look at the accuracy of the model:
In [17]:
evaluation_metrics=DataFrame()
false_positive_rate, true_positive_rate, thresholds = sklm.roc_curve(
vec(construct_feature_columns(test_labels)), vec(validation_probabilities))
evaluation_metrics[:auc]=sklm.roc_auc_score(construct_feature_columns(test_labels), vec(validation_probabilities))
validation_predictions=castto01(validation_probabilities);
evaluation_metrics[:accuracy]=accuracy = sklm.accuracy_score(test_labels, validation_predictions)

p2=plot(false_positive_rate, true_positive_rate, label="our model")
p2=plot!([0, 1], [0, 1], label="random classifier");
In [18]:
println("AUC on the validation set: ",  evaluation_metrics[:auc])
println("Accuracy on the validation set: ", evaluation_metrics[:accuracy])
AUC on the validation set: [0.865503]
Accuracy on the validation set: [0.781591]
In [19]:
plot(p2)
Out[19]:
0.000.250.500.751.000.000.250.500.751.00our modelrandom classifier

Task 2: Use a Deep Neural Network (DNN) Model

The above model is a linear model. It works quite well. But can we do better with a DNN model?
Let’s constructa NN classification model. Run the following cells, and see how you do.
In [11]:
function train_nn_classification_model(learning_rate,
steps,
batch_size,
hidden_units,
is_embedding,
keep_probability,
training_examples,
training_targets,
validation_examples,
validation_targets)
"""Trains a neural network classification model.

Args:
learning_rate: A `float`, the learning rate.
steps: A non-zero `int`, the total number of training steps. A training step
consists of a forward and backward pass using a single batch.
batch_size: A non-zero `int`, the batch size.
hidden_units: A vector describing the layout of the neural network.
is_embedding: 'true' or 'false' depending on if the first layer of the NN is an embedding layer.
keep_probability: A `float`, the probability of keeping a node active during one training step.
Returns:
p1: Plot of the loss function for the different periods.
y: The final layer of the TensorFlow network.
final_probabilities: Final predicted probabilities on the validation examples.
weight_export: The weights of the first layer of the NN
feature_columns: TensorFlow feature columns.
target_columns: TensorFlow target columns.
"""

periods = 10
steps_per_period = steps / periods

# Create feature columns.
feature_columns = placeholder(Float32, shape=[-1, size(training_examples,2)])
target_columns = placeholder(Float32, shape=[-1, size(training_targets,2)])

# Network parameters
push!(hidden_units,size(training_targets,2)) #create an output node that fits to the size of the targets
activation_functions = Vector{Function}(size(hidden_units,1))
activation_functions[1:end-1]=z->nn.dropout(nn.relu(z), keep_probability)
activation_functions[end] = nn.sigmoid #Last function should be idenity as we need the logits

# create network
flag=0
weight_export=Variable([1])
Zs = [feature_columns]

for (ii,(hlsize, actfun)) in enumerate(zip(hidden_units, activation_functions))
Wii = get_variable("W_$ii"*randstring(4), [get_shape(Zs[end], 2), hlsize], Float32)
bii = get_variable("b_$ii"*randstring(4), [hlsize], Float32)

if((is_embedding==true) & (flag==0))
Zii=Zs[end]*Wii
else
Zii = actfun(Zs[end]*Wii + bii)
end
push!(Zs, Zii)

if(flag==0)
weight_export=Wii
flag=1
end
end

y=Zs[end]
eps=1e-8
cross_entropy = -reduce_mean(log(y+eps).*target_columns + log(1-y+eps).*(1-target_columns))

features_batches, targets_batches = create_batches(training_examples, training_targets, steps, batch_size)

# Standard Adam Optimizer
my_optimizer=train.minimize(train.AdamOptimizer(learning_rate), cross_entropy)

run(sess, global_variables_initializer())


# Train the model, but do so inside a loop so that we can periodically assess
# loss metrics.
println("Training model...")
println("LogLoss error (on validation data):")
training_log_losses = []
validation_log_losses = []
for period in 1:periods
# Train the model, starting from the prior state.
for i=1:steps_per_period
features, labels = my_input_fn(features_batches, targets_batches, convert(Int,(period-1)*steps_per_period+i), batch_size)
run(sess, my_optimizer, Dict(feature_columns=>construct_feature_columns(features), target_columns=>construct_feature_columns(labels)))
end
# Take a break and compute log loss.
training_log_loss = run(sess, cross_entropy, Dict(feature_columns=> construct_feature_columns(training_examples), target_columns=>construct_feature_columns(training_targets)));
validation_log_loss = run(sess, cross_entropy, Dict(feature_columns=> construct_feature_columns(validation_examples), target_columns=>construct_feature_columns(validation_targets)));

# Occasionally print the current loss.
println(" period ", period, ": ", training_log_loss)

# Add the loss metrics from this period to our list.
push!(training_log_losses, training_log_loss)
push!(validation_log_losses, validation_log_loss)
end


println("Model training finished.")

# Calculate final predictions (not probabilities, as above).
final_probabilities = run(sess, y, Dict(feature_columns=> validation_examples, target_columns=>validation_targets))
final_predictions=0.0.*copy(final_probabilities)
final_predictions=castto01(final_probabilities)

accuracy = sklm.accuracy_score(validation_targets, final_predictions)
println("Final accuracy (on validation data): ", accuracy)

# Output a graph of loss metrics over periods.
p1=plot(training_log_losses, label="training", title="LogLoss vs. Periods", ylabel="LogLoss", xlabel="Periods")
p1=plot!(validation_log_losses, label="validation")

return p1, y, final_probabilities, weight_export, feature_columns, target_columns
end
Out[11]:
train_nn_classification_model (generic function with 1 method)
In [21]:
sess=Session(Graph())
p1, y, final_probabilities, weight_export, feature_columns, target_columns = train_nn_classification_model(
0.003, #learning rate
1000, #steps
50, #batch_size
[20, 20], #hidden_units
false, #is_embedding
1.0, # keep probability
train_feature_mat,
train_labels,
test_features_mat,
test_labels)
Training model...
LogLoss error (on validation data):
period 1: 0.471395788925613
period 2: 0.45770820644301585
period 3: 0.44868498081679836
period 4: 0.4486270877372173
period 5: 0.44886359529693076
period 6: 0.45785272663956894
period 7: 0.4486596600689161
period 8: 0.44640892834409557
period 9: 0.44672501642012086
period 10: 0.44597568632613904
Model training finished.
Out[21]:
(Plot{Plots.GRBackend() n=2}, <Tensor Sigmoid:1 shape=(?, 1) dtype=Float32>, Float32[0.759592; 0.194817; … ; 0.910422; 0.592667], TensorFlow.Variables.Variable{Float32}(<Tensor W_1BDge:1 shape=(50, 20) dtype=Float32>, <Tensor W_1BDge/Assign:1 shape=unknown dtype=Float32>), <Tensor placeholder:1 shape=(?, 50) dtype=Float32>, <Tensor placeholder_2:1 shape=(?, 1) dtype=Float32>)
Final accuracy (on validation data): 0.7854314172566903
In [22]:
plot(p1)
Out[22]:
2468100.4500.4550.4600.4650.470LogLoss vs. PeriodsPeriodsLogLosstrainingvalidation
In [23]:
evaluation_metrics=DataFrame()
false_positive_rate, true_positive_rate, thresholds = sklm.roc_curve(
vec(construct_feature_columns(test_labels)), vec(final_probabilities))
evaluation_metrics[:auc]=sklm.roc_auc_score(construct_feature_columns(test_labels), vec(final_probabilities))
validation_predictions=castto01(final_probabilities);
evaluation_metrics[:accuracy]=accuracy = sklm.accuracy_score(test_labels, validation_predictions)

p2=plot(false_positive_rate, true_positive_rate, label="our model")
p2=plot!([0, 1], [0, 1], label="random classifier");
println("AUC on the validation set: ", evaluation_metrics[:auc])
println("Accuracy on the validation set: ", evaluation_metrics[:accuracy])
AUC on the validation set: [0.871963]
Accuracy on the validation set: [0.785431]
In [24]:
plot(p2)
Out[24]:
0.000.250.500.751.000.000.250.500.751.00our modelrandom classifier

Task 3: Use an Embedding with a DNN Model

In this task, we’ll implement our DNN model using an embedding column. An embedding column takes sparse data as input and returns a lower-dimensional dense vector as output. We’ll add the embedding layer as the first layer in the hidden_units-vector, and set is_embedding to true.
NOTE: In practice, we might project to dimensions higher than 2, like 50 or 100. But for now, 2 dimensions is easy to visualize.
In [14]:
sess=Session(Graph())
p1, y, final_probabilities, weight_export, feature_columns, target_columns = train_nn_classification_model(
0.003, #learning rate
1000, #steps
50, #batch_size
[2, 20, 20], #hidden_units
true,
1.0, # keep probability
train_feature_mat,
train_labels,
test_features_mat,
test_labels)
Training model...
LogLoss error (on validation data):
period 1: 0.5663698194950431
period 2: 0.45220806683037423
period 3: 0.45205136369869175
period 4: 0.44547315482684624
period 5: 0.4489593760718842
period 6: 0.4484996714455285
period 7: 0.44584040864739305
period 8: 0.4457339047966633
period 9: 0.4464958611862487
period 10: 0.4485086547636147
Model training finished.
Out[14]:
(Plot{Plots.GRBackend() n=2}, <Tensor Sigmoid:1 shape=(?, 1) dtype=Float32>, Float32[0.736896; 0.157856; … ; 0.895887; 0.54718], TensorFlow.Variables.Variable{Float32}(<Tensor W_1fwgD:1 shape=(50, 2) dtype=Float32>, <Tensor W_1fwgD/Assign:1 shape=unknown dtype=Float32>), <Tensor placeholder:1 shape=(?, 50) dtype=Float32>, <Tensor placeholder_2:1 shape=(?, 1) dtype=Float32>)
In [15]:
plot(p1)
Out[15]:
2468100.4500.4750.5000.5250.550LogLoss vs. PeriodsPeriodsLogLosstrainingvalidation
Final accuracy (on validation data): 0.7846313852554102
In [16]:
evaluation_metrics=DataFrame()
false_positive_rate, true_positive_rate, thresholds = sklm.roc_curve(
vec(construct_feature_columns(test_labels)), vec(final_probabilities))
evaluation_metrics[:auc]=sklm.roc_auc_score(construct_feature_columns(test_labels), vec(final_probabilities))
validation_predictions=castto01(final_probabilities);
evaluation_metrics[:accuracy]=accuracy = sklm.accuracy_score(test_labels, validation_predictions)

p2=plot(false_positive_rate, true_positive_rate, label="our model")
p2=plot!([0, 1], [0, 1], label="random classifier");
println("AUC on the validation set: ", evaluation_metrics[:auc])
println("Accuracy on the validation set: ", evaluation_metrics[:accuracy])
AUC on the validation set: [0.873001]
Accuracy on the validation set: [0.784631]
In [17]:
plot(p2)
Out[17]:
0.000.250.500.751.000.000.250.500.751.00our modelrandom classifier

Task 4: Examine the Embedding

Let’s now take a look at the actual embedding space, and see where the terms end up in it. Do the following:
  1. Run the following code to see the embedding we trained in Task 3. Do things end up where you’d expect?
  2. Re-train the model by rerunning the code in Task 3, and then run the embedding visualization below again. What stays the same? What changes?
  3. Finally, re-train the model again using only 10 steps (which will yield a terrible model). Run the embedding visualization below again. What do you see now, and why?
In [18]:
xy_coord=run(sess, weight_export, Dict(feature_columns=> test_features_mat, target_columns=>test_labels))
p3=plot(title="Embedding Space", xlims=(minimum(xy_coord[:,1])-0.3, maximum(xy_coord[:,1])+0.3), ylims=(minimum(xy_coord[:,2])-0.1, maximum(xy_coord[:,2]) +0.3) )
for term_index=1:length(informative_terms)
p3=annotate!(xy_coord[term_index,1], xy_coord[term_index,1], informative_terms[term_index] )
end
plot(p3)
Out[18]:
-0.50-0.250.000.250.500.75-0.50-0.250.000.250.50Embedding Spacebadgreatbestworstfunbeautifulexcellentpoorboringawfulterribledefinitelyperfectlikedworsewasteentertaininglovedunfortunatelyamazingenjoyedfavoritehorriblebrillianthighlysimpleannoyingtodayhilariousenjoyabledullfantasticpoorlyfailsdisappointingdisappointmentnothimhergoodtime?.!moviefilmactioncomedydramafamily

Task 5: Try to improve the model’s performance

See if you can refine the model to improve performance. A couple things you may want to try:
  • Changing hyperparameters, or using a different optimizer than Adam (you may only gain one or two accuracy percentage points following these strategies).
  • Adding additional terms to informative_terms. There’s a full vocabulary file with all 30,716 terms for this data set that you can use at: https://storage.googleapis.com/mledu-datasets/sparse-data-embedding/terms.txt You can pick out additional terms from this vocabulary file, or use the whole thing.
In the following code, we will import the whole vocabulary file and run the model with it.
In [30]:
vocabulary=Array{String}(0)
open("terms.txt") do file
for ln in eachline(file)
push!(vocabulary, ln)
end
end
In [31]:
vocabulary
Out[31]:
30716-element Array{String,1}:
"the"
"."
","
"and"
"a"
"of"
"to"
"is"
"in"
"i"
"it"
"this"
"'"

"soapbox"
"softening"
"user's"
"od"
"potter's"
"renard"
"impacting"
"pong"
"nobly"
"nicol"
"ff"
"MISSING"
We will now load the test and training features matrices from disk. Open and run the Conversion of Movie-review data to one-hot encoding-notebook to prepare the IMDB_fullmatrix_datacolumns.jld-file. The notebook can be found here.
In [48]:
using JLD
train_features_full=load("IMDB_fullmatrix_datacolumns.jld", "train_features_full")
test_features_full=load("IMDB_fullmatrix_datacolumns.jld", "test_features_full")
Out[48]:
24999×30716 Array{Float64,2}:
1.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 … 0.0 0.0 0.0 0.0 0.0 1.0 0.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 … 0.0 0.0 0.0 0.0 0.0 1.0 0.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
⋮ ⋮ ⋱ ⋮ ⋮
0.0 1.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 … 0.0 0.0 0.0 0.0 0.0 1.0 0.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Now run the session with the full vocabulary file. Again, this will take a long time to finish. It assigns about 50GB of memory.
In [49]:
sess=Session(Graph())
p1, y, final_probabilities, weight_export, feature_columns, target_columns = train_nn_classification_model(
# TWEAK THESE VALUES TO SEE HOW MUCH YOU CAN IMPROVE THE RMSE
0.003, #learning rate
1000, #steps
50, #batch_size
[2, 20, 20], #hidden_units
true,
1.0, # keep probability
train_features_full,
train_labels,
test_features_full,
test_labels)
Training model...
LogLoss error (on validation data):
period 1: 0.37313920231826037
period 2: 0.26477444394028044
period 3: 0.25086412221834486
period 4: 0.2087771610933073
period 5: 0.19335710666766756
period 6: 0.16094674715105722
period 7: 0.15969771663371266
period 8: 0.14275753878385306
period 9: 0.13488925137953908
period 10: 0.11471432399972212
Model training finished.
Out[49]:
(Plot{Plots.GRBackend() n=2}, <Tensor Sigmoid:1 shape=(?, 1) dtype=Float32>, Float32[0.0975639; 0.0719205; … ; 0.993713; 0.678265], TensorFlow.Variables.Variable{Float32}(<Tensor W_1tR5j:1 shape=(30716, 2) dtype=Float32>, <Tensor W_1tR5j/Assign:1 shape=unknown dtype=Float32>), <Tensor placeholder:1 shape=(?, 30716) dtype=Float32>, <Tensor placeholder_2:1 shape=(?, 1) dtype=Float32>)
Final accuracy (on validation data): 0.8656746269850794
In [4]:
#plot(p1)
In [51]:
evaluation_metrics=DataFrame()
false_positive_rate, true_positive_rate, thresholds = sklm.roc_curve(
vec(construct_feature_columns(test_labels)), vec(final_probabilities))
evaluation_metrics[:auc]=sklm.roc_auc_score(construct_feature_columns(test_labels), vec(final_probabilities))
validation_predictions=castto01(final_probabilities);
evaluation_metrics[:accuracy]=accuracy = sklm.accuracy_score(test_labels, validation_predictions)

p2=plot(false_positive_rate, true_positive_rate, label="our model")
p2=plot!([0, 1], [0, 1], label="random classifier");
println("AUC on the validation set: ", evaluation_metrics[:auc])
println("Accuracy on the validation set: ", evaluation_metrics[:accuracy])
AUC on the validation set: [0.939391]
Accuracy on the validation set: [0.865675]
In [3]:
#plot(p2)

Task 6: Try out sparse matrices

We will now convert the feature matrices from the previous step to sparse matrices and re-run our code. The sparse matrices take about 350MB of memory. The code for the NN will still convert the sparse matrix containing the data for the current batch to a full matrix, which leads to a memory requirement of about 35GB.
In [54]:
train_features_sparse=sparse(train_features_full)
test_features_sparse=sparse(test_features_full)
Out[54]:
24999×30716 SparseMatrixCSC{Float64,Int64} with 10780292 stored entries:
[1 , 1] = 1.0
[2 , 1] = 1.0
[3 , 1] = 1.0
[4 , 1] = 1.0
[5 , 1] = 1.0
[6 , 1] = 1.0
[7 , 1] = 1.0
[8 , 1] = 1.0
[9 , 1] = 1.0
[10 , 1] = 1.0

[24977, 30715] = 1.0
[24978, 30715] = 1.0
[24979, 30715] = 1.0
[24981, 30715] = 1.0
[24982, 30715] = 1.0
[24983, 30715] = 1.0
[24984, 30715] = 1.0
[24989, 30715] = 1.0
[24990, 30715] = 1.0
[24993, 30715] = 1.0
[24996, 30715] = 1.0
In [56]:
# For saving the data
#save("IMDB_sparsematrix_datacolumns.jld", "train_features_sparse", train_features_sparse, "test_features_sparse", test_features_sparse)
In [55]:
sess=Session(Graph())
p1, y, final_probabilities, weight_export, feature_columns, target_columns = train_nn_classification_model(
0.003, #learning rate
1000, #steps
50, #batch_size
[2, 20, 20], #hidden_units
true,
1.0, # keep probability
train_features_sparse,
train_labels,
test_features_sparse,
test_labels)
Training model...
LogLoss error (on validation data):
period 1: 0.4264791202329761
period 2: 0.27906764597962996
period 3: 0.24588130824603802
period 4: 0.24137786867062228
period 5: 0.1982403807820223
period 6: 0.195120145753206
period 7: 0.15590201135616108
period 8: 0.145190807054994
period 9: 0.128827185543495
period 10: 0.12583784552275887
Model training finished.
Out[55]:
(Plot{Plots.GRBackend() n=2}, <Tensor Sigmoid:1 shape=(?, 1) dtype=Float32>, Float32[0.302512; 0.50181; … ; 0.977541; 0.652478], TensorFlow.Variables.Variable{Float32}(<Tensor W_1gvhS:1 shape=(30716, 2) dtype=Float32>, <Tensor W_1gvhS/Assign:1 shape=unknown dtype=Float32>), <Tensor placeholder:1 shape=(?, 30716) dtype=Float32>, <Tensor placeholder_2:1 shape=(?, 1) dtype=Float32>)
Final accuracy (on validation data): 0.8696347853914157
In [2]:
#plot(p1)
In [57]:
evaluation_metrics=DataFrame()
false_positive_rate, true_positive_rate, thresholds = sklm.roc_curve(
vec(construct_feature_columns(test_labels)), vec(final_probabilities))
evaluation_metrics[:auc]=sklm.roc_auc_score(construct_feature_columns(test_labels), vec(final_probabilities))
validation_predictions=castto01(final_probabilities);
evaluation_metrics[:accuracy]=accuracy = sklm.accuracy_score(test_labels, validation_predictions)

p2=plot(false_positive_rate, true_positive_rate, label="our model")
p2=plot!([0, 1], [0, 1], label="random classifier");
println("AUC on the validation set: ", evaluation_metrics[:auc])
println("Accuracy on the validation set: ", evaluation_metrics[:accuracy])
AUC on the validation set: [0.940761]
Accuracy on the validation set: [0.869635]
In [1]:
#plot(p2)

A Final Word

We may have gotten a DNN solution with an embedding that was better than our original linear model, but the linear model was also pretty good and was quite a bit faster to train. Linear models train more quickly because they do not have nearly as many parameters to update or layers to backprop through.
In some applications, the speed of linear models may be a game changer, or linear models may be perfectly sufficient from a quality standpoint. In other areas, the additional model complexity and capacity provided by DNNs might be more important. When defining your model architecture, remember to explore your problem sufficiently so that you know which space you’re in.