JSOC 2015 project – Automatic Differentiation in Julia with ForwardDiff.jl

By: Julia Developers

Re-posted from: http://feedproxy.google.com/~r/JuliaLang/~3/MoVbDxwx03A/auto-diff-in-julia

This summer, I’ve had the good fortune to be able to participate in the first ever Julia Summer of Code (JSoC), generously sponsored by the Gordon and Betty Moore Foundation. My JSoC project was to explore the use of Julia for automatic differentiation (AD), a topic with a wide array of applications in the field of optimization.

Under the mentorship of Miles Lubin and Theodore Papamarkou, I completed a major overhaul of ForwardDiff.jl, a Julia package for calculating derivatives, gradients, Jacobians, Hessians, and higher-order derivatives of native Julia functions (or any callable Julia type, really).

By the end of this post, you’ll hopefully know a little bit about how ForwardDiff.jl works, why it’s useful, and why Julia is uniquely well-suited for AD compared to other languages.

Seeing ForwardDiff.jl In Action

Before we get down to the nitty-gritty details, it might be helpful to see a simple example that illustrates various methods from ForwardDiff.jl’s API.

The snippet below is a somewhat contrived example, but works well enough as an introduction to the package. First, we define a target function we’d like to differentiate, then use ForwardDiff.jl to calculate some derivatives of the function at a given input:

julia> using ForwardDiff

julia> f(x::Vector) = sum(sin, x) + prod(tan, x) * sum(sqrt, x);

julia> x = rand(5)
5-element Array{Float64,1}:

julia> g = ForwardDiff.gradient(f); # g = ∇f

julia> g(x)
5-element Array{Float64,1}:

julia> j = ForwardDiff.jacobian(g); # j = J(∇f)

julia> j(x)
5×5 Array{Float64,2}:
0.585111 3.48083 1.7706 0.994057 1.03257
3.48083 1.06079 5.79299 3.25245 3.37871
1.7706 5.79299 0.423981 1.65416 1.71818
0.994057 3.25245 1.65416 0.251396 0.964566
1.03257 3.37871 1.71818 0.964566 0.140689

julia> ForwardDiff.hessian(f, x) # H(f)(x) == J(∇f)(x), as expected
5×5 Array{Float64,2}:
0.585111 3.48083 1.7706 0.994057 1.03257
3.48083 1.06079 5.79299 3.25245 3.37871
1.7706 5.79299 0.423981 1.65416 1.71818
0.994057 3.25245 1.65416 0.251396 0.964566
1.03257 3.37871 1.71818 0.964566 0.140689

Tada! Okay, that’s not too exciting – I could’ve just done the same thing with Calculus.jl! What makes ForwardDiff.jl different? Read on to find out…

How ForwardDiff.jl Works

What is Automatic Differentiation?

In broad terms, automatic differentiation describes a class of algorithms for automatically taking exact derivatives of user-provided functions. In addition to producing more accurate results, AD methods are also often faster than other common differentiation methods (such as finite differencing).

There are two main flavors of AD: forward mode, and reverse mode. As you might’ve guessed, this post only discusses forward mode, which is the kind of AD implemented by ForwardDiff.jl.

ForwardDiffNumber Types

ForwardDiff.jl implements several new number types, all of which are subtypes of ForwardDiffNumber{N,T,C} <: Number.

Elementary numerical functions on these types are then overloaded to evaluate both the original function and its derivative(s), returning the results in the form of a new ForwardDiffNumber. Thus, we can pass these number types into a general function $f$ (which is assumed to be composed of the overloaded elementary functions), and the derivative information is naturally propagated at each step of the calculation by way of the chain rule.

This propagation occurs all the way through to the result of the function, which is itself a ForwardDiffNumber (or an Array{F<:ForwardDiffNumber}). This final result contains the value $f(x)$ and the derivative $f’(x)$, where $x$ was the original point of evaluation.

Simple Forward Mode AD in Julia: Dual Numbers

The easiest way to write actual Julia code demonstrating this technique is to implement a simple dual number type. Note that there is already a Julia package dedicated to such an implementation, but we’re going to roll our own here for pedagogical purposes.

Here’s how we’ll define our DualNumber type:

immutable DualNumber{T} <: Number

value(d::DualNumber) = d.value
deriv(d::DualNumber) = d.deriv

Next, we can start defining functions on DualNumber. Here are a few examples to give you a feel for the process:

function Base.sqrt(d::DualNumber)
new_value = sqrt(value(d))
new_deriv = 0.5 / new_value
return DualNumber(new_value, new_deriv*deriv(d))

function Base.sin(d::DualNumber)
new_value = sin(value(d))
new_deriv = cos(value(d))
return DualNumber(new_value, new_deriv*deriv(d))

function Base.(:+)(a::DualNumber, b::DualNumber)
new_value = value(a) + value(b)
new_deriv = deriv(a) + deriv(b)
return DualNumber(new_value, new_deriv)

function Base.(:*)(a::DualNumber, b::DualNumber)
val_a, val_b = value(a), value(b)
new_value = val_a * val_b
new_deriv = val_b * deriv(a) + val_a * deriv(b)
return DualNumber(new_value, new_deriv)

We can now evaluate the derivative of any scalar function composed of the above elementary functions. To do so, we simply pass an instance of our DualNumber type into the function, and extract the derivative from the result. For example:

julia> f(x) = sqrt(sin(x * x)) + x
f (generic function with 1 method)

julia> f(1.0)

julia> d = f(DualNumber(1.0, 1.0))

julia> deriv1 = deriv(d)

julia> using Calculus; deriv2 = Calculus.derivative(f, 1.0)

julia> deriv1 – deriv2

Notice that our dual number result comes close to the result obtained from Calculus.jl, but is actually slightly different. That slight difference is due to the approximation error inherent to the finite differencing method employed by Calculus.jl.

In reality, the number types that ForwardDiff.jl implements are quite a bit more complicated than DualNumber. ForwardDiffNumbers behave like ensembles of dual numbers and hyper-dual numbers (the higher-order analog of dual numbers), allowing for simultaneous calculation of multiple higher-order partial derivatives. For an in-depth examination of ForwardDiff.jl’s number type implementation, see this section of the developer documentation.

Performance Comparison: The Ackley Function

To illustrate the performance gains that can be achieved using ForwardDiff.jl, let’s do a (very) naive benchmark comparison between ForwardDiff.jl and other libraries. For this comparison, we’ll be comparing the time to calculate the gradient of a function using ForwardDiff.jl, Calculus.jl, and a Python-based AD tool, AlgoPy.

To put these benchmarks in context, here are my machine specs, along the version info of the relevant libraries:

Machine Specs

  • Model: Late 2013 MacBook Pro
  • OS: OSX 10.9.5
  • Processor: 2.6 GHz Intel Core i5
  • Memory: 8 GB 1600 MHz DDR3

Version Info

  • Julia v0.4.1-pre+15
    • ForwardDiff.jl v0.1.2
    • Calculus.jl v0.1.13
  • Python v2.7.9
    • AlgoPy v0.5.1

Defining the Ackley Function

The function we’ll be using in our test is the Ackley function, which is mathematically defined as

Here’s the definition of the function in Julia:

function ackley(x)
a, b, c = 20.0, -0.2, 2.0*π
len_recip = inv(length(x))
sum_sqrs = zero(eltype(x))
sum_cos = sum_sqrs
for i in x
sum_cos += cos(c*i)
sum_sqrs += i^2
return (-a * exp(b * sqrt(len_recip*sum_sqrs)) -
exp(len_recip*sum_cos) + a + e)

…and here’s the corresponding Python definition:

def ackley(x):
a, b, c = 20.0, -0.2, 2.0*numpy.pi
len_recip = 1.0/len(x)
sum_sqrs, sum_cos = 0.0, 0.0
for i in x:
sum_cos += algopy.cos(c*i)
sum_sqrs += i*i
return (-a * algopy.exp(b*algopy.sqrt(len_recip*sum_sqrs)) -
algopy.exp(len_recip*sum_cos) + a + numpy.e)

Benchmark Results

The benchmarks were performed with input vectors of length 16, 1600, and 16000, taking the best time out of 5 trials for each test.

Function evaluation time

The below table compares the evaluation times of ackley(x) in both Python and Julia:

length(x) Python time (s) Julia time (s) Speed-Up vs. Python
16 0.00011 2.3e-6 47.83x
1600 0.00477 4.0269e-5 118.45x
16000 0.04747 0.00037 128.30x

Gradient evaluation time

The below table compares the evaluation times of ∇ackley(x) using various libraries (the chunk_size column denotes a configuration option passed to the ForwardDiff.gradient method, see the chunk-mode docs for details.):

length(x) AlgoPy time (s) Calculus.jl time (s) ForwardDiff time (s) chunk_size
16 0.00212 2.2e-5 3.5891e-5 16
1600 0.53439 0.10259 0.01304 10
16000 101.55801 11.18762 1.35411 10

Here, we show the speed-up ratio of ForwardDiff.jl over the other libraries:

length(x) Speed-Up vs. AlgoPy Speed-Up vs. Calculus.jl
16 59.07x 0.61x
1600 40.98x 7.86x
16000 74.99x 8.26x

As you can see, Python + AlgoPy falls pretty short of the speeds achieved by Julia + ForwardDiff.jl, or even Julia + Calculus.jl. While Calculus.jl is actually almost twice as fast as ForwardDiff.jl for the lowest input dimension vector, it is ~8 times slower than ForwardDiff.jl for the higher input dimension vectors.

Another metric that might be useful to look at is the “slowdown ratio” between the gradient evaluation time and the function evaluation time, defined as:

Here are the results (lower is better):

length(x) AlgoPy ratio Calculus.jl ratio ForwardDiff.jl ratio
16 19.27 9.56 15.60
1600 112.03 2547.61 323.82
16000 2139.41 30236.81 3659.77

Both AlgoPy and ForwardDiff.jl beat out Calculus.jl for evaluation at higher input dimensions, which isn’t too surprising. AlgoPy beating ForwardDiff.jl, though, might catch you off guard – ForwardDiff.jl had the fastest absolute runtimes, after all! One explanation for this outcome is that AlgoPy falls back to vectorized Numpy methods when calculating the gradient, while the ackley function itself uses your usual, slow Python scalar arithmetic. Julia’s scalar arithmetic performance is much faster than Python’s, so ForwardDiff.jl doesn’t have as much “room for improvement” as AlgoPy does.

Julia’s AD Advantage

At the beginning of this post, I promised I would give the reader an answer to the question: “Why is Julia uniquely well-suited for AD compared to other languages?”

There are several good answers, but the chief reason for Julia’s superiority is its efficient implementation of multiple dispatch.

Unlike many other languages, Julia’s type-based operator overloading is fast and natural, as it’s one of the central design tenets of the language. Since Julia is JIT-compiled, the bytecode representation of a Julia function can be tied directly to the types with which the function is called. This allows the compiler to optimize every Julia method for the specific input type at runtime.

This ability is phenomenally useful for implementing forward mode AD, which relies almost entirely on operator overloading in order to work. In most other scientific computing languages, operator overloading is either very slow (e.g. MATLAB), fraught with weird edge cases (e.g. Python), arduous to implement generally (e.g. C++) or some combination of all three. In addition, very few languages allow operator overloading to naturally extend to native, black-box, user-written code. Julia’s multiple dispatch is the secret weapon leveraged by ForwardDiff.jl to overcome these hurdles.

Future Directions

The new version of ForwardDiff.jl has just been released, but development of the package is still ongoing! Here’s a list of things I’d like to see ForwardDiff.jl support in the future:

  • More elementary function definitions on ForwardDiffNumbers
  • More optimized versions of existing elementary function definitions on ForwardDiffNumbers
  • Methods for evaluating Jacobian-matrix products (highly useful in conjunction with reverse mode AD).
  • Parallel/shared-memory/distributed-memory versions of current API methods for handling problems with huge input/output dimensions
  • A more robust benchmarking suite for catching performance regressions

If you have any ideas on how to make ForwardDiff.jl more useful, feel free to open a pull request or issue in the package’s GitHub repository.