Amazon Web Services Announces MXNet Is “Framework of Choice” for Deep Learning Using Julia and Other Languages

By: Julia Computing, Inc.

Re-posted from:

Las Vegas, NV – Amazon Web Services announced at this week’s AWS re:Invent 2016 conference that MXNet is the framework of choice for deep learning using Julia and other languages.

According to CTO Werner Vogels, “In addition to scalability, MXNet offers the ability to both mix programming models (imperative and declarative), and code in a wide number of programming languages, including Python, C++, R, Scala, Julia, Matlab and JavaScript.” The full blog post is available here.

AWS provides the world’s largest cloud environment, which means that deep learning using Julia and MXNet is now available for more than 1 million businesses that use AWS.

Matt Wood, GM Product Strategy, Amazon Web Services
AWS re:Invent 2016
Photo by Greg Kelleher @gregkel

According to Matt Wood, GM Product Strategy for Amazon Web Services: “MXNet has a lot of the characteristics that developers like when they are going off and building deep learning.  First is programmability.  MXNet supports a really broad set of programming languages.  So whether you are used to using Python or Scala, or whether, like me, you are a fan of Julia or Javascript or Matlab or Go, you can use all of the languages you are used to using, and start running your deep learning straight away.” The full video is available here.

You can learn more about MXNet.jl by clicking here, or visit our Website to learn about Julia’s deep learning and GPU capabilities.

About Julia Computing and Julia

Julia Computing was founded in 2015 by the co-creators of the Julia language to provide support to businesses and researchers who use Julia, the fastest modern open source programming language for data and analytics.

Julia combines the functionality of quantitative environments such as Python and R with the speed of production programming languages like Java and C++ to solve big data and analytics problems. Julia delivers dramatic improvements in simplicity, speed, capacity and productivity.

Julia users and partners include: IBM, Intel, DARPA, Lawrence Berkeley National Laboratory, National Energy Research Scientific Computing Center (NERSC), Federal Aviation Administration (FAA), MIT Lincoln Labs, Moore Foundation, Nobel Laureate Thomas J. Sargent, Federal Reserve Bank of New York (FRBNY), Brazilian National Development Bank (BNDES), BlackRock, Conning, Berkery Noyes, BestX and many of the world’s largest investment banks, asset managers, fund managers, foreign exchange analysts, insurers, hedge funds and regulators. Julia is being used to analyze images of the universe and research dark matter, drive parallel computing on supercomputers, diagnose medical conditions, manage 3D printers, build drones, improve air safety, provide analytics for foreign exchange trading, insurance, regulatory compliance, macroeconomic modeling, sports analytics, manufacturing and much, much more.

Learning without Backpropagation: Intuition and Ideas (Part 2)

By: Tom Breloff

Re-posted from:

In part one, we peeked into the rabbit hole of backprop-free network training with asymmetric random feedback. In this post, we’ll jump into the rabbit hole with both feet. First I’ll demonstrate how it is possible to learn by “gradient” descent with zero-derivative activations, where learning by backpropagation is impossible. The technique is a modification of Direct Feedback Alignment. Then I’ll review several different (but unexpectedly related) research directions: targetprop, e-prop, and synthetic gradients, which set up my ultimate goal: efficient training of arbitrary recurrent networks.

Direct Feedback Alignment

In recent research from Arild Nøkland, he explores extensions to random feedback (see part one) that avoid backpropagating error signals sequentially through the network. Instead, he proposes Direct Feedback Alignment (DFA) and Indirect Feedback Alignment (IFA) which connect the final error layer directly to earlier hidden layers through random feedback connections. Not only are they more convenient for error distribution, but they are more biologically plausible as there is no need for weight symmetry or feedback paths that match forward connectivity. A quick tutorial on the method:

Learning through flat activations

In this post, we’re curious whether we can use a surrogate gradient algorithm that will handle threshold activations. Nøkland connects the direct feedback connections from the transformation output error gradient to the “layer output”, which in this case is the output of the activation functions. However, we want to use activation functions with zero derivative, so even with direct feedback the gradients would be zeroed during propagation through the activations.

To get around this issue, we modify DFA to instead connect the error layer directly to the inputs of the activations, instead of the outputs. The result is that we have affine transformations which can learn to connect latent input ($h_{i-1}$ from earlier layers) to a projection of output error ($B_i \nabla y$) into the space of $h_i$, before applying the threshold nonlinearity. The effect of the application of a nonlinear activation is “handled” by the progressive re-learning of later network layers. Effectively, each layer learns how to align their inputs with a fixed projection of the error. The hope is that, by aligning layer input with final error gradients, we can project the inputs to a space that is useful for later layers. Learning happens in parallel, and later layers eventually learn to adjust to the learning that happens in the earlier layers.

MNIST with Modified DFA

Reusing the approach in an earlier post on JuliaML, we will attempt to learn neural network parameters both with backpropagation and our modified DFA method. The combination of Plots and JuliaML makes digging into network internals and building custom learning algorithms super-simple, and the DFA learning algorithm was fairly quick to implement. The full notebook can be found here. To ease understanding, I’ve created a video to review the notebook, method, and preliminary results:

Nice animations can be built using the super-convenient animation facilities of Plots:

Target Propagation

The concept of Target Propagation (targetprop) goes back to LeCun 1987, but has recently been explored in depth in Bengio 2014, Lee et al 2014, and Bengio et al 2015. The intuition is simple: instead of focusing solely on the “forward-direction” model ($y = f(x)$), we also try to fit the “backward-direction” model ($x = g(y)$). $f$ and $g$ form an auto-encoding relationship; $f$ is the encoder, creating a latent representation and predicted outputs given inputs $x$, and $g$ is the decoder, generating input representations/samples from latent/output variables.

Bengio 2014 iteratively adjusts weights to push latent outputs $h_i$ towards the targets. The final layer adjusts towards useful final targets using the output gradients as a guide:

Difference Target Propagation makes a slight adjustment to the update, and attempts to learn auto-encoders which fulfill:

Finally, Bengio et al 2015 extend targetprop to a Bayesian/generative setting, in which they attempt to reduce divergence between generating distributions p and q, such that the pair of conditionals form a denoising auto-encoder:

Targetprop (and its variants/extensions) is a nice alternative to backpropagation. There is still sequential forwards and backwards passes through the layers, however we:

  • avoid the issues of vanishing and exploding gradients, and
  • focus on the role of intermediate layers: creating latent representations of the input which are useful in the context of the target values.

Equilibrium Propagation

Equilibrium Propagation (e-prop) is a relatively new approach which (I’m not shy to admit) I’m still trying to get my head around. As I understand, it uses an iterative process of perturbing components towards improved values and allowing the network dynamics to settle into a new equilibrium. The proposed algorithm alternates between phases of “learning” in a forward and backward direction, though it is a departure from the simplicity of backprop and optimization.

The concepts are elegant, and it offers many potential advantages for efficient learning of very complex networks. However it will be a long time before those efficiencies are realized, given the trend towards massively parallel GPU computations. I’ll follow this line of research with great interest, but I don’t expect it to be used in a production setting in the near future.

Synthetic Gradients

A recent paper from DeepMind takes an interesting approach. What if we use complex models to estimate useful surrogate gradients for our layers? Their focus is primarily from the perspective of “unlocking” (i.e. parallelizing) the forward, backward, and update steps of a typical backpropagation algorithm. However they also offer the possibility of estimating (un-truncated) Backpropagation Through Time (BPTT) gradients, which would be a big win.

Layers output to a local model, called a Decoupled Neural Interface. This local model estimates the value of the backpropagated gradient that would be used for updating the parameters of that layer, estimated using only the layer outputs and target vectors. If you’re like me, you noticed the similarity to DFA in modeling the relationship of the local layer and final targets in order to choose a search direction which is useful for improving the final network output.

What next?

I think the path forward will be combinations and extensions of the ideas presented here. Like synthetic gradients and direct feedback, I think we should be attempting to find reliable alternatives to backpropagation which are:

  • Highly parallel
  • Asymmetric
  • Local in time and space

Obviously they must still enable learning, and efficient/simple solutions are preferred. I like the concept of synthetic gradients, but wonder if they are optimizing the wrong objective. I like direct feedback, but wonder if there are alternate ways to initialize or update the projection matrices ($B_1, B_2, …$). Combining the concepts, can we add non-linearities to the error projections (direct feedback) and learn a more complex (and hopefully more useful) layer?

There is a lot to explore, and I think we’re just at the beginning. I, for one, am happy that I chose the red pill.

Julia for Astronomy: Parallel Computing with Julia on NERSC Supercomputer Increases Speed of Image Analysis 225x

By: Julia Computing, Inc.

Re-posted from:

Berkeley, CA – Researchers from Julia Computing, UC Berkeley, Intel, the National Energy Research Scientific Computing Center (NERSC), Lawrence Berkeley National Laboratory, and JuliaLabs@MIT have developed a new parallel computing method to dramatically scale up the process of cataloging astronomical objects. This major improvement leverages 8,192 Intel® Xeon® processors in Berkeley Lab’s new Cori supercomputer and Julia, the high-performance, open-source scientific computing language to deliver a 225x increase in the speed of astronomical image analysis.

The code used for this analysis is called Celeste. It was developed at Berkeley Lab and uses statistical inference to mathematically locate and characterize light sources in the sky. When it was first released in 2015, Celeste was limited to single-node execution on at most hundreds of megabytes of astronomical images. In the case of the Sloan Digital Sky Survey, which is the dataset used for this research, this analysis is conducted by identifying points of light in nearly 5 million images of approximately 12 megabytes each – a dataset of 55 terabytes.

Using the new parallel implementation, the research team dramatically increased the speed of its analysis by an estimated 225x. This enabled the processing of more than 20 thousand images, or 250 gigabytes – an increase of more than 3 orders of magnitude compared with previous iterations.

“Astronomical surveys are the primary source of data about the Universe beyond our solar system,” said Jeff Regier, a postdoctoral fellow in the UC Berkeley Department of Electrical Engineering and Computer Sciences who has been instrumental in the development of Celeste. “Through Bayesian statistics, Celeste combines what we already know about stars and galaxies from previous surveys and from physics theories, with what can be learned from new data. Its output is a highly accurate catalog of galaxies’ locations, shapes and colors. Such catalogs let astronomers test hypotheses about the origin of the Universe, as well as about the nature of dark matter and dark energy.”

“It is exactly to enable such cutting-edge machine-learning algorithms on massive data that we designed the Julia language,” said Viral Shah, CEO of Julia Computing. “Researchers can now focus on problem solving rather than programming.”

NERSC provided the extensive computing resources the team needed to apply such a complex algorithm to so much data, assisting with many aspects of designing a program to run at scale, including load balancing and interprocess communication, Regier noted.

“Practically all the significant code that runs on supercomputers is written in C/C++ and Fortran, for good reason: efficiency is critically important,” said Pradeep Dubey, Intel Fellow and Director of the Parallel Computing Lab at Intel. “With Celeste, we are closer to bringing Julia into the conversation because we’ve demonstrated excellent efficiency using hybrid parallelism – not just processes, but threads as well – something that’s still impossible to do with Python or R.”

Alan Edelman, co-creator of the Julia language and professor of applied mathematics at MIT, said, “The JuliaLabs group at MIT is thrilled and impressed with this advancement in the use of Julia for High Performance Computing. The dream of ‘ease of use’ and (‘and’ not ‘or!’) ‘high performance’ is becoming a reality.”

The Celeste project is at the cutting edge of scientific big data analysis along multiple fronts, added Prabhat, NERSC Data and Analytics Services Group Lead and principal investigator for the MANTISSA project. “From a scientific perspective, it is one of the first codes that can conduct inference across multiple imaging surveys and create a unified catalog with uncertainties,” he said. “From a methods perspective, it is the first demonstration of large scale variational inference applied to hundreds of gigabytes of scientific data. From a software perspective, I believe it is one of the largest applications of the Julia language to a significant problem: we have integrated the DTree scheduler and utilized MPI-3 one-sided communication primitives.”

This implementation of Celeste also demonstrated good weak and strong scaling properties on 256 nodes of the Cori Phase I system, Prabhat added. The group’s next step will be to apply Celeste to the entire SDSS imaging dataset, followed by a joint SDSS + DECaLS analysis on Cori Phase II.

About the National Energy Research Scientific Computing Center (NERSC) and Lawrence Berkeley National Laboratory: The National Energy Research Scientific Computing Center (NERSC) is the primary high-performance computing facility for scientific research sponsored by the U.S. Department of Energy’s Office of Science. Located at Lawrence Berkeley National Laboratory, the NERSC Center serves more than 6,000 scientists at national laboratories and universities researching a wide range of problems in combustion, climate modeling, fusion energy, materials science, physics, chemistry, computational biology, and other disciplines. Berkeley Lab is a U.S. Department of Energy national laboratory located in Berkeley, California. It conducts unclassified scientific research and is managed by the University of California for the U.S. DOE Office of Science.

About Julia, Julia Computing and JuliaLabs@MIT: Julia is the high performance open source computing language that is taking astronomy, finance and other big data analytics fields by storm.  Julia users and partners include: Intel, DARPA, Lawrence Berkeley National Laboratory, National Energy Research Scientific Computing Center (NERSC), IBM, Federal Aviation Administration (FAA), MIT Lincoln Labs, Moore Foundation, Nobel Laureate Thomas J. Sargent, Federal Reserve Bank of New York (FRBNY), Brazilian National Development Bank (BNDES), BlackRock, Conning, Berkery Noyes, BestX and researchers at MIT, Harvard, UC Berkeley, Stanford and NYU. Julia Computing is the for-profit Julia consulting firm founded by the co-creators of the Julia computing language to help researchers and businesses maximize productivity and efficiency using Julia. JuliaLabs@MIT, led by Professor Alan Edelman, conducts research using the Julia language.
Intel is a registered trademark of Intel Corporation in the United States and other countries.