Neural Style Transfer in Julia on GPUs

Have you ever wondered how the app Prisma manages to turn your photos into impressionist paintings? They use, amongst
other things, a special type of algorithm called neural styles.

Neural styles are a special type of algorithm that combines the content of one
image with the style of another using deep neural networks. It was first
introduced in a famous paper by LA Gatys
and team at the University of Tubingen, Germany. Here they demonstrate how one
can use a class of deep neural networks to extract features from any “style” image
and subsequently “apply” them onto any “content” image.

The class of deep neural networks that are most powerful for image processing
tasks are called Convolutional Neural Networks (CNN).
It consists of a series of layers, which act as image filters. Each filter
extracts a feature from the input image. This series of layers form a model, or a network, that describes the transformations from the input image to the output features. The model used for this particular exercise was the VGG Network, which is a popular model used for object recognition tasks.

Training

As mentioned earlier, we have two images: a style image, from which we extract
features, and a content image, on which these features are applied.

Here’s the content image, a photograph of the Taj Mahal in Agra, India.


The Taj Mahal

Here’s the style image, one of the famous Water Lilies by the French impressionist, Claude Monet. We shall extract features from this particular image, and then apply them to our content image.


Water Lilies by Claude Monet, 1916-19

When we train our model on these two images, learning the styles from the style image, and appying them to the content image, we obtain the following output – the Taj Mahal drawn in the style of the lilies.

Notice that the structural features of the content image (or in other words, the
borders of the building) have been preserved. We find that a kind of texture has
been extracted from the style image and applied to the content image.

Pretrained Models

We can also load pre-trained models. In fact, this is what most mobile apps do.
The picture we take on our camera is our content image. The app then takes our content
image and performs a single forward pass on the model, “applying” a texture onto it.

We can see this with another content image. This is an artist’s impression of Kvothe,
one of my favorite fictional characters.


Potrait of a Red-haired Man

We have two pretrained models that are meant to provide two kinds of textures:
fire and frost. We pass the content image into our pretrained model
and obtain the following results, which shows the original image with two different styles applied.

Fire Texture
Frost Texture

The Code

The code for the above exercise can be found here.
We observe fast training times thanks in large part to optimized convolution
kernels on the GPU. Using the MXNet deep learning library, Julia makes it very easy
to perform these operations on a GPU. The following chart shows that the benefits of using the GPU are very large indeed!

We performed this excercise on a a IBM PowerNV 8335-GCA server, which has 160 CPU
cores, and a Tesla K80 (dual) GPU accelerator.

Future Work

It would be nice to make stylized videos, such as this.
This would involve taking a pretrained model and passing every frame of the video
into a pretained model, thereby generating a new video. We will write about that work in subsequent blog posts.