Author Archives: Can Candan's Blog

Triangle frenzy

By: Can Candan's Blog

Re-posted from: https://cancandan.github.io/julia/graphics/cuda/2022/05/07/triangles.html

Suppose we want to draw a batch of images, where each image is made up of randomly positioned and colored triangles, blending, so it will look like this:

triangles

and then find the euclidean distance of each such image to a given target image. Why on earth am I doing this? Well this turns into an interesting optimization problem of finding the closest triangle image and also an excuse to practise Julia. The inspiration is from this repo.

Towards framing this as an optimization problem, we will represent a triangle as a vector of size 10, made up of floating point numbers between 0 and 1. Four numbers of this vector are for the color of the triangle; r,g,b and alpha, and for the three vertices of the triangle we need 6 numbers, (x1,y1), (x2,y2), (x3,y3). Hence, if we want to draw M images, each image having N triangles, we need a matrix of size (10,N,M), which will be our parameters matrix. I want to randomly create such a matrix and min-max scale it along the triangle dimension, by which I mean, for each image, I first find the minimum and maximum of a triangle parameter among the N triangles, and then subtract from the parameter this minimum and then divide the result by the difference between the maximum and the minimum. I want to end up with an array of size (3,w,h,M) for the images, where w is width and h is height, and an array of size M for the distances. Let’s see how fast we can do this.

First order of business is setting this up, note that I am scaling the numbers to the given width and height:

using Images
using Statistics
using Cairo
using BenchmarkTools
using Random: seed!

function prepare()
    seed!(123)
    w,h=128,128
    num_params=10
    num_triangles=50
    num_images=256
    target = rand(Float32, 3, h, w)
    prms = rand(Float32, num_params, num_triangles, num_images);
    prms .= (prms .- minimum(prms, dims=2)) ./ (maximum(prms, dims=2) .- minimum(prms, dims=2))   
    prms[collect(1:2:6),:,:] .*= w
    prms[collect(2:2:6),:,:] .*= h         
    return prms, target, num_images, num_triangles, w, h
end

The first thing that comes to mind is to use a 2d graphics library for drawing, and since the Cairo lib is available, let’s try that, the function below is drawing the triangles on a blank white Cairo canvas, and copying it to the img array at the end:

@views function renderCairo(img, prms, num_triangles, w, h)                  
    # blank white canvas
    buffer = ones(UInt32, w, h) * typemax(UInt32)    
    c = Cairo.CairoImageSurface(buffer, Cairo.FORMAT_ARGB32, flipxy=false)
    cr = CairoContext(c);        
    
    for i in 1:num_triangles
        q = prms[:,i]
        set_source_rgba(cr,q[7], q[8], q[9], q[10])        
        move_to(cr, q[1],q[2]);
        line_to(cr, q[3],q[4]);
        line_to(cr, q[5],q[6]);
        close_path(cr);
        fill(cr);            
    end        
            
    resultimg = reinterpret(ARGB32, permutedims(c.data, (2, 1)))
    resultchn = Float32.(channelview(RGB.(resultimg)))                
    img .= resultchn
    Cairo.finish(c)
    Cairo.destroy(c)        
end

Now let’s draw each image in this fashion:

@views function withCairo() 
    prms, target, num_images, num_triangles, w, h = prepare() 

    imgs = Array{Float32}(undef, 3, h, w, num_images)    
    for i in 1:num_images
        img = imgs[:,:,:,i]        
        renderCairo(img, prms[:,:,i], num_triangles, w, h)        
    end
    dists = reshape(mean((imgs .- target) .^2, dims=(1,2,3)), num_images)
    return imgs, 1 .- dists
end

Benchmarking this with @btime withCairo();

I see 428.101 ms (3157 allocations: 205.09 MiB).

The cross product method

Now something really cool. Move your mouse inside and outside of the triangle below, you will see a bar chart depicting the magnitude and direction of the 3rd components of the cross products of vectors AB with AP (reds), BC with BP (greens) and CA with CP (blues). Observe that all those bars point same direction only inside the triangle!

Whats great about this is that cross products are just multiplications and subtractions, perfect job to parallelize with a GPU.
What needs to be done is clear, for each of the M images, for each of the N triangles, our operation is to update a pixel color to blend with the current triangles color if that pixel is inside the triangle. We will parallelize this operation with a CUDA kernel, shown below:

function puttri(prms, imgs, tri, ins)    
    idx = (blockIdx().x - 1) * blockDim().x + threadIdx().x  
    idy = (blockIdx().y - 1) * blockDim().y + threadIdx().y
    
    abx = prms[3,tri,ins] - prms[1,tri,ins]
    aby = prms[4,tri,ins] - prms[2,tri,ins]
    apx = idx - prms[1,tri,ins]
    apy = idy - prms[2,tri,ins]
    cr1 = apx * aby - apy * abx

    bcx = prms[5,tri,ins] - prms[3,tri,ins]
    bcy = prms[6,tri,ins] - prms[4,tri,ins]
    bpx = idx - prms[3,tri,ins]
    bpy = idy - prms[4,tri,ins]
    cr2 = bpx * bcy - bpy * bcx

    cax = prms[1,tri,ins] - prms[5,tri,ins]
    cay = prms[2,tri,ins] - prms[6,tri,ins]
    cpx = idx - prms[5,tri,ins]
    cpy = idy - prms[6,tri,ins]
    cr3 = cpx * cay - cpy * cax

    if ((cr1>=0) & (cr2>=0) & (cr3>=0)) | ((cr1<=0) & (cr2<=0) & (cr3<=0))
        oneMinusAlpha = (1.0f0-prms[10,tri,ins])        
        imgs[1,idx,idy,ins] = imgs[1,idx,idy,ins] * oneMinusAlpha + prms[7,tri,ins] * prms[10,tri,ins]
        imgs[2,idx,idy,ins] = imgs[2,idx,idy,ins] * oneMinusAlpha + prms[8,tri,ins] * prms[10,tri,ins]
        imgs[3,idx,idy,ins] = imgs[3,idx,idy,ins] * oneMinusAlpha + prms[9,tri,ins] * prms[10,tri,ins]
    end
    return
end

We will need to pass our parameters and target array to the GPU, and then call the kernel with @cuda. We can create white canvases with CUDA.ones here, so no need to pass it.

function withGpu()
    prms, target, num_images, num_triangles, w, h = prepare()     

    prms = CuArray(prms)    
    imgs = CUDA.ones(3, h, w, num_images)
    target = CuArray(target)
    for tri in 1:num_triangles
        for i in 1:num_images
            @cuda threads=(32,32) blocks=(8,8) puttri(prms, imgs, tri, i)        
        end                                
    end
    gpufitnesses = 1.0f0 .- reshape(mean((imgs .- target) .^ 2, dims=(1,2,3)),num_images)
    return Array(imgs), Array(gpufitnesses)
end

Benchmarking this I see 120.315 ms (38689 allocations: 52.53 MiB)

Awesome, that’s about 4x speedup. Note that I benchmarked with --check-bounds=no, which is a startup option that you pass when launching julia that disables the bounds checking. In the next post, I will talk about the very cool PGPE algorithm used in evojax to steer these images towards a target image. You can see one example of this here.

I am pretty new to Julia, please let me know if you have any comments, suggestions.

minGPT in Julia using Flux!

By: Can Candan's Blog

Re-posted from: https://cancandan.github.io/julia/flux/machine-learning/2022/03/30/mingpt-julia.html

Introduction

As a learning exercise, I tried to port Andrey Karpathy’s awesome minGPT, which is based on Python and PyTorch to Julia and Flux. GPT is a language model, that is trained by the error signal of its prediction for the next element of a given sequence. Karpathy runs the model on three different problems, each in a distinct domain, but fitting this format; language, vision and math. Here I concentrate on the self contained math problem, in which we are interested in seeing whether the model can learn to do addition given two, two digit numbers. Therefore, we begin by creating a dataset where we encode the addition problem and its result as one string. The two, two digit numbers, and the result of addition, which is three digits (both inputs and the result padded with zeros if necessary), is encoded as a string. For example, the addition of 85 and 50 which results in 135 is encoded as the sequence [8, 5, 5, 0, 1, 3, 5]. Given 85 and 50, the model should predict 135. This amounts to predicting [8, 5, 5, 0, 1] given [8, 5, 5, 0]. Predicting [8, 5, 5, 0, 1, 3] given [8, 5, 5, 0, 1] and finally predicting [8, 5, 5, 0, 1, 3, 5] given [8, 5, 5, 0, 1, 3].

Hence, our input to the model will look like [8, 5, 5, 0, 1, 3]. For the ouput he considers a sequence like this [-100, -100, -100, 1, 3, 5]. The -100s are to be ignored here in the loss calculation. How this translates to Julia code can be understood from this part of the code:


Note that since the Julia indexing starts from 1, our labels start from 1, and we also have the -99. What I am doing is here is to one hot encode the digits and also the -100 (-99 in Julia) and drop that -99 in the last row (see that [1:end-1, :, :]) and then element wise multiply (the .* in the loss function). This amounts to ignoring known part of the given sequence in the loss calculation.

Components

It was quite straightforward to port all of the PyTorch components to Flux. For example below on the left you see the Python class definition for the CausalSelfAttention component, and on the right is how to define it in Julia.

Python

Julia

The meat of this component follows next. One thing that tripped me here, has been the application of the mask. As you can see, on the left below, the att variable is modified in-place by using the masked_fill function of PyTorch. Doing the same thing with Flux lead to an error saying Mutating arrays is not supported. I guess in-place modification is not possible in the current AD component of Flux, ie. Zygote. To work around that I added the upper triangular mask to the output att of the batch matrix multiplication operation, which I do using Flux functions batched_mul and batched_transpose. Note that here, Flux requires the batch dimesion to be the last, as evidenced by the difference in the order of B, T, C.

Python

Julia

Weight Decay and Optimiser

An interesting bit in Karpathy’s code is how he had to select the parameters of the model to apply weight decay to. The following lengthy function below is doing that:


In Flux one can implement the trainable function for this, as described in the docs. Getting inspiration from that, I added a decayed_trainable. In my custom optimiser code (that I adapted from the Flux’s ADAM) I handle the weight decay if the parameters needs to be decayed. Hence this is how I specify the parameters:


Flux docs mention the weight decayed version of ADAM, the ADAMW. But as far as I understand, this is not quite how Pytorch’s ADAMW works, so I grabbed the code of basic ADAM and added the bag of tricks Karpathy used in his code, like norm clipping the gradients and decoupled weight decay of selected parameters. To be precise I tried to implement the algorithm in the paper shown below, with these added bells and whistles.

ADAMW

Hence our optimiser looks like this:


Loss and Gradient Calculation

For training, we need a loss function and its gradient, computed on batches of data. So we get the ouput from the model, apply our cross entropy / softmax loss function via the Zygote.pullback to get both of these in one shot, and then hit to the optimiser Flux.Optimise.update! with it as shown:



Making it Fast

My model was training well at this point, but it was like 10x slower than the Python version, on the GPU. Having no idea what could possible make it run so slowly, I googled for Transformers in Julia, and of course found about Transformers.jl, a Julia library for Transformers. In this library, we see a custom implementation of the batched matrix multiplication AND how to efficiently differentiate it:


The batched_gemm! of the Transformers.jl lib shown above here is also hitting a CUDA version in the library. And indeed, bringing those in to my code, it started running as fast as Python. However, thanks to the wonderful people at Julia Slack (Michael Abbott, Andrew Dinhobl), I learned that all of this is already integrated into the Flux library. Hence no need to grab code from anywhere. Yay!.. For example, the efficient differentiation, is part of Flux now, in the form of a rrule of ChainRules.jl.


It turned out that, what made my code run extremely slowly was.. (wait for it).. NOT casting the output of the sqrt below to Float32. The function sqrt outputs here a Float64 and makes the whole chain afterwards VERY inefficient. So, number one thing to look out for when tracking down inefficiencies in Julia is making sure you are using the correct types.


Try it yourself

If you want to try this out yourself, this notebook shows what needs to be done, which I copy below for reference:

include("minGPT.jl")

using Random
Random.seed!(123)

ndigit=2

(trnx,trny),(tstx,tsty)=makeData(ndigit)    

map(addOneForJulia, [trnx, trny, tstx, tsty])

config = Dict("vocab_size"=>10, "n_embed"=>128, "attn_pdrop"=>0.1f0, "resid_pdrop"=>0.1f0, "embd_pdrop"=>0.1f0, "block_size"=>6, "n_layer"=>2, "n_head"=>4,
"max_epochs"=>110, "batch_size"=>512, "learning_rate"=>6f-4, "lr_decay"=>true, "warmup_tokens"=>1024, "final_tokens"=>50*size(trnx)[2]*(ndigit+1), "betas"=>(0.9f0, 0.95f0));

model = mytraining(trnx, trny, tstx, tsty, config)
Epoch: 1 Iter: 1 Train Loss: 2.95 lr_mult: 1.00 tokens: 1536
Epoch: 1 Iter: 11 Train Loss: 2.07 lr_mult: 1.00 tokens: 16896
Test Loss: 1.90209
Epoch: 2 Iter: 1 Train Loss: 1.98 lr_mult: 1.00 tokens: 25536
Epoch: 2 Iter: 11 Train Loss: 1.91 lr_mult: 1.00 tokens: 40896
Test Loss: 1.7956433
Epoch: 3 Iter: 1 Train Loss: 1.86 lr_mult: 1.00 tokens: 49536
Epoch: 3 Iter: 11 Train Loss: 1.78 lr_mult: 0.99 tokens: 64896
Test Loss: 1.7278897
Epoch: 4 Iter: 1 Train Loss: 1.76 lr_mult: 0.99 tokens: 73536
Epoch: 4 Iter: 11 Train Loss: 1.73 lr_mult: 0.99 tokens: 88896    
...    
Epoch: 109 Iter: 1 Train Loss: 0.01 lr_mult: 0.94 tokens: 2593536
Epoch: 109 Iter: 11 Train Loss: 0.00 lr_mult: 0.93 tokens: 2608896
Test Loss: 0.00010189927
Epoch: 110 Iter: 1 Train Loss: 0.01 lr_mult: 0.92 tokens: 2617536
Epoch: 110 Iter: 11 Train Loss: 0.01 lr_mult: 0.91 tokens: 2632896
Test Loss: 0.0002310586
give_exam(model, trnx, trny, config)
tot: 8000 tot_correct: 7999
give_exam(model, tstx, tsty, config)
tot: 2000 tot_correct: 2000