Category Archives: Julia

oneAPI.jl status update

By: Tim Besard

Re-posted from: https://juliagpu.org/post/2022-04-06-oneapi_update/index.html

It has been over a year since the last update on oneAPI.jl, the Julia package for programming Intel GPUs (and other accelerators) using the oneAPI toolkit. Since then, the package has been under steady development, and several new features have been added to improve the developer experience and usability of the package.

@atomic intrinsics

oneAPI.jl now supports atomic operations, which are required to implement a variety of parallel algorithms. Low-level atomic functions (atomic_add!, atomic_xchg!, etc) are available as unexported methods in the oneAPI module:

a = oneArray(Int32[0])function kernel(a)
    oneAPI.atomic_add!(pointer(a), Int32(1))
    return
end@oneapi items=256 kernel(a)
@test Array(a)[1] == 256

Note that these methods are only available for those types that are supported by the underlying OpenCL intrinsics. For example, the atomic_add! from above can only be used with Int32 and UInt32 inputs.

Most users will instead rely on the higher-level @atomic macro, which can be easily put in front of many array operations to make them behave atomically. To avoid clashing with the new @atomic macro in Julia 1.7, this macro is also unexported:

a = oneArray(Int32[0])function kernel(a)
    oneAPI.@atomic a[1] += Int32(1)
    return
end@oneapi items=256 kernel(a)
@test Array(a)[1] == 512

When used with operations that are supported by OpenCL, this macro will lower to calls like atomic_add!. For other operations, a compare-and-exchange loop will be used. Note that for now, this is still restricted to 32-bit operations, as we do not support the cl_khr_int64_base_atomics extension for 64-bit atomics.

Initial integration with vendor libraries

One significant missing features is the integration with vendor libraries like oneMKL. These integrations are required to ensure good performance for important operations like matrix multiplication, which currently fall-back to generic implementations in Julia that may not always perform as good.

To improve this situation, we are working on a wrapper library that allows us to integrate with oneMKL and other oneAPI and SYCL libraries. Currently, only matrix multiplication is supported, but once the infrastructural issues are worked out we expect to quickly support many more operations.

If you need support for specific libraries, please have a look at this PR. As the API surface is significant, we will need help to extend the wrapper library and integrate it with high-level Julia libraries like LinearAlgebra.jl.

Correctness issues

In porting existing Julia GPU applications to oneAPI.jl, we fixed several issues that caused correctness issues when executing code on Intel GPUs:

  • when the garbage collector frees GPU memory, it now blocks until all outstanding commands (which may include uses of said memory) are completes

  • the barrier function to synchronize threads is now marked as convert to avoid LLVM miscompilations

Note that if you are using Tiger Lake hardware, there is currently a known issue in the back-end Intel compiler that affects oneAPI.jl, causing correctness issues that can be spotted by running the oneAPI.jl test suite.

Future work

To significantly improve usability of oneAPI.jl, we will add support to the KernelAbstraction.jl package. This library is used by many other packages for adding GPU acceleration to algorithms that cannot be easily expressed using only array operations. As such, support for oneAPI.jl will make it possible to use your oneAPI GPUs with all of these packages.

Astonishing performance of lookup in vectors in Julia

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/04/01/fast.html

Introduction

Several years ago I switched to Julia because of its speed. However, every day I
am positively surprised how fast it is.

Today I was writing a simulation where I needed to check if some value was
contained in some collection. Traditionally programmers are recommended to use
the Set structure to perform such operation. However, since I needed to
squeeze out every last bit of performance in my code I investigated all
available options. Julia developers positively surprised me as it seems that
they managed to implement an extremely fast lookup in a standard Vector
type.

Therefore in my post today I decided to share some benchmark results showing
how fast lookup in Vector container is against Set lookup.

The post was written under Julia 1.7.0, DataFrames.jl 1.3.2,
BenchmarkTools.jl 1.3.1, and Plots.jl 1.27.1.

The experiment

In order to test scalability of lookup I decided to test the performance for
collections from size 10 to 40 to see if some trend can be observed. I wanted
to store in them random strings and then perform a check if every value stored
in the collection is indeed present in it.

To ensure fair evaluation of the results I used the @belapsed macro
from the BenchmarkTools.jl package. Here is the test code:

using DataFrames
using BenchmarkTools
using Random
using Plots

f(x, r) = all(in(x), r) # function testing lookup speed

df = DataFrame()
Random.seed!(1234)

for i in 10:40
    v = [randstring(rand(1:1000)) for _ in 1:i] # randomly generate a Vector
    s = Set(v) # turn it to Set for comparison
    @assert length(s) == i # make sure we had no duplicates
    t1 = @belapsed f($s, $v)
    t2 = @belapsed f($v, $v)
    # display the intermetiate results so that I can monitor the progress
    @show (i, t1, t2)
    push!(df, (;i, t1, t2))
end

plot(df.i, [df.t1 df.t2];
     labels=["Set" "Vector"],
     legend=:topleft,
     xlabel="collection size",
     ylabel="time in seconds")

The code produces the following plot:

Benchmarks plot 1

In the plot it seems that performance of function f doing lookup for both
collections scales approximately linearly in the tested range. The lookup in
Set is slower than the lookup in Vector. Additionally, which is positive,
lookup time in Vector has a more stable timing (Set timings are strangely
jagged). Finally, it seems that the difference in the timing increases with
collection size. Let us visualize this:

plot(df.i, df.t1-df.t2;
     labels="Set-Vector",
     legend=:topleft,
     xlabel="collection size",
     ylabel="time difference in seconds")

Benchmarks plot 2

Indeed it seems that the timing difference increased in the tested range of
collection sizes.

Conclusions

It is getting late today, and I do not want to delay my weekly posts. Therefore,
I will do some more testing in the coming week and report you the results in my
next post.

From today’s experiments we can see that in the test setting I used
Vector lookup was significantly faster than Set lookup.

As a side note, it was super easy to implement these benchmarks in Julia.

minGPT in Julia using Flux!

By: Can Candan's Blog

Re-posted from: https://cancandan.github.io/julia/flux/machine-learning/2022/03/30/mingpt-julia.html

Introduction

As a learning exercise, I tried to port Andrey Karpathy’s awesome minGPT, which is based on Python and PyTorch to Julia and Flux. GPT is a language model, that is trained by the error signal of its prediction for the next element of a given sequence. Karpathy runs the model on three different problems, each in a distinct domain, but fitting this format; language, vision and math. Here I concentrate on the self contained math problem, in which we are interested in seeing whether the model can learn to do addition given two, two digit numbers. Therefore, we begin by creating a dataset where we encode the addition problem and its result as one string. The two, two digit numbers, and the result of addition, which is three digits (both inputs and the result padded with zeros if necessary), is encoded as a string. For example, the addition of 85 and 50 which results in 135 is encoded as the sequence [8, 5, 5, 0, 1, 3, 5]. Given 85 and 50, the model should predict 135. This amounts to predicting [8, 5, 5, 0, 1] given [8, 5, 5, 0]. Predicting [8, 5, 5, 0, 1, 3] given [8, 5, 5, 0, 1] and finally predicting [8, 5, 5, 0, 1, 3, 5] given [8, 5, 5, 0, 1, 3].

Hence, our input to the model will look like [8, 5, 5, 0, 1, 3]. For the ouput he considers a sequence like this [-100, -100, -100, 1, 3, 5]. The -100s are to be ignored here in the loss calculation. How this translates to Julia code can be understood from this part of the code:


Note that since the Julia indexing starts from 1, our labels start from 1, and we also have the -99. What I am doing is here is to one hot encode the digits and also the -100 (-99 in Julia) and drop that -99 in the last row (see that [1:end-1, :, :]) and then element wise multiply (the .* in the loss function). This amounts to ignoring known part of the given sequence in the loss calculation.

Components

It was quite straightforward to port all of the PyTorch components to Flux. For example below on the left you see the Python class definition for the CausalSelfAttention component, and on the right is how to define it in Julia.

Python

Julia

The meat of this component follows next. One thing that tripped me here, has been the application of the mask. As you can see, on the left below, the att variable is modified in-place by using the masked_fill function of PyTorch. Doing the same thing with Flux lead to an error saying Mutating arrays is not supported. I guess in-place modification is not possible in the current AD component of Flux, ie. Zygote. To work around that I added the upper triangular mask to the output att of the batch matrix multiplication operation, which I do using Flux functions batched_mul and batched_transpose. Note that here, Flux requires the batch dimesion to be the last, as evidenced by the difference in the order of B, T, C.

Python

Julia

Weight Decay and Optimiser

An interesting bit in Karpathy’s code is how he had to select the parameters of the model to apply weight decay to. The following lengthy function below is doing that:


In Flux one can implement the trainable function for this, as described in the docs. Getting inspiration from that, I added a decayed_trainable. In my custom optimiser code (that I adapted from the Flux’s ADAM) I handle the weight decay if the parameters needs to be decayed. Hence this is how I specify the parameters:


Flux docs mention the weight decayed version of ADAM, the ADAMW. But as far as I understand, this is not quite how Pytorch’s ADAMW works, so I grabbed the code of basic ADAM and added the bag of tricks Karpathy used in his code, like norm clipping the gradients and decoupled weight decay of selected parameters. To be precise I tried to implement the algorithm in the paper shown below, with these added bells and whistles.

ADAMW

Hence our optimiser looks like this:


Loss and Gradient Calculation

For training, we need a loss function and its gradient, computed on batches of data. So we get the ouput from the model, apply our cross entropy / softmax loss function via the Zygote.pullback to get both of these in one shot, and then hit to the optimiser Flux.Optimise.update! with it as shown:



Making it Fast

My model was training well at this point, but it was like 10x slower than the Python version, on the GPU. Having no idea what could possible make it run so slowly, I googled for Transformers in Julia, and of course found about Transformers.jl, a Julia library for Transformers. In this library, we see a custom implementation of the batched matrix multiplication AND how to efficiently differentiate it:


The batched_gemm! of the Transformers.jl lib shown above here is also hitting a CUDA version in the library. And indeed, bringing those in to my code, it started running as fast as Python. However, thanks to the wonderful people at Julia Slack (Michael Abbott, Andrew Dinhobl), I learned that all of this is already integrated into the Flux library. Hence no need to grab code from anywhere. Yay!.. For example, the efficient differentiation, is part of Flux now, in the form of a rrule of ChainRules.jl.


It turned out that, what made my code run extremely slowly was.. (wait for it).. NOT casting the output of the sqrt below to Float32. The function sqrt outputs here a Float64 and makes the whole chain afterwards VERY inefficient. So, number one thing to look out for when tracking down inefficiencies in Julia is making sure you are using the correct types.


Try it yourself

If you want to try this out yourself, this notebook shows what needs to be done, which I copy below for reference:

include("minGPT.jl")

using Random
Random.seed!(123)

ndigit=2

(trnx,trny),(tstx,tsty)=makeData(ndigit)    

map(addOneForJulia, [trnx, trny, tstx, tsty])

config = Dict("vocab_size"=>10, "n_embed"=>128, "attn_pdrop"=>0.1f0, "resid_pdrop"=>0.1f0, "embd_pdrop"=>0.1f0, "block_size"=>6, "n_layer"=>2, "n_head"=>4,
"max_epochs"=>110, "batch_size"=>512, "learning_rate"=>6f-4, "lr_decay"=>true, "warmup_tokens"=>1024, "final_tokens"=>50*size(trnx)[2]*(ndigit+1), "betas"=>(0.9f0, 0.95f0));

model = mytraining(trnx, trny, tstx, tsty, config)
Epoch: 1 Iter: 1 Train Loss: 2.95 lr_mult: 1.00 tokens: 1536
Epoch: 1 Iter: 11 Train Loss: 2.07 lr_mult: 1.00 tokens: 16896
Test Loss: 1.90209
Epoch: 2 Iter: 1 Train Loss: 1.98 lr_mult: 1.00 tokens: 25536
Epoch: 2 Iter: 11 Train Loss: 1.91 lr_mult: 1.00 tokens: 40896
Test Loss: 1.7956433
Epoch: 3 Iter: 1 Train Loss: 1.86 lr_mult: 1.00 tokens: 49536
Epoch: 3 Iter: 11 Train Loss: 1.78 lr_mult: 0.99 tokens: 64896
Test Loss: 1.7278897
Epoch: 4 Iter: 1 Train Loss: 1.76 lr_mult: 0.99 tokens: 73536
Epoch: 4 Iter: 11 Train Loss: 1.73 lr_mult: 0.99 tokens: 88896    
...    
Epoch: 109 Iter: 1 Train Loss: 0.01 lr_mult: 0.94 tokens: 2593536
Epoch: 109 Iter: 11 Train Loss: 0.00 lr_mult: 0.93 tokens: 2608896
Test Loss: 0.00010189927
Epoch: 110 Iter: 1 Train Loss: 0.01 lr_mult: 0.92 tokens: 2617536
Epoch: 110 Iter: 11 Train Loss: 0.01 lr_mult: 0.91 tokens: 2632896
Test Loss: 0.0002310586
give_exam(model, trnx, trny, config)
tot: 8000 tot_correct: 7999
give_exam(model, tstx, tsty, config)
tot: 2000 tot_correct: 2000