Author Archives: Tim Besard

oneAPI.jl 1.0: oneMKL, Intel Arc and Julia 1.9

By: Tim Besard

Re-posted from: https://juliagpu.org/post/2023-02-08-oneapi_1.0/index.html

The release of oneAPI.jl 1.0 adds integration with the oneAPI Math Kernel Library (oneMKL) to accelerate linear algebra operations on Intel GPUs. It also brings support for Julia 1.9 and Intel Arc GPUs.

oneMKL integration

oneAPI.jl now uses the Intel oneAPI Math Kernel Library (oneMKL), automatically downloaded as part of oneAPI_Support_jll.jl, to accelerate a great number of BLAS and LAPACK operations on Intel GPUs. Similar to how it is implemented in our other GPU back-ends, these wrappers are available at different levels of abstraction.

At the lowest level, we use a C library that wraps the oneMKL C++ APIs. For example, the oneapi::mkl::blas::column_major::gemm function for matrix-matrix multiplication is wrapped by the C functions onemklSgemm, onemklDgemm, etc. These wrappers are used to implement low-level methods like oneMKL.gemm!:

julia> using oneAPIjulia> A = oneArray(rand(Float32, 2, 3));
2×3 oneMatrix{Float32, oneAPI.oneL0.DeviceBuffer}:
 0.44302   0.125576  0.859145
 0.674291  0.428346  0.0400119
julia> B = oneArray(rand(Float32, 3, 4))
3×4 oneMatrix{Float32, oneAPI.oneL0.DeviceBuffer}:
 0.592748   0.529413   0.0323396  0.659528
 0.22489    0.0872259  0.253291   0.376519
 0.0121506  0.591135   0.706755   0.751686
julia> C = similar(B, (2, 4));julia> oneMKL.gemm!('N', 'N', true, A, B, true, C)
2×4 oneMatrix{Float32, oneAPI.oneL0.DeviceBuffer}:
 0.301279  0.753365  0.65334   0.985274
 0.496501  0.417994  0.158581  0.63607julia> Array(C) ≈ Array(A) * Array(B)
true

Of course, these low-level functions aren't very user-friendly, so we also integrate with Julia's standard libraries where possible:

julia> A = oneArray(rand(Float32, 2, 3));
julia> B = oneArray(rand(Float32, 3, 4));julia> using LinearAlgebra
julia> C = A * B;julia> Array(C) ≈ Array(A) * Array(B)
true

The most frequently used oneMKL BLAS functions have been wrapped and integrated with Julia’s standard linear algebra libraries. If you run into a missing function, please file a request to add it, or take a look at the source and contribute to oneAPI.jl! The current state of the wrappers should make it easy to extend their functionality, as well as form a good basis for integrating with other libraries like oneDNN.

Intel Arc support

The new Arc series of discrete Intel GPUs are now fully supported by oneAPI.jl. These GPUs offer a significant performance improvement over their integrated predecessors:

julia> using oneAPI
julia> oneAPI.versioninfo()
1 device:
- Intel(R) Arc(TM) A770 Graphics [0x56a0]julia> T = Float32;
julia> n = p = m = 2048;
julia> a = oneArray(rand(T, n, p));
julia> b = oneArray(rand(T, p, m));
julia> c = oneArray(zeros(T, n, m));julia> using BenchmarkTools, LinearAlgebra
julia> bench = @benchmark oneAPI.@sync mul!(c, a, b)
BenchmarkTools.Trial: 1510 samples with 1 evaluation.
 Range (min … max):  3.233 ms …  3.791 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.298 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.308 ms ± 48.426 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%        ▁▃▄▇█▅▄▃▂   ▁▁▁
  ▁▁▃▃▅▇██████████████████▇▇▇▅▆▄▅▅▄▂▃▂▂▂▂▂▂▁▂▂▂▁▂▁▂▁▂▂▂▂▁▁▂▂ ▃
  3.23 ms        Histogram: frequency by time        3.47 ms < Memory estimate: 272 bytes, allocs estimate: 11.julia> flops = n*m*(2p-1)
17175674880julia> flops / (minimum(bench.times)/1e9)
5.3131281169900205e12

For example, here we're getting over 5 TFlops of Float32 performance, which is over 10x faster than the Intel Xe Graphics G7 we had been previously using for oneAPI.jl development. At the same time, the A770 used above should be able to deliver close to 20 TFlops, so there's still room for improvement in our software stack.

To use oneAPI.jl with an Arc series GPU, you need to run Linux 6.2. At the time of writing, that kernel is still in beta, so refer to your distribution's documentation for how to install it. For example, on Arch Linux you can use the linux-mainline package from the AUR, Ubuntu has the kernel-ppa archive, Fedora provides the stable-rc repository, etc.

Other changes

  • Support for Julia 1.9 has been added.

CUDA.jl 4.0

By: Tim Besard

Re-posted from: https://juliagpu.org/post/2023-02-01-cuda_4.0/index.html

CUDA.jl 4.0 is a breaking release that introduces the use of JLLs to provide the CUDA toolkit. This makes it possible to compile other binary libaries against the CUDA runtime, and use them together with CUDA.jl. The release also brings CUSPARSE improvements, the ability to limit memory use, and many bug fixes and performance improvements.

JLLs for CUDA artifacts

While CUDA.jl has been using binary artifacts for a while, it was manually managing installation and selection of them, i.e., not by using standardised JLL packages. This complicated use of the artifacts by other packages, and made it difficult to build other binary packages against the CUDA runtime.

With CUDA.jl 4.0, we now use JLLs to load the CUDA driver and runtime. Specifically, there are two JLLs in play: CUDA_Driver_jll and CUDA_Runtime_jll. The former is responsible for loading the CUDA driver library (possibly upgrading it using a forward-compatible version), and determining the CUDA version that your set-up supports:

❯ JULIA_DEBUG=CUDA_Driver_jll julia
julia> using CUDA_Driver_jll
┌ System CUDA driver found at libcuda.so.1, detected as version 12.0.0
└ @ CUDA_Driver_jll
┌ System CUDA driver is recent enough; not using forward-compatible driver
└ @ CUDA_Driver_jll

With the driver identified and loaded, CUDA_Runtime_jll can select a compatible toolkit. By default, it uses the latest supported toolkit that is compatible with the driver:

julia> using CUDA_Runtime_jlljulia> CUDA_Runtime_jll.cuda_toolkits
10-element Vector{VersionNumber}:
 v"10.2.0"
 v"11.0.0"
 v"11.1.0"
 v"11.2.0"
 v"11.3.0"
 v"11.4.0"
 v"11.5.0"
 v"11.6.0"
 v"11.7.0"
 v"11.8.0"julia> CUDA_Runtime_jll.host_platform
Linux x86_64 {cuda=11.8}

As you can see, the selected CUDA runtime is encoded in the host platform. This makes it possible for Julia to automatically select compatible versions of other binary packages. For example, if we install and load SuiteSparse_GPU_jll, which right now provides builds for CUDA 10.2, 11.0 and 12.0, the artifact resolution code knows to load the build for CUDA 11.0 which is compatible with the selected CUDA 11.8 runtime:

julia> using SuiteSparse_GPU_jlljulia> SuiteSparse_GPU_jll.best_wrapper
"~/.julia/packages/SuiteSparse_GPU_jll/.../x86_64-linux-gnu-cuda+11.0.jl"

The change to JLLs requires a breaking change: the JULIA_CUDA_VERSION and JULIA_CUDA_USE_BINARYBUILDER environment variables have been removed, and are replaced by preferences that are set in the current environment. For convenience, you can set these preferences by calling CUDA.set_runtime_version!:

❯ julia --project
julia> using CUDA
julia> CUDA.runtime_version()
v"11.8.0"julia> CUDA.set_runtime_version!(v"11.7")
┌ Set CUDA Runtime version preference to 11.7,
└ please re-start Julia for this to take effect.❯ julia --project
julia> using CUDA
julia> CUDA.runtime_version()
v"11.7.0"julia> using CUDA_Runtime_jll
julia> CUDA_Runtime_jll.host_platform
Linux x86_64 {cuda=11.7}

The changed preference is reflected in the host platform, which means that you can use this mechanism to load a different builds of other binary packages. For example, if you rely on a package or JLL that does not yet have a build for CUDA 12, you could set the preference to v"11.x" to load an available build.

For discovering a local runtime, you can set the version to "local", which will replace the use of CUDA_Runtime_jll by CUDA_Runtime_discovery.jl, an API-compatible package that replaces the JLL with a local runtime discovery mechanism:

❯ julia --project
julia> CUDA.set_runtime_version!("local")
┌ Set CUDA Runtime version preference to local,
└ please re-start Julia for this to take effect.❯ JULIA_DEBUG=CUDA_Runtime_Discovery julia --project
julia> using CUDA
┌ Looking for CUDA toolkit via environment variables CUDA_PATH
└ @ CUDA_Runtime_Discovery
┌ Looking for binary ptxas in /opt/cuda
│   all_locations =
│    2-element Vector{String}:
│     "/opt/cuda"
│     "/opt/cuda/bin"
└ @ CUDA_Runtime_Discovery
┌ Debug: Found ptxas at /opt/cuda/bin/ptxas
└ @ CUDA_Runtime_Discovery
...

Memory limits

By popular demand, support for memory limits has been reinstated. This functionality had been removed after the switch to CUDA memory pools, as the memory pool allocator does not yet support memory limits. Awaiting improvements by NVIDIA, we have added functionality to impose memory limits from the Julia side, in the form of two environment variables:

  • JULIA_CUDA_SOFT_MEMORY_LIMIT: This is an advisory limit, used to configure the memory pool, which will result in the pool being shrunk down to the requested limit at every synchronization point. That means that the pool may temporarily grow beyond the limit. This limit is unavailable when disabling memory pools (with JULIA_CUDA_MEMORY_POOL=none).

  • JULIA_CUDA_HARD_MEMORY_LIMIT: This is a hard limit, checked before every allocation. Doing so is relatively expensive, so it is recommended to use the soft limit instead.

The value of these variables can be formatted as a numer of bytes, optionally followed by a unit, or as a percentage of the total device memory. Examples: 100M, 50%, 1.5GiB, 10000.

CUSPARSE improvements

Thanks to the work of @amontoison, the CUSPARSE interface has undergone many improvements:

  • Better support of the CuSparseMatrixCOO format with, in particular, the addition of CuSparseMatrixCOO * CuVector and CuSparseMatrixCOO * CuMatrix products;

  • Routines specialized for -, +, * operations between sparse matrices (CuSparseMatrixCOO, CuSparseMatrixCSC and CuSparseMatrixCSR) have been interfaced;

  • New generic routines for backward and forward sweeps with sparse triangular matrices are now used by \;

  • CuMatrix * CuSparseVector and CuMatrix * CuSparseMatrix products have been added;

  • Conversions between sparse and dense matrices have been updated for using more recent and optimized routines;

  • High-level Julia functions for the new set of sparse BLAS 1 routines such as dot products between CuSparseVector;

  • Add missing dispatchs for mul! and ldiv! functions;

  • Interfacing of almost all new CUSPARSE routines added by the CUDA toolkits v"11.x".

Other changes

  • Removal of the CUDNN, CUTENSOR, CUTENSORNET and CUSTATEVEC submodules: These have been moved into their own packages, respectively cuDNN.jl, cuTENSOR.jl, cuTensorNet.jl and cuStateVec.jl (note the change in capitalization, now following NVIDIA's naming scheme);

  • Removal of the NVTX submodule: NVTX.jl should be used instead, which is a more complete implementation of the NVTX API;

  • Support for CUDA 11.8 (support for CUDA 12.0 is being worked on);

  • Support for Julia 1.9.

Backport releases

Because CUDA.jl 4.0 is a breaking release, two additional releases have been made that backport bugfixes and select features:

  • CUDA.jl 3.12.1 and 3.12.2: backports of bugfixes since 3.12

  • CUDA.jl 3.13.0: additionally adding the memory limit functionality

Technical preview: Programming Apple M1 GPUs in Julia with Metal.jl

By: Tim Besard

Re-posted from: https://juliagpu.org/post/2022-06-24-metal/index.html

Julia has gained a new GPU back-end: Metal.jl, for working with Apple's M1 GPUs. The back-end is built on the same foundations that make up existing GPU packages like CUDA.jl and AMDGPU.jl, so it should be familiar to anybody who's already programmed GPUs in Julia. In the following post I'll demonstrate some of that functionality and explain how it works.

But first, note that Metal.jl is under heavy development: The package is considered experimental for now, as we're still working on squashing bugs and adding essential functionality. We also haven't optimized for performance yet. If you're interesting in using Metal.jl, please consider contributing to its development! Most of the package is written in Julia, and checking-out the source code is a single Pkg.develop away :-)

Quick start

Start by getting a hold of the upcoming Julia 1.8, launch it, and enter the package manager by pressing ]:

julia> ]pkg> add Metal
  Installed Metal

Installation is as easy as that, and we'll automatically download the necessary binary artifacts (a C wrapper for the Metal APIs, and an LLVM back-end). Then, leave the package manager by pressing backspace, import the Metal package, and e.g. call the versioninfo() method for some details on the toolchain:

julia> using Metaljulia> Metal.versioninfo()
macOS 13.0.0, Darwin 21.3.0Toolchain:
- Julia: 1.8.0-rc1
- LLVM: 13.0.11 device:
- Apple M1 Pro (64.000 KiB allocated)

And there we go! You'll note here that I'm using the upcoming macOS 13 (Ventura); this is currently the only supported operating system. We also only support M-series GPUs, even though Metal does support other GPUs. These choices were made to simplify development, and aren't technical limitations. In fact, Metal.jl does work on e.g. macOS Monterey with an Intel GPU, but it's an untested combination that may suffer from bugs.

Array programming

Just like our other GPU back-ends, Metal.jl offers an array abstraction that greatly simplifies GPU programming. The abstraction centers around the MtlArray type that can be used to manage memory and perform GPU computations:

# allocate + initialize
julia> a = MtlArray(rand(Float32, 2, 2))
2×2 MtlArray{Float32, 2}:
 0.158752  0.836366
 0.535798  0.153554# perform some GPU-accelerated operations
julia> b = a * a
2×2 MtlArray{Float32, 2}:
 0.473325  0.261202
 0.167333  0.471702# back to the CPU
julia> Array(b)
2×2 Matrix{Float32}:
 0.473325  0.261202
 0.167333  0.471702

Beyond these simple operations, Julia's higher-order array abstractions can be used to express more complex operations without ever having to write a kernel:

julia> mapreduce(sin, +, a; dims=1)
1×2 MtlArray{Float32, 2}:
 1.15276  0.584146julia> cos.(a .+ 2) .* 3
2×2 MtlArray{Float32, 2}:
 -2.0472   -1.25332
 -2.96594  -2.60351

Much of this functionality comes from the GPUArrays.jl package, which provides vendor-neutral implementations of common array operations. As a result, MtlArray is already pretty capable, and should be usable with realistic array-based applications.

Kernel programming

Metal.jl's array operations are implemented in Julia, using our native kernel programming capabilities and accompanying JIT-compiler. A small demonstration:

# a simple kernel that sets elements of an array to a value
function memset_kernel(array, value)
  i = thread_position_in_grid_1d()
  if i <= length(array)
    @inbounds array[i] = value
  end
  return
enda = MtlArray{Float32}(undef, 512)
@metal threads=512 grid=2 memset_kernel(a, 42)# verify
@assert all(isequal(42), Array(a))

As can be seen here, we've opted to deviate slightly from the Metal Shading Language, instead providing a programming experience that's similar to Julia's existing back-ends. Some key differences:

  • we use intrinsic functions instead of special kernel function arguments to access properties like the thread position, grid size, …;

  • all types of arguments (buffers, indirect buffers, value-typed inputs) are transparently converted to a GPU-compatible structure[1];

  • global (task-bound) state is used to keep track of the active device and a queue;

  • compute pipeline set-up and command encoding is hidden behind a single macro.

Behind the scenes, we compile Julia to LLVM IR and use a tiny LLVM back-end (based on @a2flo's libfloor) that (re)writes the bitcode to a Metal-compatible library containing LLVM 5 bitcode. You can inspect the generated IR using @device_code_metal:

julia> @device_code_metal @metal threads=512 grid=2 memset_kernel(a, 42)
[header]
program_count: 1
...[program]
name: julia_memset_kernel
type: kernel
...
target datalayout = "..."
target triple = "air64-apple-macosx13.0.0"; the (rewritten) kernel function:
;  - %value argument passed by reference
;  - %thread_position_in_grid argument added
;  - sitofp rewritten to AIR-specific intrinsic
define void @julia_memset_kernel(
    { i8 addrspace(1)*, [1 x i64] } addrspace(1)* %array,
    i64 addrspace(1)* %value,
    i32 %thread_position_in_grid) {
  ...
  %9 = tail call float @air.convert.f.f32.s.i64(i64 %7)
  ...
  ret void
}; minimal required argument metadata
!air.kernel = !{!10}
!10 = !{void ({ i8 addrspace(1)*, [1 x i64] } addrspace(1)*,
              i64 addrspace(1)*, i32)* @julia_memset_kernel, !11, !12}
!12 = !{!13, !14, !15}
!13 = !{i32 0, !"air.buffer", !"air.location_index", i32 0, i32 1,
       !"air.read_write", !"air.address_space", i32 1,
       !"air.arg_type_size", i32 16, !"air.arg_type_align_size", i32 8}
!14 = !{i32 1, !"air.buffer", !"air.location_index", i32 1, i32 1,
       !"air.read_write", !"air.address_space", i32 1,
       !"air.arg_type_size", i32 8, !"air.arg_type_align_size", i32 8}
!15 = !{i32 0, !"air.thread_position_in_grid"}; other metadata not shown, for brevity

Shout-out to @max-Hawkins for exploring Metal code generation during his internship at Julia Computing!

Metal APIs in Julia

Lacking an Objective C or C++ FFI, we interface with the Metal libraries using a shim C library. Most users won't have to interface with Metal directly – the array abstraction is sufficient for many – but more experienced developers can make use of the high-level wrappers that we've designed for the Metal APIs:

julia> dev = MtlDevice(1)
MtlDevice:
  name:             Apple M1 Pro
  lowpower:         false
  headless:         false
  removable:        false
  unified memory:   truejulia> desc = MtlHeapDescriptor()
MtlHeapDescriptor:
  type:             MtHeapTypeAutomatic
  storageMode:      MtStorageModePrivate
  size:             0julia> desc.size = 16384
16384julia> heap = MtlHeap(dev, desc)
MtlHeap:
  type:                 MtHeapTypeAutomatic
  size:                 16384
  usedSize:             0
  currentAllocatedSize: 16384# etc

These wrappers are based on @PhilipVinc's excellent work on MetalCore.jl, which formed the basis for (and has been folded into) Metal.jl.

What's next?

The current release of Metal.jl focusses on code generation capabilities, and is meant as a preview for users and developers to try out on their system or with their specific GPU application. It is not production-ready yet, and is lacking some crucial features:

  • performance optimization

  • integration with Metal Performance Shaders

  • integration / documentation for use with Xcode tools

  • fleshing out the array abstraction based on user feedback

Please consider helping out with any of these! Since Metal.jl and its dependencies are almost entirely implemented in Julia, any experience with the language is sufficient to contribute. If you're not certain, or have any questions, please drop by the #gpu channel on the JuliaLang Slack, ask questions on our Discourse, or chat to us during the GPU office hours every other Monday.

If you encounter any bugs, feel free to let us know on the Metal.jl issue tracker. For information on upcoming releases, subscribe to this website's blog where we post about significant developments in Julia's GPU ecosystem.


[1] This relies on Metal 3 from macOS 13, which introduced bindless argument

buffers, as we didn't fully figure out how to reliably encode arbitrarily-nested indirect buffers in argument encoder metadata.