Author Archives: Tim Besard

CUDA.jl 5.8: CuSparseVector broadcasting, CUDA 12.9, and more

Re-posted from: https://juliagpu.org/post/2025-05-14-cuda_5.8/index.html

CUDA.jl v5.8 brings several enhancements, most notably the introduction of broadcasting support for CuSparseVector. The release also includes support for CUDA 12.9, and updates to key CUDA libraries like cuTENSOR, cuQuantum, and cuDNN.

Broadcasting for `CuSparseVector`

A significant enhancement in CUDA.jl v5.8 is the support for broadcasting CuSparseVector. Thanks to @kshyatt, it is now possible to use sparse GPU vectors in broadcast expressions just like it was already possible with sparse matrices:

julia> using CUDA, .CUSPARSE, SparseArraysjulia> x = cu(sprand(Float32, 10, 0.3))
10-element CuSparseVector{Float32, Int32} with 4 stored entries:
  [2]  =  0.459139
  [3]  =  0.964073
  [8]  =  0.904363
  [9]  =  0.721723julia> # a zero-preserving elementwise operation
       x .* 2
10-element CuSparseVector{Float32, Int32} with 4 stored entries:
  [2]  =  0.918278
  [3]  =  1.928146
  [8]  =  1.808726
  [9]  =  1.443446julia> # a non-zero-preserving elementwise operation
       x .+ 1
10-element CuArray{Float32, 1, CUDA.DeviceMemory}:
 1.0
 1.4591388
 1.9640732
 1.0
 1.0
 1.0
 1.0
 1.9043632
 1.7217231
 1.0julia> # combining multiple sparse inputs
       x .+ cu(sprand(Float32, 10, 0.3))
10-element CuSparseVector{Float32, Int32} with 6 stored entries:
  [1]  =  0.906
  [2]  =  0.583197
  [3]  =  0.964073
  [4]  =  0.259103
  [8]  =  0.904363
  [9]  =  0.935917

Minor Changes

CUDA.jl 5.8 also includes several other useful updates:

Added support for CUDA 12.9;
Subpackages have been updated to CUDNN 9.10, cuTensor 2.2, and cuQuantum 25.03;
CUSPARSE.gemm! now supports additional algorithms choices to limit memory usage;
Symbols can now be passed to CUDA kernels and stored in CuArrays;
CuTensor multiplication now preserves the memory type of the input tensors;
Sparse CSR matrices are now interfaced with the SparseMatricesCSR.jl package.

As always, we encourage users to update to the latest version to benefit from these improvements and bug fixes. Check out the changelog for a full list of changes.

CUDA.jl 5.6 and 5.7: Allocator cache, and asynchronous CUBLAS wrappers

By: Tim Besard

Re-posted from: https://juliagpu.org/post/2025-03-11-cuda_5.6_5.7/index.html

CUDA.jl v5.6 adds support for the new GPUArrays.jl caching allocator interface, which should improve performance of repetitive, memory-heavy applications. CUDA.jl v5.7 brings a greatly improved CuRef type, which enables fully asynchronous CUBLAS calls.

Reworking `CuRef` for asynchronous CUBLAS

The CuRef type is similar to Julia's Ref, a boxed value, often used with C APIs. In CUDA.jl v5.7, we've made several changes to this type. First of all, we've aligned its API much more closely with the Ref type from Base, e.g, adding getindex and setindex! methods, which should make it more familiar to users:

julia> box = CuRef(1)
CuRefValue{Int64}(1)julia> box[]
1julia> box[] = 2
2julia> box
CuRefValue{Int64}(2)

We also optimized and improved the CuRef implementation. As part of that work, we removed the eager synchronization when copying from unpinned memory. This was done to make it possible for Julia code to execute when waiting for the memory copy to start. However, it turns out that certain (small) copies, such as those performed by CuRef, can be performed without having to wait for the copy to start. By removing eager synchronization from those copies, CuRef objects can now be constructed fully asynchronously, i.e., without having to wait for the GPU to be ready.

Building on these changes, @kshyatt has switched our CUBLAS wrappers over to using GPU-based CuRef boxes for scalar inputs instead of host-based Ref boxes. Although this increases the complexity of invoking CUBLAS APIs – the allocation of CuRef boxes requires CUDA API calls whereas a Ref box is much cheaper to allocate – this results in the API behaving asynchronously, whereas before every CUBLAS API taking scalar inputs would have resulted in a so-called "bubble" waiting for the GPU to finish executing.

A Julia-level allocator cache

To help with the common issue of running out of GPU memory, or to reduce the cost of CUDA.jl hitting the GC too often, @pxl-th has added a reusable caching allocator to GPUArrays.jl, which CUDA.jl now supports and integrates with.

The idea is simple: GPU allocations made in a GPUArrays.@cached block are recorded in a cache, and when the block is exited the allocations are made available for reuse. Only when the cache goes out of scope, or when you call unsafe_free! on it, the allocations will be fully freed. This is useful when you have a repetitive workload that performs the same allocations over and over again, such as in a machine learning training loop:

cache = GPUArrays.AllocCache()
for epoch in 1:1000
    GPUArrays.@cached cache begin
        # dummy workload
        sin.(CUDA.rand(Float32, 1024^3))
    end
end# wait for `cache` to be collected, or optionally eagerly free the memory
GPUArrays.unsafe_free!(cache)

Even though CUDA already has a caching allocator, the Julia-level caching mechanism may still improve performance by lowering pressure on the GC and reducing fragmentation of the underlying allocator. For example, the above snippet only performs two memory allocations that require 8 GiB, instead of 2000 allocations totalling 8 TiB (!) of GPU memory.

The cherry on top is that the caching interface is generic, implemented in GPUArrays.jl, and available to all GPU back-ends that are compatible with v11.2.

Minor changes

Device-to-host copies now eagerly synchronize to improve concurrent execution.
On multi-GPU systems, unified memory is not automatically prefetched anymore when launching kernels, making it possible to process a single array on multiple devices.
A change to CuDeviceArray should allow eliding additional bounds checks in code that already performs a manual bounds check (such as KernelAbstractions.jl code)
CUDA toolkit 12.8 is now supported, as well as Jetson Orin devices.
It is now possible to pass symbols to kernels.
CUBLAS: Support for Givens rotation methods.
CUSPARSE: Support for using CuSparseMatrixBSR with generic mm!.
Windows support for NVTX has been fixed.

OpenCL.jl 0.10: Now with native Julia kernels

By: Tim Besard

Re-posted from: https://juliagpu.org/post/2025-01-13-opencl_0.10/index.html

Version 0.10 of OpenCL.jl is a significant release that adds support for native Julia kernels. This necessitated a major overhaul of the package's internals, bringing the package in line with modern Julia GPU programming practices.

Native Julia kernels

The highlight of this release is the addition of a compiler that makes it possible to write OpenCL kernels in Julia instead of having to use OpenCL C and accompanying string-based APIs. Let's illustrate using the typical vadd vector-additional example, which starts by generating some data and uploading it to the GPU:

using OpenCLdims = (2,)
a = round.(rand(Float32, dims) * 100)
b = round.(rand(Float32, dims) * 100)
c = similar(a)d_a = CLArray(a)
d_b = CLArray(b)
d_c = CLArray(c)

The typical way to write a kernel is to use a string with OpenCL C code, which is then compiled and executed on the GPU. This is done as follows:

const source = """
   __kernel void vadd(__global const float *a,
                      __global const float *b,
                      __global float *c) {
      int i = get_global_id(0);
      c[i] = a[i] + b[i];
    }"""prog = cl.Program(; source) |> cl.build!
kern = cl.Kernel(prog, "vadd")len = prod(dims)
clcall(kern, Tuple{Ptr{Float32}, Ptr{Float32}, Ptr{Float32}},
       d_a, d_b, d_c; global_size=(len,))

With the new GPUCompiler.jl-based compiler, you can now write the kernel in Julia just like with our other back-ends:

function vadd(a, b, c)
    i = get_global_id()
    @inbounds c[i] = a[i] + b[i]
    return
endlen = prod(dims)
@opencl global_size=len vadd(d_a, d_b, d_c)

This is of course a much more natural way to write kernels, and it also allows for OpenCL.jl to be plugged into the rest of the JuliaGPU ecosystem. Concretely, OpenCL.jl now implements the GPUArrays.jl interface, enabling lots of vendor-neutral functionality, and also provides a KernelAbstractions.jl back-end for use with the plenty of libraries that build on top of KernelAbstractions.jl.

There is no free lunch, though, and the native compiler functionality currently relies on your OpenCL driver supporting SPIR-V. This is sadly not a common feature, e.g., neither NVIDIA or ADM's OpenCL drivers support it, only Intel's. But if you are stuck with a driver that does not support SPIR-V, there is still hope: SPIR-V can be compiled back to OpenCL C, using Google clspv. If you are interested, check out this issue and feel free to reach out.

Breaking API changes

Existing users of OpenCL.jl will of course have noticed that even the string-based example above uses a different API than before. In order to support the new compiler, and bring OpenCL.jl in line with modern Julia programming practices, we have significantly overhauled the package's internals as well as some external APIs.

The most significant high-level changes include:

Memory management is now done using CLArray, backed by Shared Virtual Memory (SVM), instead of opaque buffers. Raw buffers are still supported, but not compatible with native kernel execution (because they can not be converted to a pointer).
Kernels are called using the new clcall function, which performs automatic conversion of objects much like how ccall works.

At the lower-level (of the cl submodule), the changes are more extensive:

Context, device and queue arguments have been removed from most APIs, and are now stored in task-local storage. These values can be queried (cl.platform(), cl.device(), etc) and set (cl.platform!(platform), cl.device!(device), etc) as needed.
As part of the above change, questionable APIs like cl.create_some_context() and cl.devices() have been removed;
The Buffer API has been completely reworked. It now only provides low-level functionality, such as unsafe_copyto! or unsafe_map!, while high-level functionality like copy! is implemented for the CLArray type;
The cl.info method, and the getindex overloading to access properties of OpenCL objects, have been replaced by getproperty overloading on the objects themselves (e.g., cl.info(dev, :name) and dev[:name] are now simply dev.name);
The blocking cl.launch has been replaced by a nonblocking cl.call, while also removing the getindex-overloading shorthand. However, it's recommended to use the newly-added cl.clcall function, which takes an additional tuple type argument and performs automatic conversions of arguments to those types. This makes it possible to pass a CLArray to an OpenCL C function expecting Buffer-backed pointers, for example.
Argument conversion has been removed; the user should make sure Julia arguments passed to kernels match the OpenCL argument types (i.e., no empty types, 4-element tuples for a 3-element float3 arguments).
The to_host function has been replaced by simply calling Array on the CLArray.
Queue and execution capabilities of a device are now to be queried using dedicated functions, cl.queue_properties and cl.exec_capabilities.

Working towards the first stable version of this package, we anticipate having to make even more breaking changes. However, we want to get the current changes out there to get feedback from the community. If some of the removed functionality is crucial to your workflow, feel free to reach out and we can discuss how to best support it in the future.

JLL-based OpenCL drivers

Another significant change is the integration with OpenCL drivers built and provided using Julia's BinaryBuilder infrastructure. Over time, this should simplify the installation of OpenCL drivers by avoiding the need to install global drivers. For now, the only driver provided as a JLL is a CPU driver based on the Portable Computing Language (PoCL) library. This driver can be used by simply installing and loading pocl_jll before you start using OpenCL.jl:

julia> using OpenCL, pocl_jlljulia> OpenCL.versioninfo()
OpenCL.jl version 0.10.0Toolchain:
 - Julia v1.11.2
 - OpenCL_jll v2024.5.8+1Available platforms: 1
 - Portable Computing Language
   OpenCL 3.0, PoCL 6.0  Apple, Release, RELOC, SPIR-V, LLVM 16.0.6jl, SLEEF, DISTRO, POCL_DEBUG
   · cpu (fp16, fp64, il)

Notice the il capability reported by OpenCL.versioninfo(), indicating that PoCL supports SPIR-V and can thus be used with the new native Julia kernel compiler. In fact, this is one of the goals of reworking OpenCL.jl: to provide a CPU fallback implementation for use with Julia GPU libraries.

Work towards OpenCL.jl 1.0

This release is a significant step towards a stable 1.0 release of OpenCL.jl, bringing the package in line with our other Julia GPU-backends. Our focus is on improving OpenCL.jl in order to support a CPU fallback back-end for KernelAbstractions.jl based on PoCL. If you are a user of OpenCL.jl, or are interested in using the package in the future, please test out this release with your application and/or driver, and provide feedback on the changes we've made. Pull requests are greatly appreciated, and we are happy to help you get started with contributing to the package.

juliabloggers.com

A Julia Language Blog Aggregator

Author Archives: Tim Besard

CUDA.jl 5.8: CuSparseVector broadcasting, CUDA 12.9, and more

Broadcasting for `CuSparseVector`

Minor Changes

CUDA.jl 5.6 and 5.7: Allocator cache, and asynchronous CUBLAS wrappers

Reworking `CuRef` for asynchronous CUBLAS

A Julia-level allocator cache

Minor changes

OpenCL.jl 0.10: Now with native Julia kernels

Native Julia kernels

Breaking API changes

JLL-based OpenCL drivers

Work towards OpenCL.jl 1.0

Broadcasting for CuSparseVector

Minor Changes

Reworking CuRef for asynchronous CUBLAS

A Julia-level allocator cache

Minor changes

Native Julia kernels

Breaking API changes

JLL-based OpenCL drivers

Work towards OpenCL.jl 1.0

Broadcasting for `CuSparseVector`

Reworking `CuRef` for asynchronous CUBLAS