Author Archives: Tim Besard

oneAPI.jl status update

By: Tim Besard

Re-posted from: https://juliagpu.org/post/2022-04-06-oneapi_update/index.html

It has been over a year since the last update on oneAPI.jl, the Julia package for programming Intel GPUs (and other accelerators) using the oneAPI toolkit. Since then, the package has been under steady development, and several new features have been added to improve the developer experience and usability of the package.

@atomic intrinsics

oneAPI.jl now supports atomic operations, which are required to implement a variety of parallel algorithms. Low-level atomic functions (atomic_add!, atomic_xchg!, etc) are available as unexported methods in the oneAPI module:

a = oneArray(Int32[0])function kernel(a)
    oneAPI.atomic_add!(pointer(a), Int32(1))
    return
end@oneapi items=256 kernel(a)
@test Array(a)[1] == 256

Note that these methods are only available for those types that are supported by the underlying OpenCL intrinsics. For example, the atomic_add! from above can only be used with Int32 and UInt32 inputs.

Most users will instead rely on the higher-level @atomic macro, which can be easily put in front of many array operations to make them behave atomically. To avoid clashing with the new @atomic macro in Julia 1.7, this macro is also unexported:

a = oneArray(Int32[0])function kernel(a)
    oneAPI.@atomic a[1] += Int32(1)
    return
end@oneapi items=256 kernel(a)
@test Array(a)[1] == 512

When used with operations that are supported by OpenCL, this macro will lower to calls like atomic_add!. For other operations, a compare-and-exchange loop will be used. Note that for now, this is still restricted to 32-bit operations, as we do not support the cl_khr_int64_base_atomics extension for 64-bit atomics.

Initial integration with vendor libraries

One significant missing features is the integration with vendor libraries like oneMKL. These integrations are required to ensure good performance for important operations like matrix multiplication, which currently fall-back to generic implementations in Julia that may not always perform as good.

To improve this situation, we are working on a wrapper library that allows us to integrate with oneMKL and other oneAPI and SYCL libraries. Currently, only matrix multiplication is supported, but once the infrastructural issues are worked out we expect to quickly support many more operations.

If you need support for specific libraries, please have a look at this PR. As the API surface is significant, we will need help to extend the wrapper library and integrate it with high-level Julia libraries like LinearAlgebra.jl.

Correctness issues

In porting existing Julia GPU applications to oneAPI.jl, we fixed several issues that caused correctness issues when executing code on Intel GPUs:

  • when the garbage collector frees GPU memory, it now blocks until all outstanding commands (which may include uses of said memory) are completes

  • the barrier function to synchronize threads is now marked as convert to avoid LLVM miscompilations

Note that if you are using Tiger Lake hardware, there is currently a known issue in the back-end Intel compiler that affects oneAPI.jl, causing correctness issues that can be spotted by running the oneAPI.jl test suite.

Future work

To significantly improve usability of oneAPI.jl, we will add support to the KernelAbstraction.jl package. This library is used by many other packages for adding GPU acceleration to algorithms that cannot be easily expressed using only array operations. As such, support for oneAPI.jl will make it possible to use your oneAPI GPUs with all of these packages.

CUDA.jl 3.5-3.8

By: Tim Besard

Re-posted from: https://juliagpu.org/post/2022-01-28-cuda_3.5_3.8/index.html

CUDA.jl versions 3.5 to 3.8 have brought several new features to improve performance and productivity. This blog post will highlight a couple: direct copies between devices, better performance by preserving array index types and changing the memory pool, and a much-improved interface to the compute sanitizer utility.

Copies between devices

Typically, when sending data between devices you need to stage through the CPU. CUDA.jl now does this automatically, making it possible to directly copy between CuArrays on different devices:

julia> device!(0);julia> a = CUDA.rand(2,2)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.440147  0.986939
 0.622901  0.698119julia> device!(1);julia> b = CUDA.zeros(2,2);julia> copyto!(b, a)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.440147  0.986939
 0.622901  0.698119

When your hardware supports it, CUDA.jl will automatically enable so-called peer-to-peer mode, making it possible to copy data directly without going through the CPU. This can result in significant bandwidth and latency reductions. You can check if this mode of communication is possible:

julia> src = CuDevice(0)
CuDevice(0): NVIDIA A100-PCIE-40GBjulia> dst = CuDevice(1)
CuDevice(1): Tesla V100-PCIE-32GBjulia> can_access_peer(src, dst)
false

In this case, peer-to-peer communication is not possible because the devices have a different compute capability major revision number. With a compatible device, the function reports true:

julia> src = CuDevice(1)
CuDevice(1): Tesla V100-PCIE-32GBjulia> dst = CuDevice(2)
CuDevice(2): Tesla V100-PCIE-16GBjulia> can_access_peer(src, dst)
true

Thanks to @kshyatt for help with this change!

Helper function to use compute-sanitizer

The CUDA toolkit comes with a powerful tool to check GPU kernels for common issues like memory errors and race conditions: the compute sanitizer. To make it easier to use this tool, CUDA.jl now ships the binary as part of its artifacts, and provides a helper function to restart Julia under the compute-sanitizer. Let's demonstrate, and trigger a memory error to show what the compute sanitizer can detect:

julia> using CUDAjulia> CUDA.run_compute_sanitizer()
Re-starting your active Julia session...========= COMPUTE-SANITIZER
julia> using CUDAjulia> unsafe_wrap(CuArray, pointer(CuArray([1])), 2) .= 1
========= Invalid __global__ write of size 8 bytes
=========     at 0x2a0 in LLVM/src/interop/base.jl:45:julia_broadcast_kernel_1892(CuKernelContext, CuDeviceArray<Int64, (int)1, (int)1>, Broadcasted<void, Tuple<OneTo<Int64>>, _identity, Broadcasted<Int64>>, Int64)
=========     by thread (1,0,0) in block (0,0,0)
=========     Address 0xa64000008 is out of bounds
=========     and is 1 bytes after the nearest allocation at 0xa64000000 of size 8 bytes

Other tools are available too, e.g. racecheck for detecting races or synccheck for finding synchronization issues. These tools can be selected using the tool keyword argument to run_compute_sanitizer.

Updated binary dependencies

As is common with every release, CUDA.jl now supports newer versions of NVIDIA's tools and libraries:

The update to CUDA toolkit 11.6 comes with improved debug info compatibility. If you need to debug Julia GPU code with tools like compute-sanitizer or cuda-gdb, and you need debug info (the equivalent of nvcc -G), ensure CUDA.jl can use the latest version of the CUDA toolkit.

To make it easier to use the latest supported toolkit, CUDA.jl now implements CUDA's so-called Forward Compatibility mode: When your driver is outdated, CUDA.jl will attempt to load a newer version of the CUDA driver library, enabling use of a newer CUDA toolkit and libraries. Note that this is only supported on select hardware, refer to the NVIDIA documentation for more details.

Preserving array indices

Julia's integers are typically 64-bits wide, which can be wasteful when dealing with GPU indexing intrinsics that are typically only 32-bits wide. CUDA.jl's device array type now carefully preserves the type of indices so that 32-bits indices aren't unnecessarily promoted to 64-bits. With some careful kernel programming (note the use of 0x1 instead of 1 below), this makes it possible to significantly reduce the register pressure surrounding indexing operations, which may be useful in register-constrained situations:

julia> function memset(arr, val)
           i = (blockIdx().x-0x1) * blockDim().x + threadIdx().x
           @inbounds arr[i] = val
           return
       endjulia> CUDA.code_ptx(memset, Tuple{CuDeviceArray{Float32,1,AS.Global},Float32})
.func julia_memset(.param .b64 arr, .param .b32 val) {
        .reg .f32       %f<2>;
        .reg .b32       %r<5>;
        .reg .b64       %rd<5>;        ld.param.u64    %rd1, [arr];
        ld.param.f32    %f1, [val];
        mov.u32         %r1, %ctaid.x;
        mov.u32         %r2, %ntid.x;
        mov.u32         %r3, %tid.x;
        mad.lo.s32      %r4, %r2, %r1, %r3;
        ld.u64          %rd2, [%rd1];
        mul.wide.s32    %rd3, %r4, 4;
        add.s64         %rd4, %rd2, %rd3;
        st.global.f32   [%rd4], %f1;
        ret;
}

On CUDA.jl 3.4, this simple function used 3 more 64-bit registers:

.func julia_memset(.param .b64 arr, .param .b32 val) {
        .reg .f32       %f<2>;
        .reg .b32       %r<5>;
        .reg .b64       %rd<8>;        ld.param.u64    %rd1, [arr];
        ld.param.f32    %f1, [val];
        mov.u32         %r1, %ctaid.x;
        mov.u32         %r2, %ntid.x;
        mul.wide.u32    %rd2, %r2, %r1;
        mov.u32         %r3, %tid.x;
        add.s32         %r4, %r3, 1;
        cvt.u64.u32     %rd3, %r4;
        ld.u64          %rd4, [%rd1];
        add.s64         %rd5, %rd2, %rd3;
        shl.b64         %rd6, %rd5, 2;
        add.s64         %rd7, %rd4, %rd6;
        st.global.f32   [%rd7+-4], %f1;
        ret;
}

More aggressive memory management

Starting with CUDA 3.8, the memory pool used to allocate CuArrays will be configured differently: The pool will now be allowed to use all available GPU memory, whereas previously all cached memory was released at each synchronization point. This can significantly improve performance, and makes synchronization much cheaper.

This behavior can be observed by calling the memory_status() function:

julia> CUDA.memory_status()
Effective GPU memory usage: 13.57% (2.001 GiB/14.751 GiB)
Memory pool usage: 0 bytes (0 bytes reserved)julia> a = CuArray{Float32}(undef, (1024, 1024, 1024));
julia> Base.format_bytes(sizeof(a))
"4.000 GiB"julia> a = nothing
julia> GC.gc()julia> CUDA.memory_status()
Effective GPU memory usage: 40.59% (5.988 GiB/14.751 GiB)
Memory pool usage: 0 bytes (4.000 GiB reserved)

So far nothing new. On previous versions of CUDA.jl however, any subsequent synchronization of the GPU (e.g., by copying memory to the CPU) would have resulted in a release of this reserved memory. This is not the case anymore:

julia> synchronize()julia> CUDA.memory_status()
Effective GPU memory usage: 40.59% (5.988 GiB/14.751 GiB)
Memory pool usage: 0 bytes (4.000 GiB reserved)

If you still want to release this memory, you can call the reclaim() function:

julia> CUDA.reclaim()julia> CUDA.memory_status()
Effective GPU memory usage: 13.48% (1.988 GiB/14.751 GiB)
Memory pool usage: 0 bytes (0 bytes reserved)

With interactive Julia sessions, this function is called periodically so that the GPU's memory isn't held on to unnecessarily. Otherwise it shouldn't be necessary to call this function, as memory is freed automatically when it is needed.

Minor changes and improvements

CUDA.jl 3.4

By: Tim Besard

Re-posted from: https://juliagpu.org/post/2021-08-13-cuda_3.4/index.html

The latest version of CUDA.jl brings several new features, from improved atomic operations to initial support for arrays with unified memory. The native random number generator introduced in CUDA.jl is now the default, and support for memory pools other than the CUDA stream-ordered one has been removed.

Streamlined atomic operations

In preparation of integrating with the new standard @atomic macro introduced in Julia 1.7, we have streamlined the capabilities of atomic operations in CUDA.jl. The API is now split into two levels: low-level atomic_ methods for atomic functionality that's directly supported by the hardware, and a high-level @atomic macro that tries to perform operations natively or falls back to a loop with compare-and-swap. This fall-back implementation makes it possible to use more complex operations that do not map onto a single atomic operation:

julia> a = CuArray([1]);julia> function kernel(a)
         CUDA.@atomic a[] <<= 1
         return
       endjulia> @cuda threads=16 kernel(a)julia> a
1-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
 65536julia> 1<<16
65536

The only requirement is that the types being used are supported by CUDA.atomic_cas!. This includes common types like 32 and 64-bit integers and floating-point numbers, as well as 16-bit numbers on devices with compute capability 7.0 or higher.

Note that on Julia 1.7 and higher, CUDA.jl does not export the @atomic macro anymore to avoid conflicts with the version in Base. That means it is recommended to always fully specify uses of the macro, i.e., use CUDA.@atomic as in the example above.

Arrays with unified memory

You may have noticed that the CuArray type in the example above included an additional parameter, Mem.DeviceBuffer. This has been introduced to support arrays backed by different kinds of buffers. By default, we will use an ordinary device buffer, but it's now possible to allocate arrays backed by unified buffers that can be used on multiple devices:

julia> a = cu([0]; unified=true)
1-element CuArray{Int64, 1, CUDA.Mem.UnifiedBuffer}:
 0julia> a .+= 1
1-element CuArray{Int64, 1, CUDA.Mem.UnifiedBuffer}:
 1julia> device!(1)julia> a .+= 1
1-element CuArray{Int64, 1, CUDA.Mem.UnifiedBuffer}:
 2

Although all operations should work equally well with arrays backed by unified memory, they have not been optimized yet. For example, copying memory to the device could be avoided as the driver can automatically page in unified memory on-demand.

New default random number generator

CUDA.jl 3.0 introduced a new random number generator, and starting with CUDA.jl 3.2 performance and quality of this generator was improved up to the point it could be used by applications. A couple of features were still missing though, such as generating normally-distributed random numbers, or support for complex numbers. These features have been added in CUDA.jl 3.3, and the generator is now used as the default fallback when CURAND does not support the requested element types.

Both the performance and quality of this generator is much better than the previous, GPUArrays.jl-based one:

julia> using BenchmarkTools
julia> cuda_rng = CUDA.RNG();
julia> gpuarrays_rng = GPUArrays.default_rng(CuArray);
julia> a = CUDA.zeros(1024,1024);julia> @benchmark CUDA.@sync rand!($cuda_rng, $a)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  17.040 μs …  2.430 ms  ┊ GC (min … max): 0.00% … 99.04%
 Time  (median):     18.500 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   20.604 μs ± 34.734 μs  ┊ GC (mean ± σ):  1.17% ±  0.99%         ▃▆█▇▇▅▄▂▁
  ▂▂▂▃▄▆███████████▇▆▆▅▅▄▄▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂ ▄
  17 μs           Histogram: frequency by time        24.1 μs <julia> @benchmark CUDA.@sync rand!($gpuarrays_rng, $a)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  72.489 μs …  2.790 ms  ┊ GC (min … max): 0.00% … 98.44%
 Time  (median):     74.479 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   81.211 μs ± 61.598 μs  ┊ GC (mean ± σ):  0.67% ±  1.40%  █                                                           ▁
  █▆▃▁▃▃▅▆▅▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▄▆▁▁▁▁▁▁▁▁▄▄▃▄▃▁▁▁▁▁▁▁▁▁▃▃▄▆▄▁▄▃▆ █
  72.5 μs      Histogram: log(frequency) by time       443 μs <
julia> using RNGTest
julia> test_cuda_rng = RNGTest.wrap(cuda_rng, UInt32);
julia> test_gpuarrays_rng = RNGTest.wrap(gpuarrays_rng, UInt32);julia> RNGTest.smallcrushTestU01(test_cuda_rng)
 All tests were passedjulia> RNGTest.smallcrushTestU01(test_gpuarrays_rng)
 The following tests gave p-values outside [0.001, 0.9990]:       Test                          p-value
 ----------------------------------------------
  1  BirthdaySpacings                 eps
  2  Collision                        eps
  3  Gap                              eps
  4  SimpPoker                       1.0e-4
  5  CouponCollector                  eps
  6  MaxOft                           eps
  7  WeightDistrib                    eps
 10  RandomWalk1 M                   6.0e-4
 ----------------------------------------------
 (eps  means a value < 1.0e-300):

Removal of old memory pools

With the new stream-ordered allocator, caching memory allocations at the CUDA library level, much of the need for memory pools to cache memory allocations has disappeared. To simplify the allocation code, we have removed support for those Julia-managed memory pools (i.e., binned, split and simple). You can now only use the cuda memory pool, or use no pool at all by setting the JULIA_CUDA_MEMORY_POOL environment variable to none.

Not using a memory pool degrades performance, so if you are stuck on an NVIDIA driver that does not support CUDA 11.2, it is advised to remain on CUDA.jl 3.3 until you can upgrade.

Also note that the new stream-ordered allocator has turned out incompatible with legacy cuIpc APIs as used by OpenMPI. If that applies to you, consider disabling the memory pool or reverting to CUDA.jl 3.3 if your application's allocation pattern benefits from a memory pool.

Because of this, we will be maintaining CUDA.jl 3.3 longer than usual. All bug fixes in CUDA.jl 3.4 have already been backported to the previous release, which is currently at version 3.3.6.

Device capability-dependent kernel code

Some of the improvements in this release depend on the ability to write generic code that only uses certain hardware features when they are available. To facilitate writing such code, the compiler now embeds metadata in the generated code that can be used to branch on.

Currently, the device capability and PTX ISA version are embedded and made available using respectively the compute_capability and ptx_isa_version functions. A simplified version number type, constructable using the sv"..." string macro, can be used to test against these properties. For example:

julia> function kernel(a)
           a[] = compute_capability() >= sv"6.0" ? 1 : 2
           return
       end
kernel (generic function with 1 method)julia> CUDA.code_llvm(kernel, Tuple{CuDeviceVector{Float32, AS.Global}})
define void @julia_kernel_1({ i8 addrspace(1)*, i64, [1 x i64] }* %0) {
top:
  %1 = bitcast { i8 addrspace(1)*, i64, [1 x i64] }* %0 to float addrspace(1)**
  %2 = load float addrspace(1)*, float addrspace(1)** %1, align 8
  store float 1.000000e+00, float addrspace(1)* %2, align 4
  ret void
}julia> capability(device!(1))
v"3.5.0"julia> CUDA.code_llvm(kernel, Tuple{CuDeviceVector{Float32, AS.Global}})
define void @julia_kernel_2({ i8 addrspace(1)*, i64, [1 x i64] }* %0) {
top:
  %1 = bitcast { i8 addrspace(1)*, i64, [1 x i64] }* %0 to float addrspace(1)**
  %2 = load float addrspace(1)*, float addrspace(1)** %1, align 8
  store float 2.000000e+00, float addrspace(1)* %2, align 4
  ret void
}

The branch on the compute capability is completely optimized away. At the same time, this does not require re-inferring the function as the optimization happens at the LLVM level.

Other changes