Author Archives: Blog on JuliaGPU

CUDA.jl 2.4 and 2.5

By: Blog on JuliaGPU

Re-posted from: https://juliagpu.org/2021-01-08-cuda_2.4_2.5/

CUDA.jl v2.4 and v2.5 are two almost-identical feature releases, respectively for Julia 1.5
and 1.6. These releases feature a greatly improved findmin and findmax kernels, an
improved interface for kernel introspection, support for CUDA 11.2, and of course many bug
fixes.

Improved findmin and findmax kernels

Thanks to @tkf and @Ellipse0934,
CUDA.jl now uses a single-pass kernel for finding the minimum or maximum item in a
CuArray
. This fixes compatibility with
NaN-valued elements, while on average improving performance. Depending on the rank, shape
and size of the array these improvements vary from a minor regression to order-of-magnitude
improvements.

New kernel introspection interface

It is now possible to obtain a compiled-but-not-launched kernel by passing the
launch=false keyword to @cuda. This is useful when you want to reflect, e.g., query the
amount of registers, or other kernel properties:

julia> kernel = @cuda launch=false identity(nothing)
CUDA.HostKernel{identity,Tuple{Nothing}}(...)

julia> CUDA.registers(kernel)
4

The old API is still available, and will even be extended in future versions of CUDA.jl for
the purpose of compiling device functions (not kernels):

julia> kernel = cufunction(identity, Tuple{Nothing})
CUDA.HostKernel{identity,Tuple{Nothing}}(...)

Support for CUDA 11.2

CUDA.jl now supports the latest version of CUDA, version 11.2. Because CUDNN and CUTENSOR
are not compatible with this release yet, CUDA.jl won’t automatically switch to it unless
you explicitly request so:

julia> ENV["JULIA_CUDA_VERSION"] = "11.2"
"11.2"

julia> using CUDA

julia> CUDA.versioninfo()
CUDA toolkit 11.2.0, artifact installation
CUDA driver 11.2.0
NVIDIA driver 460.27.4

Alternatively, if you disable use of artifacts through JULIA_CUDA_USE_BINARYBUILDER=false,
CUDA 11.2 can be picked up from your local system.

Future developments

Due to upstream compiler changes, CUDA.jl 2.4 is expected to be the last release compatible
with Julia 1.5. Patch releases are still possible, but are not automatic: If you need a
specific bugfix from a future CUDA.jl release, create an issue or PR to backport the change.

Introducing: oneAPI.jl

By: Blog on JuliaGPU

Re-posted from: https://juliagpu.org/2020-11-05-oneapi_0.1/

We’re proud to announce the first version of oneAPI.jl, a Julia package for programming
accelerators with the oneAPI programming model. It is currently
available for select Intel GPUs, including common integrated ones, and offers a similar
experience to CUDA.jl.

The initial version of this package, v0.1, consists of three key components:

  • wrappers for the oneAPI Level Zero interfaces;
  • a compiler for Julia source code to SPIR-V IR;
  • and an array interface for convenient data-parallel programming.

In this post, I’ll briefly describe each of these. But first, some essentials.

Installation

oneAPI.jl is currently only supported on 64-bit Linux, using a sufficiently recent kernel,
and requires Julia 1.5. Furthermore, it currently only supports a limited set of Intel GPUs:
Gen9 (Skylake, Kaby Lake, Coffee Lake), Gen11 (Ice Lake), and Gen12 (Tiger Lake).

If your Intel CPU has an integrated GPU supported by oneAPI, you can just go ahead and
install the oneAPI.jl package:

pkg> add oneAPI

That’s right, no additional drivers required! oneAPI.jl ships its own copy of the Intel
Compute Runtime
, which works out of the box on
any (sufficiently recent) Linux kernel. The initial download, powered by Julia’s artifact
subsystem, might take a while to complete. After that, you can import the package and start
using its functionality:

julia> using oneAPI

julia> oneAPI.versioninfo()
Binary dependencies:
- NEO_jll: 20.42.18209+0
- libigc_jll: 1.0.5186+0
- gmmlib_jll: 20.3.2+0
- SPIRV_LLVM_Translator_jll: 9.0.0+1
- SPIRV_Tools_jll: 2020.2.0+1

Toolchain:
- Julia: 1.5.2
- LLVM: 9.0.1

1 driver:
- 00007fee-06cb-0a10-1642-ca9f01000000 (v1.0.0, API v1.0.0)

1 device:
- Intel(R) Graphics Gen9

The oneArray type

Similar to CUDA.jl’s CuArray type, oneAPI.jl provides an array abstraction that you can
use to easily perform data parallel operations on your GPU:

julia> a = oneArray(zeros(2,3))
2×3 oneArray{Float64,2}:
 0.0  0.0  0.0
 0.0  0.0  0.0

julia> a .+ 1
2×3 oneArray{Float64,2}:
 1.0  1.0  1.0
 1.0  1.0  1.0

julia> sum(ans; dims=2)
2×1 oneArray{Float64,2}:
 3.0
 3.0

This functionality builds on the GPUArrays.jl
package, which means that a lot of operations are supported out of the box. Some are still
missing, of course, and we haven’t carefully optimized for performance either.

Kernel programming

The above array operations are made possible by a compiler that transforms Julia source code
into SPIR-V IR for use with oneAPI. Most of this work is part of
GPUCompiler.jl. In oneAPI.jl, we use this
compiler to provide a kernel programming model:

julia> function vadd(a, b, c)
           i = get_global_id()
           @inbounds c[i] = a[i] + b[i]
           return
       end

julia> a = oneArray(rand(10));

julia> b = oneArray(rand(10));

julia> c = similar(a);

julia> @oneapi items=10 vadd(a, b, c)

julia> @test Array(a) .+ Array(b) == Array(c)
Test Passed

Again, the @oneapi macro resembles @cuda from CUDA.jl. One of the differences with the
CUDA stack is that we use OpenCL-style built-ins, like get_global_id instead of
threadIdx and barrier instead of sync_threads. Other familiar functionality, e.g. to
reflect on the compiler, is available as well:

julia> @device_code_spirv @oneapi vadd(a, b, c)
; CompilerJob of kernel vadd(oneDeviceArray{Float64,1,1}, oneDeviceArray{Float64,1,1},
; oneDeviceArray{Float64,1,1}) for GPUCompiler.SPIRVCompilerTarget

; SPIR-V
; Version: 1.0
; Generator: Khronos LLVM/SPIR-V Translator; 14
; Bound: 46
; Schema: 0
               OpCapability Addresses
               OpCapability Linkage
               OpCapability Kernel
               OpCapability Float64
               OpCapability Int64
               OpCapability Int8
          %1 = OpExtInstImport "OpenCL.std"
               OpMemoryModel Physical64 OpenCL
               OpEntryPoint Kernel
               ...
               OpReturn
               OpFunctionEnd

Level Zero wrappers

To interface with the oneAPI driver, we use the Level Zero
API
. Wrappers for this API is available under the
oneL0 submodule of oneAPI.jl:

julia> using oneAPI.oneL0

julia> drv = first(drivers())
ZeDriver(00000000-0000-0000-1642-ca9f01000000, version 1.0.0)

julia> dev = first(devices(drv))
ZeDevice(GPU, vendor 0x8086, device 0x1912): Intel(R) Graphics Gen9

This is a low-level interface, and importing this submodule should not be required for the
vast majority of users. It is only useful when you want to perform very specific operations,
like submitting an certain operations to the command queue, working with events, etc. In
that case, you should refer to the upstream
specification
; The wrappers in the
oneL0 module closely mimic the C APIs.

Status

Version 0.1 of oneAPI.jl forms a solid base for future oneAPI developments in Julia. Thanks
to the continued effort of generalizing the Julia GPU support in packages like GPUArrays.jl
and GPUCompiler.jl, this initial version is already much more usable than early versions of
CUDA.jl or AMDGPU.jl ever were.

That said, there are crucial parts missing. For one, oneAPI.jl does not integrate with any
of the vendor libraries like oneMKL or oneDNN. That means several important operations, e.g.
matrix-matrix multiplication, will be slow. Hardware support is also limited, and the
package currently only works on Linux.

If you want to contribute to oneAPI.jl, or run into problems, check out the GitHub
repository at JuliaGPU/oneAPI.jl. For questions,
please use the Julia Discourse forum under
the GPU domain and/or in the #gpu channel of the Julia
Slack
.

CUDA.jl 2.1

By: Blog on JuliaGPU

Re-posted from: https://juliagpu.org/2020-10-30-cuda_2.1/

CUDA.jl v2.1 is a bug-fix release, with one new feature: support for cubic texture
interpolations. The release also partly reverts a change from v2.0: reshape, reinterpret
and contiguous views now return a CuArray again.

Generalized texture interpolations

CUDA’s texture hardware only supports nearest-neighbour and linear interpolation, for other
modes one is required to perform the interpolation by hand. In CUDA.jl v2.1 we are
generalizing the texture interpolation API so that it is possible to use both
hardware-backed and software-implemented interpolation modes in exactly the same way:

# N is the dimensionality (1, 2 or 3)
# T is the element type (needs to be supported by the texture hardware)

# source array
src = rand(T, fill(10, N)...)

# indices we want to interpolate
idx = [tuple(rand(1:0.1:10, N)...) for _ in 1:10]

# upload to the GPU
gpu_src = CuArray(src)
gpu_idx = CuArray(idx)

# interpolate using a texture
gpu_dst = CuArray{T}(undef, size(gpu_idx))
gpu_tex = CuTexture(gpu_src; interpolation=CUDA.NearestNeighbour())
broadcast!(gpu_dst, gpu_idx, Ref(gpu_tex)) do idx, tex
    tex[idx...]
end

# back to the CPU
dst = Array(gpu_dst)

Here, we can change the interpolation argument to CuTexture to either NearestNeighbour
or LinearInterpolation, both supported by the hardware, or CubicInterpolation which is
implemented in software (building on the hardware-supported linear interpolation).

Partial revert of array wrapper changes

In CUDA.jl v2.0, we changed the behavior of several important array operations to reuse
available wrappers in Base: reshape started returning a ReshapedArray, view now
returned a SubArray, and reinterpret was reworked to use ReinterpretArray. These
changes were made to ensure maximal compatibility with Base’s array type, and to simplify
the implementation in CUDA.jl and GPUArrays.jl.

However, this change turned out to regress the time to precompile and load CUDA.jl.
Consequently, the change has been reverted, and these wrappers are now implemented as part
of the CuArray type again. Note however that we intend to revisit this change in the
future. It is therefore recommended to use the DenseCuArray type alias for methods that
need a CuArray backed by contiguous GPU memory. For strided CuArrays, i.e.
non-contiguous views, you should use the StridedCuArray alias.