Author Archives: Tim Besard

cuTile.jl 0.3: CUDA.jl integration, and even better performance & latency

By: Tim Besard

Re-posted from: https://juliagpu.org/post/2026-05-05-cutile_0.3/index.html

cuTile.jl v0.3 integrates with CUDA.jl, making it even easier to write and run CUDA Tile kernels in Julia. Performance has also been greatly improved, closing the gap with cuTile Python on every benchmark we ship. Added features include a random number generator, and support for array slicing.

Performance: matching cuTile Python

Three months ago, several of our benchmarks lagged cuTile Python by 5–15%. Today, cuTile.jl matches or outperforms cuTile Python on every kernel we ship. The headline numbers (RTX 5080, tileiras 13.2.51):

Kernel Julia Python Δ
Vector Addition 845 GB/s 846 GB/s =
Matrix Transpose 812 GB/s 814 GB/s =
Layer Norm fwd 983 GB/s 716 GB/s +37%
Layer Norm bwd 248 GB/s 251 GB/s -1%
Matrix Multiplication 47.5 TFLOPS 43.5 TFLOPS +9%
Batch Matrix Multiply 34.0 TFLOPS 30.8 TFLOPS +10%
FFT (3-stage Cooley-Tukey) 529 μs 554 μs +5%
Mixture of Experts 27.0 TFLOPS 20.1 TFLOPS +34%
Attention (FMHA, causal) 103.6 TFLOPS 63.4 TFLOPS +63%
Softmax (TMA) 849 GB/s 857 GB/s -1%
Softmax (Chunked) 1684 GB/s 1640 GB/s +3%

Most of the gains come from extending the IR-level optimization pipeline introduced in v0.2 with a new dataflow framework that now powers several analyses and transformations.

CUDA.jl integration: @cuda backend=cuTile

Until v0.3, launching a cuTile kernel meant calling cuTile.launch(...) directly. cuTile.jl now plugs into CUDA.jl's existing @cuda macro as a first-class backend, making it much easier to launch cuTile.jl kernels:

using CUDA, cuTile
import cuTile as ctfunction vadd(a::ct.TileArray{Float32,1}, b::ct.TileArray{Float32,1},
              c::ct.TileArray{Float32,1})
    pid = ct.bid(1)
    ct.store(c; index=pid, tile=ct.load(a; index=pid, shape=(128,)) +
                                ct.load(b; index=pid, shape=(128,)))
    return
enda = CUDA.rand(Float32, 1024)
b = CUDA.rand(Float32, 1024)
c = CUDA.zeros(Float32, 1024)@cuda backend=cuTile blocks=8 vadd(a, b, c)

Time-to-first-launch

Compiling a cuTile kernel goes through several stages: Julia type inference, our IR rewriting passes, Tile IR bytecode emission, and finally tileiras-driven CUBIN generation. None of these are fast. Significant effort in v0.3 went into reducing the time-to-first-launch, and the latency is now comparable to a typical CUDA.jl kernel launch on the same hardware:

Benchmark 1: julia -e 'using CUDACore;
                       @cuda identity(nothing)'
  Time (mean ± σ):      1.882 s ±  0.012 s    [User: 2.554 s, System: 0.305 s]
  Range (min … max):    1.867 s …  1.906 s    10 runsBenchmark 2: julia -e 'using CUDACore, cuTile;
                       @cuda backend=cuTile identity(nothing)'
  Time (mean ± σ):      1.840 s ±  0.009 s    [User: 2.488 s, System: 0.329 s]
  Range (min … max):    1.827 s …  1.859 s    10 runs

Array slicing

view and @view now derive sub-range TileArrays from existing ones:

function copy_rows!(A::ct.TileArray{Float32,2}, B::ct.TileArray{Float32,2},
                    i::Int32, j::Int32)
    sub = @view A[i:j, :]                         # sub-range TileArray
    t = ct.load(sub; index=(1, 1), shape=(8, 8))
    ct.store(B; index=(1, 1), tile=t)
    return
end@cuda backend=cuTile copy_rows!(A, B, Int32(3), Int32(10))

Each index must be : or a UnitRange; other forms (StepRange, scalar indexes, CartesianIndex, …) are currently rejected at compile time. The result is itself a TileArray, and can be passed to ct.load / ct.store (or sliced again, for nested views). The new divisibility analysis sees through the slicing chain so contiguous-axis fast paths are preserved, while literal slice sizes fold to compile-time-constant shape operands.

Random number generation

cuTile.jl now ships a tile-vectorized Philox2x32-7 RNG, both as in-kernel intrinsics and as a host-side cuTile.RNG handle for filling CuArrays. The kernel API mirrors Base.Random:

function noise!(out::ct.TileArray{Float32,1})
    pid = ct.bid(1)
    t = randn(Float32, (128,))                 # in-kernel randn
    ct.store(out; index=pid, tile=t)
    return
end@cuda backend=cuTile blocks=cld(N, 128) noise!(A)

rand covers all of Int{8,16,32,64}, UInt{8,16,32,64}, Float16, BFloat16, Float32, and Float64; randn (via Box-Muller, sharing its uniforms with the existing rand path) and randexp (via -log(U)) cover the four floating-point types. ct.DeviceRNG() opens an independent stream inside a kernel; Random.seed! re-seeds.

The host-side cuTile.RNG integrates with Random.rand! / Random.randn! / Random.randexp! and auto-advances its counter, so consecutive fills produce disjoint streams:

A = CUDACore.zeros(Float32, 1 << 20)
rng = ct.RNG(42)
randn!(rng, A)                                 # fill via fused tile kernel
B = rand(rng, Float64, 16)                     # out-of-place

Performance of both the in-kernel and host-side APIs is excellent, matching or exceeding the performance of cuRAND and GPUArrays.jl' new generator.

What's next

If you've been watching cuTile.jl from a distance: now's a good time to try it out: add cuTile from the Julia REPL, or grab the examples to see how the moving parts fit together.

There is a webinar scheduled on May 12, 2026 at 1 PM ET, where Tim Besard (JuliaHub) and Andy Terrel (NVIDIA) will present cuTile.jl in a joint webinar, covering the design of CUDA Tile, how cuTile.jl is built, and several relevant examples. Click here to sign up.

cuTile.jl 0.2: New features, improved performance, and Julia 1.13 support

By: Tim Besard

Re-posted from: https://juliagpu.org/post/2026-04-08-cutile_0.2/index.html

cuTile.jl v0.2 is the first major update of the Julia package for writing GPU kernels using NVIDIA's tile-based programming model. This release adds many new features, supports more of the Julia language, and greatly improves performance. We will be presenting about it in a joint webinar with NVIDIA on May 12.

The release is showcased by two new examples that exercise many of the features described below: a fused Mixture of Experts kernel with token routing via gather/scatter, and a Flash Multi-Head Attention implementation with online softmax and causal masking.

Breaking changes

  • ct.where removed: use ifelse.(cond, x, y) (standard Julia broadcast);

  • FP mode kwargs removed: per-operation rounding_mode and flush_to_zero kwargs on reductions/scans replaced with ct.@fpmode blocks (see below);

  • Matmul batch dimensions: muladd now uses trailing batch dims (M, K, B...) matching Julia convention, instead of leading (B, M, K).

Native for loops

Previously, cuTile.jl required a while-loop workaround for iteration. Starting with v0.2, standard Julia for loops work directly:

for k in Int32(1):K_tiles
    a = ct.load(A; index=(pid_m, k), shape=(TILE_M, TILE_K))
    b = ct.load(B; index=(k, pid_n), shape=(TILE_K, TILE_N))
    acc = muladd(a, b, acc)
end

The compiler recognizes Julia's iterator protocol and lowers for i in
start:stop
and for i in start:step:stop to Tile IR ForOp directly.

Floating-point mode: ct.@fpmode

A new scoped macro controls floating-point rounding and flush-to-zero for all operations in its body, matching how FP modes work in hardware:

ct.@fpmode rounding_mode=ct.Rounding.Approx flush_to_zero=true begin
    s = sum(tile; dims=1)
    x = exp2.(tile)
end

Blocks can be nested, with inner blocks inheriting unspecified settings from the enclosing scope.

Keyword arguments for operations

Most cuTile operations now use keyword arguments, aligning with cuTile Python's API and making call sites more readable:

  • load/store: index, shape, and tile are now kwargs (ct.load(arr; index=pid, shape=(M, N))). Positional syntax still works.

  • arange: now works without a type, defaulting to Int32. Use dtype for other types (e.g. ct.arange(16; dtype=Int64)).

  • gather/scatter: new mask, padding_value, and check_bounds kwargs. User masks are AND'd with automatic bounds masks; check_bounds=false skips bounds comparisons when indices are known safe.

  • Atomics: all atomic operations accept check_bounds to optionally skip the bounds mask.

  • allow_tma: default changed from true to nothing (compiler decides).

Experimental host abstractions

cuTile.jl now provides a limited set of host-level APIs that generate cuTile kernels automatically, without writing explicit kernel code. They are exposed using the ct.Tiled wrapper type, which represents a tiled view of an array.

Broadcasting fuses an entire expression into a single cuTile kernel, with tile sizes chosen automatically:

ct.Tiled(C) .= ct.Tiled(A) .+ ct.Tiled(B)# Or via the convenience macro (wraps all arrays automatically):
ct.@. C = A + sin(B)

mapreduce on Tiled arrays generates a tiled reduction kernel:

mapreduce(identity, +, ct.Tiled(A); dims=1)

These APIs are experimental and may not persist in their current form. The goal is to eventually fold them into the default CuArray operations in CUDA.jl.

Debugging with print/println

You can now use standard Julia print and println inside kernels:

function debug_kernel(A, tile_size::Int)
    pid = ct.bid(1)
    tile = ct.load(A; index=pid, shape=(tile_size,))
    println("Block ", pid, ": sum=", sum(tile; dims=1))
    return
end

String constants, scalars, and tiles can be mixed freely. String interpolation ("x=$x") is also supported.

Minor changes

  • Atomics: atomic_max, atomic_min, atomic_or, atomic_and, atomic_xor join the existing atomic_cas, atomic_xchg, and atomic_add;

  • fill/zeros/ones overlays: standard Base constructors now work inside kernels (e.g. zeros(Float32, 64, 64)), thanks to @AntonOresten;

  • isnan: works via a single unordered float self-comparison, avoiding Julia's bit-manipulation fallback;

  • Debug info: source file and line information is now embedded in Tile IR bytecode;

  • Julia 1.13: support for the upcoming Julia release.

Performance improvements

A major change in cuTile.jl 0.2 is under the hood: a new multi-pass optimization pipeline that significantly improves the quality of generated Tile IR. In v0.1, the compiler emitted IR almost directly from the structured Julia code; now, a series of passes transform and simplify it before bytecode emission.

The foundation is a declarative IR rewrite infrastructure inspired by MLIR that makes it easy to express pattern-matched transformations:

# FMA fusion: mulf + addf → fma
@rewrite addf(mulf(~a, ~b), ~c) => fma(~a, ~b, ~c)# pow(x, 2) → x * x
@rewrite pow(~x, broadcast(constant(2.0))) => mulf(~x, ~x)

Built on this, the pipeline includes:

  • Algebraic simplification cancels matching arithmetic pairs like (x + 1) -
    1 → x
    , even when reshapes or broadcasts sit in between. This eliminates the overhead of Julia's 1-based indexing normalization;

  • Comparison strength reduction canonicalizes patterns like (x + 1) <= y into x < y, collapsing the two-instruction arange-plus-compare that results from Julia's 1-based arange mask idiom down to a single comparison. This alone reduced layernorm's SASS from 10036 to 3253 instructions;

  • Pow2 strength reduction replaces pow(x, 2) with x * x, eliminating the expensive pow transcendental in layernorm's variance computation;

  • LICM hoists loop-invariant operations out of loop bodies;

  • Constant folding and propagation evaluates compile-time-known arithmetic and tracks constants through the IR for further optimizations;

  • Alias-aware token ordering uses alias analysis to identify independent memory operations on different arrays, avoiding unnecessary serialization that previously blocked instruction-level parallelism;

Thanks to these improvements, all examples from cuTile Python that have been ported to Julia using cuTile.jl perform within 10% of their Python counterparts, and some are even faster. For up to date performance comparisons, see the cuTile.jl README.

Upcoming webinar

On May 12, 2026 at 1 PM ET, Tim Besard (JuliaHub) and Andy Terrel (NVIDIA) will present cuTile.jl in a joint webinar. We will cover the package's design, demonstrate writing GPU kernels in Julia using the tile programming model, and discuss what's next for cuTile.jl and its integration with the Julia GPU ecosystem. To sign up, see the JuliaHub event.

CUDA.jl 5.8: CuSparseVector broadcasting, CUDA 12.9, and more

By: Tim Besard

Re-posted from: https://juliagpu.org/post/2025-05-14-cuda_5.8/index.html

CUDA.jl v5.8 brings several enhancements, most notably the introduction of broadcasting support for CuSparseVector. The release also includes support for CUDA 12.9, and updates to key CUDA libraries like cuTENSOR, cuQuantum, and cuDNN.

Broadcasting for CuSparseVector

A significant enhancement in CUDA.jl v5.8 is the support for broadcasting CuSparseVector. Thanks to @kshyatt, it is now possible to use sparse GPU vectors in broadcast expressions just like it was already possible with sparse matrices:

julia> using CUDA, .CUSPARSE, SparseArraysjulia> x = cu(sprand(Float32, 10, 0.3))
10-element CuSparseVector{Float32, Int32} with 4 stored entries:
  [2]  =  0.459139
  [3]  =  0.964073
  [8]  =  0.904363
  [9]  =  0.721723julia> # a zero-preserving elementwise operation
       x .* 2
10-element CuSparseVector{Float32, Int32} with 4 stored entries:
  [2]  =  0.918278
  [3]  =  1.928146
  [8]  =  1.808726
  [9]  =  1.443446julia> # a non-zero-preserving elementwise operation
       x .+ 1
10-element CuArray{Float32, 1, CUDA.DeviceMemory}:
 1.0
 1.4591388
 1.9640732
 1.0
 1.0
 1.0
 1.0
 1.9043632
 1.7217231
 1.0julia> # combining multiple sparse inputs
       x .+ cu(sprand(Float32, 10, 0.3))
10-element CuSparseVector{Float32, Int32} with 6 stored entries:
  [1]  =  0.906
  [2]  =  0.583197
  [3]  =  0.964073
  [4]  =  0.259103
  [8]  =  0.904363
  [9]  =  0.935917

Minor Changes

CUDA.jl 5.8 also includes several other useful updates:

As always, we encourage users to update to the latest version to benefit from these improvements and bug fixes. Check out the changelog for a full list of changes.