cuTile.jl 0.3: CUDA.jl integration, and even better performance & latency

By: Tim Besard

Re-posted from: https://juliagpu.org/post/2026-05-05-cutile_0.3/index.html

cuTile.jl v0.3 integrates with CUDA.jl, making it even easier to write and run CUDA Tile kernels in Julia. Performance has also been greatly improved, closing the gap with cuTile Python on every benchmark we ship. Added features include a random number generator, and support for array slicing.

Performance: matching cuTile Python

Three months ago, several of our benchmarks lagged cuTile Python by 5–15%. Today, cuTile.jl matches or outperforms cuTile Python on every kernel we ship. The headline numbers (RTX 5080, tileiras 13.2.51):

Kernel Julia Python Δ
Vector Addition 845 GB/s 846 GB/s =
Matrix Transpose 812 GB/s 814 GB/s =
Layer Norm fwd 983 GB/s 716 GB/s +37%
Layer Norm bwd 248 GB/s 251 GB/s -1%
Matrix Multiplication 47.5 TFLOPS 43.5 TFLOPS +9%
Batch Matrix Multiply 34.0 TFLOPS 30.8 TFLOPS +10%
FFT (3-stage Cooley-Tukey) 529 μs 554 μs +5%
Mixture of Experts 27.0 TFLOPS 20.1 TFLOPS +34%
Attention (FMHA, causal) 103.6 TFLOPS 63.4 TFLOPS +63%
Softmax (TMA) 849 GB/s 857 GB/s -1%
Softmax (Chunked) 1684 GB/s 1640 GB/s +3%

Most of the gains come from extending the IR-level optimization pipeline introduced in v0.2 with a new dataflow framework that now powers several analyses and transformations.

CUDA.jl integration: @cuda backend=cuTile

Until v0.3, launching a cuTile kernel meant calling cuTile.launch(...) directly. cuTile.jl now plugs into CUDA.jl's existing @cuda macro as a first-class backend, making it much easier to launch cuTile.jl kernels:

using CUDA, cuTile
import cuTile as ctfunction vadd(a::ct.TileArray{Float32,1}, b::ct.TileArray{Float32,1},
              c::ct.TileArray{Float32,1})
    pid = ct.bid(1)
    ct.store(c; index=pid, tile=ct.load(a; index=pid, shape=(128,)) +
                                ct.load(b; index=pid, shape=(128,)))
    return
enda = CUDA.rand(Float32, 1024)
b = CUDA.rand(Float32, 1024)
c = CUDA.zeros(Float32, 1024)@cuda backend=cuTile blocks=8 vadd(a, b, c)

Time-to-first-launch

Compiling a cuTile kernel goes through several stages: Julia type inference, our IR rewriting passes, Tile IR bytecode emission, and finally tileiras-driven CUBIN generation. None of these are fast. Significant effort in v0.3 went into reducing the time-to-first-launch, and the latency is now comparable to a typical CUDA.jl kernel launch on the same hardware:

Benchmark 1: julia -e 'using CUDACore;
                       @cuda identity(nothing)'
  Time (mean ± σ):      1.882 s ±  0.012 s    [User: 2.554 s, System: 0.305 s]
  Range (min … max):    1.867 s …  1.906 s    10 runsBenchmark 2: julia -e 'using CUDACore, cuTile;
                       @cuda backend=cuTile identity(nothing)'
  Time (mean ± σ):      1.840 s ±  0.009 s    [User: 2.488 s, System: 0.329 s]
  Range (min … max):    1.827 s …  1.859 s    10 runs

Array slicing

view and @view now derive sub-range TileArrays from existing ones:

function copy_rows!(A::ct.TileArray{Float32,2}, B::ct.TileArray{Float32,2},
                    i::Int32, j::Int32)
    sub = @view A[i:j, :]                         # sub-range TileArray
    t = ct.load(sub; index=(1, 1), shape=(8, 8))
    ct.store(B; index=(1, 1), tile=t)
    return
end@cuda backend=cuTile copy_rows!(A, B, Int32(3), Int32(10))

Each index must be : or a UnitRange; other forms (StepRange, scalar indexes, CartesianIndex, …) are currently rejected at compile time. The result is itself a TileArray, and can be passed to ct.load / ct.store (or sliced again, for nested views). The new divisibility analysis sees through the slicing chain so contiguous-axis fast paths are preserved, while literal slice sizes fold to compile-time-constant shape operands.

Random number generation

cuTile.jl now ships a tile-vectorized Philox2x32-7 RNG, both as in-kernel intrinsics and as a host-side cuTile.RNG handle for filling CuArrays. The kernel API mirrors Base.Random:

function noise!(out::ct.TileArray{Float32,1})
    pid = ct.bid(1)
    t = randn(Float32, (128,))                 # in-kernel randn
    ct.store(out; index=pid, tile=t)
    return
end@cuda backend=cuTile blocks=cld(N, 128) noise!(A)

rand covers all of Int{8,16,32,64}, UInt{8,16,32,64}, Float16, BFloat16, Float32, and Float64; randn (via Box-Muller, sharing its uniforms with the existing rand path) and randexp (via -log(U)) cover the four floating-point types. ct.DeviceRNG() opens an independent stream inside a kernel; Random.seed! re-seeds.

The host-side cuTile.RNG integrates with Random.rand! / Random.randn! / Random.randexp! and auto-advances its counter, so consecutive fills produce disjoint streams:

A = CUDACore.zeros(Float32, 1 << 20)
rng = ct.RNG(42)
randn!(rng, A)                                 # fill via fused tile kernel
B = rand(rng, Float64, 16)                     # out-of-place

Performance of both the in-kernel and host-side APIs is excellent, matching or exceeding the performance of cuRAND and GPUArrays.jl' new generator.

What's next

If you've been watching cuTile.jl from a distance: now's a good time to try it out: add cuTile from the Julia REPL, or grab the examples to see how the moving parts fit together.

There is a webinar scheduled on May 12, 2026 at 1 PM ET, where Tim Besard (JuliaHub) and Andy Terrel (NVIDIA) will present cuTile.jl in a joint webinar, covering the design of CUDA Tile, how cuTile.jl is built, and several relevant examples. Click here to sign up.