Author Archives: Christian Guinard, Tim Besard

Metal.jl 1.10: Linear algebra, FFTs, and a faster runtime

By: Christian Guinard, Tim Besard

Re-posted from: https://juliagpu.org/post/2026-07-01-metal-1.10/index.html

Metal.jl 1.10 is a big release. It adds native matrix multiplication, GPU-accelerated linear solvers and FFTs, BFloat16 support, and MPS-backed reductions, scans and sorting. The runtime also got considerably faster and leaner, and there is a new in-process profiler.

Before getting into the new features, one thing to flag up front: Metal.jl 1.10 requires macOS 14 or later, up from macOS 13. On older systems the package now refuses to initialize, and Metal.functional() returns false. The supported range is macOS 14 through 26, on Julia 1.10 through 1.13.

Tied to that requirement is a change in how kernels are compiled. Previously Metal.jl pinned a conservative baseline (AIR 2.5 / metallib v1.2.6) regardless of the host. Since Metal.jl only ever compiles for the machine it runs on, it now emits the newest AIR, MSL and metallib versions the host macOS supports, exactly like Apple's offline metal compiler does. That unlocks newer language features for free: AIR 2.6 / Metal 3.1 on macOS 14, up to AIR 2.8 / Metal 4.0 on macOS 26 (and Metal 4.1 on the macOS 27 beta). You can see what your machine targets in versioninfo:

julia> Metal.versioninfo()
macOS 26.6.0, Darwin 25.6.0Toolchain:
- Julia: 1.12.6
- LLVM: 18.1.7
- Metal: 4.0 (MSL), 2.8 (AIR), 1.2.9 (metallib)Julia packages:
- Metal.jl: 1.10.0
- GPUArrays: 11.5.8
- GPUCompiler: 1.22.7
- KernelAbstractions: 0.9.42
- ObjectiveC: 6.0.0
- LLVM: 9.10.0
- LLVMDowngrader_jll: 0.8.1+01 device:
- Apple M3 Pro (14 GPU cores, 80.000 KiB allocated; Apple9, Metal4 family)

Native matrix multiplication

Up to now, every A * B on an MtlArray went straight to Apple's vendor libraries (Metal Performance Shaders or MPSGraph). That works well on large matrices, but it leaves us at the mercy of the vendor: there are eltypes MPS does not support, small matrices pay a steep launch overhead, and bugs like the M1/M2 matmul NaN issue are out of our hands. Anything unsupported fell back to GPUArrays' generic implementation, resulting in poor performance.

Metal.jl 1.10 ships its own native GEMM kernels. You pick a backend through the Metal.matmul_alg scoped value, which defaults to :auto:

  • :scalar is a per-element tiled kernel that handles any Metal-supported eltype (integers, complex, BFloat16) and any transpose or offset. It's the universal fallback.

  • :simd is a simdgroup_matrix kernel for Float16/Float32 (and BFloat16) with Float32 accumulation

  • :tensor is a Metal 4 tensor_ops::matmul2d kernel, available on Metal 4-capable devices running macOS 26+.

  • :native picks the best of the three, per device and per operand.

  • :auto (the default) tries the vendor libraries first, then falls back to :native.

Linear solvers

Closing a long-standing request, many more standard LinearAlgebra operations on Float32/Float16 MtlMatrixes now run on the GPU through MPS-backed solvers. That covers \, lu, cholesky (including on Symmetric/Hermitian wrappers), triangular solves, and inv/det/logdet:

julia> using Metal, LinearAlgebrajulia> A = MtlArray(rand(Float32, 512, 512) + 512I);julia> b = MtlArray(rand(Float32, 512));julia> x = A \ b;                              # MPS LU solve, on the GPUjulia> norm(Array(A) * Array(x) - Array(b))    # residual, at the Float32 noise floor
4.4064095f-6julia> M = MtlArray(rand(Float32, 256, 256));julia> logdet(cholesky(Symmetric(M'M + I)))    # cholesky factorization, also on the GPU
638.44586f0

BFloat16

BFloat16 arrays now run natively on the GPU as well:

julia> using Metal, BFloat16sjulia> a = MtlArray(BFloat16[1.5, 2.5, 3.5])
3-element MtlVector{BFloat16, Metal.PrivateStorage}:
 1.5
 2.5
 3.5julia> sum(a .* BFloat16(2))
BFloat16(15.0f0)

All Julia versions are supported, but before Julia 1.13 operations involving scalar BFloat16 values (e.g. a .+ BFloat16(1)) may be slower because they go through a software emulation path in BFloats.jl.

FlashAttention example

To tie the new building blocks together, there is a FlashAttention example that spells out scaled dot-product attention in four different ways, one per programming model Metal.jl exposes:

  • with plain array operations (*, broadcasting, maximum, sum, exp);

  • with MPSGraph's fused scaledDotProductAttention op;

  • with a hand-written kernel built on MtlSimdgroupMatrix{Float16,8,8};

  • and with a fused kernel using the Metal 4 tensor_ops::matmul2d primitives.

It's a good read if you want to see how the simdgroup and tensor intrinsics look in practice; you'll find it in examples/flashattention.jl.

Reductions, scans and sorting

Reductions, prefix scans and sorting now route through MPSGraph when it makes sense. This speeds up reductions and scans, and introduces support for sorting:

julia> sort(MtlVector(Int16[5, -3, 2, 9, -7, 0]))
6-element MtlVector{Int16, Metal.PrivateStorage}:
 -7
 -3
  0
  2
  5
  9julia> accumulate(max, MtlVector(Int32[1, 3, 2, 5, 4]))
5-element MtlVector{Int32, Metal.PrivateStorage}:
 1
 3
 3
 5
 5

Neural-network primitives

Metal.jl 1.10 also wraps the core MPSGraph neural-network primitives: softmax and logsoftmax, 2D convolution, and max/mean pooling, each with its gradient. These are wired up as the Metal backend for NNlib.jl, so once that release lands, Flux models gain GPU acceleration on Apple hardware through the functions you already use (conv, maxpool, softmax, …) rather than any Metal-specific API.

FFTs

On the back of the MPSGraph work, Metal.jl now supports FFTs through the AbstractFFTs.jl interface:

julia> using Metal, AbstractFFTsjulia> x = MtlArray(rand(ComplexF32, 2048, 2048));julia> y = fft(x);          # just worksjulia> Array(ifft(y)) ≈ Array(x)
truejulia> p = plan_fft(x);     # reusable plans, toojulia> Array(p * x) ≈ Array(y)
true

Real transforms (rfft/irfft), transforms along specific dimensions, and batched transforms are all supported. Running on the GPU is a large win over a CPU FFT, even one backed by AppleAccelerate. The following are timings on a 30-core M2 Max:

Size CPU (FFTW) CPU (FFTW + AppleAccelerate) GPU (Metal) speedup vs. Accelerate
512×512 4.2 ms 766.2 µs 173.4 µs 4.4×
1024×1024 19.7 ms 3.7 ms 246.3 µs 15×
2048×2048 99.5 ms 20.8 ms 588.4 µs 35×
4096×4096 580.1 ms 99.0 ms 2.5 ms 39×

A faster, leaner runtime

A lot of work in this cycle went into the cost of getting work onto the GPU and back.

Batched command submission. Metal.jl used to create, encode and commit a fresh command buffer for every single launch. It now keeps one command buffer open and submits launches into it, flushing on synchronization or other triggers. That amortizes the per-launch command-buffer overhead, which is the dominant cost for workloads built out of many small kernels.

Non-blocking synchronization. Synchronization was ported from CUDA.jl to a spin-then-yield scheme instead of blocking inside Metal. The primary motivation is correctness (a blocked main thread can deadlock against a Metal callback that needs to do I/O), but it is also dramatically faster on the fast paths:

Scenario before after speedup
synchronize() on a queue that never ran work 15.87 µs 0.19 µs ~86×
synchronize() when the queue is idle 15.55 µs 0.37 µs ~42×
small kernel + synchronize() in a tight loop 359 µs 149 µs ~2.4×

GC under memory pressure. Because MtlArray buffers are allocated by Metal, Julia's garbage collector can't see them, and on a unified-memory Mac that means it happily lets you allocate until the system starts paging and freezes. Metal.jl now reads the memory pressure straight from Metal and triggers an incremental GC when usage gets high (above 75% normally, lower on synchronization points where the pause is hidden behind a wait anyway), rate-limited so it never spends more than a small fraction of wall-clock time collecting.

Cheaper object lifetimes. The hand-rolled retain/release/finalizer bookkeeping for Metal objects was replaced with ObjectiveC.jl's automatic reference counting, removing a few hundred lines of fiddly code and simplifying per-launch bookkeeping .

Faster large copies. Shared-storage GPU→GPU copies used to always go through a CPU memcpy. For large arrays it's faster to use a GPU blit, so copies above 32 MB now switch to that path (small copies stay on memcpy, where the API overhead would dominate):

Size before (CPU memcpy) after speedup
64 MB 3.28 ms 1.19 ms 2.8×
256 MB 6.54 ms 2.08 ms 3.1×
1024 MB 21.55 ms 5.98 ms 3.6×

Separately, copies larger than 4 GiB no longer silently fail; they are chunked into pieces Metal can handle.

Time to first kernel. A real precompilation workload plus some despecialization brought the time to a first kernel down significantly:

$ julia -e 'using Metal; a = MtlArray([1, 2, 3]); @time a .+ 1'
0.161035 seconds (178.09 k allocations: 8.552 MiB, 52.46% compilation time: 22% of which was recompilation)

Compare that to the previous version of Metal.jl:

$ julia -e 'using Metal; a = MtlArray([1, 2, 3]); @time a .+ 1'
8.133787 seconds (33.91 M allocations: 1.636 GiB, 3.64% gc time, 99.25% compilation time: 1% of which was recompilation)

A profiler that doesn't need Xcode

Timing a single kernel with BenchmarkTools is easy enough, but understanding where time goes in a larger program used to mean reaching for Xcode's Instruments. Metal.jl 1.10 adds an in-process profiler, Metal.@profile, that captures the GPU operations Metal.jl submits and prints a summary, no Xcode required:

julia> a = Metal.rand(Float32, 1024, 1024); b = similar(a); c = similar(a);julia> b .= a .+ 1f0; c .= sqrt.(b); Metal.synchronize();   # warm upjulia> Metal.@profile begin
           b .= a .+ 1f0
           c .= sqrt.(b)
       end
Profiled over 58.7 ms.Host-side activity: 42 Objective-C calls taking 118.0 µs (0.20% of wall-clock)
┌──────────┬────────────┬───────┬──────────────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Name                                     │
├──────────┼────────────┼───────┼──────────────────────────────────────────┤
│    0.06% │   33.29 µs │     2 │ [MTLCommandBuffer commit]                │
│    0.04% │   22.17 µs │     2 │ [MTLCommandQueue commandBuffer]          │
│    0.04% │    22.0 µs │     2 │ [MTLCommandBuffer computeCommandEncoder] │
│     ...  │     ...    │   ... │ ...                                      │
└──────────┴────────────┴───────┴──────────────────────────────────────────┘Device-side activity: GPU was busy 831.75 µs (1.42% of wall-clock)
┌──────────┬────────────┬───────┬───────────────────────────┬──────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution         │ Name         │
├──────────┼────────────┼───────┼───────────────────────────┼──────────────┤
│    1.42% │  831.75 µs │     2 │ 415.87 µs ± 220.62        │ broadcast_2d │
└──────────┴────────────┴───────┴───────────────────────────┴──────────────┘

The host table groups the Objective-C calls, the device table groups kernels and blits, and the slowest operations are color-highlighted. Pass trace=true for a chronological timeline (with threadgroup, occupancy and threadgroup-memory columns) instead of a summary, and use Metal.@bprofile to benchmark a snippet by running it repeatedly. The old Xcode-based capture is still there under Metal.@profile external=true. One caveat worth knowing: MPS and MPSGraph operations, including the default matmul backend, submit their own command buffers and don't show up in the integrated trace yet, so reach for the external profiler to inspect those.

Better debugging

Device-side printing. On macOS 15+, you can now print from inside a kernel, built on Apple's os_log. There's @mtlprintf, plus the friendlier @mtlprint, @mtlprintln, and @mtlshow:

julia> function device_println()
           @mtlprintln("Hello, world!")
           return
       endjulia> @metal device_println();
Hello, world!

It also wires up KernelAbstractions' @print, so the same works in KA kernels.

Richer exceptions. When a kernel throws, say a bounds error, that used to surface as an opaque failure. Device exceptions are now reported back to the host as a KernelException carrying the actual cause:

julia> function oob(a)
           a[2] = 1f0   # a has length 1
           return
       endjulia> a = MtlArray(zeros(Float32, 1));julia> @metal threads=1 oob(a)
ERROR: KernelException: A BoundsError was thrown

Launching with debug_level=2 adds a full device-side stacktrace. The detailed machinery only kicks in at the higher debug level, so the common case stays fast.

Device-side allocation. A minimal device-side malloc means kernels that need dynamic allocation (notably exception-throwing code and some broadcasts) now compile and run where they previously failed outright.

New intrinsics

The warp-level primitive set is now much more complete. Metal.jl 1.10 adds the indexed simd_shuffle/simd_shuffle_xor shuffles, the simd_ballot/simd_vote_all/simd_vote_any voting intrinsics, and the full set of quad-group (4-thread) equivalents: quad_shuffle, quad_ballot, quad_vote_all, and friends. There are also UInt16 variants of every thread- and grid-indexing intrinsic (thread_position_in_grid_i16() and so on) for when 16-bit indices are enough.

Please refer to the Metal Shading Language Specification to verify where indexing types must match for a kernel to be valid.

Other improvements

Metal.jl 1.10 includes plenty more:

One breaking fix to be aware of: launching a kernel with a grid dimension larger than typemax(UInt32) used to silently truncate. It now raises an error instead, so use grid-stride loops for kernels that need to cover arrays larger than that.

As always, update to the latest version to get these improvements, and check out the changelog for the full list.