Author Archives: Christian Guinard

Metal.jl 1.6: Initial MPSGraph Support

By: Christian Guinard

Re-posted from: https://juliagpu.org/post/2025-05-30-metal_1.6/index.html

Metal.jl adds initial support for MPSGraph, with the matrix multiplication functions wrapped, resolving some matrix multiplication issues in the previous method.

Initial MPSGraph support

PR #526 enabled the automatic generation of wrappers for all enums, structs, and Objective-C objects for the frameworks that Metal.jl relies upon. This made adding support for MPSGraph, Apple's MLIR gpu compiler interface, realistic.

To try out the new framework, constructor and method wrappers necessary for matrix multiplication were added, as well as linking it to the LinearAlgebra interface to work around the NaN issue that could show up on M1/M2 devices.

Lets go through a simple example doing pairwise multiplication followed by pairwise addition using MPSGraph directly:

using Metal, Random
using ObjectiveC: Foundation.NSDictionary
using Metal: encode!;using .MPS: MPSCommandBuffer
using .MPSGraphs: MPSGraph, placeholderTensor, MPSGraphTensorData, MPSGraphTensor, multiplicationWithPrimaryTensor, additionWithPrimaryTensorT = Float32;a = Metal.rand(10);
b = Metal.rand(10);
c = Metal.rand(10);# To compare with the MPSGraph equivalent
res = (a .* b) .+ c;graph = MPSGraph() # Initialize the graph# Create placeholder tensors to be used to compile our graph
placeA = placeholderTensor(graph, size(a), T)
placeB = placeholderTensor(graph, size(b), T)
placeC = placeholderTensor(graph, size(c), T)# Link the placeholder tensors to the data via a Dict
feeds = Dict{MPSGraphTensor, MPSGraphTensorData}(
    placeA => MPSGraphTensorData(a),
    placeB => MPSGraphTensorData(b),
    placeC => MPSGraphTensorData(c)
)# Add multiplication to the graph
pwisemul = MPSGraphs.multiplicationWithPrimaryTensor(graph, placeA, placeB)# Add addition to the graph
pwiseadd = MPSGraphs.additionWithPrimaryTensor(graph, pwisemul, placeC)# Our output tensor will be our c MtlArray
resultdict = Dict{MPSGraphTensor, MPSGraphTensorData}(
    pwiseadd => feeds[placeC]
)# Encode and run the graph
cmdbuf = MPS.MPSCommandBuffer(Metal.global_queue(device()))
MPS.encode!(cmdbuf, graph, NSDictionary(feeds), NSDictionary(resultdict))
Metal.commit!(cmdbuf)
Metal.wait_completed(cmdbuf)# The MPSGraph result is equal to the typical way of doing things.
@assert isapprox(res, c)

Clearly, for simple operations like the above example, it is a lot of extra boilerplate without much benefit, but for more complex operations, MPSGraph will optimize the graph and operations before running, reducing expensive kernel launches and remving unecessary operations.

Another exciting aspect of this new framework wrapper is that it is now easier to add functionality that has been long-requested. One can find MPSGraph functionality not yet in Metal.jl and write wrappers using the existing wrappers as a starting point. If anyone is interested in helping out, feel free to open a pull request or an issue on the Metal.jl repository, and we will do our best to help you get your code merged.

Minor Changes

Metal.jl 1.6 also includes several other useful updates:

As always, we encourage users to update to the latest version to benefit from these improvements and bug fixes. Check out the changelog for a full list of changes.

Metal.jl 1.4: Improved random numbers

By: Christian Guinard

Re-posted from: https://juliagpu.org/post/2024-10-07-metal-1.4/index.html

Metal.jl 1.4 adds higher-quality random number generators from the Metal Performance Shaders library. Some limitations apply, with a fallback to the current implementation in those situations.

Metal.rand and friends

Using functionality provided by the Metal Performance Shaders (MPS) library, Metal.jl now comes with much improved GPU random number generators. Uniform distributions using Metal.rand (and its in-place variant Metal.rand!) are available for all Metal-supported integer types and Float32. However, due to Metal API limitations, 8-bit and 16-bit integers may fall back to the lower-quality GPUArrays.jl random number generator if their size in bytes is not a multiple of 4. Normally distributed Float32 values can be generated for with Metal.randn and Metal.randn!, while Float16 is not supported by the MPS library and will always fall back to the GPUArrays implementation.

The easiest way to use these is to use the Metal convenience functions Metal.rand[n][!] as you would the usual functions from the Random.jl standard library:

julia> a = Metal.rand(Float32, 2)
2-element MtlVector{Float32, Metal.PrivateStorage}:
 0.95755994
 0.7110207julia> Metal.randn!(a)
2-element MtlVector{Float32, Metal.PrivateStorage}:
 1.7230463
 0.55636907

However, the Random.jl methods can also be used by providing the appropriate RNG either from MPS.default_rng() or MPS.RNG() to the standard Random.rand[n][!] functions:

julia> using Randomjulia> rng = MPS.RNG();julia> Random.rand(rng, 2)
2-element MtlVector{Float32, Metal.PrivateStorage}:
 0.8941469
 0.67628527

Seeding is done by calling Metal.seed! for the global RNG, or Random.seed! when working with an explicit RNG object.

Other improvements since the last blog post

  • Since v0.5: MtlArray storage mode has been parameterized, allowing one to create a shared storage MtlArray by calling MtlArray{eltype, ndims, Metal.SharedStorage}(...).

  • Since v0.3: MPS-accelerated decompositions were added.

  • Various performance improvements

  • Many bug fixes.

Future work

Although Metal.jl is now in v1, there is still work to be done to make it as fast and feature-complete as possible. In particular:

  • Metal.jl is now using native ObjectiveC FFI for wrapping Metal APIs. However, these wrappers have to be written manually for every piece of Objective-C code. We are looking for help with improving Clang.jl and ObjectiveC.jl to enable the automatic generation of these wrappers;

  • The MPS wrappers are incomplete, automatic wrapper generation would greatly help with full MPS support;

  • To implement a full-featured KernelAbstractions.jl back-end, Metal atomic operations need to be hooked up to Atomix;

  • Full support for BFloat16 values, which has been supported since Metal 3.1 (macOS 14), is not yet available in Metal.jl. There is, however, a draft PR in the works. Check it out if you're interested in helping out;

  • Some functionality present in CUDA.jl could be ported to Metal.jl to improve usability.