By: Staging Team
Re-posted from: https://sciml.ai/news/2026/05/26/sciml_small_grants_two_year_update/index.html
SciML Small Grants Program: Two Years In, Eight More Projects Funded and Shipped
By: Staging Team
Re-posted from: https://sciml.ai/news/2026/05/26/sciml_small_grants_two_year_update/index.html
SciML Small Grants Program: Two Years In, Eight More Projects Funded and Shipped
By: Picaud Vincent
Re-posted from: https://pixorblog.wordpress.com/2026/05/05/julia-custom-serialization-with-json-jl/
The GitHub:JSON3.jl package has been deprecated. That bothered me a little because I had to migrate a lot of my code to use GitHub:JSON.jl. Luckily, the migration turned out to be easier than I expected.
My use case is a bit special: I have to serialize my structures with type information so that I can retrieve the exact types after deserialization.
I know about GitHub:BSON.jl (see also Wiki:BSON) and Julia:Serialization, but I didn’t want to use them because they produce binary files. I wanted to keep a human‑readable format.
In this note I give a minimal working example that might save you some time.
We’ll need the JSON.jl package. We also use StaticArrays.jl to show how to preserve the right vector type when deserializing an AbstractVector.
using JSON using StaticArrays
Let’s imagine we have an abstract type Abstract_Foo and two concrete types: Foo_A and Foo_B.
abstract type Abstract_Foo end
@nonstruct struct Foo_A{V <: AbstractVector} <: Abstract_Foo
v::V
x::Float64
end
@nonstruct struct Foo_B <: Abstract_Foo
v::AbstractVector
n::Int
end
Nothing special here, except the @nonstruct macro. That macro comes from GitHub:StructUtils.jl, a package used by JSON.jl to automate common struct operations (construction, etc.).
Using Doc:@nonstruct in front of a struct definition marks it as “special”. You tell JSON.jl to treat it as a primitive type that should be converted directly using lift() and lower() methods, rather than constructing it from field values. In short, you have to do all the work by hand, but you also get all the freedom to serialize and deserialize the structure however you want.
During serialization the lower() method is called. We save the field values but also any type information needed for deserialization. Personally, I store this information in a field called type that holds the type of the structure. The name type isn’t special, you could call it internal_type, but I think it’s good practice to adopt a convention and stick to it.
function StructUtils.lower(to_serialize::Foo_A)
return (type = string(typeof(to_serialize)),
v = to_serialize.v,
x = to_serialize.x)
end
For Foo_B, it’s a bit more complicated because the v field is an AbstractVector type, so we need an extra field to save the type information:
function StructUtils.lower(to_serialize::Foo_B)
return (type = string(typeof(to_serialize)),
v_type = string(typeof(to_serialize.v)),
v = to_serialize.v,
n = to_serialize.n)
end
Here’s a demonstration of serialization:
a = Foo_A(@SVector(Int[1,2]),1.2) a_json_str = JSON.json(a, pretty=true)
{
"type": "Foo_A{SVector{2, Int64}}",
"v": [
1,
2
],
"x": 1.2
}
Now for Foo_B
b = Foo_B(Float16[3,4],34) b_json_str = JSON.json(b, pretty=true)
{
"type": "Foo_B",
"v_type": "Vector{Float16}",
"v": [
3.0,
4.0
],
"n": 34
}
To deserialize you have to define the lift() methods.
First, we intercept all Abstract_Foo occurrences and extract the concrete type. Right now the type is a String, to turn it into a Julia DataType we use Base.eval() and Meta.parse(). Once we have that instantiated type, we continue deserialization with it.
function StructUtils.lift(type::Type{<:Abstract_Foo},
to_deserialize)
actual_type = Base.eval(Main,Meta.parse(to_deserialize.type))
StructUtils.lift(actual_type,to_deserialize)
end
Now we redefine lift() for the specific concrete types. You have to be careful to define these new methods for all possible specializations, otherwise you’ll get an infinite recursion with the previous function. It would be nice to detect this situation, but how? (feel free to add a comment
)
For Foo_A:
function StructUtils.lift(type::Type{<:Foo_A{V}},
to_deserialize) where {V<:AbstractVector}
v = StructUtils.lift(V,to_deserialize.v) # deserialize vect.
x = to_deserialize.x
type(v,x)
end
For Foo_B:
function StructUtils.lift(type::Type{<:Foo_B},
to_deserialize)
v_type = Base.eval(Main,Meta.parse(to_deserialize.v_type))
v = StructUtils.lift(v_type,to_deserialize.v) # deserialize vect.
n = to_deserialize.n
type(v,n)
end
Notice that we don’t need to give the exact type, just Abstract_Foo is enough.
JSON.parse(a_json_str,Abstract_Foo)
Foo_A{SVector{2, Int64}}([1, 2], 1.2)
JSON.parse(b_json_str,Abstract_Foo)
Foo_B(Float16[3.0, 4.0], 34)
@kwdef and @nonstruct together You cannot use @kwdef and @nonstruct together. The following code generates an error:
@nonstruct @kwdef struct Foo_C <: Abstract_Foo end
The solution is to do the work of @nonstruct by hand. First, look at what this macro does:
@macroexpand @nonstruct struct Foo_C <: Abstract_Foo end
quote
begin
$(Expr(:meta, :doc))
struct Foo_C <: Abstract_Foo
end
end
StructUtils.structlike(::StructUtils.StructStyle, ::Type{<:Foo_C}) = false
end
So the fix is simply to replace
@nonstruct @kwdef struct Foo_C <: Abstract_Foo end
by
@kwdef struct Foo_C <: Abstract_Foo
end
StructUtils.structlike(::StructUtils.StructStyle,
::Type{<:Foo_C}) = false
Please follow the JSON.jl official doc, nothing special here:
JSON.json(file, a, pretty=true) # write file JSON.parsefile(file, Abstract_Foo) # read file
To make your life easier, here’s the complete code:
using JSON
using StaticArrays
abstract type Abstract_Foo end
@nonstruct struct Foo_A{V <: AbstractVector} <: Abstract_Foo
v::V
x::Float64
end
@nonstruct struct Foo_B <: Abstract_Foo
v::AbstractVector
n::Int
end
function StructUtils.lower(to_serialize::Foo_A)
return (type = string(typeof(to_serialize)),
v = to_serialize.v,
x = to_serialize.x)
end
function StructUtils.lower(to_serialize::Foo_B)
return (type = string(typeof(to_serialize)),
v_type = string(typeof(to_serialize.v)),
v = to_serialize.v,
n = to_serialize.n)
end
a = Foo_A(@SVector(Int[1,2]),1.2)
a_json_str = JSON.json(a, pretty=true)
println(a_json_str)
b = Foo_B(Float16[3,4],34)
b_json_str = JSON.json(b, pretty=true)
println(b_json_str)
function StructUtils.lift(type::Type{<:Abstract_Foo},
to_deserialize)
actual_type = Base.eval(Main,Meta.parse(to_deserialize.type))
StructUtils.lift(actual_type,to_deserialize)
end
function StructUtils.lift(type::Type{<:Foo_A{V}},
to_deserialize) where {V<:AbstractVector}
v = StructUtils.lift(V,to_deserialize.v) # deserialize vect.
x = to_deserialize.x
type(v,x)
end
function StructUtils.lift(type::Type{<:Foo_B},
to_deserialize)
v_type = Base.eval(Main,Meta.parse(to_deserialize.v_type))
v = StructUtils.lift(v_type,to_deserialize.v) # deserialize vect.
n = to_deserialize.n
type(v,n)
end
JSON.parse(a_json_str,Abstract_Foo)
JSON.parse(b_json_str,Abstract_Foo)
There’s nothing more ridiculous than a conclusion, because nothing is ever finished. But I admit it’s still handy to say goodbye
By: Tim Besard
Re-posted from: https://juliagpu.org/post/2026-05-05-cutile_0.3/index.html
cuTile.jl v0.3 integrates with CUDA.jl, making it even easier to write and run CUDA Tile kernels in Julia. Performance has also been greatly improved, closing the gap with cuTile Python on every benchmark we ship. Added features include a random number generator, and support for array slicing.
Three months ago, several of our benchmarks lagged cuTile Python by 5–15%. Today, cuTile.jl matches or outperforms cuTile Python on every kernel we ship. The headline numbers (RTX 5080, tileiras 13.2.51):
| Kernel | Julia | Python | Δ |
|---|---|---|---|
| Vector Addition | 845 GB/s | 846 GB/s | = |
| Matrix Transpose | 812 GB/s | 814 GB/s | = |
| Layer Norm fwd | 983 GB/s | 716 GB/s | +37% |
| Layer Norm bwd | 248 GB/s | 251 GB/s | -1% |
| Matrix Multiplication | 47.5 TFLOPS | 43.5 TFLOPS | +9% |
| Batch Matrix Multiply | 34.0 TFLOPS | 30.8 TFLOPS | +10% |
| FFT (3-stage Cooley-Tukey) | 529 μs | 554 μs | +5% |
| Mixture of Experts | 27.0 TFLOPS | 20.1 TFLOPS | +34% |
| Attention (FMHA, causal) | 103.6 TFLOPS | 63.4 TFLOPS | +63% |
| Softmax (TMA) | 849 GB/s | 857 GB/s | -1% |
| Softmax (Chunked) | 1684 GB/s | 1640 GB/s | +3% |
Most of the gains come from extending the IR-level optimization pipeline introduced in v0.2 with a new dataflow framework that now powers several analyses and transformations.
@cuda backend=cuTileUntil v0.3, launching a cuTile kernel meant calling cuTile.launch(...) directly. cuTile.jl now plugs into CUDA.jl's existing @cuda macro as a first-class backend, making it much easier to launch cuTile.jl kernels:
using CUDA, cuTile
import cuTile as ctfunction vadd(a::ct.TileArray{Float32,1}, b::ct.TileArray{Float32,1},
c::ct.TileArray{Float32,1})
pid = ct.bid(1)
ct.store(c; index=pid, tile=ct.load(a; index=pid, shape=(128,)) +
ct.load(b; index=pid, shape=(128,)))
return
enda = CUDA.rand(Float32, 1024)
b = CUDA.rand(Float32, 1024)
c = CUDA.zeros(Float32, 1024)@cuda backend=cuTile blocks=8 vadd(a, b, c)
Compiling a cuTile kernel goes through several stages: Julia type inference, our IR rewriting passes, Tile IR bytecode emission, and finally tileiras-driven CUBIN generation. None of these are fast. Significant effort in v0.3 went into reducing the time-to-first-launch, and the latency is now comparable to a typical CUDA.jl kernel launch on the same hardware:
Benchmark 1: julia -e 'using CUDACore;
@cuda identity(nothing)'
Time (mean ± σ): 1.882 s ± 0.012 s [User: 2.554 s, System: 0.305 s]
Range (min … max): 1.867 s … 1.906 s 10 runsBenchmark 2: julia -e 'using CUDACore, cuTile;
@cuda backend=cuTile identity(nothing)'
Time (mean ± σ): 1.840 s ± 0.009 s [User: 2.488 s, System: 0.329 s]
Range (min … max): 1.827 s … 1.859 s 10 runs
view and @view now derive sub-range TileArrays from existing ones:
function copy_rows!(A::ct.TileArray{Float32,2}, B::ct.TileArray{Float32,2},
i::Int32, j::Int32)
sub = @view A[i:j, :] # sub-range TileArray
t = ct.load(sub; index=(1, 1), shape=(8, 8))
ct.store(B; index=(1, 1), tile=t)
return
end@cuda backend=cuTile copy_rows!(A, B, Int32(3), Int32(10))
Each index must be : or a UnitRange; other forms (StepRange, scalar indexes, CartesianIndex, …) are currently rejected at compile time. The result is itself a TileArray, and can be passed to ct.load / ct.store (or sliced again, for nested views). The new divisibility analysis sees through the slicing chain so contiguous-axis fast paths are preserved, while literal slice sizes fold to compile-time-constant shape operands.
cuTile.jl now ships a tile-vectorized Philox2x32-7 RNG, both as in-kernel intrinsics and as a host-side cuTile.RNG handle for filling CuArrays. The kernel API mirrors Base.Random:
function noise!(out::ct.TileArray{Float32,1})
pid = ct.bid(1)
t = randn(Float32, (128,)) # in-kernel randn
ct.store(out; index=pid, tile=t)
return
end@cuda backend=cuTile blocks=cld(N, 128) noise!(A)
rand covers all of Int{8,16,32,64}, UInt{8,16,32,64}, Float16, BFloat16, Float32, and Float64; randn (via Box-Muller, sharing its uniforms with the existing rand path) and randexp (via -log(U)) cover the four floating-point types. ct.DeviceRNG() opens an independent stream inside a kernel; Random.seed! re-seeds.
The host-side cuTile.RNG integrates with Random.rand! / Random.randn! / Random.randexp! and auto-advances its counter, so consecutive fills produce disjoint streams:
A = CUDACore.zeros(Float32, 1 << 20)
rng = ct.RNG(42)
randn!(rng, A) # fill via fused tile kernel
B = rand(rng, Float64, 16) # out-of-place
Performance of both the in-kernel and host-side APIs is excellent, matching or exceeding the performance of cuRAND and GPUArrays.jl' new generator.
If you've been watching cuTile.jl from a distance: now's a good time to try it out: add cuTile from the Julia REPL, or grab the examples to see how the moving parts fit together.
There is a webinar scheduled on May 12, 2026 at 1 PM ET, where Tim Besard (JuliaHub) and Andy Terrel (NVIDIA) will present cuTile.jl in a joint webinar, covering the design of CUDA Tile, how cuTile.jl is built, and several relevant examples. Click here to sign up.