Julia, custom serialization with JSON.jl

By: Picaud Vincent

Re-posted from: https://pixorblog.wordpress.com/2026/05/05/julia-custom-serialization-with-json-jl/

Introduction

The GitHub:JSON3.jl package has been deprecated. That bothered me a little because I had to migrate a lot of my code to use GitHub:JSON.jl. Luckily, the migration turned out to be easier than I expected.

My use case is a bit special: I have to serialize my structures with type information so that I can retrieve the exact types after deserialization.

I know about GitHub:BSON.jl (see also Wiki:BSON) and Julia:Serialization, but I didn’t want to use them because they produce binary files. I wanted to keep a human‑readable format.

In this note I give a minimal working example that might save you some time.

Code

We’ll need the JSON.jl package. We also use StaticArrays.jl to show how to preserve the right vector type when deserializing an AbstractVector.

using JSON
using StaticArrays 

Let’s imagine we have an abstract type Abstract_Foo and two concrete types: Foo_A and Foo_B.

abstract type Abstract_Foo end

@nonstruct struct Foo_A{V <: AbstractVector}  <: Abstract_Foo
    v::V
    x::Float64
end

@nonstruct struct Foo_B <: Abstract_Foo
    v::AbstractVector
    n::Int
end 

Nothing special here, except the @nonstruct macro. That macro comes from GitHub:StructUtils.jl, a package used by JSON.jl to automate common struct operations (construction, etc.).

Using Doc:@nonstruct in front of a struct definition marks it as “special”. You tell JSON.jl to treat it as a primitive type that should be converted directly using lift() and lower() methods, rather than constructing it from field values. In short, you have to do all the work by hand, but you also get all the freedom to serialize and deserialize the structure however you want.

Serialization

During serialization the lower() method is called. We save the field values but also any type information needed for deserialization. Personally, I store this information in a field called type that holds the type of the structure. The name type isn’t special, you could call it internal_type, but I think it’s good practice to adopt a convention and stick to it.

function StructUtils.lower(to_serialize::Foo_A)

    return (type = string(typeof(to_serialize)),
            v = to_serialize.v,
            x = to_serialize.x)
end

For Foo_B, it’s a bit more complicated because the v field is an AbstractVector type, so we need an extra field to save the type information:

function StructUtils.lower(to_serialize::Foo_B)

    return (type = string(typeof(to_serialize)),
            v_type = string(typeof(to_serialize.v)),
            v = to_serialize.v,
            n = to_serialize.n)
end

Demonstration

Here’s a demonstration of serialization:

a = Foo_A(@SVector(Int[1,2]),1.2)

a_json_str = JSON.json(a, pretty=true)
{
  "type": "Foo_A{SVector{2, Int64}}",
  "v": [
    1,
    2
  ],
  "x": 1.2
}

Now for Foo_B

b = Foo_B(Float16[3,4],34)

b_json_str = JSON.json(b, pretty=true)
{
  "type": "Foo_B",
  "v_type": "Vector{Float16}",
  "v": [
    3.0,
    4.0
  ],
  "n": 34
}

Deserialization

To deserialize you have to define the lift() methods.

First, we intercept all Abstract_Foo occurrences and extract the concrete type. Right now the type is a String, to turn it into a Julia DataType we use Base.eval() and Meta.parse(). Once we have that instantiated type, we continue deserialization with it.

function StructUtils.lift(type::Type{<:Abstract_Foo},
                          to_deserialize)

    actual_type = Base.eval(Main,Meta.parse(to_deserialize.type))
    StructUtils.lift(actual_type,to_deserialize)
end

Now we redefine lift() for the specific concrete types. You have to be careful to define these new methods for all possible specializations, otherwise you’ll get an infinite recursion with the previous function. It would be nice to detect this situation, but how? (feel free to add a comment 🙂 )

For Foo_A:

function StructUtils.lift(type::Type{<:Foo_A{V}},
                          to_deserialize) where {V<:AbstractVector}

    v = StructUtils.lift(V,to_deserialize.v) # deserialize vect.
    x = to_deserialize.x

    type(v,x)
end 

For Foo_B:

function StructUtils.lift(type::Type{<:Foo_B},
                          to_deserialize)

    v_type = Base.eval(Main,Meta.parse(to_deserialize.v_type))
    v = StructUtils.lift(v_type,to_deserialize.v) # deserialize vect.
    n = to_deserialize.n

    type(v,n)
end 

Demonstration

Notice that we don’t need to give the exact type, just Abstract_Foo is enough.

JSON.parse(a_json_str,Abstract_Foo)
Foo_A{SVector{2, Int64}}([1, 2], 1.2)
JSON.parse(b_json_str,Abstract_Foo)
Foo_B(Float16[3.0, 4.0], 34)

Remarks

@kwdef and @nonstruct together

You cannot use @kwdef and @nonstruct together. The following code generates an error:

@nonstruct @kwdef struct Foo_C <: Abstract_Foo
end

The solution is to do the work of @nonstruct by hand. First, look at what this macro does:

@macroexpand @nonstruct  struct Foo_C <: Abstract_Foo
end
quote
    begin
        $(Expr(:meta, :doc))
        struct Foo_C <: Abstract_Foo
        end
    end
    StructUtils.structlike(::StructUtils.StructStyle, ::Type{<:Foo_C}) = false
end

So the fix is simply to replace

@nonstruct @kwdef struct Foo_C <: Abstract_Foo
end

by

@kwdef struct Foo_C <: Abstract_Foo
end

StructUtils.structlike(::StructUtils.StructStyle,
                       ::Type{<:Foo_C}) = false

Writing / reading file

Please follow the JSON.jl official doc, nothing special here:

JSON.json(file, a, pretty=true)      # write file
JSON.parsefile(file, Abstract_Foo)   # read file

Complete code

To make your life easier, here’s the complete code:

using JSON
using StaticArrays

abstract type Abstract_Foo end

@nonstruct struct Foo_A{V <: AbstractVector}  <: Abstract_Foo
    v::V
    x::Float64
end

@nonstruct struct Foo_B <: Abstract_Foo
    v::AbstractVector
    n::Int
end

function StructUtils.lower(to_serialize::Foo_A)

    return (type = string(typeof(to_serialize)),
            v = to_serialize.v,
            x = to_serialize.x)
end

function StructUtils.lower(to_serialize::Foo_B)

    return (type = string(typeof(to_serialize)),
            v_type = string(typeof(to_serialize.v)),
            v = to_serialize.v,
            n = to_serialize.n)
end

a = Foo_A(@SVector(Int[1,2]),1.2)

a_json_str = JSON.json(a, pretty=true)

println(a_json_str)

b = Foo_B(Float16[3,4],34)

b_json_str = JSON.json(b, pretty=true)

println(b_json_str)

function StructUtils.lift(type::Type{<:Abstract_Foo},
                          to_deserialize)

    actual_type = Base.eval(Main,Meta.parse(to_deserialize.type))
    StructUtils.lift(actual_type,to_deserialize)
end

function StructUtils.lift(type::Type{<:Foo_A{V}},
                          to_deserialize) where {V<:AbstractVector}

    v = StructUtils.lift(V,to_deserialize.v) # deserialize vect.
    x = to_deserialize.x

    type(v,x)
end

function StructUtils.lift(type::Type{<:Foo_B},
                          to_deserialize)

    v_type = Base.eval(Main,Meta.parse(to_deserialize.v_type))
    v = StructUtils.lift(v_type,to_deserialize.v) # deserialize vect.
    n = to_deserialize.n

    type(v,n)
end

JSON.parse(a_json_str,Abstract_Foo)

JSON.parse(b_json_str,Abstract_Foo)

Conclusion

There’s nothing more ridiculous than a conclusion, because nothing is ever finished. But I admit it’s still handy to say goodbye 🙂

cuTile.jl 0.3: CUDA.jl integration, and even better performance &amp; latency

By: Tim Besard

Re-posted from: https://juliagpu.org/post/2026-05-05-cutile_0.3/index.html

cuTile.jl v0.3 integrates with CUDA.jl, making it even easier to write and run CUDA Tile kernels in Julia. Performance has also been greatly improved, closing the gap with cuTile Python on every benchmark we ship. Added features include a random number generator, and support for array slicing.

Performance: matching cuTile Python

Three months ago, several of our benchmarks lagged cuTile Python by 5–15%. Today, cuTile.jl matches or outperforms cuTile Python on every kernel we ship. The headline numbers (RTX 5080, tileiras 13.2.51):

Kernel Julia Python Δ
Vector Addition 845 GB/s 846 GB/s =
Matrix Transpose 812 GB/s 814 GB/s =
Layer Norm fwd 983 GB/s 716 GB/s +37%
Layer Norm bwd 248 GB/s 251 GB/s -1%
Matrix Multiplication 47.5 TFLOPS 43.5 TFLOPS +9%
Batch Matrix Multiply 34.0 TFLOPS 30.8 TFLOPS +10%
FFT (3-stage Cooley-Tukey) 529 μs 554 μs +5%
Mixture of Experts 27.0 TFLOPS 20.1 TFLOPS +34%
Attention (FMHA, causal) 103.6 TFLOPS 63.4 TFLOPS +63%
Softmax (TMA) 849 GB/s 857 GB/s -1%
Softmax (Chunked) 1684 GB/s 1640 GB/s +3%

Most of the gains come from extending the IR-level optimization pipeline introduced in v0.2 with a new dataflow framework that now powers several analyses and transformations.

CUDA.jl integration: @cuda backend=cuTile

Until v0.3, launching a cuTile kernel meant calling cuTile.launch(...) directly. cuTile.jl now plugs into CUDA.jl's existing @cuda macro as a first-class backend, making it much easier to launch cuTile.jl kernels:

using CUDA, cuTile
import cuTile as ctfunction vadd(a::ct.TileArray{Float32,1}, b::ct.TileArray{Float32,1},
              c::ct.TileArray{Float32,1})
    pid = ct.bid(1)
    ct.store(c; index=pid, tile=ct.load(a; index=pid, shape=(128,)) +
                                ct.load(b; index=pid, shape=(128,)))
    return
enda = CUDA.rand(Float32, 1024)
b = CUDA.rand(Float32, 1024)
c = CUDA.zeros(Float32, 1024)@cuda backend=cuTile blocks=8 vadd(a, b, c)

Time-to-first-launch

Compiling a cuTile kernel goes through several stages: Julia type inference, our IR rewriting passes, Tile IR bytecode emission, and finally tileiras-driven CUBIN generation. None of these are fast. Significant effort in v0.3 went into reducing the time-to-first-launch, and the latency is now comparable to a typical CUDA.jl kernel launch on the same hardware:

Benchmark 1: julia -e 'using CUDACore;
                       @cuda identity(nothing)'
  Time (mean ± σ):      1.882 s ±  0.012 s    [User: 2.554 s, System: 0.305 s]
  Range (min … max):    1.867 s …  1.906 s    10 runsBenchmark 2: julia -e 'using CUDACore, cuTile;
                       @cuda backend=cuTile identity(nothing)'
  Time (mean ± σ):      1.840 s ±  0.009 s    [User: 2.488 s, System: 0.329 s]
  Range (min … max):    1.827 s …  1.859 s    10 runs

Array slicing

view and @view now derive sub-range TileArrays from existing ones:

function copy_rows!(A::ct.TileArray{Float32,2}, B::ct.TileArray{Float32,2},
                    i::Int32, j::Int32)
    sub = @view A[i:j, :]                         # sub-range TileArray
    t = ct.load(sub; index=(1, 1), shape=(8, 8))
    ct.store(B; index=(1, 1), tile=t)
    return
end@cuda backend=cuTile copy_rows!(A, B, Int32(3), Int32(10))

Each index must be : or a UnitRange; other forms (StepRange, scalar indexes, CartesianIndex, …) are currently rejected at compile time. The result is itself a TileArray, and can be passed to ct.load / ct.store (or sliced again, for nested views). The new divisibility analysis sees through the slicing chain so contiguous-axis fast paths are preserved, while literal slice sizes fold to compile-time-constant shape operands.

Random number generation

cuTile.jl now ships a tile-vectorized Philox2x32-7 RNG, both as in-kernel intrinsics and as a host-side cuTile.RNG handle for filling CuArrays. The kernel API mirrors Base.Random:

function noise!(out::ct.TileArray{Float32,1})
    pid = ct.bid(1)
    t = randn(Float32, (128,))                 # in-kernel randn
    ct.store(out; index=pid, tile=t)
    return
end@cuda backend=cuTile blocks=cld(N, 128) noise!(A)

rand covers all of Int{8,16,32,64}, UInt{8,16,32,64}, Float16, BFloat16, Float32, and Float64; randn (via Box-Muller, sharing its uniforms with the existing rand path) and randexp (via -log(U)) cover the four floating-point types. ct.DeviceRNG() opens an independent stream inside a kernel; Random.seed! re-seeds.

The host-side cuTile.RNG integrates with Random.rand! / Random.randn! / Random.randexp! and auto-advances its counter, so consecutive fills produce disjoint streams:

A = CUDACore.zeros(Float32, 1 << 20)
rng = ct.RNG(42)
randn!(rng, A)                                 # fill via fused tile kernel
B = rand(rng, Float64, 16)                     # out-of-place

Performance of both the in-kernel and host-side APIs is excellent, matching or exceeding the performance of cuRAND and GPUArrays.jl' new generator.

What's next

If you've been watching cuTile.jl from a distance: now's a good time to try it out: add cuTile from the Julia REPL, or grab the examples to see how the moving parts fit together.

There is a webinar scheduled on May 12, 2026 at 1 PM ET, where Tim Besard (JuliaHub) and Andy Terrel (NVIDIA) will present cuTile.jl in a joint webinar, covering the design of CUDA Tile, how cuTile.jl is built, and several relevant examples. Click here to sign up.

TestItems – Modern Julia testing

By: julia on Abel Soares Siqueira

Re-posted from: https://abelsiqueira.com/blog/2026-04-25-testitems-modern-testing-for-julia/

This post is the written companion for my video on TestItems.
Liking and subscribing there is appreciated.

TestItems is a modern testing framework for Julia allowing parallel testing, isolation, setup steps, and filtering.
It has a nice VSCode integration, and through TestItemRunner, it can be used with Revise to automatically rerun tests, or by AI agents via julia-mcp to improve iteration speed.
The official website is https://julia-testitems.org/.

This post is aimed at anyone developing packages in Julia.