Author Archives: Tim Besard

Profiling oneAPI.jl applications with VTune

By: Tim Besard

Re-posted from: https://juliagpu.org/post/2023-07-19-oneapi_profiling/index.html

Profiling GPU applications is hard, so this post shows how to use Intel's VTune Profiler to profile GPU applications written in Julia with oneAPI.jl.

Because of the asynchronous nature of GPU execution, profiling GPU applications with Julia's tried and tested tools like @profile or even @time can be misleading: They will only show the time spent on the CPU, and will likely report that your application is spending most of its time waiting for the GPU.

To get a better understanding of what is happening on the GPU, we need specialized tools. In this post, we'll show how to use Intel's VTune Profiler to profile GPU applications written in Julia using oneAPI.jl.

Set-up

Start by downloading and installing the Intel VTune Profiler. This does not require administrative permissions, and will install in your home folder under the intel directory. On Linux, binaries will appear in ~/intel/oneapi/vtune/latest/bin64. There are three that are particularly important:

  • vtune: a command-line tool to profile applications;

  • vtune-gui: a graphical user interface to profile applications, or to visualize the results of a command-line profiling session;

  • vtune-backend: a daemon that creates a web interface for VTune, which you can use to profile applications both locally and remotely.

Hello VTune!

Let's start with a simple example: A Julia program that computes the sum of two arrays (i.e., the vadd example from the oneAPI repository):

using oneAPIfunction kernel(a, b, c)
    i = get_global_id()
    @inbounds c[i] = a[i] + b[i]
    return
endfunction vadd(a, b)
    d_a = oneArray(a)
    d_b = oneArray(b)
    d_c = similar(d_a)    @oneapi items=size(d_c) kernel(d_a, d_b, d_c)
    Array(d_c)
endfunction main(N=256)
    a = round.(rand(Float32, N) * 100)
    b = round.(rand(Float32, N) * 100)
    c = vadd(a, b)
end
main()

We've tweaked this example to make it more suited for profiling: We've enclosed the main application in a function so that it gets compiled, and we've increased the array sizes to make the GPU work harder.

There are several ways to profile this application. We'll start by demonstrating the command-line interface:

$ vtune -collect gpu-offload julia vadd.jlvtune: Collection started.
vtune: Collection stopped.vtune: Using result path `/home/tim/Julia/pkg/oneAPI/r000gh'
    GPU Time: 0.002s
EU Array Stalled/Idle: 100.0% of Elapsed time with GPU busy
 | The percentage of time when the EUs were stalled or idle is high, which has a
 | negative impact on compute-bound applications.
FPU Utilization: 0.0% of Elapsed time with GPU busy
...

This will run the application, and collect a number of GPU-related metrics. A summary is shown in the terminal, and a more detailed report will be written to a directory in the current working directory. You can open that report with the graphical user interface, possibly even on a different machine:

$ vtune-gui r000gh

Instrumenting the application

The trace we just collected includes the time spent compiling our application, making it difficult to analyze what is happening. To refine the trace, we can instrument our application with Intel's Instrumentation and Tracing Technology (ITT) APIs:

  • only start the profiler when we're running code of interest;

  • add markers to the trace to indicate what is happening.

We can interface with the ITT APIs using the IntelITT.jl package. Let's update our example:

using oneAPI, IntelITT# same as beforefunction main(N=256)
    a = round.(rand(Float32, N) * 100)
    b = round.(rand(Float32, N) * 100)
    c = IntelITT.@task "vadd" oneAPI.@sync vadd(a, b)
end# warm-up
main()# actual profile
IntelITT.@collect main()

Here, the IntelITT.@collect macro will start and stop the collection, so we should launch VTune with the -start-paused option:

$ vtune -collect gpu-offload -start-paused julia vadd.jl

In the GUI, we can now clearly see a nicely packed stream of API calls, grouped under the vadd task we added. Note that because API calls are asynchronous, i.e. they return immediately before the GPU has executed them, I grouped them under a oneAPI.@sync call so that the task not only captures the time spent on the CPU, but also the time spent on the GPU. This may not be wanted for your application.

VTune timeline

Kernel details

The timeline view is great for getting an application-level overview of what is happening, but once you've isolated a kernel that doesn't perform as expected, you may want to switch from the GPU Offload to the GPU Compute Hotspots analysis. Here, you get a more detailed view of what's happening during execution on the GPU, including the memory bandwidth and execution properties:

$ vtune -collect gpu-hotspots -start-paused julia vadd.jl

VTune timeline

Many of these analysis can be configured to collect more or less data, at the cost of more or less overhead.

Working remotely

In many cases, your local system will not have a GPU, and you will want to profile an application running on a remote system. As shown above, you can use the vtune CLI to create a trace and open that locally using vtune-gui, however there is an easier way: The vtune-backend daemon.

Start by launching the VTune back-end on the remote system:

$ vtune-backend --enable-server-profiling --web-port 8443 --log-to-console

If your remote system is directly reachable, you want to add --allow-remote-access --base-url "https://remoteServer:8443". However, most people will need to set-up an SSH tunnel:

$ ssh -L 8443:localhost:8443 remoteServer

You can now access the VTune GUI at https://localhost:8443/. Note that the first time you connect, you will need to do so using the one-time URL that is shown in the terminal where you launched the vtune-backend daemon.

The web interface that vtune-backend provides is identical to the GUI from vtune-gui: Start by creating a new project, and configuring an analysis: Select the local VTune profile server, enter the path to the Julia executable along with arguments and a working directory, and select the GPU Offload analysis type:

VTune WebUI

To start the analysis, click the big blue play button. If you use IntelITT.@collect to restrict the trace to the code of interest, use the second button with the pause symbol.

Give it a try!

Hopefully, this guide has shed some light on how to accurately profile oneAPI.jl applications using Intel's VTune Profiler. It turns out that one package could significantly benefit from some rigorous profiling: oneAPI.jl! Until now, development has focussed on correctness and usability, leaving considerable room for performance enhancements.

If you have access to an Intel GPU and want to gain experience profiling GPU applications with VTune, we encourage you to get involved! A good starting point would be analyzing some of oneAPI.jl's array operations like mapreduce or broadcast to identify potential bottlenecks. For more information or any queries, feel free to open an issue on GitHub, or join the discussion on Slack or Discourse. Your help could make a significant difference!

Metal.jl 0.2: Metal Performance Shaders

By: Tim Besard

Re-posted from: https://juliagpu.org/post/2023-03-03-metal_0.2/index.html

Metal.jl 0.2 marks a significant milestone in the development of the Metal.jl package. The release comes with initial support for the Metal Perform Shaders (MPS) framework for accelerating common operations like matrix multiplications, as well as various improvements for writing Metal kernels in Julia.

Metal Performance Shaders

Quoting the Apple documentation, The Metal Performance Shaders (MPS) framework contains a collection of highly optimized compute and graphics shaders for use in Metal applications. With Metal.jl 0.2, we have added initial support for this framework, and used it to accelerate the matrix multiplication operation:

julia> n = p = m = 2048
julia> flops = n*m*(2p-1)
17175674880julia> a = MtlArray(rand(Float32, n, p));
julia> b = MtlArray(rand(Float32, p, m));
julia> c = MtlArray(zeros(Float32, n, m));julia> bench = @benchmark Metal.@sync mul!(c, a, b)
BenchmarkTools.Trial: 518 samples with 1 evaluation.
 Range (min … max):  9.366 ms …  13.354 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     9.629 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.646 ms ± 192.169 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%               ▃▂▅▅▆▆▆▇█▇▇▆▅▄▄▁▁ ▁
  ▄▁▄▄▄▄▆▆▆▄▄▁▇█████████████████▄█▄▁▆▁▄▁▆▁▇▁▄▄▁▁▄▄▇▁▄▆▄▁▁▁▁▁▄ █
  9.37 ms      Histogram: log(frequency) by time      10.1 ms < Memory estimate: 352 bytes, allocs estimate: 12.julia> flops / (minimum(bench.times)/1e9)
1.83e12

The benchmark above shows that on an 8-core M1 Pro matrix multiplication now reaches 1.8 TFLOPS (out of the 2.6TFLOPS of theoretical performance). The accelerated matrix multiplication is available for a variety of input types, incuding mixed-mode operations, and as shown above is integrated with the LinearAlgebra.jl mul! interface.

Of course, the MPS framework offers more than just matrix multiplication, and we expect to support more of it in the future. If you have a specific operation you would like to use from Julia, please let us know by opening an issue on the Metal.jl repository.

GPU profiling support

To support the development of Metal kernels, Max Hawkins has added support for GPU profiling. Similar to how this works in CUDA.jl, you can run code under the Metal.@profile macro to record its execution. However, this does first require setting the METAL_CAPTURE_ENABLED environment flag before import Metal.jl:

julia> ENV["METAL_CAPTURE_ENABLED"] = 1julia> using Metaljulia> a = mtl(rand(1024, 1024))
julia> Metal.@profile sum(a)
[ Info: GPU frame capture saved to jl_metal.gputrace/

The resulting capture can be opened with Xcode, presenting a timeline that's similar to other profilers:

XCode viewing a Metal.jl capture trace

Other improvements

  • Julia 1.9 is supported, but requires an up-to-date macOS version (issues have been encountered on macOS 12.4);

  • An mtl function has been added for converting Julia arrays to Metal arrays, similar to the cu function in CUDA.jl;

  • Multiple GPUs are supported, and the device! function can be used to select one;

  • Coverage for SIMD Group functions has been improved, so it's is now possible to use simdgroup_load, simdgroup_store, simdgroup_multiply, and simdgroup_multiply_accumulate in kernels functions.

Future work

Although Metal.jl is now usable for a variety of applications, there is still work to be done before it can be considered production-ready. In particular:

  • there are known performance issues with mapreduce, and other operations that realy on CartesianIndices;

  • the libcmt wrapper library for interfacing with the Metal APIs is cumbersome to use and improve, and we are looking into native ObjectiveC FFI instead;

  • the MPS wrappers are incomplete, and similar to the Metal APIs requires a replacement to libcmt to be improved;

  • support for atomic operations is missing, which is required to implement a full-featured KernelAbstractions.jl back-end.

Once (most of) these issues are addressed, we should be able to release Metal.jl 1.0.

oneAPI.jl 1.0: oneMKL, Intel Arc and Julia 1.9

By: Tim Besard

Re-posted from: https://juliagpu.org/post/2023-02-08-oneapi_1.0/index.html

The release of oneAPI.jl 1.0 adds integration with the oneAPI Math Kernel Library (oneMKL) to accelerate linear algebra operations on Intel GPUs. It also brings support for Julia 1.9 and Intel Arc GPUs.

oneMKL integration

oneAPI.jl now uses the Intel oneAPI Math Kernel Library (oneMKL), automatically downloaded as part of oneAPI_Support_jll.jl, to accelerate a great number of BLAS and LAPACK operations on Intel GPUs. Similar to how it is implemented in our other GPU back-ends, these wrappers are available at different levels of abstraction.

At the lowest level, we use a C library that wraps the oneMKL C++ APIs. For example, the oneapi::mkl::blas::column_major::gemm function for matrix-matrix multiplication is wrapped by the C functions onemklSgemm, onemklDgemm, etc. These wrappers are used to implement low-level methods like oneMKL.gemm!:

julia> using oneAPIjulia> A = oneArray(rand(Float32, 2, 3));
2×3 oneMatrix{Float32, oneAPI.oneL0.DeviceBuffer}:
 0.44302   0.125576  0.859145
 0.674291  0.428346  0.0400119
julia> B = oneArray(rand(Float32, 3, 4))
3×4 oneMatrix{Float32, oneAPI.oneL0.DeviceBuffer}:
 0.592748   0.529413   0.0323396  0.659528
 0.22489    0.0872259  0.253291   0.376519
 0.0121506  0.591135   0.706755   0.751686
julia> C = similar(B, (2, 4));julia> oneMKL.gemm!('N', 'N', true, A, B, true, C)
2×4 oneMatrix{Float32, oneAPI.oneL0.DeviceBuffer}:
 0.301279  0.753365  0.65334   0.985274
 0.496501  0.417994  0.158581  0.63607julia> Array(C) ≈ Array(A) * Array(B)
true

Of course, these low-level functions aren't very user-friendly, so we also integrate with Julia's standard libraries where possible:

julia> A = oneArray(rand(Float32, 2, 3));
julia> B = oneArray(rand(Float32, 3, 4));julia> using LinearAlgebra
julia> C = A * B;julia> Array(C) ≈ Array(A) * Array(B)
true

The most frequently used oneMKL BLAS functions have been wrapped and integrated with Julia’s standard linear algebra libraries. If you run into a missing function, please file a request to add it, or take a look at the source and contribute to oneAPI.jl! The current state of the wrappers should make it easy to extend their functionality, as well as form a good basis for integrating with other libraries like oneDNN.

Intel Arc support

The new Arc series of discrete Intel GPUs are now fully supported by oneAPI.jl. These GPUs offer a significant performance improvement over their integrated predecessors:

julia> using oneAPI
julia> oneAPI.versioninfo()
1 device:
- Intel(R) Arc(TM) A770 Graphics [0x56a0]julia> T = Float32;
julia> n = p = m = 2048;
julia> a = oneArray(rand(T, n, p));
julia> b = oneArray(rand(T, p, m));
julia> c = oneArray(zeros(T, n, m));julia> using BenchmarkTools, LinearAlgebra
julia> bench = @benchmark oneAPI.@sync mul!(c, a, b)
BenchmarkTools.Trial: 1510 samples with 1 evaluation.
 Range (min … max):  3.233 ms …  3.791 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.298 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.308 ms ± 48.426 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%        ▁▃▄▇█▅▄▃▂   ▁▁▁
  ▁▁▃▃▅▇██████████████████▇▇▇▅▆▄▅▅▄▂▃▂▂▂▂▂▂▁▂▂▂▁▂▁▂▁▂▂▂▂▁▁▂▂ ▃
  3.23 ms        Histogram: frequency by time        3.47 ms < Memory estimate: 272 bytes, allocs estimate: 11.julia> flops = n*m*(2p-1)
17175674880julia> flops / (minimum(bench.times)/1e9)
5.3131281169900205e12

For example, here we're getting over 5 TFlops of Float32 performance, which is over 10x faster than the Intel Xe Graphics G7 we had been previously using for oneAPI.jl development. At the same time, the A770 used above should be able to deliver close to 20 TFlops, so there's still room for improvement in our software stack.

To use oneAPI.jl with an Arc series GPU, you need to run Linux 6.2. At the time of writing, that kernel is still in beta, so refer to your distribution's documentation for how to install it. For example, on Arch Linux you can use the linux-mainline package from the AUR, Ubuntu has the kernel-ppa archive, Fedora provides the stable-rc repository, etc.

Other changes

  • Support for Julia 1.9 has been added.