Category Archives: Julia

The curious case of subset condition

Re-posted from: https://bkamins.github.io/julialang/2022/01/28/subset.html

Introduction

Recently on Julia Slack there was a question about using the subset function
to drop whole groups from GroupedDataFrame in DataFrames.jl.
I thought that indeed this case is tricky enough to be worth a post.

The examples were tested under Julia 1.7.0 and DataFrames.jl 1.3.2.

Standard use cases of the `subset` function

Let us start with creating some sample data:

julia> using DataFrames

julia> df = DataFrame(id=[1, 1, 1, 1, 2, 2], x=1:6)
6×2 DataFrame
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      2
   3 │     1      3
   4 │     1      4
   5 │     2      5
   6 │     2      6

julia> gdf = groupby(df, :id)
GroupedDataFrame with 2 groups based on key: id
First Group (4 rows): id = 1
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      2
   3 │     1      3
   4 │     1      4
⋮
Last Group (2 rows): id = 2
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     2      5
   2 │     2      6

Assume we want to keep rows having value of :x less than the mean of this
column from df. This can be achieved with:

julia> using Statistics

julia> subset(df, :x => x -> x .< mean(x))
3×2 DataFrame
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      2
   3 │     1      3

The same operation can be easily done groupwise. Now we keep rows that have the
value of :x less than the mean of this column per group defined by :id:

julia> subset(gdf, :x => x -> x .< mean(x))
3×2 DataFrame
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      2
   3 │     2      5

The limitation of the `subset` contract

The subset function requires that the return value of the passed condition
is a vector. Therefore the following operation fails:

julia> subset(df, :x => x -> true)
ERROR: ArgumentError: functions passed to `subset` must return an AbstractVector.

although we might expect that broadcasting would be applied to the result of
the function and all rows would be kept. For a reference e.g. select would
perform such broadcasting automatically:

julia> select(df, All(), :x => x -> true)
6×3 DataFrame
 Row │ id     x      x_function
     │ Int64  Int64  Bool
─────┼──────────────────────────
   1 │     1      1        true
   2 │     1      2        true
   3 │     1      3        true
   4 │     1      4        true
   5 │     2      5        true
   6 │     2      6        true

You might wonder why this restriction is made. Initially we allowed non-vector
return values, but they turned to be confusing for the users so we disallowed
them.

Let me give an example. If the user wants to keep all rows for which the :id
column is equal to 1 one should write:

julia> subset(df, :id => ByRow(==(1)))
4×2 DataFrame
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      2
   3 │     1      3
   4 │     1      4

However, it turned out that users frequently were forgetting to add ByRow
wrapper and instead used:

julia> subset(df, :id => ==(1))
ERROR: ArgumentError: functions passed to `subset` must return an AbstractVector.

Now it throws an error, but if we have not imposed the restriction that we require
a vector to be returned we would get the following result:

julia> subset(df, :id => x -> fill(x == 1, length(x)))
0×2 DataFrame

as the whole column :id would be compared to 1 and the result of this
comparison is false.

Dropping whole groups from a `GroupedDataFrame`

The requirement that the condition must return a vector was added for safety
reasons. However, there is one case when it is a bit problematic.

Assume we want to keep from the gdf GroupedDataFrame all groups for which
the mean of :x column is less than 3. The problem is that the following
condition fails:

julia> subset(gdf, :x => x -> mean(x) < 3)
ERROR: ArgumentError: functions passed to `subset` must return an AbstractVector.

since the comparing the mean of the :x column to 3 produces a scalar Bool
value.

The solution is to manually expand the result of the condition to match the
number of rows in the group:

julia> subset(gdf, :x => x -> fill(mean(x) < 3, length(x)))
4×2 DataFrame
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      2
   3 │     1      3
   4 │     1      4

This is unfortunately a bit inconvenient.

An alternative approach would be to use the filter function which applied
to GroupedDataFrame always works on whole groups:

julia> filter(:x => x -> mean(x) < 3, gdf) |> DataFrame
4×2 DataFrame
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      2
   3 │     1      3
   4 │     1      4

(we had to pass the result of filter to DataFrame constructor, as otherwise
we would get a filtered GroupedDataFrame)

Conclusions

The design of subset I discussed in this post shows one of the challenges we
face when defining APIs in DataFrames.jl. There often is a tension between
developer convenience and safety. In this example allowing only vectors as
results of conditions in the subset function is safer since it allows to
catch some common bugs in the users code. The cost is that in some cases
(most notably dropping whole groups from a GroupedDataFrame) it is a bit
inconvenient.

CUDA.jl 3.5-3.8

By: Tim Besard

Re-posted from: https://juliagpu.org/post/2022-01-28-cuda_3.5_3.8/index.html

CUDA.jl versions 3.5 to 3.8 have brought several new features to improve performance and productivity. This blog post will highlight a couple: direct copies between devices, better performance by preserving array index types and changing the memory pool, and a much-improved interface to the compute sanitizer utility.

Copies between devices

Typically, when sending data between devices you need to stage through the CPU. CUDA.jl now does this automatically, making it possible to directly copy between CuArrays on different devices:

julia> device!(0);julia> a = CUDA.rand(2,2)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.440147  0.986939
 0.622901  0.698119julia> device!(1);julia> b = CUDA.zeros(2,2);julia> copyto!(b, a)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.440147  0.986939
 0.622901  0.698119

When your hardware supports it, CUDA.jl will automatically enable so-called peer-to-peer mode, making it possible to copy data directly without going through the CPU. This can result in significant bandwidth and latency reductions. You can check if this mode of communication is possible:

julia> src = CuDevice(0)
CuDevice(0): NVIDIA A100-PCIE-40GBjulia> dst = CuDevice(1)
CuDevice(1): Tesla V100-PCIE-32GBjulia> can_access_peer(src, dst)
false

In this case, peer-to-peer communication is not possible because the devices have a different compute capability major revision number. With a compatible device, the function reports true:

julia> src = CuDevice(1)
CuDevice(1): Tesla V100-PCIE-32GBjulia> dst = CuDevice(2)
CuDevice(2): Tesla V100-PCIE-16GBjulia> can_access_peer(src, dst)
true

Thanks to @kshyatt for help with this change!

Helper function to use `compute-sanitizer`

The CUDA toolkit comes with a powerful tool to check GPU kernels for common issues like memory errors and race conditions: the compute sanitizer. To make it easier to use this tool, CUDA.jl now ships the binary as part of its artifacts, and provides a helper function to restart Julia under the compute-sanitizer. Let's demonstrate, and trigger a memory error to show what the compute sanitizer can detect:

julia> using CUDAjulia> CUDA.run_compute_sanitizer()
Re-starting your active Julia session...========= COMPUTE-SANITIZER
julia> using CUDAjulia> unsafe_wrap(CuArray, pointer(CuArray([1])), 2) .= 1
========= Invalid __global__ write of size 8 bytes
=========     at 0x2a0 in LLVM/src/interop/base.jl:45:julia_broadcast_kernel_1892(CuKernelContext, CuDeviceArray<Int64, (int)1, (int)1>, Broadcasted<void, Tuple<OneTo<Int64>>, _identity, Broadcasted<Int64>>, Int64)
=========     by thread (1,0,0) in block (0,0,0)
=========     Address 0xa64000008 is out of bounds
=========     and is 1 bytes after the nearest allocation at 0xa64000000 of size 8 bytes

Other tools are available too, e.g. racecheck for detecting races or synccheck for finding synchronization issues. These tools can be selected using the tool keyword argument to run_compute_sanitizer.

Updated binary dependencies

As is common with every release, CUDA.jl now supports newer versions of NVIDIA's tools and libraries:

CUDA toolkit 11.5 and 11.6
CUDNN 8.3.2
CUTENSOR 1.4.0

The update to CUDA toolkit 11.6 comes with improved debug info compatibility. If you need to debug Julia GPU code with tools like compute-sanitizer or cuda-gdb, and you need debug info (the equivalent of nvcc -G), ensure CUDA.jl can use the latest version of the CUDA toolkit.

To make it easier to use the latest supported toolkit, CUDA.jl now implements CUDA's so-called Forward Compatibility mode: When your driver is outdated, CUDA.jl will attempt to load a newer version of the CUDA driver library, enabling use of a newer CUDA toolkit and libraries. Note that this is only supported on select hardware, refer to the NVIDIA documentation for more details.

Preserving array indices

Julia's integers are typically 64-bits wide, which can be wasteful when dealing with GPU indexing intrinsics that are typically only 32-bits wide. CUDA.jl's device array type now carefully preserves the type of indices so that 32-bits indices aren't unnecessarily promoted to 64-bits. With some careful kernel programming (note the use of 0x1 instead of 1 below), this makes it possible to significantly reduce the register pressure surrounding indexing operations, which may be useful in register-constrained situations:

julia> function memset(arr, val)
           i = (blockIdx().x-0x1) * blockDim().x + threadIdx().x
           @inbounds arr[i] = val
           return
       endjulia> CUDA.code_ptx(memset, Tuple{CuDeviceArray{Float32,1,AS.Global},Float32})
.func julia_memset(.param .b64 arr, .param .b32 val) {
        .reg .f32       %f<2>;
        .reg .b32       %r<5>;
        .reg .b64       %rd<5>;        ld.param.u64    %rd1, [arr];
        ld.param.f32    %f1, [val];
        mov.u32         %r1, %ctaid.x;
        mov.u32         %r2, %ntid.x;
        mov.u32         %r3, %tid.x;
        mad.lo.s32      %r4, %r2, %r1, %r3;
        ld.u64          %rd2, [%rd1];
        mul.wide.s32    %rd3, %r4, 4;
        add.s64         %rd4, %rd2, %rd3;
        st.global.f32   [%rd4], %f1;
        ret;
}

On CUDA.jl 3.4, this simple function used 3 more 64-bit registers:

.func julia_memset(.param .b64 arr, .param .b32 val) {
        .reg .f32       %f<2>;
        .reg .b32       %r<5>;
        .reg .b64       %rd<8>;        ld.param.u64    %rd1, [arr];
        ld.param.f32    %f1, [val];
        mov.u32         %r1, %ctaid.x;
        mov.u32         %r2, %ntid.x;
        mul.wide.u32    %rd2, %r2, %r1;
        mov.u32         %r3, %tid.x;
        add.s32         %r4, %r3, 1;
        cvt.u64.u32     %rd3, %r4;
        ld.u64          %rd4, [%rd1];
        add.s64         %rd5, %rd2, %rd3;
        shl.b64         %rd6, %rd5, 2;
        add.s64         %rd7, %rd4, %rd6;
        st.global.f32   [%rd7+-4], %f1;
        ret;
}

More aggressive memory management

Starting with CUDA 3.8, the memory pool used to allocate CuArrays will be configured differently: The pool will now be allowed to use all available GPU memory, whereas previously all cached memory was released at each synchronization point. This can significantly improve performance, and makes synchronization much cheaper.

This behavior can be observed by calling the memory_status() function:

julia> CUDA.memory_status()
Effective GPU memory usage: 13.57% (2.001 GiB/14.751 GiB)
Memory pool usage: 0 bytes (0 bytes reserved)julia> a = CuArray{Float32}(undef, (1024, 1024, 1024));
julia> Base.format_bytes(sizeof(a))
"4.000 GiB"julia> a = nothing
julia> GC.gc()julia> CUDA.memory_status()
Effective GPU memory usage: 40.59% (5.988 GiB/14.751 GiB)
Memory pool usage: 0 bytes (4.000 GiB reserved)

So far nothing new. On previous versions of CUDA.jl however, any subsequent synchronization of the GPU (e.g., by copying memory to the CPU) would have resulted in a release of this reserved memory. This is not the case anymore:

julia> synchronize()julia> CUDA.memory_status()
Effective GPU memory usage: 40.59% (5.988 GiB/14.751 GiB)
Memory pool usage: 0 bytes (4.000 GiB reserved)

If you still want to release this memory, you can call the reclaim() function:

julia> CUDA.reclaim()julia> CUDA.memory_status()
Effective GPU memory usage: 13.48% (1.988 GiB/14.751 GiB)
Memory pool usage: 0 bytes (0 bytes reserved)

With interactive Julia sessions, this function is called periodically so that the GPU's memory isn't held on to unnecessarily. Otherwise it shouldn't be necessary to call this function, as memory is freed automatically when it is needed.

Minor changes and improvements

Bitonic sort is now used instead of quicksort (by @xaellison).
CuDeviceArray now stores the length of the array, greatly speeding up indexing with high-dimensional arrays.
Device intrinsics cannot be called on the CPU anymore, protecting against segfaults when something isn't dispatching correctly.
Support for Multi-GPU instances has been improved, providing the parent_uuid function to look up the UUID of the parent device.
randn and randexp are now supported in kernel code, which should help with initial support of Distributions.jl-based operations.

Speed up your Python code using Julia

By: Abel Soares Siqueira

Re-posted from: https://blog.esciencecenter.nl/speed-up-your-python-code-using-julia-f97a6c155630?source=rss----ab3660314556--julia

Part two of the series on achieving high performance with high-level code

By Abel Soares Siqueira and Faruk Diblen

Python holds the steering wheel, but we can make it faster with other languages. Photo by Spencer Davis on Unsplash (https://unsplash.com/photos/QUfxuCqdpH0), modified by us.

In part 1 of this series, we set up an environment so that we can run Julia code in Python. You can also check our Docker image with the complete environment if you want to follow along. We also have a GitHub repository with the complete code if you want to see the result.

Background

On the blog post, 50 times faster data loading for Pandas: no problem, our colleague and Senior Research Software Engineer, Patrick Bos, discoursed about improving the speed of reading non-tabular data into a DataFrame in Python. Since the data is not tabular, one must read, split, and stack the data. All of that can be done with pandas in a few lines of code. However, since the data files are large, performance issues with Python and Pandas now become visible and prohibitive. So, instead of doing all those operations with pandas, Patrick shows a nice way of doing it with C++ and Python bindings. Well done, Patrick!

In this blog post, we will look into improving the Python code in a similar fashion. However, instead of moving to C++, a low-level language considerably harder to learn than Python, we will move the heavy lifting to Julia and compare the results.

A very short summary of Patrick’s blog post

Before anything, we recommend checking Patrick’s blog post to read more into the problem, the data, and the approach of using Python with C++. The short version is that we have a file where each row is an integer, followed by the character #, followed by an unknown number of comma-separated values, which we call elements. Each row can have a different number of elements, and that’s why we say the data is non-tabular, or irregular. An example file is below:

<a href="https://medium.com/media/d4c0f4780182e3d67fe694011e4c1e67/href">https://medium.com/media/d4c0f4780182e3d67fe694011e4c1e67/href</a>

From now on, we refer to the initial approach of solving the problem with Python and pandas as the Pure Python strategy, and we will call the strategy of solving the problem with Python and C++ as the C++ strategy.

We will compare the strategies using a dataset we generated. The dataset has 180 files, generated randomly, varying the number of rows, the maximum number of elements per row, and the distribution of the number of elements per row.

Adding some Julia spice to Python

The version below is the first approach to solve our problem using Julia. There are shorter alternatives, but this one is sufficiently descriptive. We start with a very basic approach so it is easier to digest.

<a href="https://medium.com/media/9e5f9faa5fa8b143ebd3be12b1f5bb45/href">https://medium.com/media/9e5f9faa5fa8b143ebd3be12b1f5bb45/href</a>

You can test this function on Julia directly to see that it works independently of Python. After doing that, we want to call it from Python. As you should know by now, that is fairly easy to do, especially if you use the Docker image we have created for Post 1.

The next code snippet includes the file that we created above into Julia’s Main namespace and defines two functions in Python. The first, load_external , is used to read the arrays that were parsed by either C++ or Julia. The second Python function, read_arrays_julia_basic , is just a wrapper around the Julia function definition in the included file.

<a href="https://medium.com/media/56953c564183da0bb58224d35be10b17/href">https://medium.com/media/56953c564183da0bb58224d35be10b17/href</a>

Now we will benchmark this strategy, which we will call the Basic Julia strategy, against the Pure Python and C++ strategies. We are using Python 3.10.1 and Julia 1.6.5. We run each strategy three times and take the average time. Our hardware is a Notebook Dell Precision 5530, with 16 GB of RAM and an i7–8850H CPU, and we are using a docker image based on Ubuntu Linux 21.10 to run the tests (from inside another Linux machine). You can reproduce the results by pulling the abelsiqueira/faster-python-with-julia-blogpost Docker image, downloading the dataset, and running the following command in your terminal:

$ docker run --rm --volume "$PWD/dataset:/app/dataset" --volume "$PWD/out:/app/out" abelsiqueira/faster-python-with-julia-post2

See the figure below for the results.

Run time of **Pure Python**, **C++,** and **Basic Julia** strategies. (a) Time per element in the log-log scale. (b) Time per element, relative to the time of the **C++** strategy in the log-log scale.

A few interesting things happen in the image. First, both Pure Python and Basic Julia have a lot of variability with respect to the number of elements. We believe this happens because the code’s performance is dependent on the number of rows, as well as the structure distribution of elements per row. The code allocates a new array for each row, so even if the number of elements is small, if the number of rows is large, then the execution will be slow. Remember that our dataset has a lot of variability on the number of rows, maximum elements per row, and distribution of elements per row. This means that some files are close in the number of elements but may be vastly different. Second, Basic Julia and Pure Python have different efficiency profiles. Our Julia code must move all stored elements into a new array for each new row that it reads, meaning it allocates a new array for every row.

The code for Basic Julia is simple and does what is expected, but it does not pre-allocate the memory that will be used, so that really hurts its performance. In low-level languages, that would be one of the first things we would have to worry about. Indeed, if we look into the C++ code, we can see that it starts by figuring out the size of the output vector and allocating them. We need to improve our Julia code at least a little bit.

Basic improvements for the Julia Code

The first version of our Julia code is inefficient in a few ways, as explained above. With that in mind, our first change is to compute the number of elements a priori and allocate our output vectors. Here is our improved Julia code:

<a href="https://medium.com/media/f6086539828ae9fbacfd88dfeeb0c953/href">https://medium.com/media/f6086539828ae9fbacfd88dfeeb0c953/href</a>

Here, we use a dictionary generator comprehension, which has the closest resemblance to the data. This allows us to count the number of elements and keep the values to be stored later. We also use the package Parsers, which provides a slightly faster parser for integers. Here is the updated figure comparing the three previous strategies and the new Prealloc Julia strategy that we just created:

Run time of the **Pure Python**, **C++**, **Basic Julia**, and **Prealloc Julia** strategies. (a) Time per element in the log-log scale. (b) Time per element, relative to the time of the **C++** strategy in the log-log scale.

Now we have made a nice improvement. The results more consistently depend on the number of elements, like the C++ strategy. We can also see a stabilization of the trend that Prealloc Julia follows. It appears to be the same as C++, which is expected since the performance should be linearly dependent on the number of elements. For files with more than 1 million elements, the Prealloc Julia strategy has a 5.83 speedup over the Pure Python strategy, on average, while C++ has a 16.37 speedup, on average.

Next steps

We have achieved an amazing result today. Using only high-level languages, we were able to achieve some speedup in relation to the Pure Python strategy. We remark that we have not optimized the Python or the C++ strategies, simply using what was already available from Patrick’s blog post. Let us know in the comments you have optimized versions of these codes to share with the community.

In the next post, we will optimize our Julia code even further. It is said that Julia’s speed sometimes rivals low-level code. Can we achieve that for our code? Let us know what you think and stay tuned for more!

Many thanks to our proofreaders and reviewers, Elena Ranguelova, Jason Maassen, Jurrian Spaaks, Patrick Bos, Rob van Nieuwpoort, and Stefan Verhoeven.

Speed up your Python code using Julia was originally published in Netherlands eScience Center on Medium, where people are continuing the conversation by highlighting and responding to this story.

juliabloggers.com

A Julia Language Blog Aggregator

Category Archives: Julia

The curious case of subset condition

Introduction

Standard use cases of the `subset` function

The limitation of the `subset` contract

Dropping whole groups from a `GroupedDataFrame`

Conclusions

CUDA.jl 3.5-3.8

Copies between devices

Helper function to use `compute-sanitizer`

Updated binary dependencies

Preserving array indices

More aggressive memory management

Minor changes and improvements

Speed up your Python code using Julia

Part two of the series on achieving high performance with high-level code

Background

A very short summary of Patrick’s blog post

Adding some Julia spice to Python

Basic improvements for the Julia Code

Next steps

Introduction

Standard use cases of the subset function

The limitation of the subset contract

Dropping whole groups from a GroupedDataFrame

Conclusions

Copies between devices

Helper function to use compute-sanitizer

Updated binary dependencies

Preserving array indices

More aggressive memory management

Minor changes and improvements

Part two of the series on achieving high performance with high-level code

Background

A very short summary of Patrick’s blog post

Adding some Julia spice to Python

Basic improvements for the Julia Code

Next steps

Standard use cases of the `subset` function

The limitation of the `subset` contract

Dropping whole groups from a `GroupedDataFrame`

Helper function to use `compute-sanitizer`