Category Archives: Julia

ML Project Environment Setup in Julia, a Comprehensive Step-by-step Guide

By: Julia Frank

Re-posted from: https://juliaifrank.com/ml-project-environment-setup-in-julia/

If you opt for running your ML project code locally on your machine, one of the very first things to do is to take care of the ML environment setup. But why and how?

Storing vectors of vectors in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/03/22/minicontainers.html

Introduction

The beauty of DataFrames.jl design is that you can store any data
as columns of a data frame.
However, this leads to one tricky issue – what if we want to store
a vector as a single cell of a data frame? Today I will explain you
what is exactly the problem and how to solve it.

The post was written under Julia 1.10.1 and DataFrames.jl 1.6.1.

Basic transformations of columns in DataFrames.jl

Let us start with a simple example:

julia> using DataFrames

julia> df = DataFrame(id=repeat(1:2, 5), x=1:10)
10×2 DataFrame
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     1      3
   4 │     2      4
   5 │     1      5
   6 │     2      6
   7 │     1      7
   8 │     2      8
   9 │     1      9
  10 │     2     10

We want to group the df data frame by "id" and then store the "x" column unchanged in the result.

This can be done by writing:

julia> combine(groupby(df, "id", sort=true), "x")
10×2 DataFrame
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      3
   3 │     1      5
   4 │     1      7
   5 │     1      9
   6 │     2      2
   7 │     2      4
   8 │     2      6
   9 │     2      8
  10 │     2     10

Note that the column "x" is expanded into multiple rows by combine. The rule that is applied here states that if some transformation of data returns a vector it gets expanded into multiple rows. The reason for such a behavior is that this is what we want most of the time.

However, what if we would want the vectors to be kept as they are without expanding them?
This can be achieved by writing:

julia> combine(groupby(df, "id", sort=true), "x" => Ref => "x")
2×2 DataFrame
 Row │ id     x
     │ Int64  SubArray…
─────┼─────────────────────────
   1 │     1  [1, 3, 5, 7, 9]
   2 │     2  [2, 4, 6, 8, 10]

We see that we got what we wanted, but the question is why does it work?
Let me explain.

Containers holding one element in Julia

What we just did with Ref is that we wrapped some value in a container that held exactly one element.
There are three basic ways to create such a container in Julia.
The first is to wrap a vector within another vector:

julia> [[1,2,3]]
1-element Vector{Vector{Int64}}:
 [1, 2, 3]

Above you have a vector that has one element, which is a [1, 2, 3] vector.

The second method is to create a 0-dimensional array with fill:

julia> fill([1,2,3])
0-dimensional Array{Vector{Int64}, 0}:
[1, 2, 3]

The key point here is that 0-dimensional arrays are guaranteed to hold exactly one element (as opposed to a vector presented above).

The third approach is to use Ref:

julia> Ref([1,2,3])
Base.RefValue{Vector{Int64}}([1, 2, 3])

Wrapping an object with Ref also creates a 0-dimensional container. The difference between Ref and fill is that fill creates an array, while Ref is just a container (but not an array).

How to use 1-element containers in DataFrames.jl as wrappers

All three methods described above can be used to ensure that we protect a vector from being expanded into multiple rows. Therefore the following operations give the same output:

julia> combine(groupby(df, "id", sort=true), "x" => (x -> [x]) => "x")
2×2 DataFrame
 Row │ id     x
     │ Int64  SubArray…
─────┼─────────────────────────
   1 │     1  [1, 3, 5, 7, 9]
   2 │     2  [2, 4, 6, 8, 10]

julia> combine(groupby(df, "id", sort=true), "x" => fill => "x")
2×2 DataFrame
 Row │ id     x
     │ Int64  SubArray…
─────┼─────────────────────────
   1 │     1  [1, 3, 5, 7, 9]
   2 │     2  [2, 4, 6, 8, 10]

julia> combine(groupby(df, "id", sort=true), "x" => Ref => "x")
2×2 DataFrame
 Row │ id     x
     │ Int64  SubArray…
─────┼─────────────────────────
   1 │     1  [1, 3, 5, 7, 9]
   2 │     2  [2, 4, 6, 8, 10]

The point is that combine unwraps the outer container (vector, 0-dimensional array, and Ref respectively) and stores its contents as a cell of a data frame.

Now, you might ask why initially I recommended Ref? The reason is that it is the method that has the smallest memory footprint:

julia> x = [1, 2, 3]
3-element Vector{Int64}:
 1
 2
 3

julia> @allocated [x]
64

julia> @allocated fill(x)
64

julia> @allocated Ref(x)
16

This difference is important if you have a huge data frame that has millions of groups.

Also writing Ref is simpler than writing (x -> [x]) 😄.

Aliasing trap

You might have noticed that in the above examples the resulting "x" column held SubArrays? Why it is the case?
To improve performance combine did not copy the inner vectors from the source df data frame, but instead made their views. This is faster and more memory efficient, but results in creating an alias between the source data frame and the result. In many cases this is not a problem.

However, in some cases you might want to avoid it. A most common case is when you later want to mutate df in place, but do not want the result of combine to reflect this change. If you want to de-alias data you need to copy the data in the produced columns. Therefore you should do:

julia> combine(groupby(df, "id", sort=true), "x" => Ref∘copy => "x")
2×2 DataFrame
 Row │ id     x
     │ Int64  Array…
─────┼─────────────────────────
   1 │     1  [1, 3, 5, 7, 9]
   2 │     2  [2, 4, 6, 8, 10]

Notice that now the "x" column stores Array (which indicates that the copy was made). The Ref∘copy expression signals function composition. We first applly the copy function to the source data and then pass the result to Ref.

An alternative

Sometimes we want to keep the groups as columns not as rows of a data frame. In this case you can use unstack to achieve the desired result. Here is an example how to do it:

julia> unstack(df, :id, :x, combine=identity)
1×2 DataFrame
 Row │ 1                2
     │ SubArray…?       SubArray…?
─────┼───────────────────────────────────
   1 │ [1, 3, 5, 7, 9]  [2, 4, 6, 8, 10]

and a version copying the underlying data:

julia> unstack(df, :id, :x, combine=copy)
1×2 DataFrame
 Row │ 1                2
     │ Array…?          Array…?
─────┼───────────────────────────────────
   1 │ [1, 3, 5, 7, 9]  [2, 4, 6, 8, 10]

Conclusions

Having read this post you should be comfortable with protecting vectors from being expanded into multiple rows when processing data frames in DataFrames.jl. Enjoy!

Mastering Efficient Array Operations with StaticArrays.jl in Julia

By: Steven Whitaker

Re-posted from: https://blog.glcs.io/staticarrays

The Julia programming languageis known for being a high-level languagethat can still compete with Cin terms of performance.As such,Julia already has performant data structures built-in,such as arrays.But what if arrays could be even faster?That’s where the StaticArrays.jl package comes in.

StaticArrays.jl provides drop-in replacements for Array,the standard Julia array type.These StaticArrays work just like Arrays,but they provide one additional piece of informationin the type:the size of the array.Consequently,you can’t insert or remove elements of a StaticArray;they are statically sized arrays(hence the name).However,this restriction allows more informationto be given to Julia’s compiler,which in turn results in more efficient machine code(for example, via loop unrolling and SIMD operations).The resulting speed-up can often be 10x or more!

In this post,we will learn how to use StaticArrays.jland compare the performance of StaticArraysto that of regular Arraysfor several different operations.

Note that the code examples in this postassume StaticArrays.jl has been installed and loaded:

# Press ] to enter the package prompt.pkg> add StaticArrays# Press Backspace to return to the Julia prompt.julia> using StaticArrays

(Check out our post on the Julia REPLfor more details about the package promptand navigating the REPL.)

How to Use StaticArrays.jl

When working with StaticArrays.jl,typically one will use the SVector typeor the SMatrix type.(There is also the SArray type for N-dimensional arrays,but we will focus on 1D and 2D arrays in this post.)SVectors and SMatrixes have both static sizeand static data,meaning the data contained in such objectscannot be modified.For statically sized arrayswhose contents can be modified,StaticArrays.jl provides MVector and MMatrix (and MArray).We will stick with SVectors and SMatrixes in this postunless we specifically need mutability.

Constructors

There are three ways to construct StaticArrays.

  1. Convenience constructor SA:

    julia> SA[1, 2, 3]3-element SVector{3, Int64} with indices SOneTo(3): 1 2 3julia> SA[1 2; 3 4]22 SMatrix{2, 2, Int64, 4} with indices SOneTo(2)SOneTo(2): 1  2 3  4
  2. Normal constructor functions:

    julia> SVector(1, 2)2-element SVector{2, Int64} with indices SOneTo(2): 1 2julia> SMatrix{2,3}(1, 2, 3, 4, 5, 6)23 SMatrix{2, 3, Int64, 6} with indices SOneTo(2)SOneTo(3): 1  3  5 2  4  6
  3. Macros:

    julia> @SVector [1, 2, 3]3-element SVector{3, Int64} with indices SOneTo(3): 1 2 3julia> @SMatrix [1 2; 3 4]22 SMatrix{2, 2, Int64, 4} with indices SOneTo(2)SOneTo(2): 1  2 3  4

    Note that using macrosalso enables a convenient wayto create StaticArrays from common array-creation functions(eliminating the need to create an Array firstjust to convert it immediately to a StaticArray):

    @SVector [10 * i for i = 1:10]@SVector zeros(5)@SVector rand(7)@SMatrix [(i, j) for i = 1:2, j = 1:3]@SMatrix zeros(2, 2)@SMatrix randn(6, 6)

Conversion to/from Array

It may occasionally be necessaryto convert to or from Arrays.To convert from an Array to a StaticArray,use the appropriate constructor function.However, because Arrays do not have size information in the type,we ourselves must provide the size to the constructor:

SVector{3}([1, 2, 3])SMatrix{4,4}(zeros(4, 4))

To convert back to an Array, use the collect function:

julia> collect(SVector(1, 2))2-element Vector{Int64}: 1 2

Comparing StaticArrays to Arrays

Once a StaticArray is created,it can be operated on in the same wayas an Array.To illustrate,we will run a simple benchmark,both to compare the run-time speedsof the two types of arraysand to show that the same code can workwith either type of array.

Stopwatch

Here’s the benchmark code,inspired by StaticArrays.jl’s benchmark:

using BenchmarkTools, StaticArrays, LinearAlgebra, Printfadd!(C, A, B) = C .= A .+ Bfunction run_benchmarks(N)    A = rand(N, N); A = A' * A    B = rand(N, N)    C = Matrix{eltype(A)}(undef, N, N)    D = rand(N)    SA = SMatrix{N,N}(A)    SB = SMatrix{N,N}(B)    MA = MMatrix{N,N}(A)    MB = MMatrix{N,N}(B)    MC = MMatrix{N,N}(C)    SD = SVector{N}(D)    speedup = [        @belapsed($A + $B) / @belapsed($SA + $SB),        @belapsed(add!($C, $A, $B)) / @belapsed(add!($MC, $MA, $MB)),        @belapsed($A * $B) / @belapsed($SA * $SB),        @belapsed(mul!($C, $A, $B)) / @belapsed(mul!($MC, $MA, $MB)),        @belapsed(norm($D)) / @belapsed(norm($SD)),        @belapsed(det($A)) / @belapsed(det($SA)),        @belapsed(inv($A)) / @belapsed(inv($SA)),        @belapsed($A \ $D) / @belapsed($SA \ $SD),        @belapsed(eigen($A)) / @belapsed(eigen($SA)),        @belapsed(map(abs, $A)) / @belapsed(map(abs, $SA)),        @belapsed(sum($D)) / @belapsed(sum($SD)),        @belapsed(sort($D)) / @belapsed(sort($SD)),    ]    return speedupendfunction main()    benchmarks = [        "Addition",        "Addition (in-place)",        "Multiplication",        "Multiplication (in-place)",        "L2 Norm",        "Determinant",        "Inverse",        "Linear Solve (A \\ b)",        "Symmetric Eigendecomposition",        "`map`",        "Sum of Elements",        "Sorting",    ]    N = [3, 5, 10, 30]    speedups = map(run_benchmarks, N)    fmt_header = Printf.Format("%-$(maximum(length.(benchmarks)))s" * " | %7s"^length(N))    header = Printf.format(fmt_header, "Benchmark", string.("N = ", N)...)    println(header)    println("="^length(header))    fmt = Printf.Format("%-$(maximum(length.(benchmarks)))s" * " | %7.1f"^length(N))    for i = 1:length(benchmarks)        println(Printf.format(fmt, benchmarks[i], getindex.(speedups, i)...))    endendmain()

Notice that all the functions calledwhen creating the array speedupin run_benchmarksare the same whether using Arrays or StaticArrays,illustrating that StaticArraysare drop-in replacements for standard Arrays.

Running the above codeprints the following results on my laptop(the numbers indicate the speedupof StaticArrays over normal Arrays;e.g., a value of 17.7 meansusing StaticArrays was 17.7 times fasterthan using Arrays):

Benchmark                    |   N = 3 |   N = 5 |  N = 10 |  N = 30====================================================================Addition                     |    17.7 |    14.5 |     7.9 |     2.0Addition (in-place)          |     1.6 |     1.3 |     1.4 |     0.7Multiplication               |     8.2 |     7.0 |     4.2 |     2.6Multiplication (in-place)    |     1.9 |     5.9 |     3.0 |     1.0L2 Norm                      |     4.2 |     4.0 |     5.4 |     9.7Determinant                  |    66.6 |     2.5 |     1.3 |     0.9Inverse                      |    54.8 |     5.9 |     1.8 |     0.9Linear Solve (A \ b)         |    65.5 |     3.7 |     1.8 |     0.9Symmetric Eigendecomposition |     3.7 |     1.0 |     1.0 |     1.0`map`                        |    10.6 |     8.2 |     4.9 |     2.1Sum of Elements              |     1.5 |     1.1 |     1.7 |     2.1Sorting                      |     7.1 |     2.9 |     1.5 |     1.1

There are two main conclusions from this table.First,using StaticArrays instead of Arrayscan result in some nice speed-ups!Second,the gains from using StaticArrays tend to diminishas the sizes of the arrays increase.So,you can’t expect StaticArrays.jlto always magically make your code faster,but if your arrays are small enough(the recommendation being fewer than about 100 elements)then you can expect to see some good speed-ups.

Of course,the above code timed just individual operations;how much faster a particular application would beis a different matter.

For example,consider a physical simulationwhere many 3D vectorsare manipulated over several time steps.Since 3D vectors are static in size(i.e., are 1D arrays with exactly three elements),such a situation is a prime exampleof where StaticArrays.jl is useful.To illustrate,here is an example(taken from the field of magnetic resonance imaging)of a physical simulationusing Arrays vs using StaticArrays:

using BenchmarkTools, StaticArrays, LinearAlgebrafunction sim_arrays(N)    M = Matrix{Float64}(undef, 3, N)    M[1,:] .= 0.0    M[2,:] .= 0.0    M[3,:] .= 1.0    M2 = similar(M)    (sin, cos) = sincosd(30)    R = [1 0 0; 0 cos sin; 0 -sin cos]    E1 = exp(-0.01)    E2 = exp(-0.1)    (sin, cos) = sincosd(1)    F = [E2 * cos E2 * sin 0; -E2 * sin E2 * cos 0; 0 0 E1]    FR = F * R    C = [0, 0, 1 - E1]    # Run for 100 time steps (each loop iteration does 2 time steps).    for t = 1:50        mul!(M2, FR, M)        M2 .+= C        mul!(M, FR, M2)        M .+= C    end    total = sum(M; dims = 2)    return complex(total[1], total[2])endfunction sim_staticarrays(N)    M = fill(SVector(0.0, 0.0, 1.0), N)    (sin, cos) = sincosd(30)    R = @SMatrix [1 0 0; 0 cos sin; 0 -sin cos]    E1 = exp(-0.01)    E2 = exp(-0.1)    (sin, cos) = sincosd(1)    F = @SMatrix [E2 * cos E2 * sin 0; -E2 * sin E2 * cos 0; 0 0 E1]    FR = F * R    C = @SVector [0, 0, 1 - E1]    # Run for 100 time steps (each loop iteration does 1 time step).    for t = 1:100        # Apply simulation dynamics to each 3D vector.        for i = 1:length(M)            M[i] = FR * M[i] + C        end    end    total = sum(M)    return complex(total[1], total[2])endfunction main(N)    r1 = @btime sim_arrays($N)    r2 = @btime sim_staticarrays($N)    @assert r1  r2 # Make sure the results are the same.end

The speed-ups on my laptopfor different values of Nwere as follows:

  • N = 10: 14.6x faster
  • N = 100: 7.1x faster
  • N = 1000: 5.2x faster

(Here, N is the number of 3D vectors in the simulation,not the size of the StaticArrays.)

Note also that I wrote sim_arraysto be as performant as possibleby doing in-place operations(like mul!),which has the unfortunate side effectof making the code a bit harder to read.Therefore,sim_staticarrays is both faster and easier to read!

As another exampleof how StaticArrays.jlcan speed up a more involved application,see the DifferentialEquations.jl docs.

Summary

In this post,we discussed StaticArrays.jl.We saw that StaticArrays are drop-in replacementsfor regular Julia Arrays.We also saw that using StaticArrayscan result in some nice speed-upsover using Arrays,at least when the sizes of the arraysare not too big.

Are array operations a bottleneck in your code?Try out StaticArrays.jland then comment below how it helps!

Additional Links

Cover image background fromhttps://openverse.org/image/875bf026-11ef-47a8-a63c-ee1f1877c156?q=circuit%20board%20array.