Author Archives: Abel Soares Siqueira

10 examples of embedding Julia in C/C++

By: Abel Soares Siqueira

Re-posted from: https://blog.esciencecenter.nl/10-examples-of-embedding-julia-in-c-c-66282477e62c?source=rss----ab3660314556--julia

A beginner-friendly collection of examples

A wooden table with 4 cardboard boxes, one inside the other. In the back, the C++ and Julia logos.
We can put C/C++ inside Julia, and Julia inside C/C++, but wait until we call C/C++ from Julia from C/C++. This image is derived from the photos of Andrej Lišakov and Kelli McClintock found on Unsplash.

I have been asked recently about how easy it would be to use Julia from inside C/C++. That was a very interesting question that I was eager to figure out. This question gave me the chance to explore a few toy problems, but I had some issues figuring out the basics. This post aims to help people in a similar situation. I don’t make any benchmarks or claims. Rather, I want to help kickstart future investigations.

The 10 examples in this post are more-or-less ordered by increasing difficulty. I explain some basic bits of Makefile, which can be skipped, if you know what you are doing.

Please notice that this was not done in a production environment, and in no real projects. Furthermore, I use Linux, which is also very specific. I also recommend you to check the code on GitHub, so you see the full result. If possible, leave a star, so I can use it as interest metric.

Target audience

This post should be useful for people evaluating whether using Julia inside C/C++ will be a good idea, or are just very curious about it. I assume some knowledge of Julia and C/C++.

Where is libjulia.so

First, install Julia and remember where you are installing it. I usually use Jill — which is my bash script to install Julia — with the following commands:

In this case, julia will be installed in /opt/julias/julia-x.y.z/ with a link to /usr/local/bin/.

You can try finding out where your Julia is installed using which to find the full path of julia, then listing with -l to check whether that is a link and where it points. For instance,

ls -l $(dirname $(which julia))/julia*

shows all relevant links.

Now, that folder, which I will call JULIA_DIR, should have folders include and lib, and therefore files $JULIA_DIR/include/julia/julia/julia.h and $JULIA_DIR/lib/libjulia.so should exist.

The 10 examples

Now we start with the examples. Since we are using C/C++, I will start from 0, I guess. These examples are some of the files in the GitHub. Look for files sqrtX.cpp, integrationX.cpp, and linear-algebraX.cpp. Notice that there are more files than examples, because some are mostly repetition.

Table of contents

0: The basics

Let’s make sure that we can build and run a basic example, taken directly from the official documentation:

I hope the code comments are self-explanatory. To compile this code, let’s prepare a very simple Makefile.

Explaining:

  • -fPIC: Position independent code. This is needed because we will work with shared libraries
  • -g: Adding debug information because we will probably need it
  • JULIA_DIR: It’s out Julia dir!
  • -I…: Include path for julia.h
  • -L…: Linking path for libjulia.so
  • -Wl,…: Linking path for the linker
  • -ljulia: libjulia.so
  • main.exe: main.cpp: The file main.exe needs the file main.cpp
  • On line 7 there must be a TAB, not spaces
  • $<: Expands to the left-most requirement (main.cpp)
  • $@: Expands to the target (main.exe)
  • The Makefile expression as a whole: “To build a main.exe, look for main.cpp and run the command g++ … main.cpp … -o main.exe”

Enter make main.exe in your terminal, and then ./main.exe. Your output should be 1.4142135623730951. I was not expecting this to just work, but it did. I hope you have a similar experience.

1: Expanding the basics

The simplest way to execute anything in Julia is to use jl_eval_string. Variables created using jl_eval_string remain in the Julia interpreter scope, so you can access them:

jl_eval_string("x = sqrt(2.0)");
jl_eval_string("print(x)");

Returned values can be stored as pointers of type jl_value_t. To access their values, use jl_unbox_SOMETYPE. For instance:

jl_value_t *x = jl_eval_string("sqrt(2.0)");
double x_value = jl_unbox_float64(x);

Finally, you can also store pointers to Julia functions using jl_get_function. The returned type is a pointer to a jl_function_t and we can’t just use it as a C++ function. Instead, we will use jl_call1 and jl_box_float64.

jl_function_t *sqrt = jl_get_function(jl_base_module, “sqrt”)
jl_value *x = jl_call1(sqrt, jl_box_float64(2.0));

In the code above we have jl_base_module, which is everything in Base of Julia. The other common module is jl_main_module, which will include whatever we load (using) or create.

The function jl_call1 is used to execute a Julia function with 1 argument. Variants with 0 to 3 arguments also exist, but if you need to call a function with more arguments, you need to use jl_call. We will get to that later.

In the end, we have something like this:

2: Exceptions

Now, try changing the 2.0 in the code above to -1.0. If you call sqrt(-1.0) in Julia you have a DomainError. But if you run the code with the change above, you will have a segmentation fault.

The problem is not in the execution, though, it is in the unboxing below of the now-undefined x. To check for exceptions on the Julia side we can use jl_exception_occurred. It returns the pointer to the error or 0 (NULL), so it can be used in a conditional statement.

To check the contents of jl_exception_occurred, we can use showerror from Julia’s Base.

After calling sqrt of -1.0, add the following:

The first line creates a variable ex and assigns the exception to it. The evaluated expression is the value of ex, which will be either false if there is no exception, or true if something else was returned.

Then, we call showerror from Julia and pass to it Julia’s error stream with jl_strerr_obj(), and the exception. We add some flourish printing before and after the error message.

To call this a few times, we can create a function called handle_julia_exception wrapping it, and move it to auxiliary files (we’ll call them aux.h and aux.cpp). However, since we are printing, we would have to add iostream or stdio to our auxiliary files, and we don’t want that. Instead, what we can do is use jl_printf and jl_stderr_stream to print using only julia.h.

Therefore we can create the following files:

In our main file we can just add #include "aux.h" and call handle_julia_exception() directly. And since we are already here, we can also create a wrapper for jl_eval_string that checks for exceptions as well. Add the following to your auxiliary files:

// To your aux.h
jl_value_t *handle_eval_string(const char* code);

and

// To you aux.cpp
jl_value_t *handle_eval_string(const char* code) {
jl_value_t *result = jl_eval_string(code);
handle_julia_exception();
assert(result && "Missing return value but no exception occurred!");
return result;
}

And modify your Makefile accordingly:

main.exe: main.cpp aux.o
g++ $(CARGS) $< $(JLARGS) aux.o -o $@
%.o: %.cpp
g++ $(CARGS) -c $< $(JLARGS) -o $@

The % in the Makefile acts as a wildcard.

After running the newest version, you should see something like:

Exception:
DomainError with -1.0:
sqrt will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
I quit!

Other possibilities to handle exceptions and JL_TRY and JL_CATCH but I don’t have an example for it yet.

If you think that example would be useful, leave a comment.

3: Including Julia files and callings functions with 4 arguments or more

Let’s move on to something a little more convoluted, the trapezoid method for computing approximations for integrals.

Animation of Trapezoid approximation to the integral of a function.

The idea of the method is to approximate the integral of the function — the blue-shaded region — by a finite amount of trapezoid areas (caveat: the geometric interpretation only applies to positive functions). As the animation suggests, by increasing the number of trapezoids, we tend to get better approximations for the integral.

The neat formula for the approximation is

Trapezoid formula. LaTeX: \int_a^b f(x) \text{d}x \approx \frac{(b — a)}{2n}\left(f(a) + 2 \sum_{i = 1}^{n — 1} f(x_i) + f(b)\right)

A basic implementation of the trapezoid method in Julia is as follows:

Don’t worry, we are not allocating when using range, not even when accessing [2:end-1].

Write down this to an aux.jl file and include this file using the code below:

handle_eval_string("include(\"aux.jl\")");

Now we can access trapezoid as any other Julia code, for instance, using jl_eval_string or jl_get_function. It is important to notice that trapezoid is not part of the Base module. Instead, we must use the Main module through jl_main_module.

To test this function, let’s compute the integral of x^2 from 0 to 1. We will use the evaluator to compute x -> x^2, which is the notation for anonymous functions in Julia. The result should be 1/3.

Notice that the function x -> x^2 was created with a handle_eval_string, which calls jl_eval_string. The return value of jl_eval_string is a jl_value_t *, but surprise, a jl_function_t is actually just another name for jl_value_t. The difference is just for readability purposes.

The trapezoid function has 4 arguments, therefore we have to use the general jl_call that we mentioned before. The arguments of jl_call are the function, an array of jl_value_t * arguments, and the number of arguments.

4: C function from Julia from C

How about computing the integral of a C function? We will need to access it through Julia to be able to pass it to a Julia function. First, we must create the function in C. Create a file my_c_func.cpp with the following contents:

It is important that we use extern "C" here, otherwise, C++ will mangle the function name. If you use C instead of C++, then this will not be an issue, but we intend to use C++ down the road. We will compile this code to a shared library, not only a .o object. Therefore, add the following to your Makefile:

lib%.so: %.o
ld -shared $< -o $@

ld is the linker and -shared is because we want a shared library. Furthermore, you should modify the following:

main.exe: main.cpp aux.o libmy_c_func.so

Now, when you run make main.exe, the libmy_c_func.so library will be compiled.

Finally, to call this function, we use the same string evaluator and Julia’s ccall.

The ccall function has 4+ arguments:

  • (:my_c_func, "libmy_c_func.so"): A tuple with the function name and the library;
  • Cdouble: Return type;
  • (Cdouble,): Tuple with the types of the arguments;
  • Then, all the arguments. In this case, only x.

That is it. This change is enough to make the code run. Notice that the function is x^3, so the integral result should be 1 / 4. Those are the only differences in the code.

5: Using a package

Instead of implementing our own integration method, we can use some existing one. One option is QuadGK.jl. To install it, open julia, press ], and enter add QuadGK.

An important note here is that I have not investigated much into maintaining a separate environment for these packages. If you know more about this subject, don’t hesitate to leave a comment.

Here is the code:

handle_eval_string("using QuadGK");
jl_value_t *integrator = handle_eval_string(
"(f, a, b, n) -> quadgk(f, a, b, maxevals=n)[1]"
);

Just like that we can compute the integral, and compare it with our implementation. Let’s use a harder integral to make things more interesting:

The integral of 1 over 1 plus x squared from 0 to 1 is Pi over 4. LaTeX: \int_0¹ \frac{1}{1 + x²} \text{d}x = \frac{\pi}{4}.
The integral of 1 over 1 plus x squared from 0 to 1 is Pi over 4. LaTeX: \int_0^1 \frac{1}{1 + x^2} \text{d}x = \frac{\pi}{4}.

Here is the complete code for this example:

The results you should see are

Integral of 1 / (1 + x^2) is approx: 0.785394
Error: 4.16667e-06
Integral of 1 / (1 + x^2) is approx: 0.785398
Error: -1.11022e-16

6: Using the Distributions package

The package Distributions contains various probability-related tools. We are going to use the Normal distributions’ PDF (Probability Density Function) and CDF (Cumulative Density Function) in this example. Don’t worry if you don’t know what these mean, we won’t need to understand the concept, only the formulas.

The Normal distribution with mean Mu (µ) and standard deviation Sigma (σ) has PDF given by

Normal probability density function. LaTeX: f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x — \mu}{\sigma}\right)²}
Normal probability density function. LaTeX: f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x – \mu}{\sigma}\right)^2}

And the CDF of a PDF is

Cumulative density function definition. LaTeX: F(x) = \int_{-\infty}^x f(t) \text{d}t
Cumulative density function definition. LaTeX: F(x) = \int_{-\infty}^x f(t) \text{d}t

What we will do is use the Distributions package to access the PDF and compute the CDF integral using QuadGK. We will then compare it to the existing CDF function in Distributions.

Once more there is not much secret. You only have to create the Normal structure on the Julia side and use Julia closures to define PDF and CDF jl_function_t with one argument. This is the code:

handle_eval_string("normal = Normal()");
jl_function_t *pdf = handle_eval_string("x -> pdf(normal, x)");
jl_function_t *cdf = handle_eval_string("x -> cdf(normal, x)");

The full code is below

7: Creating a class to wrap the Distributions package

To complicate it a little bit more, let’s create a class wrapping the Distributions package. The basic idea will be a constructor to call Normal , and C++ functions wrapping pdf and cdf. This can be done simply by having a call to handle_eval_string or by creating the function with jl_get_function and calling jl_call_X.

However, to make it more efficient, we want to avoid frequent calls to the functions that deal with strings. One solution is to store the functions returned by jl_get_function and just use them when necessary. To do that, we will use static members in C++.

The two files below show the implementation of our class:

As you can see, we keep a distributions_loaded flag to let the constructor know that the static variables can be used. In the initialization function, we define the necessary functions. The actual implementation of the constructor and the PDF and CDF functions is straightforward.

We can use this new class in our main file easily:

Don’t forget to update your Makefile by replacing aux.o by aux.o Normal.o, i.e., add Normal.o next toaux.o. The result of this execution is

x: -4.00e+00  pdf: +4.97e-08  cdf: +1.12e-08
x: -3.00e+00 pdf: +2.73e-06 cdf: +7.07e-07
x: -2.00e+00 pdf: +8.29e-05 cdf: +2.52e-05
x: -1.00e+00 pdf: +1.39e-03 cdf: +5.11e-04
x: +0.00e+00 pdf: +1.30e-02 cdf: +5.95e-03
x: +1.00e+00 pdf: +6.68e-02 cdf: +4.04e-02
x: +2.00e+00 pdf: +1.90e-01 cdf: +1.64e-01
x: +3.00e+00 pdf: +3.00e-01 cdf: +4.18e-01
x: +4.00e+00 pdf: +2.62e-01 cdf: +7.13e-01

8: Linear algebra: Arrays, Vectors, and Matrices

Let’s start our linear algebra exploration with a matrix-vector multiplication and solving a linear system. We will define the following:

x is a vector of ones and A is a matrix with n in the diagonal, 1 below the diagonal and -1 above the diagonal. LaTeX: x = \begin{bmatrix} 1 \\ 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix}, A = \begin{bmatrix} n & -1 & -1 & \cdots & -1 \\ 1 & n & -1 & \cdots & -1 \\ 1 & 1 & n & \cdots & -1 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & 1 & 1 & \cdots & n \end{bmatrix}
x is a vector of ones and A is a matrix with n in the diagonal, 1 below the diagonal and -1 above the diagonal. LaTeX: x = \begin{bmatrix} 1 \\ 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix}, A = \begin{bmatrix} n & -1 & -1 & \cdots & -1 \\ 1 & n & -1 & \cdots & -1 \\ 1 & 1 & n & \cdots & -1 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & 1 & 1 & \cdots & n \end{bmatrix}

Let’s start with some code:

The first two statements define the vectors and matrices types. Notice that we make explicit that the first has 1 dimension and the second has 2 dimensions.

The next 3 statements allocate the memory for the two vectors x and y, and the matrix A, using the array types we previously defined.

Finally, we have a JL_GC_PUSH3, which informs Julia’s Garbage Collector to not touch this memory. Naturally, we will have to pop these eventually.

Lastly, we declare C arrays pointing to the Julia data. You will notice that AData is a 1-dimensional array because Julia implements dense matrices as a linearized array by columns. That means that the element(i,j) will be at the linearized position i + j * nrows — using 0-based indexing.

To fill the values of the vector x and the matrix A , we can use the code below:

The product of A and x is pretty much the same as any function we had so far.

The noteworthy part of this code is that we have to cast the arrays for jl_value_t * to use them as arguments to jl_call2 , and the output is cast to jl_array_t *. Similarly, we can use jl_array_data(Ax) to access the content of the product.

We can also use mul! to compute the product in place, i.e., without allocating more memory:

Notice that we use jl_main_module because mul! is part of LinearAlgebra .

Finally, we move on to solving the linear system. To do that, let’s use the LU factorization and the ldiv! function. The \ (backslash) operator is usually used here, but we choose ldiv! to solve the linear system in place.

jl_function_t *lu_fact = jl_get_function(jl_main_module, "lu");
jl_value_t *LU = jl_call1(lu_fact, (jl_value_t *) A);
jl_function_t *ldiv = jl_get_function(jl_main_module, "ldiv!");
jl_call3(ldiv, (jl_value_t *) y, LU, (jl_value_t *) Ax);

The last call defines y as the solution of the linear system Ay = (Ax) . Since A is non-singular, we expect y and x to be sufficiently close (numerical errors could appear here). We can verify this using

double *yData = (double *) jl_array_data(y);
double norm2 = 0.0;
for (size_t i = 0; i < n; i++) {
double dif = yData[i] - xData[i];
norm2 += dif * dif;
}
cout << "|x - y|² = " << norm2 << endl;

My result was 6.48394e-26 .

To finalize this code, we have to run

JL_GC_POP();

This allows the Julia Garbage Collector to collect the allocated memory. The complete code can be seen below:

9: Sparse matrices

For our next example, we will solve a heat-equation on 1 spatial dimension, using a discretization of time and space called Backward Time Centered Space (BTCS), which is not quick to explain. Check these notes for a thorough explanation.

For our interests, it suffices to say that we will be solving a sparse linear system multiple times, where the matrix is the one below:

Tridiagonal matrix, where the diagonal stores 1 plus 2 times kappa, and the off-diagonal values are -kappa. LaTeX: A = \begin{bmatrix} 1 + 2\kappa & -\kappa \\ -\kappa & 1 + 2\kappa & \kappa \\ & \ddots & \ddots & \ddots \\ & & -\kappa & 1 + 2\kappa & -\kappa \\ & & & -\kappa & 1 + 2\kappa \end{bmatrix}
Tridiagonal matrix, where the diagonal stores 1 plus 2 times kappa, and the off-diagonal values are -kappa. LaTeX: A = \begin{bmatrix} 1 + 2\kappa & -\kappa \\ -\kappa & 1 + 2\kappa & \kappa \\ & \ddots & \ddots & \ddots \\ & & -\kappa & 1 + 2\kappa & -\kappa \\ & & & -\kappa & 1 + 2\kappa \end{bmatrix}

We don’t have to store this matrix as a dense matrix (like in the previous example). Instead, we want to store only the relevant elements. To do that, we will create three vectors for the rows and columns indexes, and for the values corresponding to these indexes.

The code is below:

long int rows[3 * n - 2], cols[3 * n - 2];
double vals[3 * n - 2];
for (size_t i = 0; i < n; i++) {
rows[i] = i + 1;
cols[i] = i + 1;
vals[i] = (1 + 2 * kappa);
if (i < n - 1) {
rows[n + i] = i + 1;
cols[n + i] = i + 2;
vals[n + i] = -kappa;
rows[2 * n + i - 1] = i + 2;
cols[2 * n + i - 1] = i + 1;
vals[2 * n + i - 1] = -kappa;
}
}

Now, we will create a sparse matrix using the sparse function from the SparseArrays module in Julia. For that, we allocate two array types, one for the integers, and one for the floating point numbers.

On the jl_call3 , we also call jl_ptr_to_array_1d to directly create and return a Julia vector wrapping the data we give it.

The A_sparsematrix is a Julia sparse matrix. Many of the matrix operations that work with dense matrices will work with sparse matrices. To test a different factorization, let’s use the function ldl from the LDLFactorizations package.

Now, we can use ldiv! with ldlObj instead of the LU factorization that we used in the previous example. There is one catch, though. Since we are using the ldlObj “for a while”, we need to prevent the Garbage collector to clean it. But the JL_GC_PUSHX function can only be called once per scope. Therefore, to use it we have to create an internal scope. So something like the following:

The complete code is below:

In the algorithm, we define u as the initial vector, then solve the linear system right u as the right-hand side to obtain unew. Then we assign unew to u and repeat. Each u is an approximation to the solution of the heat equation for a specific moment in time.

You will notice that, in addition to computing the solution, we also plot it using the Plots package. We plot the initial solution at different times. This makes the code much slower, unfortunately. The result can be seen below:

The image shows the solution starting from the function given by the exponential of the negative of the distance from the center. The other show the decay of this function, each one flatter and more similar to a symmetric quadratic with negative curvature.
Plot of heat equation solution at different moments in time.

Finalizing and open questions

I hope these 10 examples are helpful to get you started with embedding Julia in C. There are many more things not covered here, in particular things I do not know. Some of them are:

  • How to deal with strings?
  • How to deal with keyword arguments?
  • How to deal with installing packages and environments?
  • How to make it faster (e.g., using precompiled images)?

I will be on the lookout for future projects to investigate these. In the meantime, like and follow for more Julia and C/C++ content.

References and extra material


10 examples of embedding Julia in C/C++ was originally published in Netherlands eScience Center on Medium, where people are continuing the conversation by highlighting and responding to this story.

Can Python with Julia be faster than low-level code?

By: Abel Soares Siqueira

Re-posted from: https://blog.esciencecenter.nl/can-python-with-julia-be-faster-than-low-level-code-cd71a72fbcf4?source=rss----ab3660314556--julia

Part 3 of the series on achieving high performance with high-level code

By Abel Soares Siqueira and Faruk Diblen

Here comes a new challenger: It is Julia. Photo by Joran Quinten on Unsplash (https://unsplash.com/photos/MR9xsNWVKvo), modified by us.

Introduction

In our last post, we were able to improve Python code using a few lines of Julia code. We were able to achieve a very interesting result without optimizing prematurely or using low-level code. However, what if we want more? In this blog post, we will investigate that.

It is quite common that a developer prototypes with a high-level language, but when the need for speed arises, they eventually move to a low-level language. This is called the “two-language problem”, and Julia was created with the objective of solving this issue (read more on their blog post from 2012). Unfortunately, achieving the desired speedup is not always easy. It depends highly on the problem, and on how much previous work was done trying to tackle it. Today we find out how much more we can speed up our Julia code, and how much effort it took.

Previously

  • Patrick Bos presented the problem of reading irregular data, or non-tabular data, in this blog post.
  • He also presented his original solution to the problem using just Python with pandas, which we are calling Pure Python in our benchmarks.
  • Finally, he presented a faster strategy which consisits of calling C++ from Python, which we denote C++.
  • In the previous blog post of this series, we created two strategies with Python calling Julia code. Our first strategy, Basic Julia, wasn’t that great, but our second strategy, Prealloc Julia, was sufficiently faster than Pure Python, but not as fast as C++.

Remember that we have set up a GitHub repository with our whole code, and also, that we have a Docker image for reproducibility.

For the C fans

Our first approach to speeding things up is to simulate what C++ is doing. We believe that the C++ version is faster because it can read the data directly as the desired data type. In Julia, we had to read the data as String and then convert it to Int. We don’t know how to do that with Julia. But we know how to do that with C.

Using Julia’s built-in ccall function, we can directly call the C functions to open and close a file, namely fopen and fclose, and call fscanf to read and parse the file at the same time. Our updated Julia code which uses these C functions is below.

Let’s see if that helped increase the speed of our code. We include in our benchmark the previous strategies as well. This new strategy will be called Julia + C parsing.

Run time of Pure Python, C++, Basic Julia, Prealloc Julia, and Julia + C parsing strategies. (a) Time per element in the log-log scale. (b) Time per element, relative to the time of the C++ version in the log-log scale.

Our code is much more C-like now, so understanding it requires more knowledge about how C works. However, the code is way faster than our previous implementation. For files with more than 1 million elements, the Julia + C parsing strategy has a 10.38 speedup over the Pure Python strategy, on average. This is almost double the speedup we got with Prealloc Julia, which is an amazing result. For comparison, on average, C++ has a 16.37 speedup.

No C for me, thanks

Our C approach was very fast, and we would like to replicate it with pure Julia. Unfortunately, we could not find anything in Julia to perform the same type of reading as fscanf. However, after some investigation, we found an alternative.

Using the read function of Julia, we can parse the file as a stream of bytes. This way we can manually walk through the file and parse the integers. This is the code:

We denote this strategy Optimized Julia. This version of the code manually keeps track of the sequence of bytes related to integers, so it is much less readable. However, this version achieves an impressive speedup, surpassing the C++ version:

Run time of Pure Python, C++, Basic Julia, Prealloc Julia, Julia + C parsing, and Optimized Julia strategies. (a) Time per element in the log-log scale. (b) Time per element, relative to the time of the C++ version in the log-log scale.

It was not easy to get to this point, and the code itself is convoluted, but we managed to achieve a large speedup in relation to Python using only Julia, another high-level language. The average speedup for files with over 1 million elements is 40.25, which is over 2 times faster than what we got with the C++ strategy. We remark again that the Pure Python and C++ strategies have not been optimized, and that readers can let us know in the comments if they found a better strategy.

So yes, we can achieve a speedup equivalent to a low-level language using Julia.

Conclusions: We won, but at what cost?

One thing to keep in mind is that to achieve high speedups, we had to put more effort into getting to that point. This effort comes in diverse ways:

  • To write and use the C++ strategy, we had to know sufficient C++, as well as understand the libraries used. If you don’t have enough C++ knowledge, the effort is higher, since what needs to be done is quite different from what Python developers are used to. If you already know C++, then the effort is that of searching the right keywords and using the right libraries.
  • To write and use any of the Julia strategies, you need to put some effort into having the correct environment. Using Julia from Python is still an experimental feature, so your experience may vary.
  • To write the Basic Julia and Prealloc Julia strategies, not much previous knowledge is required. So, we can classify this as a small effort.
  • To write the Julia + C and Optimized Julia strategies, we need more specialized knowledge. This is again a high-effort task if you do not already know the language.

Here’s our conclusion. To achieve a high speedup, we need specialized knowledge which requires a big effort. However, we can conclude as well that, if you are not familiar with either C++ or Julia, then acquiring some knowledge in Julia allows you to get a smaller improvement. That is, a small effort with Julia already gets you some speedup. You can prototype quickly in Julia and get a reasonable result and keep improving that version to get C-like speedups over time.

Speedup gain relative to the effort of moving the code to a different language.

We hope you have enjoyed the series and that it helps you with your code in any way. Let us know what you think and what you missed. Follow us for more research software content.

Many thanks to our proofreaders and reviewers, Elena Ranguelova, Jason Maassen, Jurrian Spaaks, Patrick Bos, Rob van Nieuwpoort, and Stefan Verhoeven.


Can Python with Julia be faster than low-level code? was originally published in Netherlands eScience Center on Medium, where people are continuing the conversation by highlighting and responding to this story.

Speed up your Python code using Julia

By: Abel Soares Siqueira

Re-posted from: https://blog.esciencecenter.nl/speed-up-your-python-code-using-julia-f97a6c155630?source=rss----ab3660314556--julia

Part two of the series on achieving high performance with high-level code

By Abel Soares Siqueira and Faruk Diblen

Python holds the steering wheel, but we can make it faster with other languages. Photo by Spencer Davis on Unsplash (https://unsplash.com/photos/QUfxuCqdpH0), modified by us.

In part 1 of this series, we set up an environment so that we can run Julia code in Python. You can also check our Docker image with the complete environment if you want to follow along. We also have a GitHub repository with the complete code if you want to see the result.

Background

On the blog post, 50 times faster data loading for Pandas: no problem, our colleague and Senior Research Software Engineer, Patrick Bos, discoursed about improving the speed of reading non-tabular data into a DataFrame in Python. Since the data is not tabular, one must read, split, and stack the data. All of that can be done with pandas in a few lines of code. However, since the data files are large, performance issues with Python and Pandas now become visible and prohibitive. So, instead of doing all those operations with pandas, Patrick shows a nice way of doing it with C++ and Python bindings. Well done, Patrick!

In this blog post, we will look into improving the Python code in a similar fashion. However, instead of moving to C++, a low-level language considerably harder to learn than Python, we will move the heavy lifting to Julia and compare the results.

A very short summary of Patrick’s blog post

Before anything, we recommend checking Patrick’s blog post to read more into the problem, the data, and the approach of using Python with C++. The short version is that we have a file where each row is an integer, followed by the character #, followed by an unknown number of comma-separated values, which we call elements. Each row can have a different number of elements, and that’s why we say the data is non-tabular, or irregular. An example file is below:

From now on, we refer to the initial approach of solving the problem with Python and pandas as the Pure Python strategy, and we will call the strategy of solving the problem with Python and C++ as the C++ strategy.

We will compare the strategies using a dataset we generated. The dataset has 180 files, generated randomly, varying the number of rows, the maximum number of elements per row, and the distribution of the number of elements per row.

Adding some Julia spice to Python

The version below is the first approach to solve our problem using Julia. There are shorter alternatives, but this one is sufficiently descriptive. We start with a very basic approach so it is easier to digest.

You can test this function on Julia directly to see that it works independently of Python. After doing that, we want to call it from Python. As you should know by now, that is fairly easy to do, especially if you use the Docker image we have created for Post 1.

The next code snippet includes the file that we created above into Julia’s Main namespace and defines two functions in Python. The first, load_external , is used to read the arrays that were parsed by either C++ or Julia. The second Python function, read_arrays_julia_basic , is just a wrapper around the Julia function definition in the included file.

Now we will benchmark this strategy, which we will call the Basic Julia strategy, against the Pure Python and C++ strategies. We are using Python 3.10.1 and Julia 1.6.5. We run each strategy three times and take the average time. Our hardware is a Notebook Dell Precision 5530, with 16 GB of RAM and an i7–8850H CPU, and we are using a docker image based on Ubuntu Linux 21.10 to run the tests (from inside another Linux machine). You can reproduce the results by pulling the abelsiqueira/faster-python-with-julia-blogpost Docker image, downloading the dataset, and running the following command in your terminal:

$ docker run --rm --volume "$PWD/dataset:/app/dataset" --volume "$PWD/out:/app/out" abelsiqueira/faster-python-with-julia-post2

See the figure below for the results.

Run time of Pure Python, C++, and Basic Julia strategies. (a) Time per element in the log-log scale. (b) Time per element, relative to the time of the C++ strategy in the log-log scale.

A few interesting things happen in the image. First, both Pure Python and Basic Julia have a lot of variability with respect to the number of elements. We believe this happens because the code’s performance is dependent on the number of rows, as well as the structure distribution of elements per row. The code allocates a new array for each row, so even if the number of elements is small, if the number of rows is large, then the execution will be slow. Remember that our dataset has a lot of variability on the number of rows, maximum elements per row, and distribution of elements per row. This means that some files are close in the number of elements but may be vastly different. Second, Basic Julia and Pure Python have different efficiency profiles. Our Julia code must move all stored elements into a new array for each new row that it reads, meaning it allocates a new array for every row.

The code for Basic Julia is simple and does what is expected, but it does not pre-allocate the memory that will be used, so that really hurts its performance. In low-level languages, that would be one of the first things we would have to worry about. Indeed, if we look into the C++ code, we can see that it starts by figuring out the size of the output vector and allocating them. We need to improve our Julia code at least a little bit.

Basic improvements for the Julia Code

The first version of our Julia code is inefficient in a few ways, as explained above. With that in mind, our first change is to compute the number of elements a priori and allocate our output vectors. Here is our improved Julia code:

Here, we use a dictionary generator comprehension, which has the closest resemblance to the data. This allows us to count the number of elements and keep the values to be stored later. We also use the package Parsers, which provides a slightly faster parser for integers. Here is the updated figure comparing the three previous strategies and the new Prealloc Julia strategy that we just created:

Run time of the Pure Python, C++, Basic Julia, and Prealloc Julia strategies. (a) Time per element in the log-log scale. (b) Time per element, relative to the time of the C++ strategy in the log-log scale.

Now we have made a nice improvement. The results more consistently depend on the number of elements, like the C++ strategy. We can also see a stabilization of the trend that Prealloc Julia follows. It appears to be the same as C++, which is expected since the performance should be linearly dependent on the number of elements. For files with more than 1 million elements, the Prealloc Julia strategy has a 5.83 speedup over the Pure Python strategy, on average, while C++ has a 16.37 speedup, on average.

Next steps

We have achieved an amazing result today. Using only high-level languages, we were able to achieve some speedup in relation to the Pure Python strategy. We remark that we have not optimized the Python or the C++ strategies, simply using what was already available from Patrick’s blog post. Let us know in the comments you have optimized versions of these codes to share with the community.

In the next post, we will optimize our Julia code even further. It is said that Julia’s speed sometimes rivals low-level code. Can we achieve that for our code? Let us know what you think and stay tuned for more!

Many thanks to our proofreaders and reviewers, Elena Ranguelova, Jason Maassen, Jurrian Spaaks, Patrick Bos, Rob van Nieuwpoort, and Stefan Verhoeven.


Speed up your Python code using Julia was originally published in Netherlands eScience Center on Medium, where people are continuing the conversation by highlighting and responding to this story.