Tag Archives: Cpp

10 examples of embedding Julia in C/C++

By: Abel Soares Siqueira

Re-posted from: https://blog.esciencecenter.nl/10-examples-of-embedding-julia-in-c-c-66282477e62c?source=rss----ab3660314556--julia

A beginner-friendly collection of examples

A wooden table with 4 cardboard boxes, one inside the other. In the back, the C++ and Julia logos.
We can put C/C++ inside Julia, and Julia inside C/C++, but wait until we call C/C++ from Julia from C/C++. This image is derived from the photos of Andrej Lišakov and Kelli McClintock found on Unsplash.

I have been asked recently about how easy it would be to use Julia from inside C/C++. That was a very interesting question that I was eager to figure out. This question gave me the chance to explore a few toy problems, but I had some issues figuring out the basics. This post aims to help people in a similar situation. I don’t make any benchmarks or claims. Rather, I want to help kickstart future investigations.

The 10 examples in this post are more-or-less ordered by increasing difficulty. I explain some basic bits of Makefile, which can be skipped, if you know what you are doing.

Please notice that this was not done in a production environment, and in no real projects. Furthermore, I use Linux, which is also very specific. I also recommend you to check the code on GitHub, so you see the full result. If possible, leave a star, so I can use it as interest metric.

Target audience

This post should be useful for people evaluating whether using Julia inside C/C++ will be a good idea, or are just very curious about it. I assume some knowledge of Julia and C/C++.

Where is libjulia.so

First, install Julia and remember where you are installing it. I usually use Jill — which is my bash script to install Julia — with the following commands:

In this case, julia will be installed in /opt/julias/julia-x.y.z/ with a link to /usr/local/bin/.

You can try finding out where your Julia is installed using which to find the full path of julia, then listing with -l to check whether that is a link and where it points. For instance,

ls -l $(dirname $(which julia))/julia*

shows all relevant links.

Now, that folder, which I will call JULIA_DIR, should have folders include and lib, and therefore files $JULIA_DIR/include/julia/julia/julia.h and $JULIA_DIR/lib/libjulia.so should exist.

The 10 examples

Now we start with the examples. Since we are using C/C++, I will start from 0, I guess. These examples are some of the files in the GitHub. Look for files sqrtX.cpp, integrationX.cpp, and linear-algebraX.cpp. Notice that there are more files than examples, because some are mostly repetition.

Table of contents

0: The basics

Let’s make sure that we can build and run a basic example, taken directly from the official documentation:

I hope the code comments are self-explanatory. To compile this code, let’s prepare a very simple Makefile.

Explaining:

  • -fPIC: Position independent code. This is needed because we will work with shared libraries
  • -g: Adding debug information because we will probably need it
  • JULIA_DIR: It’s out Julia dir!
  • -I…: Include path for julia.h
  • -L…: Linking path for libjulia.so
  • -Wl,…: Linking path for the linker
  • -ljulia: libjulia.so
  • main.exe: main.cpp: The file main.exe needs the file main.cpp
  • On line 7 there must be a TAB, not spaces
  • $<: Expands to the left-most requirement (main.cpp)
  • $@: Expands to the target (main.exe)
  • The Makefile expression as a whole: “To build a main.exe, look for main.cpp and run the command g++ … main.cpp … -o main.exe”

Enter make main.exe in your terminal, and then ./main.exe. Your output should be 1.4142135623730951. I was not expecting this to just work, but it did. I hope you have a similar experience.

1: Expanding the basics

The simplest way to execute anything in Julia is to use jl_eval_string. Variables created using jl_eval_string remain in the Julia interpreter scope, so you can access them:

jl_eval_string("x = sqrt(2.0)");
jl_eval_string("print(x)");

Returned values can be stored as pointers of type jl_value_t. To access their values, use jl_unbox_SOMETYPE. For instance:

jl_value_t *x = jl_eval_string("sqrt(2.0)");
double x_value = jl_unbox_float64(x);

Finally, you can also store pointers to Julia functions using jl_get_function. The returned type is a pointer to a jl_function_t and we can’t just use it as a C++ function. Instead, we will use jl_call1 and jl_box_float64.

jl_function_t *sqrt = jl_get_function(jl_base_module, “sqrt”)
jl_value *x = jl_call1(sqrt, jl_box_float64(2.0));

In the code above we have jl_base_module, which is everything in Base of Julia. The other common module is jl_main_module, which will include whatever we load (using) or create.

The function jl_call1 is used to execute a Julia function with 1 argument. Variants with 0 to 3 arguments also exist, but if you need to call a function with more arguments, you need to use jl_call. We will get to that later.

In the end, we have something like this:

2: Exceptions

Now, try changing the 2.0 in the code above to -1.0. If you call sqrt(-1.0) in Julia you have a DomainError. But if you run the code with the change above, you will have a segmentation fault.

The problem is not in the execution, though, it is in the unboxing below of the now-undefined x. To check for exceptions on the Julia side we can use jl_exception_occurred. It returns the pointer to the error or 0 (NULL), so it can be used in a conditional statement.

To check the contents of jl_exception_occurred, we can use showerror from Julia’s Base.

After calling sqrt of -1.0, add the following:

The first line creates a variable ex and assigns the exception to it. The evaluated expression is the value of ex, which will be either false if there is no exception, or true if something else was returned.

Then, we call showerror from Julia and pass to it Julia’s error stream with jl_strerr_obj(), and the exception. We add some flourish printing before and after the error message.

To call this a few times, we can create a function called handle_julia_exception wrapping it, and move it to auxiliary files (we’ll call them aux.h and aux.cpp). However, since we are printing, we would have to add iostream or stdio to our auxiliary files, and we don’t want that. Instead, what we can do is use jl_printf and jl_stderr_stream to print using only julia.h.

Therefore we can create the following files:

In our main file we can just add #include "aux.h" and call handle_julia_exception() directly. And since we are already here, we can also create a wrapper for jl_eval_string that checks for exceptions as well. Add the following to your auxiliary files:

// To your aux.h
jl_value_t *handle_eval_string(const char* code);

and

// To you aux.cpp
jl_value_t *handle_eval_string(const char* code) {
jl_value_t *result = jl_eval_string(code);
handle_julia_exception();
assert(result && "Missing return value but no exception occurred!");
return result;
}

And modify your Makefile accordingly:

main.exe: main.cpp aux.o
g++ $(CARGS) $< $(JLARGS) aux.o -o $@
%.o: %.cpp
g++ $(CARGS) -c $< $(JLARGS) -o $@

The % in the Makefile acts as a wildcard.

After running the newest version, you should see something like:

Exception:
DomainError with -1.0:
sqrt will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
I quit!

Other possibilities to handle exceptions and JL_TRY and JL_CATCH but I don’t have an example for it yet.

If you think that example would be useful, leave a comment.

3: Including Julia files and callings functions with 4 arguments or more

Let’s move on to something a little more convoluted, the trapezoid method for computing approximations for integrals.

Animation of Trapezoid approximation to the integral of a function.

The idea of the method is to approximate the integral of the function — the blue-shaded region — by a finite amount of trapezoid areas (caveat: the geometric interpretation only applies to positive functions). As the animation suggests, by increasing the number of trapezoids, we tend to get better approximations for the integral.

The neat formula for the approximation is

Trapezoid formula. LaTeX: \int_a^b f(x) \text{d}x \approx \frac{(b — a)}{2n}\left(f(a) + 2 \sum_{i = 1}^{n — 1} f(x_i) + f(b)\right)

A basic implementation of the trapezoid method in Julia is as follows:

Don’t worry, we are not allocating when using range, not even when accessing [2:end-1].

Write down this to an aux.jl file and include this file using the code below:

handle_eval_string("include(\"aux.jl\")");

Now we can access trapezoid as any other Julia code, for instance, using jl_eval_string or jl_get_function. It is important to notice that trapezoid is not part of the Base module. Instead, we must use the Main module through jl_main_module.

To test this function, let’s compute the integral of x^2 from 0 to 1. We will use the evaluator to compute x -> x^2, which is the notation for anonymous functions in Julia. The result should be 1/3.

Notice that the function x -> x^2 was created with a handle_eval_string, which calls jl_eval_string. The return value of jl_eval_string is a jl_value_t *, but surprise, a jl_function_t is actually just another name for jl_value_t. The difference is just for readability purposes.

The trapezoid function has 4 arguments, therefore we have to use the general jl_call that we mentioned before. The arguments of jl_call are the function, an array of jl_value_t * arguments, and the number of arguments.

4: C function from Julia from C

How about computing the integral of a C function? We will need to access it through Julia to be able to pass it to a Julia function. First, we must create the function in C. Create a file my_c_func.cpp with the following contents:

It is important that we use extern "C" here, otherwise, C++ will mangle the function name. If you use C instead of C++, then this will not be an issue, but we intend to use C++ down the road. We will compile this code to a shared library, not only a .o object. Therefore, add the following to your Makefile:

lib%.so: %.o
ld -shared $< -o $@

ld is the linker and -shared is because we want a shared library. Furthermore, you should modify the following:

main.exe: main.cpp aux.o libmy_c_func.so

Now, when you run make main.exe, the libmy_c_func.so library will be compiled.

Finally, to call this function, we use the same string evaluator and Julia’s ccall.

The ccall function has 4+ arguments:

  • (:my_c_func, "libmy_c_func.so"): A tuple with the function name and the library;
  • Cdouble: Return type;
  • (Cdouble,): Tuple with the types of the arguments;
  • Then, all the arguments. In this case, only x.

That is it. This change is enough to make the code run. Notice that the function is x^3, so the integral result should be 1 / 4. Those are the only differences in the code.

5: Using a package

Instead of implementing our own integration method, we can use some existing one. One option is QuadGK.jl. To install it, open julia, press ], and enter add QuadGK.

An important note here is that I have not investigated much into maintaining a separate environment for these packages. If you know more about this subject, don’t hesitate to leave a comment.

Here is the code:

handle_eval_string("using QuadGK");
jl_value_t *integrator = handle_eval_string(
"(f, a, b, n) -> quadgk(f, a, b, maxevals=n)[1]"
);

Just like that we can compute the integral, and compare it with our implementation. Let’s use a harder integral to make things more interesting:

The integral of 1 over 1 plus x squared from 0 to 1 is Pi over 4. LaTeX: \int_0¹ \frac{1}{1 + x²} \text{d}x = \frac{\pi}{4}.
The integral of 1 over 1 plus x squared from 0 to 1 is Pi over 4. LaTeX: \int_0^1 \frac{1}{1 + x^2} \text{d}x = \frac{\pi}{4}.

Here is the complete code for this example:

The results you should see are

Integral of 1 / (1 + x^2) is approx: 0.785394
Error: 4.16667e-06
Integral of 1 / (1 + x^2) is approx: 0.785398
Error: -1.11022e-16

6: Using the Distributions package

The package Distributions contains various probability-related tools. We are going to use the Normal distributions’ PDF (Probability Density Function) and CDF (Cumulative Density Function) in this example. Don’t worry if you don’t know what these mean, we won’t need to understand the concept, only the formulas.

The Normal distribution with mean Mu (µ) and standard deviation Sigma (σ) has PDF given by

Normal probability density function. LaTeX: f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x — \mu}{\sigma}\right)²}
Normal probability density function. LaTeX: f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x – \mu}{\sigma}\right)^2}

And the CDF of a PDF is

Cumulative density function definition. LaTeX: F(x) = \int_{-\infty}^x f(t) \text{d}t
Cumulative density function definition. LaTeX: F(x) = \int_{-\infty}^x f(t) \text{d}t

What we will do is use the Distributions package to access the PDF and compute the CDF integral using QuadGK. We will then compare it to the existing CDF function in Distributions.

Once more there is not much secret. You only have to create the Normal structure on the Julia side and use Julia closures to define PDF and CDF jl_function_t with one argument. This is the code:

handle_eval_string("normal = Normal()");
jl_function_t *pdf = handle_eval_string("x -> pdf(normal, x)");
jl_function_t *cdf = handle_eval_string("x -> cdf(normal, x)");

The full code is below

7: Creating a class to wrap the Distributions package

To complicate it a little bit more, let’s create a class wrapping the Distributions package. The basic idea will be a constructor to call Normal , and C++ functions wrapping pdf and cdf. This can be done simply by having a call to handle_eval_string or by creating the function with jl_get_function and calling jl_call_X.

However, to make it more efficient, we want to avoid frequent calls to the functions that deal with strings. One solution is to store the functions returned by jl_get_function and just use them when necessary. To do that, we will use static members in C++.

The two files below show the implementation of our class:

As you can see, we keep a distributions_loaded flag to let the constructor know that the static variables can be used. In the initialization function, we define the necessary functions. The actual implementation of the constructor and the PDF and CDF functions is straightforward.

We can use this new class in our main file easily:

Don’t forget to update your Makefile by replacing aux.o by aux.o Normal.o, i.e., add Normal.o next toaux.o. The result of this execution is

x: -4.00e+00  pdf: +4.97e-08  cdf: +1.12e-08
x: -3.00e+00 pdf: +2.73e-06 cdf: +7.07e-07
x: -2.00e+00 pdf: +8.29e-05 cdf: +2.52e-05
x: -1.00e+00 pdf: +1.39e-03 cdf: +5.11e-04
x: +0.00e+00 pdf: +1.30e-02 cdf: +5.95e-03
x: +1.00e+00 pdf: +6.68e-02 cdf: +4.04e-02
x: +2.00e+00 pdf: +1.90e-01 cdf: +1.64e-01
x: +3.00e+00 pdf: +3.00e-01 cdf: +4.18e-01
x: +4.00e+00 pdf: +2.62e-01 cdf: +7.13e-01

8: Linear algebra: Arrays, Vectors, and Matrices

Let’s start our linear algebra exploration with a matrix-vector multiplication and solving a linear system. We will define the following:

x is a vector of ones and A is a matrix with n in the diagonal, 1 below the diagonal and -1 above the diagonal. LaTeX: x = \begin{bmatrix} 1 \\ 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix}, A = \begin{bmatrix} n & -1 & -1 & \cdots & -1 \\ 1 & n & -1 & \cdots & -1 \\ 1 & 1 & n & \cdots & -1 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & 1 & 1 & \cdots & n \end{bmatrix}
x is a vector of ones and A is a matrix with n in the diagonal, 1 below the diagonal and -1 above the diagonal. LaTeX: x = \begin{bmatrix} 1 \\ 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix}, A = \begin{bmatrix} n & -1 & -1 & \cdots & -1 \\ 1 & n & -1 & \cdots & -1 \\ 1 & 1 & n & \cdots & -1 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & 1 & 1 & \cdots & n \end{bmatrix}

Let’s start with some code:

The first two statements define the vectors and matrices types. Notice that we make explicit that the first has 1 dimension and the second has 2 dimensions.

The next 3 statements allocate the memory for the two vectors x and y, and the matrix A, using the array types we previously defined.

Finally, we have a JL_GC_PUSH3, which informs Julia’s Garbage Collector to not touch this memory. Naturally, we will have to pop these eventually.

Lastly, we declare C arrays pointing to the Julia data. You will notice that AData is a 1-dimensional array because Julia implements dense matrices as a linearized array by columns. That means that the element(i,j) will be at the linearized position i + j * nrows — using 0-based indexing.

To fill the values of the vector x and the matrix A , we can use the code below:

The product of A and x is pretty much the same as any function we had so far.

The noteworthy part of this code is that we have to cast the arrays for jl_value_t * to use them as arguments to jl_call2 , and the output is cast to jl_array_t *. Similarly, we can use jl_array_data(Ax) to access the content of the product.

We can also use mul! to compute the product in place, i.e., without allocating more memory:

Notice that we use jl_main_module because mul! is part of LinearAlgebra .

Finally, we move on to solving the linear system. To do that, let’s use the LU factorization and the ldiv! function. The \ (backslash) operator is usually used here, but we choose ldiv! to solve the linear system in place.

jl_function_t *lu_fact = jl_get_function(jl_main_module, "lu");
jl_value_t *LU = jl_call1(lu_fact, (jl_value_t *) A);
jl_function_t *ldiv = jl_get_function(jl_main_module, "ldiv!");
jl_call3(ldiv, (jl_value_t *) y, LU, (jl_value_t *) Ax);

The last call defines y as the solution of the linear system Ay = (Ax) . Since A is non-singular, we expect y and x to be sufficiently close (numerical errors could appear here). We can verify this using

double *yData = (double *) jl_array_data(y);
double norm2 = 0.0;
for (size_t i = 0; i < n; i++) {
double dif = yData[i] - xData[i];
norm2 += dif * dif;
}
cout << "|x - y|² = " << norm2 << endl;

My result was 6.48394e-26 .

To finalize this code, we have to run

JL_GC_POP();

This allows the Julia Garbage Collector to collect the allocated memory. The complete code can be seen below:

9: Sparse matrices

For our next example, we will solve a heat-equation on 1 spatial dimension, using a discretization of time and space called Backward Time Centered Space (BTCS), which is not quick to explain. Check these notes for a thorough explanation.

For our interests, it suffices to say that we will be solving a sparse linear system multiple times, where the matrix is the one below:

Tridiagonal matrix, where the diagonal stores 1 plus 2 times kappa, and the off-diagonal values are -kappa. LaTeX: A = \begin{bmatrix} 1 + 2\kappa & -\kappa \\ -\kappa & 1 + 2\kappa & \kappa \\ & \ddots & \ddots & \ddots \\ & & -\kappa & 1 + 2\kappa & -\kappa \\ & & & -\kappa & 1 + 2\kappa \end{bmatrix}
Tridiagonal matrix, where the diagonal stores 1 plus 2 times kappa, and the off-diagonal values are -kappa. LaTeX: A = \begin{bmatrix} 1 + 2\kappa & -\kappa \\ -\kappa & 1 + 2\kappa & \kappa \\ & \ddots & \ddots & \ddots \\ & & -\kappa & 1 + 2\kappa & -\kappa \\ & & & -\kappa & 1 + 2\kappa \end{bmatrix}

We don’t have to store this matrix as a dense matrix (like in the previous example). Instead, we want to store only the relevant elements. To do that, we will create three vectors for the rows and columns indexes, and for the values corresponding to these indexes.

The code is below:

long int rows[3 * n - 2], cols[3 * n - 2];
double vals[3 * n - 2];
for (size_t i = 0; i < n; i++) {
rows[i] = i + 1;
cols[i] = i + 1;
vals[i] = (1 + 2 * kappa);
if (i < n - 1) {
rows[n + i] = i + 1;
cols[n + i] = i + 2;
vals[n + i] = -kappa;
rows[2 * n + i - 1] = i + 2;
cols[2 * n + i - 1] = i + 1;
vals[2 * n + i - 1] = -kappa;
}
}

Now, we will create a sparse matrix using the sparse function from the SparseArrays module in Julia. For that, we allocate two array types, one for the integers, and one for the floating point numbers.

On the jl_call3 , we also call jl_ptr_to_array_1d to directly create and return a Julia vector wrapping the data we give it.

The A_sparsematrix is a Julia sparse matrix. Many of the matrix operations that work with dense matrices will work with sparse matrices. To test a different factorization, let’s use the function ldl from the LDLFactorizations package.

Now, we can use ldiv! with ldlObj instead of the LU factorization that we used in the previous example. There is one catch, though. Since we are using the ldlObj “for a while”, we need to prevent the Garbage collector to clean it. But the JL_GC_PUSHX function can only be called once per scope. Therefore, to use it we have to create an internal scope. So something like the following:

The complete code is below:

In the algorithm, we define u as the initial vector, then solve the linear system right u as the right-hand side to obtain unew. Then we assign unew to u and repeat. Each u is an approximation to the solution of the heat equation for a specific moment in time.

You will notice that, in addition to computing the solution, we also plot it using the Plots package. We plot the initial solution at different times. This makes the code much slower, unfortunately. The result can be seen below:

The image shows the solution starting from the function given by the exponential of the negative of the distance from the center. The other show the decay of this function, each one flatter and more similar to a symmetric quadratic with negative curvature.
Plot of heat equation solution at different moments in time.

Finalizing and open questions

I hope these 10 examples are helpful to get you started with embedding Julia in C. There are many more things not covered here, in particular things I do not know. Some of them are:

  • How to deal with strings?
  • How to deal with keyword arguments?
  • How to deal with installing packages and environments?
  • How to make it faster (e.g., using precompiled images)?

I will be on the lookout for future projects to investigate these. In the meantime, like and follow for more Julia and C/C++ content.

References and extra material


10 examples of embedding Julia in C/C++ was originally published in Netherlands eScience Center on Medium, where people are continuing the conversation by highlighting and responding to this story.

Direct convolution

By: Picaud Vincent

Re-posted from: https://pixorblog.wordpress.com/2016/07/17/direct-convolution/

For small kernels, direct convolution beats FFT based one. I present here a basic implementation. This implementation allows to compute

\gamma[k]=\sum\limits_{i\in\Omega^\alpha}\alpha[i]\beta[k+\lambda i],\text{ with }\lambda\in\mathbb{Z}^*\text{\ \ \ \ \ \ \ \ \ \ \ \ (1)}

From time to time we will use the notation \gamma=\alpha\bigodot\limits_\lambda\beta.

An arbitrary stride \lambda has been introduced to define:

Also note that with proper boundary extension (periodic and zero padding essentially), changing the sign of \lambda gives the adjoint operator:

\langle \delta, \alpha\bigodot\limits_\lambda\beta \rangle = \langle \alpha\bigodot\limits_{-\lambda}\delta,\beta \rangle

Disclaimer

Maybe the following is overwhelmingly detailed for a simple task like Eq. (1), but I have found some interests in writing this once for all. Maybe it can be useful for someone else.

Some notations

We note \Omega our vector domain (or support), for instance

\Omega^\alpha =\llbracket i_{\min} ,i_{\max} \rrbracket

means that \alpha[i] is defined for

i\in \llbracket i_{\min}, i_{\max} \rrbracket

To get interval lower/upper bounds we use the notation

i_{\min}=l(\Omega^\alpha)\text{ and }i_{\max}=u(\Omega^\alpha)

We denote by \lambda\Omega the scaled domain \Omega defined by:

\lambda\Omega=\llbracket \lambda^+\,i_{min}+\lambda^-\,i_{max},  \lambda^+\,i_{max}+\lambda^-\,i_{min} \rrbracket

where \lambda^+=\max{(0,\lambda)} and \lambda^-=\min{(0,\lambda)}

Finally we use A\setminus B the relative complement of B with respect to the set A defined by

A\setminus B = \{ i\ /\ (i\in A) \wedge (i\notin B) \}

This set is not necessary connex, however like we are working in \mathbb{Z}, it is sufficient to introduce the left and right parts (that can be empty)

(A\setminus B)_{\text{Left}}=\llbracket  l(A), \min{(u(A),l(B)-1)} \rrbracket

(A\setminus B)_{\text{Right}}=\llbracket \max{(l(A),u(B)+1)}, u(A) \rrbracket

Goal

Given two vectors \alpha, \beta defined on \Omega^\alpha, \Omega^\beta we want to define and implement an algorithm that computes \gamma[k] for k\in\Omega^\gamma.

First step, no boundary extension

We need to define the \Omega^\gamma_1 the domain that does not violate \beta domain of definition. This can be expressed as

\Omega^\gamma_1=\{k\in\mathbb{Z}\ /\ \forall i \in \Omega^\alpha \Rightarrow k+\lambda i \in \Omega^\beta \}

Let’s write the details,

(\forall i \in \Omega^\alpha  \Rightarrow k+\lambda i \in \Omega^\beta)\Leftrightarrow (\forall i \in \Omega^\alpha \Rightarrow l(\Omega^\beta)-\lambda i \le k \le u(\Omega^\beta)-\lambda i)

\Leftrightarrow \max\limits_{i\in \Omega^\alpha} l(\Omega^\beta)-\lambda i \le k \le \min\limits_{i\in \Omega^\alpha} u(\Omega^\beta)-\lambda i

\Leftrightarrow l(\Omega^\beta)-l(\lambda \Omega^\alpha) \le k \le u(\Omega^\beta)-u(\lambda \Omega^\alpha)

hence we have

\boxed{   \Omega^\gamma_1=\llbracket  l(\Omega^\beta)-l(\lambda \Omega^\alpha) , u(\Omega^\beta)-u(\lambda \Omega^\alpha) \rrbracket   }

Thus the computation of \gamma[k],\ k\in\Omega^\gamma (Eq. 1) is splitted into two parts:

  • one part \Omega^\gamma \cap \Omega^\gamma_1 free of boundary effect,
  • one part \Omega^\gamma \setminus \Omega^\gamma_1 that requires boundary extension \tilde{\beta}=\Phi(\beta,k)

The algorithm takes the following form:

latex-test.png

Second step, boundary extensions

Usually we define some classical boundary extensions. These extensions are computed from \beta[.] and are sometimes entailed by a validity condition. For a better clarity I give explicit lower/upper bounds:

\Omega^\beta = \llbracket  j_{\min} , j_{\max} \rrbracket \neq \emptyset

Left boundary (j<j_{min}) \tilde{\beta}_j = \Phi_{left}(\beta,j) validity condition
Mirror \tilde{\beta}_j  = \beta[2\,j_{min}-j] 2\,j_{min}-j_{max} \le j
Periodic (or cyclic) \tilde{\beta}_j =  \beta[j_{max}-j_{min}+j+1] 2\,j_{min}-j_{max}-1 \le j
Constant \tilde{\beta}_j = \beta[j_{min}] none
Zero padding \tilde{\beta}_j = 0 none
Right boundary (j>j_{max}) \tilde{\beta}_j = \Phi_{right}(\beta,j) validity condition
Mirror \tilde{\beta}_j  = \beta[2\,j_{max}-j] j\le 2\,j_{max}-j_{min}
Periodic (or cyclic) \tilde{\beta}_j = \beta[-j_{max}+j_{min}+j-1] j\le 2\,j_{max}-j_{min}+1
Constant \tilde{\beta}_j = \beta[j_{max}] none
Zero padding \tilde{\beta}_j = 0 none

As we want something general we want to get rid of these validity conditions.

Periodic case

Starting from a vector \beta defined on \llbracket L=0, U \rrbracket we want to define a periodic function \tilde{\beta} of period T=U+1. This function must fulfills the \tilde{\beta}[j+T]=\tilde{\beta}[j] relation.

We can do that by considering \tilde{\beta}=\beta \circ \phi^P_U(j) where

\phi^P_U(j)=\bmod_F(j,U+1)

and \bmod_F is the modulus function associated to a floored division.

For a vector defined on an arbitrary domain \llbracket j_{\min}, j_{\max} \rrbracket, we first translate the indices

\tau_{j_{\min}}(j)=j-j_{\min}

and then translate them back using \tau^{(-1)}_{j_{\min}}=\tau_{-j_{\min}}

Putting all together, we build a periodized vector

\boxed{\tilde{\beta} = \beta \circ \phi^P_{j_{\min},j_{\max}}}

where

\phi^P_{j_{\min},j_{\max}} = \tau^{(-1)}_{j_{\min}} \circ  \phi^P_{j_{\max}- j_{\min}} \circ \tau_{j_{\min}}

\boxed{\phi^P_{j_{\min},j_{\max}} = j_{\min} + \bmod_F(j-j_{\min},j_{\max}- j_{\min}+1)}

Mirror Symmetry case

Starting from a vector \beta defined on \llbracket L=0, U \rrbracket we can extend it by mirror symmetry on \llbracket U+1, 2U \rrbracket using \tilde{\beta}=\beta\circ \phi^M_U with

\phi^M_U(j)=U-|U-j|

The resulting vector \tilde{\beta}=\beta\circ \phi^M_U fulfills the \tilde{\beta}[U-j]=\tilde{\beta}[U+j] relation for j\in \llbracket 0, U \rrbracket.

To get a “global” definition we then periodize it on \llbracket 0, 2U-1 \rrbracket using \phi^P_{2U-1} (attention 2U-1 and not 2U, otherwise the component 0 is duplicated!).

For an arbitrary domain \llbracket j_{\min}, j_{\max} \rrbracket we use index translation as for the periodic case. Putting everything together we get:

\boxed{\tilde{\beta} = \beta \circ \phi^M_{j_{\min},j_{\max}}}

where

\phi^M_{j_{\min},j_{\max}} =  \tau^{(-1)}_{j_{\min}} \circ \phi^M_{j_{\max}- j_{\min}} \circ  \phi^P_{2(j_{\max}- j_{\min})-1} \circ \tau_{j_{\min}}

\boxed{ \phi^M_{j_{\min},j_{\max}} =j_{\max}-|j_{\max}-j_{\min}-\bmod_F(j-j_{\min},2(j_{\max}-j_{\min}))| }

Boundary extensions

To use the algorithm with boundary extensions, you only have to define:

\tilde{\beta}=\Phi(\beta,k+\lambda i)=\beta[\phi^X[k+\lambda i]]

where X is the boundary extension you have chosen (periodic, constant…). You do not have to take care of any validity condition, these formula are general.

Implementation

This is a straightforward implementation following as close as possible the presented formula. We did not try to optimize it, this would have obscured the presentation. Some ideas: reverse \alpha for \lambda<0 (access memory in the right order), use simd, or C++ meta-programming with loop unrolling for fixed \alpha size, specialize regarding to Vector/StridedVector or \lambda=\pm 1…

Preamble

Index translation / domain definition

There is however one last thing we have to explain. In languages like Julia, C… we are manipulating arrays having a common starting index: 1 in Julia, Fortran… or 0 in C, C++…

For this reason we do not manipulate \alpha on \Omega^\alpha but an another translated array \tilde{\alpha} defined on \llbracket 1, N^\alpha \rrbracket (Julia) or \llbracket 0, N^\alpha-1 \rrbracket (C++).

To cover all cases, I assume that the starting index is denoted by \tilde{i}_0.

The array \tilde{\alpha} is defined by:

\alpha[i] =  \tilde{\alpha}[\tilde{i}] = \tilde{\alpha}[i-l(\Omega^\alpha)+\tilde{i}_0]

Hence we must modify the initiale Eq. (1) to use \tilde{\alpha} instead of \alpha

\gamma[k]=\sum\limits_{i\in\Omega^\alpha}\alpha[i]\beta[k+\lambda i] = \sum\limits_{i\in\Omega^\alpha}\tilde{\alpha}[i-l(\Omega^\alpha)+\tilde{i}_0]\beta[k+\lambda i]

With \tilde{i}=i-l(\Omega^\alpha)+\tilde{i}_0 we have

i\in\Omega^\alpha \Leftrightarrow \tilde{i}\in\llbracket \tilde{i}_0,u(\Omega^\alpha)-l(\Omega^\alpha)+\tilde{i}_0 \rrbracket

and

k+\lambda i = k+ \lambda \tilde{i} + \underbrace{\lambda (l(\Omega^\alpha) - \tilde{i}_0)}_{\beta\_\text{offset}}

Thus, Eq (1) becomes:

\boxed{ \gamma[k]=\sum\limits_{\tilde{i}=\tilde{i}_0}^{u(\Omega^\alpha)-l(\Omega^\alpha)+\tilde{i}_0}\tilde{\alpha}[\tilde{i}]\beta[k+ \lambda \tilde{i} + \lambda (l(\Omega^\alpha) - \tilde{i}_0)]}

The 2 other arrays are less problematic:

  • For \beta array, which is our input array, we implicitly use \Omega^\beta = \llbracket \tilde{i}_0, \tilde{i}_0 + \text{length}(\beta) - 1 \rrbracket. This does not reduce the generality of the subroutine.
  • For \gamma which is the output array, as for \beta we assume it is defined on \llbracket \tilde{i}_0, \tilde{i}_0 +    \text{length}(\gamma) - 1 \rrbracket, but we provide \Omega^\gamma\subset \llbracket \tilde{i}_0, \tilde{i}_0 +    \text{length}(\gamma) - 1 \rrbracket to define the components we want to compute. The other components, \llbracket \tilde{i}_0, \tilde{i}_0 +    \text{length}(\gamma) - 1 \rrbracket \setminus \Omega^\gamma, will remain unmodified by the subroutine.

Definition of \alpha\_\text{offset}

As we have seen before, the convolution subroutine will have \tilde{\alpha} as argument, but we also need \Omega^\alpha. For the driver subroutine we do not directly provide this interval because its length is redundant with \tilde{\alpha} length. Instead we provide an \alpha\_\text{offset} offset. \Omega^\alpha is deduced from:

\Omega^\alpha = \llbracket -\alpha\_\text{offset}, -\alpha\_\text{offset} + \text{length}(\tilde{\alpha}) -1 \rrbracket

Note: this definition does not depend on \tilde{i}_0.

With \alpha\_\text{offset}=0 you are in the “usual situation”. If you have a window size of 2n+1, taking \alpha\_\text{offset}=n returns the middle of the window. Here, in the Fig. below, the graphical representation of an arbitrary case: a filter if size 4, with \alpha\_\text{offset}=2 and \lambda=3.

a_offset.png

Julia

Auxiliary subroutines

We start by defining the basic operations on sets:

function scale::Int64::UnitRange)
    ifelse(λ>0,
           UnitRange(λ*start(Ω),λ*last(Ω)),
           UnitRange(λ*last(Ω),λ*start(Ω)))
end

function compute_Ωγ1(Ωα::UnitRange,
                     λ::Int64,
                     Ωβ::UnitRange)
    
    λΩα = scale(λ,Ωα)

    UnitRange(start(Ωβ)-start(λΩα),
              last(Ωβ)-last(λΩα))
end

# Left & Right relative complements A\B
#
function relelativeComplement_left(A::UnitRange,
                                   B::UnitRange)
    UnitRange(start(A),
              min(last(A),start(B)-1))
end

function relelativeComplement_right(A::UnitRange,
                                    B::UnitRange)
    UnitRange(max(start(A),last(B)+1),
              last(A))
end

Boundary extensions

We then define the boundary extensions. Nothing special there, we only had to check that the Julia mod(x,y) function is the floored division version (by opposition to the rem(x,y) function which is the rounded toward zero division version).

const tilde_i0 = Int64(1)

function boundaryExtension_zeroPadding{T}(β::StridedVector{T},
                                          k::Int64)
    kmin = tilde_i0
    kmax = length(β) + kmin - 1
    
    if (k>=kmin)&&(k<=kmax)
        β[k]
    else
        T(0)
    end
end

function boundaryExtension_constant{T}(β::StridedVector{T},
                                       k::Int64)
    kmin = tilde_i0
    kmax = length(β) + kmin - 1

    if k<kmin
        β[kmin]
    elseif k<=kmax
        β[k]
    else
        β[kmax]
    end
end

function boundaryExtension_periodic{T}(β::StridedVector{T},
                                       k::Int64)
    kmin = tilde_i0
    kmax = length(β) + kmin - 1

    β[kmin+mod(k-kmin,1+kmax-kmin)]
end

function boundaryExtension_mirror{T}(β::StridedVector{T},
                                     k::Int64)
    kmin = tilde_i0
    kmax = length(β) + kmin - 1

    β[kmax-abs(kmax-kmin-mod(k-kmin,2*(kmax-kmin)))]
end

# For the user interface
#
boundaryExtension = 
    Dict(:ZeroPadding=>boundaryExtension_zeroPadding,
         :Constant=>boundaryExtension_constant,
         :Periodic=>boundaryExtension_periodic,
         :Mirror=>boundaryExtension_mirror)

Main subroutine

Finally we define the main subroutine. Its arguments have been defined in the preamble part. I just added one @simd & @inbounds because this has a significant impact concerning perfomance (see end of this post).

function direct_conv!{T}(tilde_α::StridedVector{T},
                         Ωα::UnitRange,
                         λ::Int64,
                         β::StridedVector{T},
                         γ::StridedVector{T},
                         Ωγ::UnitRange,
                         LeftBoundary::Symbol,
                         RightBoundary::Symbol)
    # Sanity check
    @assert λ!=0
    @assert length(tilde_α)==length(Ωα)
    @assert (start(Ωγ)>=1)&&(last(Ωγ)<=length(γ))

    # Initialization
    Ωβ = UnitRange(1,length(β))
    tilde_Ωα = 1:length(Ωα)
    
    for k in Ωγ
        γ[k]=0 
    end

    rΩγ1=intersect(Ωγ,compute_Ωγ1(Ωα,λ,Ωβ))
    
    # rΩγ1 part: no boundary effect
    #
    β_offset = λ*(start(Ωα)-tilde_i0)
    @simd for k in rΩγ1
        for i in tilde_Ωα
            @inbounds γ[k]+=tilde_α[i]*β[k+λ*i+β_offset]
        end
    end

    # Left part
    #
    rΩγ1_left = relelativeComplement_left(Ωγ,rΩγ1)
    Φ_left = boundaryExtension[LeftBoundary]
    
    for k in rΩγ1_left
        for i in tilde_Ωα
            γ[k]+=tilde_α[i]*Φ_left(β,k+λ*i+β_offset)
        end
    end

    # Right part
    #
    rΩγ1_right = relelativeComplement_right(Ωγ,rΩγ1)
    Φ_right = boundaryExtension[RightBoundary]
    
    for k in rΩγ1_right
        for i in tilde_Ωα
            γ[k]+=tilde_α[i]*Φ_right(β,k+λ*i+β_offset)
        end
    end
end

# Some UI functions, γ inplace modification 
#
function direct_conv!{T}(tilde_α::StridedVector{T},
                         α_offset::Int64,
                         λ::Int64,

                         β::StridedVector{T},

                         γ::StridedVector{T},
                         Ωγ::UnitRange,
                         
                         LeftBoundary::Symbol,
                         RightBoundary::Symbol)

    Ωα = UnitRange(-α_offset,
                   length(tilde_α)-α_offset-1)
    
    direct_conv!(tilde_α,
                 Ωα,
                 λ,
                 
                 β,

                 γ,
                 Ωγ,

                 LeftBoundary,
                 RightBoundary)
end

# Some UI functions, allocates γ 
#
function direct_conv{T}(tilde_α::StridedVector{T},
                        α_offset::Int64,
                        λ::Int64,

                        β::StridedVector{T},

                        LeftBoundary::Symbol,
                        RightBoundary::Symbol)

    γ = Array{T,1}(length(β))
    
    direct_conv!(tilde_α,
                 α_offset,
                 λ,

                 β,

                 γ,
                 UnitRange(1,length(γ)),

                 LeftBoundary,
                 RightBoundary)

    γ
end

In C/C++

As this post is already long I will not provide a complete code here. The only trap is to use the right mod function.

C/C++ modulus operator % is not standardized. Only the D%d=D-d*(D/d) relation is invariant allowing to define the Euclidean division. On the other side a lot of CPU x86 idiv…, truncate toward zero, as a consequence C/C++ generally uses this direction.

To be sure, we have to explicitly use our F-mod function:

// Floored mod
int modF(int D, int d)
{
    int r = std::fmod(D,d);
    if((r > 0 && d < 0) || (r < 0 && d > 0)) r = r + d;
    return r;
}

You can read:

Usages examples

Basic usages

Beware that due to the asymmetric role of \alpha and \beta the proposed approach does preserve all the mathematical properties of the \alpha\bigodot\limits_\lambda\beta operator.

  • Commutativity:

\alpha\bigodot\limits_{\lambda=-1}\beta=\beta\bigodot\limits_{\lambda=-1}\alpha

only for ZeroPadding

  • Adjoint operator:

\forall \lambda\in\mathbb{Z}^*,\ \langle \alpha\bigodot\limits_{\lambda}v ,w \rangle_E = \langle v , \alpha\bigodot\limits_{-\lambda} w \rangle_F

only for ZeroPadding and Periodic

  • I have assumed \mathbb{R} arrays (not \mathbb{C} ones): some conjugation are missing
  • Not considered here, but extension to n-dimensional & separable filters is immediate
push!(LOAD_PATH,"./")
using DirectConv

α=rand(4);
β=rand(10);

# Check adjoint operator
# -> restricted to ZeroPadding & Periodic
#    (asymmetric role of α and β)
#    
vβ=rand(length(β))
d1=dot(direct_conv(α,2,-3,vβ,:ZeroPadding,:ZeroPadding),β)
d2=dot(direct_conv(α,2,+3,β,:ZeroPadding,:ZeroPadding),vβ)

@assert abs(d1-d2)<sqrt(eps())

d1=dot(direct_conv(α,-1,-3,vβ,:Periodic,:Periodic),β)
d2=dot(direct_conv(α,-1,+3,β,:Periodic,:Periodic),vβ)

@assert abs(d1-d2)<sqrt(eps())

# Check commutativity 
# -> λ = -1 (convolution) and
#    restricted to ZeroPadding
#    (asymmetric role of α and β)
v1=zeros(20)
v2=zeros(20)
direct_conv!(α,0,-1,
             β,v1,UnitRange(1,20),:ZeroPadding,:ZeroPadding)
direct_conv!(β,0,-1,
             α,v2,UnitRange(1,20),:ZeroPadding,:ZeroPadding)

@assert (norm(v1-v2)<sqrt(eps()))

# Check Interval splitting
# (should work for any boundary extension type)
#
γ=direct_conv(α,3,2,β,:Mirror,:Periodic) # global computation
Γ=zeros(length(γ))
Ω1=UnitRange(1:3)
Ω2=UnitRange(4:length(γ))
direct_conv!(α,3,2,β,Γ,Ω1,:Mirror,:Periodic) # compute on Ω1
direct_conv!(α,3,2,β,Γ,Ω2,:Mirror,:Periodic) # compute on Ω2

@assert (norm(γ-Γ)<sqrt(eps()))

Performance?

In a previous post I gave a short derivation of the Savitzky-Golay filters. I used a FFT based convolution to apply the filters. It is interesting to compare the performance of the presented direct approach vs the FFT one.

push!(LOAD_PATH,"./")
using DirectConv

function apply_filter{T}(filter::StridedVector{T},signal::StridedVector{T})

    @assert isodd(length(filter))

    halfWindow = round(Int,(length(filter)-1)/2)
    
    padded_signal = 
        [signal[1]*ones(halfWindow);
         signal;
         signal[end]*ones(halfWindow)]

    filter_cross_signal = conv(filter[end:-1:1],
                               padded_signal)

    filter_cross_signal[2*halfWindow+1:end-2*halfWindow]
end

# Now we can create a (very) rough benchmark
M=Array(Float64,0,3)
β=rand(1000000);
for halfWidth in 1:2:40
    α=rand(2*halfWidth+1);

    fft_t0 = time()
    fft_v = apply_filter(α,β)
    fft_t1 = time()

    direct_t0 = time()
    direct_v = direct_conv(α,halfWidth,1,β,
                           :Constant,:Constant)
    direct_t1 = time()

    @assert (norm(fft_v -direct_v)<sqrt(eps()))

    M=vcat(M,
           Float64[length(α)
                   (fft_t1-fft_t0)*1e3
                   (direct_t1-direct_t0)*1e3])
end
M

cpu_time.png

ratio_cpu_time.png

We see that for small filters direct method can easily be 10 time faster than the FFT approach!

Conclusion: for small filters, use a direct approach!

Discussion

Optimization/performance

If I have time I will try to benchmark two basic implementations, a Julia one vs a C/C++ one. I’m a beginner in Julia language, with C++, I’m more at home.

I would be curious to see the difference between a basic implementation and an optimized one in Julia. Just to see how optimization can obfuscate (or not) the initial code and the performance gain. In C++ you generally have a lot of boiler-plate code (meta-programming…).

Applications

The basic Eq. (1) is common tool that can be used for:

  • deconvolution procedures,
  • decimated and undecimated wavelet transforms,
  • …

For wavelet transform especially the undecimated one, AFAIK Eq. (1) is really the good choice. I will certainly write some posts on these stuff.

Some extra reading:

Code

The code is on github.

Complement: more domains

The \Omega^\gamma_2 domain

We have introduced \Omega^\gamma_1 the domain that does not violate \beta domain of definition (given \Omega^\alpha and \Omega^\beta).

To be exhaustive we can introduce \Omega^\gamma_2 the domain that use at least one (i,k+\lambda i)\in \Omega^\alpha \times \Omega^\beta.

This domain is:

\Omega^\gamma_2=\{ k\in\mathbb{Z}\ /\ \exists i \in \Omega^\alpha \Rightarrow k+\lambda i \in \Omega^\beta \}

following arguments similar to those used for \Omega^\gamma_1 we get:

\boxed{  \Omega^\gamma_2=\llbracket  l(\Omega^\beta)-u(\lambda \Omega^\alpha) , u(\Omega^\beta)-l(\lambda \Omega^\alpha) \rrbracket  }

The \Omega^\beta_{2'} domain

We can also ask for the “dual” question: given \Omega^\alpha and \Omega^\gamma what is the domain of \beta, \Omega^\beta_{2'}, involved in the computation of \gamma

By definition, this domain must fulfill the following relation:

\Omega^\gamma_2(\Omega^\beta_{2'})=\Omega^\gamma

hence, using the previous result

\llbracket  l(\Omega^\beta_{2'})-u(\lambda \Omega^\alpha) , u(\Omega^\beta_{2'})-l(\lambda \Omega^\alpha) \rrbracket = \llbracket l(\Omega^\gamma),u(\Omega^\gamma) \rrbracket

which gives:

\boxed{ \Omega^\beta_{2'} = \llbracket l(\Omega^\gamma)+u(\lambda \Omega^\alpha),u(\Omega^\gamma)+l(\lambda \Omega^\alpha) \rrbracket }

Some remarks about min()/max() functions

By: Picaud Vincent

Re-posted from: https://pixorblog.wordpress.com/2016/06/27/some-remarks-about-minmax-functions/

Computing the min (or max) of two values looks like a trivial task.

For C++ the default Standard Template Library implementation is

inline const _Tp& min(const _Tp& __a, const _Tp& __b)
{
    return __b < __a ? __b : __a;
}

However for numerical computations this definition does not mix well with the IEEE 754 standard when dealing with potential NaN value.

For instance, min() is not commutative and is not equivatent to cmath::fmin().

We have:

#include <iostream>
#include <cmath>
#include <limits>
#include <cassert>

using namespace std;

int main()
{
    const auto x_nan = numeric_limits<double>::quiet_NaN();
    const auto x_1 = 1.;

    cout << boolalpha;

    cout << "\n\ncommutativity?";

    cout << "\n min: "
     << (min(x_1, x_nan) == min(x_nan, x_1));

    cout << "\nfmin: "
     << (fmin(x_1, x_nan) == fmin(x_nan, x_1));
}
commutativity?
 min: false
fmin: true

The IEEE 754 says:

In section 6.2 of the revised IEEE 754-2008 standard there are two anomalous functions (the maxnum and minnum functions that return the maximum of two operands that are expected to be numbers) that favor numbers — if just one of the operands is a NaN then the value of the other operand is returned.

On the opposite you can read about C++ rules here:

NaN values have the curious property that they compare as “unordered” with all values, even with themselves. In other words, if x is a NaN, and y is any floating-point value, NaN or not, then x<y, x>y, x<=y, x>=y, and x==y are all false, and x!=y is true.

Remark In C/C++ that is the reason why you can implement the isnan function by

bool isnan(const double x) { return x!=x;}

Using min/max in numerical code

C++

Like the IEEE 754 and C++ standard do not mix well, I personally redefine min()/max() functions as follow:

#include <type_traits>
#include <cmath>

namespace libraryNamespace
{
    template <typename T>
    inline enable_if_t<is_floating_point<T>::value, T>
      min(const T& t1, const T& t2)
    {
        return fmin(t1, t2);
    }

    template <typename T>
    constexpr enable_if_t<is_integral<T>::value, const T&>
      min(const T& t1, const T& t2)
    {
        return (t1 < t2) ? t1 : t2;
    }
    // max() function follows the same scheme
}

Julia

Julia follows the scheme I use in C++, you have a specialization for floating point numbers julia/base/math.jl

min{T<:AbstractFloat}(x::T, y::T) =
    ifelse((y < x) | (signbit(y) > signbit(x)),
           ifelse(isnan(y), x, y), ifelse(isnan(x), y, x))

and a generic operator julia/base/operator.jl

min(x,y) = ifelse(y < x, y, x)

We can check that the Float specialization follows the IEEE 754 rule:

(max(5,NaN)==max(NaN,5))&&(max(5,NaN)==5)
true

What is the cost?

Julia

There is a big difference between the simple definition in julia/base/operator.jl and the NaN aware code of julia/base/math.jl.

We can have a look at the assembly code.

I would like to thank Erik Schnetter who made me noticed that Julia 4.6 @code_native was bugged (see his comment at the end of this post). Hence to get the same result you must have a recent Julia version. In this post I use:

versioninfo()
Julia Version 0.5.0-dev+5453
Commit 1fd440e (2016-07-15 23:33 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i3 CPU       M 380  @ 2.53GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Nehalem)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, westmere)

With this Julia version you get the following asm codes:

@code_native(min(1,2))

gives

	.text
Filename: promotion.jl
	pushq	%rbp
	movq	%rsp, %rbp
Source line: 257
	cmpq	%rdi, %rsi
	cmovgeq	%rdi, %rsi
	movq	%rsi, %rax
	popq	%rbp
	retq

whereas

@code_native min(1.0,2.0)

gives

	.text
Filename: math.jl
	pushq	%rbp
	movq	%rsp, %rbp
Source line: 203
	ucomisd	%xmm1, %xmm0
	seta	%al
	movmskpd	%xmm1, %ecx
	movmskpd	%xmm0, %edx
	andl	$1, %edx
	xorb	$1, %dl
	andb	%cl, %dl
	orb	%al, %dl
	je	L54
	movapd	%xmm1, %xmm2
	cmpordsd	%xmm2, %xmm2
	andpd	%xmm2, %xmm1
	andnpd	%xmm0, %xmm2
	orpd	%xmm1, %xmm2
	jmp	L75
L54:
	movapd	%xmm0, %xmm2
	cmpordsd	%xmm2, %xmm2
	andpd	%xmm2, %xmm0
	andnpd	%xmm1, %xmm2
	orpd	%xmm0, %xmm2
L75:
	movapd	%xmm2, %xmm0
	popq	%rbp
	retq
	nopw	%cs:(%rax,%rax)

C++

We have the same in C++

#include <cmath>
#include <iostream>

int main()
{
  double x, y;
  std::cin >> x >> y;
  asm("#ASM FOR FMIN");
  double fmin_x_y = std::fmin(x, y);
  asm("#ASM FOR FMIN - END");
  std::cout << "\n" << fmin_x_y;

  asm("#ASM FOR MIN");
  double min_x_y = std::min(x, y);
  asm("#ASM FOR MIN - END");
  std::cout << "\n" << min_x_y;
  return 0;
}

compiled with

g++ -std=c++11 -O3 -S min.cpp -o min.asm

gives for min()

#ASM FOR MIN
	movsd	24(%rsp), %xmm0
	minsd	16(%rsp), %xmm0
	movsd	%xmm0, 8(%rsp)
#ASM FOR MIN - END

and for fmin()

#ASM FOR FMIN
	movsd	24(%rsp), %xmm1
	movsd	16(%rsp), %xmm0
	call	fmin
	movsd	%xmm0, 8(%rsp)
#ASM FOR FMIN - END

CPU time?

Illustration with the Jaccard distance defined by

d_J(x,y)=1-\frac{\sum\limits_i \min(x_i,y_i)}{\sum\limits_i \max(x_i,y_i)},\ \text{where}\ (x,y)\in\mathbb{R}_+^n\times\mathbb{R}_+^n

We compute this distance using 4 different approaches:

  • Julia min()/max() NaN aware
  • Julia min()/max() comparison
  • C fmin()/fmax() NaN aware
  • C min()/max() comparison

The Julia code is given below

function jaccard_julia_NaN_aware(a::Array{Float64,1},
                                 b::Array{Float64,1})

    @assert length(a)==length(b) 

    num::Float64 = 0
    den::Float64 = 0

    for i in 1:length(a)

        @inbounds num += min(a[i],b[i])
        @inbounds den += max(a[i],b[i])

    end
    return 1. - num/den
end

function jaccard_julia_comparison(a::Array{Float64,1},
                                  b::Array{Float64,1})

    @assert length(a)==length(b) 

    num::Float64 = 0
    den::Float64 = 0

    for i in 1:length(a)

        @inbounds num += ifelse(a[i]<b[i],a[i],b[i])
        @inbounds den += ifelse(a[i]>b[i],a[i],b[i])

    end
    return 1. - num/den
end

function jaccard_C_NaN_aware(a::Array{Float64,1},
                             b::Array{Float64,1})

    @assert length(a)==length(b) 

    return ccall((:jaccard_C_NaN_aware,
                  "./libjaccard.so"),
                 Float64,
                 (Int64,Ptr{Float64},Ptr{Float64}),
                 length(a),a,b)

end

function jaccard_C_comparison(a::Array{Float64,1},
                              b::Array{Float64,1})

    @assert length(a)==length(b) 

    return ccall((:jaccard_C_comparison,
                  "./libjaccard.so"),
                 Float64,
                 (Int64,Ptr{Float64},Ptr{Float64}),
                 length(a),a,b)

end

function test_distance(f,
                       v1::Array{Float64,1},
                       v2::Array{Float64,1})
    sum=0
    for i in 1:5000
        sum+=f(v1,v2)
    end
    sum
end

v1=rand(10000);
v2=rand(10000);

for name in (:jaccard_julia_NaN_aware,
             :jaccard_C_NaN_aware,
             :jaccard_julia_comparison,
             :jaccard_C_comparison)
    print("$name")
    @time @eval test_distance($name,v1,v2)
end

The associated C subroutines are:

#include <math.h>
#include <stdint.h>

double jaccard_C_NaN_aware(uint64_t size,
               double *a, double *b)
{
  double num = 0, den = 0;

  for(uint64_t i = 0; i < size; ++i)
    {
      num += fmin(a[i],b[i]);
      den += fmax(a[i],b[i]);
    }
  return 1. - num / den;
}

double jaccard_C_comparison(uint64_t size,
                double *a, double *b)
{
  double num = 0, den = 0;

  for(uint64_t i = 0; i < size; ++i)
    {
      num += (a[i] < b[i] ? a[i] : b[i]);
      den += (a[i] > b[i] ? a[i] : b[i]);
    }
  return 1. - num / den;
}

and the Makefile is

all: bench

bench: libjaccard.so
    @julia --optimize --check-bounds=no jaccard.jl

libjaccard.so: jaccard.c
    @gcc -O2 -shared -fPIC jaccard.c -o libjaccard.so

The results I get on my computer are:

make

output.png

We see that:

  • for the NaN aware case C and Julia running time is equivalent, great!
  • for the Comparison case C seems to be a little bit faster, but the gap is very small and more precise benchmark would be necessary to quantify it.
  • min()/max() using simple comparisons is more than 3.5 times faster than an implementation taking into account possible NaN values.

Julia 4.6 previous results

In the first version of this post, using julia version 4.6, I got this result:

output_4_6.png

There was a clear advantage for the C code, but it seems this is not the case anymore with julia version 0.5.

You can reproduce the results of this post, using the code available on GitHub.

A source of bugs?

You have to take care that a simple statement like

x\le\min(x,y)

is mathematically true but false in your code. It is even false for both the IEEE 754 and the comparison based versions of the min() function.

In the future I will write a post on Automatic Differentiation. To be brief automatic differentiation is a tool that allows you to efficiently compute gradient with nearly no modification of your original code. As example my personal implementation takes the form:

// Declare the current tape
//
AD_Tape<double> tape;

// Computes gradient of f(x,y,z)=x.z.sin(x.y)
//                   at (x,y,z)=(2,1,5)
//
// Note:
//
// grad(f)={ x * y * z * Cos(x * y) + z * Sin(x * y),
//           x^2 * z *
//           Cos(x * y), x * Sin(x * y) }
//
AD_Scalar<double> x, y, z, f;

x = 2;
y = 1;
z = 5;

f = x * z * sin(x * y);

const auto grad = ad_gradient(f);

std::cout << "\ngrad(f) = {"
          << grad[x] << "," 
          << grad[y] << ","
          << grad[z] << "}";

// The screen output is
//
// grad(f) = {0.385019,-8.32294,1.81859}

One classical approach uses operator overloading. For each basic operation you compute the function value and its differential.

One important and desirable property of AutoDiff library is that its use does not modify your program result.

Unfortunately a lot of AutoDiff libraries are bugged when they define the min()/max() functions.

It is really “natural” to define min() overloading by something like:

function min(x,y)
    ifelse(x<y,
           return (x,dx),
           return (y,dy))

But this implementation is buggy: if y is NaN we now know that x<y is always false, hence:

  • the original code will return x
  • the AutoDiff code will return (y,dy) (a priori =(NaN,NaN))

To stay consistent the correct implementation is something like:

function min(x,y)
    ifelse(x==min(x,y),
           return (x,dx),
           return (y,dy))

which provides the right result (id consistent with the original code) whatever x or y is NaN or not

We can check that, for instance, with Julia ForwardDiff.jl

(githash: 045a828)

using ForwardDiff
f(x::Vector)=max(2*x[1],x[2]);
x=[1.,NaN]
print("Original value: $(f(x))")
print("\nAutoDiff gradient: $(ForwardDiff.gradient(f, x))")
Original value: 2.0
AutoDiff gradient: [0.0,1.0]

which is wrong, in this case the right gradient is

\nabla [(x_1,x_2)\rightarrow max(2.x_1,x_2)]_{x_1=1,x_2=NaN}=(2,0)

Final word

  • Concerning min()/max() examples of this post I have observed a big performance gain in using Julia version 5.0. What is not clear is the reason of this gain:

    • I compiled my own version of Julia 5.0, but I used the 4.6 version shipped with Debian. Maybe my compiled version uses some different compiler flags (like -march=native…).
    • or there is a real improvement between version 4.6 -> 5.0

    You can reproduce the benchmark using the code available on GitHub. Any feed back is welcome.

  • The @native_code macro of Julia version 4.6 seems to be bugged.
  • It is “easy” to introduce bugs when you mix comparison operators and IEEE 754 min()/max() functions. Implementing min()/max() in a Automatic Differentiation framework is one perfect illustration of this and was the point that motivated me to write this post.
  • (Joke) If you are fed up with min()/max() you can use absolute value:

min(x,y)=\frac{1}{2}(|x+y|-|x-y|)

max(x,y)=\frac{1}{2}(|x+y|+|x-y|)

… but these relations do not follow the IEEE 754 standard!

I would like to thank my colleague JP. Both for having noticed that min()/max() was “abnormally” slow (Julia 4.6), Y. Borstein for some emails exchanges concerning min()/max() in Julia (before I wrote this post), Erik Schnetter who told me to switch to a more recent Julia version.