Tag Archives: CUDA

DifferentialEquations.jl 2.0: State of the Ecosystem

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/differentialequations-jl-2-0-state-ecosystem/

In this blog post I want to summarize what we have accomplished with DifferentialEquations’ 2.0 release and detail where we are going next. I want to put the design changes and development work into a larger context so that way everyone can better understand what has been achieved, and better understand how we are planning to tackle our next challenges.

If you find this project interesting and would like to support our work, please star our Github repository. Thanks!

Now let’s get started.

DifferentialEquations.jl 1.0: The Core

Before we start talking about 2.0, let’s understand first what 1.0 was all about. DifferentialEquations.jl 1.0 was about answering a single question: how can we put the wide array of differential equations into one simple and efficient interface. The result of this was the common interface explained in the first blog post. Essentially, we created one interface that could:

  1. Specify a differential equation problem
  2. Solve a differential equation problem
  3. Analyze a differential equation problem

The problem types, solve command, and solution interface were all introduced here as part of the unification of differential equations. Here, most of the work was on developing the core. DifferentialEquations.jl 1.0 was about having the core methods for solving ordinary differential equations, stochastic differential equations, and differential algebraic equations. There were some nice benchmarks to show that our core native solvers were on the right track, even besting well-known Fortran methods in terms of efficiency, but the key of 1.0 was the establishment of this high level unified interface and the core libraries for solving the problems.

DifferentialEquations.jl 2.0: Extended Capabilities

DifferentialEquations.jl 2.0 asked a very unique question for differential equations libraries. Namely, “how flexible can a differential equations solver be?”. This was motivated by an off-putting remark where someone noted that standard differential equations solvers were limited in their usefulness because many of the higher level analyses that people need to do cannot be done with a standard differential equations solver.

So okay, then we won’t be a standard differential equations solver. But what do we need to do to make all of this possible? I gathered a list of things which were considered impossible to do with “blackbox” differential equations solvers. People want to model continuous equations for protein concentrations inside of each cell, but allow the number of cells (and thus the number of differential equations) to change stochastically over time. People want to model multiscale phenomena, and have discontinuities. Some “differential equations” may only be discontinuous changes of discrete values (like in Gillespie models). People want to solve equations with colored noise, and re-use the same noise process in other calculations. People want to solve the same ODE efficiently hundreds of times, and estimate parameters. People want to quantify the uncertainty and the sensitivity of their model. People want their solutions conserve properties like energy.

People want to make simulations of reality moreso than solve equations.

And this became the goal for DifferentialEquations.jl 2.0. But the sights were actually set a little higher. The underlying question was:

How do you design a differential equations suite such that it can have this “simulation engine” functionality, but also such that adding new methods automatically makes the method compatible with all of these features?

That is DifferentialEquations.jl 2.0. the previous DifferentialEquations.jl ecosystem blog post details the strategies we were going to employ to achieve this goal, but let me take a little bit of time to explain the solution that eventually resulted.

The Integrator Interface

The core of the solution is the integrator interface. Instead of just having an interface on the high-level solve command, the integrator interface is the interface on the core type. Everything inside of the OrdinaryDiffEq.jl, StochasticDiffEq.jl, DelayDiffEq.jl packages (will be referred to as the *DiffEq solvers) is actually just a function on the integrator type. This means that anything that the solver can do, you can do by simply having access to the integrator type. Then, everything can be unified by documenting this interface.

This is a powerful idea. It makes development easy, since the devdocs just explain what is done internally to the integrator. Adding new differential equations algorithms is now simply adding a new perform_step dispatch. But this isn’t just useful for development, this is useful for users too. Using the integrator, you can step one at a time if you wanted, and do anything you want between steps. Resizing the differential equation is now just a function on the integrator type since this type holds all of the cache variables. Adding discontinuities is just changing integrator.u.

But the key that makes this all work is Julia. In my dark past, I wrote some algorithms which used R’s S3 objects, and I used objects in numerical Python codes. Needless to say, these got in the way of performance. However, the process of implementing the integrator type was a full refactor from straight loops to the type format. The result was no performance loss (actually, there was a very slight performance gain!). The abstraction that I wanted to use did not have a performance tradeoff because Julia’s type system optimized its usage. I find that fact incredible.

But back to the main story, the event handling framework was re-built in order to take full advantage of the integrator interface, allowing the user to directly affect the integrator. This means that doubling the size of your differential equation the moment some value hits 1 is now a possibility. It also means you can cause your integration to terminate when “all of the bunnies” die. But this became useful enough that you might not want to just use it for traditional event handling (aka cause some effect when some function hits zero, which we call the ContinuousCallback), but you may just want to apply some affect after steps. The DiscreteCallback allows one to check a boolean function for true/false, and if true apply some function to the integrator. For example, we can use this to always apply a projection to a manifold at the end of each step, effectively preserving the order of the integration while also conserving model properties like energy or angular momentum.

The integrator interface and thus its usage in callbacks then became a way that users could add arbitrary functionality. It’s useful enough that a DiscreteProblem (an ODE problem with no ODE!) is now a thing. All that is done is the discrete problem walks through the ODE solver without solving a differential equation, just hitting callbacks.

But entirely new sets of equations could be added through callbacks. For example, discrete stochastic equations (or Gillespie simulations) are models where rate equations determine the time to the next discontinuity or “jump”. The JumpProblem types simply add callbacks to a differential (or discrete) equation that perform these jumps at specific rates. This effectively turns the “blank ODE solver” into an equation which can solve these models of discrete proteins stochastically changing their levels over time. In addition, since it’s built directly onto the differential equations solvers, mixing these types of models is an instant side effect. These models which mix jumps and differential equations, such as jump diffusions, were an immediate consequence of this design.

The design of the integrator interface meant that dynamicness of the differential equation (changing the size, the solver options, or any other property in the middle of solving) was successfully implemented, and handling of equations with discontinuities directly followed. This turned a lot of “not differential equations” into “models and simulations which can be handled by the same DifferentialEquations.jl interface”.

Generic algorithms over abstract types

However, the next big problem was being able to represent a wider array of models. “Models and simulations which do weird non-differential equation things over time” are handled by the integrator interface, but “weird things which aren’t just a system of equations which do weird non-differential equation things over time” were still out of reach.

The solution here is abstract typing. The *DiffEq solvers accept two basic formats. Let’s stick to ODEs for the explanation. For ODEs, there is the out-of-place format

du = f(t,u)

where the derivative/change is returned by the function, and there is the in-place format

f(t,u,du)

where the function modifies the object du which stores the derivative/change. Both of these formats were generalized to the extreme. In the end, the requirements for a type to work in the out-of-place format can be described as the ability to do basic arithmetic (+,-,/,*), and you add the requirement of having a linear index (or simply having a broadcast! function defined) in order to satisfy the in-place format. If the method is using adaptivity, the user can pass an appropriate norm function to be used for calculating the norm of the error estimate.

This means that wild things work in the ODE solvers. I have already demonstrated arbitrary precision numbers, and unit-checked arithmetic.

But now there’s even crazier. Now different parts of your equation can have different units using the ArrayPartition. You can store and update discrete values along with your differential equation using the DEDataArray type. Just the other day I showed this can be used to solve problems where the variable is actually a symbolic mathematical expression. We are in the late stages of getting a type which represents a spectral discretization of a function compatible with the *DiffEq solvers.

But what about those “purely scientific non-differential equations” applications? A multiscale model of an embryo which has tissues, each with different populations of cells, and modeling the proteins in each cell? That’s just a standard application of the AbstractMultiScaleArray.

Thus using the abstract typing, even simulations which don’t look like systems of equations can now be directly handled by DifferentialEquations.jl. But not only that, since this is done simply via Julia’s generic programming, this compatibility is true for any of the new methods which are added (one caveat: if they use an external library like ForwardDiff.jl, their compatibility is limited by the compatibility of that external library).

Refinement of the problem types

The last two big ideas made it possible for a very large set of problems to be written down as a “differential equation on an array” in a much expanded sense of the term. However, there was another design problem to solve: not every algorithm could be implemented with “the information” we had! What I mean by “information”, I mean the information we could get from the user. The ODEProblem type specified an ODE as

 \frac{du}{dt} = f(t,u)

but some algorithms do special things. For example, for the ODE

 \frac{du}{dt} = f(t,u) = A + g(t,u)

the Lawson-Euler algorithm for solving the differential equation is

 u_{n+1} = \exp(A \Delta t)(u_n + g(t,u_n)\Delta t)

This method exploits the fact that it knows that the first part of the equation is A for some matrix, and uses it directly to improve the stability of the algorithm. However, if all we know is f, we could never implement this algorithm. This would violate our goal of “full flexibility at full performance” if this algorithm was the most efficient for the problem!

The solution is to have a more refined set of problem types. I discussed this a bit at the end of the previous blog post that we could define things like splitting problems. The solution is quite general, where

 M \frac{du}{dt} = f_1(t,u) + f_2(t,u) + ... + f_n(t,u)

can be defined using the SplitODEProblem (M being a mass matrix). Then specific methods can request specific forms, like here the linear-nonlinear ODE. Together, the ODE solver can implement this algorithm for the ODE, and that implementation, being part of a *DiffEq solver, will have interpolations, the integrator interface, event handling, abstract type compatibility, etc. all for free. Check out the other “refined problem types”: these are capable of covering wild things like staggered grid PDE methods and symplectic integrators.

In addition to specifying the same equations in new ways, we created avenues for common analyses of differential equations which are not related to simulating them over time. For example, one common problem is to try to find steady states, or points where the differential equation satisfies f(u)=0. This can now easily be done by defining a SteadyStateProblem from an ODE, and then using the steady state solver. This new library will also lead to the implementation of accelerated methods for finding steady states, and the development of new accelerated methods. The steady state behavior can now also be analyzed using the bifurcation analysis tools provided by the wrapper to PyDSTool.

Lastly, the problem types themselves have become much more expressive. In addition to solving the standard ODE, one can specify mass matrices in any appropriate DE type, to instead solve the equation

 M \frac{du}{dt} = f(t,u)

where M is some linear operator (similarly in DDEs and SDEs). While the vast majority of solvers are not able to use M right now, this infrastructure is there for algorithms to support it. In addition, one can now specify the noise process used in random and stochastic equations, allowing the user to solve problems with colored noise. Using the optional fields, a user can define non-diagonal noise problems, and specify sparse noise problems using sparse matrices.

As of now, only some very basic methods using all of this infrastructure have been made for the most extreme examples for testing purposes, but these show that the infrastructure works and this is ready for implementing new methods.

Common solve extensions

Okay, so once we can specify efficient methods for weird models which evolve over time in weird ways, we can simulate and get what solutions look like. Great! We have a tool that can be used to get solutions! But… that’s only the beginning of most analyses!

Most of the time, we are simulating solutions to learn more about the model. If we are modeling chemical reactions, what is the reaction rate that makes the model match the data? How sensitive is our climate model to our choice of the albedo coefficient?

To back out information about the model, we rely on analysis algorithms like parameter estimation and sensitivity analysis. However, the common solve interface acts as the perfect level for implementing these algorithms because they can be done problem and algorithm agnostic. I discuss this in more detail in a previous blog post, but the general idea is that most of these algorithms can be written with a term y(t) which is the solution of a differential equation. Thus we can write the analysis algorithms at a very high level and allow the user to pass in the arguments for a solve command use that to generate the y(t). The result is an implementation of the analysis algorithm which works with any of the problems and methods which use the common interface. Again, chaining all of the work together to get one more complete product. You can see this in full force by looking at the parameter estimation docs.

Modeling Tools

In many cases one is solving differential equations not for their own sake, but to solve scientific questions. To this end, we created a framework for modeling packages which make this easier. The financial models make it easy to specify common financial equations, and the biological models make it easy to specify chemical reaction networks. This functionality all works on the common solver / integrator interface, meaning that models specified in these forms can be used with the full stack and analysis tools. Also, I would like to highlight BioEnergeticFoodWebs.jl as a great modeling package for bio-energetic food web models.

Over time, we hope to continue to grow these modeling tools. The financial tools I hope to link with Julia Computing’s JuliaFin tools (Miletus.jl) in order to make it easy to efficiently solve the SDE and PDE models which result from their financial DSL. In addition, DiffEqPhysics.jl is planned to make it easy to specify the equations of motion just by giving a Hamiltonian or Lagrangian, or by giving the the particles + masses and automatically developing a differential equation. I hope that we can also tackle domains like mechanical systems and pharmacokinetics/pharmacodynamics to continually expand what is easily able to be solved using this infrastructure.

DifferentialEquations 2.0 Conclusion

In the end, DifferentialEquations 2.0 was about finding the right infrastructure such that pretty much anything CAN be specified and solved efficiently. While there were some bumps along the road (that caused breaking API changes), I believe we came up with a very good solution. The result is a foundation which feeds back onto itself, allowing projects like parameter estimation of multiscale models which change size due to events to be standard uses of the ODE solver.

And one of the key things to note is that this follows by design. None of the algorithms were specifically written to make this work. The design of the *DiffEq packages gives interpolation, event handling, compatibility with analysis tools, etc. for free for any algorithm that is implemented in it. One contributor, @ranocha, came to chat in the chatroom and on a few hours later had implemented 3 strong stability preserving Runge-Kutta methods (methods which are efficient for hyperbolic PDEs) in the *DiffEq solvers. All of this extra compatibility followed for free, making it a simple exercise. And that leads me to DifferentialEquations 3.0.

DifferentialEquations 3.0: Stiff solvers, parallel solvers, PDEs, and improved analysis tools

1.0 was about building the core. 2.0 was about making sure that the core packages were built in a way that could be compatible with a wide array of problems, algorithms, and analysis tools. However, in many cases, only the simplest of each type of algorithm was implemented since this was more about building out the capabilities than it was to have completeness in each aspect. But now that we have expanded our capabilities, we need to fill in the details. These details are efficient algorithms in the common problem domains.

Stiff solvers

Let’s start by talking about stiff solvers. As of right now, we have the all of the standard solvers (CVODE, LSODA, radau) wrapped in the packages Sundials.jl, LSODA.jl, and ODEInterface.jl respectively. These can all be used in the DifferentialEquations.jl common interface, meaning that it’s mostly abstracted away from the user that these aren’t actually Julia codes. However, these lower level implementations will never be able to reach the full flexibility of the native Julia solvers simply because they are restricted in the types they use and don’t fully expose their internals. This is fine, since our benchmarks against the standard Runge-Kutta implementations (dopri5, dop853) showed that the native Julia solvers, being more modern implementations, can actually have performance gains over these older methods. But, we need to get our own implementations of these high order stiff solvers.

Currently there exists the Rosenbrock23 method. This method is similar to the MATLAB ode23s method (it is the Order 2/3 Shampine-Rosenbrock method). This method is A and L stable, meaning it’s great for stiff equations. This was thus used for testing event handling, parameter estimation, etc.’s capabilities and restrictions with the coming set of stiff solvers. However, where it lacks is order. As an order 2 method, this method is only efficient at higher error tolerances, and thus for “standard tolerances” it tends not to be competitive with the other methods mentioned before. That is why one of our main goals in DiffEq 3.0 will be the creation of higher order methods for stiff equations.

The main obstacle here will be the creation of a library for making the differentiation easier. There are lots of details involved here. Since a function defined using the macros of ParameterizedFunctions can symbolically differentiate the users function, in some cases a pre-computed function for the inverted or factorized Jacobian can be used to make a stiff method explicit. In other cases, we need autodifferentiation, and in some we need to use numerical differentiation. This is all governed by a system of traits setup behind the scenes, and thus proper logic for determining and using Jacobians can immensely speed up our calculations.

The Rosenbrock23 method did some of this ad-hocly within its own method, but it was determined that the method would be greatly simplified if there was some library that could handle this. In fact, if there was a library to handle this, then the Rosenbrock23 code for defining steps would be as simple as defining steps for explicit RK methods. The same would be true for implicit RK methods like radau. Thus we will be doing that: building a library which handles all of the differentiation logic. The development of this library, DiffEqDiffTools.jl, is @miguelraz ‘s Google Summer of Code project. Thus with the completion of this project (hopefully summer?), efficient and fully compatible high order Rosenbrock methods and implicit RK methods will easily follow. Also included will be additive Runge-Kutta methods (IMEX RK methods) for SplitODEProblems. Since these methods double as native Julia DAE solvers and this code will make the development of stiff solvers for SDEs, this will be a major win to the ecosystem on many fronts.

Stiffness Detection and Switching

In many cases, the user may not know if a problem is stiff. In many cases, especially in stochastic equations, the problem may be switching between being stiff and non-stiff. In these cases, we want to change the method of integration as we go along. The general setup for implementing switching methods has already been implemented by the CompositeAlgorithm. However, current usage of the CompositeAlgorithm requires that the user define the switching behavior. This makes it quite difficult to use.

Instead, we will be building methods which make use of this infrastructure. Stiffness detection estimates can be added to the existing methods (in a very efficient manner), and could be toggled on/off. Then standard switching strategies can be introduced such that the user can just give two algorithms, a stiff and a non-stiff solvers, and basic switching can then occur. What is deemed as the most optimal strategies can then be implemented as standard algorithm choices. Then at the very top, these methods can be added as defaults for solve(prob), making the fully automated solver efficiently handle difficult problems. This will be quite a unique feature and is borderline a new research project. I hope to see some really cool results.

Parallel-in-time ODE/BVP solvers

While traditional methods (Runge-Kutta, multistep) all step one time point at a time, in many cases we want to use parallelism to speed up our problem. It’s not hard to buy an expensive GPU, and a lot of us already have one for optimization, so why not use it?

Well, parallelism for solving differential equations is very difficult. Let’s take a look at some quick examples. In the Euler method, the discretization calculates the next time step u_{n+1} from the previous time step u_n using the equation

u_{n+1} = u_n + \Delta t f(t,u)

In code, this is the update step

u .= uprev .+ dt.*f(t,uprev)

I threw in the .’s to show this is broadcasted over some arrays, i.e. for systems of equations u is a vector. And that’s it, that’s what the inner loop is. The most you can parallelize are the loops internal to the broadcasts. This means that for very large problems, you can parallelize this method efficiently (this form is called parallelism within the method). Also, if your input vector was a GPUArray, this will broadcast using CUDA or OpenCL. However, if your problem is not a sufficiently large vector, this parallelism will not be very efficient.

Similarly for implicit equations, we need to repeatedly solve (I-\Delta tJ)u = b where J is the Jacobian matrix. This linear solve will only parallelize well if the Jacobian matrix is sufficiently large. But many small differential equations problems can still be very difficult. For example, this about solving a very stiff ODE with a few hundred variables. Instead, the issue is that we are stepping serially over time, and we need to use completely different algorithms which parallelize over time.

One of these approaches is a collocation method. Collocation methods build a very large nonlinear equation F(X)=0 which describes a numerical method over all time points at once, and simultaneously solves for all of the time points using a nonlinear solver. Internally, a nonlinear solver is just a linear solver, Ax=b, with a very large A. Thus, if the user passes in a custom linear solver method, say one using PETSc.jl or CUSOLVER, this is parallelize efficiently over many nodes of an HPC or over a GPU. In fact, these methods are the standard methods for Boundary Value Problems (BVPs). The development of these methods is the topic of @YingboMa’s Google Summer of Code project. While written for BVPs, these same methods can then solve IVPs with a small modification (including stochastic differential equations).

By specifying an appropriate preconditioner with the linear solver, these can be some of the most efficient parallel methods. When no good preconditioner is found, these methods can be less efficient. One may wonder then if there’s exists a different approach, one which may sacrifice some “theoretical top performance” in order to be better in the “low user input” case (purely automatic). There is! Another approach to solving the parallelism over time issue is to use a neural network. This is the topic of @akaysh’s Google Summer of Code project. Essentially, you can define a cost function which is the difference between the numerical derivative and f(t_i,u_i) at each time point. This then gives an optimization problem: find the u_i at each time point such that the difference between the numerical and the desired derivatives is minimized. Then you solve that cost function minimization using a neural network. The neat thing here is that neural nets are very efficient to parallelize over GPUs, meaning that even for somewhat small problems we can get out very good parallelism. These neural nets can use very advanced methods from packages like Knet.jl to optimize efficiently and parallel with very little required user input (no preconditioner to set). There really isn’t another standard differential equations solver package which has a method like this, so it’s hard to guess how efficient it will be in advance. But given the properties of this setup, I suspect this should be a very good “automatic” method for medium-sized (100’s of variables) differential equations.

The really great thing about these parallel-in-time methods is that they are inherently implicit, meaning that they can be used even on extremely stiff equations. There are also simple extensions to make these solver SDEs and DDEs. So add this to the bank of efficient methods for stiff diffeqs!

Improved methods for stochastic differential equations

As part of 3.0, the hope is to release brand new methods for stochastic differential equations. These methods will be high order and highly stable, some explicit and some implicit, and will have adaptive timestepping. This is all of the details that I am giving until these methods are published, but I do want to tease that many types of SDEs will become much more efficient to solve.

Improved methods for jump equations

For jump equations, in order to show that everything is complete and can work, we have only implemented the Gillespie method. However, we hope to add many different forms of tau-leaping and Poisson(/Binomial) Runge-Kutta methods for these discrete stochastic equations. Our roadmap is here and it seems there may be a great deal of participation to complete this task. Additionally, we plan on having a specialized DSL for defining chemical reaction networks and automatically turn them into jump problems or ODE/SDE systems.

Geometric and symplectic integrators

In DifferentialEquations.jl 2.0, the ability to Partitioned ODEs for dynamical problems (or directly specify a second order ODE problem) was added. However, only a symplectic Euler method has been added to solve this equations so far. This was used to make sure the *DiffEq solvers were compatible with this infrastructure, and showed that event handling, resizing, parameter estimation, etc. all works together on these new problem types. But, obviously we need more algorithms. Velocity varlet and higher order Nystrom methods are asking to be implemented. This isn’t difficult for the reasons described above, and will be a very nice boost to DifferentialEquations.jl 3.0.

(Stochastic/Delay) Partial differential equations

Oh boy, here’s a big one. Everyone since the dawn of time has wanted me to focus on building a method that makes solving the PDE that they’re interested dead simple to do. We have a plan for how to get there. The design is key: instead of one-by-one implementing numerical methods for each PDE, we needed a way to pool the efforts together and make implementations on one PDE matter for other PDEs.

Let’s take a look at how we can do this for efficient methods for reaction-diffusion equations. In this case, we want to solve the equation

 u_t = \Delta u + f(t,u)

The first step is always to discretize this over space. Each of the different spatial discretization methods (finite difference, finite volume, finite element, spectral), end up with some equation

 U_t = AU + f(t,U)

where now U is a vector of points in space (or discretization of some basis). At this point, a specialized numerical method for stepping this equation efficiently in the time can be written. For example, if diffusion is fast and f is stiff, one really efficient method is the implicit integrating factor method. This would solve the equation by updating time points like:

U_{n+1} = e^{-A\Delta t}U_n + \Delta t U_{n+1}

where we have to solve this implicit equation each time step. The nice thing is that the implicit equation decouples in space, and so we actually only need to solve a bunch of very small implicit equations.

How can we do this in a way that is not “specific to the heat equation”? There were two steps here, the first is discretizing in space, the second is writing an efficient method specific to the PDE. The second part we already have an answer for: this numerical method can be written as one of the methods for linear/nonlinear SplitODEProblems that we defined before. Thus if we just write a SplitODEProblem algorithm that does this form of updating, it can be applied to any ODE (and any PDE discretization) which splits out a linear part. Again, because it’s now using the ODE solver, all of the extra capabilities (event handling, integrator interface, parameter estimation tools, interpolations, etc.) all come for free as well. The development of ODE/SDE/DDE solvers for handling this split, like implicit integrating factor methods and exponential Runge-Kutta methods, is part of DifferentialEquations.jl 3.0’s push for efficient (S/D)PDE solvers.

So with that together, we just need to solve the discretization problem. First let’s talk about finite difference. For the Heat Equation with a fixed grid-size \Delta x, many people know what the second-order discretization matrix A is in advance. However, what if you have variable grid sizes, and want different order discretizations of different derivatives (say a third derivative)? In this case the Fornburg algorithm can be used to construct this A. And instead of making this an actual matrix, because this is sparse we can make this very efficient by creating a “matrix-free type” where AU acts like the appropriate matrix multiplication, but in reality no matrix is ever created. This can save a lot of memory and make the multiplication a lot more efficient by ignoring the zeros. In addition, because of the reduced memory requirement, we easily distribute this operator to the GPU or across the cluster, and make the AU function utilize this parallelism.

The development of these efficient linear operators is the goal of @shivin9’s Google Summer of Code project. The goal is to get a function where the user can simply specify the order of the derivative and the order of the discretization, and it will spit out this highly efficient A to be used in the discretization, turning any PDE into a system of ODEs. In addition, other operators which show up in finite difference discretizations, such as the upwind scheme, can be encapsulated in such an A. Thus this project would make turning these types of PDEs into efficient ODE discretizations much easier!

The other very popular form of spatial discretization is the finite element method. For this form, you chose some basis function over space and discretize the basis function. The definition of this basis function’s discretization then derives what the A discretization of the derivative operators should be. However, there is a vast array of different choices for basis and the discretization. If we wanted to create a toolbox which would implement all of what’s possible, we wouldn’t get anything else done. Thus we will instead, at least for now, piggyback off of the efforts of FEniCS. FEniCS is a toolbox for the finite element element method. Using PyCall, we can build an interface to FEniCS that makes it easy to receive the appropriate A linear operator (usually sparse matrix) that arises from the desired discretization. This, the development of a FEniCS.jl, is the topic of @ysimillides’s Google Summer of Code. The goal is to make this integration seamless, so that way getting ODEs out for finite element discretizations is a painless process, once again reducing the problem to solving ODEs.

The last form of spatial discretization is spectral discretizations. These can be handled very well using the Julia library ApproxFun.jl. All that is required is to make it possible to step in time the equations which can be defined using the ApproxFun types. This is the goal of DiffEqApproxFun.jl. We already have large portions of this working, and for fixed basis lengths the ApproxFunProblems can actually be solved using any ODE solver (not just native Julia solvers, but also methods from Sundials and ODEInterface work!). This will get touched up soon and will be another type of PDE discretization which will be efficient and readily available.

Improved Analysis Tools

What was described above is how we are planning to solve very common difficult problems with high efficiency and simplify the problems for the user, all without losing functionality. However, the tools at the very top of the stack, the analysis tools, can also become much more efficient as well. This is the other focus of DifferentialEquations.jl 3.0.

Local sensitivity analysis is nice because it not only tells you how sensitive your model is to the choice of parameters, but it gives this information at every time point. However, in many cases this is overkill. Also, this makes the problem much more computationally difficult. If we wanted to classify parameter space, like to answer the question “where is the model least sensitive to parameters?”, we would have to solve this equation many times. When this is the question we wish to answer, global sensitivity analysis methods are much more efficient. We plan on adding methods like the Morris method in order for sensitives to be globally classified.

In addition, we really need better parameter estimation functionality. What we have is very good: you can build an objective function for your parameter estimation problem to then use Optim.jl, BlackBoxOptim.jl or any MathProgBase/JuMP solver (example: NLopt.jl) to optimize the parameters. This is great, but it’s very basic. In many cases, more advanced methods are necessary in order to get convergence. Using likelihood functions instead of directly solving the nonlinear regression can often times be more efficient. Also, in many cases statistical methods (the two-stage method) can be used to approximately optimize parameters without solving the differential equations repeatedly, a huge win for performance. Additionally, Bayesian methods will not only give optimal parameters, but distributions for the parameters which the user can use to quantify how certain they are about estimation results. The development of these methods is the focus of @Ayush-iitkgp’s Google Summer of Code project.

DifferentialEquations.jl 3.0 Conclusion

2.0 was about building infrastructure. 3.0 is about filling out that infrastructure and giving you the most efficient methods in each of the common problem domains.

DifferentialEquations.jl 4.0 and beyond

I think 2.0 puts us in a really great position. We have a lot, and the infrastructure allows us to be able to keep expanding and adding more and more algorithms to handle different problem types more efficiently. But there are some things which are not slated in the 3.0 docket. One thing that keeps getting pushed back is the automatic promotion of problem types. For example, if you specified a SplitODEProblem and you want to use an algorithm which wants an ODEProblem (say, a standard Runge-Kutta algorithm like Tsit5), it’s conceivable that this conversion can be handled automatically. Also, it’s conceivable that since you can directly convert an ODEProblem into a SteadyStateProblem, that the steady state solvers should work directly on an ODEProblem as well. However, with development time always being a constraint, I am planning on spending more time developing new efficient methods for these new domain rather than the automatic conversions. However, if someone else is interested in tackling this problem, this could probably be addressed much sooner!

Additionally, there is one large set of algorithms which have not been addressed in the 3.0 plan: multistep methods. I have been holding off on these for a few reasons. For one, we have wrappers to Sundials, DASKR, and LSODA which supply well-known and highly efficient multistep methods. However, these wrappers, having the internals not implemented in Julia, are limited in their functionality. They will never be able to support arbitrary Julia types and will be constrained to equations on contiguous arrays of Float64s. Additionally, the interaction with Julia’s GC makes it extremely difficult to implement the integrator interface and thus event handling (custom allocators would be needed). Also, given what we have seen with other wrappers, we know we can actually improve the efficiency of these methods.

But lastly, and something I think is important, these methods are actually only efficient in a small but important problem domain. When the ODE f is not “expensive enough”, implicit Runge-Kutta and Rosenbrock methods are more efficient. When it’s a discretization of a PDE and there exists a linear operator, exponential Runge-Kutta and implicit integrating factor methods are more efficient. Also, if there are lots of events or other dynamic behavior, multistep methods have to “restart”. This is an expensive process, and so in most cases using a one-step method (any of the previously mentioned methods) is more efficient. This means that multistep methods like Adams and BDF (/NDF) methods are really only the most efficient when you’re talking about a large spatial discretization of a PDE which doesn’t have a linear operator that you can take advantage of and events are non-existent/rare. Don’t get me wrong, this is still a very important case! But, given the large amount of wrappers which handle this quite well already, I am not planning on tackling these until the other parts are completed. Expect the *DiffEq infrastructure to support multistep methods in the future (actually, there’s already quite a bit of support in there, just the adaptive order and adaptive time step needs to be made possible), but don’t count on it in DifferentialEquations 3.0.

Also not part of 3.0 but still of importance is stochastic delay differential equations. The development of a library for handling these equations can follow in the same manner as DelayDiffEq.jl, but likely won’t make it into 3.0 as there are more pressing (more commonly used) topics to address first. In addition, methods for delay equations with non-constant lags (and neutral delay equations) also will likely have to wait for 4.0.

In the planning stages right now is a new domain-specific language for the specification of differential equations. The current DSL, the @ode_def macro in ParameterizedFunctions.jl does great for the problem that it can handle (ODEs and diagonal SDEs). However, there are many good reasons to want to expand the capabilities here. For example, equations defined by this DSL can be symbolically differentiated, leading to extremely efficient code even for stiff problems. In addition, one could theoretically “split the ODE function” to automatically turn the problem in a SplitODEProblem with a stiff and nonstiff part suitable for IMEX solvers. If PDEs also can be written in the same syntax, then the PDEs can be automatically discretized and solved using the tools from 2.0. Additionally, one can think about automatically reducing the index of DAEs, and specifying DDEs.

This all sounds amazing, but it will need a new DSL infrastructure. After a discussion to find out what kind of syntax would be necessary, it seems that a major overhaul of the @ode_def macro would be needed in order to support all of this. The next step will be to provide a new library, DiffEqDSL.jl, for providing this enhanced DSL. As described in the previously linked roadmap discussion, this DSL will likely take a form closer in style to JuMP, and the specs seem to be compatible at reaching the above mentioned goals. Importantly, this will be developed as a new DSL and thus the current @ode_def macro will be unchanged. This is a large development which will most likely not be in 3.0, but please feel free to contribute to the roadmap discussion which is now being continued at the new repository.

Conclusion

DifferentialEquations.jl 1.0 was about establishing a core that could unify differential equations. 2.0 was about developing the infrastructure to tackle a very vast domain of scientific simulations which are not easily or efficiently written as differential equations. 3.0 will be be about producing efficient methods for the most common sets of problems which haven’t adequately addressed yet. This will put the ecosystem in a very good state and hopefully make it a valuable tool for many people. After this, 4.0+ will keep adding algorithms, expand the problem domain some more, and provide a new DSL.

The post DifferentialEquations.jl 2.0: State of the Ecosystem appeared first on Stochastic Lifestyle.

My quest to responsive visualizations with Julia

By: Simon Danisch

Re-posted from: http://randomfantasies.com/2016/09/my-quest-to-responsive-visualizations-with-julia/

Responsive and fun interactivity is notoriously hard! But it’s also the key to less frustration and more patience when working on a project.

There are endless important projects that humanity needs to solve. To become less of a burden to the environment and have a better society we constantly need to advance the state of the art.

Still, people give up on doing this when the initial amount of patience, motivation and curiosity is depleted without getting new incentives.

This actually happened to me with my 3D modeling hobby a few years ago. Not only was the 3D software difficult to learn, it also turned unresponsive fairly quickly, even on a powerful machine.

Just as an example, rendering a 30 second clip in high quality took ~2 weeks with my PC running day and night. This slowness drained my patience and at some point I decided that this is too much hassle for a hobby.

Half a year later, I ran into a similar problem when working on a computer vision pipeline. We simply couldn’t find frameworks which were fast enough for real time processing while maintaining high programming productivity. This time I didn’t give up, but the job could have been much more fun and productive!

At this point I decided that I want to work on software that turns really tough problems like object recognition of 3D models into weekend projects.

I started working on GLVisualize, a library for fast and interactive graphics written in Julia.

Why graphics?

Visualizing something is the first step to understanding and it allows us to explore huge problem spaces that were invisible before.

But current tools seem to be divided into 2D and 3D libraries, high performance libraries, which are kind of a hassle to use, and easily usable libraries which turn into an unresponsive mess quickly.

So with my background, this seemed like a worthy start!

Why Julia?

While there are several reasons for choosing Julia, today I want to show you its impressive speed despite being an easy-to-use dynamic language.

Let me show you the benefit with two toy examples (all graphics including the plots were produced with GLVisualize!).

Consider the following function, the famous Lorenz Attractor:

It takes a point t0 and parameters a to d and returns a new point. Could you imagine what kind of line this will draw, when recursively applying it to the new result and changing parameters?

I guess no one has such a powerful brain, which is why we visualize these kind of things, together with sliders to explore the function:

lorenz

code

These are 100,000 points, and as you can see GLVisualize can deal with that amount easily.

Lets see how long we could have stayed at pleasant interactive speeds in another language.

Lets take for example Python, a language comparable in usability:

 

pyjulo

 

minimal speedup 70.4680453961563
maximal speedup 148.02061279172827
max seconds Julia 0.193422334
max seconds Python 13.630093812942505

As you can see, computation times jump beyond one second at around 10⁵ points for Python, while Julia stays interactive with up to 10⁷ points.

So if you’re interested in a higher resolution in Python you better crank up your patience or call out to C, eliminating all convenience of Python abruptly!

Next example is a 3D mandelbulb visualization:

mandelbulb on the gpu

code

One step through the function takes around 24 seconds with Julia on the CPU, which makes it fairly painful to explore the parameters.

So why is the shown animation still smooth? Well, when choosing Julia, I’ve been betting on the fact that it should be fairly simple

to run Julia code on the GPU. This bet is now starting to become reality.

I’m using CUDAnative and GPUArrays to speed up the calculation of the mandelbulb.

CUDAnative allows you to run Julia code on the GPU while GPUArrays offers a simpler array interface for this task.

Here are the benchmark results:

cujlmandel

minimal speedup 37.05708590791309
maximal speedup 401.59165062897495
max seconds cpu 24.275534426
max seconds cuda 0.128474155

This means that by running Julia on the GPU we can still enjoy a smooth interaction with this function.

But the problem is actually so hard, that I can’t run this interactively at the resolution I would like to.

As you can see, the iso surface visualization still looks very coarse. So even when using state of the art software, you still run into problems that won’t compute under one second.

But with Julia you can at least squeeze out the last bit of performance of your hardware, be it CPU or GPU, while enjoying the comfort of a dynamic high level language!

I’m sure that with Julia and advances in hardware, algorithms and optimizations, we can soon crack even the hardest problem with ease!

 


Benchmarking system:

RAM: 32GB

CPU: Intel® Core™ i7-6700 CPU @ 3.40GHz × 8

GPU: GeForce GTX 950

 

Multiple-GPU Parallelism on the HPC with Julia

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/multiple-gpu-on-the-hpc-with-julia/

This is the exciting Part 3 to using Julia on an HPC. First I got you started with using Julia on multiple nodes. Second, I showed you how to get the code running on the GPU. That gets you pretty far. However, if you got a trial allocation on Cometand started running jobs, you may have noticed when looking at the architecture that you’re not getting to use the full GPU. In the job script I showed you, I asked for 2 GPUs. Why? Well, that’s because the flagship NVIDIA GPU, the Tesla K80, is actually a duel GPU and you have to control the two parts separately. You may have been following along on your own computer and have been wondering how you use the multiple GPUs in your setup as well. This tutorial will cover how to extend our function to multiple-GPU setups. You will notice that while this code was made for 2 GPUs, by simply extended the array cudaCores with the values of other GPUs then this code automatically extends to however many you need.

Recall

I want to recall the problem a bit to rephrase it and make it easier to understand why what we are doing will work. Recall that we are simply looping over a mesh of (i,j) (you can think of it as (x,y)) and solving some inner function at each point, and add together the results. The CUDA kernal is shown in the previous part. Thus, how we will extend our problem to multiple GPUs is to simply split the mesh and solve the inner function on different GPUs. It’s critical to realize that this solution method requires independence, i.e. that different parts of the grid do not effect each other. Thus you can think of our function as being a “vectorized” evaluation.

Note that we are not doing the most memory efficient implementation. Each GPU will have the full array of i‘s and j‘s even though they will only evaluate it at specific parts. One could split it up, though that would take more work for little performance gain. Thus you should really look into this if you are running low on memory (or to do as an exercise). If you would like to do this, then you can calculate how far in the i‘s the first GPU calculates, and then only send the right portion of the i‘s to the other GPU (since we loop down the j‘s, both will need the full j array). Since this problem is not memory limited, I will not show this solution.

Upgrading the Kernal

Upgrading the kernal will be quite simple. Recall that we had the problem in linear indexing before. We will keep the linear indexing, except in each GPU we will pass an integer, StartIdx, which will tell the GPU where in the linear indexing it will start. Thus as before we had the value equalDiv which told the GPU how many indices each core will evaluate, and we had

int index = threadIdx.x + blockIdx.x * blockDim.x;

which tells the GPU which core it is, we have a new global index which is

int globalIndex = index*equalDiv+startIdx;

Notice that each globalIndex is equalDiv apart, and so that is the section each GPU will operate on. The CUDA kernal then just sums up the solution of the inner function over the next equalDiv indices:

    __global__ void integration(const float *coefs, const float *iArr, const float *jArr, const int sizei, const int sizej, const int equalDiv,const int startIdx, int *tmp)
    {
        int index = threadIdx.x + blockIdx.x * blockDim.x;
        int globalIndex = index*equalDiv+startIdx;
        int loopInd;
        float i;
        float j;
        float isq2;
        float isq3;
        float isq4;
        float isq5;
        float isq6;
        float isq7;
        float isq8;
        float jsq2;
        float jsq3;
        float jsq4;
        float jsq5;
        float jsq6;
        float jsq7;
        float jsq8;
        int ans = 0;
        for(loopInd=0;loopInd<equalDiv;loopInd=loopInd+1){
          i = iArr[(globalIndex+loopInd)/sizej];
          j = jArr[(globalIndex+loopInd)%sizej];
          if(globalIndex+loopInd >= sizei*sizej){
            break;
          }
          if((globalIndex+loopInd)%sizej==0 || loopInd==0){
            isq2 = i*i;
            isq3 = i*isq2;
            isq4 = isq2*isq2;
            isq5 = i*isq4;
            isq6 = isq4*isq2;
            isq7 = i*isq6;
            isq8 = isq4*isq4;
          }
          jsq2 = j*j;
          jsq3 = j*jsq2;
          jsq4 = jsq2*jsq2;
          jsq5 = j*jsq4;
          jsq6 = jsq2*jsq4;
          jsq7 = j*jsq6;
          jsq8 = jsq4*jsq4;
          ans = ans + innerFunc(coefs,i,isq2,isq3,isq4,isq5,isq6,isq7,isq8,j,jsq2,jsq3,jsq4,jsq5,jsq6,jsq7,jsq8);
        }
        tmp[index] = ans;
    }

Notice that the inner function did not have to be changed at all:

  __device__ int innerFunc(const float *coefs,const float i,const float isq2,const float isq3,const float isq4,const float isq5,const float isq6,const float isq7,const float isq8,const float j,const float jsq2,const float jsq3,const float jsq4,const float jsq5,const float jsq6,const float jsq7,const float jsq8)
  {
    return abs(coefs[0]*(jsq2) + coefs[1]*(jsq3) + coefs[2]*(jsq4) + coefs[3]*(jsq5) + coefs[4]*jsq6 + coefs[5]*jsq7 + coefs[6]*jsq8 + coefs[7]*(i) + coefs[8]*(i)*(jsq2) + coefs[9]*i*jsq3 + coefs[10]*(i)*(jsq4) + coefs[11]*i*jsq5 + coefs[12]*(i)*(jsq6) + coefs[13]*i*jsq7 + coefs[14]*(isq2) + coefs[15]*(isq2)*(jsq2) + coefs[16]*isq2*jsq3 + coefs[17]*(isq2)*(jsq4) + coefs[18]*isq2*jsq5 + coefs[19]*(isq2)*(jsq6) + coefs[20]*(isq3) + coefs[21]*(isq3)*(jsq2) + coefs[22]*isq3*jsq3 + coefs[23]*(isq3)*(jsq4) + coefs[24]*isq3*jsq5 + coefs[25]*(isq4) + coefs[26]*(isq4)*(jsq2) + coefs[27]*isq4*jsq3 + coefs[28]*(isq4)*(jsq4) + coefs[29]*(isq5) + coefs[30]*(isq5)*(jsq2) + coefs[31]*isq5*jsq3+ coefs[32]*(isq6) + coefs[33]*(isq6)*(jsq2) + coefs[34]*(isq7) + coefs[35]*(isq8))<1;
  }

As before, we wrap this in extern “C”{} for libraries.

Setup Multi-GPU Julia Driver

Recall that we are starting with the following parameters:

  imin = -12
  imax = 1
  jmin = -4
  jmax = 4
  iarr = convert(Vector{Float32},collect(imin:dx:imax))
  jarr = convert(Vector{Float32},collect(jmin:dx:jmax))
  sizei = length(imin:dx:imax)
  sizej = length(jmin:dx:jmax)
  totArea = ((imax-imin)*(jmax-jmin)/(sizei*sizej));
 
  coefs = getCoefficients(A0,A1,B0,B1,α,β1234)

Where getCoefficients is some function which returns an array of 36 integers. You can make this be a call for random integers or something of the like. Now we need to call our kernal from Julia. Start by compling the kernal as in part two into a .ptx. Our simple implementation will be to split the job by the number of cores on each GPU. On Comet, each part of the K80 has 2496 CUDA cores, so we would set

  cudaCores = [2496,2496]

I will instead use the more interesting case of my development computer. It has a GTX 970 with 1664 cores and a GTX 750Ti with 640 cores. Therefore I start by setting

  cudaCores = [1664,640]

The implementation is then as follows: I first get equalDiv as the number of calculations to be done per core. Since I am looping over a mesh with sizei and sizej, if I let totCores=sum(cudaCores) be the total number of cores, then

equalDiv = sizei*sizej÷totCores + 1

tells me how many calculations per core (we add 1 (or take the ceiling) in order to make sure that we cover all of it. Otherwise the part truncated from the integer division will not be calculated). Next we need to calculate where to start the second GPU. Let portions=equalDiv*cudaCores be the total number of calculations in each GPU (it returns a vector of 2 values). I want my code to work when there’s only 1 card (i.e. cudaCores=[2946]), so to calculate the starting indices I run

  if numCards > 1
    startIdx = [0 cumsum(portions)[1:end-1]]
  else
    startIdx = [0]
  end

What this does is give an array where the first value is zero, the second value is portions[1], the third is portions[1]+portions[2], etc., and if there is only one card then it’s simply the array with zero. Notice this is exactly the index which the CUDA kernal needs to stop at, so this works.

Now we need to move our arrays over to the GPU. Note that each array is stored on only one GPU, and so we need to loop through the devices and put each array on each device. We change devices with with command device(i), where i=0 is the first device, i=1 is the second, etc. Since the values are actually C-pointers, we will just store them in arrays of type Any (of size numCards which is the number of GPU cards as calculated by the length(cudaCores)). The setup is then:

  g_iarr = Vector{Any}(numCards)
  g_jarr = Vector{Any}(numCards)
  g_tmp = Vector{Any}(numCards)
  g_coefs = Vector{Any}(numCards)
  ans = Vector{Int32}(numCards)
  integrationFuncs = Vector{Any}(numCards)
  CUDArt.init(devices(dev->true))
  devlist = devices(dev->true)
  streams = [(device(dev); Stream()) for dev in devlist]
  for i = 1:numCards
    device(i-1)
    g_iarr[i]  = CudaArray(iarr)
    g_jarr[i]  = CudaArray(jarr)
    g_tmp[i]   = CudaArray(Int32,cudaCores[i])
    md = CuModule("integration.ptx",false)
    integrationFuncs[i] = CuFunction(md,"integration")
  end

Lastly, we need to call the kernal. Recall that in my problem the inner loop is to change coefs, run the kernal calculation, and use the result. Thus in my test function is where I send coefs to the GPU. This will call the kernal as before, except we need to do this asynchronously via Julia’s streams. Notice we have a new variable (or array of variables) around: streams. The code initiated 2 GPU streams, one on each device. Thus what we will do is tell the kernal to run on the stream, and move onto the calculation before it finishes (but tell it to not sum the result until after the GPU stream finished). This is done by the following code:

        @async begin
          dev = device(i-1)
          g_coefs[i] = CudaArray(coefs)
          launch(integrationFuncs[i], cudaCores[i], 64, (g_coefs[i], g_iarr[i], g_jarr[i], sizei,sizej,equalDiv,startIdx[i],g_tmp[i],),stream=streams[i]);
          wait(streams[i])
          ans[i] = sum(to_host(g_tmp[i]))
        end

@async is the command which asynchronizes, as in Julia keeps going forward even if the command isn’t done. Thus what we need to do is run all of the GPUs and then wait for each to finish via the @sync command. The full loop is then as follows:

  function f4()
    @sync begin
      for i = 1:numCards
        @async begin
          dev = device(i-1)
          g_coefs[i] = CudaArray(coefs)
          launch(integrationFuncs[i], cudaCores[i], 64, (g_coefs[i], g_iarr[i], g_jarr[i], sizei,sizej,equalDiv,startIdx[i],g_tmp[i],),stream=streams[i]);
          wait(streams[i])
          ans[i] = sum(to_host(g_tmp[i]))
        end
      end
    end
    res = sum(ans)*totArea
    #println(res)
  end

I put it in the function f4 for benchmarking. Notice we have @sync around the inner loop, telling it not to go past the loop until each portion is done, but in the loop we wrap it with @async so that way Julia can start the GPU codes all at the same time without having to wait for the first to finish, and then as they finish it comes back to finish up (this is not needed to be done in parallel since the overhead for doing this is really small. The overhead of @parallel would probably increase the runtime!). After we are done, we scale the area to get the answer.

So how well did it do? On my last post someone told me to checkout the benchmark package. So to finish off the code I run:

  df4 = benchmark(f4,"MultiGPU",1000)
  println(df4)
  CUDArt.close(devices(dev->true))

benchmark will run the function 1000 times and tell me the average runtime by printing the dataframe df4. Then I close the GPU to clean up the memory.

Heterogeneous GPU Setup

Recall that our last implementation simply split up the problem by how many cores each GPU has. We will get back to whether that is correct or not later, but first lets make something a little more complicated to test this out. So what about making equalDiv higher on faster GPUs and slower on slower GPUs? What we will do is declare an array trueBias, like

trueBias = [.75,.25]

which will say how to bias the equalDivs amongst the GPUs. How do we calculate the division? It’s easiest to do this by example. Notice in the code above we have that

equalDiv1 = 3 equalDiv2,

that is equalDiv is 3 times the size on the first GPU than the second GPU. Since the number of values we want to calculate is sizei*sizej, and we calculate equalDiv values on cudaCores[i] different cores, the total amount that we calculate is

equalDiv1*cudaCores1 + equalDiv2*cudaCores2 = sizei*sizej

and by substitution we then get

equalDiv2 = frac{sizei*sizej}{3*cudaCores1+cudaCores2}

Do a bigger example with [.75, 1/16, 1/8, 1/16]. See the pattern? Notice that trueBias/trueBias[i] gives us the value that we need to substitute $k$ with $i$ (where $k$ is the value on the top). Then notice that the left hand side is simply the dot product of these values with the number of cudaCores. Thus we pull out equalDivi and sizei*sizej by that dot product to get what equalDivi should be the final result is that the following calculates the desired quantity:

  equalDiv = Vector{Int32}(numCards)
  for i = 1:numCards
    equalDiv[i] = ceil(Int32,(sizei*sizej)/dot(trueBias/(trueBias[i]),cudaCores))
  end

We then need to make sure we scale portions by the equalDiv per GPU:

  portions = equalDiv.*cudaCores
  if numCards > 1
    startIdx = [0 cumsum(portions)[1:end-1]]
  else
    startIdx = [0]
  end

When calling the kernal, all we have to do is specify which equalDiv we want to use, and thus we get the test function:

  function f5()
    @sync begin
      for i = 1:numCards
        @async begin
          dev = device(i-1)
          g_coefs[i] = CudaArray(coefs)
          launch(integrationFuncs[i], cudaCores[i], 64, (g_coefs[i], g_iarr[i], g_jarr[i], sizei,sizej,equalDiv[i],startIdx[i],g_tmp[i],),stream=streams[i]);
          wait(streams[i])
          ans[i] = sum(to_host(g_tmp[i]))
        end
      end
    end
    res = sum(ans)*totArea
    #println(res)
  end
  df5 = benchmark(f5,"MultiGPU2",1000)
  println(df5)
  CUDArt.close(devices(dev->true))

Same as before with equalDiv[i] now.

Does Changing the Divisions Help?

I ran this on my desktop and got the following results:

1x12 DataFrames.DataFrame
| Row | Category | Benchmark | Iterations | TotalWall | AverageWall |
|-----|----------|-----------|------------|-----------|-------------|
| 1   | "GPU"    | "GPU"     | 1000       | 23.1013   | 0.0231013   |
 
| Row | MaxWall   | MinWall  | Timestamp             |
|-----|-----------|----------|-----------------------|
| 1   | 0.0264843 | 0.021724 | "2016-02-27 17:16:32" |
 
| Row | JuliaHash                                  | CodeHash | OS        |
|-----|--------------------------------------------|----------|-----------|
| 1   | "a2f713dea5ac6320d8dcf2835ac4a37ea751af05" | NA       | "Windows" |
 
| Row | CPUCores |
|-----|----------|
| 1   | 8        |
1x12 DataFrames.DataFrame
| Row | Category   | Benchmark  | Iterations | TotalWall | AverageWall |
|-----|------------|------------|------------|-----------|-------------|
| 1   | "MultiGPU" | "MultiGPU" | 1000       | 22.3445   | 0.0223445   |
 
| Row | MaxWall   | MinWall   | Timestamp             |
|-----|-----------|-----------|-----------------------|
| 1   | 0.0593403 | 0.0211532 | "2016-02-27 17:16:55" |
 
| Row | JuliaHash                                  | CodeHash | OS        |
|-----|--------------------------------------------|----------|-----------|
| 1   | "a2f713dea5ac6320d8dcf2835ac4a37ea751af05" | NA       | "Windows" |
 
| Row | CPUCores |
|-----|----------|
| 1   | 8        |
1x12 DataFrames.DataFrame
| Row | Category    | Benchmark   | Iterations | TotalWall | AverageWall |
|-----|-------------|-------------|------------|-----------|-------------|
| 1   | "MultiGPU2" | "MultiGPU2" | 1000       | 22.5787   | 0.0225787   |
 
| Row | MaxWall   | MinWall   | Timestamp             |
|-----|-----------|-----------|-----------------------|
| 1   | 0.0557531 | 0.0213999 | "2016-02-27 17:17:18" |
 
| Row | JuliaHash                                  | CodeHash | OS        |
|-----|--------------------------------------------|----------|-----------|
| 1   | "a2f713dea5ac6320d8dcf2835ac4a37ea751af05" | NA       | "Windows" |
 
| Row | CPUCores |
|-----|----------|
| 1   | 8        |
1x12 DataFrames.DataFrame
| Row | Category             | Benchmark            | Iterations | TotalWall |
|-----|----------------------|----------------------|------------|-----------|
| 1   | "MultiGPU2-Balanced" | "MultiGPU2-Balanced" | 1000       | 22.3453   |
 
| Row | AverageWall | MaxWall   | MinWall   | Timestamp             |
|-----|-------------|-----------|-----------|-----------------------|
| 1   | 0.0223453   | 0.0593927 | 0.0211599 | "2016-02-27 17:17:41" |
 
| Row | JuliaHash                                  | CodeHash | OS        |
|-----|--------------------------------------------|----------|-----------|
| 1   | "a2f713dea5ac6320d8dcf2835ac4a37ea751af05" | NA       | "Windows" |
 
| Row | CPUCores |
|-----|----------|
| 1   | 8        |

The first run was 1 GPU which averaged .0231 seconds. The second run was equal splitting which got .0223 seconds. Next was the split trueBias = [.51,.49] which got .0225 seconds. Lastly I set trueBias = [.5,.5] and got .0223 seconds. Thus notice two things: one allowing for the homogeneous split has no overhead (we still only pass an integer), but more interesting, the best split was still 50-50. This is reproducible and only gets worse when we split more ([.75,.25] is terrible!).

So what’s going on? Well, first of all notice that this is not a memory bound problem: in the loop we only send an array of 36 values. The largest difference between the GPUs is their memory bandwidth. From the GTX 970 specs we see that its memory bandwidth is 224 GB/sec, vs the GTX 750Ti’s 86.4 GB/sec. That’s a whopping 2.5x difference, but it doesn’t matter in this case. All that matters in this test is the compute power. Here the GTX 970 has a base clock of 1050 MHz and a boost clock of 1178 MHz, while the GTX 750Ti has a base clock of 1020 MHz and a boost clock of 1085 MHz. 1178/1085 = 1.08. They are almost the same. Note that this is the difference between a entry level graphics card from ~4 years ago, vs one of the high-end cards of today (only beat in the gaming sphere by the GTX 980(Ti) and the Titan X). GPUs tend to have about the same strength per core, while the difference is mostly in how many cores there are. In that sense, our result is not surprising. However, knowing how to deal with the heterogeneity is probably good for more memory taxing problems.

Running it on Comet

I compiled this code onto Comet and ran it with the job script

#!/bin/bash
#SBATCH -A uci128
#SBATCH --job-name="jOpt"
#SBATCH --output="output/jOpt.%j.%N.out"
#SBATCH --partition=gpu-shared
#SBATCH --gres=gpu:2
#SBATCH --nodes=1
#SBATCH --export=ALL
#SBATCH --mail-type=ALL
#SBATCH --mail-user=crackauc@uci.edu
#SBATCH --ntasks-per-node=12
#SBATCH -t 0:10:00
module load cuda/7.0
module load cmake
export SLURM_NODEFILE=`generate_pbs_nodefile`
/home/crackauc/julia-a2f713dea5/bin/julia --machinefile $SLURM_NODEFILE /home/crackauc/projectCode/ImprovedSRK/Optimization/integrationTestComet.jl

The result was the following:

| Row | Category    | Benchmark   | Iterations | TotalWall | AverageWall |
|-----|-------------|-------------|------------|-----------|-------------|
| 1   | "SingleGPU" | "SingleGPU" | 1000       | 26.5858   | 0.0265858   |
 
| Row | MaxWall   | MinWall   | Timestamp             |
|-----|-----------|-----------|-----------------------|
| 1   | 0.0440866 | 0.0260225 | "2016-02-27 17:28:41" |
 
| Row | JuliaHash                                  | CodeHash | OS      |
|-----|--------------------------------------------|----------|---------|
| 1   | "a2f713dea5ac6320d8dcf2835ac4a37ea751af05" | NA       | "Linux" |
 
| Row | CPUCores |
|-----|----------|
| 1   | 24       |
 
1x12 DataFrames.DataFrame
| Row | Category   | Benchmark  | Iterations | TotalWall | AverageWall |
|-----|------------|------------|------------|-----------|-------------|
| 1   | "MultiGPU" | "MultiGPU" | 1000       | 22.3377   | 0.0223377   |
 
| Row | MaxWall   | MinWall   | Timestamp             |
|-----|-----------|-----------|-----------------------|
| 1   | 0.0458176 | 0.0219705 | "2016-02-27 17:29:06" |
 
| Row | JuliaHash                                  | CodeHash | OS      |
|-----|--------------------------------------------|----------|---------|
| 1   | "a2f713dea5ac6320d8dcf2835ac4a37ea751af05" | NA       | "Linux" |
 
| Row | CPUCores |
|-----|----------|
| 1   | 24       |
1x12 DataFrames.DataFrame
| Row | Category             | Benchmark            | Iterations | TotalWall |
|-----|----------------------|----------------------|------------|-----------|
| 1   | "MultiGPU2-Balanced" | "MultiGPU2-Balanced" | 1000       | 22.4148   |
 
| Row | AverageWall | MaxWall   | MinWall   | Timestamp             |
|-----|-------------|-----------|-----------|-----------------------|
| 1   | 0.0224148   | 0.0549695 | 0.0220449 | "2016-02-27 17:29:29" |
 
| Row | JuliaHash                                  | CodeHash | OS      |
|-----|--------------------------------------------|----------|---------|
| 1   | "a2f713dea5ac6320d8dcf2835ac4a37ea751af05" | NA       | "Linux" |
 
| Row | CPUCores |
|-----|----------|
| 1   | 24       |
1x12 DataFrames.DataFrame
| Row | Category               | Benchmark              | Iterations |
|-----|------------------------|------------------------|------------|
| 1   | "MultiGPU2-Unbalanced" | "MultiGPU2-Unbalanced" | 1000       |
 
| Row | TotalWall | AverageWall | MaxWall   | MinWall   |
|-----|-----------|-------------|-----------|-----------|
| 1   | 22.6683   | 0.0226683   | 0.0568948 | 0.0224583 |
 
| Row | Timestamp             | JuliaHash                                  |
|-----|-----------------------|--------------------------------------------|
| 1   | "2016-02-27 17:29:53" | "a2f713dea5ac6320d8dcf2835ac4a37ea751af05" |
 
| Row | CodeHash | OS      | CPUCores |
|-----|----------|---------|----------|
| 1   | NA       | "Linux" | 24       |

Notice that there is a performance gain by going to multiple GPUs, but again we loose performance by unbalancing (again, by running this a few times we see that the two version which are balanced are the same result within random error). Still, the speedup from going to multiple GPUs is not large here. The difference is bigger for bigger meshes. This shows you that multiple GPUs is really good for tackling larger problems, though in general if your problem is “not too big”, then going to multiple GPUs will only be a mild gain (remember, you want each processor to run enough calculations for it to be worthwhile!).

Lastly, notice that there is not much of a performance difference from our consumer GPU and the Tesla! This is for two reasons. One, we are using single precision. The Tesla and GTX line are almost the the same when it comes to single precision calculations. We would see a huge (~32x) difference for double precision calculations. Secondly, the Tesla has ECC memory which slows it down a little but makes sure it does not have errors in its memory. Lastly, the big advantage of a Tesla K80 is that it has a “huge” amount of memory compared to other GPUs. The GTX 970 has ~3.6 GB (due to NVIDIA’s lies…) whereas the Tesla K80 has 24 GB. This means that if the problem is single precision and fits on your GTX you’re fine. Otherwise, the Tesla is required.

Where do we go from here?

We started with an HPC allocation. We got Julia running on the HPC, and we setup a parallel function to use multiple nodes. Then, realizing how parallel our problem is, we took it to GPUs, first taking on one GPU, and then multiple, with a setup that works for as many GPUs as you need. Our implementation is about a kernal which simply evaluates a function along a vector, i.e. a vectorized function call. Thus very similar code to this works for all vectorized functions. What else do we need?

One obvious generalization is higher dimension arrays. If you need to do vectorized calls on matrices etc., you can just vectorize them in Julia and send them to the GPU in that form. If you want to be able to keep the matrix form in the CUDA kernal, you just have to play with indices but the same approach we did here works.

One place to go is to now start using Float64’s (doubles). Remember, on most GPUs they are really slow, but on the Tesla GPUs like in Comet they are still quite fast. All you have to do is change the array declarations to Float64’s and now you have double precision.

This means that you can basically code whatever vectorized calls you need to do. What about harder problems? If you need harder problems, check out libraries. cuBLAS is a BLAS implementation for doing linear algebra on the GPU. It contains cuBLASxt which automatically scales cuBLAS to multiple GPUs. Another thing for linear algebra is cuSolver for GPU methods for solving the linear equation Ax=b. You can use these in your CUDA kernals.

There are also Julia packages for making this easier. There’s cuBLAS.jl, ArrayFire.jl, etc. all for making GPU calculations easier. These do not require you to write/compile CUDA kernals. I may write a tutorial on this with some experiments, but this should be pretty easy given what you know now.

At this point you should be pretty capable with GPUs. My next focus will then be on Xeon Phis. I will have to write a Multigrid solver soon, so I may write one on the GPU and also on the Xeon Phi (called from Julia) and see the performance difference. Since I also have an allocation on LSU’s SuperMIC, I will also be looking for good ways to both use the Xeon Phi at the same time as the GPU (they have hybrid nodes!) and understand which problems are better on which device. Stay tuned.

Full Julia Code

Again, I am going to put my full Julia code here for reference. I use a bunch of values to calculate the coefficients as it pertains to my real problem, but again you can just swap out the getCoefficients problem with any function that returns an array of 36 Float32’s (rand(36)).

function integrationTest()
  A0 = [0 0 0 0
      1 0 0 0
      1/4 1/4 0 0
      0 0 0 0];
  A1 = [0 0 0 0
      1/4 0 0 0
      1 0 0 0
      0 0 1/4 0];
  B0 = [0 0 0 0
       0 0 0 0
      1 1/2 0 0
      0 0 0 0];
  B1 = [0 0 0 0
      -1/2 0 0 0
      1 0 0 0
      2 -1 1/2 0];
 
  α = [1/6 1/6 2/3 0];
 
  β1 = [-1 4/3 2/3 0];
  β2 = [1 -4/3 1/3 0];
  β3 = [2 -4/3 -2/3 0];
  β4 = [-2 5/3 -2/3 1];
 
  dx = 1/400
 
 
 
  imin = -12
  imax = 1
  jmin = -4
  jmax = 4
  coefs,powz,poww = getCoefficients(A0,A1,B0,B1,α,β1234)
  iarr = convert(Vector{Float32},collect(imin:dx:imax))
  jarr = convert(Vector{Float32},collect(jmin:dx:jmax))
  sizei = length(imin:dx:imax)
  sizej = length(jmin:dx:jmax)
  totArea = ((imax-imin)*(jmax-jmin)/(sizei*sizej));
 
  #=
  function coefFunc()
    getCoefficients(A0,A1,B0,B1,α,β1234)
  end
  dfcoef = benchmark(coefFunc,"coef",10)
  println(dfcoef)
  =#
 
  #=
  #Some SIMD
  function f1()
    res = @sync @parallel (+) for i = imin:dx:imax
      tmp = 0
       for j=jmin:dx:jmax
        ans = 0
        @simd for k=1:36
          @inbounds ans += coefs[k]*(i^(powz[k]))*(j^(poww[k]))
        end
        tmp += abs(ans)<1
      end
      tmp
    end
    res = res*((imax-imin)*(jmax-jmin)/(length(imin:dx:imax)*length(jmin:dx:jmax)))
    println(res)#Mathematica gives 2.98
  end
  df1 = benchmark(f1,"CPUSimp",2)
  println(df1)
  =#
  #=
  ## Full unravel
  function f2()
    res = @sync @parallel (+) for i = imin:dx:imax
        tmp = 0
        isq2 = i*i; isq3 = i*isq2; isq4 = isq2*isq2; isq5 = i*isq4
        isq6 = isq4*isq2; isq7 = i*isq6; isq8 = isq4*isq4
        @simd for j=jmin:dx:jmax
          jsq2 = j*j; jsq3= j*jsq2; jsq4 = jsq2*jsq2;
          jsq5 = j*jsq4; jsq6 = jsq2*jsq4; jsq7 = j*jsq6; jsq8 = jsq4*jsq4
          @inbounds tmp += abs(coefs[1]*(jsq2) + coefs[2]*(jsq3) + coefs[3]*(jsq4) + coefs[4]*(jsq5) + coefs[5]*jsq6 + coefs[6]*jsq7 + coefs[7]*jsq8 + coefs[8]*(i) + coefs[9]*(i)*(jsq2) + coefs[10]*i*jsq3
          + coefs[11]*(i)*(jsq4) + coefs[12]*i*jsq5 + coefs[13]*(i)*(jsq6) + coefs[14]*i*jsq7 + coefs[15]*(isq2) + coefs[16]*(isq2)*(jsq2) + coefs[17]*isq2*jsq3 + coefs[18]*(isq2)*(jsq4) +
          coefs[19]*isq2*jsq5 + coefs[20]*(isq2)*(jsq6) + coefs[21]*(isq3) + coefs[22]*(isq3)*(jsq2) + coefs[23]*isq3*jsq3 + coefs[24]*(isq3)*(jsq4) + coefs[25]*isq3*jsq5 +
          coefs[26]*(isq4) + coefs[27]*(isq4)*(jsq2) + coefs[28]*isq4*jsq3 + coefs[29]*(isq4)*(jsq4) + coefs[30]*(isq5) + coefs[31]*(isq5)*(jsq2) + coefs[32]*isq5*jsq3+ coefs[33]*(isq6) +
          coefs[34]*(isq6)*(jsq2) + coefs[35]*(isq7) + coefs[36]*(isq8))<1
        end
        tmp
      end
      res = res*totArea
      println(res)#Mathematica gives 2.98
  end
  df2 = benchmark(f2,"CPU",2)
  println(df2)
  =#
 
  #=
  #GPU
  cudaCores = 2496;
  equalDiv = sizei*sizej÷cudaCores + 1
  CUDArt.init(devices(dev->true))
  dev = device(0)
  md = CuModule("integrationWin.ptx",false)
  integrationFunc = CuFunction(md,"integration")
  g_iarr = CudaArray(iarr)
  g_jarr = CudaArray(jarr)
  g_tmp = CudaArray(Int32,cudaCores)
  function f3()
    g_coefs = CudaArray(coefs)
    launch(integrationFunc, cudaCores, 64, (g_coefs, g_iarr, g_jarr, sizei,sizej,equalDiv,g_tmp,)); res = sum(to_host(g_tmp));
    res = res*totArea
    #println(res)
  end
  df3 = benchmark(f3,"GPU",10)
  println(df3)
  CUDArt.close(devices(dev->true))
  =#
 
  #Single-GPU
  cudaCores = [2496];
  #cudaCores = [1664,1]
  numCards = length(cudaCores)
  totCores = sum(cudaCores)
  equalDiv = sizei*sizej÷totCores + 1
  portions = equalDiv*cudaCores
  if numCards > 1
    startIdx = [0 cumsum(portions)[1:end-1]]
  else
    startIdx = [0]
  end
  g_iarr = Vector{Any}(numCards)
  g_jarr = Vector{Any}(numCards)
  g_tmp = Vector{Any}(numCards)
  g_coefs = Vector{Any}(numCards)
  ans = Vector{Int32}(numCards)
  integrationFuncs = Vector{Any}(numCards)
  CUDArt.init(devices(dev->true))
  devlist = devices(dev->true)
  streams = [(device(dev); Stream()) for dev in devlist]
  for i = 1:numCards
    device(i-1)
    g_iarr[i]  = CudaArray(iarr)
    g_jarr[i]  = CudaArray(jarr)
    g_tmp[i]   = CudaArray(Int32,cudaCores[i])
    md = CuModule("integration.ptx",false)
    integrationFuncs[i] = CuFunction(md,"integration")
  end
 
  function f4()
    @sync begin
      for i = 1:numCards
        @async begin
          dev = device(i-1)
          g_coefs[i] = CudaArray(coefs)
          launch(integrationFuncs[i], cudaCores[i], 64, (g_coefs[i], g_iarr[i], g_jarr[i], sizei,sizej,equalDiv,startIdx[i],g_tmp[i],),stream=streams[i]);
          wait(streams[i])
          ans[i] = sum(to_host(g_tmp[i]))
        end
      end
    end
    res = sum(ans)*totArea
    println(res)
  end
  df4 = benchmark(f4,"SingleGPU",1000)
  println(df4)
  CUDArt.close(devices(dev->true))
 
  #Multi-GPU
  cudaCores = [2496,2496];
  #cudaCores = [1664,1]
  numCards = length(cudaCores)
  totCores = sum(cudaCores)
  equalDiv = sizei*sizej÷totCores + 1
  portions = equalDiv*cudaCores
  if numCards > 1
    startIdx = [0 cumsum(portions)[1:end-1]]
  else
    startIdx = [0]
  end
  g_iarr = Vector{Any}(numCards)
  g_jarr = Vector{Any}(numCards)
  g_tmp = Vector{Any}(numCards)
  g_coefs = Vector{Any}(numCards)
  ans = Vector{Int32}(numCards)
  integrationFuncs = Vector{Any}(numCards)
  CUDArt.init(devices(dev->true))
  devlist = devices(dev->true)
  streams = [(device(dev); Stream()) for dev in devlist]
  for i = 1:numCards
    device(i-1)
    g_iarr[i]  = CudaArray(iarr)
    g_jarr[i]  = CudaArray(jarr)
    g_tmp[i]   = CudaArray(Int32,cudaCores[i])
    md = CuModule("integration.ptx",false)
    integrationFuncs[i] = CuFunction(md,"integration")
  end
 
  function f4()
    @sync begin
      for i = 1:numCards
        @async begin
          dev = device(i-1)
          g_coefs[i] = CudaArray(coefs)
          launch(integrationFuncs[i], cudaCores[i], 64, (g_coefs[i], g_iarr[i], g_jarr[i], sizei,sizej,equalDiv,startIdx[i],g_tmp[i],),stream=streams[i]);
          wait(streams[i])
          ans[i] = sum(to_host(g_tmp[i]))
        end
      end
    end
    res = sum(ans)*totArea
    println(res)
  end
  df4 = benchmark(f4,"MultiGPU",1000)
  println(df4)
  CUDArt.close(devices(dev->true))
 
 
  #= Hardware Bias Spec.
  cudaCores = [1664,640];
  trueBias = [.75,.25]
  numCards = length(cudaCores)
  totCores = sum(cudaCores)
  coreBias = cudaCores ./ totCores
  workloadBias = trueBias./(coreBias)
  equalDiv = ceil(Int32,sizei*sizej/totCores .* workloadBias)
  portions = equalDiv.*cudaCores
  if numCards > 1
    startIdx = [0 cumsum(portions)[1:end-1]]
  else
    startIdx = [0]
  end
  =#
 
  ##Heterogenous Multi-GPU
  # trueBias = coreBias.* workloadBias*numCards
  cudaCores = [2496,2496]
  trueBias = [.5,.5]
  numCards = length(cudaCores)
  totCores = sum(cudaCores)
  equalDiv = Vector{Int32}(numCards)
  for i = 1:numCards
    equalDiv[i] = ceil(Int32,(sizei*sizej)/dot(trueBias/(trueBias[i]),cudaCores))
  end
 
  portions = equalDiv.*cudaCores
  if numCards > 1
    startIdx = [0 cumsum(portions)[1:end-1]]
  else
    startIdx = [0]
  end
  g_iarr = Vector{Any}(numCards)
  g_jarr = Vector{Any}(numCards)
  g_tmp = Vector{Any}(numCards)
  g_coefs = Vector{Any}(numCards)
  ans = Vector{Int32}(numCards)
  integrationFuncs = Vector{Any}(numCards)
  CUDArt.init(devices(dev->true))
  devlist = devices(dev->true)
  streams = [(device(dev); Stream()) for dev in devlist]
  for i = 1:numCards
    device(i-1)
    g_iarr[i]  = CudaArray(iarr)
    g_jarr[i]  = CudaArray(jarr)
    g_tmp[i]   = CudaArray(Int32,cudaCores[i])
    md = CuModule("integration.ptx",false)
    integrationFuncs[i] = CuFunction(md,"integration")
  end
 
  function f5()
    @sync begin
      for i = 1:numCards
        @async begin
          dev = device(i-1)
          g_coefs[i] = CudaArray(coefs)
          launch(integrationFuncs[i], cudaCores[i], 64, (g_coefs[i], g_iarr[i], g_jarr[i], sizei,sizej,equalDiv[i],startIdx[i],g_tmp[i],),stream=streams[i]);
          wait(streams[i])
          ans[i] = sum(to_host(g_tmp[i]))
        end
      end
    end
    res = sum(ans)*totArea
    #println(res)
  end
  df5 = benchmark(f5,"MultiGPU2-Balanced",1000)
  println(df5)
  CUDArt.close(devices(dev->true))
 
 
 
  ##Heterogenous Multi-GPU
  # trueBias = coreBias.* workloadBias*numCards
  cudaCores = [2496,2496];
  trueBias = [.55,.45]
  numCards = length(cudaCores)
  totCores = sum(cudaCores)
  equalDiv = Vector{Int32}(numCards)
  for i = 1:numCards
    equalDiv[i] = ceil(Int32,(sizei*sizej)/dot(trueBias/(trueBias[i]),cudaCores))
  end
 
  portions = equalDiv.*cudaCores
  if numCards > 1
    startIdx = [0 cumsum(portions)[1:end-1]]
  else
    startIdx = [0]
  end
  g_iarr = Vector{Any}(numCards)
  g_jarr = Vector{Any}(numCards)
  g_tmp = Vector{Any}(numCards)
  g_coefs = Vector{Any}(numCards)
  ans = Vector{Int32}(numCards)
  integrationFuncs = Vector{Any}(numCards)
  CUDArt.init(devices(dev->true))
  devlist = devices(dev->true)
  streams = [(device(dev); Stream()) for dev in devlist]
  for i = 1:numCards
    device(i-1)
    g_iarr[i]  = CudaArray(iarr)
    g_jarr[i]  = CudaArray(jarr)
    g_tmp[i]   = CudaArray(Int32,cudaCores[i])
    md = CuModule("integration.ptx",false)
    integrationFuncs[i] = CuFunction(md,"integration")
  end
 
  function f5()
    @sync begin
      for i = 1:numCards
        @async begin
          dev = device(i-1)
          g_coefs[i] = CudaArray(coefs)
          launch(integrationFuncs[i], cudaCores[i], 64, (g_coefs[i], g_iarr[i], g_jarr[i], sizei,sizej,equalDiv[i],startIdx[i],g_tmp[i],),stream=streams[i]);
          wait(streams[i])
          ans[i] = sum(to_host(g_tmp[i]))
        end
      end
    end
    res = sum(ans)*totArea
    #println(res)
  end
  df5 = benchmark(f5,"MultiGPU2-Unbalanced",1000)
  println(df5)
  CUDArt.close(devices(dev->true))
end
 
using CUDArt
using Benchmark
include("getCoefs.jl")
integrationTest()

The post Multiple-GPU Parallelism on the HPC with Julia appeared first on Stochastic Lifestyle.