# The Numerical Algorithms Group (NAG) and the NAG Library Introduction

The Numerical Algorithms Group (NAG) provides expertise in numerical engineering, by delivering high-quality computational software,
consulting services and high performance computing services. For over four decades NAG have collaborated with world-leading researchers
in academia and industry to create powerful, reliable and flexible software which is relied on by tens of thousands of individual users,
as well as numerous independent software vendors. As a not-for-profit company,
NAG reinvests surpluses into the research and development of its products, services, staff and its collaborations.

The NAG Library is the world’s largest collection of robust, documented, tested and maintained numerical and statistical algorithms.
It is callable from many different programming languages and environments such as .NET, C++, Java®, MATLAB®, Python® and Visual Basic®.
Some users of the Julia language already call the NAG Library into their applications.
This blog demonstrates that the NAG Library is also callable directly from Julia, using one of the routines from the NAG Library as an example.

This blog post is also available on the NAG website.

# Introduction to Correlation Matrices

Correlations provide helpful information when working with possibly mutually related data sources, such as the value of stocks, options, precious metals or others. In the case of three or more data sources this produces a matrix known as a “correlation matrix” which has a special structure, shown in Julia code below:

Correlation (or, to give its full name, “Pearson’s product-moment correlation coefficient”),
is a measure of the linear relationship between two random variables. One of the features
that makes it particularly useful is that it is invariant to location-scale
transformations of the variables, giving values on the interval [-1,1]: 1 being perfect
positive linear relationship, 0 being no linear relationship (though, it should be
emphasised, not necessarily independent), and -1 being a perfect negative relationship.

A correlation matrix is then the set of pairwise correlations between multiple random
variables, where the (i,j)th element is the correlation between the ith and the
jth variable. Mathematically, a correlation matrix must satsify the following
properties:

• it must be symmetric,

• the diagonal elements must be 1, and

• it must be positive semidefinite, which is equivalent to all eigenvalues being non-negative.

## Estimating Correlation Matrices

Of course, we never get to observe the correlation matrix directly, instead it must be
estimated from data. Given a matrix Y, where each row is a joint observation of the
variables, the empirical correlation matrix can be computed using the
cor function in Julia.

The resulting matrix should satisfy the three properties above, though some small
deviations may arise due to numerical error in extreme cases:

The structure of correlations produces a matrix that should be symmetric, have unit diagonal and be
positive semidefinite. In
applications however it can happen that approximate correlations lead
to accordingly approximate correlation matrices which are symmetric and
have unit diagonal, but also have nontrivial negative eigenvalues, meaning
it is not positive semidefinite.

Larger deviations from positive definiteness may arise when the correlations are computed
from incomplete data: for example, two stock symbols may not have price data on the exact
same days.

The available approaches to handling the situation of missing data can be broken down into two broad categories;
model the missing values and use that model to generate a full set of data,
hence guaranteeing a semi-definitive correlation matrix; or calculate an approximate correlation matrix from the available data and adjust it to ensure that it is definite.

The first of these approaches is beyond the scope of this article and we only deal with the second:
calculating an approximate correlation matrix from existing data, adjusting if required.
This approach has the advantage of being applicable when an approximate correlation matrix is indefinite for a reason other than missing data,
for example, due to rounding during its calculation.

In the example given below, based on real financial data, it was not possible to calculate a full correlation matrix due to missing data.
Pairwise correlations between the stocks were calculated using as much data as possible, which yielded a matrix with a negative eigenvalue.

It is useful therefore to correlate incomplete data and then clean the resulting
correlation matrix by finding the “nearest” correlation matrix, and then use the nearest correlation
matrix as a proxy for the original correlation matrix in workflows which require those properties.

# Nearest Correlation Matrix

To fix the problem of indefiniteness in an approximation correlation matrix, one
can define an optimization problem which fixes it appropriately without reducing
the original value of the true correlations. One could for example simply find the closest matrix in the Frobenius norm to an input
matrix, but subject to the constraints that the output is a correlation matrix.
In practice this may not be an acceptable solution, so there are alternative
formulations which enforce important properties
of the approximate correlation matrix input. NAG has implemented many of these different
approaches and through a partnership with Julia Computing a Julia
interface has been developed to access these nearest correlation routines in the NAG Library. The next
section will first detail the routines and following this there is
an explanation of how they can be accessed through Julia.

## NAG Routines

The first and simplest routine in the NAG Library for nearest correlation matrices
solves an optimization problem and accepts an arbitrary matrix as input, this is
g02aa, which solves the following optimization problem

where X is a valid correlation matrix. An algorithm devised in 1 uses alternating projection to solve this optimization problem,
which has the virtue of being simple to implement, but tends to converge slowly. Newton’s algorithm
can have rapid convergence and was investigated in 2 along with suitable preconditioners
for the linear systems it yields. This algorithm has been implemented by NAG

Some common situations may make this simple approach less
appealing however. For example suppose only a few data pairs have problematic missing data,
then only a small sub-block of the approximate correlation matrix needs to be fixed and
a likely-small change added to the rest of the matrix to preserve the properties it had
to begin with. Thus targeting of a sub-block can be formulated as follows:

this was solved by Higham et al. in 3 and is implemented by NAG as routine g02an. A similar routine “g02ap” which will be released with the NAG Library Mark 26 can keep arbitrary entries of a matrix fixed, rather than a full sub-block, which affords the user more flexibility.

If fixing a sub-block or fixed entries of the input matrix does not model well the additional information
a user may have, then one can also weight fixed rows and columns in the optimization, formulated as follows:

where W is an input diagonal matrix of weights.
The NAG routine g02ab incorporates this weighted formulation.

Alternatively if weighting whole rows and columns is not adequate one can optionally weight
only specified entries with the routine g02aj,
(note this is a very expensive formulation).

Finally one may know that the output should have a k-factor structure, which means
that the off-diagonal entries of the output can be low-rank approximated.
The NAG Library also incorporates this additional information through the
reformulation

This optimization problem is solved in the NAG routine g02ae and produces a nearest correlation matrix which respects k-factor structure specified by one.

In summary, the NAG Library contains the following nearest correlation routines:

• g02aa Basic problem using Newton’s method.
• g02ab Incorporates row/column weights.
• g02aj Incorporates entrywise weights.
• g02an Incorporates fixed submatrix.
• g02ap Incorporates fixed entries.
• g02ae Incorporates k-factor structure specified by user.

As a final remark it is worth noting that g02ab, g02aj, and g02ap can also optionally include a bound on the lowest eigenvalue. This might be desirable for example
to clamp the smallest eigenvalue to zero rather than allowing a possible roundoff to be negative. Bounding the smallest eigenvalue can also help in enforcing a positive definite
output, rather than only positive semidefinite.

What follows will detail a NAG partnership with Julia Computing which makes these routines much easier to utilize from a Julia environment.

# Julia Interface

Julia’s ccall function provides an easy to use, “no boilerplate,” interface for calling out to shared C and Fortran libraries, such as the NAG C Library and NAG Fortran Library. The ccall function allows for calling into compiled shared libraries directly without the need to write any “glue” code, perform low-level code generation to create function interfaces or even recompile a binary. Full documentation on the use of ccall can be found in the “Calling C and Fortran Code” section of the Julia Language documentation.

For this project, we developed a set of Julia wrappers to the above set of nearest correlation matrix routines. Development of these wrappers consisted of the following components:

• a Julia NAGdemo module containing all of the Julia code for the interface. The module includes all Julia type definitions, const values, function definitions and export definitions for the Julia/NAG interface.
• a constant string (LIBNAGC_NAG) pointing to the location of the NAG shared library distributed with a NAG C Library installation.
• a function nag_licence_query/nag_license_query that tests execution of the simple NAG routine a00acc as a method of performing proper license checkout for the NAG Library.
• translation of entries in the nag_types.h header file included with the NAG C Library into corresponding Julia objects.
• definition of proper error handling routines that can capture, store and print any errors in Julia that occur within the NAG C Library routines.
• definition of Julia wrapper functions to NAG’s nearest correlation matrix routines.

Below we will discuss each of these items in turn.

## Defining a NAGdemo module

In this post we define a simple NAGdemo module, included in full at the bottom of the page. NAGdemo is a Julia module much like any other Julia module, providing a standard mechanism for the organization of Julia code, as well as namespace resolution for all functions defined as part of the module when called from external Julia code. The list of exported functions in the module enumerate those functions that do not need to be called with full namespace resolution when the NAGdemo module is loaded into a Julia session via a using NAGdemo statement.

## Defining a constant for the location of the NAG shared library

The constant string LIBNAGC_NAG contains the location of the NAG shared library which is a required argument for all uses of ccall that will call into the NAG C Library from Julia.

The nag_licence_query \ nag_license_query function provides our first, and simplest, example of calling into the NAG C Library from Julia using ccall.

This particular use of ccall takes 3 arguments:

• a tuple (:a00acc, LIBNAGC_NAG) containing a symbol for the name of the exported function to be accessed, :a00acc, and the const string, LIBNAGC_NAG, pointing to the location of the NAG shared library. The NAG routine :a00acc outputs information about the version of the NAG C Library in use.
• the output type, Cint, of the NAG C routine being called. In the case of :a00acc, the output type is a C integer. The routine will return 0 if a licence is acquired and the function executes successfully, or 1 if the routine was unsuccessful.
• a tuple of the types for all input arguments, In this particular case, :a00acc does not take any input arguments, so this tuple argument is empty.

## Translating the NAG header file

The header file nag_types.h is a large file containing many C enums, typedefs and pre-processor macros. As the contents of this header file are widely used throughout the NAG C Library when referring to the input and output types of most functions, the definitions included in this header, or at least the atomic types and values that are represented by these definitions, need to be present in Julia when wrapping routines via ccall.

While there are packages within the Julia ecosystem, such as Clang.jl, that can assist in automation of wrapping C libraries in certain situations, a manual translation of the necessary portions of nag_types.h file into corresponding Julia code is not difficult, and is sufficient for this exercise. Each of the enum entries within nag_types.h are fundamentally integer constant values and this Nearest Correlation Matrix demonstration only requires one of those enum values:

NagInt is an alias for an appropriately sized integer value on Windows or Unix platforms, as needed for use of the NAG C Library in each operating system environment for appropriate 32-bit or 64-bit flavours as necessary.

## Handling errors from NAG routines

The following block of Julia code defines a NagError type that inherits from Julia’s abstract Exception type, as well as various functions for handling and storing any errors returned by any NAG routine as Julia exceptions:

The call to reset_nag_error() on load of the NAGdemo module sets our custom error_handler routine to collect and handle any errors thrown by NAG routines, as well as throw those errors as Julia exceptions. The last_nag_error() function resets the stored list of NagError values.

## Creating Julia wrappers

For each of the NAG routines for which a Julia function API is to be created we start with inspection of the appropriate NAG function signature to determine the datatypes of all arguments.

With a ccall invocation defined we can subsequently make a determination of which arguments to a NAG routine can be considered as purely input arguments, which arguments could be considered as purely outputs and which arguments could be considered both input and output arguments. This classification can help in the design of a user-facing Julia API.

Pure input arguments are those arguments which are passed to the NAG routine but are not modified in any way. Pure output arguments are those arguments for which memory is allocated prior to calling the NAG routine, but for which the initial values in that memory location have no importance the memory is only needed for returning an output value. Arguments that are both inputs and outputs are those arguments for which the initial value is both used by the NAG routine and modified before being returned.

Understanding the context of each argument in the NAG C API also helps to determine which of the C arguments might be able to have default values assigned via Julia keyword arguments, which of the C arguments might be able to infer their values from other input arguments at the Julia level and which of the C arguments might not need to be present at all within a user-facing Julia API.

For each of the nearest correlation routines considered in this exercise, the routine returns an output of type Void.

For the g02aac routine, its C API is the following:

Constructing a ccall command for this API signature requires mapping the type of each input argument from C type to the corresponding Julia data type. The following table provides that mapping:

argument C data type Julia data type
order Nag_OrderType NagInt
g double[] Ptr{Float64}
pdg Integer NagInt
n Integer NagInt
errtol double Float64
maxits Integer NagInt
maxit Integer NagInt
x double[] Ptr{Float64}
pdx Integer NagInt
iter Integer* Ptr{NagInt}
feval Integer* Ptr{NagInt}
nrmgrd double* Ptr{Float64}
fail NagError* Ptr{Void}

With each of these variables defined in the current Julia scope a pure ccall into NAG’s g02aac routine would look like the following block of code:

As stated previously, for the design of our initial Julia API, we need to make a few decisions about which of the arguments to ccall should be Julia inputs and which values should be returned as outputs.

In this proof of concept project, we made the following decisions for a first level Julia wrapper to g02aac named nag_nearest_correlation!:

• the scalar arguments being passed to ccall by value (e.g. order, pdg, n, errtol, maxits, maxit, pdx) will be arguments to nag_nearest_correlation!
• all array arguments (e.g. g and x) will be arguments to nag_nearest_correlation!. Argument g is actually a pure input argument, as its value is not modified, while argument x could be considered a pure output argument as its memory is only used to store the output correlation matrix. We will return to this point below.
• the scalar arguments being passed by reference into the ccall (e.g. iter, feval, nrmgrd) are being used to track and report internal behaviour of the NAG routine. These pointers to scalar values will be created within the Julia routine and their values returned as Julia integer or floating point values as needed.

The initial implementation of the nag_nearest_correlation! Julia function subsequently took the following form:

Using the same data defined above, we can call this routine as follows:

Now if we decide that we would like to provide an interface that does not require one to manage the memory allocation for x (and also only provides the nearest correlation matrix x as output), we can define the following method nag_nearest_correlation that allocates the memory for x internally and then calls nag_nearest_correlation!.

But the values we have assigned for errtol, maxits and maxit are essentially default values for these arguments. Consequently, we can assign these values in keyword arguments in our nag_nearest_correlation function signature:

Furthermore, the values for arguments pdg, n and pdx can be deduced from the size of the input array g and can also be removed from the function signature for nag_nearest_correlation as follows:

Calling this method now involves only the following code:

Having seen the entire process for construction of a user-friendly wrapper for the nag_nearest_correlation function below are the full contents of our NAGdemo module:

## Summary

By using Julia’s features for calling external libraries we have provided easy interfaces into the NAG nearest correlation matrix routines. This example shows that the numerical computing capabilities of Julia can be easily extended to incorporate pre-existing industry grade libraries. For users of Julia, NAG integration allows the possibility of including widely utilized, specialized libraries developed outside of the Julia ecosystem, and for users of the NAG Library it means their existing workflows depending on NAG need not be disrupted by a switch to Julia.

In summary: the NAG Library, which is the worlds’ largest collection of robust and commercially maintained numerical and statistical routines and Julia, which is
a high-productivity, high-performance programming language is a winning combination.

If you are interested in a more extensive integration of Julia and the NAG Library then please contact Julia Computing at info@juliacomputing.com and NAG at infodesk@nag.com.

Java® is a registered trademark of Oracle Corporation

MATLAB® is a registered trademark of The MathWorks, Inc.

Python® is a registered trademark of the Python Software Foundation

Visual Basic® is a registered trademark of Microsoft Corporation

## References

1. N. J. Higham. “Computing the nearest correlation matrix – A problem from finance”. IMA J. Numer. Anal., 22(3):329–343, 2002. doi: 10.1093/imanum/22.3.329.

2. R. Borsdorf and N. J. Higham. “A preconditioned (Newton) algorithm for the nearest correlation matrix”. IMA J. of Numer. Anal., 30(1):94–107, 2010. doi: 10.1093/imanum/drn085.

3. N. J. Higham, N. Strabić, and V. Šego, “Restoring definiteness via shrinking, with an application to correlation matrices with a fixed block”. SIAM Rev., 58(2), 245–263, 2016. doi:10.1137/140996112.

# Effects of cervical cancer mortality trends on life expectancy

Re-posted from: http://static-dust.klpn.se/posts/2016-10-16-cerv.html

# Effects of cervical cancer mortality trends on life expectancy

Taggar: epidemiology, julia

Since the 1960s, mortality, and also incidence, from cervical cancer has decreased greatly in Sweden, as well as other high-income countries, and screening with Pap tests, which may detect even precancerous lesions, is said to be the most important factor in this decline (Danielsson m.fl. 2012). During the same period, life expectancy among Swedish females has increased about 9 years. One might ask how much the decrease in cervical cancer mortality has contributed to this increase.

One way to assess this question is to take a life table for a recent year, and add the difference between age-specific cervical cancer mortality rates for the early 1960s and recent rates to the age-specific total morality rates used to calculate risks of death, and then recalculate the life table with the new death rates. This shows what the life table had looked like if cervical cancer mortality had stayed the same for the last 50 years, while recent age-specific mortality rates from all other causes had been as they actually are. To do these calculations, you need:

1. Recent data for age-specific number of deaths (due to all causes) and population at risk. This can be obtained from Statistics Sweden (2016) for single years of age.
2. Data for age-specific number of deaths due to cervical cancer and population at risk from recent years as well as the 1960s. This can be obtained from WHO (2016) in 5-year age groups.
3. Some software which can be used to calculate life tables from vectors of population at risk and deaths. In this post, my LifeTable package for Julia will be used.

All files used in the analysis below are available in a gist. The CSV file with cervical cancer mortality has been generated using my mortchartgen scripts scripts. With the files in the gist available (and LifeTable installed), include("cervclt.jl") will calculate a life table for Swedish females in 2013, and then a new life table with difference between cervical mortality rates 1960 and 2013 added to the 2013 mortality rates. The Julia file contains the following code:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15  using DataFrames, LifeTable cerv = readtable("cervc4290rate2.csv") fse13r = readtable("fse13.csv") fse13 = DataFrame(age = collect(0:105), pop = [fse13r[2][1:105]; sum(fse13r[2][106:111])], dths = [fse13r[3][1:105]; sum(fse13r[3][106:111])]) fse13lt = PeriodLifeTable(fse13, 2) cerv60 = transpose(convert(Array, cerv[10, 12:23])) cerv13 = transpose(convert(Array, cerv[63, 12:23])) cervdiff = cerv60 .- cerv13 cervdiff5 = map((x)->fill(x, 5), cervdiff) cervdiffyr = [fill(0, 25); vcat(cervdiff5...); fill(0, 21)] fse13mc60 = fse13lt[2] .+ cervdiffyr fse13c60 = DataFrame(age = fse13[1], mort = fse13mc60) fse13ltc60 = PeriodLifeTable(fse13c60, 2, true, "rate")

The files from Statistics Sweden (2016) contain ages up to 110 years; ages over 105 are summed together due to a very small population (line 5–6). Only cervical cancer mortality rates from 25–29 to 80–84 years are included (line 8–9): the contribution of other age groups to cervical cancer mortality is small, and inclusion of these would create some technical complications due to differing age formats. The array of differences in cervical cancer mortality is expanded to have the same length as the array of total mortality (line 11–12).

After the script has run, round(fse13lt[:e][1], 2) gives the life expectancy at birth in the life table calculated from the actual 2013 rates, rounded to two decimals. This should return 83.71, which is the same as in the life expectancy column in the precalculated table available from Statistics Sweden (2016). On the other hand, round(fse13ltc60[:e][1], 2) gives the life expectancy at birth given 2013 rates plus the difference in cervical cancer from 1960, and should return 83.55. So, if the decline in cervical cancer mortality since 1960, presumably due to Pap smears and other factors, had not occurred, life expectancy among Swedish females would have been about 1.9 months shorter.

The difference in mortality rates can be visualized (with PyPlot):

 1 2 3 4 5 6 7 8 9 10  using PyPlot plot(fse13lt[:age][26:85],log(fse13lt[:m][26:85]), label = "2013") plot(fse13lt[:age][26:85],log(fse13ltc60[:m][26:85]), label = "2013 with 1960 cervical rates") xlim(25,84) xlabel("age") ylabel("log(rate)") legend(loc=2) title("Total mortality, Swedish females 25\u201384")

The decline in cervical mortality makes significant differences to total mortality mainly at younger ages, where mortality already is low, which explains the modest effect on life expectancy.

It may be questioned whether trends in age-specific mortality rates from other causes would have stayed the same if trends in cervical cancer mortality had been different. The data is about so-called underlying causes, for which exactly one is reported for each death. The interpretation of this can be problematic if there are complex dependencies between different causes of death. This should be less of a problem for cervical cancer than for causes mainly occurring at older ages, where people often suffer from multiple diseases. However, there might be correlations between cervical cancer and risk factors for certain other causes of deaths, so that women who would have died from cervical cancer without e.g. Pap smears, have increased (or perhaps decreased) mortality rates from some other causes, compared to the general population.

## References

Danielsson, Maria, Torsten Berglund, Margareta Forsberg, Margareta Larsson, Christina Rogala och Tanja Tydén. 2012. ”Sexual and reproductive health: Health in sweden: The national public health report 2012. chapter 9”. Scandinavian Journal of Public Health 40: 176–196. doi:10.1177/1403494812459600.

Statistics Sweden. 2016. ”Life table by sex and age”. http://www.statistikdatabasen.scb.se/goto/en/ssd/LivslangdEttariga.

WHO. 2016. ”WHO Mortality Database”. http://www.who.int/healthinfo/mortality_data/en/index.html.

# Quick Baseball Series Simulations With Julia

Listening to game three of the National League Divisional Series
between the Washington Nationals and Los Angeles Dodgers today on
the way to the airport, I heard a startingly weighty statistic: The
team that wins game three of a five-game series wins 77% of series!
That's one pivotal game!

After mulling over that for a second, I thought to myself, "Hmm, I
wonder if that's significantly different from what you'd expect even
with each game being a coin flip. After all, it's not like the winner
of any game in a series is only 50% to win the whole best-of-five."
And with Gogo internet on the flight in typical fine form, I had
plenty of extra time to jump into a quick Julia REPL and find out.

We can simulate a 5-game series between teams 1 and 0 like:

julia> (round(Int, rand(1,5)))
1x5 Array{Int64,2}:
1  1  1  1  0


In real life they don't play games 4 and 5, but there's less effort
making the machine calculate those dead rubber games than real
major league playoff games.

Here's a function to see whether the winner of game three won
the series:

julia> function gameThreeWonSeries(series)
winner = sum(series) > 2 ? 1 : 0
series[3] == winner
end


So now let's just run that simulation 100,000 times and see what happens.

julia> wonGameThree = Array{Int}(10000)
100000-element Array{Int64,1}:
...

julia> for i = 1:100000
wonGameThree[i] = gameThreeWonSeries(round(Int, rand(1,5))) ? 1 : 0
end

julia> mean(wonGameThree)
0.68857


So even if every game were a simple coin flip, the team that won
game three would win about 69% of series. And of course, the games
aren't exactly coin flips — one of the teams is probably a
little bit better than the other, skewing that number even higher.

A nice thought experiment realized while flying without internet.
What I failed to figure out is how my laptop could chat with a
ground-based customer service rep to troubleshoot the broken
airborne Wi-Fi.