Author Archives: Mosè Giordano

Measuring the prevalence of documentation, testing and continuous integration in Julia packages

By: Mosè Giordano

Re-posted from: http://giordano.github.io/blog/2021-01-23-documentation-testing-julia/

Introduction

Documentation,
testing, and continuous
integration
(CI) are some
of the key ingredients to make software maintinable and reliable in the long
term. During the first part of my (short) career in academia, when my daily
software tools were C, Fortran and IDL, I barely knew these practices (probably
didn’t know continuous integration at all), and never applied them in practice.
Some time around 5 years ago I started my journey with the Julia programming
language, and I learned to value and use these principles.

I have the general feeling that Julia makes documentation and testing very
simple and lowers the entry barriers for newcomers. Today, the Julia ecosystem
offers many tools about this:

  • Documenter.jl: a package to
    generate HTML and PDF versions of the documentation
  • the Test standard library
    provides very basic tools for testing, but it’s also extremely simple to use:
    you don’t have a good reason for not testing your code
  • in addition to Test, there are third-party packages which offer more
    advanced testing frameworks. Some of them are:

  • packages like PkgTemplates.jl
    and PkgSkeleton.jl lets you
    easily generate a new package with minimal setup for documentation, testing,
    and continuous integration.

If you’re looking for tips about testing workflow in Julia, see these best
testing practices with
Julia

by Erik Engheim.

Analysing Julia packages in the General registry

Is my feeling that Julia makes documentation and testing easy actually true?
Prompted by the recent analysis of the prevalence of continuous integration in
JOSS
made by my colleague
Jamie Quinn, I decided to look how Julia packages in the General
registry
fare with regard to
documentation, testing and continuous integration. I quickly wrote a Julia
package, AnalyzeRegistry.jl
for this task: it clones all packages in the registry (only the last commit of
the default branch, to save time and bandwidth) and looks for some specific
files to decide whether a package uses documentation, testing and, continuous
integration.

The usage of the package is described in its
README.md
(yes, I haven’t set up proper documentation, yet!). I ran the analysis with 8
threads, it took less than 30 minutes to analyse the entire General registry (I
was mainly limited by my Internet connection, using some threads less wouldn’t
have changed much):

julia> using AnalyzeRegistry

julia> @time report = analyze(find_packages());
1567.008404 seconds (7.41 M allocations: 857.636 MiB, 0.01% gc time, 0.00% compilation time)

analyze returns a vector of a data structure, Package, which describes a
package:

  • name
  • URL of the git repository
  • can the repository be cloned? Some repositories have been deleted or made
    private, so it can be accessed from the public anymore
  • does it have documentation? This is the hardest criterion: I looked for the
    file docs/make.jl, or doc/make.jl, which is used to generate the
    documentation with Documenter.jl, but many packages may do something else,
    see more in the comments below
  • does it have the test/runtests.jl file? This is what the Test standard
    library uses to launch the tests
  • does it use GitHub Actions?
  • does it use Travis CI?
  • does it use AppVeyor?
  • does it use Cirrus CI?
  • does it use Circle CI?
  • does it use Drone CI?
  • does it use Buildkite?
  • does it use Azure Pipelines?
  • does it use GitLab Pipeline?

Report of the analysis on 2021

Before summarising my findings, I saved the results of the
analysis

in a JLD2 archive, so that anyone can look
into them.

Now, some statistics:

  • total number of packages: 4312 (note: I excluded JLL
    packages
    , which are automatically
    generated, and not very useful to measure whether humans follow best
    programming practices);
  • packages hosted on GitHub: 4296
  • packages hosted on GitLab: 16
  • packages that could be cloned: 4287 (99.4% of the total). The rest of
    percentages will refer to this number, since I could look for documentation
    and testing only if I could actually clone the repository of a package;
  • packages using Documenter.jl to publish documentation: 1887 (44%)
  • packages with testing: 4139 (96.5%)
  • packages using any of the CI services below: 4105 (95.8%)
  • packages using GitHub Actions: 3240 (75.6%)
  • packages using Travis CI: 2512 (58.6%)
  • packages using AppVeyor: 783 (18.3%)
  • packages using Cirrus CI: 60 (1.4%)
  • packages using Circle CI: 13 (0.3%)
  • packages using Drone CI: 43 (1%)
  • packages using Buildkite: 29 (0.7%)
  • packages using Azure Pipelines: 7 (0.2%)
  • packages using GitLab Pipeline: 85 (1.9%)

I ran the analysis on 2020-01-23, on this revision of the General
registry
.

Comments

  • The results are biased by the criteria I’ve chosen to determine whether a
    package “has documentation”, or “has tests” and should be takes cum grano
    salis
    : these criteria aren’t bullet-proof, also an empty test/runtests.jl
    would count as “has tests”, but see below;
  • remember that the General
    registry
    is not a curated list
    of Julia packages and at the moment there is no requirement to have
    documentation, nor testing nor continuous integration to be accepted;
  • about 44% of packages are set up to use Documenter.jl to publish
    documentation. While this fraction doesn’t look particularly high, consider
    that many packages don’t keep the documentation within the repository but it
    may be stored somewhere else (very notable examples are
    DifferentialEquations.jl
    and Plots.jl), or they are so
    simple that the entire documentation is written down in the README file (for
    example Tar.jl) or the wiki.

    • I looked at a random sample of 43 packages (1% of total) “without
      documentation”, of this

      • 4 didn’t have any documentation, docstrings, or examples of uses
        whatsoever (but 2 of them still had meaningful tests!)
      • 1 didn’t have documentation nor examples, but only docstrings
      • 1 only had an examples directory, with poorly commented samples of code
      • all the others had documentation in the README, in the wiki, or anyway
        published on a different website.

    So, about 6 packages out of 43 (14%) had a very poor or completely missing
    meaningful documentation for users. Assuming this was a representative
    sample, my conclusion is that about 8% of packages in the registry have a
    lacking documentation, while the rest have some documentation to get users
    started, and a bit less than half of the total are set up to use
    Documenter.jl to publish documentation;

  • an overwhelming majority of packages, 96.5%, have tests!
    • I looked at a random sample of 43 packages “with tests”, to check whether
      they actually have tests: only two of them (4.7%) had a dummy
      test/runtests.jl file. If this sample was representative of all Julia
      packages in the General registry, we can say that about 92% of Julia
      packages do test their code. Interestingly enough, this is about the same
      fraction of packages with some documentation;
  • almost all packages with testing also set up continuous integration, even
    though a small fraction is probably running dummy tests. If compared to the
    about 80% of papers in JOSS found by Jamie not to have any obvious CI setup,
    the uptake of CI among Julia packages is remarkable (the audience of authors
    of JOSS paper and Julia packages has a large overlap, so I believe it makes
    sense to look at them together);
  • GitHub Actions is the most used CI service (75.6%), followed by Travis
    (58.6%). Note that measuring usage of GitHub Actions for CI is tricky,
    because, differently from all the other services, there isn’t a single
    standard name for the configuration script, and GitHub Actions is also often
    used for other tasks than pure CI. For example, I didn’t consider files for
    CompatHelper or
    TagBot. It would have been
    interesting to look at these statistics before November, when the new pricing
    model
    pushed
    many open-source users away from their services.

My take-away message is that despite the General registry not being curated and
not enforcing specific styles or best practices, a large fraction of Julia
packages do embrace documentation, testing, and continuous integration, showing
that the ecosystem provides easy-to-use tools for these practices.



Random-based indexing for arrays

By: Mosè Giordano

Re-posted from: http://giordano.github.io/blog/2020-05-23-random-array-indexing/

image

Image credit: “xkcd: Donald Knuth” (CC-BY-NC
2.5
)

Let me introduce a package I recently wrote:
RandomBasedArrays.jl, a
hassle-free package in the Julia programming language
for dealing with arrays. Every time you access an element of an array, the
first index is random, so this package relieves you from having to remember
whether Julia uses 0- or 1-based indexing: you simply cannot ever know what the
initial element will be. As an additional benefit, you can use any Int to
index a RandomBasedArray.

Motivation

This package takes a new stance in the longstanding debate whether arrays should
have 0-based or 1-based
indexing.

The main motivation for this package is that I’m sick of reading about this
debate. It is incredibly hard to convince people that there is no “one size
fits all” indexing in programming, both alternatives have their merits:

  • 0-based indexing is natural every time you deal with offsets, e.g. when
    referencing memory addresses, or when doing modular arithmetic
  • 1-based indexing is natural when you are counting elements: the 1st element is
    “1”, the 2nd element is “2”, etc…

It is pointless to claim the superiority of one indexing over the other one, as
they’re useful in different situations.

As a matter of fact, many “math-oriented” languages (e.g., Fortran, Julia,
Mathematica, MATLAB, R), that are less likely to fiddle with pointers’
addresses, default to 1-based indexing, even though probably the majority of the
programming languages nowadays uses 0-based indexing.

Many people claim superiority of 0-based indexing over 1-based indexing because
of a popular note by Edsger
W. Dijkstra
: “Why numbering
should start at
zero”
.
However, I personally find this note unconvincing and partial, mainly based on
subjective arguments (like elegance and ugliness) rather than really compelling
reasons. That said, 0-based indexing is certainly useful in many situations.

A good programming language, whatever indexing convention it uses, should
provide an abstraction layer to let users forget which is the initial index.
For example, Fortran has lbound
to reference the first element of an array. Besides the
first
function to reference the first element of a collection, the Julia programming
language has different utilities to iterate over collections:

  • arrays are iterables, this means that you can write a for loop like

    for element in my_array
        # do things with the `element`...
    end
    

    without using indices at all

  • the length
    function gives you the length of a collection, so that Dijkstra’s argument
    about the length of a sequence remains aesthetic rather than practical
  • eachindex is
    a function that returns an iterable object with all the indices of the array.

Some times you need to use the indices of an array and know which is the first
one. In this case, the abstraction layer above is not useful. Thus, I think it
is important for a programming language to provide also a way to easily switch
to the most appropriate indexing for the task at hand. In Fortran you can set a
different initial index for an array with the
dimension
statement. Julia allows you to define custom indices for you new array-like
type, as described in the
documentation. The
most notable application of custom indices is probably the
OffsetArrays.jl package.
Other use cases of custom indices are shown in this blog
post
.

Usage

Let’s come to RandomBasedArrays.jl. You can install it with Julia built-in
package manager
. In a Julia
session, after entering the package manager mode with ], run the command

pkg> add RandomBasedArrays

After that, just run using RandomBasedArrays in the REPL to load the package.
It defines a new AbstractArray type, RandomBasedArray, which is a thin
wrapper around any other AbstractArray.

julia> using RandomBasedArrays

julia> A = RandomBasedArray(reshape(collect(1:12), 3, 4))
3×4 Array{Int64,2}:
 10  2  6   6
  8  9  1  11
  7  8  8   7

Every time you access an element, you conveniently get a random element of the
parent array, making any doubt about first index go away. This means that you
can use any Int as index, including negative numbers:

julia> A[-35]
6

julia> A[-35]
9

julia> A[-35]
4

You can also set elements of the array in the usual way, just remember that
you’ll set a random element of the parent array:

julia> A[28,-19] = 42
42

julia> A
5×5 Array{Int64,2}:
 13  16   3  25   9
 23  20  16  18   1
  5  17  21   6   8
  5   3  42  10  13
 25   6  23   4  11

julia> A
5×5 Array{Int64,2}:
  9  25  25   3   3
  4  14   9   7  18
 22  14  13  21   2
 11  12  19  14  19
 19   2  21   2  21

There are other Julia packages that play with different indexing of arrays, e.g.:

  • OffsetArrays.jl:
    Fortran-like arrays with arbitrary, zero or negative starting indices
  • TwoBasedIndexing.jl:
    Two-based indexing (note: this is currently out-of-date and doesn’t work in
    Julia 1.0)



A case for multiple dispatch: complex division

By: Mosè Giordano

Re-posted from: http://giordano.github.io/blog/2019-05-26-complex-division/

Multiple dispatch is the ability to dispatch a function based on the number and
the run time types of its arguments. The Julia programming language uses
multiple dispatch as its main
paradigm, and this has been a central design concept of the language from its
inception.

I already talked about multiple dispatch when I showed a simple implementation
of the rock-paper-scissors game.
Multiple dispatch is useful not only to develop games, but also to write
mathematically correct code.

A key idea of multiple dispatch is that methods don’t belong to a single class,
but they are shared by all the arguments. Quoting from Julia
documentation
:

does the addition operation in x + y belong to x any more than it does to
y? The implementation of a mathematical operator generally depends on the
types of all of its arguments. Even beyond mathematical operations, however,
multiple dispatch ends up being a powerful and convenient paradigm for
structuring and organizing programs.

An interesting example of how multiple dispatch allows one to more easily write
mathematically correct code is the division involving complex numbers. In
particular, the corner case of division by 0. But… is the 0 real or complex?
Indeed the answer is different depending on the nature of the 0. When you
divide a complex finite number by (positive) real 0, you’re computing the
limit:

and it is clear the direction in which you take the limit: the real axis. Thus,
for example, from the division we should expect as a result
, assuming that the limit is taken from the positive part of
the real axis. When the 0 is complex, instead, the direction of the limit in
the complex plane is not well-defined, and so the result of the division by a
complex zero should be undefined. This behaviour is also mandated by the
ISO/IEC 10967 standard (part 3,
section 5.2.5.5 “Fundamental complex floating point arithmetic”).

Comparison between languages

Let’s see how different programming languages behave with this operation.

Python

>>> import numpy as np
>>> np.complex128(1) / 0
__main__:1: RuntimeWarning: divide by zero encountered in cdouble_scalars
__main__:1: RuntimeWarning: invalid value encountered in cdouble_scalars
(inf+nanj) # This is correct
>>> np.float64(1) / np.complex128(0)
__main__:1: RuntimeWarning: divide by zero encountered in true_divide
__main__:1: RuntimeWarning: invalid value encountered in true_divide
(inf+nanj) # Should be (nan+nanj)
>>> np.complex128(1) / np.complex128(0)
(inf+nanj) # Should be (nan+nanj)
>>> np.float64(1) / np.complex128(1)
(1+0j)     # Should be (1-0j)

R

> complex(real = 1.0) / 0.0
[1] Inf+NaNi # This is correct
> 1.0 / complex(real = 0.0)
[1] Inf+NaNi # Should be NaN+NaNi
> complex(real = 1.0) / complex(real = 0.0)
[1] Inf+NaNi # Should be NaN+NaNi
> 1.0 / complex(real = 1.0)
[1] 1+0i     # Should be 1-0i

GNU Octave

octave:1> (1.0 + 0.0*i) / 0.0
warning: division by zero
ans =  Inf
octave:2> 1.0 / (0.0 + 0.0*i)
warning: division by zero
ans =  Inf
octave:3> (1.0 + 0.0*i) / (0.0 + 0.0*i)
warning: division by zero
ans =  Inf
octave:4> 1.0 / (1.0 + 0.0*i)
ans =  1

C

#include <complex.h>
#include <stdio.h>
int main(void)
{
  double complex c_zero = 0.0 + 0.0*I;
  double complex c_one  = 1.0 + 0.0*I;
  double complex w1 = c_one / 0.0;
  double complex w2 = 1.0 / c_zero;
  double complex w3 = c_one / c_zero;
  double complex w4 = 1.0 / c_one;
  printf("(%g,%g) # Should be (inf,nan)\n", creal(w1), cimag(w1));
  printf("(%g,%g) # Should be (nan,nan)\n", creal(w2), cimag(w2));
  printf("(%g,%g) # Should be (nan,nan)\n", creal(w3), cimag(w3));
  printf("(%g,%g)      # Should be (1,-0)\n", creal(w4), cimag(w4));
  return 0;
}

Compiled with GCC

$ gcc -std=c99 test.c&&./a.out
(inf,-nan) # Should be (inf,nan)
(inf,-nan) # Should be (nan,nan)
(inf,-nan) # Should be (nan,nan)
(1,0)      # Should be (1,-0)

Compiled with ICC

$ icc -std=c99 test.c&&./a.out
test.c(7): warning #39: division by zero
    double complex w1 = c_one / 0.0;
                                ^

(inf,-nan) # Should be (inf,nan)
(inf,-nan) # Should be (nan,nan)
(inf,-nan) # Should be (nan,nan)
(1,0)      # Should be (1,-0)

C++

#include <complex>
#include <iostream>

int main(void)
{
  std::complex<double> c_zero(0.0, 0.0);
  std::complex<double> c_one(1.0, 0.0);
  std::complex<double> w1 = c_one / 0.0;
  std::complex<double> w2 = 1.0 / c_zero;
  std::complex<double> w3 = c_one / c_zero;
  std::complex<double> w4 = 1.0 / c_one;
  std::cout << w1 << " # Should be (inf,nan)"   << std::endl;
  std::cout << w2 << " # Should be (nan,nan)"   << std::endl;
  std::cout << w3 << " # Should be (nan,nan)"   << std::endl;
  std::cout << w4 << "      # Should be (1,-0)" << std::endl;
  return 0;
}

Compiled with GCC

$ g++ test.cpp&&./a.out
(inf,-nan) # Should be (inf,nan)
(inf,-nan) # Should be (nan,nan)
(inf,-nan) # Should be (nan,nan)
(1,0)      # Should be (1,-0)

Compiled with ICC

$ icc test.cpp&&./a.out
(inf,-nan) # Should be (inf,nan)
(inf,-nan) # Should be (nan,nan)
(inf,-nan) # Should be (nan,nan)
(1,0)      # Should be (1,-0)

Fortran90

program main
  implicit none
  complex :: c_zero = (0.0, 0.0)
  complex :: c_one  = (1.0, 0.0)
  double precision :: f_zero = 0.0
  double precision :: f_one  = 1.0
  write (*, '(F0.0,"+i",F0.0,A)') c_one / f_zero,  " # Should be Inf+iNaN"
  write (*, '(F0.0,"+i",F0.0,A)') f_one / c_zero,  " # Correct"
  write (*, '(F0.0,"+i",F0.0,A)') c_one / c_zero,  " # Correct"
  write (*, '(F0.0,"+i",F0.0,A)') f_one / c_one, "   # Should be 1.-i0."
endprogram main

Compiled with Gfortran

$ gfortran test.f90&&./a.out
NaN+iNaN # Should be Inf+iNaN
NaN+iNaN # Should be NaN+iNaN
NaN+iNaN # Should be NaN+iNaN
1.+i0.   # Should be 1.-i0.

Compiled with ifort

$ ifort test.f90&&./a.out
Inf+iNaN # Should be Inf+iNaN
Inf+iNaN # Should be NaN+iNaN
Inf+iNaN # Should be NaN+iNaN
1.+i0.   # Should be 1.-i0.

Julia

julia> complex(1) / 0
Inf + NaN*im # This is correct

julia> 1 / complex(0)
NaN + NaN*im # This is correct

julia> complex(1) / complex(0)
NaN + NaN*im # This is correct

julia> 1 / complex(1)
1.0 - 0.0im  # This is correct

Summary

The difference between Julia and the other languages is that the division is
implemented in a different way depending on the type of the two operands. By
using the
@which
macro we can see which are the methods called in the examples above:

julia> @which complex(1) / 0
/(z::Complex, x::Real) in Base at complex.jl:324

julia> @which 1 / complex(0)
/(a::R, z::S) where {R<:Real, S<:Complex} in Base at complex.jl:323

julia> @which complex(1) / complex(0)
/(a::Complex{T}, b::Complex{T}) where T<:Real in Base at complex.jl:327

Thus each combination of real and complex arguments calls a different method.
This directly matches the different mathematical operations that should be
carried out.

The fact that in the other languages reviewed here we get the same result for
any combination of the types of the operands suggests that in those cases the
arguments are always silently converted to or treated as they were all complex,
thus losing the information about their original nature.

Multiple dispatch is nothing completely magic. One could achieve similar
results in Python with a bunch of ifs that check the types of the arguments,
but multiple dispatch keeps the different implementations neatly separated in
different methods, instead of putting the code for the different operations in a
single catch-em-all function. Also in C++ one could define different methods
for different combinations of the type of the arguments, with the difference
that in that case the dispatch would be dynamic only on the first argument (the
class instance, the receiver of the method) and static on the others, whereas
with multiple dispatch the dispatch is based on the runtime type of all the
arguments.

Note: this post is largely inspired by the discussion in issue
#22983 of the Julia
repository.