Author Archives: Tamás K. Papp

Local packages in a separate directory in Julia

By: Tamás K. Papp

Re-posted from: https://tamaspapp.eu/post/julia-local-test/

I run Pkg.update() fairly often to stay up to date and benefit from
the latest improvements of various packages. I rarely ever pin to a
specific package version, but I occasionally checkout master for
some packages, especially if I am contributing.

Despite updating regularly, I found that the documentation subtly diverged from what I was experiencing for some packages. After looking into the issue, I learned that I was 2–3 minor versions behind despite updating regularly. For example, when I would

Pkg.update("ForwardDiff")

I would be told that there is a new version, and to get it I should update ReverseDiff and Optim. But

Pkg.update("ForwardDiff", "ReverseDiff", "Optim")

would just run quietly without updating.

I could not figure out the cause for this and did not want to get sidetracked debugging it, so I decided to wipe the package directory and start over. However, in order to do this, I had to make sure that no code is lost, especially for local packages. First, I moved my local packages into a separate directory, and added that to LOAD_PATH in .juliarc.jl:

push!(LOAD_PATH, expanduser("~/src/julia-local-packages/"))

Then I ran multi-git-status to make sure that there were no unpushed changes. Finally, I deleted the package directory and reinstalled everything. Surprisingly, Pkg.add ran much faster than before.

In case I have to do this again, I decided to keep my local packages separate — the only drawback is that Pkg.test now can’t find them. A workaround is below, using some code from Base.Pkg:

"""
    local_test(pkgname, [coverage])

Find and test a package in `LOAD_PATH`. Useful when the package is outside
`Pkg.dir()`.

Assumes the usual directory structure: package has the same name as the module,
the main file is in `src/Pkgname.jl`, while tests are in `test/runtests.jl`.
"""
function local_test(pkgname; coverage::Bool=false)
    module_path = Base.find_in_path(pkgname, nothing)
    src_dir, module_file = splitdir(module_path)
    dir = normpath(src_dir, "..")
    test_path = joinpath(dir, "test", "runtests.jl")
    @assert isfile(test_path) "Could not find $(test_path)"
    Base.cd(dir) do
        try
            color = Base.have_color? "--color=yes" : "--color=no"
            codecov = coverage? ["--code-coverage=user"] : ["--code-coverage=none"]
            compilecache = "--compilecache=" * (Bool(Base.JLOptions().use_compilecache) ? "yes" : "no")
            julia_exe = Base.julia_cmd()
            run(`$julia_exe --check-bounds=yes $codecov $color $compilecache $test_path`)
            info("$module_file tests passed")
        catch err
            Base.Pkg.Entry.warnbanner(err, label="[ ERROR: $module_file ]")
        end
    end
end

Compared to simply include("wherever-it-is/runtests.jl"), this has the advantage of running a separate Julia process, so your workspace does not contaminate the test environment and in case of segfaults, the parent process won’t be affected.

Hopefully, the code above will be obsolete once Pkg3 is released, but until then it is a useful workaround.

edit: function above was corrupted during copy-paste, corrected.

Emacs 25.2 on Ubuntu

By: Tamás K. Papp

Re-posted from: https://tamaspapp.eu/post/emacs25-ubuntu/

Emacs is undoubtedly the most important program on my computers. On my
laptop, I use it to keep track of stuff with org-mode, read mail with mu4e, edit LaTeX with AUCTeX, and of course program. On servers, the first alias I define is usually qe='emacs -Q -nw', which give me a fast and responsive editor. With helm, doing just about anything (eg locating files, rgreping for something) is orders of magnitude faster and more convenient than any alternative I have tried.

I also try to keep up with the latest versions for software in general. Usually, whatever Ubuntu stable/Debian testing has is good enough not to justify the extra effort, but when I really need it, I grab the source and compile. That is usually only a minor hassle, but I try to restrict it to a few critical programs, otherwise it adds up. The major issue is not compiling, but having cruft in the filesystem (despite stow and checkinstall, it piles up). So I try to avoid it if I can.

Emacs 25.2 was released in April 2017, but there is no sign of an Ubuntu package for it yet. On various forums the PPA of kelleyk is recommended, but that does not have 25.2 for 17.04 (some files clash if you install previous versions).

Fortunately, Robert Bruce Park has now added Emacs 25.2 to the Ubuntu Emacs Lisp PPA, so having the latest of your favorite editor is only an add-apt-repository away. You may want to add a file to /etc/apt/preferences.d with contents

Package: *
Pin: release o=LP-PPA-ubuntu-elisp
Pin-Priority: 600

to make sure the right packages are installed.

Sampling variation in effective sample size estimates (MCMC)

By: Tamás K. Papp

Re-posted from: https://tamaspapp.eu/post/ess-sampling/

Introduction

MCMC samples, used in Bayesian statistics, are not independent — in fact, unless one uses specialized methods or modern HMC, posterior draws are usually at highly autocorrelated. For independent draws,

\[
\text{variance of simulation mean} \propto \frac1N
\]

where \(N\) is the sample size, but for correlated draws, one has to scale the sample size with a factor

\[
\tau = \frac{1}{1+2\sum_{k=1}^\infty \rho_k}
\]

where \(\rho_k\) is the lag-\(k\) autocorrelation. \(\tau N\) is the effective sample size.

Usually, \(\rho_k\) is estimated from the data using the variogram

\[
V_k = \frac{1}{N-k} \sum_{i=1}^{N-k} x_i x_{i+k}
\]

from which we obtain

\[
\rho_k = 1-\frac{V_k}{2\text{var}(x)}
\]

where an estimator for the variance is also used. Then, to avoid using noisy estimates, we only add up to the last \(K\) where

\[
\rho_{K} + \rho_{K+1} \ge 0
\]

I will call \(K\) the last lag. Stan does something slightly different, using FFT for autocorrelations, and cutting off at the first negative \(\rho_K\), but for HMC this does not make a whole lot of difference.

The sampling variation

I was coding up the above calculation, and needed some unit tests. Surprisignly, I could not find anything on the sampling variation of \(\tau\), so I wrote some simulations in Julia (source code for everything). I did the following simulation exercise:

  1. for a given autocorrelation coefficient \(\phi\), simulate \(N\) draws from the AR(1) process
    \(
    x_t = \phi x_{t-1} + \sigma \epsilon_t
    \qquad
    \epsilon_t \sim \text{Normal}(0,1), \text{IID}
    \)
  2. calculate \(\tau\) and \(K\),
  3. repeat 1000 times and plot the results.

I use \(N=1000\) and \(N=10000\), as these would be typical sample sizes, first for a fairly efficient algorithm, then for a more stubborn but still manageable posterior.

IID samples

Let \(\phi=0\), then we expect \(\tau=1\) (red line in histogram, coefficient of variation on top).

Results with \(\phi=0\) (IID), \(N=1000\). (a) \(\tau\), (b) last lag \(K\), (c) scatterplot.

Results with $$\phi=0$$ (IID), $$N=1000$$. (a) $$\tau$$, (b) last lag $$K$$, (c) scatterplot.

Results with \(\phi=0\) (IID), \(N=10000\). (a) \(\tau\), (b) last lag \(K\), (c) scatterplot.

Results with $$\phi=0$$ (IID), $$N=10000$$. (a) $$\tau$$, (b) last lag $$K$$, (c) scatterplot.

With \(1000\) samples, there is a lot of variation in ESS: 800 could show up very easily in practice. \(600\) is not improbable either. Using up to \(10\) lags is not uncommon. For \(10000\) samples, the precision is improved considerably, we commonly use \(2\) or \(4\) lags. For both sample sizes, notice the high correlation between the last lag \(K\), and \(\tau\): given the method above, using more lags increases \(\tau^{-1}\), so this is to be expected.

AR(1) samples with \(\rho=0.5\)

This is a more autocorrelated process, here theory tells us that \(\tau\)=1/3.

Results with \(\phi=0.5\), \(N=1000\). (a) \(\tau\), (b) last lag \(K\), (c) scatterplot.

Results with $$\phi=0.5$$, $$N=1000$$. (a) $$\tau$$, (b) last lag $$K$$, (c) scatterplot.

Results with \(\phi=0.5\), \(N=10000\). (a) \(\tau\), (b) last lag \(K\), (c) scatterplot.

Results with $$\phi=0.5$$, $$N=10000$$. (a) $$\tau$$, (b) last lag $$K$$, (c) scatterplot.

Notice that \(\tau\) is now more dispersed, compared to the IID case. Even with 10000 samples, the coefficient of variation is 6%, with 1000 it is around 1/6. In practice, expect effective sample sizes all over the place.

AR(1) samples with \(\rho=0.8\)

This is an even more autocorrelated process, here theory tells us that \(\tau\)=1/9.

Results with \(\phi=0.8\), \(N=1000\). (a) \(\tau\), (b) last lag \(K\), (c) scatterplot.

Results with $$\phi=0.8$$, $$N=1000$$. (a) $$\tau$$, (b) last lag $$K$$, (c) scatterplot.

Results with \(\phi=0.8\), \(N=10000\). (a) \(\tau\), (b) last lag \(K\), (c) scatterplot.

Results with $$\phi=0.8$$, $$N=10000$$. (a) $$\tau$$, (b) last lag $$K$$, (c) scatterplot.

There is now so much variation that in order to get an estimate for ESS that we can use for comparing various MCMC implementations, we need to run much more than \(1000\) samples.

Conclusion

  1. For unit testing ESS calculations, I will need to use 10000 samples, with \(\pm10\) or similar error bands.

  2. As a rule of thumb, I would ignore less than 1.5x variation in ESS for 1000 samples, or run longer chains: it may be just random noise.

Bibliography

  • Gelman, Andrew, et al. 2013. Bayesian data analysis. 3rd edition. Chapman & Hall/CRC.
  • Stan Development Team. 2016. Stan Modeling Language Users Guide and Reference Manual, Version 2.15.0. http://mc-stan.org