Author Archives: Kristoffer Carlsson on Kristoffer Carlsson

SIMD and SIMD-intrinsics in Julia

By: Kristoffer Carlsson on Kristoffer Carlsson

Re-posted from: http://kristofferc.github.io/post/intrinsics/

Introduction

The good old days of processors getting a higher clock-speed every year have been over for quite some time now.
Instead, other features of CPUs are getting improved like the number of cores, the size of the cache, and the instruction set
they support.
In order to be responsible programmers, we should try our best to take advantage of the features the hardware
provides.
One way of doing this is multithreading which is exploiting the fact that modern CPUs have multiple cores and
can run multiple (possibly independent) instruction streams simultaneously.
Another feature to take advantage of is that a single core on modern processors can do operations on multiple values in one instruction (called SIMD – Single Instruction Multiple Data).

This article is intended to give a short summary of using SIMD in the Julia programming language.
It is intended for people already quite familiar with Julia.
The first part is likely familiar to people that have been using Julia for a while, the latter part, which is about explicitly calling
SIMD intrinsics might be new. Feel free to scroll down to the intrinsic bit or read the TLDR about it below.

TLDR (SIMD intrinsics)

To call an intrinsic like _mm_aesdec_si128:

  • call the intrinsic from C
  • use Clang with -emit-llvm to figure out the LLVM intrinsic (for example)
  • call the intrinsic from Julia with ccall("insert_llvm_intrinsic", llvmcall, return_type, input_types, args...)
    where an LLVM vector type like <2 x i64> is translated to the Julia type NTuple{2, VecElement{Int64}}

For example:

julia> const __m128i = NTuple{2, VecElement{Int64}};

julia> aesdec(a, roundkey) = ccall("llvm.x86.aesni.aesdec", llvmcall, __m128i, (__m128i, __m128i), a, roundkey);

julia> aesdec(__m128i((213132, 13131)), __m128i((31231, 43213)))
(VecElement{Int64}(-1627618977772868053), VecElement{Int64}(999044532936195731))

Automatic SIMD

The backend compiler for Julia is LLVM which can in some cases vectorize loops using the Loop Vectorizer
and it can even promote scalar code to SIMD operations using the SLP Vectorizer.

Automatic loop vectorization

Defining a simple loop that does an “axpy” like operation c .= a .* b

function axpy!(c::Array, a::Array, b::Array)
    @assert length(a) == length(b) == length(c)
    @inbounds for i in 1:length(a)
        c[i] = a[i] * b[i]
    end
end

and using the code introspection tool to inspect the generated LLVM IR, we see:

julia> V64 = Vector{Float64}
Array{Float64,1}

julia> code_llvm(axpy!, Tuple{V64, V64, V64})
...
  %56 = fmul <4 x double> %wide.load, %wide.load24
  %57 = fmul <4 x double> %wide.load21, %wide.load25
  %58 = fmul <4 x double> %wide.load22, %wide.load26
  %59 = fmul <4 x double> %wide.load23, %wide.load27
...

The type <4 x double> is in LLVM IR terminology a vector, and is here the resulting type
of adding two arguments of the same vector type.
As a note, LLVM has also decided to unroll the loop by a factor of four.
Moving on to look at the assembly, we see that indeed (at least on the computer of the author), this LLVM IR has turned into processor instructions that
does four multiplications in one instruction.

julia> code_native(axpy!, Tuple{V64, V64, V64})
 vmulpd  (%esi,%ecx,8), %ymm0, %ymm0
 vmulpd  32(%esi,%ecx,8), %ymm1, %ymm1
 vmulpd  64(%esi,%ecx,8), %ymm2, %ymm2
 vmulpd  96(%esi,%ecx,8), %ymm3, %ymm3

The instruction vmulpd does “packed” double-precision floating point addition and
the ymm registers fit 256 bits which can thus fit four 64-bit floats.

Any type of control flow inside the loop will likely mean the loop will not vectorize. That is why @inbounds is important here,
otherwise we have control flow to the part that throws the bounds error.

Note that for reductions using non-associative arithmetic (like floating point airthmetic) you will have to tell the compiler that it is ok to reorder
the accumulations into the reduction variable using the @simd macro.

Automatic scalar vectorization

LLVM can also auto-vectorize scalar operations that follow a certain pattern. Here
is a function that does not contain any loop:

function mul_tuples(a::NTuple{4,Float64}, b::NTuple{4,Float64})
    return (a[1]*b[1], a[2]*b[2], a[3]*b[3], a[4]*b[4])
end

The function mul_tuples just multiplies numbers from two tuples of length four and forms a new tuple.
The pattern here should be obvious, it is clear that the four additions
could be done at the same time.
LLVM can identify such patterns and generate code that uses SIMD. Again,
inspecting the code we find that SIMD instructions are used:

LLVM IR:

julia> code_llvm(mul_tuples, Tuple{NTuple{4,Float64}, NTuple{4, Float64}})
...
  %7 = fmul <4 x double> %4, %6
...

Native code:

julia> code_native(mul_tuples, Tuple{NTuple{4,Float64}, NTuple{4, Float64}})
...
  vmulpd  (%edx), %ymm0, %ymm0
...

The scalar auto-vectorizer is quite impressive. It manages for example to very nicely vectorize a 4×4 matrix multiplication
in the StaticArrays.jl package which you can see doing something like

julia> using StaticArrays # import Pkg, Pkg.add("StaticArrays") to install

julia> @code_native rand(SMatrix{4,4}) * rand(SMatrix{4,4})

which almost only uses SIMD-instructions without StaticArrays.jl having to do any work for it.

SIMD using a vector library

While the auto-vectorizer can sometimes work pretty well, it quite easily gets confused. Alternatively, the data
is not laid out in such a way that it is allowed or beneficial to vectorize the code.
For example, trying a matrix multiplication of size
3×3 instead of 4×4 matrices in StaticArrays.jl and things are not so pretty anymore:

@code_native rand(SMatrix{3,3}) * rand(SMatrix{3,3})
    vmovsd  (%rdx), %xmm0           ## xmm0 = mem[0],zero
    vmovsd  8(%rdx), %xmm7          ## xmm7 = mem[0],zero
    vmovsd  16(%rdx), %xmm6         ## xmm6 = mem[0],zero
    vmovsd  16(%rsi), %xmm11        ## xmm11 = mem[0],zero
    vmovsd  40(%rsi), %xmm12        ## xmm12 = mem[0],zero
    vmovsd  64(%rsi), %xmm9         ## xmm9 = mem[0],zero
    vmovsd  24(%rdx), %xmm3         ## xmm3 = mem[0],zero
    vmovupd (%rsi), %xmm4
    vmovupd 8(%rsi), %xmm10
    vmovhpd (%rsi), %xmm11, %xmm5   ## xmm5 = xmm11[0],mem[0]
    vinsertf128     $1, %xmm5, %ymm4, %ymm5
    vunpcklpd       %xmm3, %xmm0, %xmm1 ## xmm1 = xmm0[0],xmm3[0]
    vmovddup        %xmm0, %xmm0    ## xmm0 = xmm0[0,0]
    vinsertf128     $1, %xmm1, %ymm0, %ymm0
...

In the code above there is a lot of activity in the xmm registers (which are smaller than ymm).
indeed, if we benchmark the 3×3 matrix multiply we find that it is in fact slower than the 4×4 version (note that in the
benchmark below the matrix is wrapped in a Ref here to prevent the compiler from constant folding the benchmark loop):

julia> using BenchmarkTools # import Pkg; Pkg.add("BenchmarkTools") to install

julia> for n in (2,3,4)
           s = Ref(rand(SMatrix{n,n}))
           @btime $(s)[] * $(s)[]
       end
  2.711 ns (0 allocations: 0 bytes)
  10.273 ns (0 allocations: 0 bytes)
  6.059 ns (0 allocations: 0 bytes)

In these cases, we can explicitly vectorize the code using the SIMD vector library SIMD.jl.
SIMD.jl provides a type Vec{N, T} where N is the number of elements and T is the element type.
Vec{N, T} is similar to the LLVM <N x T> vector type and operations on Vec typically translate directly to LLVM operations:

For example, below we define some input data and a function g that do some simple arithmetic. We then look at the generated code.

julia> using SIMD

julia> a = Vec((1,2,3,4))
<4 x Int64>[1, 2, 3, 4]

julia> b = Vec((1,2,3,4))
<4 x Int64>[1, 2, 3, 4]

julia> g(a, b, c) = a * b + c;

julia> @code_llvm g(a, b, 3)
...
  %res.i = mul <4 x i64> %7, %6
  %8 = insertelement <4 x i64> undef, i64 %3, i32 0
  %9 = shufflevector <4 x i64> %8, <4 x i64> undef, <4 x i32> zeroinitializer
  %res.i1 = add <4 x i64> %res.i, %9
...

The mul <4 x i64> is the multiplication of the two vectors, and then the scalar 3 is “broadcasted” to a vector
and added to the result. Feel free to look at @code_native to see the native SIMD instructions.
We cand use SIMD.jl to write a faster 3×3 matrix multiplication:

function matmul3x3(a::SMatrix, b::SMatrix)
    D1 = a.data; D2 = b.data
    # Extract data from matrix into SIMD.jl Vec
    SV11 = Vec((D1[1], D1[2], D1[3]))
    SV12 = Vec((D1[4], D1[5], D1[6]))
    SV13 = Vec((D1[7], D1[8], D1[9]))

    # Form the columns of the resulting matrix
    r1 = muladd(SV13, D2[3], muladd(SV12, D2[2], SV11 * D2[1]))
    r2 = muladd(SV13, D2[6], muladd(SV12, D2[5], SV11 * D2[4]))
    r3 = muladd(SV13, D2[9], muladd(SV12, D2[8], SV11 * D2[7]))

    return SMatrix{3,3}((r1[1], r1[2], r1[3],
                         r2[1], r2[2], r2[3],
                         r3[1], r3[2], r3[3]))
end

Let’s compare this new version to the default the 3×3 version:

julia> s = Ref(rand(SMatrix{n,n}))

julia> @btime $(s)[] * $(s)[];
  10.281 ns (0 allocations: 0 bytes)

julia> @btime matmul3x3($(s)[], $(s)[]);
  4.392 ns (0 allocations: 0 bytes)

A guite significant improvement! The code for matmul3x3 could, of course, be generalized to work for more sizes, perhaps using a @generated function.

Using intrinsics

All we have done so far has been architecture independent.
If the CPU only supports an old version of SIMD or perhaps doesn’t support SIMD at all, LLVM will just compile the code
using the latest features that are available, falling back to scalar instructions if needed.
However, in some cases, we really do want to use a specific instruction in a certain instruction set supported by the CPU.
The idea for writing this blog post was from reading about
a new hashing library called “meowhash” was released.
It uses AES decryption which processors now has built-in instructions to perform.
Looking in the source code
we can see the macro:

#define Meow128_AESDEC(Prior, Xor) _mm_aesdec_si128((Prior), (Xor))

In the Intel Intrinsics Guide
this intrinsic is described as

Perform one round of an AES decryption flow on data (state) in a using the round key in RoundKey, and store the result in dst.

If we wanted to port meow-hash to Julia we would need to call this intrinsic in Julia, so how should we do that?

Firstly, Julia allows calling LLVM intrinsics through ccall.
We can for example call the pow intrinsic for two Float64 as:

julia> llvm_pow(a, b) = ccall("llvm.pow.f64", llvmcall, Float64, (Float64, Float64), a, b);

julia> llvm_pow(2.0, 3.0)
8.0

So, in order to call our AES decryption instruction, we need to know the corresponding LLVM intrinsic to _mm_aesdec_si128. Since Julia itself doesn’t provide us
with a way to get the intrinsic,
we need to ask a compiler that does. Fortunately, we can just ask Clang to emit the corresponding LLVM for us.
Using the Godbolt compiler webtool makes this very easy.
In the link to Godbolt we can see the following (slightly cleaned up):

define <2 x i64> @_Z6aesdecDv2_xS_(<2 x i64>, <2 x i64>) local_unnamed_addr #0 !dbg !262 {
  %3 = call <2 x i64> @llvm.x86.aesni.aesdec(<2 x i64> %0, <2 x i64> %1) #3, !dbg !270
  ret <2 x i64> %3, !dbg !271
}

So the intrinsic is called x86.aesni.aesdec. Now, we just need to know how to pass in the argument types which are
<2 x i64>. A normal Julia tuple of integers will not do because it gets passed to LLVM as an array
and not a vector
Instead, we need to send in a tuple with special elements of the type VecElement.
Julia treats a tuple of VecElements special and will pass it to LLVM as a vector.

All that is now left is to define some convenience typealias, create our inputs and call the intrinsic:

julia> const __m128i = NTuple{2, VecElement{Int64}};

julia> aesdec(a, roundkey) = ccall("llvm.x86.aesni.aesdec", llvmcall, __m128i, (__m128i, __m128i), a, roundkey);

julia> aesdec(__m128i((213132, 13131)), __m128i((31231, 43213)))
(VecElement{Int64}(-1627618977772868053), VecElement{Int64}(999044532936195731))

So now, we are in a position to port Meow Hash to Julia!

It should be stated that intrinsics should only be used as a last resort. It will lead to your code being less portable
and harder to maintain.

Conclusion

There are many ways of doing SIMD in Julia. From letting the compiler to do the the job to using a SIMD library, and finally
getting our hands dirty and use the intrinsics. Which way is best will depend on your application but hopefully, this helped a bit
with showing what options are available.

Julia has an issue…

By: Kristoffer Carlsson on Kristoffer Carlsson

Re-posted from: http://kristofferc.github.io/post/julia_issue/

As most of us in the Julia community know, Julia has an issue.
In fact, looking at the Julia repository we see that
there are (at time of writing) 11865 issues, where 1872 are open and 9993 are closed.
An interesting question to ask is:

  • How has the ratio between open and closed issues varied over the development of Julia? And how about for pull requests (PRs)?

In this post, the aim is to answer the question above using the data that can be scraped from the GitHub repo.

Getting the data

A large amount of data for a repository hosted on GitHub can be found via the GitHub API.
The GitHub.jl package provides a convenient
interface to communicate with this API from Julia.

Getting the data for the issues (and PRs) is as simple as:

import GitHub
myauth = GitHub.authenticate(ENV["GITHUB_AUTH"])    
repo = GitHub.repo("JuliaLang/julia")
myparams = Dict("state" => "all", "per_page" => 100, "page" => 1);
issues, page_data = GitHub.issues(repo; params = myparams, auth = myauth)

where I have created a “GitHub access token” and saved it in the environment variable GITHUB_AUTH. Note here that the function name GitHub.issues is a bit of a misnomer.
What is returned is actually both issues and PRs.
The variable issues now contains a Vector of all the issues and PRs made to the Julia repo.
Each element in the vector is an Issue, which is a struct containing fields
corresponding to the keys of the returned JSON from the GitHub REST API.
In order to not have to rescrape the data every time Julia is started, it would be nice to
store it to disk. The standard way of storing Julia data is by using the *JLD.jl package.
Unfortunately, JLD.jl has some problems handling Nullable’s (see this issue).
However, there is an unregistered package called JLD2.jl
that does support Nullable’s. The code below uses JLD2 to save the issues to a .jld file
and then reload them again:

using JLD2

function save_issues(issues)
    f = jldopen("issues.jld", "w")
    write(f, "issues", issues)
    close(f)
end

function load_issues()
    f = jldopen("issues.jld", "r")
    issues = read(f, "issues")
    close(f)
    return issues
end

I put up the resulting .jld file here if you don’t feel like doing the scraping yourself.

Digression: a DateVector

In this post we will deal with Vector’s that are naturally indexed by
Date’s instead of standard integers starting at 1. Therefore, using the powerful
AbstractVector interface
we easily create a Vector that can be indexed with single Date’s ranges consisting of Date’s etc.

The implementation below should be fairly self explanatory. We create the DateVector struct
wrapping a Vector and two Date’s (the start and end date) and define a minimum amount of methods needed.
Vector’s with non conventional indices are still quite new in Julia so not everything work
perfectly with them. Here, we just implement the functionality needed
to set and retrieve data using indexing with Dates.

struct DateVector{T} <: AbstractVector{T}
    v::Vector{T}
    startdate::Date
    enddate::Date

    function DateVector(v::Vector{T}, startdate::Date, enddate::Date) where T
        len = (enddate - startdate) ÷ Dates.Day(1) + 1
        if length(v) != len
            throw(ArgumentError("length of vector v $(length(v)) not equal to date range $len"))
        end
        new{T}(v, startdate, enddate)
    end
end

Base.endof(dv::DateVector) = dv.enddate
Base.indices(dv::DateVector) = (dv.startdate:dv.enddate,)
Base.getindex(dv::DateVector, date::Date) = dv.v[(date - dv.startdate) ÷ Dates.Day(1) + 1]
Base.setindex!(dv::DateVector, v, date::Date) = dv.v[(date - dv.startdate) ÷ Dates.Day(1) + 1] = v
Base.checkindex(::Type{Bool}, d::Range{Date}, v::Range{Date}) = length(v) == 0 || (first(v) in d && last(v) in d)
Base.Array(dv::DateVector) = dv.v

The DateVector can now be seen in action by for example indexing into it
with a Range of Date’s:

julia> v = rand(10);

julia> dv = DateVector(v, Date("2015-01-01"), Date("2015-01-10"));

julia> dv[Date("2015-01-02"):Date("2015-01-05")]
4-element Array{Float64,1}:
 0.299136
 0.898991
 0.0626245
 0.585839

julia> v[2:5]
4-element Array{Float64,1}:
 0.299136
 0.898991
 0.0626245
 0.585839

Total number of opened and closed issues / PRs over time

For a given issue we can check the time it was created, what state it is in (open / closed),
what time it was eventually closed and if it is, in fact, a pull request and not an issue:

julia> get(issues[1].created_at)
2017-05-23T18:09:45

julia> get(issues[1].state)
"open"

julia> isnull(issues[1].closed_at)
true

julia> isnull(issues[1].pull_request) # pull_request is null so this is an issue
true

It is now quite easy to write a function that takes an issue and returns two Date-intervals, the first
for when the issue was opened, and the second for when it was closed up until today.
If the issue is still open, we make sure to return an empty interval for the closed period.

function open_closed_range(issue)
    opened = Date(get(issue.created_at))
    if get(issue.state) == "closed"
        closed = Date(get(issue.closed_at))
    else
        closed = Date(now())
    end
    return opened:closed, (closed + Dates.Day(1)):Date(now())
end

As an example, we can test this function on an issue:

julia> open_closed_range(issues[200])
(2017-05-13:1 day:2017-05-14, 2017-05-15:1 day:2017-05-29)

So, here we see that the issue was opened between 2017-05-13 and 2017-05-14 (and then got closed).
Now, we can simply create two DateVector’s, one that will contain the total number of opened
issues and the second the total number of closed issues / PRs for a given date

function count_closed_opened(PRs::Bool, issues)
    min_date = Date(get(minimum(issue.created_at for issue in issues)))
    days_since_min_date = (Date(now()) - min_date) ÷ Dates.Day(1) + 1
    closed_counter = DateVector(zeros(Int, days_since_min_date), min_date, Date(now()))
    open_counter = DateVector(zeros(Int, days_since_min_date), min_date, Date(now()))

    for issue in filter(x -> !isnull(x.pull_request) == PRs, issues)
        open_range, closed_range = open_closed_range(issue)
        closed_counter[closed_range] += 1
        open_counter[open_range] += 1
    end

    return open_counter, closed_counter
end

For a given date, the number of opened and closed issues are now readily available:

julia> opened, closed = count_closed_opened(false, issues);

julia> opened[Date("2016-01-1")]
288

julia> closed[Date("2016-01-1")]
5713

We could plot these now using one of our favorite plotting packages. I am personally a LaTeX fan and one of the popular
plotting packages for LaTeX is pgfplots. PGFPlotsX.jl is a Julia package that makes
it quite easy to interface with pgfplots so this is what I used here. The total number of open and
closed issues for different dates is shown below.

We can see that in the early development of Julia, PRs was not really used.
Also, the number of closed issues and PRs seem to grow at approximately the same rate.
A noticeable difference is in the number of opened issues and PRs. Open PRs accumulate
significantly slower than the number of open issues. This is likely because an open PR
become stale quite quickly while issues can take a long time to fix, or does not really have
a clear actionable purpose.

Let’s plot the ratio between open and closed issues / PRs:

To reduce some of the noise, I started the plot at 2013-06-01. The ratio between open to closed issues seems to slowly increase while for PRs, the number have stabilized at around 0.05.

Conclusions

Using the GitHub API it is quite easy to do some data anlysis on a repo.
It is hard to say how much actual usefulness can be extracted from the data here
but sometimes it is fun to just hack on data.
Possible future work could be to do the same analysis here but for other programming language repos.
Right now, looking at e.g. the Rust repo they have an open to closed PR ratio of 63 / 21200 ≈ 0.003 which is 20 times lower than Julia. Does this mean that the Julia community need to be better at reviewing PRs to make sure they eventually get merged / closed? Or is the barrier to open a PR to Rust higher so that only PRs with high success of merging gets opened?

What identifier is the most common in Julia? The answer might surprise you!

By: Kristoffer Carlsson on Kristoffer Carlsson

Re-posted from: http://kristofferc.github.io/post/tokenize/

Tokenize is a Julia package to perform lexical analysis of Julia source code.
Lexing is the process of transforming raw source code (represented as normal text) into a sequence of tokens which is
a string with an associated meaning. “Meaning” could here be if the string represent an operator, a keyword, a comment etc.

The example below shows lexing (or tokenization) of some simple code.

julia> using Tokenize

julia> collect(tokenize("""
       100
       "this is a string"
       'a'
       type
       foo
       *
       """))
13-element Array{Tokenize.Tokens.Token,1}:
 1,1-1,3          INTEGER        "100"
 1,4-2,0          WHITESPACE     "\n"
 2,1-2,18         STRING         "\"this is a string\""
 2,19-3,0         WHITESPACE     "\n"
 3,1-3,3          CHAR           "'a'"
 3,4-4,0          WHITESPACE     "\n"
 4,1-4,4          KEYWORD        "type"
 4,5-5,0          WHITESPACE     "\n"
 5,1-5,3          IDENTIFIER     "foo"
 5,4-6,0          WHITESPACE     "\n"
 6,1-6,1          OP             "*"
 6,2-7,0          WHITESPACE     "\n"
 7,1-7,0          ENDMARKER      ""

The displayed array containing the tokens has three columns. The first column shows the location where the string of the token starts and ends,
which is represented as the line number (row) and at how many characters into the line (columns) the token starts / ends.
The second column shows the type (kind) of token and, finally, the right column shows the string the token contains.

One of the different token kinds is the identifier. These are names that refer to different entities in the code.
This includes variables, types, functions etc. The name of the identifiers are chosen by the programmer,
in contrast to keywords which are chosen by the developers of the language.
Some questions I thought interesting are:

  • What is the most common identifier in the Julia Base code (the code making up the standard library). Has it changed from 0.5 to 0.6?
  • How about packages? Is the source code there significantly different from the code in Julia Base in terms of the identifiers used?

The plan is to use Tokenize to lex both Julia Base and a bunch of packages, count the number of occurrences of
each identifier and then summarize this as a top 10 list.

A Julia source code identifier counter

First, let’s create a simple counter type to keep track of how many times each identifier occur.
This is a just a wrapper around a dictionary with a default value of 0 and a
count! method that increments the counter for the supplied key:

immutable Counter{T}
    d::Dict{T, Int}
end
Counter{T}(::Type{T})= Counter(Dict{T, Int}())

Base.getindex{T}(c::Counter{T}, v::T) = get(c.d, v, 0)
getdictionary(c::Counter) = c.d
count!{T}(c::Counter{T}, v::T) = haskey(c.d, v) ? c.d[v] += 1 : c.d[v] = 1

A short example of the Counter type in action is showed below.

julia> c = Counter(String)
Counter{String}(Dict{String,Int64}())

julia> c["foo"]
0

julia> count!(c, "foo"); count!(c, "foo");

julia> c["foo"]
2

Now, we need a function that tokenizes a file and counts the number of identifiers in it.
The code for such a function is shown below and a short explanation follows:

function count_tokentypes!(counter, filepath, tokentype)
    f = open(filepath, "r")
    for token in tokenize(f)
        if Tokens.kind(token) == tokentype
            count!(counter, untokenize(token))
        end
    end
    return counter
end

This opens the file at the path filepath, loops over the tokens, and if the kind of token is the tokentype
the counter is incremented with the string of the token (extracted with untokenize) as the key.
In Tokenize each type of token is represented by an enum, and the one corresponding to identifiers is named
Tokens.IDENTIFIER.

As an example, we could run the function on a short file in base (nofloat_hashing.jl):

julia> BASEDIR =  joinpath(JULIA_HOME, Base.DATAROOTDIR, "julia", "base")

julia> filepath = joinpath(BASEDIR, "nofloat_hashing.jl");

julia> c = Counter(String);

julia> count_tokentypes!(c, filepath, Tokens.IDENTIFIER)
Counter{String}(Dict("b"=>2,"x"=>8,"a"=>2,"h"=>8,"UInt32"=>1,"UInt16"=>1,"hx"=>3,"abs"=>1,"Int8"=>1,"Int16"=>1…))

julia> c["h"]
8

We see here that there are 8 occurrences of the identifier h in the file.

The next step is to apply the count_tokentypes function to all the files in the base directory.
To that end, we create the applytofolder function:

function applytofolder(path, f)
    for (root, dirs, files) in walkdir(path)
        for file in files
            f(joinpath(root, file))
        end
    end
end

It takes a path to a folder and applies the function f on each file in that path.
The walkdir function works recursively so each file will be visited this way.

Finally, we create a Counter and call the previously created count_tokentypes on all files
that end with ".jl" using the applytofolder function:

julia> BASEDIR = joinpath(JULIA_HOME, Base.DATAROOTDIR, "julia", "base")

julia> c = Counter(String)

julia> applytofolder(BASEDIR,
                     function(file)
                         if endswith(file, ".jl")
                             count_tokentypes!(c, file, Tokens.IDENTIFIER)
                         end
                     end)

The counter c now contains the count of all identifiers in the base folder:

julia> c["_uv_hook_close"]
12

julia> c["x"]
7643

julia> c["str"]
230

Analysis

We are interested in the most common identifiers so we create a function that
extracts the n most common identifiers as two vectors.
One with the identifiers and one with the counts:

function getntop(c::Counter, n)
    vec = Tuple{String, Int}[]
    for (k, v) in getdictionary(c)
        push!(vec, (k, v))
    end
    sort!(vec, by = x -> x[2], rev = true)
    vec_trunc = vec[1:n-1]
    identifiers = [v[1] for v in vec_trunc]
    counts      = [v[2] for v in vec_trunc]
    return identifiers, counts
end

To visualize this we use the excellent plotting package
UnicodePlots:

julia> using UnicodePlots

julia> identifiers, counts = getntop(c, 10)

julia> barplot(identifiers, counts, title = "Base identifiers")
                   Base identifiers
       ┌────────────────────────────────────────┐
     x │▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ 7643 │
     T │▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ 7202   │
     A │▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ 7001    │
     i │▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ 5119            │
   Ptr │▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ 4239                │
     s │▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ 4128                 │
     n │▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ 3650                   │
     B │▪▪▪▪▪▪▪▪▪▪▪▪▪▪ 3143                     │
    io │▪▪▪▪▪▪▪▪▪▪▪▪ 2714                       │
       └────────────────────────────────────────┘

So there we have it, x is the winner, closely followed by T and A.
This is perhaps not very surprising; x is a very common variable name,
T is used a lot in parametric functions and A is used a lot in the
Linear Algebra code base which is quite large.

Difference vs. 0.6

The plot below shows the same experiment repeated on the 0.6 code base:

               Base identifiers 0.6
       ┌────────────────────────────────────────┐ 
     x │▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ 7718 │ 
     A │▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ 7313   │ 
     T │▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ 6932    │ 
     i │▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ 5242            │ 
   Ptr │▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ 4147                 │ 
     s │▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ 4093                 │ 
     n │▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ 3650                   │ 
     B │▪▪▪▪▪▪▪▪▪▪▪▪▪▪ 3174                     │ 
    io │▪▪▪▪▪▪▪▪▪▪▪▪▪ 2933                      │ 
       └────────────────────────────────────────┘ 

Most of the counts are relatively similar between 0.5 and 0.6 with the exception that A overtook T for the second place.
In fact, the number of T identifers have decreased with almost 300 counts!
What could have caused this?
The answer is a new syntactic sugar feature available in Julia 0.6 which was implemented by Steven G. Johnson in PR #20414 .
This allowed a parametric function with the syntax

foo{T <: Real}(Point{T}) = ...

to instead be written more tersely as

foo(Point{<:Real})...

In PR #20446 Pablo Zubieta went through the Julia code base and updated
many of the function signatures to use this new syntax.
Since T is a very common name to use for the parameter, the counts of T significantly decreased.
And this is how A managed to win over T in 0.6 in the prestigeful “most common identifier”-competition.

Julia packages.

We now perform the same experiment but on the Julia package directory. For me, this includes around 130 packages:

julia> length(readdir(Pkg.dir()))
137

The results are:

                   Package identifiers
        ┌────────────────────────────────────────┐ 
      T │▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ 15425 │ 
      x │▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ 15062  │ 
   test │▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ 13624     │ 
      i │▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ 9989              │ 
      d │▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ 9562               │ 
      A │▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ 8280                 │ 
    RGB │▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ 8041                  │ 
      a │▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ 7144                    │ 
      n │▪▪▪▪▪▪▪▪▪▪▪▪▪▪ 6470                     │ 
        └────────────────────────────────────────┘ 

When we counted the Julia base folder we excluded all the files used for unit testing.
For packages, these files are included and clearly test, used in the @test macro, is
unsurprisingly very common. T, x and i are common in packages and Base but for some reason
the variable d is more common in packages than in Base.

Conclusion

Doing these type of investigations has perhaps little practical use but it is, at least to me, a lot of fun.
Feel free to tweak the code to find the most common string literal (Tokens.STRING) or perhaps most common integer (`Tokens.INTEGER)
or anything else you can come up with.

Below is a wordcloud I made with the top 50 identifiers in Julia Base.