By: Iain Dunning

Re-posted from: http://iaindunning.com/2014/metadatatools.html

This is my first attempt at turning an IJulia Notebook into a blog post. I gave a lightning talk at the Cambridge Area Julia Users Group (CAJUN) on Sept. 4th 2014 about some fun things you can do with MetadataTools.jl and used the following notebook as my "slides". Here is that notebook, converted to Markdown for my site. It works fairly well, although I've had to a little bit of manual editing to make it look correct.

## MetadataTools.jl Demo

```
using MetadataTools
```

### Getting information about packages

`MetadataTools`

defines a `PkgMeta`

type that represents a package's METADATA entry, and contains a `PkgMetaVersion`

for each tagged version.

```
pkgs = get_all_pkg() # Returns a Dict{String,PkgMeta}
pkgs["DataArrays"]
```

Output:

```
DataArrays git://github.com/JuliaStats/DataArrays.jl.git
0.0.0,a6ce00,julia 0.2-,StatsBase 0.2.5 0.3-,SortingAlgorithms
0.0.1,0001ff,julia 0.2-,StatsBase 0.2.5 0.3-,SortingAlgorithms
0.0.2,7a61d2,julia 0.2-,StatsBase 0.2.5 0.3-,SortingAlgorithms
0.0.3,613ca1,julia 0.2- 0.3-,StatsBase 0.3.8-
0.1.0,ae7d82,julia 0.3-,StatsBase 0.3-,SortingAlgorithms
0.1.1,3fe861,julia 0.3-,StatsBase 0.3
0.1.2,d0a0b3,julia 0.3-,StatsBase 0.3
0.1.3,d9ad97,julia 0.3-,StatsBase 0.3
0.1.4,4742f2,julia 0.3.0-prerelease+1942,StatsBase 0.3
0.1.5,833e53,julia 0.3.0-,StatsBase 0.3
0.1.6,4be6c8,julia 0.3.0-,StatsBase 0.3
0.1.7,fc8a8a,julia 0.3.0-,StatsBase 0.3
0.1.8,511e2c,julia 0.3.0-,StatsBase 0.3
0.1.9,9c281b,julia 0.3.0-,StatsBase 0.3
0.1.10,440fb0,julia 0.3.0-,StatsBase 0.3
0.1.11,623147,julia 0.3.0-,StatsBase 0.3
0.1.12,e0e4a7,julia 0.3.0-,StatsBase 0.3
0.2.0,d78a6d,julia 0.3.0-,StatsBase 0.3
```

We can check that maximum supported Julia version using `get_upper_limit`

– useful for checking if a package is deprecated.

Input:

```
get_upper_limit(get_pkg("Monads"))
```

Output:

```
v"0.3.0"
```

Input:

```
get_upper_limit(get_pkg("DataFrames"))
```

Output:

```
v"0.0.0"
```

We can also request information about a package from GitHub (or wherever it is hosted – only GitHub needed right now!)

```
gadfly_info = get_pkg_info(get_pkg("Gadfly"))
Base.isless(a::MetadataTools.Contributor,b::MetadataTools.Contributor) =
isless(a.username,b.username)
sort(gadfly_info.contributors, rev=true)[1:10]
```

Output:

```
10-element Array{(Int64,Contributor),1}:
(428,Contributor("dcjones","https://github.com/dcjones"))
(8,Contributor("dchudz","https://github.com/dchudz"))
(7,Contributor("darwindarak","https://github.com/darwindarak"))
(6,Contributor("timholy","https://github.com/timholy"))
(5,Contributor("kleinschmidt","https://github.com/kleinschmidt"))
(5,Contributor("aviks","https://github.com/aviks"))
(5,Contributor("Keno","https://github.com/Keno"))
(4,Contributor("jverzani","https://github.com/jverzani"))
(4,Contributor("inq","https://github.com/inq"))
(4,Contributor("IainNZ","https://github.com/IainNZ"))
```

I pulled all the data about a week ago and serialized it for later use.

```
f = open("metadata.jldata","r")
pkg_info = deserialize(f)
close(f)
pkg_info["Dates"]
```

Output:

```
PkgInfo("https://github.com/quinnj/Dates.jl","Date/DateTime Implementation for the Julia Language; Successor to Datetime.jl","",5,2,[(2,Contributor("jiahao","https://github.com/jiahao")),(131,Contributor("quinnj","https://github.com/quinnj"))])
```

Input:

```
# Calculate commits stats
total_coms = Dict()
total_pkgs = Dict()
for pkg in values(pkg_info)
for contrib in pkg.contributors
commits, c = contrib
total_coms[c.username] = get(total_coms,c.username,0) + commits
total_pkgs[c.username] = get(total_pkgs,c.username,0) + 1
end
end
# Turn dicts into sorted (num,username) vectors
total_pkgs = sort([(total_pkgs[n],n) for n in keys(total_pkgs)],rev=true)
total_coms = sort([(total_coms[n],n) for n in keys(total_coms)],rev=true)
println("Number of packages contributed to")
map(println, total_pkgs[1:20])
println("Number of commits across all packages")
map(println, total_coms[1:20]);
```

Output:

```
Number of packages contributed to
(51,"timholy")
(45,"johnmyleswhite")
(40,"kmsquire")
(37,"StefanKarpinski")
(35,"Keno")
(34,"lindahua")
(30,"simonster")
(29,"IainNZ")
(25,"mlubin")
(24,"staticfloat")
(24,"aviks")
(21,"vtjnash")
(20,"stevengj")
(20,"ihnorton")
(18,"quinnj")
(17,"tanmaykm")
(17,"dcjones")
(17,"carlobaldassi")
(16,"tkelman")
(16,"powerdistribution")
Number of commits across all packages
(1734,"lindahua")
(1427,"jakebolewski")
(1178,"timholy")
(893,"johnmyleswhite")
(821,"dcjones")
(788,"simonster")
(749,"mlubin")
(678,"milktrader")
(462,"stevengj")
(435,"dmbates")
(415,"nolta")
(402,"one-more-minute")
(398,"quinnj")
(397,"IainNZ")
(372,"joehuchette")
(353,"powerdistribution")
(350,"WestleyArgentum")
(340,"Keno")
(336,"scidom")
(330,"tanmaykm")
```

### Package Ecosystem

`MetadataTools`

has a dependency on `Graphs`

to enable an analysis of how

packages rely on each other.

```
using Graphs
# Get a directed graph where PkgA -> PkgB iff
# PkgA directly requires PkgB
g = get_pkgs_dep_graph(get_all_pkg())
```

Output:

```
Directed Graph (418 vertices, 496 edges)
```

Input:

```
g_gadfly = get_pkg_dep_graph(get_pkg("Gadfly"),g)
```

Output:

```
Directed Graph (24 vertices, 36 edges)
```

To plot the dependency graph for a package, we can use my GraphLayout.jl package which uses Compose.jl internally for drawing. I haven't got around to adding Graphs.jl support to GraphLayout.jl just yet though…

```
using GraphLayout
for pkg_name in ["Gadfly","QuantEcon","JuMP","Twitter"]
# Extract graph
g_pkg = get_pkg_dep_graph(get_pkg(pkg_name),g)
# Extract adjacency matrix
adj_mat = adjacency_matrix(g_pkg)
# Build layout
locs_x,locs_y = layout_spring_adj(adj_mat)
# Extract name for each vertex
vert_names = map(pm->pm.name, vertices(g_pkg))
# Draw as an SVG
draw_layout_adj(adj_mat, locs_x, locs_y, labels=vert_names)
end
```

We can also look at which packages depend on the most packages

```
num_pkg_req = [
(num_vertices(get_pkg_dep_graph(pkg, g)), pkg.name)
for pkg in values(pkgs)]
sort!(num_pkg_req, rev=true) # Sort descending
println("Top 10 packages by number of packages depended on:")
for i in 1:10
println(rpad(num_pkg_req[i][2],20," "), num_pkg_req[i][1]-1)
end
```

Output:

```
Top 10 packages by number of packages depended on:
RobustStats 30
MachineLearning 30
Quandl 26
Twitter 25
Lumira 24
Gadfly 23
QuantEcon 22
ProfileView 22
ImageView 21
Etcd 21
```

We can also reverse the graph – now an arc from PkgA to PkgB means PkgB requires PkgA

```
g_rev = get_pkgs_dep_graph(pkgs, reverse=true)
# Count size of every subgraphs like above
num_pkg_req = [
(num_vertices(get_pkg_dep_graph(pkg, g_rev)), pkg.name)
for pkg in values(pkgs)]
sort!(num_pkg_req, rev=true) # Sort descending
println("Top 10 packages by number of packages that depend on them:")
for i in 1:10
println(rpad(num_pkg_req[i][2],20," "), num_pkg_req[i][1]-1)
end
```

Output:

```
Top 10 packages by number of packages that depend on them:
URIParser 89
SHA 88
BinDeps 87
ArrayViews 76
JSON 71
StatsBase 66
Homebrew 58
Zlib 49
URLParse 40
Reexport 40
```