MetadataTools.jl

By: Iain Dunning - Blog

Re-posted from: http://iaindunning.com/blog/metadatatools.html

This is my first attempt at turning an IJulia Notebook into a blog post. I gave a lightning talk at the Cambridge Area Julia Users Group (CAJUN) on Sept. 4th 2014 about some fun things you can do with MetadataTools.jl and used the following notebook as my “slides”. Here is that notebook, converted to Markdown for my site. It works fairly well, although I’ve had to a little bit of manual editing to make it look correct.

MetadataTools.jl Demo

using MetadataTools

Getting information about packages

MetadataTools defines a PkgMeta type that represents a package’s METADATA entry, and contains a PkgMetaVersion for each tagged version.

pkgs = get_all_pkg()  # Returns a Dict{String,PkgMeta}
pkgs["DataArrays"]

Output:

DataArrays   git://github.com/JuliaStats/DataArrays.jl.git
  0.0.0,a6ce00,julia 0.2-,StatsBase 0.2.5 0.3-,SortingAlgorithms
  0.0.1,0001ff,julia 0.2-,StatsBase 0.2.5 0.3-,SortingAlgorithms
  0.0.2,7a61d2,julia 0.2-,StatsBase 0.2.5 0.3-,SortingAlgorithms
  0.0.3,613ca1,julia 0.2- 0.3-,StatsBase 0.3.8-
  0.1.0,ae7d82,julia 0.3-,StatsBase 0.3-,SortingAlgorithms
  0.1.1,3fe861,julia 0.3-,StatsBase 0.3
  0.1.2,d0a0b3,julia 0.3-,StatsBase 0.3
  0.1.3,d9ad97,julia 0.3-,StatsBase 0.3
  0.1.4,4742f2,julia 0.3.0-prerelease+1942,StatsBase 0.3
  0.1.5,833e53,julia 0.3.0-,StatsBase 0.3
  0.1.6,4be6c8,julia 0.3.0-,StatsBase 0.3
  0.1.7,fc8a8a,julia 0.3.0-,StatsBase 0.3
  0.1.8,511e2c,julia 0.3.0-,StatsBase 0.3
  0.1.9,9c281b,julia 0.3.0-,StatsBase 0.3
  0.1.10,440fb0,julia 0.3.0-,StatsBase 0.3
  0.1.11,623147,julia 0.3.0-,StatsBase 0.3
  0.1.12,e0e4a7,julia 0.3.0-,StatsBase 0.3
  0.2.0,d78a6d,julia 0.3.0-,StatsBase 0.3

We can check that maximum supported Julia version using get_upper_limit – useful for checking if a package is deprecated.

Input:

get_upper_limit(get_pkg("Monads"))

Output:

v"0.3.0"

Input:

get_upper_limit(get_pkg("DataFrames"))

Output:

v"0.0.0"

We can also request information about a package from GitHub (or wherever it is hosted – only GitHub needed right now!)

gadfly_info = get_pkg_info(get_pkg("Gadfly"))
Base.isless(a::MetadataTools.Contributor,b::MetadataTools.Contributor) =
    isless(a.username,b.username)
sort(gadfly_info.contributors, rev=true)[1:10]

Output:

10-element Array{(Int64,Contributor),1}:
 (428,Contributor("dcjones","https://github.com/dcjones"))        
 (8,Contributor("dchudz","https://github.com/dchudz"))            
 (7,Contributor("darwindarak","https://github.com/darwindarak"))  
 (6,Contributor("timholy","https://github.com/timholy"))          
 (5,Contributor("kleinschmidt","https://github.com/kleinschmidt"))
 (5,Contributor("aviks","https://github.com/aviks"))              
 (5,Contributor("Keno","https://github.com/Keno"))                
 (4,Contributor("jverzani","https://github.com/jverzani"))        
 (4,Contributor("inq","https://github.com/inq"))                  
 (4,Contributor("IainNZ","https://github.com/IainNZ"))            

I pulled all the data about a week ago and serialized it for later use.

f = open("metadata.jldata","r")
pkg_info = deserialize(f)
close(f)
pkg_info["Dates"]

Output:

PkgInfo("https://github.com/quinnj/Dates.jl","Date/DateTime Implementation for the Julia Language; Successor to Datetime.jl","",5,2,[(2,Contributor("jiahao","https://github.com/jiahao")),(131,Contributor("quinnj","https://github.com/quinnj"))])

Input:

# Calculate commits stats
total_coms = Dict()
total_pkgs = Dict()

for pkg in values(pkg_info)
    for contrib in pkg.contributors
        commits, c = contrib
        total_coms[c.username] = get(total_coms,c.username,0) + commits
        total_pkgs[c.username] = get(total_pkgs,c.username,0) + 1
    end
end

# Turn dicts into sorted (num,username) vectors
total_pkgs = sort([(total_pkgs[n],n) for n in keys(total_pkgs)],rev=true)
total_coms = sort([(total_coms[n],n) for n in keys(total_coms)],rev=true)

println("Number of packages contributed to")
map(println, total_pkgs[1:20])

println("Number of commits across all packages")
map(println, total_coms[1:20]);

Output:

Number of packages contributed to
(51,"timholy")
(45,"johnmyleswhite")
(40,"kmsquire")
(37,"StefanKarpinski")
(35,"Keno")
(34,"lindahua")
(30,"simonster")
(29,"IainNZ")
(25,"mlubin")
(24,"staticfloat")
(24,"aviks")
(21,"vtjnash")
(20,"stevengj")
(20,"ihnorton")
(18,"quinnj")
(17,"tanmaykm")
(17,"dcjones")
(17,"carlobaldassi")
(16,"tkelman")
(16,"powerdistribution")

Number of commits across all packages
(1734,"lindahua")
(1427,"jakebolewski")
(1178,"timholy")
(893,"johnmyleswhite")
(821,"dcjones")
(788,"simonster")
(749,"mlubin")
(678,"milktrader")
(462,"stevengj")
(435,"dmbates")
(415,"nolta")
(402,"one-more-minute")
(398,"quinnj")
(397,"IainNZ")
(372,"joehuchette")
(353,"powerdistribution")
(350,"WestleyArgentum")
(340,"Keno")
(336,"scidom")
(330,"tanmaykm")

Package Ecosystem

MetadataTools has a dependency on Graphs to enable an analysis of how
packages rely on each other.

using Graphs
# Get a directed graph where PkgA -> PkgB iff 
# PkgA directly requires PkgB
g = get_pkgs_dep_graph(get_all_pkg())

Output:

Directed Graph (418 vertices, 496 edges)

Input:

g_gadfly = get_pkg_dep_graph(get_pkg("Gadfly"),g)

Output:

Directed Graph (24 vertices, 36 edges)

To plot the dependency graph for a package, we can use my GraphLayout.jl package which uses Compose.jl internally for drawing. I haven’t got around to adding Graphs.jl support to GraphLayout.jl just yet though…

using GraphLayout
for pkg_name in ["Gadfly","QuantEcon","JuMP","Twitter"]
    # Extract graph
    g_pkg = get_pkg_dep_graph(get_pkg(pkg_name),g)
    # Extract adjacency matrix
    adj_mat = adjacency_matrix(g_pkg)
    # Build layout
    locs_x,locs_y = layout_spring_adj(adj_mat)
    # Extract name for each vertex
    vert_names = map(pm->pm.name, vertices(g_pkg))
    # Draw as an SVG
    draw_layout_adj(adj_mat, locs_x, locs_y, labels=vert_names)
end

We can also look at which packages depend on the most packages

num_pkg_req = [
    (num_vertices(get_pkg_dep_graph(pkg, g)), pkg.name)
        for pkg in values(pkgs)]
sort!(num_pkg_req, rev=true)  # Sort descending
println("Top 10 packages by number of packages depended on:")
for i in 1:10
    println(rpad(num_pkg_req[i][2],20," "), num_pkg_req[i][1]-1)
end

Output:

Top 10 packages by number of packages depended on:
RobustStats         30
MachineLearning     30
Quandl              26
Twitter             25
Lumira              24
Gadfly              23
QuantEcon           22
ProfileView         22
ImageView           21
Etcd                21

We can also reverse the graph – now an arc from PkgA to PkgB means PkgB requires PkgA

g_rev = get_pkgs_dep_graph(pkgs, reverse=true)
# Count size of every subgraphs like above
num_pkg_req = [
    (num_vertices(get_pkg_dep_graph(pkg, g_rev)), pkg.name)
        for pkg in values(pkgs)]
sort!(num_pkg_req, rev=true)  # Sort descending
println("Top 10 packages by number of packages that depend on them:")
for i in 1:10
    println(rpad(num_pkg_req[i][2],20," "), num_pkg_req[i][1]-1)
end

Output:

Top 10 packages by number of packages that depend on them:
URIParser           89
SHA                 88
BinDeps             87
ArrayViews          76
JSON                71
StatsBase           66
Homebrew            58
Zlib                49
URLParse            40
Reexport            40