Author Archives: Julia Developers

JuliaCon 2016

By: Julia Developers

Re-posted from: http://feedproxy.google.com/~r/JuliaLang/~3/b08JH6V-YnU/juliacon2016

The gentle breeze brushed my face and the mild sunshine warmed an
otherwise chilly morning. I was standing in front of a large building that can
only be described as unique: a series of metal plates jutting out at odd angles,
whose dull resplendence cast an instant impression. It was the Ray and Maria
Stata Centre, a towering monolith and the venue for an event that
people from all over the world came to attend and participate in. I was in town for
the third edition of JuliaCon,
the annual Julia Conference at MIT.

On the eve of JuliaCon, a series of workshops were organised on some important
areas people use Julia for. I was conducting the
Parallel Computing workshop
along with some other members of the JuliaLab. The key idea in our workshop was
to show users the many different ways of writing and executing parallel code in Julia.
I was talking about easy GPU computing using my package called ArrayFire and achieving
acceleration using Julia’s multi-threading.

Day 1 started off with the first keynote speaker – Guy Steele, a stalwart
in the software industry and an expert in programming language design. He
spoke about his adventures designing
Fortress, a language that was intended to be good at mathematical programming.
He went through the key design principles and tradeoffs: from the type hierarchy,
to their model for parallelism (automatic work-stealing), and interesting choices
(such as non-transitive operator precedence). My colleague Keno Fischer was up next
with a tour of the new Julia Debugger:
Gallium! Gallium was quite breathtaking in its complexity and versatility, so much so
that Keno himself uses it to debug code in C and C++! A powerful debugger becomes even
better with GUI-integration, which Mike Innes very usefully
pitched in with during his demo of
Juno-Gallium integration. Stepping, printing and breakpoints promised a powerful
package development experience.

The next session was all about data science. Simon Byrne
spoke about the data science ecosystem
in Julia and future plans. He touched on the
famous problem
with DataFrames, and then laid out a roadmap for the ecosystem. The rest of the
session featured an interesting demo in
music processing,
while Arch Robison showed
us how to use Julia as a code generator.

The evening had two sessions in parallel at different rooms. This is a recurrent
feature of JuliaCon, and it’s always hard to decide which session to attend.
This time, I chose to attend the sessions on
automatic differentation in
JuMP and
forward differentiation using
ForwardDiff.jl. I didn’t want to miss the
talk on iterative methods for sparse
linear systems. Performance of different kinds of techniques and approaches were compared
and evaluated against one another, which made for a compelling presentation,
which I really enjoyed.

The evening session featured Jeffrey Sarnoff, one of the sponsors of JuliaCon 2016.
Mr. Sarnoff had some very interesting thoughts
on extended precision floating point numbers. And so ended the first day at JuliaCon.
Now it was time to head to the JuliaHouse! The JuliaHouse was an AirBnb that a bunch of
Julia contributors rented out. They had a yard and a barbecue and it was the ideal place
for people to go relax, unwind and network with the other Julia folks. People chilled there
till the wee hours of the morning, and somehow made it on time for the next day’s session.

The second day started with a keynote speech
by Professor Tim Holy, a prolific contributor to the Julia language and ecosystem.
He spoke about the state of arrays in Julia and showed us a few of his ideas for iterators.
I saw that Professor Holy is widely admired in the entire Julia community due to his involvement
in various packages and the key issues on the language. I noticed that he asked some pretty neat
insightful questions at various earlier sessions too. Stefan was up next with his super-important
Julia 1.0 talk. It was quite a comprehensive list
of things that needed to be done before Julia would be 1.0 ready and he touched on a variety of areas
such as the compiler, the type system, the runtime, multi-threading, strings and so on.

The next session saw a team from UC Berkeley show off their
autonomous racing car that uses some optimization
packages (JuMP and Ipopt in particular) to solve real-time optimization problems. Julia was running
on an ARM chip with Ubuntu 14.04 installed. Julia can also run on the Raspberry Pi, and my colleague
Avik took some time to show off a cool Minecraft demo
running on the Pi. The talk after that was about JuliaBox. Nishanth, another colleague of mine, has
been hard at work porting JuliaBox to Google Cloud from AWS, and he
spoke about his exciting plans for JuliaBox.

Post lunch, I had to choose again between parallel sessions, but I couldn’t quite
resist the session with stochastic PDEs and Finite Elements. Kristoffer Carlsson
reviewed the state of FEM in Julia,
talking about the packages and ecosystem for every FEM step from assembly to the
conjugate gradient. The next talk
was given by a professor at TU Vienna whose group conducts research on nano-biosensors,
and the group uses Julia to solve the stochastic PDEs that come up when trying to model
noise and fluctuations. The next talk
on astrodynamics was very interesting in that it gave me an insight into the kinds of
computational challenges faced by scientists in the field. There were also some interesting
demos which I enjoyed, particularly the one where we modelled and visualized a target orbit,
which superimposed upon a visual of the earth in space.

In the afternoon, after much consideration, I went to the session that featured statistical
modelling and least squares. The first talk on
sparse least squares optimization problems
gave me a flavor of the kinds of models and problems economists need to solve, and how the
Julia ecosystem helps them. The next talk
on computational neuroscience focussed on dealing with tens of terabytes of brain data
coming from both animals and human surgery patients. I had a very interesting discussion
with John earlier about his work, and I was able to get a keen sense of how why the package
he was talking about (VinDsl.jl) was important for his work. And so ended Day 2 at JuliaCon,
a highly educational day for me personally, with insights into astrodynamics, finite elements
and computational neuroscience.

I would contest that one of the best ways to begin your day is to listen to a speech by a Nobel
Laureate. It was quite a surreal experience
listening to Professor Tom Sargent, and to see him excited by Julia. He gave us a flavor of
macroeconomics research and introduced dynamic programming squared problems that were
“a walking advertisement for Julia”. As a case in point, the
next session on DSGE models in Julia highlighted
the benefits Julia can bring to macroeconomics research and analysis.

The next session had a bunch of Julia Summer of Code (JSOC) students present their projects.
Some couldn’t make it to the conference so they presented their work
through Google Hangouts or through
pre-recorded video. Unfortunately, I couldn’t
catch all of them because I wanted to catch my colleague Jameson’s Machine Code
talk which was in another room. The material
he spoke about was very interesting, and got me thinking about the Julia compiler. I also had a
very enlightening discussion with him later about the Julia parser.

It turned out that in the afternoon, I was crunched for time. I was helping Shashi plug
ArrayFire into
Dagger.jl for his talk that was due in a
couple of hours, while also working on my own ArrayFire notebooks for late that evening.
But we managed to pull through in time. So the afternoon session had
Shashi presenting
Dagger, his out-of-core framework, followed by a tour
of ParallelAccelerator from the IntelLabs team.
I have been following ParallelAccelerator for a while, and I’m excited by how certain aspects
of it (such as automatic elimination of bounds checking) can be incorporated into Base Julia.

The evening session showed people how they can accelerate their code in Julia. The speaker
before me covered vectorization with Yeppp
before I covered GPU acceleration with ArrayFire.
It was quite overwhelming to be speaking in front of a bunch of experts, but I think I did okay.
But I did finish 5 minutes faster than my allotted time. As it turned out, both parallel sessions
actually ended up concluding a few minutes early.

Finally, Andreas came up to the podium for the concluding remarks and closed off a very important
JuliaCon for me personally. I was able to appreciate the various kinds of people involved in the
Julia community: some who worked on the core language to some who worked on their own packages as
part of their research; some who worked on Julia part-time, to some (like myself) who worked
full-time; the relatively uninitiated JSOC students to experienced old-timers in the community.
One thing tied them all together though: a quite thorough appreciation of a new language whose
flexibility and power enabled people to solve important problems, whose community’s openness and sense
of democracy welcomed more smart people, and the idea that a group of individuals on different time zones
and from different walks of life can drive a revolution in scientific computing.

BioJulia 2016 – The First Report

By: Julia Developers

Re-posted from: http://feedproxy.google.com/~r/JuliaLang/~3/12oUw98qs-Q/biojulia2016-mid

We are pleased to announce releasing
Bio.jl 0.4, a minor release including
significant functionality improvements as I promised in the previous blog
post
.

The following features are added since the post:

  • Online sequence search algorithms.
  • Sequence data structure for reference genomes.
  • Data reader and writer for the .2bit file format.
  • Data reader and writer for the SAM and BAM file formats.
  • Sequence demultiplexing tool.
  • Package to handle BGZF files.

And many other miscellaneous performance and usability improvements! Tutorial
notebooks are available at https://github.com/BioJulia/BioTutorials. Here I
briefly introduce you to these new features one by one.

Online sequence search algorithms

Sequence search is an indispensable tool in sequence analysis. Since the last
post, I have added exact, approximate and regex search algorithms. The search
interface of Bio.jl mimics that of Julia’s standard library.

julia> using Bio.Seq

julia> seq = dna"ACAGCGTAGCT"
11nt DNA Sequence:
ACAGCGTAGCT

# Exact search.
julia> search(seq, dna"AGCG")
3:6

# Approximate search with one error or less.
julia> approxsearch(seq, dna"AGGG", 1)
3:6

# Regular expression search.
julia> search(seq, biore"AGN*?G"d)
3:6

Sequence data structure for reference genomes

In Bio.jl DNA sequences are encoded using 4 bits per base by default in order to
store ambiguous nucleotides and this encoding does well in most cases. However,
some biological sequences such as chromosomal sequences are so long especially
for eukaryotic organisms and the default DNA sequences may result in a waste of
memory space. ReferenceSequence is a new type introduced in Bio.jl that
compresses positions of ambiguous nucleotides using a sparse bit vector. This
type can achieve almost 2-bit encoding in common reference sequences because
most of the ambiguous nucleotides are clustered in a sequence and the number of
them is small compared to other unambiguous nucleotides.

# Converting a DNASequence object to ReferenceSequence.
julia> ReferenceSequence(dna"ACGT"^10000)
40000nt Reference Sequence:
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACG…CGTACGTACGTACGTACGTACGTACGTACGTACGTACGT

# Reading chromosome 1 of human from a FASTA file.
julia> open(first, FASTAReader{ReferenceSequence}, "hg38.fa")
Bio.Seq.SeqRecord{Bio.Seq.ReferenceSequence,Bio.Seq.FASTAMetadata}:
  name: chr1
  sequence: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN…NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
  metadata: Bio.Seq.FASTAMetadata("")

julia> sequence(ans)
248956422nt Reference Sequence:
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN…NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Data reader and writer for the 2bit file format

2bit is a binary file format to store reference sequences. This is a kind of
binary counterpart of FASTA but
specialized for DNA reference sequences to enable smaller file size and faster
loading. Reference sequences of various organisms are distributed from the
download page of UCSC
in this
file format. An important advantage of 2bit is that sequences are indexed by its
name and can be accessed immediately.

# Opening a sequence file of yeast (S.cerevisiae).
julia> reader = open(TwoBitReader, "sacCer3.2bit");

# Loading a chromosome VI using random access index.
julia> reader["chrVI"]
Bio.Seq.SeqRecord{Bio.Seq.ReferenceSequence,Array{UnitRange{Int64},1}}:
  name: chrVI
  sequence: GATCTCGCAAGTGCATTCCTAGACTTAATTCATATCTGC…GTGTGGTGTGTGGGTGTGGTGTGTGGGTGTGGTGTGTGG
  metadata: UnitRange{Int64}[]

Data reader and writer for the SAM and BAM file formats

The SAM and BAM file formats are designed for storing sequences aligned to
reference sequences. SAM is a line-oriented text file format and easy to handle
with UNIX command line tools. BAM is a compressed binary version of SAM and
suitable for storing data in disks and processing with purpose-built softwares
like samtools. The BAM data reader is carefully
tuned so that users can use it in real analysis with large files. It is also
feasible to read a CRAM file
combining the BAM reader and samtools view command.

An experimental feature is parallel processing using multiple threads.
Multi-threading support is introduced in Julia 0.5 and we use it to parallelize
decompression of BAM files. Here is a simple benchmark script to show how
much reading speed can be improved with multiple threads:

using Bio.Align

# Count the number of mapped records.
function countmapped(reader)
    ret = 0
    record = BAMRecord()
    while !eof(reader)
        # in-place reading
        read!(reader, record)
        if ismapped(record)
            ret += 1
        end
    end
    return ret
end

println(open(countmapped, BAMReader, ARGS[1]))

JULIA_NUM_THREADS environment variable controls the number of worker threads.
The result below shows that the elapsed time is almost halved using two threads:

~/.j/v/Bio $ time julia countmapped.jl SRR1238088.sort.bam
28040186
       29.27 real        28.64 user         0.66 sys
~/.j/v/Bio $ env JULIA_NUM_THREADS=2 time julia countmapped.jl SRR1238088.sort.bam
28040186
       17.40 real        32.31 user         0.63 sys

Package to handle BGZF files

BGZF (Blocked GZip Format) is a gzip-compliant file format commonly used in
bioinformatics. BGZF can be read using standard gzip tools but files in the
format are compressed block by block and special metadata are added to index the
compressed files for random access. BAM files are compressed in this file format
and sequence alignments in a specific genomic region can be retrieved
efficiently. BGZFStreams.jl is a new package to handle BGZF files like usual
I/O streams and it is built on top of our
Libz.jl package. Parallel decompression
mentioned above is implemented in this package layer.

julia> using BGZFStreams

julia> stream = BGZFStream("/Users/kenta/.julia/v0.5/BGZFStreams/test/bar.bgz")
BGZFStreams.BGZFStream{IOStream}(<mode=read>)

julia> readstring(stream)
"bar"

https://github.com/BioJulia/BGZFStreams.jl

Sequence demultiplexing tool

Sequence demultiplexing is a technique to distinguish the origin of a sequence
using its artificially-attached “barcode” sequence. This is often used at a
preprocessing phase after multiplexed
sequencing
,
a common technique to sequence multiple samples simultaneously. A barcode
sequence, however, may be corrupted due to sequencing error, and we need to find
the best matching barcode from a barcode set. The demultiplexer algorithm
implemented in Bio.jl is based on a trie-like data structure, and efficiently
finds the optimal barcode from the prefix of a given DNA sequence.

# Set DNA barcode pool.
julia> barcodes = DNASequence["ATGG", "CAGA", "GGAA", "TACG"];

# Create a sequence demultiplexer that allows one errors at most.
julia> dplxr = Demultiplexer(barcodes, n_max_errors=1, distance=:hamming)
Bio.Seq.Demultiplexer{Bio.Seq.BioSequence{Bio.Seq.DNAAlphabet{4}}}:
  distance: hamming
  number of barcodes: 4
  number of correctable errors: 1

# Demultiplex a given sequence from its prefix.
julia> demultiplex(dplxr, dna"ATGGCGNT")  # 1st barcode with no errors
(1,0)

julia> demultiplex(dplxr, dna"CAGGCGNT")  # 2nd barcode with one error
(2,1)

Next step

This is still the first half of my project this year. The next term will come
with:

  • Supporting more file formats including GFF3, VCF and BCF.
  • Integration with databases.
  • Integration with genome browsers.

And, of course, improving existing features of Bio.jl and other packages. We
welcome any contributions and feature requests from you all. Gitter chat
channel
is the best place to communicate
with developers and other users. If you love Julia and/or biology, any reason
not to join us?

Acknowledgements

I gratefully acknowledge the Moore Foundation and the Julia project for
supporting the BioJulia project. I also would like to thank Ben J.
Ward
and Kevin
Murray
for comments on my program code and other
contributions.

GSoC 2016 Project: Graft.jl

By: Julia Developers

Re-posted from: http://feedproxy.google.com/~r/JuliaLang/~3/-XJNyq8rvJc/GSoC2016-Graft

This blog post describes my work on Graft.jl, a general purpose graph analysis package for Julia. For those unfamiliar with graph algorithms, a quick introduction might help.

Proposal

My proposal, titled ParallelGraphs, was to develop a parallelized/distributed graph algorithms
library. However, in the first month or so, we decided to work towards a more general framework that supports data analysis on
networks (graphs with attributes defined on vertices and edges).
Our change in direction was mainly motivated by:

  • The challenges associated with distributed graph computations. This
    blog post was an eye opener.
  • Only very large graphs, of the order of terabytes or petabytes, require distributed execution. Most useful graphs can be analyzed on a single compute node.
  • Multi-threading is under heavy development, and we decided to wait for the full multi-threaded programming model to be available.
  • As we looked at public datasets, we felt that the ability to combine graph theoretic analyses with real world data was the missing piece in Julia. LightGraphs.jl already provides fast implementations for most graph algorithms, so we decided to target
    graph data analysis.

The modified proposal could be summarized as the development of a package that supports:

  • Vertex and edge metadata : Key value pairs for vertices and edges.
  • Vertex labelling : Allow vertices to be referenced, externally, through arbitrary Julia types.
  • SQL like queries for edge data and metadata.
  • Compatibility with LightGraphs

Graft

ParallelGraphs turned out to be a misnomer, since we were moving towards a more general purpose data analysis framework. So we chose the name Graft, a kind of abbreviation for Graph Toolkit. The following sections detail Graft's features:

Vertex and Edge Metadata

Graphs are often representations of real world entities, and the relationships between them. Such entities (and their relationships), often have data attached to them.
While it is quite straightforward to store vertex data (a simple table will suffice), storing edges and their data is very tricky. The data should be structured on the
source and target vertices, should support random access and should be vectorized for queries.

At first we tried placing the edge data in a SparseMatrixCSC. This turned out to be a bad idea, because sparse matrices are designed for numeric storage.
A simpler solution is to store edge metadata in a DataFrame, and have a SparseMatrixCSC map edges onto indices for the DataFrame. This strategy needed a lot less
code, and the benchmarks were more promising. Mutations such as the addition or removal of vertices and edges become more complicated however.

Vertex Labelling

Most graph libraries do not support vertex labelling. It can be very confusing to refer to a vertex by its (often long) integer identifier. It is also
computationally expensive to use non-integer labels in the implementation of the package (any such implementation would involve dictionaries). There is no reason, however,
for the user to have to use integer labels externally. Graft supports two modes of vertex labelling. By default, a vertex is identified by its internal identifier. A user
can assign labels of any arbitrary Julia type to identify vertices, overriding the internal identifiers. This strategy, we feel, makes a reasonable compromise between
user experience and performance.

If vertex labels were used in the internal implementation, the graph data structure would probably look like this:

  Dict(
     "Alice" => Dict(
        "age" => 34,
        "occupation"  => "Doctor",
        "adjacencies" => Dict("Bob" => Dict("relationship" => "follow")))
     ),
     "Bob" => Dict(
        "age" => 36,
        "occupation"  => "Software Engineer",
        "adjacencies" => Dict("Charlie" => Dict("relationship" => "friend"))
     ),
     "Charlie" => Dict(
        "age" => 30,
        "occupation"  => "Lawyer",
        "adjacencies" => Dict("David" => Dict("relationship" => "follow"))
     ),
     "David" => Dict(
        "age" => 29,
        "occupation" => "Athlete",
        "adjacencies" => Dict("Alice" => Dict("relationship" => "friend"))
     )
  )

Cleary, using labels internally is a very bad idea. Any sort of data access would set off multiple dictionary look-ups. Instead, if a bidirectional map
could be used to translate labels into vertex identifiers and back, the number of dictionary lookups could be reduced to one. The data would also be better structured for query processing.

  # Label Map to resolve queries
  LabelMap(
     # Forward map : labels to vertex identifiers
     Dict("Alice" => 1, "David" => 4", "Charlie" => 3, Bob" => 2),

     # Reverse map : vertex identifiers to labels
     String["Alice, "Bob", "Charlie", "David"]
  )

  # Vertex DataFrame
  4×2 DataFrames.DataFrame
  │ Row │ age │ occupation          │
  ├─────┼─────┼─────────────────────┤
  │ 1   │ 34  │ "Doctor"            │
  │ 2   │ 36  │ "Software Engineer" │
  │ 3   │ 30  │ "Lawyer"            │
  │ 4   │ 29  │ "Athlete"           │

  # SparseMatrixCSC : maps edges onto indices into Edge DataFrame
  4×4 sparse matrix with 4 Int64 nonzero entries:
     [4, 1]  =  1
     [1, 2]  =  2
     [2, 3]  =  3
     [3, 4]  =  4

  # Edge DataFrame
  4×1 DataFrames.DataFrame
  │ Row │ relationship │
  ├─────┼──────────────┤
  │ 1   │ "follow"     │
  │ 2   │ "friend"     │
  │ 3   │ "follow"     │
  │ 4   │ "friend"     │

SQL Like Queries

Graft’s query notation is borrowed from Jplyr. The @query macro is used to simplify the query syntax, and
accepts a pipeline of abstractions separated by the pipe operator |>. The stages are described through abstractions:

eachvertex

Accepts an expression, that is run over every vertex. Vertex properties can be expressed using the dot notation. Some reserved properties are v.id, v.label,
v.adj, v.indegree and v.outdegree.
Examples:

  # Check if the user has overridden the default labels
  julia> @query(g |> eachvertex(v.id == v.label)) |> all

  # Kirchoff's law :P
  julia> @query(g |> eachvertex(v.outdegree - v.indegree)) .== 0

eachedge

Accepts an expression, that is run over every edge. The symbol s is used to denote
the source vertex, and t is used to denote the target vertex in the edge. The symbol e is used to denote
the edge itself. Edge properties can be expressed through the dot notation. Some reserved properties are e.source, e.target, e.mutualcount, and e.mutual.
Examples:

  # Arithmetic expression on edge, source and target properties
  julia> @query g |> eachedge(e.p1 - s.p1 - t.p1)


  # Check if constituent vertices have the same outdegree
  julia> @query g |> eachedge(s.outdegree == t.outdegree)


  # Count the number of "mutual friends" between the source and target vertices in each edge
  julia> @query g |> eachedge(e.mutualcount)

filter

Accepts vertex or edge expressions and computes subgraphs with a subset of vertices, or a subset
of edges, or both.
Examples:

  # Remove vertices where property p1 equals property p2
  @query g |> filter(v.p1 != v.p2)

  # Remove self loops from the graph
  @query g |> filter(e.source != e.target)

select

Returns a subgraph with a subset of vertex properties, or a subset of edge properties or both.
Examples:

  # Preserve vertex properties p1, p2 and nothing else
  @query g |> select(v.p1, v.p2)

  # Preserve vertex property p1 and edge property p2
  @query g |> select(v.p1, e.p2)

Demonstration

The typical workflow we hope to support with Graft is:
– Load a graph from memory
– Use the query abstractions to construct new vertex/edge properties or obtain subgraphs.
– Run complex queries on the subgraphs, or export data to LightGraphs and run computationally expensive algorithms there.
– Bring the data back into Graft as a new property, or use it to modify the graphs structure.

The following examples should demonstrate this workflow:

  • Google+: This demo uses a real, somewhat large, dataset with plenty of text data.
  • Baseball Players: Two separate datasets spliced together, a table on baseball players
    and a trust network. The resulting data is quite absurd, but does a good job of showing the quantitative queries Graft can run.

Future Work

  • Graph IO : Support more graph file formats.
  • Improve the query interface: The current pipelined macro based syntax has a learning curve, and the macro itself does some eval at runtime. We would like to move towards a cleaner composable syntax, that will pass off as regular Julia commands.
  • New abstractions, such as Group-by, sort, and table output.
  • Database backends : A RDBMS can be used instead of the DataFrames. Or Graft can serve as a wrapper on a GraphDB such as Neo4j.
  • Integration with ComputeFramework for out of core processing. Support for parallelized IO, traversals and queries.

More information can be found here

Acknowledgements

This work was carried out as part of the Google Summer of Code program, under the guidance of mentors: Viral B Shah and Shashi Gowda.