BioJulia Project in 2016

By: Julia Developers

Re-posted from:

I am pleased to announce that the next phase of BioJulia is starting! In the next several months, I’m going to implement many crucial features for bioinformatics that will motivate you to use Julia and BioJulia libraries in your work. But before going to the details of the project, let me briefly introduce you what the BioJulia project is. This project is supported by the Moore Foundation and the Julia project.

The BioJulia project is a collaborative open source project to create an infrastructure for bioinformatics in the Julia programming language. It aims to provide fast and accessible software libraries. Julia’s Just-In-Time (JIT) compiler enables this greedy goal without resorting to other compiled languages like C/C++. The central package developed under the project is Bio.jl, which provides fundamental features including biological symbols/sequences, file format parsers, alignment algorithms, wrappers for external softwares, etc. It also supports several common file formats such as FASTA, FASTQ, BED, PDB, and so on. Last year I made the FMIndexes.jl package to build a full-text search index for large genomes as a Julia Summer of Code (JSoC) student, and we released the first development version of Bio.jl. While the BioJulia project is getting more active and the number of contributors are growing, we still lack some important features for realistic applications. Filling in gaps between our current libraries and actual use cases is the purpose of my new project.

So, what will be added in it? Here is the summary of my plan:

  • Sequence analysis:
    • Online sequence search algorithms
    • Data structure for reference genomes
    • Error-correcting algorithms for DNA barcodes
    • Parsers for BAM and CRAM file formats
  • Integration with data viewers and databases:
    • Genome browser backend
    • Parsers for GFF3 and VCF/BCF
    • Database access through web APIs

These things are of crucial importance for writing analysis programs because they connects software components (e.g. programs, archives, databases, viewers, etc.); data analysis softwares in bioinformatics usually read/write formatted data from/to each other. The figure below shows common workflow of detecting genetic variants; underlined deliverables will connect softwares, archives and databases so that you can write your analysis software in the Julia language.


Sequence Analysis

The online sequence search algorithms will come with three flavors: exact, approximate, and regular expression search algorithms. The exact sequence search literally means finding exactly matching positions of a query sequence in another sequence. The approximate search is similar to the exact search but allows up to a specified number of errors: mismatches, insertions, and deletions. The regular expression search accepts a query in regular expression, which enables flexible description of a query pattern like motifs. For these algorithms, there are already half-done pull requests I’m working on: #152, #153, #143.

After the last release of Bio.jl v0.1.0, the sequence data structure has been significantly rewritten to make biological sequence types coherent and extensible. But because we chose an encoding that requires 4 bits per base to represent DNA sequences, the DNA sequence type consumes too much memory than necessary to store a reference genome, which is usually composed of four kinds of DNA nucleotides (denoted by A/C/G/T) and (consecutive and relatively small number of) undetermined nucleotides (denoted by N). After trying some data structures, I found that memory space of N positions can be dramatically saved using IndexableBitVectors.jl, which is a package I created in JSoC 2015. I’m developing a separated package for reference genomes, ReferenceSequences.jl, and going to improve the functionality and performance to handle huge genomes like the human genome.

If you are a researcher or an engineer who handles high-throughput sequencing data, BAM and CRAM parsers would be the most longing feature addition in the list. BAM is the de facto standard file format to accommodate aligned sequences and most sequence mappers generate alignments in this format. CRAM is a storage-efficient alternative of BAM and is getting popular reflecting explosion of accumulated sequence data. Since these files contain massive amounts of DNA sequences from high-throughput sequencing machines, high-speed parsing is a practically desirable feature. I’m going to concentrate on the speed by careful tuning and multi-thread parallel computation which is planned to be introduced in the next Julia release.

Integration with Data Viewers and Databases

Genome browsers enable to interactively visualize genetic features found in
individuals and/or populations. For example, using the UCSC Genome Browser, you can investigate genetic regions along with sequence annotations around the ABO gene in a window. Genome browser is one of the most common visualizations and hence lots of softwares have been developed but unfortunately there is no standardized interface. So, we will need to select a promising one that is an open source and supporting interactions with other softwares. The first candidate is JBrowse, which is built with modern JavaScript and HTML5 technologies. It also supports RESTful APIs and hence it can fetch data from a backend server via HTTP. I’m planning to make an API server that responds to queries from a genome browser to interactively visualize in-memory data.

Many databases distribute their data in some standardized file formats. As for genetic annotations and variants, GFF3 and VCF would be the most common formats. If you are using data from human or mouse, you should know various annotations are available from the GENCODE project. It offers data in GTF or GFF3 file formats. NCBI provides human variation sets in VCF file formats here. These file formats are text, so you may think it is trivial to write parsers when you need them. It is partially true — if you don’t care about completeness and performance. Parsing a text file format in a naive way (for example, split a line by a tab character) allocates many temporary objects and often leads to degrade performance, while careful tuning of a parser leads to complicated code that is hard to maintain. @dcjones challenged this problem and made a great work and made Julia support for Ragel, which generates Julia code that executes a finite state machine. Daniel’s talk of the JuliaCon 2015 is helpful to know about the details if you are interested:

Sometimes you may need only a part of data provided by a database. In such a case, web-based APIs are handy to fetch necessary data on demand. BioMart Central Portal offers a unified access point to a range of biological databases that is programmatically accessible via REST and SOAP APIs. Julian wrapper to BioMart will make it much easier to access data by automatically converting response to Julia objects. In the R language, the biomaRt package is one of the most downloaded packages in Bioconductor packages:

Try BioJulia!

We need users and collaborators of our libraries. Feedbacks from users in the real world are the most precious thing to improve the quality of our libraries. We welcome feature requests and discussions that will make bioinformatics easier and faster. Tools for phylogenetics and structural biology, which I didn’t mention in this post, are also under active development. You can post issues here:; if you want to get in touch with us more casually, this Gitter room may be more convenient:

A parallel recommendation engine in Julia.

By: Julia Computing, Inc.

Re-posted from:


Recommender systems play a pivotal role in various business settings like e-commerce sites, social media platforms, and other platforms involving user interaction with other users or products. Recommender systems provide valuable insights to gain actionable
intelligence on these users.

Large Scale Recommender systems help in unraveling the latent information in the complex relational data between users and items.However mapping the users space to the items space to predict the interaction is a challenge. Inferring actionable information from a variety of data sources collected either implicitly like click patterns, browser history etc, or explicitly like ratings of books and movies, is what well-designed recommender systems do consistently well.

Matrix Factorizations

Depending on the source of information on the users and the items, there are a variety of techniques to build recommender systems, each with a unique mathematical approach. Linear algebra and matrix factorisations are important to certain types of recommenders where user ratings are available and it is most ideal to apply methods like svd in such cases.

In matrix factorization the users and items are mapped onto a joint latent factor space of reduced dimension f, and the inner product of the user vector with the item vector gives the corresponding interaction. Dimensionality reduction is mainly about a more compact representation of the large training data which is obtained by matrix factorization. We want to quantify the nature or the characteristics of the movies defined by a certain number of aspects (factors), i.e., we are trying to generalize the information (independent and unrelated ratings matrix) in a concise and descriptive way.

Example :
Let us consider a simple example to figure out how matrix factorization helps in predicting the likelihood of a user liking a movie or not.
For sake of brevity, we have couple of users, Joe and Jane and couple of movies, Titanic and Troll 2. The users and the movies are characterized based on certain number of factors as show in the below tables.

Factors/Movies Titanic Troll 2
Romance 4 1
Comedy 2 4
Box Office success 5 2
Drama 3 2
Horror 1 4
Factors/Movies Joe Jane
Romance 4 1
Comedy 2 4
Box Office success 5 2
Drama 3 2
Horror 1 4

Consider Joe to be characterized by vector [4 2 5 3 1], which suggests that Joe likes Romance and big hit movies and not so much horror or comedy. Similarly Jane likes comedy horror and she is not very particular about box office success of the movies, neither is she a big fan of romance movies.

The movies Titanic, is a popular romance movie, where as the movie Troll 2, is not so popular and horror comedy. It is intuitively obvious that Joe will end up liking Titanic and Jane will like Troll 2. This is based on how the users and movies score on the 5 factors. Using Cosine distance as shown in the below table, confirms this.

Factors/Movies Joe Jane
Titanic 0.94 0.67
Troll 2 0.50 0.97

With large rating data matrix, like in the NETFLIX dataset which had around 20 thousand movies and 0.5 million users, mapping all the users and the movies in the above way is impossible. This is where matrix factorization helps in factoring the Rating matrix into user matrix and movie matrix.

Alternating Least Squares

ALS Factorization

Let be the user feature matrix where and , and let be the item or
movie feature matrix, where and . Here is the number of factors, i.e., the reduced dimension or the lower rank, which is determined by cross validation. The predictions can be calculated for any user-movie combination,
, as .

Here we minimize the loss function of and as the condition in the iterative process of obtaining these matrices. Let us start by considering the loss due to a single prediction in terms of squared error:

Based on the above equation generalizing it for the whole data set, the
empirical total loss as:
mathcal{L}^{emp}(R,U,M)=frac{1}{n} sum_{(i,j) in
where is the known ratings dataset having ratings.

Julia recommender system

The package RecSys.jl is a package for recommender systems in Julia, it can currently work with explicit ratings data. For preparing the input create an object of ALSWR type. This takes two input parameters, firstly input file location, and second optional input is the variable par which specifies the type of parallelism. The parallelism is about how the data is shared/distributed across the processing units. When par=ParShemm the data is present at one location and is shared across the processing units, when par=ParChunk the data is distributed across the processing units as chunks. For this report only sequential timings were captured, i.e., with nprocs=1.

rec=ALSWR("/location/to/input/file/File.delim", par=ParShemm)

The file can be any tabular structured data, delimited by any character, which needs to be specified,

inp=DlmFile(name::AbstractString; dlm::Char=Base.DataFmt.invalid_dlm(Char), header::Bool=false, quotes::Bool=true)

The call to the function to create a model is train(rec, 10, 10) where 10 is the number of iterations to run and 10 is the number of factors.

Performance Analysis

Sequential version

The sequential performance of the ALS algorithm is tested on Apache Spark and Julia. The scala example code shown in the mentioned link was run with rank = 10 and iterations = 10. The timing of the ALS.train() function is recorded in order to analyse the core computational part only. For the same parameters in Julia, the timings for the computationally intensive train() function is captured.

The algorithm took around 500 seconds to train on the NETFLIX dataset on a single processor, which is good for data as large as 1 billion ratings.

The below table also summarises the performance(single processor) on various other datasets like the Movielens and lastFM.

Parameters/Datasets Size (No. of interactions) Factorization time (in secs)
Movielens 20 Million 119 0.5 Billion 2913

The NETFLIX dataset is not available publicly anymore, however datasets for movielens and lastfm can be downloaded. Please refer the dataset specific julia example scripts in the examples/ directory for more details on how to model the recommender system for the respective datasets.

Parallel version

Parallelism is made possible in Julia mainly 2 ways, a). Multiprocessing and b). Multithreading. The multithreading development is onging. However the multiprocessing based parallel processing in Julia is mature and mainly based around Tasks which are concurrent function calls. The implementation details are not covered here, the following graph summarises the performance of parallel ALS implementation in Julia and Spark,

ALS Factorization

In the above graph, Julia Distr breaks up the problem and uses Julia’s distributed computing capabilities. Julia Shared uses shared memory through mmap arrays. Julia MT is the multithreading version of the ALS. While multi-threading in Julia is nascent, it already gives parallel speedups. There are several planned improvements to Julia’s multi-threading that we expect will make the multi-threaded ALS faster than the other parallel implementations.

The experiments were conducted by invoking spark with flags --master local[1]. The experiments were conducted on a 30 core Intel Xeon machine with 132 GB memory and 2 hyperthreads per core.

Credits to Tanmay KM for contributing towards the parallel implementation of the package.

Apart from methods to model the data and check for accuracy, there are also abilities to make recommendations for users who have not interacted with items, by picking the most likely items the user would interact with. Hence in RecSys.jl we have a fast, scalable and accurate recommender system which can be used to for end to end system. Currently we are working on a demo of such a recommender system with a UI interface too implemented in Julia.

Optimal Number of Workers for Parallel Julia

By: Christopher Rackauckas

Re-posted from:

How many workers do you choose when running a parallel job in Julia? The answer is easy right? The number of physical cores. We always default to that number. For my Core i7 4770K, that means it’s 4, not 8 since that would include the hyperthreads. On my FX8350, there are 8 cores, but only 4 floating-point units (FPUs) which do the math, so in mathematical projects, I should use 4, right? I want to demonstrate that it’s not that simple.

Where the Intuition Comes From

Most of the time when doing scientific computing you are doing parallel programming without even knowing it. This is because a lot of vectorized operations are “implicitly paralleled”, meaning that they are multi-threaded behind the scenes to make everything faster. In other languages like Python, MATLAB, and R, this is also the case. Fire up MATLAB and run

A = randn(10000,10000)
B = randn(10000,10000)

and you will see that all of your cores are used. Threads are a recent introduction to Julia, and so in version 0.5 this will also be the case.

Another large place where implicit parallelization comes up is in linear algebra. When one uses a matrix multiplication, it is almost surely calling an underlying program which is an implementation of BLAS. BLAS (Basic Linear Algebra Subroutines) is aptly named just a set of functions for solving linear algebra problems. These are written in either C or Fortran and are heavily optimized. They are well-studied and many smart people have meticulously crafted “perfect code” which minimizes cache misses and all of that other low level stuff, all to make this very common operation run smoothly.

Because BLAS (and LINPACK, Linear Algebra Package, for other linear algebra routines) is so optimized, people say you should always make sure that it knows exactly how many “real” processors it has to work with. So in my case, with a Core i7 with 4 physical cores and 4 from hyperthreading, forget the hyperthreading and thus there are 4. With the FX8350, there are only 4 processors for doing math, so 4 threads. Check to make sure this is best.

What about for your code?

Most likely this does not apply to your code. You didn’t carefully manage all of your allocations and tell the compiler what needs to be cached etc. You just wrote some beautiful Julia code. So how many workers do you choose?

Let’s take my case. I have 4 real cores, do I choose 4? Or do I make 3 workers to allow for 1 to “command” the others freely? Or do I make 7/8 due to hyperthreading?

I decided to test this out on a non-trivial example. I am not going to share all of the code (I am submitting it as part of a manuscript soon), but the basic idea is that it is a high-order adaptive solver for stochastic differential equations. The code sets up the problem and then calls pmap to do a Monte Carlo simulation and solve the equation 1000 times in parallel. The code is mostly math, but there is a slight twist where some values are stored on stacks (very lightweight datastructure). To make sure I could trust the times, I ran the code 1000 times and took the average, min, and max times.

So in this case, what was best? The results speak for themselves.

Average wall times vs Number of Worker Processes, 1000 iterations
Number of Workers Average Wall Time Max Wall Time Min Wall Time
1 62.8732 64.3445 61.4971
3 25.749 26.6989 25.1143
4 22.4782 23.2046 21.8322
7 19.7411 20.2904 19.1305
8 19.0709 20.1682 18.5846
9 18.3677 18.9592 17.6
10 18.1857 18.9801 17.6823
11 18.1267 18.7089 17.5099
12 17.9848 18.5083 17.5529
13 17.8873 18.4358 17.3664
14 17.4543 17.9513 16.9258
15 16.5952 17.2566 16.1435
16 17.5426 18.4232 16.2633
17 16.927 17.5298 16.4492

Note there are two “1000”s here. I ran the Monte Carlo simulation (each solving the SDE 1000 times itself) 1000 times. I plotted the mean, max, and min times it took to solve the simulation. From the plot it’s very clear that the minimum exists somewhere around 15. 15!

What’s going on? My guess is that this is because of the time that’s not spent on the actual mathematics. Sometimes there are things performing logic, checking if statements, allocating new memory as the stacks grow bigger, etc. Although it is a math problem, there is more than just the math in this problem! Thus it seems the scheduler is able to effectively let the processes compete and more fully utilize the CPU by pushing the jobs around. This can only go so far: if you have too many workers, then you start to get cache misses and then the computational time starts to increase. Indeed, at 10 workers I could already see signs of problems in the resource manager.

Overload at 10 Workers as seen in Window's Resource Manager

Overload at 10 Workers as seen in Window’s Resource Manager

However, allowing one process to start re-allocating memory but causing a cache miss (or whatever it’s doing) seems to be a good tradeoff at low levels. Thus for this code the optimal number of workers is far above the number of physical cores.

Moral of the Story

The moral is, test your code. If your code is REALLY efficient, then sure, making sure you don’t mess with your perfect code is optimal. If your code isn’t optimal (i.e. it’s just some Julia code that is pretty good and you want to parallelize it), try some higher numbers of workers. You may be shocked what happens. In this case, the compute time dropped more than 30% by overloading the number of workers.

The post Optimal Number of Workers for Parallel Julia appeared first on Stochastic Lifestyle.