Author Archives: Julia Computing, Inc.

JuliaDB in Julia 1.0

In many data science applications, it is easy to run out of memory when working with data. An analyst working with big data has a few options available to them:

  1. Buy more RAM.
  2. Rent RAM with a cloud-based service.
  3. Use a sample of the dataset.
  4. Buy a SAS license.

None of these options are particularly good solutions.

Introducing JuliaDB

With JuliaDB, one can easily read big data, save it in an efficient binary format, and even run operations out-of-core. Analytics are available via OnlineStats integration, making statistical calculations on big data a breeze.

OnlineStats implements on-line (single-pass) algorithms for statistics and models, meaning you can run analyses like linear regression on data that is too big to fit in memory. Every statistic/model in OnlineStats also supports merging, enabling parallel processing. The combination of on-line updating/merging eliminates the need for the entire dataset to be loaded into RAM simultaneously, allowing analyses that would not be possible with traditional methods. Below is a visualization of how JuliaDB integrates with OnlineStats by scheduling the updating and merging operations:

fitting and merging

Example

From Kaggle’s Huge Stock Market Dataset, there are over 7000 CSVs with historical price data (each stock’s history in a different file). JuliaDB can quickly load them into a distributed dataset and perform group-by operations:

using Distributed
addprocs(4)

@everywhere using JuliaDB, OnlineStats

# 7195 CSVs with 14,887,665 rows
files = glob("*.txt", "Stocks")

t = loadtable(files, filenamecol=:Stock)

groupreduce(Mean(), t, :Stock; select=:Volume)
Distributed Table with 7163 rows in 4 chunks:
Stock          Mean
──────────────────────────────────────────────
"a.us.txt"     Mean: n=4521 | value=3.9935e6
"aa.us.txt"    Mean: n=12074 | value=3.33776e6
"aaap.us.txt"  Mean: n=505 | value=1.49989e5
"aaba.us.txt"  Mean: n=5434 | value=2.3218e7
"aac.us.txt"   Mean: n=785 | value=2.20441e5
"aal.us.txt"   Mean: n=989 | value=1.00454e7
"aamc.us.txt"  Mean: n=1211 | value=17644.4
"aame.us.txt"  Mean: n=2926 | value=6715.76
"aan.us.txt"   Mean: n=3201 | value=7.13972e5
"aaoi.us.txt"  Mean: n=1041 | value=7.94087e5
"aaon.us.txt"  Mean: n=3201 | value=2.04258e5
"aap.us.txt"   Mean: n=3201 | value=1.22519e6
"aapl.us.txt"  Mean: n=8364 | value=1.06642e8
"aat.us.txt"   Mean: n=1717 | value=2.15145e5
"aau.us.txt"   Mean: n=3177 | value=1.86901e5
"aav.us.txt"   Mean: n=3199 | value=4.28732e5
"aaww.us.txt"  Mean: n=3162 | value=2.64112e5
"aaxn.us.txt"  Mean: n=3201 | value=1.40883e6
"ab.us.txt"    Mean: n=3201 | value=5.60349e5
⋮

Main Features

Just-in-Time Compiled

JuliaDB leverages Julia’s just-in-time compiler (JIT) so that table operations – even custom ones – are fast.

Compute in Parallel

Process data in parallel or even calculate statistical models out-of-core through integration with OnlineStats.jl.

Store Any Data Type

JuliaDB supports Strings, Dates, Float64… and any other Julia data type, whether built-in or defined by you.

Fast User-Defined Functions

JuliaDB is written 100% in Julia. That means user-defined functions are JIT compiled.

Fast CSV Parser

CSVs are loaded extremely fast! Many files can be read at the same time to create a single table.

Open Source

JuliaDB is released under the MIT License.

JuliaDB for Time Series

The ability to index (sort) on any number of columns and store any data type makes JuliaDB ideal for time series analysis. For a big data time series example, see the demo here.

Feature JuliaDB Pandas xts (R) TimeArrays
Distributed Computing
Data larger than memory
Multiple Indexes
Index Type(s) Any Built-ins Time Time
Value Type(s) Any Built-ins Built-ins Any
Compiled UDFs

Resources

Growing a Compiler – Getting to Machine Learning from a General Purpose Compiler

The Compilers for Machine Learning workshop
was recently held at CGO 2019. Since
compiler techniques affect a large part of the machine learning stack,
this workshop aimed to highlight research that incorporates compiler
techniques and algorithms in optimizing machine learning
workloads. The workshop included talks from various projects – Julia (Julia Computing), TVM (UW), Glow (Facebook), XLA (Google), nGraph (Intel), TensorRT (Nvidia), and the soon to release MLIR (Google).

Our talk introduced the abstractions in the Julia language and the
kind of compiler transforms involved in implementing them. We then had
a deep dive into dynamic semantics + static analysis – our JAOT
(Just-Ahead-Of-Time) analysis. Building on these capabilities, the
Zygote system implements
automatic differentiation, effectively treating it as a compiler
problem, giving us differentiable programming for free. Finally,
compiler backends for
GPUs and
TPUs give us high performance
execution. All this comes together beautifully in Neural
ODEs
, which we had to
show off as our first slide!

Our
presentation
is available online. A PDF is also available in case Google Docs are blocked.

JuliaTeam Vision

Julia Computing has a new product called JuliaTeam. Currently it lets developers install Julia packages behind a company’s firewall without hassles while hooking into authentication systems so that IT and management have insight and control over who’s using what Julia packages. We’ve written about it in our newsletters and customers are already trying it out.

Our plans for JuliaTeam are much bigger than this simple beginning, however: it will provide key infrastructure for the whole Julia ecosystem and be a mechanism for companies benefitting from that ecosystem to give back to the community. Not out of some sense of altruism or charity but for good old fashioned capitalistic reasons — because it saves them money. The vision of JuliaTeam is to create a virtuous cycle between Julia’s open source ecosystem and companies deriving value from that ecosystem.

This is the JuliaTeam business model in simple terms: build high quality services for all Julia developers, give them to open source projects for free and charge companies for those services integrated nicely into their corporate environments. We see a lot of commercial demand for great services that make developers more productive; the very same productivity boost is also needed to accelerate open source development, which in turn provides greater value to everyone using Julia. In addition to including private versions of public services, the JuliaTeam product integrates smoothly into each company’s development environment—authentication systems, internal code hosting, issue trackers, application deployment—all wrapped up in a clear, high-level dashboard.

JuliaTeam will also give us visibility into which open source packages companies rely on. This helps us understand what the most important packages are and puts us in a position to ensure that these packages are maintained and funded appropriately. We envision a future where Julia Computing can reinvest revenue from JuliaTeam into the open source ecosystem, providing support to make sure that releases, bug fixes, documentation, and testing all meet the highest standards for software development.

Functionality

Enough high-level vision talk, what is JuliaTeam? What functionality does JuliaTeam already offer? What will it include in the future? Each capability of JuliaTeam comes in a free open source flavor and a corresponding paid enterprise flavor. The following list outlines some of these capabilities with an indication of whether each one is “done”, “in progress” or “on the roadmap”.

  • Installation [done]. JuliaTeam serves registry info, package source, binary dependencies and other artifacts. Instead of trying to fetch these from a smattering of different servers across the Internet which may go down, go away, or be compromised, Julia package clients fetch from a secure, maintained JuliaTeam instance. This avoids firewall issues and offers security, visibility and control. JuliaTeam replaces a mix of GitHub, GitLab, Bintray, random FTP servers, and other miscellaneous sources for serving code and release artifacts. Having a single managed service will be safer and more reliable and allow the Julia community to get valuable (anonymous) telemetry about how often packages are installed and upgraded and on what platforms; currently we don’t have access to this information. This JuliaTeam service is already in use by thousands of JuliaPro users — JuliaTeam serves up all packages installed by JuliaPro.

  • Backup [roadmap]. Long-term copies of all registered code will be stored in case someone deletes a package repo or original versions of required artifacts — whether on purpose or by accident. This prevents “left-pad” scenarios. JuliaTeam thereby minimizes the damage that a disgruntled or fat-fingered developer can cause. Since Julia package versions are addressed by identity and content-addressed in manifest files, once a version is saved, it never has to change. The JuliaTeam instance serving packages to JuliaPro users already provides some degree of protection from potential left-pad situations, but we plan to build out fully resilient, long-term backup system for both public and private packages.

  • Security [in progress]. Ensure that if there are security vulnerabilities discovered in any Julia packages that you are using, you are made aware and can react in a timely fashion. There have been a number of security compromises in the package systems of open source programming languages in recent years. Having proactive auditing and notification when a compromise is detected will help keep our entire community safe.

  • Registration [done; was “in progress” at publication time]. The open source Julia ecosystem is currently in a slightly awkward state where packages are still registered with the old METADATA.jl repository, which is automatically synced to the new General package registry; the new registry is used by Julia ≥ 0.7 to install packages. This arrangement has the benefit that it allows users of both old and new Julia versions to continue to use all registered packages, but it’s getting to be time to register new package versions directly with the General registry. A new package registration process is under development, which JuliaTeam will include to support registration of private packages inside of organizations.

  • Compatibility [in progress]. For several months now, requests to register new versions of packages have been checked automatically and merged if they pass sanity checks including passing their own tests and not breaking the tests of any reverse dependencies that they claim to be compatible with. This has been very successful but the service is quite hard to keep running (it’s currently down and undergoing long term maintenance, for example). An updated version of this service will be integrated into both the public and enterprise JuliaTeam registration processes, which will help prevent unintended API breakage.

  • Documentation [done; was “in progress” at publication time]. Providing a single place to find all documentation for the Julia packages that you use. This service offers a single consistent place and way to host and publish package documentation. It also makes cross-linking docs between packages easy since they all live in the same place. Developers shouldn’t ever have to set up or think about the how of documentation hosting—they should just need to follow standard conventions for inline docs and then push their code. The docs service does the rest: cross-linked, searchable (see the next bullet point) docs are generated automatically. Those who are interested in following along or contributing are encouraged to checkout the work-in-progress DocumentationGenerator.jl package.

  • Search [done; was “in progress” at publication time]. Currently search and discovery of packages is a serious pain point in the Julia ecosystem. JuliaTeam will provide integrated search of documentation and code for all packages. This will let you find the package that does what you need, whether it’s a public open source package or a private package that your organization uses—they’ll all be searchable in a single place. Code search will match the capabilities of the Google Code Search of old.

  • Refactoring [roadmap]. One of the major difficulties of refactoring code bases composed of many packages is updating all the components at the same time. The right tooling, however, can make this much less painful—even smooth. We envision being able to change an API and easily make a set of coordinated changes across all packages that use the API. The Julia community has already experimented with this approach with the FemtoCleaner bot which helped update public packages for Julia 1.0.

  • Upgrades [roadmap]. When doing Julia releases, we test all registered open source packages on release candidates and investigate any failures to determine whether we’ve inadvertently broken compatibility where we shouldn’t. Even when packages rely on unstable Julia internals, we make updates to those packages so that a working version already exists by the time the new Julia version is released. JuliaTeam will allow private packages to get the same kind of testing, which lets us guarantee that the next release of Julia won’t break your applications, enabling you to upgrade without fear.

  • Testing [roadmap]. Smart, Julia-specific continuous integration. Projects that use standard Julia testing tools will fit into JuliaTeam transparently, allowing us to present test results in a clear way that can be drilled into easily. In will also also be possible to analyze code changes and prioritize tests so that tests that are most likely to fail are run first, massively reducing test time when tests fail, improving both developer productivity and utilization of CI resources.

  • Debugging [roadmap]. When some test fails during CI, it’s often a time-consuming nightmare to reprodce the failure in a way that can be debugged. Instead of wasting developer time trying to reproduce the circumstances of such a failure, we want to allow the developer to log into the CI server directly and get a Julia debug prompt right at the point where the failure occurred. With rr integration, we could even allow developers to step backward in time in the debugger from the error to find the origin of the problem. Imagine debugging with superpowers without the inhuman setup process.

  • Benchmarking [roadmap]. Julia users care about performance. It’s one of the major selling points of the language, after all. (“Come for the speed, stay for the multiple dispatch.”) Julia Base has infrastructure for measuring and tracking performance of code over time, but Julia itself is not the only thing that needs benchmarking. We’d like to build a similar system for benchmarking all kinds of Julia projects to help people make sure their code is fast and stays fast.

As you can see, we have quite a few features planned for JuliaTeam. Some are already done. Many are well under way. And some of the most exciting ones that might completely transform the way Julia development is done are in the future. If you want to try JuliaTeam out today, contact us by emailing info+juliateam@juliacomputing.com. If you have ideas for other services and features that you feel that you and every other Julia developer need, please let us know. We will publish more blog posts in the future as we develop and release more features for JuliaTeam. We can’t wait to get these services built and help Julia developers everywhere make amazing software faster and easier!