Author Archives: Julia Computing, Inc.

CSV Reader Benchmarks: Julia Reads CSVs 10-20x Faster than Python and R

The very first task in any data analysis workflow is simply reading the
data, and this absolutely must be done quickly and efficiently so the
more interesting work can begin. Across many industries and domains, the
CSV file format is king for storing and sharing tabular data. Loading
CSVs fast and robustly is crucial, and it must scale well across a wide
variety of file sizes, data types, and shapes. This post compares the
performance for reading 8 different real-world datasets across three
different CSV parsers: R’s fread, Pandas’ read_csv, and Julia’s CSV.jl.
Each of these was chosen as the “best in class” CSV parser in each R,
Python and Julia, respectively.

All three tools have robust support for loading a wide variety of data
types with potentially missing values, but only
fread
(R) and CSV.jl (Julia) support
multithreading—Pandas only supports
single threaded CSV loading. Julia’s CSV.jl is further unique in that it
is the only tool that is fully implemented in its higher-level language
rather than being implemented in C and wrapped from R / Python. (Pandas
does have a slightly more capable Python-native parser, it is
significantly slower and nearly all uses of read_csv default to the C
engine.) As such, the CSV.jl benchmarks here not only represent the
speed of loading data in Julia, but are also indicative of the sorts of
performance that’s possible in the subsequent Julia code used in the
analysis.

The following benchmarks show that Julia’s CSV.jl is 1.5 to 5 times
faster than Pandas even on a single core; with multithreading enabled,
it is as fast or faster than R’s read_csv. The tools used for
benchmarking were
BenchmarkTools.jl for
Julia,
microbenchmark
for R, and timeit for
Python.

Homogenous data

Let’s start with some homogeneous datasets i.e. datasets which have the
same kind of data in all columns. The datasets in this section, apart
from stock price dataset, are derived from this benchmark
site
. The performance metric
is the time taken to load a dataset as the number of threads is
increased from 1 to 20. Since Pandas does not support multi-threading,
single threaded speed is reported across the board for all core counts.

Performance on Homogenous Datasets:

Uniform Float dataset: The first dataset contains float values
arranged in 1 Million rows and 20 columns. Pandas takes 232 milliseconds
to load this file. Single threaded data.table is 1.6 times faster than
CSV.jl. With Multithreading, CSV.jl is at its best, more than double the
speed of data.table. CSV.jl is 1.5 times faster than Pandas without
multithreading, and about 11 times faster with.

Uniform String dataset(I): This dataset contains string values in
all columns and has 1 Million rows and 20 columns. Pandas takes 546
milliseconds to load the file. With R, adding threads doesn’t seem to
lead to any performance gain. Single threaded CSV.jl is 2.5 times faster
than data.table. At 10 threads, it is about 14 times faster than
data.table.

Uniform String dataset(II): The dimensions of this dataset are the
same as that of the one above. However, every column has missing values
as well. Pandas takes 300 milliseconds. Without threading, CSV.jl is 1.2
times faster than R, and with, it is about 5 times faster.

Apple stock prices:

This dataset contains 50 million rows and 5 columns, and is 2.5GB. The
rows are open, high, low, and close prices for AAPL stock. The four
columns with prices are float values, and there is a date column.

The single threaded CSV.jl is about 1.5 times faster than R’s fread from
data.table. With multithreading CSV.jl is about 22 times faster! Pandas’
read_csv takes 34s to read, this is slower than both R and Julia.

Performance on Heterogeneous Datasets

Mixed dataset: This dataset has 10k rows and 200 columns. The
columns contain, String, Float, DateTime, and missing values. Pandas
takes about 400 milliseconds to load this dataset. Without threading,
CSV.jl is 2 times faster than R, and is about 10 times faster with 10
threads.

Mortgage risk dataset

Now, let’s look at a wider dataset. This mortgage risk
dataset

from Kaggle is a mixed type dataset, with 356k rows and 2190 columns.
The columns are heterogeneous and have values of types String, Int,
Float, Missing. Pandas takes 119s to read in this dataset. Single
threaded fread is about twice faster than CSV.jl. However, with more
threads Julia is either as fast or slightly faster than R.

Wide dataset: This is a considerably wider dataset with 1000 rows
and 20,000 columns. The dataset contains string and Int values. Pandas
takes 7.3 seconds to read the dataset. In this case, single threaded
data.table is about 5 times faster than CSV.jl. With more threads,
CSV.jl is competitive with data.table. Increasing the number of threads
doesn’t seem to result in any performance gain in case of data.table.

Fannie Mae Acquisition dataset: This dataset can be downloaded from
Fannie Mae site
here
.
The dataset has 4 Million rows and 25 columns and values of types Int,
String, Float, Missing.

Single threaded data.table is 1.25 times faster than CSV.jl. But, the
performance of CSV.jl keeps increasing with more threads. CSV.jl gets
about 4 times faster with multi-threading.

Summary Charts:

Across all eight datasets, Julia’s CSV.jl is always faster than Pandas,
and with multi-threading it is competitive with R’s data.table.

System Info: The specs of the system on which the benchmarking was
performed are as below

$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.4 LTS
Release:	18.04
Codename:	bionic
$ uname -a
Linux antarctic 5.6.0-custom+ #1 SMP Mon Apr 6 00:47:33 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              40
On-line CPU(s) list: 0-39
Thread(s) per core:  2
Core(s) per socket:  10
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
Stepping:            4
CPU MHz:             800.225
CPU max MHz:         3000.0000
CPU min MHz:         800.0000
BogoMIPS:            4400.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            14080K
NUMA node0 CPU(s):   0-9,20-29
NUMA node1 CPU(s):   10-19,30-39
$ free -h
        total   used    free   shared  buff/cache   available
Mem:      62G   3.3G    6.3G     352K        52G         58G
Swap:     59G   3.2G     56G

JuliaTeam v2.0 Now Available

JuliaTeam provides enterprises using Julia with governance and private package management. New features in JuliaTeam v2.0 include:

  1. New dashboard page with information about new packages, recently updated packages, popular packages and popular tags

  2. Improved search and license detection

  3. New details page for individual packages with rich description including usage, dependencies, reverse dependencies, contributors and documentation

  4. Simpler registration workflow for GitLab registries

  5. Dark mode UI

  6. JuliaRun: Multithreading support, job comparison, output handling and UI improvements

  7. JuliaRun: Use unregistered packages for jobs

Newsletter June 2020

Pfizer uses
Julia to accelerate simulations of new therapies for metabolic diseases
up to 175x. More information is available
here.

JuliaCon 2020 Goes Online – Register for Free Now to Reserve Your
Spot:
Due to the COVID-19 global pandemic, JuliaCon 2020 has moved
online. Please click here to reserve your
spot
for July 29-31 and for more
information. Julia Computing staff will be available online via email
and Slack during the conference to answer questions.

2020 Julia User & Developer Survey: The second annual Julia User &
Developer Survey is open now. The survey is available in four languages:
English, Chinese, Japanese and Spanish. Please click here to
respond
and please forward
the link to friends and
colleagues who use Julia. Results will be presented during JuliaCon.

JuliaTeam v1.5 Now Available: JuliaTeam v1.5 is now available.
JuliaTeam provides
enterprises using Julia with governance and private package management.
New features in JuliaTeam v1.5 include:

1. New dashboard page with information about new packages, recently
updated packages, popular packages and

popular tags

2. Improved search and license detection

3. New details page for individual packages with rich description
including usage, dependencies, reverse dependencies, contributors and
documentation

4. Simpler registration workflow for Gitlab registries

5. Dark mode UI

6. JuliaRun: Multithreading support, job comparison, output handling and
UI improvements

7. JuliaRun: Use unregistered packages for jobs

New Free Julia Courses on JuliaAcademy: New Julia courses available
for free through JuliaAcademy include Julia for Data
Science
(Dr. Huda
Nassar) and Introduction to
DataFrames.jl

(Dr. Bogumił Kamiński). Other free Julia courses available through Julia
Academy include Introduction to
Julia
(Dr. Jane Herriman),
Foundations of Machine
Learning

(Dr. Chris Rackauckas), Deep Learning with
Flux.jl
(Dr.
Matt Bauman), Parallel
Computing
(Dr. Matt
Bauman) and the World of Machine Learning with
Knet

(Dr. Deniz Yuret).

Julia Is Used to Project Hospital Utilization During COVID-19: A
new paper
published in the Proceedings of the National Academy of Sciences uses
Julia to project hospital utilization during COVID-19. The paper is
available here

and the code is available on
GitHub
.

Julia for Machine Learning Book Published – Discount Code Available: Dr. Zacharias Voulgaris has published Julia for Machine Learning which joins his earlier volume Julia for Data Science. Both books are available for purchase in PDF, print or ebook. The print and PDF versions are available with a 25% discount using the code JuliaJune.

Stack Overflow 2020 Developer Survey: Stack Overflow has published
their 10th annual developer
survey

of more than 66,000 developers. Julia ranked sixth at 62.2% on the list
of languages that developers are currently using and would like to
continue using.

Graph Processing Benchmarks: Timothy Lin has published new
benchmarks

of LightGraphs.jl vs. Igraph, Graph-tool, SNAP and NetworKit. More
information is available
here.

UMB Grants Pumas-AI Exclusive License for Lyv: University of
Maryland, Baltimore (UMB) has granted exclusive license for
Lyv
,
a dosing app built in Julia, to Pumas-AI.
Pumas-AI is a pharmacology and pharmacometrics
startup created by Julia Computing and UMB Professors Vijay Ivaturi and
Joga Gobburu. Lyv uses patient data to provide personalized and
continuously optimized dosing recommendations for pharmaceutical
treatment.

Context Free Interviews Julia Co-Creators and Julia Computing
Co-Founders Viral Shah and Jeff Bezanson:
Context Free interviewed
Viral Shah and Jeff
Bezanson
about Julia.
You can watch the full interview
here
.

Julia Event Calendars: For the latest information about upcoming
online events, please visit the Julia Computing Calendar of
Events
and the Julia Language
Calendar of Events
.

Julia Computing Enterprise Products

  • JuliaSure:
    JuliaSure from
    Julia Computing provides full service development support,
    production support and indemnification for companies using Julia.
    Subscriptions are USD $99 per month. Click here to
    subscribe
    .

  • JuliaTeam:
    JuliaTeam from
    Julia Computing lets your entire enterprise work together
    using Julia. Collaborate, develop and manage private and public
    packages across your organization, manage open source licenses and
    benefit from continuous integration, deployment, security, indemnity
    and enterprise governance. Click here for more
    information
    .

  • JuliaRun:
    JuliaRun from
    Julia Computing helps you scale and deploy Julia using high
    performance computing (HPC) resources, including large parallel
    simulations and analyses in the cloud: AWS, Microsoft Azure or
    Google Cloud. Click here for more
    information
    .

  • Pumas: Pumas
    from Julia Computing is a comprehensive platform for pharmaceutical
    modeling and simulation, providing a single tool for the entire drug
    development pipeline. Click here for more
    information
    .

Upcoming Julia Computing Online Trainings, Microtrainings and
Webinars:
Julia Computing provides a number of online Trainings,
Microtrainings and Webinars. All are conducted by advanced Julia
Computing instructors. Click the links below to register.

Title

Date and Time

Event Type

Instructor

Cost

Register

Introduction to Julia

Wed June 10 & Thu June 11

11 AM – 3 PM Eastern (US)

Training

Julia Computing

$250

Register

Bullish on Fast Code? Julia for Finance

Fri June 12

12 Noon – 1 PM Eastern (US)

Webinar

Dr. Matt Bauman

Julia Computing

Free

Register

Introduction to Machine Learning and Artificial Intelligence in Julia

Wed June 17 & Thu June 18

11 AM – 3 PM Eastern (US)

Training

Julia Computing

$500

Register

How to Build a Model Using Flux and Productionize It

Fri June 19

12 Noon – 1 PM Eastern (US)

Microtraining

Dhairya Gandhi

Julia Computing

Free

Register

Parallel Computing in Julia

Wed June 24 & Thu June 25

11 AM – 3 PM Eastern (US)

Training

Julia Computing

$500

Register

Pharmacology and Pharmacometrics Using Pumas.jl

Tue June 30

12 Noon – 1 PM Eastern (US)

Webinar

Dr. Vijay Ivaturi, Pumas.ai

University of Maryland School of Pharmacy

Free

Register

Julia and Julia Computing in the News

  • Entwickler: Julia Ist Bei Einfachen Machine-Learning-Aufgaben mit Python Vergleichbar, Aber Besser Geeignet für Komplexere

  • JAXenter: Julia Is Comparable to Python for Simple Machine Learning Tasks and Better for Complex Ones

  • Analytics India: How Differentiable Programming Helps in Complex Computational Models

  • TechCentral: Julia vs. Python – Which Is Best for Data Science?

  • HPCWire: Julia Researchers Create Verifiable Neural Net in Julia for COVID-19 Epidemiology

  • Analytics India: Top 9 Machine Learning Frameworks in Julia

  • Analytics India: Top 10 Speakers at Plugin 2020

  • Analytics Insight: Decoding the Popularity of Jupyter Among Data Scientists

  • Analytics Insight: Top 10 Data Science Experts to Follow on Twitter

  • ZDNet: Programming Languages: Developers Reveal What They Love and Loathe, and What Pays Best

  • Towards Data Science: Bye-Bye Python. Hello Julia!

  • UMB: UMB Grants Pumas-AI Exclusive License for Lyv, a Cutting-Edge Clinical Decision Support System

  • Proceedings of the National Academy of Sciences: Projecting Hospital Utilization During the COVID-19 Outbreaks in the United States

  • Analytics India: AIM’s Virtual Conference – Plugin Day 2 Highlights

  • Business Insider: The 14 Most Loved Programming Languages, According to a Study of 65,000 Developers

Julia Blog Posts

Upcoming Julia Online Events

Recent Julia Online Events

Julia Jobs, Fellowships and Internships

Do you work at or know of an organization looking to hire Julia programmers as staff, research fellows or interns? Would your employer be interested in hiring interns to work on open source packages that are useful to their business? Help us connect members of our community to great opportunities by sending us an email, and we’ll get the word out.

There are hundreds of Julia jobs currently listed on
Indeed.com. Click
here to apply.

Contact Us: Please contact us if
you wish to:

  • Purchase or obtain license information for Julia Computing products such as JuliaSure, JuliaTeam, JuliaRun or Pumas

  • Obtain pricing for Julia consulting projects for your organization

  • Schedule online Julia training for your organization

  • Share information about exciting new Julia case studies or use cases

  • Spread the word about an upcoming online event involving Julia

  • Partner with Julia Computing to organize a Julia event online

  • Submit a Julia internship, fellowship or job posting

About Julia and Julia Computing

Julia is the fastest high performance open source computing language for data, analytics, algorithmic trading, machine learning, artificial intelligence, and other scientific and numeric computing applications. Julia solves the two language problem by combining the ease of use of Python and R with the speed of C++. Julia provides parallel computing capabilities out of the box and unlimited scalability with minimal effort. Julia has been downloaded more than 15 million times and is used at more than 1,500 universities. Julia co-creators are the winners of the 2019 James H. Wilkinson Prize for Numerical Software and the 2019 Sidney Fernbach Award. Julia has run at petascale on 650,000 cores with 1.3 million threads to analyze over 56 terabytes of data using Cori, one of the ten largest and most powerful supercomputers in the world.

Julia Computing was founded in 2015 by all the creators of Julia to develop products and provide professional services to businesses and researchers using Julia.