Tag Archives: Hadoop

Build & Deploy Machine Learning Apps on Big Data Platforms with Microsoft Linux Data Science Virtual Machine

By: Cortana Intelligence and ML Blog Team

Re-posted from: https://blogs.technet.microsoft.com/machinelearning/2017/03/09/deploy-machine-learning-apps-to-big-data-platforms-with-linux-data-science-virtual-machine/

This post is authored by Gopi Kumar, Principal Program Manager in the Data Group at Microsoft.

This post covers our latest additions to the Microsoft Linux Data Science Virtual Machine (DSVM), a custom VM image on Azure, purpose-built for data science, deep learning and analytics. Offered in both Microsoft Windows and Linux editions, DSVM includes a rich collection of tools, seen in the picture below, and makes you more productive when it comes to building and deploying advanced machine learning and analytics apps.

The central theme of our latest Linux DSVM release is to enable the development and testing of ML apps for deployment to distributed scalable platforms such as Spark, Hadoop and Microsoft R Server, for operating on data at a very large scale. In addition, with this release, DSVM also offers Julia Computing’s JuliaPro on both Linux and Windows editions.

Here’s more on the new DSVM components you can use to build and deploy intelligent apps to big data platforms:

Microsoft R Server 9.0

Version 9.0 of Microsoft R Server (MRS) is a major update to enterprise-scale R from Microsoft, supporting parallel and distributed computation. MRS 9.0 supports analytics execution in the Spark 2.0 context. There’s a new architecture and simplified interface for deploying R models and functions as web services via a new library called mrsdeploy, which makes it easy to consume models from other apps using the open Swagger framework.

Local Spark Standalone Instance

Spark is one of the premier platforms for highly scalable big data analytics and machine learning. Spark 2.0 launched in mid-2016 and brings several improvements such as the revised machine learning library (MLLib), scaling and performance optimization, better ANSI SQL compliance and unified APIs. The Linux DSVM now offers a standalone Spark instance (based on the Apache Spark distribution), PySpark kernel in Jupyter to help you build and test applications on the DSVM and deploy them on large scale clusters like Azure HDInsight Spark or your own on-premises Spark cluster. You can develop your code using either Jupyter notebook or with the included community edition of the Pycharm IDE for Python or RStudio for R.

Single Node Local Hadoop (HDFS and YARN) Instance

To make it easier to develop Hadoop programs and/or use HDFS storage locally for development and testing, a single node Hadoop installation is built into the VM. Also, if you are developing on the Microsoft R Server for execution in Hadoop or Spark remote contexts, you can first test things locally on the Linux DSVM and then deploy the code to a remote scaled out Hadoop or Spark cluster or to Microsoft R Server. These DSVM additions are designed to help you iterate rapidly when developing and testing your apps, before they get deployed into large-scale production big data clusters.

The DSVM is also a great environment for self-learning and running training classes on big data technologies. We provide sample code and notebooks to help you get started quickly on the different data science tools and technologies offered.

DSVM Resources

New to DSVM? Here are resources to get you started:

Linux Edition

Windows Edition

The goal of DSVM is to make data scientists and developers highly productive in their work and provide a broad array of popular tools. We hope you find it useful to have these new big data tools pre-installed with the DSVM.

We always appreciate feedback, so please send in your comments below or share your thoughts with us at the DSVM community forum.


Using Julia As A ‘Glue’ Language

By: Randy Zwitch

Re-posted from: http://randyzwitch.com/julia-odbc-jl/

While much of the focus in the Julia community has been on the performance aspects of Julia relative to other scientific computing languages, Julia is also perfectly suited to ‘glue’ together multiple data sources/languages. In this blog post, I will cover how to create an interactive plot using Gadfly.jl, by first preparing the data using Hadoop and Teradata Aster via ODBC.jl.

The example problem I am going to solve is calculating and visualizing the number of airplanes by hour in the air at any given time in the U.S. for the year 1987. Because of the structure and storage of the underlying data, I will need to write some custom Hive code, upload the data to Teradata Aster via a command-line utility, re-calculate the number of flights per hour using a built-in Aster function, then using Julia to visualize the data.

Step 1: Getting Data From Hadoop

In a prior set of blog posts, I talked about loading the airline dataset into Hadoop, then analyzing the dataset using Hive or Pig. Using ODBC.jl, we can use Hive via Julia to submit our queries. The hardest part of setting up this process is making sure that you have the appropriate Hive drivers for your Hadoop cluster and credentials (which isn’t covered here). Once you have your DSN set up, running Hive queries is as easy as the following:In this code, I’ve written my query as a Julia string, to keep my code easily modifiable. Then, I pass the Julia string object to the query() function, along with my ODBC connection object. This query runs on Hadoop through Hive, then streams the result directly to my local hard drive, making this a very RAM efficient (though I/O inefficient!) operation.

Step 2: Shelling Out To Load Data To Aster

Once I created the file with my Hadoop results in it, I now have a decision point: I can either A) do the rest of the analysis in Julia or B) use a different tool for my calculations. Because this is a toy example, I’m going to use Teradata Aster to do my calculations, which provides a convenient function called ‘burst()’ to regularize timestamps into fixed intervals. But before I can use Aster to ‘burst’ my data, I first need to upload it to the database.

While I could loop over the data within Julia and insert each record one at a time, Teradata provides a command-line utility to upload data in parallel. Running command-line scripts from within Julia is as easy as using the run() command, with each command surrounded in backticks:While I could’ve run this at the command-line, having all of this within an IJulia Notebook keeps all my work together, should I need to re-run this in the future.

Step 3: Using Aster For Calculations

With my data now loaded in Aster, I can normalize the timestamps to UTC, then ‘burst’ the data into regular time intervals. Again, all of this can be done via ODBC from within Julia:Since it might not be clear what I’m doing here, the ‘burst()’ function in Aster takes a row of data with a start and end timestamp, and potentially returns multiple rows which normalize the time between the timestamps. If you’re familiar with pandas in Python, it’s a similar functionality to ‘resample’ on a series of timestamps.

Step 4: Download Smaller Data Into Julia, Visualize

Now that the data has been processed from Hadoop to Aster through a series of queries, we now have a much smaller dataset that can be loaded into RAM and processed by Julia:The Gadfly code above produces the following plot (using a d3.js backend for interactivity):

Since this chart is in UTC, it might not be obvious what the interpretation is of the trend. Because the airline dataset represents flights either leaving or returning to the United States, there are many fewer planes in the air overnight and the early morning hours (UTC 7-10, 2-5am Eastern). During the hours when the airports are open, there appears to be a limit of roughly 2500 planes per hour in the sky.

Why Not Do All Of This In Julia?

At this point, you might be tempted to wonder why go through all of this effort? Couldn’t this all be done in Julia?

Yes, you probably could do all of this work in Julia with a sufficiently large amount of RAM. As a proof-of-concept, I hope I’ve shown that there is much more to Julia than micro-benchmarking Julia’s speed relative to other scientific programming languages. You’ll notice that in none of my code have I used any type annotations, as none would really make sense (nor would they improve performance).  And although this is a toy example purposely using multiple systems, I much more frequently use Julia in this manner at work than doing linear algebra or machine learning.

So next time you’re tempted to use Python or R or shell scripting or whatever, consider Julia as well. Julia is just as at-home as a scripting language as a scientific computing language.