Dr Donald Kinghorn (HPC and Scientific Computing)

Machine Learning and Data Science: Introduction

Written on April 28, 2017 by Dr Donald Kinghorn

This is the first post in a series that I have wanted to do for some time. Machine Learning and Data Science, fascinating! This is one of the most interesting areas of endeavor today. It has truly blossomed over the last few years. There are numerous high quality tools and frameworks for handling and working with data sets, a seemingly endless number of application domains, and lots and lots of data. Data Science has become "a thing". Many Universities are offering graduate programs in "Data Science". It has existed for ages as part of Statistics, Informatics, Business Intelligence/Analytics, Applied Mathematics, etc., but it is now taking on a multidisciplinary life of its own.

In this series I'll be exploring the algorithms and tools of Machine Learning and Data Science. It will be tutorials, guides, how-to, reviews, "real world" application, and whatever I feel like writing about. It will be documenting my own journey. Even though I am familiar with the mathematics many of the algorithms and tools are new to me. This is partly a revival of the fascination I had with neural networks as a graduate student in the mid 1990's but never had the time to pursue. It's about time I did it!

What is Machine Learning

I wrote a blog post in 2016 titled What is Machine Learning. I looked at it again and decided I like the definition I came up with so here it is.

Machine Learning -- Machine Learning (ML) is a multidisciplinary field focused on implementing computer algorithms capable of drawing predictive insight from static or dynamic data sources using analytic or probabilistic models and using refinement via training and feedback. It make use of pattern recognition, artificial intelligence learning methods, and statistical data modeling. Learning is achieved in two primary ways; Supervised learning, where the desired outcome is known and an annotated data set or measured values are used to train/fit a model to accurately predict values or labels outside of the training set. This is basically “regression” for values and “classification” for labels. Unsupervised learning uses “unlabeled” data and seeks to find inherent partitions or clusters of characteristics present in the data. ML draws from the fields of computer science, statistics, probability, applied mathematics, optimization, information theory, graph and network theory, biology, and neuroscience.

What is Data Science

In my view Machine Learning and is the primary "What", Data Science is the primary "How". Data Science has historically been just another name for Statistics. However, I feel that it has become much more than that! After about 2010 everything has changed. It used to be that the largest data set that a Statistician might have to work with was the US census data. That's a few 100 gigabytes, not small, but not so much by todays measure of "big data". The reality of any "real world" Machine Learning project is that a large part of the time and effort is going to put into getting usable data ready for some machine learning algorithm. There is also the problem of just trying to understand what useful information your data has to offer and how you are going to model that. Data visualization is also an important part of that.

Here's my take on a definition for "Data Science";

Data Science -- Data Science is an emerging field of study establishing formalism and "best practices" for manipulating and extracting useful information from, possibly massive, data sets. It is a widely multidisciplinary field incorporating elements of;

  • Statistics and Probability
  • Machine Learning and Artificial Intelligence
  • Data Visualization
  • Numerical Analysis, Optimization, Applied Mathematics
  • Computer Science, DevOps, Programming, HPC
  • Domain Specific Knowledge (Sciences, Business, Finance, etc.)

It is defining best practices for, Data Analysis and Reproducible Research.

Reproducible Research

This is important! There is a crisis in research reproducibility. The further away research literature is from peer reviewed pure mathematics the more likely it is to irreproducible or just plain junk. The Data Science community seems to be making a serious effort to use and establish reproducible methodology.

  • Literate Programming (Documentation and code together)
  • Source control and Continuous Integration (DevOps!)
  • Environment reproducibility via Containerization with Docker ;-) (the wink is for those of you that have read my Docker posts)
  • Well maintained open source tools

The Data Science community is remarkably "open" considering that a lot of Data Science gets done by large commercial interest's.

Jupyter Notebooks

In keeping with the spirit of reproducibility I will be writing this series of blog posts as Jupyter notebooks and making them available on GitHub. This blog post so far is markdown code in a notebook cell.

In Jupyter I can easily add LaTeX code like this little formula

$$ R = \sum_{t=0}^{q-1}\frac{1}{q}\gamma_{3}\left( p-q \right) \gamma_{3} \left( t \right) \left( 1-\frac{c}{ab}\right) ^{q-t} $$

and have it display nicely in the notebook,

$$ R = \sum_{t=0}^{q-1}\frac{1}{q}\gamma_{3}\left( p-q \right) \gamma_{3} \left( t \right) \left( 1-\frac{c}{ab}\right) ^{q-t} $$

I can write Python and execute it in the document

In [4]:
print("Hello from Python3")
Hello from Python3

Make a plot

In [5]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

N = 50
x = np.random.rand(N)
y = np.random.rand(N)
colors = np.random.rand(N)
area = np.pi * (15 * np.random.rand(N))**2
plt.scatter(x, y, s=area, c=colors, alpha=0.7, cmap='PuBu')

Good stuff like that ... and then use the notebook to generate HTML for the blog post.

Happy computing! --dbk

Tags: Machine Learning, Data Science, Python, Jupyter notebook, Programming
otto mondo

I have a quick question. Most of the work I do is in R. On the Intel website they describe the utility of recompiling the R distribution with their compiler suite so that Phi cards can be used. The performance improvement seems pretty dramatic.

Nevertheless, on your site, all of the the accelerator options appear to be for Tesla cards. Are you currently making anything with Phi cards? Or has that ship already sailed and sunk with Tesla winning the race? I do believe the Sandia Exascale program is planning to use Phi processors in/on Cray hardware.

Posted on 2017-05-03 15:09:03
Donald Kinghorn

We don't do the Phi anymore, it was really kind of a pain for us. The new Phi stuff is pretty good and well optimized code can run well on it. But, yes, my feeling is the GPU acceleration is dominating. There will be some big Phi clusters going in though ... In any case there are a lot of workloads that are just fine on the CPU! If you can rebuild any code to link to MKL it is going to be a performance win. Intel has added a lot of code libs for machine learning acceleration too. AND the the new Xeon Skylake should be a big leap forward in performance ... really looking forward to testing that!

Recompile R and link to MKL it will surely be a big improvement on the CPU (as well as the Phi if that is your target)! Another option would be to use Microsoft's R open. That used to be Revolution Analytics R and it is optimized and should have MKL use as an install option. You can install it on various Linux's and Win. I haven't tried it myself but it was a really good project in the past.

Best wishes --Don

Posted on 2017-05-04 00:18:31