Dr Donald Kinghorn (HPC and Scientific Computing)

Machine Learning and Data Science: Linear Regression Part 1

Written on May 26, 2017 by Dr Donald Kinghorn

Linear Regression is one of the fundamental methods of computational science. There are many books entirely dedicated to the subject and it is often taught as a course in graduate school. Does that mean it's hard to understand? No. It's not necessarily that hard to understand! However, it provides a very rich context for "details" that "can" go very deep. It can also be presented in in many different contexts and is a useful tool for any kind of scientist or computer professional to know. I will focus on Linear Regression in the context and language of Machine Learning. It will provide a simple but rich and useful example of many of the key concepts and methods used in Machine Learning.

If you fully understand Linear Regression you will well on your way to completely grasping modern Machine Learning, Data Analysis and AI!

I will break this discussion down to (hopefully) logical sections and key concepts. The work will be presented as Jupyter notebooks along with all of the code and data needed to reproduce the examples. There will be as many parts as are necessary. We will do an example using real data for housing prices in King county Washington (where I live) Lets get started!

Quirks of the Language of Machine Learning and Statistics

BE FLEXIBLE WITH NOTATION! In general Mathematics is very concise and, of course, it's all about abstraction. "Things" are what you define them to be! The extreme example of trying to standardize notation and definitions is the old French Bourbaki. [ This was a bunch of mathematicians that collectively published under the name "Nicolas Bourbaki". Fascinating! ] They were obsessive about notation and definition. In my opinion Statistics is the opposite of that :-) Statistics always felt like a rouge field of mathematics to me personally [ don't get me wrong, I think Statistics is one of the most interesting fields of mathematics! ]. I believe some of the common statistics notation and language is part of what can make it difficult to understand. This applies to Machine Learning literature too since one of the short definitions of Machine Learning is that it is Statistics.

I have noticed that people without much mathematics background can really get hung-up badly on notation. Keep in mind that notation is just visual symbols that have no meaning in and of themselves. It's the values/functions that you assign to them that have meaning. If you ever get lost it's good to write down the symbols that are being used and be sure you know what they actually refer to and note their "sizes" and counting indices's (index symbols like subscript or superscript i,j,k, (i) etc. )

Linear Regression is "Supervised Learning"

In Machine Learning speak, Linear Regression is a type of supervised learning. By "supervised learning" we mean that you you have a set of data (a training set) where you know the "right answers". That is for some feature (input) you know a corresponding "answer" (output) of some question for every data point in your training set. For example, you know the price a house sold for (answer) and the square foot living space of that house (feature).

When look at your data you will have some expected or presumed relationship between those features and answers (input and output). You will be trying to find a model function of that relationship that will let you predict answers for new input data that was not in your training data. You are trying to find a (linear) model that "fits" the data according to some "goodness" criteria. The "fit" model hopefully represents the data realistically and gives a way to generalize the data and estimate missing/unknown values and perhaps predict future trends. The training or fitting is the Machine Learning part. The model that can be used to predict things is the "Artificial Intelligence" part (it may be pretty stupid "intelligence" :-).

The word regression is referring to finding the best parameters for a model of data that is continuous. For example, the price of a house given it's size in square feet, or, the change of a stock price over time, or, the change in blood-pressure based on your salt intake, etc.. The "answer" has a continuous range of values.

Another common type of supervised learning is "Classification". That would be the case where your data is labeled by some kind of discrete categories. For example, a set of images of cats and dogs. With this kind of data you are more concerned in finding a model that describes the boundaries between the categories.

Example of Classification vs Regression

Say you have a set of images of people and for each image you know the age and sex of the person. There are two kinds of models you might be training with this data. Regression: If you are training a model to predict the age of a person from their photo then you are doing Regression. Classification: If you are trying to train a model to predict a persons sex from a photo then that is Classification.

What does Linear mean?

There are mathematical definitions of what linear really means but it is OK to go with the intuitive idea of "line like", i.e. not curving. It is a property of how things "combine" too (combining with a simple sum). For functions (mappings from one thing to another) we can write "linear functions". If our function only has one variable (or feature) then the it can be represented as the equation of a straight line, $y=f(x)=ax+b$. If there are two variables or features then it would look like a flat plane (like a table top). You can have more variables for a linear function but you can't easily visualize it. The linear regression problem we look at first will just have one variable (for now) i.e. it will result in a straight line. I will describe this function in more detail in the next post in this series but for right now I want to show what it is and make a comment about how to start thinking about it and the language we will use to talk about it.

Our linear linear regression model function

$$ h_a(x) = a_0 + a_1x $$
  • $ a = \{a_0, a_1\}$ is a set of parameters for the function $h$. We want to find those parameters.
  • $x$ is our feature variable (which will be the size of a house in square feet in our example problem).
  • The "value" of $h_a(x)$ will be the price that our model predicts i.e. $h$ is a function that maps house sizes to prices.

We are using the symbol $h$ because in Machine Learning this is usually referred to as the hypothesis. $h$ is just an equation for a straight line with a "slope" of $a_1$ and an "intercept" of $a_0$. We can say that $h$ is a linear function of $x$ that is parameterized by $a_0$ and $a_1$. [graph of a line with a0=1 and a1=1]

Data representation

Our data will be represented as a set of pairs of values. $$ (x, y) $$

  • $x$ is the feature (size of house in square feet)
  • $y$ is the selling (price) of the house
  • $ x^{(i)},y^{(i)}$ would be the $i^{th}$ data pair. I know using that superscript for an index is a little ugly but later we will need subscripts too!
  • $m$ will be the number of data point. For the data we will use it will be $m=21613$.

I will repeat these definitions when we look closer at the model and start doing the the training. For now I just wanted you to see this notation.

The Key to Understanding Machine Learning and Data Analysis!

You need to be able to shift your mental focus from DATA to FEATURES to MODEL PARAMETERS and not get lost in the notation and indexes. If you can do that while you are looking at formulas, algorithms and numbers you will do OK.

$h_a(x)$ above can also be thought of as a function of it's parameters $a_0$ and $a_1$!! This is very important! It can be confusing. The subscript $a$ is to remind you of the parameters. You can look at how the equation of the line represented the $h_a(x)$ changes when you change the vaules of the parameters $a_0$ and $a_1$. When you are training a model you are fitting the model to your data by varying the values of the parameters (or weights) to find values that minimize the errors between the predicted values given by $h(x)$ and the known values in your data set. Your focus during the training is on the parameters, not the features $x$. You do this by deriving a new function called the Cost Function (or error function) that uses $h_a$. The Cost function is a function of the parameters $a$.

Got that? :-) It will all become clearer and clearer as we work through the problem. I promise!

I'm going to stop here. Your main take-away is seeing some of the notation and language that is often used in Machine Learning and getting an idea of what linear regression is.

In the next post I'll show you where to get the dataset for the King county housing prices. That is a big data set with lots of features. I will show you how to get the data into Python and extract the actual features we want i.e. (size of house in square feet) and selling price. We'll make some plots and start playing with the data. After that I will look in more detail at the regression model, cost function and what it means to find the "best" parameters.

Happy computing --dbk

Tags: Machine Learning, Data Science, Python, Jupyter notebook, Programming