Machine Learning and Data Science: Logistic Regression Implementation







Logistic Regression: Implementation and Evaluating a Trained Model

I’ve done two earlier posts on Logistic Regression, Logistic Regression Theory and Logistic and Linear Regression Regularization. Those posts included general discussion of logistic regression and derivation of the equations with the purpose of giving insight on how it works. Now it’s time to make it work!

In this post I’ll do a simple implementation of the logistic regression equations with Python. This will just be a “calculator” style implementation in this Jupyter notebook. Most statistics packages will include logistic regression functions. The Python modules scikit-learn and statsmodels both have implementations. The purpose of the code in this blog post is to make it clear how some of the functionality of those “black-boxes” really works.

Before I load up the Python modules and try some examples I want to discuss one more bit of “theory”, the very important problem of deciding if your model is any good or not!

This posts along with all of the others in this series were converted to html from Jupyter notebooks. The notebooks are available at https://github.com/dbkinghorn/blog-jupyter-notebooks

Evaluating the quality of the trained Logistic model

This is the last important concept that needs to be discussed before we get on to some code and examples.

In Linear Regression, we had used the value of the cost function and the statistic $R^2$ to evaluate the goodness of fit. We also used a test-set to see how well the model function generalized and thus, it’s predictive value. For Logistic Regression there isn’t a direct substitute for $R^2$ but there are several standard tests that are used. You will see discussion of Deviance (and variations on that theme), Hosmer–Lemeshow test, Wald statistic, Likelihood ratio test, etc.. I am not going to go into detail on any of those! The main thing I want to get across is the importance of evaluating the predictive value of your model. The best way to do that is to see how well it works on a set of test data that was not used during training (or validation) of the model.

Some simple measures to evaluate the quality of a model

  • Cost Function $J$ at the optimized value of the parameters $a$. This is basically the “log-likelihood” that is used in many of the statistical measures. It is intuitive that a smaller value of the Cost Function indicates a better fit to the training data. The Cost Function is what is being minimized during the training optimization. Has your optimization converged? … You do need to be careful about over-fitting! Also, note that the Regularization term is only needed during the optimization. For evaluating the fit it’s probably better to leave that out.
  • Training Accuracy, how well the optimized model fits the data. This can be computed as a percentage of correctly predicted values using the optimized model on the training data. when evaluation Training Accuracy it is important to understand the data set that is being used.
    It is good to look at Training Accuracy for the positive cases and the negative cases separately as well as the overall accuracy.

    …look carefully at the data! Say you have 100 data samples but on only 10 of those data points are positive cases. That means that having a model that always predicts negative will have a 90% Training Accuracy!! Garbage!! For a case like that it might be better to look at Anomaly Detection … we’ll get to that in a latter blog post.

  • Test-set Accuracy, how well does the model work on a set of data samples that was NOT used during training.

It’s the Test-set Accuracy that actually means something! Having a perfect fit to the data and a zero cost function is useless if your model has poor predictive value on a test-set.

The Cost Function and Training Accuracy are useful measures to look at while you are adjusting hyper-parameters like $\lambda$ and deciding about the complexity of the expression you are using for the model. (…evaluating which features are important in the model, feature expressions, nonlinear features etc..)

Best practice for partitioning a data-set

If you only have a small set of data then it could be difficult to evaluate the trained model since pulling out test cases may leave you without enough points to get a good model fit, and, you could have problems with over-fitting. In these cases it may be worthwhile to try to get more data samples or think about generating synthetic data by adding points with small random variations on the existing set. However, if you have lots of data then best practice would be as follows.

  • Training Set 60-80% of data. Keep enough data to get a good optimization fit to the model and try to have much more data than feature variables to help with over-fitting.
  • Validation Set 10-20% of data. The Validation Set is used like a test-set for checking improvements in the model from adjusting “hyper-parameters” and model features. It’s best to have this as a separate dataset from the test-set since making adjustments to improve the model during training will bias the model toward the validation-set.
  • Test Set 10-20% A good test-set with varied representative samples is the best way to evaluate whether a model will likely perform well in “the real world”. If the model accuracy for positive and negative cases is good and similar to what was found in the training-set evaluation then the model may have good predictive value.

I just made up those percentages! In reality what you do will depend on your immediate goal and the nature of your data.

Logistic Regression is often used to evaluate things like medical studies. In this case the predictive value of the overall model may not be as important as the analysis of the “significance” of various features. That is where a lot of statistical analysis of the quality of a model comes into play — trying to find causality relationships.

For many machine learning problems the main goal is to come up with a model that has good predictive value. The model will be an “inference engine”. If you are doing a lot of work to get the best model you can, then having a validation-set to check adjustments against is good practice. A test-set is valuable to determine if your model is going to be of any use! A poorly chosen model or feature set will likely be apparent as under-fitting when you check the training accuracy against the training-set. An over-fit or numerically unstable model may not be so obvious when checking against the training data. A good test-set is the best way to evaluate the usefulness of a model.

Implementation of Logistic Regression

A quick look at the formulas and then an interactive “calculator style” implementation in this Jupyter notebook.

Equations for logistic regression

Following is a list of equations we will need for an implementation of logistic regression. They are in matrix form. Be sure to read the notes after this list.

$$ \bbox[25px,border:2px solid green]{
\begin{align}
m & = \text{the number of elements in the training set} \\ \\
n & = \text{the number of elements in the parameter vector }a \\ \\
\lambda & = \text{the adjustable regularization weight} \\ \\
g(z) & = \frac{1}{1 + e^{-z}} \\ \\
h_a(X) &= g(Xa) \\ \\
J(a) &= -\frac{1}{m} \left( y’ log(h_a(X) + (1-y)’log(1 – h_a(X) \right) + \frac{\lambda}{2m} a’La \\ \\
&= -\frac{1}{m} \left(1’\log(h) + (y-1)’Xa \right) + \frac{\lambda}{2m} a’La \\ \\
\nabla J(a) &= \frac{1}{m}X'(h-y) + \frac{\lambda}{m} La \\ \\
&=\frac{1}{m}X'(g(Xa) – y) + \frac{\lambda}{m} La
\end{align} }$$

Notes:

  • $\lambda$ is a training “hyper-parameter”. It is used to smooth the optimization of the model parameters to avoid over-fitting.
  • I have included two expressions for $J(a)$. They are equivalent. The second expression contains a symbol $1’$, that is a that is the transpose of a vector of all ones. It’s effect is to sum the elements in the vector $log(h)$. That’s just to keep the expressions in matrix/vector form, see the Theory post.
  • $L =
    \begin{bmatrix} 0&0&0&0&0 \\ 0&1&0&0&0 \\ 0&0&1&0&0 \\ 0&0&0& \ddots&\vdots \\ 0&0&0& \cdots& 1 \end{bmatrix} $ is the $n \times n$ identity matrix with a 0 in the 1,1 position.
    The term $a’La$ is equal to $a[1:n]’a[1:n]’$ i.e. the dot product of the vector $a$ excluding the first element [the bias or intercept term in the model]. $La$ is just $a[1:n]$, i.e. $a$ with 0 for it’s first element. … a lot simpler than it looks!

Simple Logistic Regression Implementation

Time for some code! First we’ll load up some Python modules and create definitions for the equations. I’ll add some functions to give information on how well the model is working too.

In [44]:
import pandas as pd  # data handeling
import numpy as np   # numerical computing
from scipy.optimize import minimize  # optimization code
import matplotlib.pyplot as plt  # plotting
import seaborn as sns
%matplotlib inline
sns.set()
In [56]:
#
# Main Logistic Regression Equations
#
def g(z) :  # sigmoid function
    return 1/(1 + np.exp(-z))

def h(X,a) : # Model function
    return g(np.dot(X,a))

def J(X,a,y) : # Cost Function 
    m = y.size
    return -(np.sum(np.log(h(X,a))) + np.dot((y-1).T,(np.dot(X,a))))/m

def J_reg(X,a,y,reg_lambda) : # Cost Function with Regularization
    m = y.size
    return J(X,a,y) + reg_lambda/(2*m) * np.dot(a[1:],a[1:])

def gradJ(X,a,y) : # Gradient of Cost Function
    m = y.size
    return (np.dot(X.T,(h(X,a) - y)))/m

def gradJ_reg(X,a,y,reg_lambda) : # Gradient of Cost Function with Regularization
    m = y.size
    a_reg = a
    a_reg[0] = 0
    return gradJ(X,a,y) + reg_lambda/(2*m) * a_reg.T
In [46]:
#
# Some model checking functions
#
def to_0_1(h_prob) : # convert probabilites to true (1) or false (0) at cut-off 0.5
    return np.where(h_prob >= 0.5, 1, 0)

def model_accuracy(h,y) : # Overall accuracy of model
    return np.sum(h==y)/y.size * 100

def model_accuracy_pos(h,y) : # Accuracy on positive cases
    return np.sum(y[h==1] == 1)/y[y==1].size * 100

def model_accuracy_neg(h,y) : # Accuracy on negative cases
    return np.sum(y[h==0] == 0)/y[y==0].size * 100

def false_pos(h,y) : # Number of false positives
    return np.sum((y==0) & (h==1))

def false_neg(h,y) : # Number of false negatives
    return np.sum((y==1) & (h==0))

def print_model_quality(title, h, y) : # Print the results of the functions above
    print( '\n# \n# {} \n#'.format(title) )
    print( 'Total number of data points   = {}'.format(y.size))
    print( 'Number of Positive values(1s) = {}'.format(y[y==1].size))
    print( 'Number of Negative values(0s) = {}'.format(y[y==0].size))
    print( '\nModel Accuracy = {:.2f}%'.format( model_accuracy(h,y) ) )
    print( '\nModel Accuracy Positive Cases = {:.2f}%'.format( model_accuracy_pos(h,y) ) )
    print( 'Number of False Positives = {}'.format(false_pos(h,y)) )
    print( '\nModel Accuracy Negative Cases = {:.2f}%'.format( model_accuracy_neg(h,y) ) )
    print( 'Number of False Negatives = {}'.format(false_neg(h,y)) )

Example: Coronary heart disease vs Age

The goal of the first example is just to check my work! I want to be sure that the code above produces a model that I can verify against another implementation. I wont separate out a test set for this example since I really just want to check my result with some of what was done in the following book,

Applied Logistic Regression
David W. Hosmer, Jr., Stanley Lemeshow, Rodney X. Sturdivant.
3rd ed.Includes bibliographic references and index. (2013)
ISBN 978-0-470-58247-3 (cloth)
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.

The data is described as “the age in years (AGE), and presence or absence of evidence of significant coronary heart disease (CHD) for 100 subjects in a hypothetical study of risk factors for heart disease”.

We’ll read in the data with pandas and take a look at the first few entries,

In [47]:
df = pd.read_table("CHDAGE/CHDAGE.txt")
In [48]:
df.head()
Out[48]:
ID AGE CHD
0 1 20 0
1 2 23 0
2 3 24 0
3 5 25 1
4 4 25 0

Define the model training variables and augmented matrix $X$,

In [49]:
x = df['AGE']
y = df['CHD']
X = np.column_stack( (np.ones((x.size,1)) , x ) )

I’ll be doing the optimization with our regularization. The following two functions are “wrappers” around my equation functions so that they can be passed to the optimization code.

In [50]:
def opt_J(a) :
    return J(X,a,y)

def opt_gradJ(a) :
    return gradJ(X,a,y)

The optimization code is called from the scipy.optimize, minimize, module methods. It is a standard BFGS implementation which is good all-around method for many non-linear optimization jobs — Broyden–Fletcher–Goldfarb–Shanno algorithm.

In [51]:
aguess = [0.1,2.0] # an initial guess for the parameters
res = minimize(opt_J, aguess, method='BFGS', jac=opt_gradJ, tol=1e-8, options={'disp': True})
Optimization terminated successfully.
         Current function value: 0.536765
         Iterations: 16
         Function evaluations: 22
         Gradient evaluations: 22

<hr >
The main result I want to check is the optimized parameters $a$. They are in precise agreement with the book.

In [52]:
a_opt = res.x
a_opt
Out[52]:
array([-5.30945337,  0.11092114])

The “model quality” output and plots of the results

In [53]:
h_prob = h(X,a_opt)
h = to_0_1(h_prob)
In [54]:
print_model_quality('Coronary heart disease vs Age -- Model Quality on Training-data', h, y)
#
# Coronary heart disease vs Age -- Model Quality on Training-data
#
Total number of data points   = 100
Number of Positive values(1s) = 43
Number of Negative values(0s) = 57

Model Accuracy = 74.00%

Model Accuracy Positive Cases = 67.44%
Number of False Positives = 12

Model Accuracy Negative Cases = 78.95%
Number of False Negatives = 14

<hr >
The model is not a bad fit given that the data has a lot of variability, i.e. there are young people with heart disease and older people without it.

The “decision boundary” (or logit) from this model is $b(AGE) = -5.309 + 0.1109*AGE$ and the probability model itself would be,

$$ h_a(AGE) = \frac{1}{1+e^{5.309 – 0.1109*AGE} } $$

Since this is a 1-D problem, a plot of the probability $h^a(AGE)$ vs AGE along with the training data set is an interesting visualization,

In [57]:
fig, ax = plt.subplots()
xage = np.linspace(20,70,50)
Xage = np.column_stack( (np.ones((xage.size,1)) , xage ) )
ax.plot(xage, h(Xage,a_opt), label='h(Xa)')
ax.plot(x,y,'o', label='Training data')
plt.title("AGE vs h(AGE,a)= 1 / 1 +exp(5.309 - 0.1109*AGE)", fontsize=18)
plt.xlabel('AGE')
plt.ylabel('probability of heart disease')
plt.legend();

It basically just show that as AGE increases so does the probability of coronary heart disease.

We can also look at the positive (red +) and negative (green o) results plotted on a line (it’s a 1-dimensional problem) together with the decision boundary line.

In [58]:
fig, ax = plt.subplots()
xage = np.linspace(20,70,50)
Xage = np.column_stack( (np.ones((xage.size,1)) , xage ) )
ax.scatter(x[y==1],0.5*np.ones(x[y==1].size), marker='P', color='r', alpha=0.4 ,label='Pos data')
ax.scatter(x[y==0],0.5*np.ones(x[y==0].size), marker='o', color='g', alpha=0.4, label='Neg data')
bx = np.linspace(45,60,50)
ax.plot( bx, -5.309 + 0.1109*bx, label='Decision Boundary')
plt.title("Decision Boundary  b(AGE) = -5.309 + 0.1109*AGE)", fontsize=18)
plt.xlabel('AGE')
plt.ylabel('b(AGE)')
plt.legend();

Where the decision boundary crosses the data at a probability “contour” of 0.5 is the boundary cutoff of the model for deciding, “yes” heart disease, or “no” heart disease.

This example was really just to check to be sure the code was producing the correct model parameters, and the that the results made sense. It looks good so in the next post I’ll do more examples of usage. After some binary decision problems I’ll look at multi-class Logistic Regression.

Happy computing –dbk