Table of Contents
Logistic Regression: Examples 1 — 2D data fit with multinomial model and 0 1 digits classification on MNIST dataset.¶
This post will be mostly Python code with implementation and examples of the Logistic Regression theory we have been discussing in the last few posts. Examples include fitting to 2 feature data using an arbitrary order multinomial model and a simple 2 class image classification problem using the MNIST digits data.
I’ve done three earlier posts on Logistic Regression, Logistic Regression Theory and Logistic and Linear Regression Regularization, Logistic Regression Implementation. Those posts include discussion of logistic regression theory and derivation of the equations with the purpose of giving insight on how it works. The last post included a simple Python implementation of the equations and a simple example.
This will be a “calculator” style implementation using Python in this Jupyter notebook. This will hopefully lift some of the veil off of black-box implementations in packages like scikit-learn. The code should be fairly easy to “tinker” with.
I’ll also exercise a “best practice” in data science by dividing the MNIST dataset into training-validation-test subsets for analysis.
This posts along with all of the others in this series were converted to html from Jupyter notebooks. The notebooks are available at https://github.com/dbkinghorn/blog-jupyter-notebooks
Python functions to implement Logistic Regression¶
To understand what is being implemented here refer to the previous posts mentioned near the top of this post.
Note: I have fixed a couple of bugs in code that I had in the last post. These were related to Python variable and function scoping.
import pandas as pd # data handeling
import numpy as np # numerical computing
from scipy.optimize import minimize # optimization code
import matplotlib.pyplot as plt # plotting
import seaborn as sns
%matplotlib inline
sns.set()
import itertools # combinatorics functions for multinomial code
#
# Main Logistic Regression Equations
#
def g(z) : # sigmoid function
return 1.0/(1.0 + np.exp(-z))
def h_logistic(X,a) : # Model function
return g(np.dot(X,a))
def J(X,a,y) : # Cost Function
m = y.size
return -(np.sum(np.log(h_logistic(X,a))) + np.dot((y-1).T,(np.dot(X,a))))/m
def J_reg(X,a,y,reg_lambda) : # Cost Function with Regularization
m = y.size
return J(X,a,y) + reg_lambda/(2.0*m) * np.dot(a[1:],a[1:])
def gradJ(X,a,y) : # Gradient of Cost Function
m = y.size
return (np.dot(X.T,(h_logistic(X,a) - y)))/m
def gradJ_reg(X,a,y,reg_lambda) : # Gradient of Cost Function with Regularization
m = y.size
return gradJ(X,a,y) + reg_lambda/(2.0*m) * np.concatenate(([0], a[1:])).T
For the model checking function I have added all of the data for a “confusion matrix” i.e. true positive, true negative, false positive and false negative. Using that data I ahve added “Precision” and “Recall” measures.
#
# Some model checking functions
#
def to_0_1(h_prob) : # convert probabilites to true (1) or false (0) at cut-off 0.5
return np.where(h_prob >= 0.5, 1, 0)
def model_accuracy(h,y) : # Overall accuracy of model
return np.sum(h==y)/y.size * 100
def model_accuracy_pos(h,y) : # Accuracy on positive cases
return np.sum(y[h==1] == 1)/y[y==1].size * 100
def model_accuracy_neg(h,y) : # Accuracy on negative cases
return np.sum(y[h==0] == 0)/y[y==0].size * 100
def false_pos(h,y) : # Number of false positives
return np.sum((y==0) & (h==1))
def false_neg(h,y) : # Number of false negatives
return np.sum((y==1) & (h==0))
def true_pos(h,y) : # Number of true positives
return np.sum((y==1) & (h==1))
def true_neg(h,y) : # Number of true negatives
return np.sum((y==0) & (h==0))
def model_precision(h,y) : # Precision = TP/(TP+FP)
return true_pos(h,y)/(true_pos(h,y) + false_pos(h,y))
def model_recall(h,y) : # Recall = TP/(TP+FN)
return true_pos(h,y)/(true_pos(h,y) + false_neg(h,y))
def print_model_quality(title, h, y) : # Print the results of the functions above
print( '\n# \n# {} \n#'.format(title) )
print( 'Total number of data points = {}'.format(y.size))
print( 'Number of Positive values(1s) = {}'.format(y[y==1].size))
print( 'Number of Negative values(0s) = {}'.format(y[y==0].size))
print( '\nNumber of True Positives = {}'.format(true_pos(h,y)) )
print( 'Number of False Positives = {}'.format(false_pos(h,y)) )
print( '\nNumber of True Negatives = {}'.format(true_neg(h,y)) )
print( 'Number of False Negatives = {}'.format(false_neg(h,y)) )
print( '\nModel Accuracy = {:.2f}%'.format( model_accuracy(h,y) ) )
print( 'Model Accuracy Positive Cases = {:.2f}%'.format( model_accuracy_pos(h,y) ) )
print( 'Model Accuracy Negative Cases = {:.2f}%'.format( model_accuracy_neg(h,y) ) )
print( '\nModel Precision = {}'.format(model_precision(h,y)) )
print( '\nModel Recall = {}'.format(model_recall(h,y)) )
I added the following “utility functions” to generate multinomial features and apply mean normalization to features. Normalization is particularly important in Logistic regression.
def multinomial_partitions(n, k):
"""returns an array of length k sequences of integer partitions of n"""
nparts = itertools.combinations(range(1, n+k), k-1)
tmp = [(0,) + p + (n+k,) for p in nparts]
sequences = np.diff(tmp) - 1
return sequences[::-1] # reverse the order
def make_multinomial_features(fvecs,order=[1,2]) :
'''Make multinomial feature matrix
fvecs is a matrix of feature vectors (columns)
"order" is a set of multinomial degrees to create
default is [1,2] meaning for example: given f1, f2 in fvecs
return a matrix made up of a [1's column, f1,f2,f1**2,f1*f2,f2**2] '''
Xtmp = np.ones_like(fvecs[:,0])
for ord in order :
if ord==1 :
fstmp = fvecs
else :
pwrs = multinomial_partitions(ord,fvecs.shape[1])
fstmp = np.column_stack( ( np.prod(fvecs**pwrs[i,:], axis=1) for i in range(pwrs.shape[0]) ))
Xtmp = np.column_stack((Xtmp,fstmp))
return Xtmp
def mean_normalize(X):
'''apply mean normalization to each column of the matrix X'''
X_mean=X.mean(axis=0)
X_std=X.std(axis=0)
return (X-X_mean)/X_std
def apply_normalizer(X,X_mean,X_std) :
return (X-X_mean)/X_std
Example of 2-D data using a multinomial feature model¶
This first example is a variation of data that I used when describing “decision boundaries”. I have made it a little more complicated and will use multinomial “feature expansion” on the two feature variables. This will also include some over-fitting and Regularization cases.
# Generate some "interesting" random 2-D data
np.random.seed(42)
px1, px2 = np.random.multivariate_normal([0,0], [[1,0.2],[0.8,1]] , 100).T
t = np.linspace(0,2*np.pi,100)
nx1, nx2 = (3+px1)*np.sin(t), (3+px2)*np.cos(t)
# Plot the data
fig, ax = plt.subplots()
ax.plot(nx1,nx2, "o", label='neg data' )
ax.plot(px1,px2, "P", label='pos data')
ax.axis('equal')
plt.title("2-D dataset", fontsize=24)
ax.legend();
We will obviously need a non-linear feature model to fit this data. We’ll start by setting the data up for a logistic regression optimization run using a hybrid multinomial feature set that looks like,
$ [1, f_1, f_2, f_1^2, f_1 f_2, f_2^2 ]$¶
f1 = np.concatenate((nx1,px1))
f2 = np.concatenate((nx2,px2))
fs = np.column_stack((f1,f2))
y = np.concatenate( (np.zeros(100), (np.ones(100))) )
order = np.array([1,2])
X = make_multinomial_features(fs,order=order)
X[:,1:] = mean_normalize(X[:,1:]) # normalize but leave the bias term alone
print('Number of multinolial features = {}'.format(X.shape[1]))
Before the optimization run we need to setup wrappers around the cost functions and the gradient so they are in the correct for the optimization code.
def opt_J_reg(a) :
return J_reg(X,a,y,reg)
def opt_gradJ_reg(a) :
return gradJ_reg(X,a,y,reg)
Now do an optimization run to find the parameters $a$ for the model.
reg = 0.0 # no regularization on this run
def opt_J_reg(a) :
return J_reg(X,a,y,reg)
def opt_gradJ_reg(a) :
return gradJ_reg(X,a,y,reg)
aguess = np.ones(X.shape[1]) # make a random starting guess
res = minimize(opt_J_reg, aguess, method='BFGS', jac=opt_gradJ_reg, tol=1e-4, options={'disp': True})
Take a quick look at the optimizied $a$,
a_opt = res.x
print(a_opt)
Here’s some data about the quality of the fit,
h_prob = h_logistic(X,a_opt) # get the probabilities of the from the model using a_opt
h_predict = to_0_1(h_prob) # convert the probabilities to 0,1 predictions at cutoff 0.5
print_model_quality('Traing-data fit', h_predict, y)
That looks like a reasonably good fit to the data. (I am not doing validation and test set checking for this example)
Lets take a look at a plot of the fit,
fig, ax = plt.subplots()
ax.plot(X[0:100,1],X[0:100,2], "o", label='neg data' )
ax.plot(X[100:,1],X[100:,2], "P", label='pos data')
xb1 = np.linspace(-1.5, 1.5, 100)
xb2 = np.linspace(-1.5, 1.5, 100)
#Xb1, Xb2 = np.meshgrid(mean_normalize(xb1), mean_normalize(xb2))
Xb1, Xb2 = np.meshgrid(xb1, xb2)
BX = make_multinomial_features(np.column_stack((Xb1.flatten(),Xb2.flatten())),order=order)
BX[:,1:] = mean_normalize(BX[:,1:])
B = np.sum(a_opt*BX, axis=1).reshape((100,100))
ax.contour(Xb1,Xb2,B, colors='r', levels=[0])
ax.axis('equal')
plt.title("Decision Boundary From Fit", fontsize=24)
plt.legend();
Now we’ll try the fit using a multinomial feature set up to order 6!
$ [1, f_1, f_2, f_1^2, f_1 f_2, f_2^2,f_1^3 \cdots,f_1f_2^5, f_2^6 ]$¶
order = np.array([1,2,3,4,5,6])
X = make_multinomial_features(fs,order=order)
X[:,1:] = mean_normalize(X[:,1:]) # normalize but leave the bias term alone
print('Number of multinomial features = {}'.format(X.shape[1]))
reg = 0.0 # no regularization on this run
aguess = np.ones(X.shape[1]) # make a random starting guess
res = minimize(opt_J_reg, aguess, method='BFGS', jac=opt_gradJ_reg, tol=1e-4, options={'disp': True})
a_opt = res.x
print(a_opt)
The optimization had some trouble because of the size of the parameters and gave warnings about divide by zero.
h_prob = h_logistic(X,a_opt) # get the probabilities of the from the model using a_opt
h_predict = to_0_1(h_prob) # convert the probabilities to 0,1 predictions at cutoff 0.5
print_model_quality('Traing-data fit', h_predict, y)
The model fit quality does look good. Let’s look at a plot of the fit,
fig, ax = plt.subplots()
ax.plot(X[0:100,1],X[0:100,2], "o", label='neg data' )
ax.plot(X[100:,1],X[100:,2], "P", label='pos data')
xb1 = np.linspace(-1.5, 1.5, 100)
xb2 = np.linspace(-1.5, 1.5, 100)
#Xb1, Xb2 = np.meshgrid(mean_normalize(xb1), mean_normalize(xb2))
Xb1, Xb2 = np.meshgrid(xb1, xb2)
BX = make_multinomial_features(np.column_stack((Xb1.flatten(),Xb2.flatten())),order=order)
BX[:,1:] = mean_normalize(BX[:,1:])
B = np.sum(a_opt*BX, axis=1).reshape((100,100))
ax.contour(Xb1,Xb2,B, colors='r', levels=[0])
ax.axis('equal')
plt.title("Decision Boundary From Fit", fontsize=24)
plt.legend();
The plot looks OK but the model boundary is getting a bit wobbly. Let’s try it again with a small regularization value added.
reg = 1.0 # no regularization on this run
aguess = np.ones(X.shape[1]) # make a random starting guess
res = minimize(opt_J_reg, aguess, method='BFGS', jac=opt_gradJ_reg, tol=1e-4, options={'disp': True})
a_opt = res.x
print(a_opt)
h_prob = h_logistic(X,a_opt) # get the probabilities of the from the model using a_opt
h_predict = to_0_1(h_prob) # convert the probabilities to 0,1 predictions at cutoff 0.5
print_model_quality('Traing-data fit', h_predict, y)
Now the parameters $a$ are closer to 1 and the optimization no longer has a divide by zero warning but it is still complaining about convergence. The fit quality looks good. Lets take a look at a plot,
fig, ax = plt.subplots()
ax.plot(X[0:100,1],X[0:100,2], "o", label='neg data' )
ax.plot(X[100:,1],X[100:,2], "P", label='pos data')
xb1 = np.linspace(-1.5, 1.5, 100)
xb2 = np.linspace(-1.5, 1.5, 100)
#Xb1, Xb2 = np.meshgrid(mean_normalize(xb1), mean_normalize(xb2))
Xb1, Xb2 = np.meshgrid(xb1, xb2)
BX = make_multinomial_features(np.column_stack((Xb1.flatten(),Xb2.flatten())),order=order)
BX[:,1:] = mean_normalize(BX[:,1:])
B = np.sum(a_opt*BX, axis=1).reshape((100,100))
ax.contour(Xb1,Xb2,B, colors='r', levels=[0])
ax.axis('equal')
plt.title("Decision Boundary From Fit", fontsize=24)
plt.legend();
The plot boundary is smoother and probably a more reasonable model. I hope this example has given you some idea of how multinomial features and regularization can effect model fitting for Logistic Regression.
This is really kind of fun to play around with! If you want to try the code and experiment yourself feel free to grad the Jupyter notebook from the GitHub link I gave near the top of the post.
In the next section we’ll look at a simple image classification example.
Two class Image classification example using data from the MNIST dataset¶
For this example I pulled the MNIST training set from Kaggle. For information on the dataset itself see Yann Lecun’s site http://yann.lecun.com/exdb/mnist/index.html. I used the data from Kaggel since it is in an easy to use format. (Kaggle is a great site for Machine Learning and there are many useful datasets and example problems there.)
The reason I only pulled the training set from Kaggel is because I want to explicitly show splitting up a dataset into separate sets for training. validation, and testing. The training set should be the largest set since that is where the model fitting takes place. The validation set is used for evaluating the effect of making adjustments to “hyper-parameters” of the method used to construct the model. The test set is the finial judge of the “generalization” or usefulness of the model for making predictions.
The first thing to do is load up the data and have a look at it,
data_full = pd.read_csv("./data/kg-mnist/train.csv")
data_full.head(10)
The data is laid out with each row being an image with the image label (0-9) as the first element of the row. The following 784 columns are the 28 x 28 pixie image “flattened out” as a vector of gray-scale values. The 784 pixel gray-scale intensities are the “features”.
I’ll convert the dataframe to a matrix and show the image in the 6th row (a 7). There are 42000 images in this dataset.
data_full_matrix=data_full.as_matrix()
print(data_full_matrix.shape)
plt.imshow(data_full_matrix[6,1:].reshape((28,28)))
The image is shown fairly large on the screen. You can clearly see the pixels and their relative gray-scale values.
Now I’ll pull out just the 0 1nd 1 images (data_01) and the corresponding labels (y_01),
idx = np.logical_or(data_full_matrix[:,0]==0, data_full_matrix[:,0]==1)
y_01, data_01 = data_full_matrix[idx,0], data_full_matrix[idx,1:]
print(data_01.shape)
There are now 8816 images of 0’s and 1’s in (data_01). Some of the feature columns in data_01 have all 0 gray-scale values i.e. they are empty boarders. Let’s remove those “features”. We will have 588 columns in data_01 after their removal.
data_01 = data_01[:,data_01.sum(axis=0)!=0]
print(data_01.shape)
The next thing to do is break up the data set in training, validation and test sets. I’ll print out the shapes of the datasets and label vectors. We will have 6170 training samples and 1323 samples for validation and testing.
data_train_01,y_train_01 = data_01[0:6170,:], y_01[0:6170]
data_val_01, y_val_01 = data_01[6170:7493,:], y_01[6170:7493]
data_test_01, y_test_01 = data_01[7493:,:], y_01[7493:]
print(data_train_01.shape,y_train_01.shape)
print(data_val_01.shape, y_val_01.shape)
print(data_test_01.shape, y_test_01.shape)
Next we setup the data for the optimization run. We want the data to all be normalized by the same mean and standard deviation so that all of the sets are scaled in the same way. I’ll pull the mean and standard deviation from the full 0,1 dataset data_01 and apply it to all of the datasets.
At the same time I’ll setup the augmented matrix with ones in the first column for the bias term using the make_multinomial_features function using “order” set to [1]. [I could add multinomial features but that gets to be VERY large … I tried it and am still experimenting with subsets :-)]
X_mean = data_01.mean(axis=0)
X_std = data_01.std(axis=0)
X_std[X_std==0]=1.0 # if there are any 0 values in X_std set them to 1
order = np.array([1])
X_train = make_multinomial_features(data_train_01, order=order)
X_train[:,1:] = apply_normalizer(X_train[:,1:],X_mean,X_std)
y_train = y_train_01
X_val = make_multinomial_features(data_val_01, order=order)
X_val[:,1:] = apply_normalizer(X_val[:,1:],X_mean,X_std)
y_val = y_val_01
X_test = make_multinomial_features(data_test_01, order=order)
X_test[:,1:] = apply_normalizer(X_test[:,1:],X_mean,X_std)
y_test = y_test_01
Optimization run on Training set¶
def opt_J_reg(a) :
return J(X_train,a,y_train)
def opt_gradJ_reg(a) :
return gradJ_reg(X_train,a,y_train,reg)
reg =1.0
aguess = np.random.randn(X_train.shape[1])
%time res = minimize(opt_J_reg, aguess, method='CG', jac=opt_gradJ_reg, tol=1e-4, options={'disp': True})
a_opt = res.x
h_prob = h_logistic(X_train,a_opt)
h_predict = to_0_1(h_prob)
print_model_quality('Traing-data fit', h_predict, y_train)
We got a 100% fit to the training data! Note: I changed the optimizer to conjugate gradients since it is usually better for larger numbers of optimization parameters. I also have a small regularization value set to help stabalize the optimization.
Results on Validation set¶
This is our first test of how well the fit is “really” doing. 100% is hard to argue with for the training set!
a_opt = res.x
h_prob = h_logistic(X_val,a_opt)
h_predict = to_0_1(h_prob)
print_model_quality('Validation-data fit', h_predict, y_val)
All we had to do here is use the optimized vales in $a$ with our validation set to see how well the model predicts values on that. It looks pretty good with only 7 total miss-classifications out or 1323 total.
Maybe our “perfect” fit on the training set was too good. I can try adjusting the regularization term to see if we can get a model that is still good on the training set but does a better job on the validation set. (this is what the validation set is for!)
reg = 15.0
np.random.seed(42) # just make sure I'm using the same random guess every time I run
aguess = np.random.randn(X_train.shape[1])
%time res = minimize(opt_J_reg, aguess, method='CG', jac=opt_gradJ_reg, tol=1e-4, options={'disp': True})
a_opt = res.x
h_prob = h_logistic(X_train,a_opt)
h_predict = to_0_1(h_prob)
print_model_quality('Traing-data fit', h_predict, y_train)
a_opt = res.x
h_prob = h_logistic(X_val,a_opt)
h_predict = to_0_1(h_prob)
print_model_quality('Validation-data fit', h_predict, y_val)
After a few minutes of trying different values for the regularization 15.0 seemed to be about as good as was going to get for both the training and validation set. This value resulted in just 1 missed prediction out of 1323.
Now lets see how the test set looks,
Results on Test set¶
a_opt = res.x
h_prob = h_logistic(X_test,a_opt)
h_predict = to_0_1(h_prob)
print_model_quality('Test-data fit', h_predict, y_test)
OK there it is! The test set has 7 errors out of 1323, not bad for simple Logistic Regression!
Nest time I extend this to Multi-Class Logistic Regression using a method called one-vs-all. With that we can do the whole set of digits 0-9.
Happy computing! –dbk