Coursera Machine Learning Challenges in Python: ex2 (Logistic Regression)

A series that implements Coursera Machine Learning programming tasks in Python. (2015/10/23) Added ex2_reg and added. (2015/12/25) Added a version that can be easily written using Polynomial Features with ex2_reg

ex2 (logistic regression without regularization)

Introduction

In this example, the scores of the two tests are given as input data, and the test results (pass or fail) are given as output data, and a classifier by logistic regression is created.

Python code

ex2.py


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model

data = pd.read_csv("ex2data1.txt", header=None)
# read 1st, 2nd column as feature matrix (100x2)
X = np.array([data[0],data[1]]).T
# read 3rd column as label vector (100)
y = np.array(data[2])

# plot
pos = (y==1) # numpy bool index
neg = (y==0) # numpy bool index
plt.scatter(X[pos,0], X[pos,1], marker='+', c='b')
plt.scatter(X[neg,0], X[neg,1], marker='o', c='y')
plt.legend(['Admitted', 'Not admitted'], scatterpoints=1)
plt.xlabel("Exam 1 Score")
plt.ylabel("Exam 2 Score")

# Logistic regression model with no regularization
model = linear_model.LogisticRegression(C=1000000.0)
model.fit(X, y)

# Extract model parameter (theta0, theta1, theta2)
[theta0] = model.intercept_
[[theta1, theta2]] = model.coef_
# Plot decision boundary
plot_x = np.array([min(X[:,0])-2, max(X[:,0])+2])   # lowest and highest x1
plot_y = - (theta0 + theta1*plot_x) / theta2   # calculate x2
plt.plot(plot_x, plot_y, 'b')

plt.show()

The resulting plot looks like this. ex2.png

Machine learning points

To perform logistic regression, use the sklearn.linear_model.LogisticRegression () class and learn with the familiar model.fit (X, y).

In the LogisticRegression class, the strength of Regularization is specified by the parameter C. In class, this was specified by the parameter $ \ lambda $, but C is the reciprocal of $ \ lambda $ (will appear in a later SVM session). The smaller the C, the stronger the regularization, and the larger the C, the weaker the regularization. In this example, we want to have no regularization, so we put a large value (1,000,000) in C.

After training the model, draw a Decision Boundary. The decision boundary of logistic regression is a straight line defined by $ \ theta ^ {T} X = 0 $. In the case of the example, if you write it down in terms of components, $ \ theta_0 + \ theta_1 x_1 + \ theta_2 x_2 = 0 $, and solve this to the formula $ x_2 =-\ frac {\ theta_0 + \ theta_1 x_1} {\ theta_2} $ Calculates the coordinates of the points on the decision boundary with and passes them to the plot function.

Python-like point

ex2_reg (regularized logistic regression)

Introduction

In this example, the two test results of the microchip are given as input data, and the pass or fail flag is given as the result data. Create a classifier that classifies pass and fail using a logistic regression model. Polynomial fitting is used because linear separation is not possible with straight lines. Also, create a model with different regularization parameters $ \ lambda $ and see the effect of regularization.

code

ex2_reg.py


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model

# mapFeature(x1, x2)
#Do feature mapping
#Argument: Feature vector x1,x2 (must be of the same dimension n)
#Return value: Feature matrix X(nx28 matrix)
#Up to 6th order 1, x1, x2, x1^2, x1*x2, x2, x1^3, .... x1*x2^5, x2^28 columns like 6
def mapFeature(x1, x2):
    degree = 6
    out = np.ones(x1.shape) #The first column is 1
    for i in range(1, degree+1):  #Loop from 1 to degree
        for j in range(0, i+1):   #Loop from 0 to i
            out = np.c_[out, (x1**(i-j) * x2**j)] #Increase the number of columns
    return out

#Text from here
data = pd.read_csv("ex2data2.txt", header=None)

x1 = np.array(data[0])
x2 = np.array(data[1])
y = np.array(data[2])

#Plot sample data
pos = (y==1) # numpy bool index
neg = (y==0) # numpy bool index
plt.scatter(x1[pos], x2[pos], marker='+', c='b') #The correct example is'+'
plt.scatter(x1[neg], x2[neg], marker='o', c='y') #Negative example'o'
plt.legend(['y = 0', 'y = 1'], scatterpoints=1)
plt.xlabel("Microchip Test 1")
plt.ylabel("Microchip Test 2")

#X is an nx28 matrix for feature mapping
X = mapFeature(x1, x2)

#Logistic regression model with regularization
model = linear_model.LogisticRegression(penalty='l2', C=1.0)
model.fit(X, y)

# Decision Boundary(Decision boundary)To plot
px = np.arange(-1.0, 1.5, 0.1)
py = np.arange(-1.0, 1.5, 0.1)
PX, PY = np.meshgrid(px, py) # PX,Each PY is a 25x25 matrix
XX = mapFeature(PX.ravel(), PY.ravel()) #Feature mapping. The argument is ravel()Convert to a 625-dimensional vector and pass it. XX is a 625x28 matrix
Z = model.predict_proba(XX)[:,1] #Predicted with a logistic regression model. y=The probability of 1 is in the second column of the result, so take it out. Z is a 625 dimensional vector
Z = Z.reshape(PX.shape) #Convert Z to 25x25 Matrix
plt.contour(PX, PY, Z, levels=[0.5], linewidths=3) # Z=0.The contour line of 5 becomes the decision boundary
plt.show()

Machine learning points

The part that uses the LogisticRegression class is the same as the previous example. Since Regularization (regularization) that appeared in Coursera is L2 regularization (ridge regression), add the option of'penalty ='l2'. I will draw a decision boundary with a model trained by changing the strength of regularization, but in Coursera there were 3 types of $ \ lambda = 0, 1, 100 $, but in the Python example C = 1000000.0 , C = 1.0, C = 0.01.

In case of C = 1000000.0 (no regularization, overfit) ex2reg1.png

When C = 1.0 ex2reg2.png

When C = 0.01 (too regular, underfit) ex2reg3.png

Python-like point

Addendum: Simplified using a library for polynomial feature generation

In the above code, I made my own function called mapFeature () to generate features of polynomials, but scikit-learn has a class called sklearn.preprocessing.PolynomialFeatures that does the same thing. There is, so replace it here. Click here for the code.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn

data = pd.read_csv("ex2data2.txt", header=None)

x1 = np.array(data[0])
x2 = np.array(data[1])
y = np.array(data[2])

#Plot sample data
pos = (y==1) # numpy bool index
neg = (y==0) # numpy bool index
plt.scatter(x1[pos], x2[pos], marker='+', c='b') #The correct example is'+'
plt.scatter(x1[neg], x2[neg], marker='o', c='y') #Negative example'o'
plt.legend(['y = 0', 'y = 1'], scatterpoints=1)
plt.xlabel("Microchip Test 1")
plt.ylabel("Microchip Test 2")

#X is an nx28 matrix for feature mapping
poly = sklearn.preprocessing.PolynomialFeatures(6)
X = poly.fit_transform(np.c_[x1,x2])

#Logistic regression model with regularization
model = linear_model.LogisticRegression(penalty='l2', C=1.0)
model.fit(X, y)

# Decision Boundary(Decision boundary)To plot
px = np.arange(-1.0, 1.5, 0.1)
py = np.arange(-1.0, 1.5, 0.1)
PX, PY = np.meshgrid(px, py) # PX,Each PY is a 25x25 matrix
XX = poly.fit_transform(np.c_[PX.ravel(), PY.ravel()]) #Feature mapping argument is ravel()Converts to a 625-dimensional vector with and passes XX is a 625x28 matrix
Z = model.predict_proba(XX)[:,1] #Predicted by logistic regression model y=Since the probability of 1 is in the second column of the result, the Z to be extracted is a 625-dimensional vector.
Z = Z.reshape(PX.shape) #Convert Z to 25x25 Matrix
plt.contour(PX, PY, Z, levels=[0.5], linewidths=3) # Z=0.The contour line of 5 becomes the decision boundary
plt.show()

in conclusion

I'm studying both Python and machine learning, so I'd be happy if you could point out any strange points (^^)

Recommended Posts

Coursera Machine Learning Challenges in Python: ex2 (Logistic Regression)
Coursera Machine Learning Challenges in Python: ex1 (Linear Regression)
Coursera Machine Learning Challenges in Python: ex7-2 (Principal Component Analysis)
Coursera Machine Learning Challenges in Python: ex6 (How to Adjust SVM Parameters)
Coursera Machine Learning Challenges in Python: ex7-1 (Image compression with K-means clustering)
Machine learning logistic regression
Machine learning algorithm (logistic regression)
Classification and regression in machine learning
Python: Preprocessing in Machine Learning: Overview
<Course> Machine Learning Chapter 3: Logistic Regression Model
Machine learning with python (2) Simple regression analysis
[python] Frequently used techniques in machine learning
Python: Preprocessing in machine learning: Data acquisition
I implemented Cousera's logistic regression in Python
[Python] Saving learning results (models) in machine learning
Python: Preprocessing in machine learning: Data conversion
EV3 x Python Machine Learning Part 2 Linear Regression
[Python3] Let's analyze data using machine learning! (Regression)
Get a glimpse of machine learning in Python
Logistic distribution in Python
Machine learning linear regression
Python: Supervised Learning (Regression)
Regression analysis in Python
Tool MALSS (application) that supports machine learning in Python
Tool MALSS (basic) that supports machine learning in Python
Python Scikit-learn Linear Regression Analysis Nonlinear Simple Regression Analysis Machine Learning
Attempt to include machine learning model in python package
Cross-entropy to review in Coursera Machine Learning week 2 assignments
MALSS, a tool that supports machine learning in Python
Multiple regression expressions in Python
Machine learning in Delemas (practice)
Machine learning with Python! Preparation
Understand machine learning ~ ridge regression ~.
Python Machine Learning Programming> Keywords
Used in machine learning EDA
Simple regression analysis in Python
Supervised machine learning (classification / regression)
Beginning with Python machine learning
Online linear regression in Python
Machine learning stacking template (regression)
The result of Java engineers learning machine learning in Python www
Python learning memo for machine learning by Chainer Chapter 7 Regression analysis
2. Multivariate analysis spelled out in Python 5-3. Logistic regression analysis (stats models)
[Machine learning] Understanding logistic regression from both scikit-learn and mathematics
Implement stacking learning in Python [Kaggle]
First simple regression analysis in Python
Python: Application of supervised learning (regression)
How about Anaconda for building a machine learning environment in Python?
Machine learning with python (1) Overall classification
Machine learning summary by Python beginners
Automate routine tasks in machine learning
Machine learning beginners try linear regression
Machine learning algorithm (multiple regression analysis)
Widrow-Hoff learning rules implemented in Python
Machine learning algorithm (simple regression analysis)
<For beginners> python library <For machine learning>
Machine learning in Delemas (data acquisition)
Enjoy Coursera / Machine Learning materials twice
Implemented Perceptron learning rules in Python
Preprocessing in machine learning 2 Data acquisition
"Gaussian process and machine learning" Gaussian process regression implemented only with Python numpy