[PYTHON] Ridge regression with Pyspark's Mllib

Practice Map function in the created Pyspark environment I tried ridge regression.

The original story is from edX's Scalable Machine Learning lecture.

The data can be obtained from the Million Song Dataset.

So, if you write the result first, even though you did a Grid search including the Default value The result is that the regression performed with the Default value is more accurate, which is likely to cause nosebleeds.

First of all, import.

#library import
from pyspark.mllib.regression import LinearRegressionWithSGD
from pyspark.mllib.regression import LabeledPoint
from test_helper import Test
import numpy as np

#data load
rawData = sc.textFile('millionsong.txt')

Ridge regression (default)

I used LinearRegressionWithSGD.

samplePoints = rawData.take(2)

#Create LabeledPoint
def parsePoint(line):

    tokens = line.split(',')
    label = tokens[0]
    features = tokens[1:]
    return LabeledPoint(label, features)

parsedData = rawData.map(parsePoint)
print parsedData.take(3)

[LabeledPoint(2001.0, [0.884123733793,0.610454259079,0.600498416968,0.474669212493,0.247232680947,0.357306088914,0.344136412234,0.339641227335,0.600858840135,0.425704689024,0.60491501652,0.419193351817]), LabeledPoint(2001.0, [0.854411946129,0.604124786151,0.593634078776,0.495885413963,0.266307830936,0.261472105188,0.506387076327,0.464453565511,0.665798573683,0.542968988766,0.58044428577,0.445219373624]), LabeledPoint(2001.0, [0.908982970575,0.632063159227,0.557428975183,0.498263761394,0.276396052336,0.312809861625,0.448530069406,0.448674249968,0.649791323916,0.489868662682,0.591908113534,0.4500023818])]

Next, divide the data into Traning, Validation, and Test.

#Devide rawData into Traning, Validation and Test
weights = [.8, .1, .1]
seed = 50
parsedTrainData, parsedValData, parsedTestData = parsedData.randomSplit(weights, seed)

# Fit the model with default values
fitModel = LinearRegressionWithSGD.train(parsedTrainData)
print  fitModel

(weights=[348.391703677,312.158507888,303.514705245,268.768326368,260.265535915,321.082923267,345.636059404,369.96940298,414.587178279,328.611497772,389.972179858,348.42792115], intercept=0.0)

Comparison of measured and predicted values based on this model. The error is about 2 years.

# Prediction 
testPoint = parsedTrainData.take(1)[0]
print testPoint.label

testPrediction = fitModel.predict(testPoint.features)
print testPrediction

2001.0
2003.04838193

Next, a model evaluation using the root mean square error (RMSE) was performed. In addition, we will search for hyperparameters by grid search using RMSE as an evaluation function.

# Define a formula to caluculate RMSE(Root Mean Square Error)
def squaredError(label, prediction):
    ##Calculates the the squared error for a single prediction.
    return (label - prediction)**2

def calcRMSE(labelsAndPreds):
    ##Calculates the root mean squared error for an `RDD` of (label, prediction) tuples.
    return np.sqrt(labelsAndPreds
                   .map(lambda (label, prediction): squaredError(label, prediction))
                   .mean())

#Create new RDD with actual label and predicted label 
labelsAndPreds = parsedValData.map(lambda lp: (lp.label, fitModel.predict(lp.features)))

#Calculation RMSE
rmseValLR1 = calcRMSE(labelsAndPreds)
print rmseValLR1

126.788570325

Ridge regression (grid search)

RMSE is about 126 in Fitting performed with the default value. So, next, I will try to find hyperparameters with grid search.

##Grid search
# Values to use when training the linear regression model
miniBatchFrac = 1.0  # miniBatchFraction
regType = 'l2'  # regType
useIntercept = True  # intercept

# Seed of minmum RMSE
min_rmseVal = 10**10

#Fix HyperParameters
modelRMSEs = []

for grid_alpha in [1e-5,1.0,10,100]:
    for grid_numIters in [50,100,500]:
        for grid_reg in [1e-5,0.0, 1.0]:
            model = LinearRegressionWithSGD.train(parsedTrainData,
                                                  iterations=grid_numIters, 
                                                  step=grid_alpha,
                                                  regParam=grid_reg)
            
            labelsAndPreds = parsedValData.map(lambda lp: (lp.label, model.predict(lp.features)))
            rmseVal = calcRMSE(labelsAndPreds)
            
            
            
            if rmseVal < min_rmseVal:
                min_rmseVal = rmseVal
                best_alpha = grid_alpha
                best_numIter =grid_numIters
                best_reg = grid_reg
            
print "best_alpha:{},best_numIter:{}, best_reg:{}, best_rmseVal:{}".format(best_alpha, best_numIter, best_reg, min_rmseVal)
best_alpha:1.0,best_numIter:500, best_reg:1e-05, best_rmseVal:117.120806943

Here, the RMSE has dropped slightly to about 117, and it can be expected that the accuracy will increase. So, fitting immediately and comparing the measured value and the predicted value as in the first model.

#Fitting with HyperParameters fixed by grid search 
Final_model = LinearRegressionWithSGD.train(parsedTrainData,
                                           iterations=best_numIter, 
                                           step=best_alpha, 
                                           regParam =best_reg)

#Labels comparison between Final_model and actual data. 
Final_testPoint = parsedTrainData.take(1)[0]
print Final_testPoint.label

Final_Prediction = Final_model.predict(Final_testPoint.features)
print Final_Prediction

2001.0
1969.7210425

The accuracy is significantly worse with an error of nearly 30 years. Why? I don't know the cause right away, so I'll rewrite it if I try again.