[PYTHON] 100 amateur language processing knocks: 78

It is a challenge record of Language processing 100 knock 2015. The environment is Ubuntu 16.04 LTS + Python 3.5.2 : : Anaconda 4.1.1 (64-bit). Click here for a list of past knocks (http://qiita.com/segavvy/items/fb50ba8097d59475f760).

Chapter 8: Machine Learning

In this chapter, the task of classifying sentences into positive (positive) or negative (negative) using the sentence polarity dataset v1.0 of Movie Review Data published by Bo Pang and Lillian Lee (polarity analysis). Work on.

78.5 Split cross-validation

In the experiment> 76-77, the case used for learning was also used for evaluation, so it cannot be said to be a valid evaluation. That is, the classifier evaluates the performance when memorizing the training case, and does not measure the generalization performance of the model. Therefore, find the correct answer rate, precision rate, recall rate, and F1 score for polarity classification by 5-fold cross-validation.

The finished code:

`main.py`


# coding: utf-8
import codecs
import snowballstemmer
import numpy as np

fname_sentiment = 'sentiment.txt'
fname_features = 'features.txt'
fname_result = 'result.txt'
fencoding = 'cp1252'		# Windows-1252

division = 5			#Number of data divisions
learn_alpha = 6.0		#Learning rate
learn_count = 1000		#Number of learning iterations

stemmer = snowballstemmer.stemmer('english')

#List of stop words http://xpo6.com/list-of-english-stop-words/From CSV Format
stop_words = (
	'a,able,about,across,after,all,almost,also,am,among,an,and,any,are,'
	'as,at,be,because,been,but,by,can,cannot,could,dear,did,do,does,'
	'either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,'
	'him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,'
	'likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,'
	'on,only,or,other,our,own,rather,said,say,says,she,should,since,so,'
	'some,than,that,the,their,them,then,there,these,they,this,tis,to,too,'
	'twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,'
	'will,with,would,yet,you,your').lower().split(',')


def is_stopword(str):
	'''Returns whether the character is a stopword
Equalize case

Return value:
True for stop words, False for different
	'''
	return str.lower() in stop_words


def hypothesis(data_x, theta):
	'''Hypothetical function
	data_For x, use theta for data_Predict y

Return value:
Matrix of predicted values
	'''
	return 1.0 / (1.0 + np.exp(-data_x.dot(theta)))


def cost(data_x, theta, data_y):
	'''Objective function
	data_Calculate the difference between the predicted result and the correct answer for x

Return value:
Difference between prediction and correct answer
	'''
	m = data_y.size			#Number of data
	h = hypothesis(data_x, theta)		# data_Matrix of predicted values of y
	j = 1 / m * np.sum(-data_y * np.log(h) -
			(np.ones(m) - data_y) * np.log(np.ones(m) - h))

	return j


def gradient(data_x, theta, data_y):
	'''Gradient calculation for steepest descent

Return value:
Gradient matrix for theta
	'''
	m = data_y.size			#Number of data
	h = hypothesis(data_x, theta)		# data_Matrix of predicted values of y
	grad = 1 / m * (h - data_y).dot(data_x)

	return grad


def extract_features(data, dict_features):
	'''Extracting features from sentences
Dict from text_Extract the features included in features and
	dict_features['(Feature)']Returns a matrix with the position of 1.
The first element is fixed at 1. For weights that do not correspond to features.

Return value:
First element and the position of the corresponding feature+Matrix with 1 as 1
	'''
	data_one_x = np.zeros(len(dict_features) + 1, dtype=np.float64)
	data_one_x[0] = 1		#The first element is fixed and 1, for weights that do not correspond to features.

	for word in data.split(' '):

		#Remove whitespace before and after
		word = word.strip()

		#Stop word removal
		if is_stopword(word):
			continue

		#Stemming
		word = stemmer.stemWord(word)

		#Get index of features, set the corresponding part of the matrix to 1.
		try:
			data_one_x[dict_features[word]] = 1
		except:
			pass		# dict_Ignore features not found in features

	return data_one_x


def load_dict_features():
	'''features.Create a dictionary to read txt and convert features into indexes
Index value is 1 based, features.Matches the line number in txt.

Return value:
A dictionary that converts features into indexes
	'''
	with codecs.open(fname_features, 'r', fencoding) as file_in:
		return {line.strip(): i for i, line in enumerate(file_in, start=1)}


def create_training_set(sentiments, dict_features):
	'''Create a matrix to be learned and a matrix with polar labels from the correct data sentiments
The size of the line example to be learned is the number of reviews of correct answer data ×(Feature number+1)。
The column value will be 1 if there is a corresponding feature for each review, and 0 otherwise.
The index of column features is dict_features['(Feature)']It is decided by.
The first column is always 1 for learning weights that do not correspond to features.
	dict_Ignore features that do not exist in features.

The size of the polar label matrix is the number of reviews x 1.
1 for positive content and 0 for negative content.

Return value:
Matrix to learn,Polarity label matrix
	'''

	#Initialize matrix with 0
	data_x = np.zeros([len(sentiments), len(dict_features) + 1], dtype=np.float64)
	data_y = np.zeros(len(sentiments), dtype=np.float64)

	for i, line in enumerate(sentiments):

		#Feature extraction
		data_x[i] = extract_features(line[3:], dict_features)

		#Set of polar label matrices
		if line[0:2] == '+1':
			data_y[i] = 1

	return data_x, data_y


def learn(data_x, data_y, alpha, count):
	'''Learning logistic regression

Return value:
Trained theta
	'''
	theta = np.zeros(data_x.shape[1])
	c = cost(data_x, theta, data_y)
	print('\t Start learning\tcost：{}'.format(c))

	for i in range(1, count + 1):

		grad = gradient(data_x, theta, data_y)
		theta -= alpha * grad

		#Calculate the cost and the maximum adjustment amount of theta and display the progress (once in 100 times)
		if i % 100 == 0:
			c = cost(data_x, theta, data_y)
			e = np.max(np.absolute(alpha * grad))
			print('\t learning(#{})\tcost：{}\tE:{}'.format(i, c, e))

	c = cost(data_x, theta, data_y)
	e = np.max(np.absolute(alpha * grad))
	print('\t Learning completed(#{}) \tcost：{}\tE:{}'.format(i, c, e))
	return theta


def score(fname):
	'''Score calculation from the result file
Reads the result file specified by fname and returns the correct answer rate, precision rate, recall rate, and F1 score.

Return value:
Correct answer rate,Compliance rate,Recall,F1 score
	'''
	#Read the results and aggregate
	TP = 0		# True-Positive expectations+1, correct answer+1
	FP = 0		# False-Positive expectations+1, the correct answer is-1
	FN = 0		# False-Negative expectations-1, the correct answer is+1
	TN = 0		# True-Negative expectations-1, correct answer-1

	with open(fname) as data_file:
		for line in data_file:
			cols = line.split('\t')

			if len(cols) < 3:
				continue

			if cols[0] == '+1':			#Correct answer
				if cols[1] == '+1':		#Expected
					TP += 1
				else:
					FN += 1
			else:
				if cols[1] == '+1':
					FP += 1
				else:
					TN += 1

	#Calculation
	accuracy = (TP + TN) / (TP + FP + FN + TN)		#Correct answer rate
	precision = TP / (TP + FP)		#Compliance rate
	recall = TP / (TP + FN)		#Recall
	f1 = (2 * recall * precision) / (recall + precision) 	#F1 score

	return accuracy, precision, recall, f1


#Reading the feature dictionary
dict_features = load_dict_features()

#Read correct answer data
with codecs.open(fname_sentiment, 'r', fencoding) as file_in:
	sentiments_all = list(file_in)

#Divide the correct answer data into 5
sentiments = []
unit = int(len(sentiments_all) / division)
for i in range(5):
	sentiments.append(sentiments_all[i * unit:(i + 1) * unit])

#5-fold cross-validation
with open(fname_result, 'w') as file_out:
	for i in range(division):

		print('{}/{}'.format(i + 1, division))

		#Separate correct data for learning and verification
		data_learn = []
		for j in range(division):
			if i == j:
				data_validation = sentiments[j]
			else:
				data_learn += sentiments[j]

		#Creating an array of learning targets and polar labels
		data_x, data_y = create_training_set(data_learn, dict_features)

		#Learning
		theta = learn(data_x, data_y, alpha=learn_alpha, count=learn_count)

		#Verification
		for line in data_validation:

			#Feature extraction
			data_one_x = extract_features(line[3:], dict_features)

			#Forecast, result output
			h = hypothesis(data_one_x, theta)
			if h > 0.5:
				file_out.write('{}\t{}\t{}\n'.format(line[0:2], '+1', h))
			else:
				file_out.write('{}\t{}\t{}\n'.format(line[0:2], '-1', 1 - h))

#Result display
print('\n Learning rate:{}\t Number of learning repetitions:{}'.format(learn_alpha, learn_count))
accuracy, precision, recall, f1 = score(fname_result)
print('Correct answer rate\t{}\n Conformance rate\t{}\n recall\t{}\nF1 score\t{}'.format(
	accuracy, precision, recall, f1
))

Execution result:

`Execution result`


1/5
Start learning cost: 0.6931471805599453
Learning(#100)	cost：0.46843942718055814	E:0.006388382573910524
Learning(#200)	cost：0.4155300488897057	E:0.003950176267083882
Learning(#300)	cost：0.3855283848183693	E:0.002867235531957132
Learning(#400)	cost：0.3648933651792237	E:0.0022495471367582247
Learning(#500)	cost：0.3493282931816998	E:0.0018583498524543404
Learning(#600)	cost：0.3369232080431452	E:0.0016771358183603987
Learning(#700)	cost：0.32666634898652896	E:0.001528412108716516
Learning(#800)	cost：0.31795919554061053	E:0.0014042508127869423
Learning(#900)	cost：0.31041943220497686	E:0.0012990594970099315
Learning(#1000)	cost：0.30378857681325766	E:0.0012088047599478039
Learning completed(#1000) 	cost：0.30378857681325766	E:0.0012088047599478039
2/5
Start learning cost: 0.6931471805599453
Learning(#100)	cost：0.4741687433335998	E:0.006589814822192543
Learning(#200)	cost：0.42144780985764596	E:0.003908261118677938
Learning(#300)	cost：0.3912183151335336	E:0.002804459291483359
Learning(#400)	cost：0.370303379815077	E:0.0023610369221010326
Learning(#500)	cost：0.354477846021314	E:0.0020514997491309413
Learning(#600)	cost：0.3418460542105294	E:0.0018224684562050484
Learning(#700)	cost：0.33139550986560584	E:0.001645643112098399
Learning(#800)	cost：0.3225230456812948	E:0.0015047097369745835
Learning(#900)	cost：0.31484124228803834	E:0.0013896119787524179
Learning(#1000)	cost：0.3080871067835467	E:0.0012937962132790058
Learning completed(#1000) 	cost：0.3080871067835467	E:0.0012937962132790058
3/5
Start learning cost: 0.6931471805599453
Learning(#100)	cost：0.46891949543978517	E:0.006357216339527686
Learning(#200)	cost：0.41580499264287696	E:0.003532830533162978
Learning(#300)	cost：0.3854553165948075	E:0.0027301913427912735
Learning(#400)	cost：0.3644760512004263	E:0.0022545615099526647
Learning(#500)	cost：0.3485986820681382	E:0.001919021249806922
Learning(#600)	cost：0.3359163761795678	E:0.0016705021198879075
Learning(#700)	cost：0.32541428766128333	E:0.0014797071516709523
Learning(#800)	cost：0.31648958311645375	E:0.0013367387334497819
Learning(#900)	cost：0.3087557956043563	E:0.0012494627215075146
Learning(#1000)	cost：0.3019508027016161	E:0.0011779206121903469
Learning completed(#1000) 	cost：0.3019508027016161	E:0.0011779206121903469
4/5
Start learning cost: 0.6931471805599453
Learning(#100)	cost：0.4725342546493931	E:0.006182597071964639
Learning(#200)	cost：0.4194276723005623	E:0.0034649497530972943
Learning(#300)	cost：0.38918298242842136	E:0.0025501444797361994
Learning(#400)	cost：0.36832204557828535	E:0.0021388621069763788
Learning(#500)	cost：0.3525543611131982	E:0.001855410065711756
Learning(#600)	cost：0.33996964450743344	E:0.0016480855756071824
Learning(#700)	cost：0.32955351095109425	E:0.0014898405345723522
Learning(#800)	cost：0.32070420313966275	E:0.001365069555771408
Learning(#900)	cost：0.3130363527272276	E:0.0012842751555114352
Learning(#1000)	cost：0.3062888953703655	E:0.0012201511930926112
Learning completed(#1000) 	cost：0.3062888953703655	E:0.0012201511930926112
5/5
Start learning cost: 0.6931471805599453
Learning(#100)	cost：0.47367883038307196	E:0.006165844710913304
Learning(#200)	cost：0.42196370471708444	E:0.0038294500786362744
Learning(#300)	cost：0.39242868456409186	E:0.002903639748114128
Learning(#400)	cost：0.3720216436950633	E:0.002348459481805761
Learning(#500)	cost：0.3565815749862366	E:0.0019763223680587666
Learning(#600)	cost：0.34425094991837796	E:0.001708469854933442
Learning(#700)	cost：0.33404185010109005	E:0.0015059837001246833
Learning(#800)	cost：0.32536765342218166	E:0.001357007404701798
Learning(#900)	cost：0.3178523514158344	E:0.0012612117027114012
Learning(#1000)	cost：0.3112409530842421	E:0.0011798784899874886
Learning completed(#1000) 	cost：0.3112409530842421	E:0.0011798784899874886

Learning rate: 6.0 Number of learning repetitions: 1000
Correct answer rate 0.7483114446529081
Compliance rate 0.749058734939759
Recall rate 0.7466691686995685
F1 score 0.7478620430410676

What is cross-validation?

When verifying the accuracy of training results, even if you verify with the data used for training, you cannot find the accuracy for unknown data. In addition, verification with the data used for learning will give good results even in learning (called overfitting) that cannot be applied by so-called textbook rote memorization. Therefore, it is necessary to verify with the data that is not used for training. This is called cross-validation.

There seem to be some common methods for cross-validation. The "K-partition cross-validation test" specified in this question is a method of dividing data into K and using one of them for validation and the rest for learning. This is repeated K times while switching the verification data and averaged. Details are published on "gihyo.jp" of Gijutsu-Hyoronsha 21st (final) Let's start machine learning ) Is easy to understand.

Also, Coursera's Machine Learning recommended another method. It is a method of dividing into three parts, training data, adjustment data, and verification data, learning with training data, adjusting with adjustment data, and finally confirming the accuracy of the result with verification data. For more information, see katsu1110's Coursera Machine Learning (6): Machine Learning Model Evaluation (Cross-Validation, Bias & Variance, Conformance & Recall). The summary is easy to understand.

Implementation of 5-fold cross-validation

I simply divided the data into 5 parts and implemented it. Since there are 10,662 data in total, I decided that the remaining 2 cases that did not break were not troublesome to handle ^^;

Note that all the functions were created from the previous problems. Regarding the calculation of the score, in order to use the score () created in Problem 77 as it is, the result of predicting it in 5 times is [Problem] 76](http://qiita.com/segavvy/items/e107f764534f01c5b105) I added to one file and made one big result file to be eaten by score () ..

About improving accuracy

Unfortunately it has dropped to just under 75%. It feels a little low, but I'm going to move on to the next problem for the time being.

In order to improve the accuracy, there are various things such as increasing features, increasing data, adding normalization parameters, increasing polynomials of hypothetical functions, reviewing the method of extracting features, and so on. There is a means. However, even if you try various things on your own, it just wastes time and it seems that the accuracy cannot be improved easily.

In Coursera's Machine Learning, the first cause of inaccuracies for improvement is overfitting (too dependent on learning data, high variance). It was said that it should be determined whether it is underfit (high bias, which does not fully utilize the training data). This can be determined by introducing a normalized parameter. The result will determine the approach for improvement. You can learn about validation and improvement in Week 6 of Machine Learning.

That's all for the 79th knock. If you have any mistakes, I would appreciate it if you could point them out.

The execution result includes a part of the data distributed in Corpus data used for 100 knocks. I will. *