[PYTHON] 100 amateur language processing knocks: 74

It is a challenge record of Language processing 100 knock 2015. The environment is Ubuntu 16.04 LTS + Python 3.5.2 : : Anaconda 4.1.1 (64-bit). Click here for a list of past knocks (http://qiita.com/segavvy/items/fb50ba8097d59475f760).

Chapter 8: Machine Learning

In this chapter, the task of classifying sentences into positive (positive) or negative (negative) using the sentence polarity dataset v1.0 of Movie Review Data published by Bo Pang and Lillian Lee (polarity analysis). Work on.

74. Forecast

Using the logistic regression model learned in> 73, implement a program that calculates the polarity label ("+1" for a positive example, "-1" for a negative example) of a given sentence and its prediction probability.

The finished code:

main.py


# coding: utf-8
import codecs
import snowballstemmer
import numpy as np

fname_sentiment = 'sentiment.txt'
fname_features = 'features.txt'
fname_theta = 'theta.npy'
fencoding = 'cp1252'		# Windows-1252

stemmer = snowballstemmer.stemmer('english')

#List of stop words http://xpo6.com/list-of-english-stop-words/From CSV Format
stop_words = (
	'a,able,about,across,after,all,almost,also,am,among,an,and,any,are,'
	'as,at,be,because,been,but,by,can,cannot,could,dear,did,do,does,'
	'either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,'
	'him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,'
	'likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,'
	'on,only,or,other,our,own,rather,said,say,says,she,should,since,so,'
	'some,than,that,the,their,them,then,there,these,they,this,tis,to,too,'
	'twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,'
	'will,with,would,yet,you,your').lower().split(',')


def is_stopword(str):
	'''Returns whether the character is a stopword
Equalize case

Return value:
True for stop words, False for different
	'''
	return str.lower() in stop_words


def hypothesis(data_x, theta):
	'''Hypothetical function
	data_For x, use theta for data_Predict y

Return value:
Matrix of predicted values
	'''
	return 1.0 / (1.0 + np.exp(-data_x.dot(theta)))


def extract_features(data, dict_features):
	'''Extracting features from sentences
Dict from text_Extract the features included in features and
	dict_features['(Feature)']Returns a matrix with the position of 1.
The first element is fixed at 1. For weights that do not correspond to features.

Return value:
First element and the position of the corresponding feature+Matrix with 1 as 1
	'''
	data_one_x = np.zeros(len(dict_features) + 1, dtype=np.float64)
	data_one_x[0] = 1		#The first element is fixed and 1, for weights that do not correspond to features.

	for word in data.split(' '):

		#Remove whitespace before and after
		word = word.strip()

		#Stop word removal
		if is_stopword(word):
			continue

		#Stemming
		word = stemmer.stemWord(word)

		#Get index of features, set the corresponding part of the matrix to 1.
		try:
			data_one_x[dict_features[word]] = 1
		except:
			pass		# dict_Ignore features not found in features

	return data_one_x


def load_dict_features():
	'''features.Create a dictionary to read txt and convert features into indexes
Index value is 1 based, features.Matches the line number in txt.

Return value:
A dictionary that converts features into indexes
	'''
	with codecs.open(fname_features, 'r', fencoding) as file_in:
		return {line.strip(): i for i, line in enumerate(file_in, start=1)}


#Reading the feature dictionary
dict_features = load_dict_features()

#Reading learning results
theta = np.load(fname_theta)

#input
review = input('Please enter a review--> ')

#Feature extraction
data_one_x = extract_features(review, dict_features)

#Forecast
h = hypothesis(data_one_x, theta)
if h > 0.5:
	print('label:+1 ({})'.format(h))
else:
	print('label:-1 ({})'.format(1 - h))

Execution result:

I put in the first 3 reviews of "sentiment.txt" made in question 70. By the way, the correct answer of the first and third reviews is positive (+1), and the second is negative (-1).

Execution result


segavvy@ubuntu:~/document/100 language processing knock 2015/74$ python main.py 
Please enter a review--> deep intelligence and a warm , enveloping affection breathe out of every frame .
label:+1 (0.9881093733272299)
segavvy@ubuntu:~/document/100 language processing knock 2015/74$ python main.py 
Please enter a review--> before long , the film starts playing like general hospital crossed with a saturday night live spoof of dog day afternoon .
label:-1 (0.6713196688353891)
segavvy@ubuntu:~/document/100 language processing knock 2015/74$ python main.py 
Please enter a review--> by the time it ends in a rush of sequins , flashbulbs , blaring brass and back-stabbing babes , it has said plenty about how show business has infiltrated every corner of society -- and not always for the better .
label:-1 (0.6339673922580253)

The first and second cases were predicted correctly, but the third case predicted a positive review as a negative one. The probability of the first prediction is 98.8%, so it seems to be a fairly confident prediction. The probability of predicting the third wrong prediction is 63.4%, so it seems that he was not confident.

Forecasting method

Forecasting is OK if you extract the features from the input review and give the result to the hypothesis function to make a prediction. The hypothesis function returns a value between 0 and 1. If the value is greater than 0.5, it is predicted to be positive, and if it is smaller than 0.5, it is predicted to be negative. If it's just 0.5, it doesn't matter which one, but this time I made it negative.

The predicted probability is indicated by the value of the hypothesis function itself. For example, if the hypothesis function returns 0.8, there is an 80% chance that it will be positive. However, if it is negative, the probability is higher when it is closer to 0, so the value obtained by subtracting the value of the hypothesis function from 1 is the prediction probability. For example, if the hypothesis function returns 0.3, there is a 70% chance (= 1-0.3) that it will be negative.

That's all for the 75th knock. If you have any mistakes, I would appreciate it if you could point them out.


Recommended Posts

100 amateur language processing knocks: 41
100 amateur language processing knocks: 71
100 amateur language processing knocks: 24
100 amateur language processing knocks: 50
100 amateur language processing knocks: 70
100 amateur language processing knocks: 62
100 amateur language processing knocks: 60
100 amateur language processing knocks: 92
100 amateur language processing knocks: 30
100 amateur language processing knocks: 06
100 amateur language processing knocks: 84
100 amateur language processing knocks: 81
100 amateur language processing knocks: 33
100 amateur language processing knocks: 46
100 amateur language processing knocks: 88
100 amateur language processing knocks: 89
100 amateur language processing knocks: 40
100 amateur language processing knocks: 45
100 amateur language processing knocks: 43
100 amateur language processing knocks: 55
100 amateur language processing knocks: 22
100 amateur language processing knocks: 61
100 amateur language processing knocks: 94
100 amateur language processing knocks: 54
100 amateur language processing knocks: 04
100 amateur language processing knocks: 63
100 amateur language processing knocks: 78
100 amateur language processing knocks: 08
100 amateur language processing knocks: 42
100 amateur language processing knocks: 19
100 amateur language processing knocks: 73
100 amateur language processing knocks: 75
100 amateur language processing knocks: 83
100 amateur language processing knocks: 95
100 amateur language processing knocks: 96
100 amateur language processing knocks: 72
100 amateur language processing knocks: 79
100 amateur language processing knocks: 23
100 amateur language processing knocks: 05
100 amateur language processing knocks: 00
100 amateur language processing knocks: 02
100 amateur language processing knocks: 37
100 amateur language processing knocks: 21
100 amateur language processing knocks: 68
100 amateur language processing knocks: 11
100 amateur language processing knocks: 90
100 amateur language processing knocks: 74
100 amateur language processing knocks: 66
100 amateur language processing knocks: 28
100 amateur language processing knocks: 64
100 amateur language processing knocks: 34
100 amateur language processing knocks: 36
100 amateur language processing knocks: 77
100 amateur language processing knocks: 01
100 amateur language processing knocks: 16
100 amateur language processing knocks: 27
100 amateur language processing knocks: 10
100 amateur language processing knocks: 03
100 amateur language processing knocks: 82
100 amateur language processing knocks: 69
100 amateur language processing knocks: 53