[PYTHON] 100 amateur language processing knocks: 75

It is a challenge record of Language processing 100 knock 2015. The environment is Ubuntu 16.04 LTS + Python 3.5.2 : : Anaconda 4.1.1 (64-bit). Click here for a list of past knocks (http://qiita.com/segavvy/items/fb50ba8097d59475f760).

Chapter 8: Machine Learning

In this chapter, the task of classifying sentences into positive (positive) or negative (negative) using the sentence polarity dataset v1.0 of Movie Review Data published by Bo Pang and Lillian Lee (polarity analysis). Work on.

75. Weight of features

Check the top 10 features with high weights and the top 10 features with low weights in the logistic regression model learned in> 73.

The finished code:

`main.py`


# coding: utf-8
import codecs
import numpy as np

fname_features = 'features.txt'
fname_theta = 'theta.npy'
fencoding = 'cp1252'		# Windows-1252

#Feature reading
with codecs.open(fname_features, 'r', fencoding) as file_in:
	features = list(file_in)

#Reading learning results
theta = np.load(fname_theta)

#Sort by weight and create index array
index_sorted = np.argsort(theta)

#Top and bottom 10 items displayed
print('top 10')
for index in index_sorted[:-11:-1]:
	print('\t{}\t{}'.format(theta[index],
			features[index - 1].strip() if index > 0 else '(none)'))

print('worst 10')
for index in index_sorted[0:10:]:
	print('\t{}\t{}'.format(theta[index],
			features[index - 1].strip() if index > 0 else '(none)'))

Execution result:

`Execution result`


top 10
	2.5513381450571875	refresh
	2.3661550679971954	engross
	2.1686400756091198	unexpect
	1.9558595013013638	examin
	1.921611726928927	remark
	1.8642762301453122	glorious
	1.6736177761639448	quiet
	1.6361584092330672	delight
	1.6264395695012035	confid
	1.6207851665872708	beauti
worst 10
	-2.6661835195544206	bore
	-2.381809993645082	dull
	-2.264732545707236	wast
	-2.0944221636736557	fail
	-2.043315628825945	flat
	-1.9875250134818985	mediocr
	-1.921981567258377	worst
	-1.9199082235005225	suppos
	-1.9103686908457609	routin
	-1.8511208689897838	appar

The weight of the feature

The feature weight is the value of $ \ theta $ itself. If the value is large, the value of the hypothesis function increases to the affirmative side when the word appears, and if the value is negative, it decreases to the negative side. Since it is the result of stemming, some words are cut off in the middle, but it seems that the top 10 are positive words and the worst 10 are negative words, so you can learn the weight of features as you expected. There seems to be.

Get index of sorted results

Sorting by feature weights is easy, but it requires some ingenuity.

In the code, the feature weights $ \ theta $ are stored in the matrix theta, and the corresponding string array of features is stored in features. If you simply sort by the value of theta here, only theta will be sorted, and you can see the weight values of the top 10 and the worst 10, but you cannot know the corresponding features. Therefore, features must also be sorted in tandem. This is a bit annoying.

A useful place here is Numpy.argsort (). This sorts the target array, but the result is not an array of sorted values, but an array of sorted resulting indexes.

For example, if you sort in ascending order like this time, the return value will be in the form of [index of the element with the lowest value, index of the second element, ... index of the element with the highest value]. Therefore, using the returned index, you can find the corresponding features by looking at the corresponding element in the array of features features. This is convenient!

One thing to note is that theta is a matrix of 3,228 elements, while features is of 3,227 elements. As explained in Vectorization of Problem 73, the first element of theta is a weight corresponding to" no feature ", so it is shifted by one. I am. Therefore, set the index to -1 when checking the corresponding element of features. If the index is 0, it is the weight corresponding to "no features".

That's all for the 76th knock. If you have any mistakes, I would appreciate it if you could point them out.

The execution result includes a part of the data distributed in Corpus data used for 100 knocks. I will. *