It is a challenge record of Language processing 100 knock 2015. The environment is Ubuntu 16.04 LTS + Python 3.5.2 : : Anaconda 4.1.1 (64-bit). Click here for a list of past knocks (http://qiita.com/segavvy/items/fb50ba8097d59475f760).
In this chapter, the task of classifying sentences into positive (positive) or negative (negative) using the sentence polarity dataset v1.0 of Movie Review Data published by Bo Pang and Lillian Lee (polarity analysis). Work on.
Check the top 10 features with high weights and the top 10 features with low weights in the logistic regression model learned in> 73.
main.py
# coding: utf-8
import codecs
import numpy as np
fname_features = 'features.txt'
fname_theta = 'theta.npy'
fencoding = 'cp1252' # Windows-1252
#Feature reading
with codecs.open(fname_features, 'r', fencoding) as file_in:
features = list(file_in)
#Reading learning results
theta = np.load(fname_theta)
#Sort by weight and create index array
index_sorted = np.argsort(theta)
#Top and bottom 10 items displayed
print('top 10')
for index in index_sorted[:-11:-1]:
print('\t{}\t{}'.format(theta[index],
features[index - 1].strip() if index > 0 else '(none)'))
print('worst 10')
for index in index_sorted[0:10:]:
print('\t{}\t{}'.format(theta[index],
features[index - 1].strip() if index > 0 else '(none)'))
Execution result
top 10
2.5513381450571875 refresh
2.3661550679971954 engross
2.1686400756091198 unexpect
1.9558595013013638 examin
1.921611726928927 remark
1.8642762301453122 glorious
1.6736177761639448 quiet
1.6361584092330672 delight
1.6264395695012035 confid
1.6207851665872708 beauti
worst 10
-2.6661835195544206 bore
-2.381809993645082 dull
-2.264732545707236 wast
-2.0944221636736557 fail
-2.043315628825945 flat
-1.9875250134818985 mediocr
-1.921981567258377 worst
-1.9199082235005225 suppos
-1.9103686908457609 routin
-1.8511208689897838 appar
The feature weight is the value of $ \ theta $ itself. If the value is large, the value of the hypothesis function increases to the affirmative side when the word appears, and if the value is negative, it decreases to the negative side. Since it is the result of stemming, some words are cut off in the middle, but it seems that the top 10 are positive words and the worst 10 are negative words, so you can learn the weight of features as you expected. There seems to be.
Sorting by feature weights is easy, but it requires some ingenuity.
In the code, the feature weights $ \ theta $ are stored in the matrix theta
, and the corresponding string array of features is stored in features
. If you simply sort by the value of theta
here, only theta
will be sorted, and you can see the weight values of the top 10 and the worst 10, but you cannot know the corresponding features. Therefore, features
must also be sorted in tandem. This is a bit annoying.
A useful place here is Numpy.argsort ()
. This sorts the target array, but the result is not an array of sorted values, but an array of sorted resulting indexes.
For example, if you sort in ascending order like this time, the return value will be in the form of [index of the element with the lowest value, index of the second element, ... index of the element with the highest value]. Therefore, using the returned index, you can find the corresponding features by looking at the corresponding element in the array of features features
. This is convenient!
One thing to note is that theta
is a matrix of 3,228 elements, while features
is of 3,227 elements. As explained in Vectorization of Problem 73, the first element of theta
is a weight corresponding to" no feature ", so it is shifted by one. I am. Therefore, set the index to -1 when checking the corresponding element of features
. If the index is 0, it is the weight corresponding to "no features".
That's all for the 76th knock. If you have any mistakes, I would appreciate it if you could point them out.
Recommended Posts