In this article, I will explain scikit-learn using Language processing 100 knocks Chapter 6.
First, let's do pip install scikit-learn
.
Download the News Aggregator Data Set and create training data (train.txt), verification data (valid.txt), and evaluation data (test.txt) as follows.
Unzip the downloaded zip file and read the explanation of readme.txt. Extract only cases (articles) where the information source (publisher) is “Reuters”, “Huffington Post”, “Businessweek”, “Contactmusic.com”, “Daily Mail”. Randomly sort the extracted cases. Divide 80% of the extracted cases into training data and the remaining 10% into verification data and evaluation data, and save them with the file names train.txt, valid.txt, and test.txt, respectively. Write one case per line in the file, and use the tab-delimited format of the category name and article headline (this file will be reused later in Problem 70).
After creating the training data and evaluation data, check the number of cases in each category.
This problem has nothing to do with scikit-learn, so you can solve it the way you like. First of all, download the file and read readme.txt.
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00359/NewsAggregatorDataset.zip
!unzip -c NewsAggregatorDataset.zip readme.txt
A readme is fine, but I want to handle compressed files without decompressing them as much as possible. The zip file of the data body should be handled by the zipfile module. Any method of reading is fine, but in this case, I think it's easier to use pandas. Let sklearn.model_selection.train_test_split () do the splitting. It also shuffles.
As a rudimentary story, the library name is scikit-learn, but the module name when importing is sklearn.
import csv
import zipfile
import pandas as pd
from sklearn.model_selection import train_test_split
with zipfile.ZipFile("NewsAggregatorDataset.zip") as z:
with z.open("newsCorpora.csv") as f:
names = ('ID','TITLE','URL','PUBLISHER','CATEGORY','STORY','HOSTNAME','TIMESTAMP')
df = pd.read_table(f, names=names, quoting=csv.QUOTE_NONE)
publisher_set = {"Reuters", "Huffington Post", "Businessweek", "Contactmusic.com", "Daily Mail"}
df = df[df['PUBLISHER'].isin(publisher_set)]
df, valid_test_df = train_test_split(df, train_size=0.8, random_state=0)
df.to_csv('train.txt', columns=('CATEGORY','TITLE'), sep='\t', header=False, index=False)
valid_df, test_df = train_test_split(valid_test_df, test_size=0.5, random_state=0)
valid_df.to_csv('valid.txt', columns=('CATEGORY','TITLE'), sep='\t', header=False, index=False)
test_df.to_csv('test.txt', columns=('CATEGORY','TITLE'), sep='\t', header=False, index=False)
pandas.read_table ()
reads a TSV file and creates a DataFrame
type object. names sets the column name. quoting = csv.QUOTE_NONE
is a setting to treat quotation marks as character strings. csv.QUOTE_NONE
is the same even if you write 3
.
(I heard that read_table ()
used to be deprecated, so you can use read_csv (sep ='\ t')
, but it seems to be deprecated because there is no warning.)
The df ['PUBLISHER']
part is an operation to extract columns, and the return value will be of type Series
. The DataFrame
type of pandas represented the structure of the entire table, and each column was represented by the Series
type. Its method ʻisin ()returns the
Series of the truth value of the ʻin
operation for each element. And if you pass it as if it were a df key, it will return a DataFrame
that extracts only the True rows.
names = ('CATEGORY','TITLE')
df = pd.read_table('train.txt', names=names, quoting=csv.QUOTE_NONE)
df['CATEGORY'].value_counts()
b 4503
e 4254
t 1210
m 717
Name: CATEGORY, dtype: int64
df = pd.read_table('test.txt', names=names, quoting=csv.QUOTE_NONE)
df['CATEGORY'].value_counts()
b 565
e 518
t 163
m 90
Name: CATEGORY, dtype: int64
Extract the features from the training data, verification data, and evaluation data, and save them with the file names
train.feature.txt
,valid.feature.txt
, andtest.feature.txt
, respectively. Feel free to design the features that are likely to be useful for categorization. The minimum baseline would be an article headline converted to a word string.
In this problem, it is not said that the extracted features should be converted into a vector (matrix). It seems that it is required to save the features in a human-readable format in order to use them for error analysis later.
(If you use Count vectorizer
of scikit-learn
, feature extraction and vectorization will be done as a set, which is not familiar with this problem.)
Therefore, extract the features by yourself, create a dictionary object, save it, and use Dictvectorizer in the next problem. We will solve it with the policy of using it to vectorize it. The key of the dictionary is the name of the feature, and the value is 1.0. It is a binary feature. Creating a dictionary from features This process is also required for inference, so make it a function.
The format for saving features is not specified, but I think the jsonl format is better from the viewpoint of readability.
I want to separate commas and quotes from words. It doesn't matter how you do it. SpaCy is famous as a tokenizer, but I think that the tokenizer of Countvectorizer
is also effective in this problem.
q51.py
import argparse
import json
from sklearn.feature_extraction.text import CountVectorizer
def ngram_gen(seq, n):
return zip(*(seq[i:] for i in range(n)))
nlp = CountVectorizer().build_tokenizer()
def make_feats_dict(title):
words = nlp(title)
feats = {}
for token in words:
feats[token] = 1.0
for bigram in ngram_gen(words, 2):
feats[' '.join(bigram)] = 1.0
for trigram in ngram_gen(words, 3):
feats[' '.join(trigram)] = 1.0
return feats
def dump_features(input_file, output_file):
with open(input_file) as fi, open(output_file, 'w') as fo:
for line in fi:
vals = line.rstrip().split('\t')
label, title = vals
feats = {'**LABEL**': label}
feats.update(make_feats_dict(title))
print(json.dumps(feats), file=fo)
def main():
parser = argparse.ArgumentParser()
parser.add_argument('input_file')
parser.add_argument('output_file')
args = vars(parser.parse_args())
dump_features(**args)
if __name__ == '__main__':
main()
Overwriting q51.py
!python q51.py test.txt test.feature.txt
!python q51.py valid.txt valid.feature.txt
!python q51.py train.txt train.feature.txt
ngram_gen ()
makes ngram. For bigram, transpose [[I, am, an, NLPer], [am, an, NLPer]]
(according to the shorter one)! I am doing it in an elegant way.
(The label is not a feature, but I write it out because it will ease the next problem.)
I'm doing something strange with the main function, but this is also used in Chapter 4 [Unpacking the argument list](https://docs.python.org/ja/3/tutorial/controlflow.html#unpacking- I'm trying to pass keyword arguments in a dictionary by argument-lists). Since the return value of parse_args ()
is a namespace object, it is converted to a dictionary object by vars ()
(which appeared in Chapter 5).
Learn the logistic regression model using the training data constructed in> 51.
First, create a list X
consisting of a dictionary representing features from the file created in 51. To input it into the machine learning model, we need a vector that lists the values of all features. So use DictVectorizer (). The method fit (X)
of the DictVectorizer
gets the feature name and index mapping from X
and stores it in a variable inside the instance. Then use transform (X)
to transform X
into a numpy
matrix. Fit_transform (X)
does this all at once.
Then use LogisticRegression (). Simply instantiate and call the fit (X, y)
method to learn the weight vector inside the instance. Hypara is set at instantiation. X
is like a matrix, y
is like a list, and it's okay if the lengths match.
Save the learned model by referring to Model persistence. When using joblib.dump (), a large number of files will be generated unless the optional argument compress
is specified. So be careful.
At this time, if you do not save the mapping between the feature name and the index, you will have trouble in inference. Let's dump each instance of DictVectorizer
.
q52.py
import argparse
import json
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
import joblib
def argparse_imf():
parser = argparse.ArgumentParser()
parser.add_argument('-i', '--input')
parser.add_argument('-m', '--model')
parser.add_argument('-f', '--feats')
args = parser.parse_args()
return args
def load_xy(filename):
X = []
y = []
with open(filename) as f:
for line in f:
dic = json.loads(line)
y.append(dic.pop('**LABEL**'))
X.append(dic)
return X, y
def main():
args = argparse_imf()
X_train, y_train = load_xy(args.input)
vectorizer = DictVectorizer()
X_train = vectorizer.fit_transform(X_train)
y_train = np.array(y_train)
clf = LogisticRegression(random_state=0, max_iter=1000, verbose=1)
clf.fit(X_train, y_train)
joblib.dump(clf, args.model, compress=3)
joblib.dump(vectorizer, args.feats, compress=3)
if __name__ == '__main__':
main()
Overwriting q52.py
!python q52.py -i train.feature.txt -m train.logistic.model -f train.feature.joblib
Use the logistic regression model learned in> 52 and implement a program that calculates the category and its prediction probability from the given article headline.
I feel that the "given article headline" in this question does not refer to the test data created above, but rather to make predictions from any article headline.
If you load the saved model and call predict (X)
, the label will come out, and if you callpredict_proba (X)
, the prediction probability will come out. This X
can be obtained by creating a feature dictionary from the input and converting it with theDictvectorizer ()
saved in 52.
If you enter two titles and apply predict_proba ()
, you will get a numpy.ndarray
like this.
>>> y_proba
array([[0.24339871, 0.54111814, 0.10059608, 0.11488707],
[0.19745579, 0.69644375, 0.04204659, 0.06405386]])
Predictive probabilities for all labels are coming out, but I think you only want the maximum value. what should I do? ndarray
seems to have a max () method ...
>>> y_proba.max()
0.6964437549683299
>>> y_proba.max(axis=0)
array([0.24339871, 0.69644375, 0.10059608, 0.11488707])
Let's do our best. Below is an example of the answer.
q53.py
import argparse
import json
import sys
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
import joblib
from q51 import make_feats_dict
from q52 import argparse_imf, load_xy
def predict_label_proba(X, vectorizer, clf):
X = vectorizer.transform(X)
y_proba = clf.predict_proba(X)
y_pred = clf.classes_[y_proba.argmax(axis=1)]
y_proba_max = y_proba.max(axis=1)
return y_pred, y_proba_max
def main():
args = argparse_imf()
vectorizer = joblib.load(args.feats)
clf = joblib.load(args.model)
X = list(map(make_feats_dict, sys.stdin))
y_pred, y_proba = predict_label_proba(X, vectorizer, clf)
for label, proba in zip(y_pred, y_proba):
print('%s\t%.4f' % (label, proba))
if __name__ == '__main__':
main()
Overwriting q53.py
!echo 'I have a dog.' | python q53.py -m train.logistic.model -f train.feature.joblib
e 0.5441
Measure the correct answer rate of the logistic regression model learned in> 52 on the training data and evaluation data.
You can implement it by hand, but I'll leave it to sklearn.metrics.accuracy_score ().
The most important thing in learning scikit-learn
is the flow so far.
dict
(list with elements) typeDictvectorizer.fit_transform ()
Logistic Regression
fit (X_train, y_train)
predict (X_test)
Let's hold this firmly.
q54.py
import argparse
import json
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import joblib
from q52 import argparse_imf, load_xy
def predict(args):
X_test, y_true = load_xy(args.input)
vectorizer = joblib.load(args.feats)
X_test = vectorizer.transform(X_test)
y_true = np.array(y_true)
clf = joblib.load(args.model)
y_pred = clf.predict(X_test)
return y_true, y_pred
def main():
args = argparse_imf()
y_true, y_pred = predict(args)
accuracy = accuracy_score(y_true, y_pred) * 100
print('Accuracy: %.3f' % accuracy)
if __name__ == '__main__':
main()
Overwriting q54.py
!python q54.py -i train.feature.txt -m train.logistic.model -f train.feature.joblib
Accuracy: 99.897
!python q54.py -i test.feature.txt -m train.logistic.model -f train.feature.joblib
Accuracy: 87.275
Create a confusion matrix of the logistic regression model learned in> 52 on the training data and evaluation data.
Leave it to sklearn.metrics.confusion_matrix ().
q55.py
from sklearn.metrics import confusion_matrix
from q52 import argparse_imf
from q54 import predict
def main():
args = argparse_imf()
y_true, y_pred = predict(args)
labels = ('b', 'e', 't', 'm')
matrix = confusion_matrix(y_true, y_pred, labels=labels)
print(labels)
print(matrix)
if __name__ == '__main__':
main()
Overwriting q55.py
!python q55.py -i train.feature.txt -m train.logistic.model -f train.feature.joblib
('b', 'e', 't', 'm')
[[4499 1 3 0]
[ 2 4252 0 0]
[ 3 1 1206 0]
[ 0 1 0 716]]
!python q55.py -i test.feature.txt -m train.logistic.model -f train.feature.joblib
('b', 'e', 't', 'm')
[[529 26 10 0]
[ 13 503 2 0]
[ 37 36 89 1]
[ 19 26 0 45]]
Measure the precision, recall, and F1 score of the logistic regression model learned in> 52 on the evaluation data. Obtain the precision rate, recall rate, and F1 score for each category, and integrate the performance for each category with the micro-average and macro-average.
Leave it to sklearn.metrics.classification_report (). In the multi-class (single label) classification, the micro-average for all classes matches the correct answer rate (Reference).
q56.py
from sklearn.metrics import classification_report
from q52 import argparse_imf
from q54 import predict
def main():
args = argparse_imf()
y_true, y_pred = predict(args)
print(classification_report(y_true, y_pred, digits=4))
if __name__ == '__main__':
main()
Overwriting q56.py
!python q56.py -i test.feature.txt -m train.logistic.model -f train.feature.joblib
precision recall f1-score support
b 0.8846 0.9363 0.9097 565
e 0.8511 0.9710 0.9071 518
m 0.9783 0.5000 0.6618 90
t 0.8812 0.5460 0.6742 163
accuracy 0.8728 1336
macro avg 0.8988 0.7383 0.7882 1336
weighted avg 0.8775 0.8728 0.8633 1336
Check the top 10 features with high weights and the top 10 features with low weights in the logistic regression model learned in> 52.
The attribute coef_
has a weight, but since it is a multi-class classification, the weight is the number of classes x the number of feature labels. Will all 4 classes be output?
q57.py
import joblib
import numpy as np
from q52 import argparse_imf
def get_topk_indices(array, k=10):
unsorted_max_indices = np.argpartition(-array, k)[:k]
max_weights = array[unsorted_max_indices]
max_indices = np.argsort(-max_weights)
return unsorted_max_indices[max_indices]
def show_weights(args):
vectorizer = joblib.load(args.feats)
feature_nemes = np.array(vectorizer.get_feature_names())
clf = joblib.load(args.model)
coefs = clf.coef_
y_labels = clf.classes_
for coef, y_label in zip(coefs, y_labels):
max_k_indices = get_topk_indices(coef)
print(y_label)
for name, weight in zip(feature_nemes[max_k_indices], coef[max_k_indices]):
print(name, weight, sep='\t')
print('...')
min_k_indices = get_topk_indices(-coef)
for name, weight in zip(feature_nemes[min_k_indices], coef[min_k_indices]):
print(name, weight, sep='\t')
print()
def main():
args = argparse_imf()
show_weights(args)
if __name__ == '__main__':
main()
Overwriting q57.py
!python q57.py -i test.feature.txt -m train.logistic.model -f train.feature.joblib
Instead of sort
ing the entire coef_
, I want only the upper and lower levels, so this is a roundabout method. This is because numpy doesn't have a topk ()
-like function, it just gets the top index that isn't sorted by ʻargpartition ()`.
When training a logistic regression model, the degree of overfitting during learning can be controlled by adjusting the regularization parameters. Learn the logistic regression model with different regularization parameters and find the accuracy rate on the training data, validation data, and evaluation data. Summarize the results of the experiment in a graph with the regularization parameters on the horizontal axis and the accuracy rate on the vertical axis.
import argparse
import json
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import joblib
import matplotlib.pyplot as plt
from tqdm import tqdm
from q52 import load_xy
def get_accuracy(clf, X, y_true):
y_pred = clf.predict(X)
return accuracy_score(y_true, y_pred)
X_train, y_train = load_xy('train.feature.txt')
X_valid, y_valid = load_xy('valid.feature.txt')
X_test, y_test = load_xy('test.feature.txt')
vectorizer = DictVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_valid = vectorizer.transform(X_valid)
X_test = vectorizer.transform(X_test)
train_accuracies = []
valid_accuracies = []
test_accuracies = []
for exp in tqdm(range(10)):
clf = LogisticRegression(random_state=0, max_iter=1000, C=2**exp)
clf.fit(X_train, y_train)
train_accuracies.append(get_accuracy(clf, X_train, y_train))
valid_accuracies.append(get_accuracy(clf, X_valid, y_valid))
test_accuracies.append(get_accuracy(clf, X_test, y_test))
cs = [2**c for c in range(10)]
plt.plot(cs, train_accuracies, label='train')
plt.plot(cs, valid_accuracies, label='valid')
plt.plot(cs, test_accuracies, label='test')
plt.legend()
plt.show()
Learn the categorization model while changing the learning algorithm and learning parameters. Find the learning algorithm parameter that has the highest accuracy rate on the verification data. Also, find the correct answer rate on the evaluation data when the learning algorithm and parameters are used.
Algorithm high para selection should be done on the verification data and not test set tuning. But this time I didn't do that much and I'll use sklearn.ensemble.GradientBoostingClassifier
as appropriate to finish it ... I intended to finish it.
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(random_state=0, min_samples_split=0.01,
min_samples_leaf=5, max_depth=10,
max_features='sqrt', n_estimators=500,
subsample=0.8)
clf.fit(X_train, y_train)
valid_acc = get_accuracy(clf, X_valid, y_valid) * 100
print('Validation Accuracy: %.3f' % valid_acc)
test_acc = get_accuracy(clf, X_test, y_test) * 100
print('Test Accuracy: %.3f' % test_acc)
Validation Accuracy: 88.997
Test Accuracy: 88.548
It's a trendy GBDT! I tried it, but it was difficult because the performance changed considerably with high para. If I have time, I will seriously search for grids ...
Anyway, when it comes to machine learning in Python, I think it's a chapter where you can learn scikit-learn.
Recommended Posts