I was wondering because the article "Twitter slander repelling machine" I wrote on Qiita earlier was good, but LGTM was sluggish. I borrowed the power of Deep Learning because I thought it would be a hassle to put the slanderous words in the array one by one. The purpose of this article is the same as last time
is Let's go! !! !!
Learn prior knowledge in the previous article and proceed Twitter slander repelling machine
Create a model using word2vec, RNN (LSTM). The data uses a dataset commonly used for reputation analysis called "umich-sentiment-train.txt".
Create using Keras. word2vec is an algorithm that puts human words in a vector (number) in a word.
word2vec_model = Sequential()
word2vec_model.add(Embedding(input_dim=vocab_size, output_dim=EMBEDDING_SIZE,
embeddings_initializer='glorot_uniform',
input_length=WINDOW_SIZE * 2))
word2vec_model.add(Lambda(lambda x: K.mean(x, axis=1), output_shape=(EMBEDDING_SIZE,)))
word2vec_model.add(Dense(vocab_size, kernel_initializer='glorot_uniform', activation='softmax'))
word2vec_model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=["accuracy"])
word2vec_model.fit(Xtrain, ytrain, batch_size=BATCH_SIZE, epochs=NUM_EPOCHS, validation_data=(Xtest, ytest))
# evaluate
word2vec_score, word2vec_acc = word2vec_model.evaluate(Xtest, ytest, batch_size=BATCH_SIZE)
print("word2vec Test score: {:.3f}, accuracy: {:.3f}".format(word2vec_score, word2vec_acc))
# get embedding_weights
embedding_weights = word2vec_model.layers[0].get_weights()[0]
RNN (LSTM) is an AI model specialized for handling time series data. It is also applied to stock price forecasting and machine translation.
rnn_model = Sequential()
rnn_model.add(Embedding(vocab_size, EMBEDDING_SIZE, input_length=MAX_SENTENCE_LENGTH,
weights=[embedding_weights], trainable=True))
rnn_model.add(Dropout(0.5))
rnn_model.add(LSTM(HIDDEN_LAYER_SIZE, dropout=0.5, recurrent_dropout=0.5))
rnn_model.add(Dense(1))
rnn_model.add(Activation("sigmoid"))
rnn_model.compile(loss="binary_crossentropy", optimizer="adam",
metrics=["accuracy"])
rnn_model.fit(Xtrain, ytrain, batch_size=BATCH_SIZE, epochs=NUM_EPOCHS, validation_data=(Xtest, ytest))
# evaluate
rnn_score, rnn_acc = rnn_model.evaluate(Xtest, ytest, batch_size=BATCH_SIZE)
print("rnn Test score: {:.3f}, accuracy: {:.3f}".format(rnn_score, rnn_acc))
# save model
rnn_model.save(os.path.join(DATA_DIR, "sentence_analyzing_rnn.hdf5"))
rnn_model.add(Embedding(vocab_size, EMBEDDING_SIZE, input_length=MAX_SENTENCE_LENGTH,
weights=[embedding_weights], trainable=True))
Here, proceed with learning using the weight obtained with word2vec earlier and save it as "sentence_analyzing_rnn.hdf5".
We will combine these with the slander slander repelling machine that was mentioned in the previous article.
# coding=utf-8
import collections
import os
import json
import nltk
import codecs
from requests_oauthlib import OAuth1Session
from keras.models import Sequential, load_model
from keras.preprocessing import sequence
#Authentication process
CK = 'YOUR OWN'
CS = 'YOUR OWN'
AT = 'YOUR OWN'
ATS = 'YOUR OWN'
twitter = OAuth1Session(CK, CS, AT, ATS)
#Tweet search endpoint
url = 'https://api.twitter.com/1.1/search/tweets.json'
#User block endpoint
url2 = 'https://api.twitter.com/1.1/blocks/create.json'
# Setting parameter
DATA_DIR = "./data"
MAX_FEATURES = 2000
MAX_SENTENCE_LENGTH = 40
if os.path.exists(os.path.join(DATA_DIR, "sentence_analyzing_rnn.hdf5")):
# Read training data and generate word2index
maxlen = 0
word_freqs = collections.Counter()
with codecs.open(os.path.join(DATA_DIR, "umich-sentiment-train.txt"), "r", 'utf-8') as ftrain:
for line in ftrain:
label, sentence = line.strip().split("\t")
try:
words = nltk.word_tokenize(sentence.lower())
except LookupError:
print("Englisth tokenize does not downloaded. So download it.")
nltk.download("punkt")
words = nltk.word_tokenize(sentence.lower())
maxlen = max(maxlen, len(words))
for word in words:
word_freqs[word] += 1
vocab_size = min(MAX_FEATURES, len(word_freqs)) + 2
word2index = {x[0]: i + 2 for i, x in enumerate(word_freqs.most_common(MAX_FEATURES))}
word2index["PAD"] = 0
word2index["UNK"] = 1
# load model
model = load_model(os.path.join(DATA_DIR, "sentence_analyzing_rnn.hdf5"))
#Parameters to pass to the endpoint
target_account = '@hoge ' #Reply account
keyword = '@' + target_account + 'exclude:retweets' #RT excluded
params = {
'count': 50, #Number of tweets to get
'q': keyword, #Search keyword
}
req = twitter.get(url, params=params)
if req.status_code == 200:
res = json.loads(req.text)
for line in res['statuses']:
target_text = line['text'].replace(target_account, "")
test_words = nltk.word_tokenize(target_text.lower())
test_seqs = []
for test_word in test_words:
if test_word in word2index:
test_seqs.append(word2index[test_word])
else:
test_seqs.append(word2index["UNK"])
Xsent = sequence.pad_sequences([test_seqs], maxlen=MAX_SENTENCE_LENGTH)
ypred = model.predict(Xsent)[0][0]
if ypred < 0.5:
params2 = {'user_id': line['user']['id']} #Users to block
req2 = twitter.post(url2, params=params2)
if req2.status_code == 200:
print("Blocked !!")
else:
print("Failed2: %d" % req2.status_code)
else:
print("Failed: %d" % req.status_code)
else:
print ("AI model doesn't exist")
#Note: Twitter API cannot make search hits for rips older than a week
#Note: Obvious aggressive rips don't hit in the first place
First, in order to run the model, it is necessary to put the word in the ID, so create a word2index. Specify the replied account, receive the message and pass it to AI. Bad words are classified as 0, and compliments are classified as close to 1. And if the threshold is less than 0.5, the account will be blocked as a bad word.
Since it is difficult to convert Japanese into word IDs and there is no high-quality data set, this program is only available in English. I sincerely hope that this program will be of some help to the world.
By the way, this is my Twitter dirt. Feel free to message me! https://twitter.com/downtownakasiya