[PYTHON] Twitter slander repulsion machine (strongest version)

I was wondering because the article "Twitter slander repelling machine" I wrote on Qiita earlier was good, but LGTM was sluggish. I borrowed the power of Deep Learning because I thought it would be a hassle to put the slanderous words in the array one by one. The purpose of this article is the same as last time

Rescue SNS slander with the power of technology

is Let's go! !! !!

Prior knowledge

Learn prior knowledge in the previous article and proceed Twitter slander repelling machine

Let's make a slanderous identification AI

Create a model using word2vec, RNN (LSTM). The data uses a dataset commonly used for reputation analysis called "umich-sentiment-train.txt".

word2vec model

Create using Keras. word2vec is an algorithm that puts human words in a vector (number) in a word.

    word2vec_model = Sequential()
    word2vec_model.add(Embedding(input_dim=vocab_size, output_dim=EMBEDDING_SIZE,
                                 embeddings_initializer='glorot_uniform',
                                 input_length=WINDOW_SIZE * 2))
    word2vec_model.add(Lambda(lambda x: K.mean(x, axis=1), output_shape=(EMBEDDING_SIZE,)))
    word2vec_model.add(Dense(vocab_size, kernel_initializer='glorot_uniform', activation='softmax'))

    word2vec_model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=["accuracy"])
    word2vec_model.fit(Xtrain, ytrain, batch_size=BATCH_SIZE, epochs=NUM_EPOCHS, validation_data=(Xtest, ytest))
    # evaluate
    word2vec_score, word2vec_acc = word2vec_model.evaluate(Xtest, ytest, batch_size=BATCH_SIZE)
    print("word2vec Test score: {:.3f}, accuracy: {:.3f}".format(word2vec_score, word2vec_acc))
    # get embedding_weights
    embedding_weights = word2vec_model.layers[0].get_weights()[0]

RNN (LSTM) model

RNN (LSTM) is an AI model specialized for handling time series data. It is also applied to stock price forecasting and machine translation.

    rnn_model = Sequential()
    rnn_model.add(Embedding(vocab_size, EMBEDDING_SIZE, input_length=MAX_SENTENCE_LENGTH,
                            weights=[embedding_weights], trainable=True))
    rnn_model.add(Dropout(0.5))
    rnn_model.add(LSTM(HIDDEN_LAYER_SIZE, dropout=0.5, recurrent_dropout=0.5))
    rnn_model.add(Dense(1))
    rnn_model.add(Activation("sigmoid"))

    rnn_model.compile(loss="binary_crossentropy", optimizer="adam",
                      metrics=["accuracy"])
    rnn_model.fit(Xtrain, ytrain, batch_size=BATCH_SIZE, epochs=NUM_EPOCHS, validation_data=(Xtest, ytest))
    # evaluate
    rnn_score, rnn_acc = rnn_model.evaluate(Xtest, ytest, batch_size=BATCH_SIZE)
    print("rnn Test score: {:.3f}, accuracy: {:.3f}".format(rnn_score, rnn_acc))
    # save model
    rnn_model.save(os.path.join(DATA_DIR, "sentence_analyzing_rnn.hdf5"))

rnn_model.add(Embedding(vocab_size, EMBEDDING_SIZE, input_length=MAX_SENTENCE_LENGTH,
                            weights=[embedding_weights], trainable=True))

Here, proceed with learning using the weight obtained with word2vec earlier and save it as "sentence_analyzing_rnn.hdf5".

"Twitter API + slander identification AI" combined

We will combine these with the slander slander repelling machine that was mentioned in the previous article.

# coding=utf-8
import collections
import os
import json
import nltk
import codecs
from requests_oauthlib import OAuth1Session
from keras.models import Sequential, load_model
from keras.preprocessing import sequence

#Authentication process
CK = 'YOUR OWN'
CS = 'YOUR OWN'
AT = 'YOUR OWN'
ATS = 'YOUR OWN'
twitter = OAuth1Session(CK, CS, AT, ATS)

#Tweet search endpoint
url = 'https://api.twitter.com/1.1/search/tweets.json'
#User block endpoint
url2 = 'https://api.twitter.com/1.1/blocks/create.json'
# Setting parameter
DATA_DIR = "./data"
MAX_FEATURES = 2000
MAX_SENTENCE_LENGTH = 40

if os.path.exists(os.path.join(DATA_DIR, "sentence_analyzing_rnn.hdf5")):
    # Read training data and generate word2index
    maxlen = 0
    word_freqs = collections.Counter()
    with codecs.open(os.path.join(DATA_DIR, "umich-sentiment-train.txt"), "r", 'utf-8') as ftrain:
        for line in ftrain:
            label, sentence = line.strip().split("\t")
            try:
                words = nltk.word_tokenize(sentence.lower())
            except LookupError:
                print("Englisth tokenize does not downloaded. So download it.")
                nltk.download("punkt")
                words = nltk.word_tokenize(sentence.lower())
            maxlen = max(maxlen, len(words))
            for word in words:
                word_freqs[word] += 1

    vocab_size = min(MAX_FEATURES, len(word_freqs)) + 2
    word2index = {x[0]: i + 2 for i, x in enumerate(word_freqs.most_common(MAX_FEATURES))}
    word2index["PAD"] = 0
    word2index["UNK"] = 1
    # load model
    model = load_model(os.path.join(DATA_DIR, "sentence_analyzing_rnn.hdf5"))
    #Parameters to pass to the endpoint
    target_account = '@hoge '  #Reply account
    keyword = '@' + target_account + 'exclude:retweets'  #RT excluded

    params = {
        'count': 50,  #Number of tweets to get
        'q': keyword,  #Search keyword
    }
    req = twitter.get(url, params=params)

    if req.status_code == 200:
        res = json.loads(req.text)
        for line in res['statuses']:
            target_text = line['text'].replace(target_account, "")
            test_words = nltk.word_tokenize(target_text.lower())
            test_seqs = []
            for test_word in test_words:
                if test_word in word2index:
                    test_seqs.append(word2index[test_word])
                else:
                    test_seqs.append(word2index["UNK"])

            Xsent = sequence.pad_sequences([test_seqs], maxlen=MAX_SENTENCE_LENGTH)
            ypred = model.predict(Xsent)[0][0]
            if ypred < 0.5:
                params2 = {'user_id': line['user']['id']}  #Users to block
                req2 = twitter.post(url2, params=params2)

                if req2.status_code == 200:
                    print("Blocked !!")
                else:
                    print("Failed2: %d" % req2.status_code)
    else:
        print("Failed: %d" % req.status_code)
else:
    print ("AI model doesn't exist")

#Note: Twitter API cannot make search hits for rips older than a week
#Note: Obvious aggressive rips don't hit in the first place

First, in order to run the model, it is necessary to put the word in the ID, so create a word2index. Specify the replied account, receive the message and pass it to AI. Bad words are classified as 0, and compliments are classified as close to 1. And if the threshold is less than 0.5, the account will be blocked as a bad word.

Summary

Since it is difficult to convert Japanese into word IDs and there is no high-quality data set, this program is only available in English. I sincerely hope that this program will be of some help to the world.

By the way, this is my Twitter dirt. Feel free to message me! https://twitter.com/downtownakasiya