[PYTHON] I tried to build an estimation model of article titles that are likely to buzz with Qiita

I've touched Python, but I have little experience in implementing machine learning. → I especially want to make something using natural language processing With that feeling, I built a buzz estimation model.

Overview

Implementation

Collect article title data

Articles on the trend

Obtain it from the Twitter account (Qiita Popular Posts) that introduces trending articles using the Twitter API. 3229 data will be collected, the URL and emoji in the tweet will be removed, and then dumped to the json file.

def retrieveTweets(screenName, count):
    global totalIdx
    timeLine = t.statuses.user_timeline(screen_name=screenName, count=count)
    maxId = 0
    for tweetsIdx, tweet in enumerate(timeLine):
        maxId = tweet["id"]
        addArticleTitles(tweet)
        totalIdx += 1
    print("Starting additional retrieving...")
    retrieveContinuedTweets(screenName, count, maxId)

def retrieveContinuedTweets(screenName, count, maxId):
    global totalIdx, isFinished
    tmpMaxId = maxId
    while True:
        timeLine = t.statuses.user_timeline(screen_name=screenName, count=count, max_id=tmpMaxId)
        prevMaxId = 0
        for tweetsIdx, tweet in enumerate(timeLine):
            tmpMaxId = tweet["id"]
            addArticleTitles(tweet)
            print("totalIdx = {}, prevMaxId = {}, maxId = {}, title = {}\n".format(totalIdx, prevMaxId, tmpMaxId, trendArticleTitles[totalIdx]["articleTitle"]))
            if prevMaxId == 0 and totalIdx % 200 != 0:
                isFinished = True
                break
            prevMaxId = tmpMaxId
            totalIdx += 1
        if isFinished:
            print("Finished collecting {} qiita_trend_titles.".format(totalIdx))
            break

def addArticleTitles(tweet):
    global trendArticleTitles
    tmpTitle = re.sub(r"(https?|ftp)(:\/\/[-_\.!~*\'()a-zA-Z0-9;\/?:\@&=\+\$,%#]+)", "", tweet["text"]) #Remove URLs in tweets
    tmpTitle = ''.join(s for s in tmpTitle if s not in emoji.UNICODE_EMOJI)
    articleTitle = tmpTitle[:len(tmpTitle)-1] #Remove the trailing half-width space
    datum = {"articleTitle": articleTitle}
    trendArticleTitles.append(datum)

Regular article

For regular article titles that are not buzzed, use the Qiita API to get them. Here, 9450 data were collected and dumped to a json file as well as the trend article title.

articleTitles = []
idx = 0
print("Starting collecting article titles...")
for page in range(3, 101):
    #Exclude early pages to exclude articles from spam accounts
    params = {"page": str(page), "per_page": str(per_page)}
    response = requests.get(url, headers=headers, params=params)
    resJson = response.json()
    for article in resJson:
        if article.get("likes_count") < notBuzzThreshold:
            title = article.get("title")
            articleTitles.append({"articleTitle": title})
            print("{}th article title = {}, url = {}".format(idx, title, article["url"]))
            idx += 1
print("Finished collecting {} qiita_article_titles.".format(idx))

Combine article title datasets into one file

First, load the two types of article title data collected above. While adding a flag as to whether it is a trend article, we will finish it as a single piece of data. Just in case, shuffle the contents of the combined data.

Here too, the json file is dumped at the end, and the data collection is finished.

mergedData = []
for datum in trendData:
    mergedData.append({
        "articleTitle": datum["articleTitle"],
        "isTrend": 1
    })
for datum in normalData:
    mergedData.append({
        "articleTitle": datum["articleTitle"],
        "isTrend": 0
    })

#Shuffle the order of the combined results
random.shuffle(mergedData)
print("Finished shuffling 'Merged Article Titles'.")

Practice first with spam detection

I tried to build an inference model using Naive Bayes, but I wasn't sure what to start with. Therefore, I reviewed Naive Bayes itself and tried an article that implements spam detection in Naive Bayes so that I could get a feel for it before this implementation.

Naive Bayes-Study

Naive Bayes-Practice with Spam Detection

Now that I've learned a lot about Naive Bayes, I've moved on to practice implementation. I proceeded along the ↓. Machine learning-Junk mail classification (naive bayes classifier)-

Replaced with Qiita article dataset

Now that you have a feel for Naive Bayes, it's time to get into the main subject. I will write about the parts that have been modified from the implementation of the article used in the practice.

Install MeCab, ipadic-NEologd

Since the spam detection dataset is in English, it will be thrown to scikit-learn as it is, but the Qiita article title does not. First, add MeCab and ipadic-NEologd so that you can divide words well in Japanese. (The result of the division was obtained with CountVectorizer, but it was unnatural.)

I mainly referred to the site below.

Model building

From the implementation of spam detection practice, we have added:

def getStopWords():
    stopWords = []
    with open("./datasets/Japanese.txt", mode="r", encoding="utf-8") as f:
        for word in f:
            if word != "\n":
                stopWords.append(word.rstrip("\n"))
    print("amount of stopWords = {}".format(len(stopWords)))
    return stopWords

def removeEmoji(text):
    return "".join(ch for ch in text if ch not in emoji.UNICODE_EMOJI)

stopWords = getStopWords()
tagger = MeCab.Tagger("mecabrc")
def extractWords(text):
    text = removeEmoji(text)
    text = neologdn.normalize(text)
    words = []
    analyzedResults = tagger.parse(text).split("\n")
    for result in analyzedResults:
        splittedWord = result.split(",")[0].split("\t")[0]
        if not splittedWord in stopWords:
            words.append(splittedWord)
    return words

If you pass the word splitting method to the argument analyzer of CountVectorizer, it seems that Japanese will be split well. great.

vecCount = CountVectorizer(analyzer=extractWords, min_df=3)

Execution result

We have prepared three texts for prediction: " I released the app ", " Unity tutorial ", " Git command memo ". It is assumed that "I tried to release the app" is "likely to buzz".

Count Vectorizer without analyzer specified

Obviously the number of words is small. I feel that it has not been divided normally.

word size:  1016
word content:  {'From': 809, 'ms': 447, 'nginx': 464, 'django': 232, 'intellij': 363}
Train accuracy = 0.771
Test accuracy = 0.747
[0 0 0]

Designate MeCab NEoglod as morphological analyzer

It seems that the words can be divided, but there are many words ... The classification is as expected.

word size:  3870
word content:  {'From': 1696, 'MS': 623, 'Teams': 931, 'To': 1853, 'notification': 3711}
Train accuracy = 0.842
Test accuracy = 0.783
[1 0 0]

Remove stop words and emoji

The number of words has been reduced, and the accuracy of test data has increased slightly. I felt the importance of pretreatment.

word size:  3719
word content:  {'MS': 623, 'Teams': 931, 'To': 1824, 'notification': 3571, 'To do': 1735}
Train accuracy = 0.842
Test accuracy = 0.784
[1 0 0]

Added various normalization processes

The accuracy for the training data has decreased slightly, but the accuracy for the test data has increased accordingly. Furthermore, I forgot to display the probability of classification, so I will display it here. The text that I assumed to be buzzing was honestly surprised with a higher probability than I expected. (It's unreliable unless you try it with more texts ...)

word size:  3700
word content:  {'MS': 648, 'Teams': 955, 'To': 1838, 'notification': 3583, 'To do': 1748}
[1 0 0]
[[0.23452364 0.76547636]
 [0.92761086 0.07238914]
 [0.99557625 0.00442375]]
Train accuracy = 0.841
Test accuracy = 0.785

Consideration / issues, etc.

This time, it seemed that the change in accuracy was within the margin of error. Since I could only guarantee the minimum terms contained in NEologd, I thought that the accuracy could be improved by covering the vectorization of technical terms. After that, it seems that accuracy will improve even if you extract important words from the article title and article content with TF-IDF etc. and utilize them.

Recommended Posts

I tried to build an estimation model of article titles that are likely to buzz with Qiita
I tried to build an environment of Ubuntu 20.04 LTS + ROS2 with Raspberry Pi 4
I tried to summarize the operations that are likely to be used with numpy-stl
I tried to create an article in Wiki.js with SQLAlchemy
I tried to get the authentication code of Qiita API with Python.
I tried to predict the behavior of the new coronavirus with the SEIR model.
I tried to detect an object with M2Det!
I tried to create a model with the sample of Amazon SageMaker Autopilot
I tried to build an environment with WSL + Ubuntu + VS Code in a Windows environment
I tried to extract features with SIFT of OpenCV
I tried to automate the article update of Livedoor blog with Python and selenium.
I tried to implement an artificial perceptron with python
I tried to build ML Pipeline with Cloud Composer
I tried to make an OCR application with PySimpleGUI
I tried to find an alternating series with tensorflow
I tried to build an environment for machine learning with Python (Mac OS X)
[Python] I tried to make an application that calculates salary according to working hours with tkinter
I built an application with Lambda that notifies LINE of "likes" using the Qiita API
I made an API with Docker that returns the predicted value of the machine learning model
I tried to build a service that sells machine-learned data at explosive speed with Docker
I tried to predict the sales of game software with VARISTA by referring to the article of Codexa
I tried to find the entropy of the image with python
I tried to find the average of the sequence with TensorFlow
I tried to implement ListNet of rank learning with Chainer
I tried to divide with a deep learning language model
I tried to implement SSD with PyTorch now (model edition)
I tried to predict the number of domestically infected people of the new corona with a mathematical model
[Python] I tried to explain words that are difficult for beginners to understand in an easy-to-understand manner.
I tried to make an original language "PPAP Script" that imaged PPAP (Pen Pineapple Appo Pen) with Python
I tried to make an activity that collectively sets location information
I tried scraping the ranking of Qiita Advent Calendar with Python
I tried to automate the watering of the planter with Raspberry Pi
I tried to build the SD boot image of LicheePi Nano
[Python] A memo that I tried to get started with asyncio
I tried to create a list of prime numbers with python
I tried to fix "I tried stochastic simulation of bingo game with Python"
I tried to summarize what was output with Qiita with Word cloud
I tried to make an analysis base of 5 patterns in 3 years
I tried to expand the size of the logical volume with LVM
I tried to visualize Boeing of violin performance by pose estimation
I tried to improve the efficiency of daily work with Python
I tried to automatically collect images of Kanna Hashimoto with Python! !!
I tried to make an image similarity function with Python + OpenCV
I tried to make a mechanism of exclusive control with Go
I tried to build an environment that can acquire, store, and analyze tweet data in WSL (bash)
(Python) I made an app from Trello that periodically notifies slack of tasks that are about to expire.
I tried to output the rpm list of SSH login destination to an Excel sheet with Python + openpyxl.
I tried to implement deep learning that is not deep with only NumPy
I tried to implement a blockchain that actually works with about 170 lines
I tried to make an open / close sensor (Twitter cooperation) with TWE-Lite-2525A
I tried to automatically extract the movements of PES players with software
I tried to analyze the negativeness of Nono Morikubo. [Compare with Posipa]
I tried to streamline the standard role of new employees with Python
I tried to visualize the text of the novel "Weathering with You" with WordCloud
I tried to visualize the model with the low-code machine learning library "PyCaret"
I tried to explain the latest attitude estimation model "Dark Pose" [CVPR2020]
I tried to get the movie information of TMDb API with Python
I tried to visualize all decision trees of random forest with SVG
I tried to build a Mac Python development environment with pythonz + direnv
[Lambda] I tried to incorporate an external module of python via S3
I want to create an API that returns a model with a recursive relationship in the Django REST Framework