[PYTHON] [Free study] Is there a connection between Wikipedia updates and trends?

[Free study] Is there a connection between Wikipedia updates and trends?

This is Getta. I'm in the 20th grade of elementary school. I will present my free research.

……

I will write the motive below, but I was interested in it, so I tried it and I was able to implement it quickly, so When I thought about giving it to qiita, it was just the time for ad-care, so I will post it appropriately.

Motivation

When Mr. Ninomiya of Arashi announced his marriage, Wikipedia was being destroyed.

Of course, at that time, <was "Nino Marriage" in the Twitter trend.

So I thought, "Maybe Wikipedia and Twitter trends are related: thinking:".

What makes me happy when there is a relationship is that I can get specific knowledge while suppressing trends just by looking at Wikipedia, and I think I can learn more about trends than looking at SNS.

Method

Get Twitter trends with API.

A program to get Japanese trends on Twitter with API
def auth_api():
    auth = tweepy.OAuthHandler(api_key.CONSUMER_KEY, api_key.CONSUMER_SECRET)
    auth.set_access_token(api_key.ACCESS_TOKEN, api_key.ACCESS_SECRET)
    return tweepy.API(auth)
    
def get_trend_words():
    api = auth_api()
    trends = api.trends_place("23424856") #Japanese WOEID
    trend_words = []
    for d in trends[0]["trends"]:
        trend_words.append(d["name"])
    return trend_words

Get a list by scraping the latest updated page on Wikipedia.

A program to get recent updates to Wikipedia
def get_wikipedia_log_keywords():
    url = 'https://ja.wikipedia.org/wiki/%E7%89%B9%E5%88%A5:%E6%9C%80%E8%BF%91%E3%81%AE%E6%9B%B4%E6%96%B0?hidebots=1&hidecategorization=1&hideWikibase=1&limit=500&days=7&urlversion=2' #If you change the limit, the number of acquisitions will change
    html = requests.get(url)
    soup = bs4.BeautifulSoup(html.text, "html5lib")
    keywords = [el.text for el in soup.find_all(class_="mw-changeslist-title")]
    return keywords

Search each list completely, and if the words match, count them and divide by the length of the list to output.

import twitter_trend
import wikipedia_ch_log

def main():
    print("------wikipedia 500 keywords------")
    print()
    wiki = wikipedia_ch_log.get_wikipedia_log_keywords()
    print(wiki)
    print()
    print("------twitter trends------")
    print()
    twi = twitter_trend.get_trend_words()
    print(twi)
    cnt = 0
    for s in twi:
        if "#" in s:
            s = s[1:] #Hashtag removal
        for s2 in wiki:
            if s in s2:
                print("same word :", s, s2)
                cnt += 1
    print("count :", cnt)
    
    print("coincidence :", cnt / (max(len(twi), len(wiki))))
    
if __name__ == "__main__":
    main() 

Can I use the line feed code in pythonista3?

result

一致なし

Sometimes they didn't match at all.

一致した

Only one match. The top is the Wikipedia list, and the bottom is the Twitter trend. There was one match for "Toshinobu Kubota" with a degree of match of 0.002.

Consideration

This time I compared Wikipedia and Twitter, There were only 0 to 1 matches for about 500 Wikipedia and about 50 Twitter trends.

Even though I didn't take screenshots, there were at most 4 matches.

This result is not so much related to these as There was a problem with the evaluation method.

For the time being, we are processing to remove the hashtag of Twitter, but that is not enough, As I said at the beginning, even if you enter a trend such as "Nino Marriage", it should be registered as "Kazunari Ninomiya" on Wikipedia. Of course, this method doesn't match.

To handle this well, it may be better to extract related words from each trend word in some way (for example, Google search) and then compare it with Wikipedia.

Also, the evaluation value was set so that the maximum value is 1, but the size of the larger list is effective for the evaluation value, but the size of the smaller list (trend in this case) is completely evaluated. Since it does not affect, the evaluation formula is not very good.

Impressions

I had an idea to pull out the history of Wikipedia from before, but I didn't do it because it was a hassle, but I'm glad I could do it very easily. The trend of Twitter was also prepared by API, and I was glad that I could do it quickly because I got the API key before.

If I continue to do it in the future, I would like to investigate the degree of agreement in more detail by omitting related words.

As a result, if you know how much it is related to the trend, you may be able to create a site that automatically reads Wikipedia articles and explains the trend as a by-product.

reference

My article To remember how to use bs4

Recommended Posts

[Free study] Is there a connection between Wikipedia updates and trends?
Is there a contradiction between the party that protects the people from NHK and the party that protects NHK from the people?
What is the difference between a symbolic link and a hard link?
Connection between flask and sqlite3
Difference between ps a and ps -a
Probability statistics in Pokemon (uncorrelated test) --Is there a correlation between CP, weight, and height of Magikarp?
Difference between == and is in python
Is there a special in scipy? ??
[Introduction to Python] What is the difference between a list and a tuple?