[PYTHON] Get all live tweets of professional baseball

Introduction

I was collecting live baseball tweets in my university research, so I summarized them. I mainly write about scraping stories and getting a lot of tweets using tweepy.

2019 Professional Baseball (NPB) live tweet

We will publish the results of continuing to acquire live tweets of 2019 Professional Baseball (NPB) with hashtags every day. I ran the Python file with cron every day and got it for a year.

スクリーンショット 2019-10-16 15.21.44.png

The searched hashtags are as follows. There may be active hashtags that haven't been discovered yet. In Hanshin, is it "#tiger van"?

Central League Pacific League
Giant #kyojin, #giants Nippon ham #lovefighters
Chunichi # dragons Softbank #sbhawks
Hiroshima #carp Rakuten #rakuteneagles
Yakult #swallows,#yakultswallows Seibu #seibulions
Hanshin #hanshin,#tigers Lotte #chibalotte
DeNA #baystars Orix #Orix_Buffaloes

Overview of live tweet acquisition

1. Get the match card / match time of the target day

Obtained by scraping websites that provide breaking sports news.

--Scraping destination - SPORTS BULL(https://sportsbull.jp/stats/npb/) --Sports Navi (by Yahoo! JAPAN) (https://baseball.yahoo.co.jp/npb/schedule/)

2. Get live tweets from hashtags by specifying the time of the match

ID and hashtag of each team (search query)

Hashtag dictionary object (tag_list) key: team_id set by myself item: Hashtag (query when searching for tweets)

tag_list = {0: '#kyojin OR #giants', 1: '#dragons',\
            2: '#carp', 3: '#swallows OR #yakultswallows',4: '#hanshin OR #tigers',\
            5: '#baystars', 6: '#lovefighters', 7: '#sbhawks',8: '#rakuteneagles',\
            9: '#seibulions', 10: '#chibalotte', 11: '#Orix_Buffaloes'}

Implementation

Published on github → here

Specification Library (Python)

The library used this time is as follows. Please install as appropriate.

getLiveTweet_NPB.py


from urllib.request import urlopen
import tweepy
from datetime import timedelta
import time
import sqlite3
from contextlib import closing
import datetime
from bs4 import BeautifulSoup
import urllib.request as req

Get a match card

SPORTS BULL(https://sportsbull.jp/stats/npb/) Scraping the match played on the date and time specified by. You can get it from Sportsnavi, but this one has a simpler HTML structure.

getLiveTweet_NPB.py



def get_gameteamId(gamedate):
    url = 'https://sportsbull.jp/stats/npb/home/index/' + gamedate
    print(url)
    res = req.urlopen(url)
    soup = BeautifulSoup(res, 'html.parser')
    q = soup.select('.game-block a')
    gameId_list = []
    flag_list = [1 for i in range(12)]
    i = 0
    for p in q:
        urls = p.get('href')
        #Processing when canceled
        p_ = p.select('.st-03')
        for p__ in p_:
            if 'Cancel' in str(p__.text):
                print('Cancel')
                flag_list[i] = 0
                flag_list[i+1] = 0
        if flag_list[i] == 1:
            print(urls[-10:])
            gameId_list.append(urls[-10:])
        i += 2
    print('flag_list: ',flag_list)
    q = soup.select('.game-block .play-box01 dt')
    teamId_list = []
    teamId_dict = {'Giant': 0, 'Chunichi': 1, 'Hiroshima': 2, 'Yakult': 3, 'Hanshin': 4, 'DeNA': 5,
                  'Nippon ham': 6, 'Softbank': 7, 'Rakuten': 8, 'Seibu': 9, 'Lotte': 10, 'Orix': 11}
    i = 0
    for p in q:
        if flag_list[i] == 1:
            teamId_list.append(teamId_dict[p.text])
        i += 1
    return gameId_list, teamId_list


#date
def get_date(days_ago):
	date = datetime.date.today()
	date -= datetime.timedelta(days=days_ago)
	date_str = str(date)
	date_str = date_str[:4]+date_str[5:7]+date_str[8:10]
	return date_str


#Example--------------------------
n = 1
game_date = get_date(n) #Automatic(Get data n days ago)
game_date = '20200401' #Manual input
print(game_date,'Get the data of,')
# -----------------------------

#List of game IDs and team IDs
gameteamId_list = get_gameteamId(game_date)
gameId_list = gameteamId_list[0]
teamId_list = gameteamId_list[1]
print('gameId_list:',gameId_list)
print('teamId_list:',teamId_list)

Example of execution result

Get the data of 20200401
https://sportsbull.jp/stats/npb/home/index/20200401
flag_list: [1,1,1,1,0,0,0,0,0,0,0,0]
gameId_list: [2020040101,2020040102]
teamId_list: [0,1,2,3]

in this case, Giants (Home) vs Chunichi (away) at gameId = 2019040101 Hiroshima (Home) vs Yakult (away) at gameId = 2019040102 The match was held

Get the start and end times of the match

Yahoo! Sportsnavi (https://baseball.yahoo.co.jp/npb/schedule/) Each match page https://baseball.yahoo.co.jp/npb/game/[game_id]/top Since the start time and match time can be taken from, add them up to get the start time and end time.

getLiveTweet_NPB.py



#Get the start time and end time of the match by scraping
def gametime(game_id):
    url = 'https://baseball.yahoo.co.jp/npb/game/' + game_id + '/top'
    res = req.urlopen(url)
    soup = BeautifulSoup(res, 'html.parser')
    time = []

    #Start time
    css_select = '#gm_match .gamecard .column-center .stadium'
    q = soup.select(css_select)
    time.append(q[0].text[-6:-4])
    time.append(q[0].text[-3:-1])

    #ending time
    minutes = []
    while True:
        try:
            css_select = '#yjSNLiveDetaildata td'
            q = soup.select(css_select)
            minutes = q[1].text.split('time')
            minutes[1] = minutes[1][:-1]
            break
        except:
            continue
    time = time + minutes
    return time

↑ Output of this function

#Start time 18:00, when the match time is 3 hours and 15 minutes
[18,0,21,15]

Search by Twitter API

Use the Twitter API search to get all tweets in time. Since 100 tweets can be acquired with one request, repeat it. If you get stuck in the API limit, pause and wait 15 minutes.

The target is tweets from the start of the game to 5 minutes after the end of the game.

getLiveTweet_NPB.py



# TwitterAPI
APIK = 'consumer_key'
APIS = 'consumer_secret'
AT = 'access_token'
AS = 'access_token_secret'
auth = tweepy.OAuthHandler(APIK, APIS)
auth.set_access_token(AT, AS)
api = tweepy.API(auth)

#Twitter API search
def search_livetweet(team_num, api, game_id, query):
    print(query)    #Get from the latest tweets
    print('Search page: 1')
    try:
        tweet_data = api.search(q=query, count=1)
    except tweepy.TweepError as e:
        print('Error: wait 15 minutes')
        time.sleep(60 * 15)

    tweet_data = api.search(q=query, count=100)
    table_name = 'team' + str(team_num)
    #This function is for saving to the database
    saveDB_tweet(table_name, 0, tweet_data, game_id)
    print('************************************************\n')
    next_max_id = tweet_data[-1].id

    page = 1
    while True:
        page += 1
        print('Search page:' + str(page))
        try:
            tweet_data = api.search(q=query, count=100, max_id=next_max_id - 1)
            if len(tweet_data) == 0:
                break
            else:
                next_max_id = tweet_data[-1].id
                #This function is for saving to the database
                saveDB_tweet(table_name, page - 1, tweet_data, game_id)
        except tweepy.TweepError as e:
            print('Error: wait 15 minutes')
            print(datetime.datetime.now().strftime("%Y/%m/%d %H:%M:%S"))
            print(e.reason)
            time.sleep(60 * 15)
            continue
        print('*'*40 + '\n')


#Specify time → Create query → Tweet search function (search)_livetweet())
def get_livetweet(team_id, game_id):
    date = game_id[:4] + '-' + game_id[4:6] + '-' + game_id[6:8]
    time = gametime(game_id)
    sh, sm = time[0], time[1]
    eh = int(time[0]) + int(time[2])
    em = int(time[1]) + int(time[3]) + 5  #5 minutes after the end
    if em >= 60:
        em -= 60
        eh += 1
    eh = '{0:02d}'.format(eh)
    em = '{0:02d}'.format(em)
    
    print(date, sh, sm, eh, em)
    tag_list = {0: '#kyojin OR #giants', 1: '#dragons',\
            2: '#carp', 3: '#swallows OR #yakultswallows',4: '#hanshin OR #tigers',\
            5: '#baystars', 6: '#lovefighters', 7: '#sbhawks',8: '#rakuteneagles',\
            9: '#seibulions', 10: '#chibalotte', 11: '#Orix_Buffaloes'}
    tag = tag_list[team_num]
    query = tag + ' exclude:retweets exclude:replies\
            since:' + date + '_' + sh + ':' + sm + ':00_JST \
            until:' + date + '_' + eh + ':' + em + ':59_JST lang:ja'
    search_livetweet(team_id, api, game_id, query)

Now execute the above function

Get tweets from two teams for each match from the gameId_list and teamId_list created above.

getLiveTweet_NPB.py



for i in range(len(gameId_list)):
    game_id = gameId_list[i]

    #away
    team_id = teamId_list[2*i+1]
    get_livetweet(team_id, game_id)
    print('='*60 + '\n')

    #home
    team_id = teamId_list[2*i]
    get_livetweet(team_id, game_id)
    print('='*60 + '\n')

in conclusion

If the game is interrupted by rain, you may not be able to get all the tweets. It seems that improvement around that is necessary.

The part to get tweets by specifying the time can be used in any domain, so I hope it will be helpful.

Recommended Posts

Get all live tweets of professional baseball
Get lots of your tweets with Tweepy
Get a lot of Twitter tweets at once
Method to get all keys of nested dict
[Python] Get the number of views of all posted articles
Erase all your tweets
Get Tweets with Tweepy
Get all IP addresses of instances in the autoscaling group