[PYTHON] Studying web scraping for the purpose of extracting data from Filmarks # 2

Home page: https://program-board.com

Final goal: Extract data for the purpose of creating a list of movies that are highly evaluated by age group from Filmarks. In this paper, we extract information about multiple items on a specific page. There is only one movie to extract.

Check the source code

Check the source code of the information you want to extract as before. In this paper, the following information is extracted.

・ Movie title ・ Screening date ・ Country of origin ・ Screening time ・ Genre (up to 3) ・ Star (evaluation) ·directed by ・ Screenplay (up to 2 people) ・ Performers (up to 3 people)

import requests
import bs4
import pandas as pd

#Get information on the web
url= 'https://filmarks.com/list/trend'
res = requests.get(url)

#HTML formatting
soup = bs4.BeautifulSoup(res.text) #features='lxml')

Extract information for only one work

For the time being, extract the web information of the first work on the display screen.

infos = soup.select('div.p-movie-cassette__info')
#infos = infos.select('a.c-label')
movie = infos[0]
print('Number of works:{}'.format(len(infos)))
print(movie.prettify())

スクリーンショット-2019-10-05-18.51.04-1024x368.png

Extraction of multiple items

We will try to extract information for multiple items from here. However, since an error has occurred, trial and error for the error is also described.

Error occurred

Since the source code was'a.c-label'to extract the director's information in the same way as last time, I tried to extract with'a.c-label'. However, not only the director but also the script and cast were extracted. Upon confirmation, the director, script, and cast were also tagged with'a.c-label' in the source code.

スクリーンショット-2019-10-05-18.52.17-1024x156.png

Countermeasures against errors

Therefore, when organizing the extracted information, it is necessary to devise such as storing it as an element of the list. The code for extracting multiple items is defined in movie_info () for the purpose of organizing the extraction information. The code and output result are as follows.

def movie_info(info):

    #For output
    out_list = []
    
    #title
    title = info.select('h3.p-movie-cassette__title')[0].text
    out_list.append(title)
    
    #Screening date
    release_date = info.select('span')[2].text
    release_date = '{}/{}/{}'.format(release_date[0:4],release_date[5:7],release_date[8:10])
    out_list.append(release_date)

    #Country of origin
    country = info.select('div.p-movie-cassette__other-info')[0].select('a')[0].text
    out_list.append(country)

    #Screening time
    time = info.select('span')[3].text.replace('Minutes','')
    out_list.append(time)
    
    #Genre(Up to 3)
    genre_list = ['-','-','-']
    
    genre_web = info.select('div.p-movie-cassette__genre')[0].select('a')#Genre list creation
    for i in range(len(genre_web)):
        genre_list.insert(i,genre_web[i].text)
        
    out_list.append(genre_list[0])
    out_list.append(genre_list[1])
    out_list.append(genre_list[2])
    
    #Star(Evaluation)
    score = info.select('div.c-rating__score')[0].text
    out_list.append(score)
    
    #directed by
    director= info.select('div.p-movie-cassette__people-wrap')[0].select('a')[0].text
    out_list.append(director)
    
    #Screenplay(Up to 2 people)
    scenario_list = ['-','-','-']
    
    scenario_web = info.select('div.p-movie-cassette__people-wrap')[1].select('a')
    for i in range(len(scenario_web)):
        scenario_list.insert(i,scenario_web[i].text)
        
    out_list.append(scenario_list[0])
    out_list.append(scenario_list[1])
    
    #Performer(Up to 3 people)
    cast_list = ['-','-','-']
    
    cast_web = info.select('div.p-movie-cassette__people-wrap')[2].select('a')
    for i in range(len(cast_web)):
        cast_list.insert(i,cast_web[i].text)
        
    out_list.append(cast_list[0])
    out_list.append(cast_list[1])
    out_list.append(cast_list[2])
    
    return out_list

###################################
['Joker',
 '2019/10/04',
 'America',
 '122',
 'Drama',
 'Crime',
 'Thriller',
 '4.4',
 'Todd Phillips',
 'Todd Phillips',
 '-',
 'Joaquin Phoenix',
 'Robert de Niro',
 '-']
###################################

Correction points for the next time

I tried to extract information on multiple works based on this code, but some works did not have a script. Therefore, instead of using the elements of the list, modify it so that it can be identified directly by items such as "screenplay".