Home page: https://program-board.com
Final goal: Extract data for the purpose of creating a list of movies that are highly evaluated by age group from Filmarks. In this paper, we extract information about multiple items on a specific page. There is only one movie to extract.
Check the source code of the information you want to extract as before. In this paper, the following information is extracted.
・ Movie title ・ Screening date ・ Country of origin ・ Screening time ・ Genre (up to 3) ・ Star (evaluation) ·directed by ・ Screenplay (up to 2 people) ・ Performers (up to 3 people)
import requests
import bs4
import pandas as pd
#Get information on the web
url= 'https://filmarks.com/list/trend'
res = requests.get(url)
#HTML formatting
soup = bs4.BeautifulSoup(res.text) #features='lxml')
For the time being, extract the web information of the first work on the display screen.
infos = soup.select('div.p-movie-cassette__info')
#infos = infos.select('a.c-label')
movie = infos[0]
print('Number of works:{}'.format(len(infos)))
print(movie.prettify())
We will try to extract information for multiple items from here. However, since an error has occurred, trial and error for the error is also described.
Since the source code was'a.c-label'to extract the director's information in the same way as last time, I tried to extract with'a.c-label'. However, not only the director but also the script and cast were extracted. Upon confirmation, the director, script, and cast were also tagged with'a.c-label' in the source code.
Therefore, when organizing the extracted information, it is necessary to devise such as storing it as an element of the list. The code for extracting multiple items is defined in movie_info () for the purpose of organizing the extraction information. The code and output result are as follows.
def movie_info(info):
#For output
out_list = []
#title
title = info.select('h3.p-movie-cassette__title')[0].text
out_list.append(title)
#Screening date
release_date = info.select('span')[2].text
release_date = '{}/{}/{}'.format(release_date[0:4],release_date[5:7],release_date[8:10])
out_list.append(release_date)
#Country of origin
country = info.select('div.p-movie-cassette__other-info')[0].select('a')[0].text
out_list.append(country)
#Screening time
time = info.select('span')[3].text.replace('Minutes','')
out_list.append(time)
#Genre(Up to 3)
genre_list = ['-','-','-']
genre_web = info.select('div.p-movie-cassette__genre')[0].select('a')#Genre list creation
for i in range(len(genre_web)):
genre_list.insert(i,genre_web[i].text)
out_list.append(genre_list[0])
out_list.append(genre_list[1])
out_list.append(genre_list[2])
#Star(Evaluation)
score = info.select('div.c-rating__score')[0].text
out_list.append(score)
#directed by
director= info.select('div.p-movie-cassette__people-wrap')[0].select('a')[0].text
out_list.append(director)
#Screenplay(Up to 2 people)
scenario_list = ['-','-','-']
scenario_web = info.select('div.p-movie-cassette__people-wrap')[1].select('a')
for i in range(len(scenario_web)):
scenario_list.insert(i,scenario_web[i].text)
out_list.append(scenario_list[0])
out_list.append(scenario_list[1])
#Performer(Up to 3 people)
cast_list = ['-','-','-']
cast_web = info.select('div.p-movie-cassette__people-wrap')[2].select('a')
for i in range(len(cast_web)):
cast_list.insert(i,cast_web[i].text)
out_list.append(cast_list[0])
out_list.append(cast_list[1])
out_list.append(cast_list[2])
return out_list
###################################
['Joker',
'2019/10/04',
'America',
'122',
'Drama',
'Crime',
'Thriller',
'4.4',
'Todd Phillips',
'Todd Phillips',
'-',
'Joaquin Phoenix',
'Robert de Niro',
'-']
###################################
I tried to extract information on multiple works based on this code, but some works did not have a script. Therefore, instead of using the elements of the list, modify it so that it can be identified directly by items such as "screenplay".
Recommended Posts