[PYTHON] Studying web scraping for the purpose of extracting data from Filmarks # 2

Home page: https://program-board.com

Final goal: Extract data for the purpose of creating a list of movies that are highly evaluated by age group from Filmarks. In this paper, we extract information about multiple items on a specific page. There is only one movie to extract.

Check the source code

Check the source code of the information you want to extract as before. In this paper, the following information is extracted.

・ Movie title ・ Screening date ・ Country of origin ・ Screening time ・ Genre (up to 3) ・ Star (evaluation) ·directed by ・ Screenplay (up to 2 people) ・ Performers (up to 3 people)

import requests
import bs4
import pandas as pd

#Get information on the web
url= 'https://filmarks.com/list/trend'
res = requests.get(url)

#HTML formatting
soup = bs4.BeautifulSoup(res.text) #features='lxml')

Extract information for only one work

For the time being, extract the web information of the first work on the display screen.

infos = soup.select('div.p-movie-cassette__info')
#infos = infos.select('a.c-label')
movie = infos[0]
print('Number of works:{}'.format(len(infos)))
print(movie.prettify())
スクリーンショット-2019-10-05-18.51.04-1024x368.png

Extraction of multiple items

We will try to extract information for multiple items from here. However, since an error has occurred, trial and error for the error is also described.

Error occurred

Since the source code was'a.c-label'to extract the director's information in the same way as last time, I tried to extract with'a.c-label'. However, not only the director but also the script and cast were extracted. Upon confirmation, the director, script, and cast were also tagged with'a.c-label' in the source code.

スクリーンショット-2019-10-05-18.52.17-1024x156.png

Countermeasures against errors

Therefore, when organizing the extracted information, it is necessary to devise such as storing it as an element of the list. The code for extracting multiple items is defined in movie_info () for the purpose of organizing the extraction information. The code and output result are as follows.

def movie_info(info):

    #For output
    out_list = []
    
    #title
    title = info.select('h3.p-movie-cassette__title')[0].text
    out_list.append(title)
    
    #Screening date
    release_date = info.select('span')[2].text
    release_date = '{}/{}/{}'.format(release_date[0:4],release_date[5:7],release_date[8:10])
    out_list.append(release_date)

    #Country of origin
    country = info.select('div.p-movie-cassette__other-info')[0].select('a')[0].text
    out_list.append(country)

    #Screening time
    time = info.select('span')[3].text.replace('Minutes','')
    out_list.append(time)
    
    #Genre(Up to 3)
    genre_list = ['-','-','-']
    
    genre_web = info.select('div.p-movie-cassette__genre')[0].select('a')#Genre list creation
    for i in range(len(genre_web)):
        genre_list.insert(i,genre_web[i].text)
        
    out_list.append(genre_list[0])
    out_list.append(genre_list[1])
    out_list.append(genre_list[2])
    
    #Star(Evaluation)
    score = info.select('div.c-rating__score')[0].text
    out_list.append(score)
    
    #directed by
    director= info.select('div.p-movie-cassette__people-wrap')[0].select('a')[0].text
    out_list.append(director)
    
    #Screenplay(Up to 2 people)
    scenario_list = ['-','-','-']
    
    scenario_web = info.select('div.p-movie-cassette__people-wrap')[1].select('a')
    for i in range(len(scenario_web)):
        scenario_list.insert(i,scenario_web[i].text)
        
    out_list.append(scenario_list[0])
    out_list.append(scenario_list[1])
    
    #Performer(Up to 3 people)
    cast_list = ['-','-','-']
    
    cast_web = info.select('div.p-movie-cassette__people-wrap')[2].select('a')
    for i in range(len(cast_web)):
        cast_list.insert(i,cast_web[i].text)
        
    out_list.append(cast_list[0])
    out_list.append(cast_list[1])
    out_list.append(cast_list[2])
    
    return out_list

###################################
['Joker',
 '2019/10/04',
 'America',
 '122',
 'Drama',
 'Crime',
 'Thriller',
 '4.4',
 'Todd Phillips',
 'Todd Phillips',
 '-',
 'Joaquin Phoenix',
 'Robert de Niro',
 '-']
###################################

Correction points for the next time

I tried to extract information on multiple works based on this code, but some works did not have a script. Therefore, instead of using the elements of the list, modify it so that it can be identified directly by items such as "screenplay".

Recommended Posts

Studying web scraping for the purpose of extracting data from Filmarks # 2
I searched for railway senryu from the data
[Python] Flow from web scraping to data analysis
The transition of baseball as seen from the data
Scraping the winning data of Numbers using Docker
Data analysis for improving POG 1 ~ Web scraping with Python ~
Scraping with Python-Getting the base price of mutual funds from the investment trust association web
Scraping member images from the official website of Sakamichi Group
The story of copying data from S3 to Google's TeamDrive
Scraping the result of "Schedule-kun"
[tensorflow, keras, mnist] Take out n sheets for each label from the mnist data and create 10 * n sheets of data.
Aggregate the number of hits per second for one day from the web server log with Python
Fixed-point observation of specific data on the Web by automatically executing a Web browser on the server (Ubuntu16.04) (2) -Web scraping-
Check the increase / decrease of Bitcoin for each address from the blockchain
Scraping with Python-Getting investment trust attribute information from the investment trust association web
Seaborn basics for beginners ① Aggregate graph of the number of data (Countplot)
How to use machine learning for work? 01_ Understand the purpose of machine learning
Summary of pages useful for studying the deep learning framework Chainer
(For lawyers) Extract the behavior of Office software from .evtx files
Existence from the viewpoint of Python
Beginners use Python for web scraping (1)
Web scraping for weather warning notifications.
Beginners use Python for web scraping (4) ―― 1
Summarize the main points of growth hacks for web services and the points of analysis
Comparing R, Python, SAS, SPSS from the perspective of European data scientists
Latin learning for the purpose of writing a Latin sentence analysis program (Part 1)
A story of a person who started aiming for data scientist from a beginner
Get the list of packages for the specified user from the packages registered on PyPI
Scraping with Python-Getting the base price of mutual funds from Yahoo! Finance
Get the key for the second layer migration of JSON data in python