[Beginner] Python web scraping using Google Colaboratory

The explanation of the following people was easy to understand. Reference: Scraping with Python (1) Introduction | Let's automatically extract data using scraping

I understand the flow of web scraping. Next, I would like to modify the code so that information can be obtained from multiple sites.

Get HTML for your site

import requests
from bs4 import BeautifulSoup
html_doc = requests.get("https://www.yahoo.co.jp/").text #HTML acquisition of yahoo site
soup = BeautifulSoup(html_doc, 'html.parser') #Beautiful Soup initialization
print(soup.prettify())  #Indent HTML to make it easier to read.

Processing of acquired HTML

#Get Title
title = soup.title.text
print(title)

#reference)
# Python,How to use Requests
# https://note.nkmk.me/python-requests-usage/
# 
#Response object
# url:url attribute
#Status code: status_code attribute
#encoding:encoding attribute
#Response header:headers attribute
#text:text attribute
#Binary data:content attribute


#Get Description
meta_description = soup.find('meta', {'name' : 'description'})
description = meta_description['content']
print(description)

#When getting multiple tags
tags = soup.find_all("a")
print(tags)
##result
[<a class="yMWCYupQNdgppL-NV6sMi _3sAlKGsIBCxTUbNi86oSjt" data-ylk="slk:help;pos:0" href="https://www.yahoo-help.jp/">help</a>,
 <a class="yMWCYupQNdgppL-NV6sMi _3sAlKGsIBCxTUbNi86oSjt" data-ylk="rsec:header;slk:logo;pos:0" href="https://www.yahoo.co.jp">Yahoo! JAPAN</a>,
 <a aria-label="Transition to premium" class="yMWCYupQNdgppL-NV6sMi _3sAlKGsIBCxTUbNi86oSjt" data-ylk="rsec:header;slk:premium;pos:0" href="https://premium.yahoo.co.jp/"><p class="oLvk9L5Yk-9JOuzi-OHW5"><span class="t_jb9bKlgIcajcRS2hZAP">premium</span><span class="_2Uq6Pw5lfFfxr_OD36xHp6 _3JuM5k4sY_MJiSvJYtVLd_ Y8gFtzzcdGMdFngRO9qFV" style="width:36px;height:38px"></span></p></a>,
 <a aria-label="Transition to card" class="yMWCYupQNdgppL-NV6sMi _3sAlKGsIBCxTUbNi86oSjt" data-ylk="rsec:header;slk:card;pos:0" href="https://card.yahoo.co.jp/service/redirect/top/"><p class="oLvk9L5Yk-9JOuzi-OHW5"><span class="t_jb9bKlgIcajcRS2hZAP">card</span><span class="_2Uq6Pw5lfFfxr_OD36xHp6 _3JuM5k4sY_MJiSvJYtVLd_ _1MaEI7rEHB4FpQ1MwfWxIK" style="width:36px;height:38px"></span></p></a>,
 <a aria-label="Transition to email" class="yMWCYupQNdgppL-NV6sMi _3sAlKGsIBCxTUbNi86oSjt" data-ylk="rsec:header;slk:mail;pos:0" href="https://mail.yahoo.co.jp/"><p class="oLvk9L5Yk-9JOuzi-OHW5"><span class="t_jb9bKlgIcajcRS2hZAP">Email</span><span class="_2Uq6Pw5lfFfxr_OD36xHp6 _3JuM5k4sY_MJiSvJYtVLd_ _3Qi5P0lTFbNkWishPzz8tb" style="width:36px;height:38px"></span></p></a>,
...]

#Get the text and link of the obtained a tag
for tag in tags:
 print (tag.string)
 print (tag.get("href"))
##result
help
https://www.yahoo-help.jp/
Yahoo! JAPAN
https://www.yahoo.co.jp
None
https://premium.yahoo.co.jp/
...

Save to CSV data

import pandas as pd
from google.colab import files

columns = ["name", "url"]
df = pd.DataFrame(columns=columns)
#Add article name and article URL to dataframe
for tag in tags:
 name = tag.string
 url = tag.get("href")
 se = pd.Series([name, url], columns)
 print(se)
 df = df.append(se, columns)
# result.Output to CSV with the name csv
filename = "result.csv"
df.to_csv(filename, encoding = 'utf-8-sig', index=False)
files.download(filename)

#reference)
#Export / add csv file with pandas (to_csv)
# https://note.nkmk.me/python-pandas-to-csv/

Recommended Posts

[Beginner] Python web scraping using Google Colaboratory
Web scraping using Selenium (Python)
Web scraping beginner with python
Scraping using Python
Python web scraping selenium
I tried web scraping using python and selenium
Pharmaceutical company researchers summarized web scraping using Python
Scraping using Python 3.5 async / await
Web scraping with python + JupyterLab
python super beginner tries scraping
Web scraping notes in python3
Study Python with Google Colaboratory
Scraping using Python 3.5 Async syntax
Web scraping using AWS lambda
[Python scraping] I tried google search top10 using Beautifulsoup & selenium
Web scraping with Python ① (Scraping prior knowledge)
[Python3] Google translate google translate without using api
Web scraping with Python First step
I tried web scraping with python.
Snippets (scraping) registered in Google Colaboratory
Beginners use Python for web scraping (1)
Beginners use Python for web scraping (4) ―― 1
Google colaboratory
[Scraping] Python scraping
web scraping
WEB scraping with Python (for personal notes)
Getting Started with Python Web Scraping Practice
Usual processing notes when using Google Colaboratory
Try using Python with Google Cloud Functions
[Personal note] Web page scraping with python3
Web scraping with Python ② (Actually scraping stock sites)
Horse Racing Site Web Scraping with Python
Scraping a website using JavaScript in Python
Getting Started with Python Web Scraping Practice
First-principles calculations for free using Google Colaboratory
Using Java's Jupyter Kernel with Google Colaboratory
[Python] Scraping a table using Beautiful Soup
Practice web scraping with Python and Selenium
Easy web scraping with Python and Ruby
Creating Google Spreadsheet using Python / Google Data API
[For beginners] Try web scraping with Python
Free Python runtime environment Google Colaboratory memo
Beginner ABC154 (Python)
Python scraping notes
Beginner ABC156 (Python)
Python Scraping get_ranker_categories
Scraping with Python
python beginner memo (9.2-10)
Procedure to use TeamGant's WEB API (using python)
Scraping with Python
python beginner memo (9.1)
Python beginner notes
web scraping (prototype)
[Beginner] Python array
Start using Python
Python Scraping eBay
Try using the Python web framework Tornado Part 1
Create a web map using Python and GDAL
Beginner ABC155 (Python)
[Python] Flow from web scraping to data analysis
Python Scraping get_title