Web crawling, web scraping, character acquisition and image saving with python

Preparation

import re
import requests
from pathlib import Path
import requests
from bs4 import BeautifulSoup

Create a working folder

output_folder = Path('Working folder')
output_folder.mkdir(exist_ok=True)

I want to get yahoo weather data

Get the html element using requests.

url = 'https://weather.yahoo.co.jp/weather/jp/13/4410.html'
html = requests.get(url).text

It's hard to read as it is, so rewrite the structure with Beautiful Soup

soup = BeautifulSoup(html, 'lxml')

Check the soup to see where the information you want to get is. This time I want to get ** today tomorrow's weather **.

Search for the word with ctrl + F.

image.png

I was able to confirm class = "yjMt".

Get by specifying the element with soup

today = soup.select('.yjMt')

When you want to get a div, select ('div') When class wants to get ('.class') When you want to get the id ('# id') If you want to take img, soup.find_all ('img') may be more convenient than select

Check the acquired contents

today
[<h2 class="yjMt">Today tomorrow weather</h2>,
 <h2 class="yjMt">Weekly weather</h2>,
 <h2 class="yjMt">Pinpoint weather</h2>]

Since three elements are taken out, it is necessary to specify the list number and take out.

Get the maximum and minimum temperatures in the same way

high = soup.select('.high')
low = soup.select('.low')
low
[<li class="low"><em>25</em>℃[+2]</li>,
 <li class="low"><em>28</em>℃[+3]</li>]

Since the information of tomorrow is coming in today, specify the list number. Remove unnecessary strings.

today_low= str(low[0]).replace('<li class="high"><em>', '').replace('</em>', '').replace('</li>', '')

Image acquisition

Right click on the image on the website Copy the url and search for the url with ctrl + F.

It turns out that class is pict

pict = soup.select('.pict')
pict
[<p class="pict"><img alt="Cloudy and sometimes rain" border="0" src="https://s.yimg.jp/images/weather/general/next/size150/203_day.png "/>Cloudy and sometimes rain</p>,
 <p class="pict"><img alt="Cloudy then sunny" border="0" src="https://s.yimg.jp/images/weather/general/next/size150/266_day.png "/>Cloudy then sunny</p>,
 <div class="cmnMod pict">
 <ul>
 <li>
 <dl>
 <dt>Rain cloud radar</dt>
 <dd><a data-ylk="slk:zmradar; pos:1" href="//weather.yahoo.co.jp/weather/zoomradar/?lat=35.6965&amp;lon=139.4472&amp;z=10"><img alt="Movement of rain clouds" height="150" src="https://weather-pctr.c.yimg.jp/r/iwiz-weather/raincloud/1599021000/202010-0000-pf1300-20200902133000.gif?w=200&amp;h=150" width="200"/>
 </a></dd>
 </dl>
 </li><!--
 --><li>
 <dl>
 <dt>Weather map</dt>
 <dd><a data-ylk="slk:chart; pos:1" href="/weather/chart/"><img alt="Weather map" height="150" src="https://weather-pctr.c.yimg.jp/r/iwiz-weather/chart_v2/1599012878/WM_ChartA_20200902-090000.jpg?w=200&amp;h=150" width="200"/>
 </a></dd>
 </dl>
 </li><!--
 --><li>
 <dl>
 <dt>Meteorological satellite</dt>
 <dd><a data-ylk="slk:stlt; pos:1" href="/weather/satellite/"><img alt="Meteorological satellite" height="150" src="https://weather-pctr.c.yimg.jp/r/iwiz-weather/satellite_v2/1599022735/WM_H-JPN-IR_20200902-140000.jpg?w=200&amp;h=150" width="200"/>
 </a></dd>
 </dl>
 </li>
 </ul>
 </div>]

I want to get only the url of the image. Since there is a "" "before and after the image url, use this as the specified character to separate the characters. Specify the list number with the corresponding url from the separated list.

sp = re.split('"', str(pict))
sp[7]
'https://s.yimg.jp/images/weather/general/next/size150/203_day.png'

Extract the image from the url and display it in PIL

from PIL import Image
from io import BytesIO

img = requests.get(sp[7]).content
today_pict = Image.open(BytesIO(img))
today_pict

image.png

Another solution

There is also a way to retrieve only the img in the a tag

a_img = soup.select('a > img')

Split from the extraction and narrow down to only the image with the extension match

str_img = str(a_img).split('"')
l_in = [s for s in str_img if '.jpg' in s]

Save

today_pict.save("today_pict.png ")

that's all

Recommended Posts

Web crawling, web scraping, character acquisition and image saving with python
Practice web scraping with Python and Selenium
Easy web scraping with Python and Ruby
Web scraping with python + JupyterLab
Web scraping beginner with python
Scraping with Node, Ruby and Python
Web scraping with Python ① (Scraping prior knowledge)
Scraping with Python, Selenium and Chromedriver
Web scraping with Python First step
I tried web scraping with python.
Scraping with Python and Beautiful Soup
Let's do image scraping with Python
Getting Started with Python Web Scraping Practice
Crawling and scraping any site with mitmproxy
[Personal note] Web page scraping with python3
Web scraping with Python ② (Actually scraping stock sites)
Horse Racing Site Web Scraping with Python
Getting Started with Python Web Scraping Practice
[For beginners] Try web scraping with Python
Scraping with Python
Scraping with Python
AWS-Perform web scraping regularly with Lambda + Python + Cron
Scraping tabelog with python and outputting to CSV
I tried web scraping using python and selenium
Launch a web server with Python and Flask
Let's do web scraping with Python (weather forecast)
Let's do web scraping with Python (stock price)
[Web development with Python] Precautions when saving cookies
WEB scraping with python and try to make a word cloud from reviews
Scraping with Python (preparation)
Try scraping with Python.
Scraping with Python + PhantomJS
Image processing with Python
Scraping with Selenium [Python]
Python web scraping selenium
Scraping with Python + PyQuery
Scraping RSS with Python
Data analysis for improving POG 1 ~ Web scraping with Python ~
Notes on HDR and RAW image processing with Python
Parse and visualize JSON (Web application ⑤ with Python + Flask)
Quick web scraping with Python (while supporting JavaScript loading)
Crawling with Python and Twitter API 1-Simple search function
Python beginners get stuck with their first web scraping
Programming with Python and Tkinter
I tried scraping with Python
Python and hardware-Using RS232C with Python-
Scraping with Selenium + Python Part 1
Web scraping notes in python3
Scraping with chromedriver in python
Festive scraping with Python, scrapy
Image editing with python OpenCV
Save images with web scraping
Scraping with Selenium in Python
Sorting image files with Python (3)
Web scraping technology and concerns
Get a Python web page, character encode it, and display it
[Part.2] Crawling with Python! Click the web page to move!
Trade-offs in web scraping & crawling
Easy web scraping with Scrapy
Scraping with Tor in Python
Web API with Python + Falcon