[PYTHON] Web scraping with BeautifulSoup4 (serial number page)

Web scraping with Beutiful Soup 4

I wrote a code to create a URL list for downloading all at once on a page with serial numbers of common URLs, so make a note

Installation

$ apt-get install lxml-python
$ pip install beautifulsoup4

Source

scraper.py


# -*- coding: utf-8 -*-

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

try:
    # Python 3
    from urllib import request
except ImportError:
    # Python 2
    import urllib2 as request

from bs4 import BeautifulSoup
import codecs
import time

def getSoup(url):
    response = request.urlopen(url)
    body = response.read()
    # Parse HTML
    return BeautifulSoup(body, 'lxml')

wait_sec = 3
domain = 'http://hoge.com'
result_file = 'list.txt'
i = 1
while(True):
    url = '{domain}/{index:0>2}/'.format(domain = domain, index = i)
    try:
        soup = getSoup(url)
    except IOError:
        break

    div = soup.find('div', attrs = {'id': 'div_id'})
    all_a = div.find_all('a', attrs = {'class': 'a_class'})
    src_list = []
    for a in all_a:
        src_list.append(a.img['src'])
    with codecs.open(result_file, 'a', 'utf-8') as f:
        f.write('\n'.join(src_list))
    print(i)
    i += 1

    time.sleep(wait_sec)

Reference page

[Python: Scraping websites with BeautifulSoup4](http://momijiame.tumblr.com/post/114227737756/python-beautifulsoup4-%E3%82%92%E4%BD%BF%E3%81%A3 % E3% 81% A6-web-% E3% 82% B5% E3% 82% A4% E3% 83% 88% E3% 82% 92% E3% 82% B9% E3% 82% AF% E3% 83% AC% E3% 82% A4% E3% 83% 94% E3% 83% B3% E3% 82% B0% E3% 81% 99% E3% 82% 8B)

Scraping with Python and Beautiful Soup

Recommended Posts

Web scraping with BeautifulSoup4 (serial number page)
Web scraping with BeautifulSoup4 (layered page)
[Personal note] Web page scraping with python3
Web scraping with python + JupyterLab
Save images with web scraping
Easy web scraping with Scrapy
Web scraping beginner with python
I-town page scraping with selenium
Web scraping with Python ① (Scraping prior knowledge)
Scraping Alexa's web rank with pyQuery
Web scraping with Python First step
I tried web scraping with python.
web scraping
Getting Started with Python Web Scraping Practice
Web scraping with Python ② (Actually scraping stock sites)
Horse Racing Site Web Scraping with Python
Monitor web page updates with LINE BOT
Getting Started with Python Web Scraping Practice
Import serial number videos together with Aviutl
Practice web scraping with Python and Selenium
Easy web scraping with Python and Ruby
[For beginners] Try web scraping with Python
Scraping with selenium
AWS-Perform web scraping regularly with Lambda + Python + Cron
Scraping with selenium ~ 2 ~
Scraping with Python
Scraping with Python
web scraping (prototype)
Erase & generate serial number files with shell script
Scraping with Selenium
[python] Quickly fetch web page metadata with lassie
Let's do web scraping with Python (weather forecast)
Let's do web scraping with Python (stock price)
Extract data from a web page with Python
Data analysis for improving POG 1 ~ Web scraping with Python ~
Display serial number columns and variables with Bottle template
Quick web scraping with Python (while supporting JavaScript loading)
Python beginners get stuck with their first web scraping
Serial communication with Python
Successful scraping with Selenium
Scraping with Python (preparation)
Try scraping with Python.
Web page summary (preprocessing)
Serial communication with python
Scraping with Python + PhantomJS
Introduction to Web Scraping
Flask-Python realization web page
Scraping with Selenium [Python]
Python web scraping selenium
Scraping with Python + PyQuery
Scraping with Beautiful Soup
Scraping RSS with Python
Make a gif animation from a serial number file with matplotlib
[Part.2] Crawling with Python! Click the web page to move!
[Python] Easy reading of serial number image files with OpenCV
Display a web page with FastAPI + uvicorn + Nginx (SSL / HTTPS)