HTML scraping

How to scrape

In the previous post, I used rllib and requests to get the elements of the web page.

Here, we will extract only the necessary data from it. To be precise, this extraction work is called scraping.

There are the following methods for scraping.

Regular expression scraping

Consider HTML or XML as a simple character string and extract the necessary parts. For example, you can use the re module of the Python standard library to retrieve arbitrary strings with relative flexibility.

Scraping with a third-party library

Basically, this method is the most used. There are multiple libraries that scrape from HTML etc. You can easily do it by using it.

The typical modules included in the Python library are as follows.

ElementTree
lxml
BeautifulSoup

Elements / attributes

We will explain the terms used in XML and HTML using pages written in XML. XML is the same markup language as HTML and is a more extensible language. Let's get the code of the XML page of Yahoo! News as a sample.

import requests

r = requests.get("https://news.yahoo.co.jp/pickup/rss.xml")
print(r.text)
>>>Output result
<?xml version='1.0' encoding='UTF-8'?>
<rss xmlns:blogChannel="http://backend.userland.com/blogChannelModule" version="2.0">
<channel>
<title>Yahoo!News topics-Major</title>
<link>https://news.yahoo.co.jp/</link>
<description>Yahoo!We provide the latest headlines featured in JAPAN News Topics.</description>
<language>ja</language>
<pubDate>Thu, 06 Dec 2018 19:42:33 +0900</pubDate>
<item>
<title>Measures against "brain fatigue" that you are not aware of</title>
<link>https://news.yahoo.co.jp/pickup/6305814</link>
<pubDate>Thu, 06 Dec 2018 19:35:20 +0900</pubDate>
<enclosure length="133" url="https://s.yimg.jp/images/icon/photo.gif" type="image/gif">
</enclosure>
<guid isPermaLink="false">yahoo/news/topics/6305814</guid>
</item>
......(The following is omitted)......

There is a description like (text) </ title> in the obtained code. A sentence enclosed in this HTML tag is called an element.</p> <p>title is called element name</p> <title> is the start tag and </ title> is the end tag. Also, there are often descriptions like <html lang = "ja">. This means that the lang attribute is ja, indicating that the language is Japanese. <h2>Scraping with Beautiful Soup (1)</h2> <p>BeautifulSoup is a simple and easy-to-remember scraping library. I will continue to explain how to use it easily using the XML page of Yahoo! News.</p> <pre><code class="language-python">#Import libraries and modules from bs4 import BeautifulSoup import requests # yahoo!Get the main data of the news r = requests.get("https://news.yahoo.co.jp/pickup/rss.xml") # BeautifulSoup()You cannot directly specify the file name or URL for soup = BeautifulSoup(r.text, "xml") </code></pre> <p>The BeautifulSoup method gets the specified web page. In the first argument, specify the HTML character string as str type or bytes type. Specify the parser as the second argument. A parser is a program that performs parsing. In this case, the HTML string is parsed element by element and converted for ease of use.</p> <p>The parsers that can be used with Beautiful Soup are as follows. Choose the right parser for your purpose.</p> <p><img src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/430767/ba06a91b-49fd-b27d-9385-56e1fe94ab4c.png" alt="image.png" /></p> <h2>Scraping with Beautiful Soup (2)</h2> <p>Now that you have specified the appropriate parser, you are ready to parse the web page. Now, let's get any part.</p> <p>There are several ways to specify an element for acquisition, but here we will use the following two.</p> <h3>Tag name (+ attribute)</h3> <p>If you specify the tag name, attribute, or both in the find method for the parsed data Get only the first appearing element that satisfies it. Also, the find_all method gets all the specified elements in the list as well.</p> <h3>CSS selector</h3> <p>If you specify a CSS selector with the select_one method for parsed data Get only the first appearing element that satisfies it. Also, the select method gets all the specified elements in the list as well.</p> <pre><code class="language-python">import requests from bs4 import BeautifulSoup # yahoo!Get the main data of the news r = requests.get("https://news.yahoo.co.jp/pickup/rss.xml") #Parse with xml soup = BeautifulSoup(r.text, "xml") #Extract only the first element of the title tag print(soup.find("title")) print() #Extract all elements of the title tag print(soup.find_all("title")) >>>Output result <title>Yahoo!News topics-Major</title> [<title>Yahoo!News topics-Major</title>, <title>Go Iyama wins 43rd term and sets new record</title>, <title>Is Mitsuki Takahata and Sakaguchi continuing dating?</title>, <title>Mieko Hanada remarried under 13 years old</title>, ....(The following is omitted) </code></pre> <p>CSS selector is an expression method that specifies elements to decorate such as character strings with CSS. For example, if you specify "body> h1", you will get the h1 element, which is the direct child relationship of the body element.</p> <pre><code class="language-python">#(First half omitted) #Extracts only the very first h1 element in the body element print(soup.select_one("body > h1")) print() #Extracts all h1 elements in the body element print(soup.select("body > h1")) </code></pre> <h2>Scraping with Beautiful Soup (3)</h2> <p>The h3 tag remained as it was in the information acquired in the previous section. This is because the tag was also added to the list. With text, you can retrieve only the text in each of the retrieved elements.</p> <pre><code class="language-python">import requests from bs4 import BeautifulSoup # yahoo!Get the main data of the news r = requests.get("https://news.yahoo.co.jp/pickup/rss.xml") #Parse with xml soup = BeautifulSoup(r.text, "xml") #Extract the element of the title tag titles = soup.find_all("title") #Get each element from the list using a for statement #You can use text to remove tags and output only text for title in titles: print(title.text) >>>Output result Yahoo!News topics-Major Investigate for explosion, gross negligence, etc. NEWS Koyama news every.Get off Mieko Hanada remarried under 13 years old ... </code></pre> <h1>Get the title of the photo</h1> <h2>Get the title of the photo (1)</h2> <p>So far, I've scraped only one web page, In reality, I think that you often scrape multiple pages such as "next page".</p> <p>To scrape multiple pages, you need to get all the URLs of the pages you want to scrape.</p> <p>The exercise web page has a page number at the bottom. Since the link destination is set for each, it seems good to get it. The URL of the link destination is described in the href attribute of the <a> element.</p> <pre><code class="language-python">for url in soup.find_all("a"): print(url.get("href")) </code></pre> <pre><code class="language-python">import requests from bs4 import BeautifulSoup #Get Aidemy's practice web page authority = "http://scraping.aidemy.net" r = requests.get(authority) #Parse with lxml soup = BeautifulSoup(r.text, "lxml") #To find links for page transitions<a>Get the element urls = soup.find_all("a") # -----The url you want to scrape_Get to list----- url_list = [] # url_Add the URL of each page to the list for url in urls: url = authority + url.get("href") url_list.append(url) #Output list print(url_list) </code></pre> <h2>Get the title of the photo (2)</h2> <p>In the previous section, we were able to list the URLs we wanted to get.</p> <p>Repeat scraping for each of the acquired URLs You can get various information such as photo name and age.</p> <p>Also, if you write the acquired information to a database or write it to a file, It will be available for processing data.</p> <pre><code class="language-python">import urllib.request import requests from bs4 import BeautifulSoup #Get Aidemy's practice web page authority = "http://scraping.aidemy.net" r = requests.get(authority) #Parse with lxml soup = BeautifulSoup(r.text, "lxml") #To find links for page transitions<a>Get the element urls = soup.find_all("a") # -----The url you want to scrape_Get to list----- url_list = [] # url_Add the URL of each page to list for url in urls: url = authority + url.get("href") url_list.append(url) # -----Scraping photo titles----- #Create a scraping function def scraping(url): html = urllib.request.urlopen(url) soup = BeautifulSoup(html, "lxml") #Answer here photos = soup.find_all("h3") photos_list = [] #Please complete the following for statement for photo in photos: photo = photo.text photos_list.append(photo) return photos_list for url in url_list: print(scraping(url)) </code></pre>  <script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>  <ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-5469278205356604" data-ad-slot="4209814965" data-ad-format="auto" data-full-width-responsive="true"></ins> <script> (adsbygoogle = window.adsbygoogle || []).push({}); </script> <div style="margin-top: 30px;"> <div class="link-top" style="margin-top: 1px;"></div> <p> <font size="4">Recommended Posts</font>  <div style="margin-top: 10px;"> <a href="/en/a8d3f16ec0e4c3c50b7c">Python: Scraping Part 1</a> </div> <div style="margin-top: 10px;"> <a href="/en/fa7941ba5586d95398d7">Python: Scraping Part 2</a> </div> <div style="margin-top: 10px;"> <a href="/en/8706bdb77eb75d09fd76">[Scraping] Python scraping</a> </div> <div style="margin-top: 10px;"> <a href="/en/225f38c23a652459962f">Scraping with Selenium + Python Part 1</a> </div> <div style="margin-top: 10px;"> <a href="/en/bcbc5b09170be2903ce9">Scraping with Selenium + Python Part 2</a> </div> <div style="margin-top: 10px;"> <a href="/en/06d15232beb3de9b3f00">QGIS + Python Part 2</a> </div> <div style="margin-top: 10px;"> <a href="/en/0944d989e72fa8ac8f3a">Python scraping notes</a> </div> <div style="margin-top: 10px;"> <a href="/en/0cb9b41f32f99e2bc2a5">Python Scraping get_ranker_categories</a> </div> <div style="margin-top: 10px;"> <a href="/en/136297ed22df0317bd89">Scraping with Python</a> </div> <div style="margin-top: 10px;"> <a href="/en/36cd0292b327fee417dc">Scraping with Python</a> </div> <div style="margin-top: 10px;"> <a href="/en/3ce49cfa6dfaaf488da7">QGIS + Python Part 1</a> </div> <div style="margin-top: 10px;"> <a href="/en/66fa6ceea66dc5a4d3a3">Python Scraping eBay</a> </div> <div style="margin-top: 10px;"> <a href="/en/91f9232ae28e4b30a73d">Python Scraping get_title</a> </div> <div style="margin-top: 10px;"> <a href="/en/e3dd905fa536b69329ad">Scraping using Python</a> </div> <div style="margin-top: 10px;"> <a href="/en/408665802a3d5221e026">Automate simple tasks with Python Part1 Scraping</a> </div> <div style="margin-top: 10px;"> <a href="/en/b9f38a4413e424e3e585">[Part1] Scraping with Python → Organize to csv!</a> </div> <div style="margin-top: 10px;"> <a href="/en/0989a2daf169c19adada">Scraping with Python (preparation)</a> </div> <div style="margin-top: 10px;"> <a href="/en/09b6fa057cbf5ad5f916">Summary about Python scraping</a> </div> <div style="margin-top: 10px;"> <a href="/en/0e41870de5f84b327d59">Try scraping with Python.</a> </div> <div style="margin-top: 10px;"> <a href="/en/350773b741ea87c32c20">UnicodeEncodeError:'cp932' during python scraping</a> </div> <div style="margin-top: 10px;"> <a href="/en/42b947a77bba75ea6ce3">Basics of Python scraping basics</a> </div> <div style="margin-top: 10px;"> <a href="/en/4655a954e8e7e7c557a4">Scraping with Python + PhantomJS</a> </div> <div style="margin-top: 10px;"> <a href="/en/85fedb9d97c66be0c354">Python basic memorandum part 2</a> </div> <div style="margin-top: 10px;"> <a href="/en/96aebe73539a3d6e8e59">Python basic memo --Part 2</a> </div> <div style="margin-top: 10px;"> <a href="/en/c161462844aef87e0f0d">Scraping with Selenium [Python]</a> </div> <div style="margin-top: 10px;"> <a href="/en/ca307c7a5bcdeb95a1c0">Python web scraping selenium</a> </div> <div style="margin-top: 10px;"> <a href="/en/cd51a00de026ef92080a">Scraping with Python + PyQuery</a> </div> <div style="margin-top: 10px;"> <a href="/en/e2321be29d4660194bab">Python basic memo --Part 1</a> </div> <div style="margin-top: 10px;"> <a href="/en/ef0ed3273907ea56e5cd">Scraping RSS with Python</a> </div> <div style="margin-top: 10px;"> <a href="/en/03229bfa161e6dc2ea61">Scraping using Python 3.5 async / await</a> </div> <div style="margin-top: 10px;"> <a href="/en/0888dff584666d948dd4">I tried scraping with Python</a> </div> <div style="margin-top: 10px;"> <a href="/en/0f5dbcdd77b5b10cce96">Studying Python with freeCodeCamp part1</a> </div> <div style="margin-top: 10px;"> <a href="/en/1592ffb4e65744b73a58">Bordering images with python Part 1</a> </div> <div style="margin-top: 10px;"> <a href="/en/15b87653ed656f6cf7d5">Python application: Pandas Part 1: Basic</a> </div> <div style="margin-top: 10px;"> <a href="/en/1911252d97321c1f9d9b">Web scraping with python + JupyterLab</a> </div> <div style="margin-top: 10px;"> <a href="/en/248f050a867a9ef53b7a">Python: Ship Survival Prediction Part 2</a> </div> <div style="margin-top: 10px;"> <a href="/en/2714bcd6a56836cc9134">[Python] Scraping in AWS Lambda</a> </div> <div style="margin-top: 10px;"> <a href="/en/272d485e8a249d0d1bd7">python super beginner tries scraping</a> </div> <div style="margin-top: 10px;"> <a href="/en/29ff9f562526ee47af00">Web scraping notes in python3</a> </div> <div style="margin-top: 10px;"> <a href="/en/2c2d615040a9adaa6d33">Python</a> </div> <div style="margin-top: 10px;"> <a href="/en/3088148a31f625bff095">Scraping with chromedriver in python</a> </div> <div style="margin-top: 10px;"> <a href="/en/35905779504016085801">Festive scraping with Python, scrapy</a> </div> <div style="margin-top: 10px;"> <a href="/en/3fc23b2071fafebb2c1f">Python: Supervised Learning: Hyperparameters Part 1</a> </div> <div style="margin-top: 10px;"> <a href="/en/64153dd78699c7b0a563">Python Basic Grammar Memo (Part 1)</a> </div> <div style="margin-top: 10px;"> <a href="/en/68e0ce1db7677cfebf63">Scraping with Selenium in Python</a> </div> <div style="margin-top: 10px;"> <a href="/en/6de518f5a242b73180f3">Python: Ship Survival Prediction Part 1</a> </div> <div style="margin-top: 10px;"> <a href="/en/73a566580b229c119675">Studying Python with freeCodeCamp part2</a> </div> <div style="margin-top: 10px;"> <a href="/en/81f4b893bb1406162ab3">Scraping with Tor in Python</a> </div> <div style="margin-top: 10px;"> <a href="/en/8e4f8922df083e0c4cf9">Solving Sudoku with Python (Part 2)</a> </div> <div style="margin-top: 10px;"> <a href="/en/9d6d1169093f8db705df">Web scraping using Selenium (Python)</a> </div> <div style="margin-top: 10px;"> <a href="/en/a06b8b00c7d7f6d9e666">Python: Ship Survival Prediction Part 3</a> </div> <div style="margin-top: 10px;"> <a href="/en/a0ffe6aac9790cd1b551">Python: Stock Price Forecast Part 2</a> </div> <div style="margin-top: 10px;"> <a href="/en/a89b9e44b6b058ebb1ad">Python: Supervised Learning: Hyperparameters Part 2</a> </div> <div style="margin-top: 10px;"> <a href="/en/ca7a4d0525d6ea32ebe7">[Python + Selenium] Tips for scraping</a> </div> <div style="margin-top: 10px;"> <a href="/en/d0c36bd3e5d1c998d3cd">Web scraping beginner with python</a> </div> <div style="margin-top: 10px;"> <a href="/en/e72706d4e2e58773a494">Scraping 1</a> </div> <div style="margin-top: 10px;"> <a href="/en/ef86f89218865eb8bd4f">Basics of Python × GIS (Part 1)</a> </div> <div style="margin-top: 10px;"> <a href="/en/fa781f3eac2c0fa8f225">Python: Stock Price Forecast Part 1</a> </div> <div style="margin-top: 10px;"> <a href="/en/fdecfce2692461687016">Python Crawling & Scraping Chapter 4 Summary</a> </div> <div style="margin-top: 10px;"> <a href="/en/029b6c880ea56ddd0033">Transpose CSV files in Python Part 1</a> </div> <div style="margin-top: 10px;"> <a href="/en/0dea7bef7c0b4f76a3dc">Try scraping with Python + Beautiful Soup</a> </div>  </p> </div> </div> </div> <div class="footer text-center" style="margin-top: 40px;">  </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.4.1/dist/jquery.min.js"></script> <script src="https://cdn.jsdelivr.net/npm/bootstrap@4.3.1/dist/js/bootstrap.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@10.1.2/build/highlight.min.js"></script> <script> $(document).ready(function() { var cfg_post_height = 60; var cfg_per = 0.51; var ads_obj = $('<ins class="adsbygoogle" style="display:block; text-align:center;" data-ad-layout="in-article" data-ad-format="fluid" data-ad-client="ca-pub-5469278205356604" data-ad-slot="7950405964"></ins>'); $('pre code').each(function(i, e) {hljs.highlightBlock(e)}); function getDocumentOffsetPosition( el ) { var _x = 0; var _y = 0; while( el && !isNaN( el.offsetLeft ) && !isNaN( el.offsetTop ) ) { _x += el.offsetLeft - el.scrollLeft; _y += el.offsetTop - el.scrollTop; el = el.offsetParent; } return { top: _y, left: _x }; } if ( $( "#article202011" ).length ) { var h1_pos = getDocumentOffsetPosition($('h1')[0]); var footer_pos = getDocumentOffsetPosition($('.link-top')[0]); var post_distance = footer_pos.top - h1_pos.top; // console.log('h1_pos: '+ h1_pos.top); // console.log(cfg_post_height) if((post_distance/h1_pos.top)>=cfg_post_height) { // console.log('tesssssssssssssssssssssssssssssssss'); $( ".container p" ).each(function( index ) { var p_tag_pos = $(this).position().top; var dis = p_tag_pos - h1_pos.top; var per = dis/post_distance; if(per>cfg_per) { ads_obj.insertAfter($(this)); (adsbygoogle = window.adsbygoogle || []).push({}); console.log( index + ": " + $( this ).text() ); return false; } }); } } }); </script> <script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script> <script> (adsbygoogle = window.adsbygoogle || []).push({}); </script>  <script data-ad-client="ca-pub-5469278205356604" async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js" type="d7540fe192d881abe59fcf57-text/javascript"></script>  </body> </html><script src="/cdn-cgi/scripts/7d0fa10a/cloudflare-static/rocket-loader.min.js" data-cf-settings="b0db866ef1f70df7c0135618-|49" defer></script>