[PYTHON] Scraping 2 How to scrape

Aidemy 2020/9/30


Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This is the second post of scraping. Nice to meet you.

What to learn this time ・ Scraping method (Refer to scraping 1 for preparatory crawling) ・ Scraping multiple pages


How to scrape

・ (Review) Scraping is to acquire a Web page and extract necessary data from it. -There are two types of scraping methods, "regular expression (re module)" or "use a third-party library", but this time the major method is __ "use a third-party library". "Scraping is done by the __ method.

About HTML and XML

-XML is a __markup language __ that directly builds the same Web pages as HTML. It is more extensible than HTML. -In HTML and XML, there is text surrounded by something like ** \ <\ title> **, of which the whole is called <strong>element</strong> and <title> is __. Say tag __. -There is also a notation such as \ <html lang = "ja">, which means that the <strong>lang attribute is ja</strong>, that is, the language is Japanese. -In the "Beautiful Soup" library used for scraping this time, the title etc. is acquired from this element name.</p> <h2>Scraping with Beautiful Soup (preparation for analysis)</h2> <p>-You can easily scrape by using the <strong>BeautifulSoup (decoded web page, "parser") __ method. -</strong> Parser __ is a program that analyzes (parses) character strings, and there are several types for each feature, and one of them is specified. Examples include "html.parser" which does not require an additional library, "lxml" which can process at high speed, and "xml" which corresponds to XML.</p> <pre><code class="language-python">#Import requests to crawl and Beautiful Soup to scrape from bs4 import BeautifulSoup import requests #Get url url=requests.get("https://www.google.co.jp") #Scraping (decoding is the url of the request module.The parser is done with text"xml"Specified as) soup=BeautifulSoup(url.text,"xml") </code></pre> <h2>Scraping with BeautifulSoup (extracting necessary data)</h2> <p>-Necessary data can be extracted from the parsed data performed in the previous section. There are the following two methods. -If you put the parsed data in the variable soup, __soup.find ("tag name or attribute name") __ will extract only the first element with that tag or attribute. Also, if the find part is <strong>find_all</strong>, all the specified elements will be listed and extracted. -If you want to scrape from the class attribute, add _<em>class</em>=" class attribute name" __ to the argument.</p> <p>-If you put the parsed data in the variable soup, __soup.selected_one ("CSS selector") __ will extract only the first element that satisfies this. Also, if the selected_one part is <strong>select</strong>, all the specified elements will be listed and extracted. -The __CSS selector is a method of showing elements in CSS representation. __ You can also use this to specify an element inside an element (ex) The h1 element inside a body element is "body> h1").</p> <p>-Also, as a trick, you can copy elements and CSS selectors with Chrome's developer tools. Therefore, it is possible to extract the desired data in a visually easy-to-understand manner without having to bother to output the decoded data.</p> <pre><code class="language-python">Google_title = soup.find("title") #<title>Google</title> Google_h1 = soup.select("body > h1") #[](Empty list is output because there is no h1 element of body element) </code></pre> <p>-If the above is left as it is, Google_title will be output with the title tag attached, but by using <strong>text</strong>, only the text of these can be obtained.</p> <pre><code class="language-python">print(Google_title.text) #Google </code></pre> <h2>Scraping multiple pages</h2> <p>・ With the method so far, you can scrape only one page at a time. If you want to scrape multiple pages, you can get the URL of the other page from the link to the other page on the top page etc. __ and scrape all the URLs by iterative processing. -The URL of other pages can be obtained by __the URL of the top page + the href attribute (link of each page) __ of the <a> element.</p> <pre><code class="language-python">top="http://scraping.aidemy.net" r=requests.get(top) soup=BeautifulSoup(r.text,"lxml") url_lists=[] #Get the URL of another page from the link #(The method is to first get all the a tags, use get to code the href attribute for each, and connect it to the topURL to make it a URL.) urls = soup.find_all("a") for url in urls: url = top + url.get("href") url_lists.append(url) </code></pre> <p>・ If you can get the URL of another page, actually scrape it. As mentioned above, scraping should be performed for all URLs by iterative processing. -In the following, the titles of the photos (listed in the h3 tag) will be scraped from all the pages acquired in the previous section, and all will be acquired and displayed as a list.</p> <pre><code class="language-python">photo_lists=[] #Encode the page obtained in the previous section and then scrape the photo title with Beautiful Soup. for url in url_lists: r2=requests.get(url) soup=BeautifulSoup(r2.text,"lxml") photos=soup.find_all("h3") #Add the title of the photo obtained by scraping to the list without the h3 tag. for photo in photos: photo_text=photo.text photo_lists.append(photo_text) print(photo_lists) #['Minim incididunt pariatur', 'Voluptate',...(Abbreviation)] </code></pre> <h1>Summary</h1> <p>-When scraping a crawled page, first parse it with the <strong>BeautifulSoup</strong> method. -Any data can be extracted from the parsed data. Use __find () __ or __selected_one () __ to extract. -If you add <strong>text</strong> to the extracted data, tags etc. will be omitted and only the elements can be extracted. -When scraping multiple pages at once, you can scrape them individually because you can scrape them by extracting the link from the __top page and connecting it to the base URL.</p> <p>This time is over. Thank you for reading this far.</p> <!-- ENDDDDDDDDDDDDDDDDDDDDDDDDDDDDD --> <script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script> <!-- post_new_ui_horiz --> <ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-5469278205356604" data-ad-slot="4209814965" data-ad-format="auto" data-full-width-responsive="true"></ins> <script> (adsbygoogle = window.adsbygoogle || []).push({}); </script> <div style="margin-top: 30px;"> <div class="link-top" style="margin-top: 1px;"></div> <p> <font size="4">Recommended Posts</font> <!-- BEGIN LINK ************************* --> <div style="margin-top: 10px;"> <a href="/en/15efc8a4c76ec347fa71">Scraping 2 How to scrape</a> </div> <div style="margin-top: 10px;"> <a href="/en/519325bce0b889c0a8ac">How to end Python's infinite scroll scraping</a> </div> <div style="margin-top: 10px;"> <a href="/en/b72d69acfc9295b6cac6">How to scrape websites created with SPA</a> </div> <div style="margin-top: 10px;"> <a href="/en/09aed7b23388190cba23">How to use Python-shell</a> </div> <div style="margin-top: 10px;"> <a href="/en/1138e427367e93cd2ab8">How to use tf.data</a> </div> <div style="margin-top: 10px;"> <a href="/en/12cde4330e831587fd46">How to use virtualenv</a> </div> <div style="margin-top: 10px;"> <a href="/en/1a138c2f22f886573451">How to use Seaboan</a> </div> <div style="margin-top: 10px;"> <a href="/en/1f770ae4e04210cfa32f">How to use image-match</a> </div> <div style="margin-top: 10px;"> <a href="/en/20cd4e7db9a7c3f79f10">How to use shogun</a> </div> <div style="margin-top: 10px;"> <a href="/en/245b1dda77d44f775857">How to install Python</a> </div> <div style="margin-top: 10px;"> <a href="/en/24e6b848761f83514720">How to use Pandas 2</a> </div> <div style="margin-top: 10px;"> <a href="/en/2e5ff5bf4ade3f30ac02">How to read PyPI</a> </div> <div style="margin-top: 10px;"> <a href="/en/3261ffa9b67410803443">How to install pip</a> </div> <div style="margin-top: 10px;"> <a href="/en/346cff2d2a0161994d0e">How to use Virtualenv</a> </div> <div style="margin-top: 10px;"> <a href="/en/3cc5399e18a7e3f9db86">How to use numpy.vectorize</a> </div> <div style="margin-top: 10px;"> <a href="/en/3d56b7f8a34612baa2ee">How to update easy_install</a> </div> <div style="margin-top: 10px;"> <a href="/en/41456d9fb76d278f0fde">How to install archlinux</a> </div> <div style="margin-top: 10px;"> <a href="/en/4771111002287dd42de7">How to use pytest_report_header</a> </div> <div style="margin-top: 10px;"> <a href="/en/514b633f4c5b7e5c0b62">Introduction to Web Scraping</a> </div> <div style="margin-top: 10px;"> <a href="/en/598ec1ea7ce7ed756bbf">How to restart gunicorn</a> </div> <div style="margin-top: 10px;"> <a href="/en/5d230501e974dc3758e7">How to install python</a> </div> <div style="margin-top: 10px;"> <a href="/en/5e4c118608c325c1e864">How to virtual host</a> </div> <div style="margin-top: 10px;"> <a href="/en/5ead81be176bf803f327">How to debug selenium</a> </div> <div style="margin-top: 10px;"> <a href="/en/6554ce52bd980fab7a11">How to use partial</a> </div> <div style="margin-top: 10px;"> <a href="/en/66809e71caa5924b24a9">How to use Bio.Phylo</a> </div> <div style="margin-top: 10px;"> <a href="/en/67e5b23172c978f58960">How to read JSON</a> </div> <div style="margin-top: 10px;"> <a href="/en/6d05a43d6607e4208bc8">How to use SymPy</a> </div> <div style="margin-top: 10px;"> <a href="/en/6e43b7928b2e0dc42bb9">How to scrape horse racing data with BeautifulSoup</a> </div> <div style="margin-top: 10px;"> <a href="/en/6f444ea1596506fafe5d">How to use x-means</a> </div> <div style="margin-top: 10px;"> <a href="/en/72753b7ac08f0bd4993f">How to use WikiExtractor.py</a> </div> <div style="margin-top: 10px;"> <a href="/en/73a41e3f0ca540e28d2e">How to update Spyder</a> </div> <div style="margin-top: 10px;"> <a href="/en/7852e13ace6de288042f">How to use IPython</a> </div> <div style="margin-top: 10px;"> <a href="/en/7b2d461c4a0b3fdc4309">How to install BayesOpt</a> </div> <div style="margin-top: 10px;"> <a href="/en/860e1000852fcf7a9691">How to use virtualenv</a> </div> <div style="margin-top: 10px;"> <a href="/en/8c558fa27ea8860978a8">How to use Matplotlib</a> </div> <div style="margin-top: 10px;"> <a href="/en/9b4eaee4b3d16f794740">How to use iptables</a> </div> <div style="margin-top: 10px;"> <a href="/en/a15658d1dd17c421e1e2">How to use numpy</a> </div> <div style="margin-top: 10px;"> <a href="/en/aacd5d1fec600f3af569">How to use TokyoTechFes2015</a> </div> <div style="margin-top: 10px;"> <a href="/en/ab90bd40611174c7cb98">How to use venv</a> </div> <div style="margin-top: 10px;"> <a href="/en/acdada0c8bf912d269d8">How to use dictionary {}</a> </div> <div style="margin-top: 10px;"> <a href="/en/b1b673f530a05ec6b423">How to use Pyenv</a> </div> <div style="margin-top: 10px;"> <a href="/en/b33b5c824a56dc89e1f7">How to grow dotfiles</a> </div> <div style="margin-top: 10px;"> <a href="/en/b98fb6de08b433646082">How to use list []</a> </div> <div style="margin-top: 10px;"> <a href="/en/bbd8b5860612904deea0">How to use python-kabusapi</a> </div> <div style="margin-top: 10px;"> <a href="/en/c2a9f22e0bbacd1e9183">"How to count Fukashigi"</a> </div> <div style="margin-top: 10px;"> <a href="/en/c2b43de4bd3bf57ff5de">How to install Nbextensions</a> </div> <div style="margin-top: 10px;"> <a href="/en/c2ba174a153bbdc5ad22">How to use OptParse</a> </div> <div style="margin-top: 10px;"> <a href="/en/d7bb0672a2751a95df80">How to use return</a> </div> <div style="margin-top: 10px;"> <a href="/en/d91095639578d3ff3ca4">How to install Prover9</a> </div> <div style="margin-top: 10px;"> <a href="/en/dd873d018d8d991fe46b">How to use dotenv</a> </div> <div style="margin-top: 10px;"> <a href="/en/dfe10bfcfdd1c61b7a78">How to operate NumPy</a> </div> <div style="margin-top: 10px;"> <a href="/en/eb39bca9762043601675">How to use pyenv-virtualenv</a> </div> <div style="margin-top: 10px;"> <a href="/en/f04c4b4337b71a9aff22">How to use Go.mod</a> </div> <div style="margin-top: 10px;"> <a href="/en/f66629335424868f3f85">How to use imutils</a> </div> <div style="margin-top: 10px;"> <a href="/en/feba5c77c6ef70116fab">How to use import</a> </div> <div style="margin-top: 10px;"> <a href="/en/2c341c5c8b6be6df06f6">How to scrape image data from flickr with python</a> </div> <div style="margin-top: 10px;"> <a href="/en/434b259e473cc8646e91">How to scrape horse racing data using pandas read_html</a> </div> <div style="margin-top: 10px;"> <a href="/en/c881f7d9ef5add132c7b">Scraping with Python-Selenium is old! ?? ・ ・ ・ How to use Pyppeteer</a> </div> <div style="margin-top: 10px;"> <a href="/en/0dc321690e00e7ae322b">[2020.8 latest] How to install Python</a> </div> <div style="margin-top: 10px;"> <a href="/en/13cf2e1dc73b96508314">How to use Qt Designer</a> </div> <div style="margin-top: 10px;"> <a href="/en/1429b7529d858ee4177b">[IPython] How to Share IPython Notebook</a> </div> <!-- END LINK ************************* --> </p> </div> </div> </div> <div class="footer text-center" style="margin-top: 40px;"> <!-- <p> Licensed under cc by-sa 3.0 with attribution required. </p> --> </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.4.1/dist/jquery.min.js"></script> <script src="https://cdn.jsdelivr.net/npm/bootstrap@4.3.1/dist/js/bootstrap.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@10.1.2/build/highlight.min.js"></script> <script> $(document).ready(function() { var cfg_post_height = 60; var cfg_per = 0.51; var ads_obj = $('<ins class="adsbygoogle" style="display:block; text-align:center;" data-ad-layout="in-article" data-ad-format="fluid" data-ad-client="ca-pub-5469278205356604" data-ad-slot="7950405964"></ins>'); $('pre code').each(function(i, e) {hljs.highlightBlock(e)}); function getDocumentOffsetPosition( el ) { var _x = 0; var _y = 0; while( el && !isNaN( el.offsetLeft ) && !isNaN( el.offsetTop ) ) { _x += el.offsetLeft - el.scrollLeft; _y += el.offsetTop - el.scrollTop; el = el.offsetParent; } return { top: _y, left: _x }; } if ( $( "#article202011" ).length ) { var h1_pos = getDocumentOffsetPosition($('h1')[0]); var footer_pos = getDocumentOffsetPosition($('.link-top')[0]); var post_distance = footer_pos.top - h1_pos.top; // console.log('h1_pos: '+ h1_pos.top); // console.log(cfg_post_height) if((post_distance/h1_pos.top)>=cfg_post_height) { // console.log('tesssssssssssssssssssssssssssssssss'); $( ".container p" ).each(function( index ) { var p_tag_pos = $(this).position().top; var dis = p_tag_pos - h1_pos.top; var per = dis/post_distance; if(per>cfg_per) { ads_obj.insertAfter($(this)); (adsbygoogle = window.adsbygoogle || []).push({}); console.log( index + ": " + $( this ).text() ); return false; } }); } } }); </script> <script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script> <script> (adsbygoogle = window.adsbygoogle || []).push({}); </script> <!-- ads --> <script data-ad-client="ca-pub-5469278205356604" async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js" type="d7540fe192d881abe59fcf57-text/javascript"></script> <!-- end ads --> </body> </html>