Text mining with Python-Scraping-

I'm dealing with machine learning and so on, and somehow I was interested in text mining, so I will post it as a memorandum of what I tried. This time it's scraping, but in the end I'm thinking of analyzing the scraped text information, so I'll post it as needed when it's complete.

Execution environment

When I thought about scraping with Python, the first thing that came to my mind was requests and [BeautifulSoup](http: //). It was kondou.com/BS4/), so I'm going to use this combination this time.

By the way, I usually use JSer, so I often use puppeteer for scraping. Well, that's fine, let's actually start scraping.

First check if it works

Of course, when writing code in python, you can write it directly in the "~ .py" file, but if you use Jupyter Notebook, there are various convenient parts such as easy to see the output result, so use Jupyter Notebook for testing. It is recommended to do. Especially this time, I will test using Google Colaboratory provided by Google. You don't need to install the library, just a Google account to run it.

** Open Colaboratory and create a new notebook ** Open Colaboratory in your web browser and open Create a new notebook from File> New Notebook in Python 3.

** Import library **


import  requests
from bs4 import BeautifulSoup

** Specify URL ** This time, I will scrape the latest 10 lines of headline news of ArchiFuture Web, which is a portal site of architecture x computing, which is my occupation. (ArchiFuture)

Specify url

url = "http://www.archifuture-web.jp/headline/457.html"

** Visit page using requests ** Let's see if we can actually access the page.

Visit page

res = requests.get(url)

If you do this, you probably


<Response [200]>

I think that will be returned. If you would like to know the HTTP response code, please refer to here.

If you want to see the contents of the page



'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html xmlns="http://www.w3.org/1999/xhtml" lang="ja" xml:lang="ja" dir="ltr" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">\n<head>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n<meta http-equiv="Content-Style-Type" content="text/css" />\n<meta http-equiv="Content-Script-Type" content="text/javascript" />\n<!-- [if lt IE]><meta http-equiv="imagetoolbar" content="no" /><![endif] -->\n<title>"Archi Future 2019" is a great success with the highest number of visitors in history | Headline | Architecture x Computation Portal Site\u3000Archi Future Web</title>\n<meta name="Description" content="Architecture x Computation portal site "Archi Future Web"." />\n<meta name="keywords" content="Architecture,Computation,Archi Future">\n<meta property="og:site_name" content="Architecture × Computationのポータルサイト\u3000Archi Future Web">\n<meta property="og:title" content=""Archi Future 2019" is a great success with the highest number of visitors ever">\n<meta property="og:type" content="article">\n<meta property="og:description" content="The 12th "Archi Future 2019" will be held on October 25th last week.(Money)Was held in. On the day of the event, the number of visitors was high despite the unfortunate weather of heavy rain and wind....">\n<meta property="og:url" content="http://www.archifuture-web.jp/headline/457.html" />\n<meta property="og:image" content="http://www.archifuture-web.jp/headline/img/4/c/4c57dc333a5c9d674ef327289a500800.jpg " />\n<meta property="og:image:width"  content="700" />\n<meta property="og:image:height" content="467" />\n<meta property="og:locale" content="ja_JP"> <div class="page-title">
<p><img src="img/icon_new.gif" width="112" height="20" alt="Latest 10 lines news"/></p>
<h2>"Archi Future 2019" has the highest number of visitors ever<br />\r\n Collect and hold successfully</h2>
</div>
<p class="page-data">2019.10.28</p>

<p>The 12th "Archi Future 2019" will be held on October 25th last week.(Money)Was held in.<br />\r\n On the day of the event, despite the unfortunate weather of heavy rain and wind<span style="font-size:12px;">、</span>The number of visitors is 5 compared to the previous time.4% increase<br />\r\n5,With 509 people, it was a great event to attract the highest number of visitors in history.<br />\r\n Diller Scofidio, a well-known US design firm+Keynote speech by Renfro and current location of major general contractors<br />\r\Panel design by 5 people in n-long class<span style="font-size:12px;">I</span>Ska<span style="font-size:12px;">Tsu</span>Shi<span style="font-size:12px;">Yo</span>Is<span style="font-size:12px;">、</span>The venue expanded to 600 seats is full<span style="font-size:12px;">、</span>Panel design<span style="font-size:12px;">I</span><br />\r\nSkaTsuShiYoIs席をさらに100席増設するほどの盛況ぶりだ<span style="font-size:12px;">Tsu</span>Ta. Return to article list</a></li>
</ul>
</div> banner\u3000========-->\n\r\n<div id="premiumbanner" class="al_center clr">\r\n<a href="http://www.archifuture.jp/2019/" class="premiumbanner-left banner" target="_blank" id="premium-24"><img src="../img_banner/premium/img/8/2/8280d8bb17d4c09e872441a1ba21eae0.png " width="270" height="180" alt="Archi Future 2019"/></a>\r\n<a href="http://www.archifuture.jp/2019/" class="premiumbanner-right banner" target="_blank" id="premium-25"><img src="../img_banner/premium/img/6/a/6ad8af52988f0560acbb6c08377d79f3.png " width="270" height="180" alt="Archi Future 2019"/></a>\r\n</div>\r\n\n\n<!--========\u3000 Rectangle Super Banner\u3000========-->\n\r\n<p id="superbanner" class="al_center"><a href="http://www.archifuture.jp/2019/" class="banner" id="super-14"><img src="../img_banner/super/img/9/9/99ae81e84701bf687561a0ca026bdef0.png " width="600" height="90" alt="Archi Future 2019"/></a></p>\r\n\n\n<!-- /#mainContent --></div>\n\n<div id="sidebar">\n<!--========\u3000 advertising banner\u3000========-->\n\r\n<ul id="banner" class="clr">\r\n<li><a href="https://www.cradle.co.jp/" target="_blank" class="banner" id="default-5"><img src="../img_banner/default/img/2/f/2f1b60f601b0f99e6094e32d7fd0b26d.gif" width="270" height="80" alt="Software cradle"/></a></li>\r\n<li><a href="https://product.metamoji.com/gemba/eyacho/" target="_blank" class="banner" id="default-24"><img src="../img_banner/default/img/2/8/280e8426c1fb78ee0e67b2d009d7c9d2.gif" width="270" height="80" alt="MetaMoJi"/></a></li>\r\n<li><a href="https://www.izumi-soft.jp/product-category/bim-%E7%A9%BA%E8%AA%BF%E8%A8%AD%E5%82%99%E8%A8%AD%E8%A8%88/" target="_blank" class="banner" id="default-16"><img src="../img_banner/default/img/1/8/18ef602ddf1e9f1e3c5f00a7674725a2.gif" width="270" height="80" alt="イズミShiステム設計様"/></a></li>\r\n<li><a href="http://www.nyk-systems.co.jp/" target="_blank" class="banner" id="default-6"><img src="../img_banner/default/img/3/b/3b747d65472ce7be37b8235fc703432d.gif" width="270" height="80" alt="NYKShiステムズ様"/></a></li>\r\n<li><a href="http://www.pivot.co.jp/" target="_blank" class="banner" id="default-12"><img src="../img_banner/default/img/6/d/6d10409aeb0b2d23bd73b9ccc70cc08d.gif" width="270" height="80" alt="ArchitectureピボTsuト様"/></a></li>\r\n<li><a href="http://www.applicraft.com/" target="_blank" class="banner" id="default-20"><img src="../img_banner/default/img/9/0/90cc824aac1eda2ba2c37046e55dd79c.gif" width="270" height="80" alt="Appcraft"/></a></li>\r\n<li><a href="http://bit.ly/2Bw8tEc" target="_blank" class="banner" id="default-3"><img src="../img_banner/default/img/7/d/7dbe65f17a1bf153277ba5b466580556.jpg " width="270" height="80" alt="グラフIソフトジャパン様"/></a></li>\r\n<li><a href="https://autode.sk/2TXDSqE" target="_blank" class="banner" id="default-11"><img src="../img_banner/default/img/e/c/ecb06a6b95c9e79935b6a7df88384ab3.jpg " width="270" height="80" alt="Autodesk"/></a></li>\r\n<li><a href="https://licensecounter.jp/aec-collection-bim/" target="_blank" class="banner" id="default-22"><img src="../img_banner/default/img/1/0/10bcd30f085e6ce4dab4b824c64817a6.gif" width="270" height="80" alt="SB C&Mr. S"/></a></li>\r\n<li><a href="https://www.nvidia.com/ja-jp/design-visualization/industries/architecture-engineering-construction/?nvid=nv-int-pcjp12rrdsfrqr-44523" target="_blank" class="banner" id="default-21"><img src="../img_banner/default/img/d/5/d53b3fe7fec2bc10858a26f88556c8fb.jpg " width="270" height="80" alt="エヌビデIア様"/></a></li>\r\n<li><a href=" https://www.aanda.co.jp/Vectorworks2019/index.html?utm_source=af&utm_medium=banner&utm_campaign=bnr_20190921" target="_blank" class="banner" id="default-9"><img src="../img_banner/default/img/7/2/72745a704c3abe2513357559102be116.jpg " width="270" height="80" alt="A & A"/></a></li>\r\n<li><a href="http://j-bim.gloobe.jp/" target="_blank" class="banner" id="default-4"><img src="../img_banner/default/img/a/9/a9f022f44cac2878ac5936fcf4b26175.gif" width="270" height="80" alt="Fukui Computer Architect"/></a></li>\r\n<li><a href="http://www.env-simulation.com" target="_blank" class="banner" id="default-18"><img src="../img_banner/default/img/e/3/e3d7f1694a53705271dc5e751519d0d8.gif" width="270" height="80" alt="環境ShiミュレーShiYoン様"/></a></li>\r\n<li><a href="http://www.f-cadewa.com/" target="_blank" class="banner" id="default-8"><img src="../img_banner/default/img/f/0/f0a5e6fe83e322a2d62a2461855a6c2a.gif" width="270" height="80" alt="富士通四国インフォテTsuク様"/></a></li>\r\n<li><a href="https://www.photoruction.com/?utm_source=afw&utm_medium=banner&utm_campaign=201903" target="_blank" class="banner" id="default-23"><img src="../img_banner/default/img/2/8/28dba36bb5e98ff9a4514ae00e93844b.png " width="270" height="80" alt="フォトラクShiYoン様"/></a></li>\r\n</ul>\r\n<script type="text/javascript">\r\n<!--\r\nvar top_url = \'/\';\r\n//-->\r\n</script>\r\n<script type="text/javascript" src="../common/js/banner_track.js"></script>\r\n\n\n<!-- /#sidebar --></div>\n<!-- /#section --></div>\n<!-- /#content --></div>\n\n<script type="text/javascript">footer(\'../\');</script>\n\n<!-- /#container --></div>\n</body>\n</html>'

However, in this case, HTML that is packed tightly as text data is returned, and I do not understand what it is. So let's use Beautiful Soup to parse the HTML.

** Use Beautiful Soup **

html perspective

soup = BeautifulSoup(res.text, 'html.parser')

Let's see the result here


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html dir="ltr" lang="ja" xml:lang="ja" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">


<div class="page-title">
<p><img alt="Latest 10 lines news" height="20" src="img/icon_new.gif" width="112"/></p>
<h2>"Archi Future 2019" has the highest number of visitors ever<br/>
Collected and held successfully</h2>
<p class="page-data">2019.10.28</p>
<p>The 12th "Archi Future 2019" will be held on October 25th last week.(Money)Was held in.<br/>
On the day of the event, despite the unfortunate weather of heavy rain and wind<span style="font-size:12px;">、</span>The number of visitors is 5 compared to the previous time.4% increase<br/>
5,With 509 people, it was a great event to attract the highest number of visitors in history.<br/>
Diller Scofidio, a well-known US design firm+Keynote speech by Renfro and current location of major general contractors<br/>
Panel design by 5 long class students<span style="font-size:12px;">I</span>Ska<span style="font-size:12px;">Tsu</span>Shi<span style="font-size:12px;">Yo</span>Is<span style="font-size:12px;">、</span>The venue expanded to 600 seats is full<span style="font-size:12px;">、</span>Panel design<span style="font-size:12px;">I</span><br/>
Scution is so successful that it will add another 100 seats.<span style="font-size:12px;">Tsu</span>Ta. Which course is the lecture / seminar?<br/>
Is almost full<span style="font-size:12px;">、</span>The exhibition hall is also visited by a large number of visitors<span style="font-size:12px;">、</span>The whole venue was very lively and a great success<br/>
It was held. Special talk 1 between Mr. Okada and Mr. Yamanashi, special talk 2 between Mr. Toyota and Mr. Matsushima, which<br/>
Sessie<span style="font-size:12px;">Yo</span>It was a fulfilling content that made me feel a new direction of architecture and a bright future.<br/>
The report of Archi Future 2019 will be introduced on this site in the future.<br/>
<a href="http://www.archifuture.jp/2019/" target="_blank"><p class="image al_center"><img alt="Top page of "Archi Future 2019" official site" height="400" src="./img/4/c/4c57dc333a5c9d674ef327289a500800.jpg " width="600"/></p><p class="caption">Top page of "Archi Future 2019" official site</p></a></p>



It parses and displays html nicely.

Get to the information you want to get

Then it is finally the main subject. This time we want to get the content of the 10-line article, so first find out where the article body is. You can find it by comparing it with what is written, but if you are on the Web, use the developer tools to find it.


Something like this. (Hmm ... id and class aren't assigned ...)

If id or class is assigned, you can easily get it by specifying it using css selector, but this time there is no such thing so I will get all the p tags where the article is written. ..

Get p tag

p_tags = soup.select('p')

Acquisition result of p tag

[<p><img alt="Latest 10 lines news" height="20" src="img/icon_new.gif" width="112"/></p>,
 <p class="page-data">2019.10.28</p>,
 <p>The 12th "Archi Future 2019" will be held on October 25th last week.(Money)Was held in.<br/>
On the day of the event, despite the unfortunate weather of heavy rain and wind<span style="font-size:12px;">、</span>The number of visitors is 5 compared to the previous time.4% increase<br/>
 5,With 509 people, it was a great event to attract the highest number of visitors in history.<br/>
Diller Scofidio, a well-known US design firm+Keynote speech by Renfro and current location of major general contractors<br/>
Panel design by 5 long class students<span style="font-size:12px;">I</span>Ska<span style="font-size:12px;">Tsu</span>Shi<span style="font-size:12px;">Yo</span>Is<span style="font-size:12px;">、</span>The venue expanded to 600 seats is full<span style="font-size:12px;">、</span>Panel design<span style="font-size:12px;">I</span><br/>
Scution is so successful that it will add another 100 seats.<span style="font-size:12px;">Tsu</span>Ta. Which course is the lecture / seminar?<br/>
Is almost full<span style="font-size:12px;">、</span>The exhibition hall is also visited by a large number of visitors<span style="font-size:12px;">、</span>The whole venue was very lively and a great success<br/>
It was held. Special talk 1 between Mr. Okada and Mr. Yamanashi, special talk 2 between Mr. Toyota and Mr. Matsushima, which<br/>
Sessie<span style="font-size:12px;">Yo</span>It was a fulfilling content that made me feel a new direction of architecture and a bright future.<br/>
The report of Archi Future 2019 will be introduced on this site in the future.<br/>
 <a href="http://www.archifuture.jp/2019/" target="_blank"><p class="image al_center"><img alt="Top page of "Archi Future 2019" official site" height="400" src="./img/4/c/4c57dc333a5c9d674ef327289a500800.jpg " width="600"/></p><p class="caption">Top page of "Archi Future 2019" official site</p></a></p>,
 <p class="image al_center"><img alt="Top page of "Archi Future 2019" official site" height="400" src="./img/4/c/4c57dc333a5c9d674ef327289a500800.jpg " width="600"/></p>,
 <p class="caption">Top page of "Archi Future 2019" official site</p>,
 <p class="al_center" id="superbanner"><a class="banner" href="http://www.archifuture.jp/2019/" id="super-14"><img alt="Archi Future 2019" height="90" src="../img_banner/super/img/9/9/99ae81e84701bf687561a0ca026bdef0.png " width="600"/></a></p>]

Apparently, it is the second (counting from 0) in the p tag, so we will extract the text of the second element from this.

Get articles

article = p_tags[2].get_text()

Article acquisition result

'The 12th "Archi Future 2019" will be held on October 25th last week.(Money)Was held in.\r\n On the day of the event, the number of visitors was 5 compared to the previous time, despite the unfortunate weather of heavy rain and wind..4% increase\r\n5,With 509 people, it was a great event to attract the highest number of visitors in history.\r\n Diller Scofidio, a well-known US design firm+Keynote speech by Renfro and current location of major general contractors\r\In the panel discussion by 5 people in the n-long class, the venue expanded to 600 seats was full and the panel day\r\n Scushion was so successful that it added 100 more seats. Which course is the lecture / seminar?\r\n was almost full, the exhibition hall was visited by a large number of visitors, and the entire venue was very lively, a great success.\r\It was held n. Special talk 1 between Mr. Okada and Mr. Yamanashi, special talk 2 between Mr. Toyota and Mr. Matsushima, which\r\The n-session was also a fulfilling content that made us feel a new direction of architecture and a bright future.\r\The report of nArchi Future 2019 will be introduced on this site in the future.\n\n\u3000 "Archi Future 2019" official site top page'

You're getting closer. So, let's erase unnecessary line feed codes.

Extract only the text of the article

lines = [line.strip() for line in text.splitlines()]  #Get only characters without tags
ten_lines_news = lines[0:10]  #Delete unnecessary parts

Contents of 10-line news

['The 12th "Archi Future 2019" will be held on October 25th last week.(Money)Was held in.',
 'On the day of the event, the number of visitors was 5 compared to the previous time, despite the unfortunate weather of heavy rain and wind..4% increase',
 '5,With 509 people, it was a great event to attract the highest number of visitors in history.',
 'Diller Scofidio, a well-known US design firm+Keynote speech by Renfro and current location of major general contractors',
 'In the panel discussion by 5 people in the long class, the venue expanded to 600 seats was full, and the panel day',
 'Scassion was so successful that it added another 100 seats. Which course is the lecture / seminar?',
 'The exhibition hall was almost full, and the exhibition hall was very lively with a large number of visitors.',
 'It was held. Special talk 1 between Mr. Okada and Mr. Yamanashi, special talk 2 between Mr. Toyota and Mr. Matsushima, which',
 'The session was also fulfilling and made us feel a new direction of architecture and a bright future.',
 'The report of Archi Future 2019 will be introduced on this site in the future.']

You got it well. The excitement point is that the number of arrays is 10.

Finally, put it together in one line of text.

In one text

ten_lines_news_text = ""
for line in ten_lines_news:
  ten_lines_news_text += line

Also collect information around the article

The real thrill of scraping is getting a lot of information at once. When that happens, it is currently not possible to identify the information that has been acquired.

This time, as a two-step stance, I will get the date when the article was posted and the number assigned to the URL of the article and use it as the article ID. As you can see from the data that came out when the p tag was acquired all at once, the date part is assigned a class. Let's use this to get the date this time.

Get Post Date

date = soup.select('.page-data')[0].string

Acquisition result


The rest is the id of the article, but the URL of the article page "http://www.archifuture-web.jp/headline/457.html" Make sure to use the name part of the html file. (This time it's troublesome ~~ Let's write the ID directly)

Article ID

id = 457

Functionalization of processing

Let's turn the process created so far into a function.

--Access page --html perspective --Get p tag --Getting articles --Extract only the text of the article --In one text

The four processes of are combined into one function, and when the URL is entered, the text of the 10-line article is returned.

Functionalization of processing

def get_article(url):
  res = requests(url)
  soup = BeautifulSoup(res.text, ‘html.parser’)

  #Get articles
  p_tags = soup.select(‘p’)
  article = p_tags[2].get_text()
  lines = [line.strip() for line in text.splitlines()]  #Get only characters without tags
  ten_lines_news = lines[0:10]  #Delete unnecessary parts
  #Store in one text data
  ten_lines_news_text = ""
  for line in ten_lines_news:
    ten_lines_news_text += line

  date = soup.select('.page-data')[0].string  #Post date and time
  id = 457 #Article ID

  return ten_lines_news_text

Check if this can be executed with a python script file (.py), and if it can be executed, let's describe the following steps in the python script file.

Try saving to a CSV file

It's a waste to keep getting the acquired information, so let's write it in a csv file. There are several libraries that handle csv data in python, but this time I will use pandas. It's a library I personally like because it's very useful when working with row and column data.

Let's prepare the csv file to write to first (this time create it with the name txt_data.csv). ~~ articles.csv is more suitable ... ~~



Read and write files using pandas.

Write to csv

 csv_file = 'csv/txt_data.csv' 
 df = pd.read_csv(csv_file)
 text = value
 results = pd.DataFrame([id, date, text], columns=['id', 'date', 'text'])
 df = pd.concat([df, results])
 df.to_csv(csv_file, index=False)
 print("success writing to %s" % csv_file)

Csv data after writing is completed

457,2019.10.28,The 12th "Archi Future 2019" will be held on October 25th last week.(Money)Was held in. On the day of the event, the number of visitors was 5 compared to the previous time, despite the unfortunate weather of heavy rain and wind..4% increase 5,With 509 people, it was a great event to attract the highest number of visitors in history. Diller Scofidio, a well-known US design firm+The keynote speech by Renfro and the panel discussion by the current location manager class of five major general contractors were full, and the panel discussion was so successful that the number of seats was increased by 100. All of the lectures and seminars were almost full, and the exhibition hall was visited by a large number of visitors, and the entire venue was very lively and was a great success. Every session, including the special dialogue 1 between Mr. Okada and Mr. Yamanashi and the special dialogue 2 between Mr. Toyota and Mr. Matsushima, was fulfilling and made us feel a new direction of architecture and a bright future. The report of Archi Future 2019 will be introduced on this site in the future.

Now you can save the acquired information as csv data. Next time, I will explain how to get all the articles posted so far and save the text data.


Qiita: A general-purpose method for extracting only characters by scraping Python

