・ I've been studying web scraping recently and would like to help people who are also studying. ・ For the output of my own learning. Output practice. -It is a human output that is less than a month after starting to touch web scraping or Selenium. On the contrary, from now on Please use it as a guideline as people who start it can usually become this much. ・ I don't understand the super-detailed explanation as much as possible. I can only understand it roughly, so I hope that those who read it will understand it. ・ We refer to robots.txt and judge that there seems to be no problem, and we are implementing it.
newspaper.py
import requests
from bs4 import BeautifulSoup
import pandas as pd
#Creating the list required for web scraping
elements_title = []
elements_url = []
#Web scraping process
def get_nikkei_news():
url = "https://www.nikkei.com/"
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html,"html.parser")
#Processing about title
title_list = soup.select(".k-card__block-link")
for title in title_list:
elements_title.append(title.text)
#Processing about url
url_list = soup.select(".k-card__block-link")
for i in url_list:
urls = i.get("href")
if "http" not in urls:
urls = "https://www.nikkei.com" + urls
elements_url.append(urls)
else:
elements_url.append(urls)
#Web scraping process call
get_nikkei_news()
#pandas processing
df = pd.DataFrame({"news_title":elements_title,
"news_url":elements_url})
print(df)
df.to_csv("nikkei_test_select.csv", index=False, encoding="utf-8")
news_title news_url
0 FY2008 growth rate, minus 4%Mid-Taiwan Government Outlook https://www.nikkei.com/article/DGXMZO62026930Z...
1 US-Australia 2 plus 2, "serious concern" on China's power line https://www.nikkei.com/article/DGXMZO62026150Z...
2 dangerous "Xi Jinping politics" all negative, of the Soviet Union, which invites collision in the US spell https://www.nikkei.com/article/DGXMZO61984420Y...
3 Nuclear fuel reprocessing plant passed safety examination; operation started in FY2009 https://www.nikkei.com/article/DGXMZO62026760Z...
4 Suspended study abroad at Corona, suddenly entered job hunting. Trial for returning students https://www.nikkei.com/article/DGXMZO61953440X...
.. ... ...
70 Small rocket, small and medium-sized, etc. aiming to launch in the air https://www.nikkei.com/article/DGXMZO61958380X...
71 Marunaka preferentially hires technical intern trainees in their home country Aeon https://www.nikkei.com/article/DGXMZO62005790Y...
72 Strengthening border measures, aiming to resume international flights Naha Airport Building President https://www.nikkei.com/article/DGXMZO62017470Y...
73 Kanagawa Bank's President Kondo reforms loan screening for the first time https://www.nikkei.com/article/DGXMZO61933170X...
74 Toriten: Enjoy the taste of Oita's hospitality and eat while walking https://www.nikkei.com/article/DGXMZO56989060Z...
・ Nihon Keizai Shimbun (https://www.nikkei.com/) The titles and URLs of all news and advertisements displayed on the top page of It is a web scraping code that performs the process of spitting out in csv format. I think it's a basic (possibly) code for web scraping using BeautifulSoup and requests.
-Prepare an empty list to store news titles and urls The processing for extracting the data to be included in the list is performed in each function. It was good to just go to extract the news title, but the URL may have "https: //" (protocol part) from the beginning. Some of them were not attached, so we made a conditional branch, and for those that were not attached, "https://www.nikkei.com" Is added. (Maybe some of the URLs will output something strange. It didn't look like I checked it visually, but if there was, it would be fixed (-_-;))
-It is output by print () and further output in csv format. However, since it is output in utf-8, the characters are garbled when opened on a Windows PC. I use atom, google drive, Cloud9, etc. We are checking the contents, so if you want to check the output by copying this code, thank you for your reference <m (__) m>
The following part of the function that performs web scraping processing is coded diligently because I think it is magical. Copy and paste the URL of the target you want to web scrape into the url, and bring the HTML file from that URL. I think it's like analyzing the structure with HTML perspective.
parts.py
url = "https://www.nikkei.com/"
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html,"html.parser")
The code below retrieves the title. Then, the extracted title is stored in the list.
parts.py
title_list = soup.select(".k-card__block-link")
for title in title_list:
elements_title.append(title.text)
The following code retrieves the url. However, the content may be insufficient just by taking it out. Because there was, conditional branching is performed with the if statement, and if there is a shortage, make up for the shortage. It will be stored in the list.
parts.py
url_list = soup.select(".k-card__block-link")
for i in url_list:
urls = i.get("href")
if "http" not in urls:
urls = "https://www.nikkei.com" + urls
elements_url.append(urls)
else:
elements_url.append(urls)
The following uses a module called pandas and processes it in csv format and spits it out. In pandas, make it a dictionary type, set the column name in the key, and set the element of each column in the value. I was able to output the data in csv format neatly. It's convenient. pandas.
parts.py
df = pd.DataFrame({"news_title":elements_title,
"news_url":elements_url})
print(df)
df.to_csv("nikkei_test.csv", index=False, encoding="utf-8")
Recommended Posts