[Python] I made a web scraping code that automatically acquires the news title and URL of Nikkei Inc.

Reason for publication

・ I've been studying web scraping recently and would like to help people who are also studying. ・ For the output of my own learning. Output practice. -It is a human output that is less than a month after starting to touch web scraping or Selenium. On the contrary, from now on Please use it as a guideline as people who start it can usually become this much. ・ I don't understand the super-detailed explanation as much as possible. I can only understand it roughly, so I hope that those who read it will understand it. ・ We refer to robots.txt and judge that there seems to be no problem, and we are implementing it.

Code created

newspaper.py


import requests
from bs4 import BeautifulSoup
import pandas as pd

#Creating the list required for web scraping
elements_title = []
elements_url = []

#Web scraping process
def get_nikkei_news():
    url = "https://www.nikkei.com/"
    response = requests.get(url)
    html = response.text
    soup = BeautifulSoup(html,"html.parser")

    #Processing about title
    title_list = soup.select(".k-card__block-link")
    for title in title_list:
        elements_title.append(title.text)

    #Processing about url
    url_list = soup.select(".k-card__block-link")
    for i in url_list:
        urls = i.get("href")
        if "http" not in urls:
            urls = "https://www.nikkei.com" + urls
            elements_url.append(urls)
        else:
            elements_url.append(urls)

#Web scraping process call
get_nikkei_news()

#pandas processing
df = pd.DataFrame({"news_title":elements_title,
                   "news_url":elements_url})
print(df)
df.to_csv("nikkei_test_select.csv", index=False, encoding="utf-8")

Output result (console only)

              news_title                                           news_url
0 FY2008 growth rate, minus 4%Mid-Taiwan Government Outlook https://www.nikkei.com/article/DGXMZO62026930Z...
1 US-Australia 2 plus 2, "serious concern" on China's power line https://www.nikkei.com/article/DGXMZO62026150Z...
2 dangerous "Xi Jinping politics" all negative, of the Soviet Union, which invites collision in the US spell https://www.nikkei.com/article/DGXMZO61984420Y...
3 Nuclear fuel reprocessing plant passed safety examination; operation started in FY2009 https://www.nikkei.com/article/DGXMZO62026760Z...
4 Suspended study abroad at Corona, suddenly entered job hunting. Trial for returning students https://www.nikkei.com/article/DGXMZO61953440X...
..                         ...                                                ...
70 Small rocket, small and medium-sized, etc. aiming to launch in the air https://www.nikkei.com/article/DGXMZO61958380X...
71 Marunaka preferentially hires technical intern trainees in their home country Aeon https://www.nikkei.com/article/DGXMZO62005790Y...
72 Strengthening border measures, aiming to resume international flights Naha Airport Building President https://www.nikkei.com/article/DGXMZO62017470Y...
73 Kanagawa Bank's President Kondo reforms loan screening for the first time https://www.nikkei.com/article/DGXMZO61933170X...
74 Toriten: Enjoy the taste of Oita's hospitality and eat while walking https://www.nikkei.com/article/DGXMZO56989060Z...

Overview explanation

・ Nihon Keizai Shimbun (https://www.nikkei.com/) The titles and URLs of all news and advertisements displayed on the top page of It is a web scraping code that performs the process of spitting out in csv format. I think it's a basic (possibly) code for web scraping using BeautifulSoup and requests.

-Prepare an empty list to store news titles and urls The processing for extracting the data to be included in the list is performed in each function. It was good to just go to extract the news title, but the URL may have "https: //" (protocol part) from the beginning. Some of them were not attached, so we made a conditional branch, and for those that were not attached, "https://www.nikkei.com" Is added. (Maybe some of the URLs will output something strange. It didn't look like I checked it visually, but if there was, it would be fixed (-_-;))

-It is output by print () and further output in csv format. However, since it is output in utf-8, the characters are garbled when opened on a Windows PC. I use atom, google drive, Cloud9, etc. We are checking the contents, so if you want to check the output by copying this code, thank you for your reference <m (__) m>

Code explanation (roughly)

The following part of the function that performs web scraping processing is coded diligently because I think it is magical. Copy and paste the URL of the target you want to web scrape into the url, and bring the HTML file from that URL. I think it's like analyzing the structure with HTML perspective.

parts.py


    url = "https://www.nikkei.com/"
    response = requests.get(url)
    html = response.text
    soup = BeautifulSoup(html,"html.parser")

The code below retrieves the title. Then, the extracted title is stored in the list.

parts.py


    title_list = soup.select(".k-card__block-link")
    for title in title_list:
        elements_title.append(title.text)

The following code retrieves the url. However, the content may be insufficient just by taking it out. Because there was, conditional branching is performed with the if statement, and if there is a shortage, make up for the shortage. It will be stored in the list.

parts.py


    url_list = soup.select(".k-card__block-link")
    for i in url_list:
        urls = i.get("href")
        if "http" not in urls:
            urls = "https://www.nikkei.com" + urls
            elements_url.append(urls)
        else:
            elements_url.append(urls)

The following uses a module called pandas and processes it in csv format and spits it out. In pandas, make it a dictionary type, set the column name in the key, and set the element of each column in the value. I was able to output the data in csv format neatly. It's convenient. pandas.

parts.py


df = pd.DataFrame({"news_title":elements_title,
                   "news_url":elements_url})
print(df)
df.to_csv("nikkei_test.csv", index=False, encoding="utf-8")

Recommended Posts

[Python] I made a web scraping code that automatically acquires the news title and URL of Nikkei Inc.
[Python scraping] Output the URL and title of the site containing a specific keyword to a text file
I made a calendar that automatically updates the distribution schedule of Vtuber
[Python / C] I made a device that wirelessly scrolls the screen of a PC remotely.
I made a calendar that automatically updates the distribution schedule of Vtuber (Google Calendar edition)
[Python3] I made a decorator that declares undefined functions and methods.
I made a slack bot that notifies me of the temperature
I made a scaffolding tool for the Python web framework Bottle
I made a program that automatically calculates the zodiac with tkinter
Get the title and delivery date of Yahoo! News in Python
I made a Line bot that guesses the gender and age of a person from an image
[Python] I analyzed the diary of a first-year member of society and made a positive / negative judgment on the life of a member of society.
I made a web application in Python that converts Markdown to HTML
[Python] I wrote a simple code that automatically generates AA (ASCII art)
The story of developing a web application that automatically generates catchphrases [MeCab]
I made a program to check the size of a file in Python
I made a function to see the movement of a two-dimensional array (Python)
I made AI patroll the net and created a gadget ranking Web service that is updated once a week
Article that can be a human resource who understands and masters the mechanism of API (with Python code)
I made a program in Python that reads CSV data of FX and creates a large amount of chart images
I made a system that automatically decides whether to run tomorrow with Python and adds it to Google Calendar.
[Python] I made a script that automatically cuts and pastes files on a local PC to an external SSD.
I compared the speed of go language web framework echo and python web framework flask
[Python] I made a LINE Bot that detects faces and performs mosaic processing.
[Python] A program that calculates the number of updates of the highest and lowest records
I made a tool to automatically back up the metadata of the Salesforce organization
I made a script to record the active window using win32gui of Python
I made a github action that notifies Slack of the visual regression test
I tried web scraping using python and selenium
A discussion of the strengths and weaknesses of Python
I made a twitter app that decodes the characters of Pricone with heroku (failure)
[Python3] Take a screenshot of a web page on the server and crop it further
I made a Docker Image that reads RSS and automatically tweets regularly and released it.
I compared the speed of the reference of the python in list and the reference of the dictionary comprehension made from the in list.
I made a web application that maps IT event information with Vue and Flask
Get the title of yahoo news and analyze sentiment
[Python] A program that counts the number of valleys
Make a BOT that shortens the URL of Discord
I made a net news notification app with Python
I made a VM that runs OpenCV for Python
# Function that returns the character code of a string
I checked out the versions of Blender and Python
I made a LINE BOT with Python and Heroku
A memo that I touched the Datastore with python
I felt that I ported the Python code to C ++ 98.
I made an action to automatically format python code
[Python] A program that compares the positions of kangaroos.
I made a twitter app that identifies and saves the image of a specific character on the twitter timeline by pytorch transfer learning
When writing to a csv file with python, a story that I made a mistake and did not meet the delivery date
I made a program in Python that changes the 1-minute data of FX to an arbitrary time frame (1 hour frame, etc.)
Collect tweets about "Corona" with python and automatically detect words that became a hot topic due to the influence of "Corona"
I made a tool to automatically generate a state transition diagram that can be used for both web development and application development
I made a toolsver that spits out OS, Python, modules and tool versions to Markdown
A library that monitors the life and death of other machines by pinging from Python
How to start a simple WEB server that can execute cgi of php and python
I tried "Streamlit" which turns the Python code into a web application as it is
I made a function to crop the image of python openCV, so please use it.
[Python] A program to find the number of apples and oranges that can be harvested
A tool that automatically turns the gacha of a social game
I scraped the Organization member team and made a ranking
I just changed the sample source of Python a little.