[PYTHON] I saved the scraped data in CSV!

Introduction

I recently learned about scraping and implemented it. This time, I created a keyword in " CiNii Articles --Search for Japanese Articles --National Institute of Informatics ". All the "titles", "authors", and "paper publication media" of the papers that hit the keyword are acquired and saved in CSV. It was a good study for learning scraping, so I wrote an article. We hope it will be useful for those who are learning scraping!

code

Below is the code I wrote myself. The explanation is written with the code, so please take a look at it. Also, actually go to the site of " CiNii Articles --Search for Japanese Articles --National Institute of Informatics " and Chrome I think that understanding will deepen if you do it while actually looking at the structure of HTML using the verification function of. This time I saved this code as "search.py".

import sys
import os
import requests
import pandas as pd
from bs4 import BeautifulSoup
import re


def main():
    url ='https://ci.nii.ac.jp/search?q={}&count=200'.format(sys.argv[1])
    res = requests.get(url)
    soup = BeautifulSoup(res.content , "html.parser")

    #Check the number of searches.
    #In text'\n Search results\n\t\n\t0\n\t'It contains data like this.
    search_count_result = soup.find_all("h1" , {"class":"heading"})[0].text
    #Get the number of searches using a regular expression
    pattern = '[0-9]+'
    result = re.search(pattern, search_count_result)
   
   #If there are no search results, the function ends here
    search_count = int(result.group())
    if  search_count == 0:
        return print('There are no search results.')

    print('The number of searches is' + str(search_count) + 'It is a matter.')

    #Creating a directory to store data.
    try:
        os.makedirs(sys.argv[1])
        print("A new directory has been created.")
    except FileExistsError:
        print("It will be a directory that already exists.")

    #To get all the search results, get the number of for.
    #This time, it is set to 200 because it is displayed every 200 cases.
    if search_count // 200 == 0:
        times = 1
    elif search_count % 200 == 0:
        times = search_count // 200
    else:
        times = search_count // 200 + 1

    
    #Acquire authors, titles, and publication media at once
    title_list = []
    author_list = []
    media_list = []

    #Processing to delete whitespace characters here
    escape = str.maketrans({"\n":'',"\t":''})
    for time in range(times):
        
        #get url
        count = 1 + 200 * time
        #search?q={}Enter the keyword you want to search for here.
        #count=200&start={}It counts every 200 and enters the number to display from.
        url ='https://ci.nii.ac.jp/search?q={}&count=200&start={}'.format(sys.argv[1], count)
        print(url)
        res = requests.get(url)
        soup = BeautifulSoup(res.content , "html.parser")


        for paper in soup.find_all("dl", {"class":"paper_class"}):#Turn the loop for each paper.
            
            #Get title
            title_list.append(paper.a.text.translate(escape))
            #Acquisition of author
            author_list.append(paper.find('p' , {'class':"item_subData item_authordata"}).text.translate(escape))
            #Acquisition of publication media
            media_list.append(paper.find('p' , {'class':"item_extraData item_journaldata"}).text.translate(escape))
    
    #Save as CSV as a data frame.
    jurnal = pd.DataFrame({"Title":title_list , "Author":author_list , "Media":media_list})
    
    #Encoding is performed to prevent garbled characters.
    jurnal.to_csv(str(sys.argv[1] + '/' + str(sys.argv[1]) + '.csv'),encoding='utf_8_sig')
    print('I created a file.')
    print(jurnal.head())


if __name__ == '__main__':
    #The code you want to run only when you run the module directly
    main()

I implemented it.

I tried to implement what I actually created. First, type the following into the terminal. This time, I entered machine learning as a search keyword. In the place of machine learning, you enter the keyword you want to search for.

python search.py machine learning

If all goes well, it will look like this: 2020-10-27 (3).png

The contents of the CSV look like this. 2020-10-27 (4).png

Finally

How was that? I learned scraping about three days ago, but the code was dirty, but it was relatively easy to implement. I think I have more to study, so I will continue to do my best.

Recommended Posts

I saved the scraped data in CSV!
I got lost in the maze
I participated in the ISUCON10 qualifying!
Check the data summary in CASTable
I wrote the queue in Python
I wrote the stack in Python
Data input / output in Python (CSV, JSON)
Read all csv files in the folder
I wrote the selection sort in C
I can't get the element in Selenium!
I touched the data preparation tool Paxata
I wrote the sliding wing in creation.
How do I represent the data passed in Curl --data-urlencode in Python Requests?
I searched for railway senryu from the data
I stumbled on the character code when converting CSV to JSON in Python
I tried to save the data with discord
I tried simulating the "birthday paradox" in Python
I tried the least squares method in Python
I can't enter characters in the text area! ?? !! ?? !! !! ??
[Data analysis] Should I buy the Harumi flag?
The story of reading HSPICE data in Python
I wrote the hexagonal architecture in go language
I implemented the inverse gamma function in python
Csv in python
I checked the calendar deleted in Qiita Advent Calendar 2016
I implemented Human In The Loop ― Part ① Dashboard ―
I want to display the progress in Python!
[First data science ⑥] I tried to visualize the market price of restaurants in Tokyo
I tried to graph the packages installed in Python
I convert AWS JSON data to CSV like this
Full-width and half-width processing of CSV data in Python
Collectively register data in Firestore using csv file in Python
I tried to touch the CSV file with Python
I tried to predict the J-League match (data analysis)
Read the csv file and display it in the browser
About the inefficiency of data transfer in luigi on-memory
[Django] Perform Truncate Table (delete all data in the table)
Read the linked list in csv format with graph-tool
Not being aware of the contents of the data in python
I examined the data mapping between ArangoDB and Java
I want to write in Python! (3) Utilize the mock
I tried using the API of the salmon data project
Let's use the open data of "Mamebus" in Python
Store the stock price scraped by Python in the DB
Try to decipher the login data stored in Firefox
[Note] I can't call the installed module in jupyter
What I learned by participating in the ISUCON10 qualifying
I want to use the R dataset in python
I can't use the darknet command in Google Colaboratory!
Sampling in imbalanced data
How to calculate the sum or average of time series csv data in an instant
I counted the grains
Try scraping the data of COVID-19 in Tokyo with Python
I checked the Python package pre-installed in Google Cloud Dataflow
I participated in the translation activity of Django official documents
I tried the accuracy of three Stirling's approximations in python
[Python] Open the csv file in the folder specified by pandas
I tried the super-resolution algorithm "PULSE" in a Windows environment
I scraped the Organization member team and made a ranking
Mezzanine introduction memo that I got stuck in the flow
I wrote the basic operation of Seaborn in Jupyter Lab