[Python] From morphological analysis of CSV data to CSV output and graph display [GiNZA]

Collect CSV data

First, get the CSV data. I was wondering what data to get, but I will scrape the lyrics of my favorite Yorushika.

First, install the modules required for scraping

pip install requests
pip install bs4
pip install lxml
pip install pandas

Scraping!

I referred to here. 【https://qiita.com/yuuuusuke1997/items/122ca7597c909e73aad5#%E3%81%8A%E3%82%8F%E3%82%8A%E3%81%AB】

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

#Create a table to put the scraped data
list_df = pd.DataFrame(columns=['lyrics'])

for page in range(10):
    try:
        #Song page top address
        base_url = 'https://www.uta-net.com'

        #Lyrics list page
        artist = "22653"
        url = 'https://www.uta-net.com/artist/'+artist+'/0/' + str(page) + '/'
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'lxml')
        links = soup.find_all('td', class_='side td1')
        for link in links:
            a = base_url + (link.a.get('href'))

            #Lyrics detail page
            response = requests.get(a)
            soup = BeautifulSoup(response.text, 'lxml')
            song_lyrics = soup.find('div', itemprop='lyrics')
            song_lyric = song_lyrics.text
            song_lyric = song_lyric.replace('\n','')
            #Wait 1 second to not load the server
            time.sleep(1)

            #Add the acquired lyrics to the table
            tmp_se = pd.DataFrame([song_lyric], index=list_df.columns).T
            list_df = list_df.append(tmp_se)
    except:
        print(page)
        import traceback
        traceback.print_exc()

print(list_df)

#csv save
list_df.to_csv('list.csv', mode = 'a', encoding='utf_8_sig')

Installation that requires morphological analysis

First install what you need

pip install "https://github.com/megagonlabs/ginza/releases/download/v1.0.2/ja_ginza_nopn-1.0.2.tgz"
pip install matplotlib
pip install wordcloud

Japaneseization of matplotlib

With reference to this [https://qiita.com/osakasho/items/7408d031ca0b2192422f]

Analysis and graph display!

# coding: utf-8
import spacy
nlp = spacy.load('ja_ginza_nopn')
import pandas as pd
import matplotlib.pyplot as plt
import collections
from wordcloud import WordCloud

def ginza(word):
    doc = nlp(word)
    #Survey results
    total_ls = []
    Noun_ls = [chunk.text for chunk in doc.noun_chunks]
    Verm_ls = [token.lemma_ for token in doc if token.pos_ == "VERB"]
    for n in Noun_ls:
        total_ls.append(n)
    for v in Verm_ls:
        total_ls.append(v)
    return total_ls, Noun_ls, Verm_ls


"""---------------CSV read and pre-set--------------"""
csv_read_path = "list.csv"
df = pd.read_csv(csv_read_path)

target_categories = ["lyrics"]
black_list = ["test"]
"""-------------------------------------------------------------"""



"""---------------Morpheme processing------------------------"""
for target in target_categories:
    total_voc = []#Prepare a box to put letters
    for data in df[target]:
        try:
            word_ls, noun_ls, verm_ls = ginza(data)
        except:#If it cannot be decomposed, use one word.
            word_ls = [data]
        for w in word_ls:
            if not w in black_list:#Check if the word is on the blacklist.
                total_voc.append(w)

    print("The number of words is", len(total_voc), "was.")

    #Ranking the most frequent words
    c = collections.Counter(total_voc)

    #Write to CSV
    c_data = (c.most_common())
    csvdf = pd.DataFrame(c_data)
    filename = target + ".csv"
    csvdf.to_csv(filename, encoding='utf_8_sig')
    print("----------------------------")

    #Graph for the time being
    #Specify an additional partial font.
    plt.rcParams["font.family"] = "IPAexGothic"
    plt.title(target)
    plt.grid(True)
    graph_x_list = []
    graph_y_list = []
    top_num = 0
    for key, value in c.most_common():
        graph_x_list.append(key)
        graph_y_list.append(value)
        if top_num >= 10:
            break
        top_num += 1
    try:
        plt.bar(graph_x_list, graph_y_list)
        #Graph display
        plt.show()
    except:
        print(target, "Could not draw the data.")

    #Draw in WordCloud
    font = 'C:/Windows/Fonts/YuGothM.ttc'
    wordcloud = WordCloud(background_color="white", width=1000, height=600, font_path=font)

    wordcloud.generate(" ".join(wordcloud_ls))
    wordcloud.to_file(target+'.png')

"""-------------------------------------------------------------"""

Graph results

Bar chart results

image.png

Word Cloud results

image.png

You really understand

Thank you for your hard work.

Recommended Posts

[Python] From morphological analysis of CSV data to CSV output and graph display [GiNZA]
Ported from R language of "Sazae-san's rock-paper-scissors data analysis" to Python
Python> Output numbers from 1 to 100, 501 to 600> For csv
From the introduction of JUMAN ++ to morphological analysis of Japanese with Python
Data analysis: Easily apply descriptive and inference statistics to CSV data in Python
[Python] How to read data from CIFAR-10 and CIFAR-100
[Python] Flow from web scraping to data analysis
[Python] How to name table data and output it in csv (to_csv method)
Full-width and half-width processing of CSV data in Python
[Introduction to Data Scientists] Basics of Python ♬ Functions and classes
[Introduction to Python] Combine Nikkei 225 and NY Dow csv data
Meteorology x Python ~ From weather data acquisition to spectrum analysis ~
[Python / Ruby] Understanding with code How to get data from online and write it to CSV
Output to csv file with Python
Recommended books and sources of data analysis programming (Python or R)
[Python] Try to graph from the image of Ring Fit [OCR]
How to avoid duplication of data when inputting from Python to SQLite.
Introduction to Data Analysis with Python P17-P26 [ch02 1.usa.gov data from bit.ly]
Practice of data analysis by Python and pandas (Tokyo COVID-19 data edition)
[Introduction to Data Scientists] Basics of Python ♬ Conditional branching and loops
[Introduction to Data Scientists] Basics of Python ♬ Functions and anonymous functions, etc.
[Python] How to specify the window display position and size of matplotlib
From Python to using MeCab (and CaboCha)
Data input / output in Python (CSV, JSON)
From preparation for morphological analysis with python using polyglot to part-of-speech tagging
[Data science basics] I tried saving from csv to mysql with python
CSV output of pulse data with Raspberry Pi (confirm analog input with python)
Transfer floating point data from Python to JavaScript without loss of digits
Porting and modifying doublet-solver from python2 to python3.
Read Python csv and export to txt
Easily graph data in shell and Python
Graph display of AIX and Linux nmon data without using MS Excel
From re-environment construction of Python to graph drawing (on visual studio code)
Compress python data and write to sqlite
[Introduction to Data Scientists] Basics of Python ♬
I want to output a path diagram of distributed covariance structure analysis (SEM) by linking Python and R.
Even in the process of converting from CSV to space delimiter, seriously try to separate input / output and rules
Data analysis in Python Summary of sources to look at first for beginners
(Miscellaneous notes) Data update pattern from CSV data acquisition / processing to Excel by Python
Summary of how to read numerical data with python [CSV, NetCDF, Fortran binary]
Beautiful graph drawing with python -seaborn makes data analysis and visualization easier Part 1
Use libsixel to output Sixel in Python and output a Matplotlib graph to the terminal.
Beautiful graph drawing with python -seaborn makes data analysis and visualization easier Part 2
Organize Python tools to speed up the initial movement of data analysis competitions
From Excel file to exe and release of tool that spits out CSV
Csv output from Google search with [Python]! 【Easy】
Scraping tabelog with python and outputting to CSV
[Kaggle] From data reading to preprocessing and encoding
Read Python csv data with Pandas ⇒ Graph with Matplotlib
[Python] Convert from DICOM to PNG or CSV
Read JSON with Python and output as CSV
Receive and display HTML form data in Python
From file to graph drawing in Python. Elementary elementary
Data retrieval from MacNote3 and migration to Write
Output python log to both console and file
Thorough comparison of three Python morphological analysis libraries
CSV output of pulse data with Raspberry Pi (CSV output)
Write CSV data to AWS-S3 with AWS-Lambda + Python
List of Python code to move and remember
I tried morphological analysis and vectorization of words
A well-prepared record of data analysis in Python