[Python] Wouldn't it be the best and highest if you could grasp the characteristics of a company with nlplot?

Trigger

Currently, I am an intern for data analysis at EXIDEA Co., Ltd., which develops SEO writing tools. It's been four months since I started working, but due to the influence of Corona, I have never met anyone in the company. But what are the characteristics of regular online drinking parties and daily meetings? I finally understand. Also, I often hear the word ** "recruitment" ** at recent monthly meetings. I think there are many companies that are focusing on recruiting activities using Wantedly, not just venture companies. In this article, ** Wantedly's story article will be a story to re-recognize the corporate characteristics and feelings that you want to convey to applicants using the package nlplot that makes it easy to visualize natural language. ** **

The source code is available on Github, so please feel free to contact us. https://github.com/yuuuusuke1997/Article_analysis

environment

・ MacOS -Python 3.7.6 ・ Jupyter Notebook ・ Zsh shell

Story flow

  1. [Data collection (scraping)](https://qiita.com/yuuuusuke1997/items/247eb06583ae8f653c2a#1-%E3%83%87%E3%83%BC%E3%82%BF%E3%81% AE% E5% 8F% 8E% E9% 9B% 86% E3% 82% B9% E3% 82% AF% E3% 83% AC% E3% 82% A4% E3% 83% 94% E3% 83% B3% E3% 82% B0)
  2. [Morphological analysis (MeCab)](https://qiita.com/yuuuusuke1997/items/247eb06583ae8f653c2a#2-%E5%BD%A2%E6%85%8B%E7%B4%A0%E8%A7%A3 % E6% 9E% 90mecab) Visualization (nlplot)
  3. Visualization (nlplot)

1. Data collection (scraping)

1-1. Flow of scraping

In this scraping, we will move the web page as follows and get only all the articles of our company. Before scraping, we will do it with the permission of Wantedly. Thank you for your understanding in advance. IMG_0017.PNG

1-2. Advance preparation

The Wantedly web page loads the next article by scrolling to the bottom of the page. Therefore, Selenium, which automates browser operations, is used in the minimum necessary locations to acquire data. To operate the browser, you need to prepare a driver ** for your ** browser and install the ** Selenium library **. I love Google Chrome, so I downloaded the Chrome Driver from here and placed it in the following directory. In addition, please change * under Users to your own user name as appropriate.

python


$ cd /Users/*/documents/nlplot
$ ls
article_analysis.ipynb
chromedriver
post_articles.csv
user_dic.csv

Install the Selenium library with pip.

python


$ pip install selenium

If you want to know more about Selenium from installation to operation method, you can refer to the article here. Now that we're ready, we'll actually scrape.

1-3. Source code

article_analysis.ipynb


import json
import re
import time

import pandas as pd
import requests
from bs4 import BeautifulSoup as bs4
from selenium import webdriver

base_url = 'https://www.wantedly.com'


def scrape_path(url):
    """
Get the URL of the space detail page from the story list page

    Parameters
    --------------
    url: str
URL of the story list page

    Returns
    ----------
    path_list: list of str
A list containing the URL of the space detail page
    """

    path_list = []

    response = requests.get(url)
    soup = bs4(response.text, 'lxml')
    time.sleep(3)

    # <script data-placeholder-key="wtd-ssr-placeholder">Get the contents
    #At the beginning of the json character'//'To remove.string[3:]
    feeds = soup.find('script', {'data-placeholder-key': 'wtd-ssr-placeholder'}).string[3:]
    feed = json.loads(feeds)

    # {'body'}of'spaces'Get
    feed_spaces = feed['body'][list(feed['body'].keys())[0]]['spaces']
    for i in feed_spaces:
        space_path = base_url + i['post_space_path']
        path_list.append(space_path)

    return path_list


path_list = scrape_path('https://www.wantedly.com/companies/exidea/feed')


def scrape_url(path_list):
    """
Get the URL of the story detail page from the space detail page

    Parameters
    --------------
    path_list: list of str
A list containing the URL of the space detail page

    Returns
    ----------
    url_list: list of str
List containing URLs for story detail pages
    """

    url_list = []

    #Launch chrome(chromedriver is placed in the same directory as this file)
    driver = webdriver.Chrome('chromedriver')
    for feed_path in path_list:
        driver.get(feed_path)

        #Scroll to the bottom of the page and exit the program if you can no longer scroll
        #Height before scrolling
        last_height = driver.execute_script("return document.body.scrollHeight")

        while True:
            #Scroll to the bottom of the page
            driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')

            #Selenium processing is too fast to load a new page, so forced wait
            time.sleep(3)

            #Height after scrolling
            new_height = driver.execute_script("return document.body.scrollHeight")

            # last_height is new_Scroll until it matches the height of height
            if new_height == last_height:
                break
            else:
                last_height = new_height
                continue

        soup = bs4(driver.page_source, 'lxml')
        time.sleep(3)
        # <div class="post-space-item" >Get the element of
        post_space = soup.find_all('div', class_='post-content')
        for post in post_space:
            # <"post-space-item">of<a>Get element
            url = base_url + post.a.get('href')
            url_list.append(url)

    url_list = list(set(url_list))

    #Close web page
    driver.close()
    return url_list


url_list = scrape_url(path_list)


def get_text(url_list, wrong_name, correct_name):
    """
Get text from story details page

    Parameters
    --------------
    url_list: list of str
List containing URLs for story detail pages
    wrong_name: str
Wrong company name
    correct_name: str
Correct company name

    Returns
    ----------
    text_list: list of str
A list of stories
    """

    text_list = []

    for url in url_list:
        response = requests.get(url)
        soup = bs4(response.text, 'lxml')
        time.sleep(3)

        # <section class="article-description" data-post-id="○○○○○○">In<p>Get all elements
        articles = soup.find('section', class_='article-description').find_all('p')
        for article in articles:
            #Split by delimiter
            for text in re.split('[\n!?!?。]', article.text):
                #Preprocessing
                replaced_text = text.lower()  #Lowercase conversion
                replaced_text = re.sub(wrong_name, correct_name, replaced_text)  #Convert company name to uppercase
                replaced_text = re.sub(r'https?://[\w/:%#\$&\?\(\)~\.=\+\-…]+', '', replaced_text)  #Remove URL
                replaced_text = re.sub('[0-9]', '', replaced_text)  #Exclude numbers
                replaced_text = re.sub('[,:;-~%()]', '', replaced_text)  #Replace the symbol with a half-width space
                replaced_text = re.sub('[,:;·~%()※""【】(Lol)]', '', replaced_text)  #Replace the symbol with a half-width space
                replaced_text = re.sub(' ', '', replaced_text)  # \Remove u3000

                text_list.append(replaced_text)

    text_list = [x for x in text_list if x != '']
    return text_list


text_list = get_text(url_list, 'exidea', 'EXIDEA')

Save the retrieved text in a CSV file.

nlplot_articles.ipynb


df_text = pd.DataFrame(text_list, columns=['text'])
df_text.to_csv('post_articles.csv', index=False)

スクリーンショット 2020-09-17 23.27.40.png

2. Morphological analysis (MeCab)

2-1. Flow to morphological analysis

  1. Installation and environment settings of MeCab main unit
  2. Add IPA dictionary
  3. Add NEologd dictionary
  4. Creating a user dictionary
  5. Finally analysis

2-1. A short break

From here, I will start installing MeCab and making various preparations, but it will not work as well as I expected and my heart will be broken, so I hope it will lead to motivation.

In the first place, why do you do such a tedious task? If you think that hitting $ brew install mecab will do just one shot, you may be. However, in order to get the result of morphological analysis with nlplot as desired, it is necessary to register the character code in UTF-8 in the user dictionary with the company-specific business division name and company word as proper nouns. As a result of installing with brew for ease, the character code became EUC-JP, and I had to take the trouble twice. Therefore, if you want to stick to the output result, please try the method from now on. If you want to try it easily, please install it with brew by referring to the following.

Preparing the environment for using MeCab on Mac

2-2. Installation and environment settings of MeCab main unit

MeCab From the official website, use the curl command to download ** MeCab itself ** and ** IPA dictionary **. This time, install it in the local environment. First, install MeCab itself.

python


#Create the installation directory of mecab in the local environment
$ mkdir /Users/*/opt/mecab
$ cd /Users/*/opt/mecab
#In the current directory-o Download by specifying the file name with the option
$ curl -Lo mecab-0.996.tar.gz 'https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7cENtOXlicTFaRUE'
#Unzip the source code file
$ tar zxfv mecab-0.996.tar.gz
$ cd mecab-0.996
#UTF character code-Check if you can compile by specifying 8
$ ./configure --prefix=/Users/*/opt/mecab --with-charset=utf-8
#Compile the Makefile created by configure
$ make
#Check if it works properly before installation
$ make check
#Binary files compiled with make/Users/*/opt/Install on mecab
$ make install

Done

If you are wondering what configure, make, make install is, here may be helpful.

Now that it's installed, let's go through the path so that we can run the mecab command.

python


#Check shell type
$ echo $SHELL
/bin/zsh
# .Add path to zshrc
$ echo 'export PATH=/Users/*/opt/mecab/bin:$PATH' >>  ~/.zshrc

"""
Caution:Last by login shell(~/.zshrc)change
Example) $ echo 'export PATH=/Users/*/opt/mecab/bin:$PATH' >>  ~/.bash_profile
"""

#Reflects shell settings
$ source ~/.zshrc
#Check if the pass passed
$ which mecab
/Users/*/opt/mecab/bin/mecab

Done

Reference article: What is PATH?

2-3. Addition of IPA dictionary

python


#Move to the starting directory
$ cd /Users/*/opt/mecab
#In the current directory-o Download by specifying the file name with the option
$ curl -Lo mecab-ipadic-2.7.0-20070801.tar.gz 'https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7MWVlSDBCSXZMTXM'
#Unzip the source code file
$ tar zxfv mecab-ipadic-2.7.0-20070801.tar.gz
$ cd mecab-ipadic-2.7.0-20070801
#UTF character code-Check if you can compile by specifying 8
$ ./configure --prefix=/Users/*/opt/mecab --with-charset=utf-8
#Compile the Makefile created by configure
$ make
#Binary files compiled with make/Users/*/opt/Install on mecab
$ make install

Done

#Confirmation of character code
#Character code is EUC-For JP, UTF-Change to 8
$ mecab -P | grep config-charset
config-charset: EUC-JP
#Search config file
$ find /Users -name dicrc
/Users/*/opt/mecab/mecab-ipadic-2.7.0-20070801/dicrc
$ vim /Users/*/opt/mecab/mecab-ipadic-2.7.0-20070801/dicrc 
[Before change] config-charset = EUC-JP
[After change] config-charset = UTF-8

$ mecab
I will stop humans! Jojo
I noun,Pronoun,General,*,*,*,I,me,me
Is a particle,Particle,*,*,*,*,Is,C,Wow
Human noun,General,*,*,*,*,Human,Ningen,Ningen
Particles,Case particles,General,*,*,*,To,Wo,Wo
Quit verb,Independence,*,*,One step,Uninflected word,Stop,Yamel,Yamel
Particles,Final particle,*,*,*,*,I'm sorry,Zo,Zo
!! symbol,General,*,*,*,*,!,!,!
Jojo noun,Proper noun,Organization,*,*,*,*
EOS

#IPA dictionary directory check
$ find /Users -name ipadic
/Users/*/opt/mecab/lib/mecab/dic/ipadic

2-3. Addition of NEologd dictionary

python


#Move to the starting directory
cd /Users/*/opt/mecab
#Download the source code from github
$ git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
$ cd mecab-ipadic-neologd
#Enter "yes" on the screen to execute and check the result
$ ./bin/install-mecab-ipadic-neologd -n

Done

#Confirmation of character code
#Character code is EUC-For JP, UTF-Change to 8
$ mecab -d /Users/*/opt/mecab/lib/mecab/dic/mecab-ipadic-neologd -P | grep config-charset
config-charset: EUC-JP
#Search config file
$ find /Users -name dicrc
/Users/*/opt/mecab/lib/mecab/dic/mecab-ipadic-neologd/dicrc
$ vim /Users/*/opt/mecab/lib/mecab/dic/mecab-ipadic-neologd/dicrc
[Before change] config-charset = EUC-JP
[After change] config-charset = UTF-8

#NEologd dictionary directory check
$ find /Users -name mecab-ipadic-neologd
/Users/*/opt/mecab/lib/mecab/dic/mecab-ipadic-neologd

$echo “I'm quitting humans!| mecab -d /Users/*/opt/mecab/lib/mecab/dic/mecab-ipadic-neologd
“Symbol,Open parentheses,*,*,*,*,“,“,“
I will stop humans! noun,Proper noun,General,*,*,*,I will quit humans!,Orehaningen Woyamerzo,Orewaningen Oyamelzo
Jojo noun,General,*,*,*,*,*
EOS

Github Official: mecab-ipadic-neologd

python


#Finally pip to be able to use mecab with python3
$ pip install mecab-python3

2-4. Creating a user dictionary

The user dictionary creates words that the system dictionary cannot handle by giving meaning to the user.

First, create a csv file according to the format of the word you want to add. Visualize it once, and if there is a word you are interested in, try adding the word to the csv file.

python


"""
format
Surface type,Left context ID,Right context ID,cost,Part of speech,Part of speech細分類1,Part of speech細分類2,Part of speech細分類3,Utilization type,Inflected form,Prototype,reading,pronunciation
"""

#csv file creation
$ echo 'Internship student,-1,-1,1,noun,General,*,*,*,*,*,*,*,Internship'"\n"'Core value,-1,-1,1,noun,General,*,*,*,*,*,*,*,Core value'"\n"'Meetup,-1,-1,1,noun,General,*,*,*,*,*,*,*,Meetup' > /Users/*/Documents/nlplot/user_dic.csv

#Check the character code of the csv file
$ file /Users/*/Documents/nlplot/user_dic.csv
/users/*/documents/nlplot/user_dic.csv: UTF-8 Unicode text

Next, compile the created csv file into a user dictionary.

python


#Create a directory to save the user dictionary
$ mkdir /Users/*/opt/mecab/lib/mecab/dic/userdic

"""
-d Directory containing system dictionaries
-u user-Where to save the dictionary
-f CSV file character code
-t User dictionary character code/Where to save the csv file
"""

##Create a user dictionary
/Users/*/opt/mecab/libexec/mecab/mecab-dict-index \
-d /Users/*/opt/mecab/lib/mecab/dic/mecab-ipadic-neologd \
-u /Users/*/opt/mecab/lib/mecab/dic/userdic/userdic.dic \
-f utf-8 -t utf-8 /Users/*/Documents/nlplot/user_dic.csv

# userdic.Confirm that dic is created
$ find /Users -name userdic.dic
/Users/*/opt/mecab/lib/mecab/dic/userdic/userdic.dic

Now that we have installed mecab and created a user dictionary, we will move on to morphological analysis.

Reference article: How to add words

2-5. Finally analysis

First, load the csv file created during scraping.

nlplot_articles.ipynb


df = pd.read_csv('post_articles.csv')
df.head()

スクリーンショット 2020-09-21 0.07.51.png

In nlplot, we want to output sentences word by word, so we perform morphological analysis with nouns.

article_analysis.ipynb


import MeCab

def download_slothlib():
    """
Load SlothLib and create a stopword

    Returns
    ----------
    slothlib_stopwords: list of str
List containing stop words
    """

    slothlib_path = 'http://svn.sourceforge.jp/svnroot/slothlib/CSharp/Version1/SlothLib/NLP/Filter/StopWord/word/Japanese.txt'
    response = requests.get(slothlib_path)
    soup = bs4(response.content, 'html.parser')
    slothlib_stopwords = [line.strip() for line in soup]
    slothlib_stopwords = slothlib_stopwords[0].split('\r\n')
    slothlib_stopwords = [x for x in slothlib_stopwords if x != '']
    return slothlib_stopwords


stopwords = download_slothlib()


def add_stopwords():
    """
Add stop words to stop words

    Returns
    ----------
    stopwords: list of str
List containing stop words
    """

    add_words = ['See', 'Company', 'I'd love to', 'By all means', 'Story', '弊Company', 'Human', 'What', 'article', 'Other than', 'Hmm', 'of', 'Me', 'Sa', 'like this']
    stopwords.extend(add_words)
    return stopwords


stopwords = add_stopwords()


def tokenize_text(text):
    """
Extract only nouns by morphological analysis

    Parameters
    --------------
    text: str
Text stored in dataframe

    Returns
    ----------
    nons_list: list of str
A list that contains only nouns after morphological analysis
    """

    #Specify the directory where the user dictionary and neologd dictionary are saved
    tagger = MeCab.Tagger('-d /Users/*/opt/mecab/lib/mecab/dic/mecab-ipadic-neologd -u /Users/*/opt/mecab/lib/mecab/dic/userdic/userdic.dic')
    node = tagger.parseToNode(text)
    nons_list = []
    while node:
        if node.feature.split(',')[0] in ['noun'] and node.surface not in stopwords:
            nons_list.append(node.surface)
        node = node.next
    return nons_list


df['words'] = df['text'].apply(tokenize_text)

article_analysis.ipynb


df.head()

スクリーンショット 2020-09-21 0.28.28.png

3. Visualization (nlplot)

3-1. Advance preparation

python


$ pip install nlplot

3-2. uni-gram

nlplot_articles.ipynb


import nlplot

#Specify df words
npt = nlplot.NLPlot(df, taget_col='words')

# top_Top 2 words that appear frequently in n, min_Specify frequent subwords with freq
#Top 2 words: ['Company', 'jobs']
stopwords = npt.get_stopword(top_n=2, min_freq=0)

npt.bar_ngram(
    title='uni-gram',
    xaxis_label='word_count',
    yaxis_label='word',
    ngram=1,
    top_n=50,
    stopwords=stopwords,
    save=True
)

uni-gram.png

3-3. bi-gram

nlplot_articles.ipynb


npt.bar_ngram(
    title='bi-gram',
    xaxis_label='word_count',
    yaxis_label='word',
    ngram=2,
    top_n=50,
    stopwords=stopwords,
    save=True
)

bi-gram.png

3-4. tri-gram

nlplot_articles.ipynb


npt.bar_ngram(
    title='tri-gram',
    xaxis_label='word_count',
    yaxis_label='word',
    ngram=3,
    top_n=50,
    stopwords=stopwords,
    save=True
)

tri-gram.png

3-5. tree map

nlplot_articles.ipynb


npt.treemap(
    title='tree map',
    ngram=1,
    stopwords=stopwords,
    width=1200,
    height=800,
    save=True
)

tree-map.png

3-6. wordcloud

nlplot_articles.ipynb


npt.wordcloud(
    stopwords=stopwords,
    max_words=100,
    max_font_size=100,
    colormap='tab20_r',
    save=True
)

wordcloud.png

3-7. Co-occurrence network

nlplot_articles.ipynb


npt.build_graph(stopwords=stopwords, min_edge_frequency=13)

display(
    npt.node_df, npt.node_df.shape,
    npt.edge_df, npt.edge_df.shape
)

npt.co_network(
    title='All sentiment Co-occurrence network',
    color_palette='hls',
    save=True
)

Co-occurrence-network.png

3-8. sunburst chart

nlplot_articles.ipynb


npt.sunburst(
    title='All sentiment sunburst chart',
    colorscale=True,
    color_continuous_scale='Oryel',
    width=800,
    height=600,
    save=True
)

sunburst-chart.png

Reference article: The library "nlplot" that can easily visualize and analyze natural language has been released

Summary

By visualizing it, I felt that I was able to embody the action guideline "The share" that EXIDEA cherishes again. In particular, The share's Happy and Sincere. And Altruistic is prominent in the article, and as a result, I think I was able to meet friends who can talk about the best working environment, what they want to achieve, and their worries. There are still few things that I can contribute to the company in my daily work, but I would like to maximize what I can do now, such as fully committing to the task at hand and sending it to the outside.

in conclusion

In this article, I was able to reaffirm the importance of preprocessing. I started with the desire to try nlplot, but when I visualized it without preprocessing, proper nouns were displayed as morphemes in bi-gram and tri-gram, and the result was disastrous. Thanks to that, I think that it was the best harvest to be able to learn the knowledge around Linux when installing mecab and creating a user dictionary. Rather than acquiring it as knowledge, I will utilize it for future learning so as not to neglect the basic thing of actually moving my hands.

It's been a long time, but thank you for reading this far. If you find any mistakes, I would be very grateful if you could point them out in the comments.

Recommended Posts

[Python] Wouldn't it be the best and highest if you could grasp the characteristics of a company with nlplot?
Get the stock price of a Japanese company with Python and make a graph
If you define a method in a Ruby class and define a method in it, it becomes a method of the original class.
If you give a list with the default argument of the function ...
Find the white Christmas rate by prefecture with Python and map it to a map of Japan
Calculate the shortest route of a graph with Dijkstra's algorithm and Python
If you want a singleton in python, think of the module as a singleton
Article that can be a human resource who understands and masters the mechanism of API (with Python code)
If you guys in the scope kitchen can do it with a margin ~ ♪
[Python] A program that calculates the number of updates of the highest and lowest records
A discussion of the strengths and weaknesses of Python
Python tricks: a combination of enumerate () and zip (), checking if a string can be converted to a number, sorting the string as a number
Even if you are a beginner in python and have less than a year of horse racing, you could win a triple.
[Python3] Take a screenshot of a web page on the server and crop it further
What happens if you graph the number of views and ratings/comments of the video of "Flag-chan!" [Python] [Graph]
The result of making a map album of Italy honeymoon in Python and sharing it
Until you can install blender and run it with python for the time being
Image processing with Python (I tried binarizing it into a mosaic art of 0 and 1)
Until you create a machine learning environment with Python on Windows 7 and run it
Visualize the range of interpolation and extrapolation with python
[Python environment maintenance] De-NeoBundle. Prepare the environment of the super convenient complementary plug-in jedi-vim with dein and set it to be comfortable
I don't like to be frustrated with the release of Pokemon Go, so I made a script to detect the release and tweet it
[Python] If you create a file with the same name as the module to be imported, an Attribute Error will occur.
Get the matched string with a regular expression and reuse it when replacing on Python3
Recursively get the Excel list in a specific folder with python and write it to Excel.
When accessing a URL containing Japanese (Japanese URL) with python3, it will be encoded in html without permission and an error will occur, so make a note of the workaround.
Return the image data with Flask of Python and draw it to the canvas element of HTML
[Python] A program to find the number of apples and oranges that can be harvested
You can try it with copy! Let's draw a cool network diagram with networkx of Python
How to write when you want to put a number after the group number to be replaced with a regular expression in re.sub of Python
Associate Python Enum with a function and make it Callable
Detect objects of a specific color and size with Python
The process of making Python code object-oriented and improving it
Play with the password mechanism of GitHub Webhook and Python
In the python dictionary, if a non-existent key is accessed, initialize it with an arbitrary value
Understand the probabilities and statistics that can be used for progress management with a python program
Process the gzip file UNLOADed with Redshift with Python of Lambda, gzip it again and upload it to S3
[Python] The role of the asterisk in front of the variable. Divide the input value and assign it to a variable
[Python] A program that finds the maximum number of toys that can be purchased with your money
A simple reason why the return value of round (2.675,2) is 2.67 in python (it should be 2.68 in reality ...)
I compared the speed of Hash with Topaz, Ruby and Python
A pharmaceutical company researcher summarized the basic description rules of Python
If you know Python, you can make a web application with Django
Save the result of the life game as a gif with python
[Statistics] Grasp the image of the central limit theorem with a graph
[python, ruby] fetch the contents of a web page with selenium-webdriver
Delete a particular character in Python if it is the last
The story of making a standard driver for db with python.
Solve the Python knapsack problem with a branch and bound method
The idea of feeding the config file with a python file instead of yaml
Convert the result of python optparse to dict and utilize it
The story of making a module that skips mail with python
Create a compatibility judgment program with the random module of python.
For Python beginners. You may be confused if you don't know the general term of the programming language Collection.
[Python / Jupyter] Translate the comment of the program copied to the clipboard and insert it in a new cell
What to do if you get a "Wrong Python Platform" warning when using Python with the NetBeans IDE
I ran GhostScript with python, split the PDF into pages, and converted it to a JPEG image.
Recognize the contour and direction of a shaped object with OpenCV3 and Python3 (Principal component analysis: PCA, eigenvectors)
It seems that some RHEL will be free with a big boo for the end of CentOS
The story of making a tool to load an image with Python ⇒ save it as another name
Get a list of camera parameters that can be set with cv2.VideoCapture and make it a dictionary type