WEB scraping with python and try to make a word cloud from reviews

Content of this article

WEB scraping and extracting reviews and reviews from imports of MeCab and Beautiful Soup. Let's make it a wordcloud and visualize what is written! It is the content.

To be able to

For example, TripAdvisor's "word-of-mouth" can be used as a word cloud (visualization of frequently-used words ≒ topics that many people bother to mention in word-of-mouth). It may be interesting to see the difference when comparing the visualization of Tokyo Tower and Sky Tree, and Tokyo Tower and Tsutenkaku. .. .. Is the idea.

Tokyo_tower.jpg Make something like this from the "word of mouth" in the red frame. wordcloud_TT.jpg

Sites and articles that I referred to

I tried web scraping for the first time and even made wordcloud, but I will introduce the site that I referred to at that time first. Scraping review sites to find out the number of words [For beginners] Try web scraping with Python Active engineers explain how to use MeCab in Python [for beginners] Use wordcloud on Windows with Anaconda / Jupyter (Tips)

Library installation

First, install the required libraries. (If you already have it, please skip it.) By the way, my environment is Windows 10, Anaconda (jupyter). Beautiful Soup、request、wordcloud Launch Ancaonda Prompt and install Beautiful Soup and request.

conda install beautifulsoup4
conda install request
conda install -c conda-forge wordcloud

MeCab Download and install "Binary package for MS-Windows" from the Official Site. In this case, the dictionary is included from the beginning. If you get used to it, you can change it to another dictionary. You will be asked for the character code during installation, but to "UTF-8"! Others can be left as they are. After the installation is complete, set the environment variables. · Search for "system details" (probably in the search window at the bottom left of the taskbar) -Select "Environment Variables" -Select the system environment variable "Path" -Click Edit and select New -Enter "C: \ Program Files (x86) \ MeCab \ bin" ・ Select OK and close the screen Active engineers explain how to use MeCab in Python [for beginners] has a procedure, so please take a look there as well.

Production from here

Now that the preparations are complete, let's write the code from here. First import

import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

Once you've done this, go to the site you want to scrape. In this example, Trip advisor's Tokyo Tower page. Check the following two points. ・ URL ・ Where is the HTML you want to scrape (word of mouth this time)?

First of all, you can see the URL as it is, so I will omit the explanation. The latter presses "F12" to launch the developer tools. Untitled.jpg A window like the one above will appear. From here you can see where and how reviews are stored.

The confirmation method is simple, click "Shift + Ctrl + C", and then CLICK the word-of-mouth part of the article. Then, the corresponding part is selected in the previous window. Untitled2.jpg You can see that the reviews are stored in the q class "IRsGHomP".

After that, specify this URL and location on the code and perform scraping.

#Df scraped reviews_Store in list
df_list = [] 
#Scrap 20 pages.
pages = range(0, 100, 5)

for page in pages:
#Since the URL is slightly different between the first page and the second and subsequent pages, branch by IF
    if page == 0:
        urlName = 'https://www.tripadvisor.jp/Attraction_Review-g14129730-d320047-Reviews-Tokyo_Tower-Shibakoen_Minato_Tokyo_Tokyo_Prefecture_Kanto.html'
    else:
        urlName = 'https://www.tripadvisor.jp/Attraction_Review-g14129730-d320047-Reviews-or' + str(page) + '-Tokyo_Tower-Shibakoen_Minato_Tokyo_Tokyo_Prefecture_Kanto.html'

    url = requests.get(urlName)
    soup = BeautifulSoup(url.content, "html.parser")
    
#Class of tag q from HTML'IRsGHoPm'Specify
    review = soup.find_all('q', class_ = 'IRsGHoPm')

#Store the extracted reviews in order
    for i in range(len(review)):
        _df = pd.DataFrame({'Number':i+1,
                            'review':[review[i].text]})
        
        df_list.append(_df)

At this point, your df_list should contain:

df_review = pd.concat(df_list).reset_index(drop=True)
print(df_review.shape)
df_review

Untitled3.jpg

Creating a word cloud

First, import MeCab and WordCloud

import MeCab
import matplotlib.pyplot as plt
from wordcloud import WordCloud

Enter the code by referring to Scraping the review site to investigate the number of words.

#MeCab preparation
tagger = MeCab.Tagger()
tagger.parse('')

#Combine all text data
all_text= ""
for s in df_review['review']:
    all_text += s

node = tagger.parseToNode(all_text)

#Extract nouns into a list
word_list = []
while node:
    word_type = node.feature.split(',')[0]
    if word_type == 'noun':
        word_list.append(node.surface)
    node = node.next

#Convert list to string
word_chain = ' '.join(word_list)

All you have to do is run wordcloud and it's ok.

#Creating stop words (words to exclude)
stopwords = ['']

#Word cloud creation
W = WordCloud(width=500, height=300, background_color='lightblue', colormap='inferno', font_path='C:\Windows\Fonts\yumin.ttf', stopwords = set(stopwords)).generate(word_chain)

plt.figure(figsize = (15, 12))
plt.imshow(W)
plt.axis('off')
plt.show()

Then it will be created as follows. wordcloud_TT.jpg

However, "no", "koto", and "tame" are unnecessary, so I will remove them. That's where the above stopword comes in.

#Creating stop words (words to exclude)
stopwords = ['of', 'thing', 'For']

#Word cloud creation
W = WordCloud(width=500, height=300, background_color='lightblue', colormap='inferno', font_path='C:\Windows\Fonts\yumin.ttf', stopwords = set(stopwords)).generate(word_chain)

plt.figure(figsize = (15, 12))
plt.imshow(W)
plt.axis('off')
plt.show()

Then it will be as follows. Untitled4.jpg

I can understand something, I don't understand anything. .. .. Whether there are many people comparing it with Skytree or many people saying "I can see Skytree", there is no doubt that "Skytree" is of interest to Tokyo Tower users. It looks like it. Therefore, the word-of-mouth of Skytree is also a word cloud below.

Untitled5.jpg

Words such as "elevator" and "ticket" that were not mentioned much in Tokyo Tower (the letters were not big) stand out here. Also, "Tokyo Tower" is not noticeable. This area seems to be the difference between Tokyo Tower and Sky Tree.

end.

Supplement

It may be interesting to compare your company with the competition on company word-of-mouth sites such as Open Work. It seems that there are things that can be seen by comparing similar facilities, such as a word-of-mouth comparison of the five major dome of Sapporo Dome, Tokyo Dome, Nagoya Dome, Osaka Dome, and Fukuoka Dome. By the way, Open Work scraping requires header settings. See below for details. [Python] What to do when scraping 403 Forbidden: You do n’t have permission to access on this server

Recommended Posts

WEB scraping with python and try to make a word cloud from reviews
Try to make a "cryptanalysis" cipher with Python
Try to make a dihedral group with Python
Try to make a command standby tool with python
If you try to make a word cloud with comments from WEB manga, it is interesting to visually understand what kind of manga it is.
Try to bring up a subwindow with PyQt5 and Python
Try to make Qiita's Word Cloud from your browser history
Perform a Twitter search from Python and try to generate sentences with Markov chains.
Try to generate a cyclic peptide from an amino acid sequence with Python and RDKit
Fractal to make and play with Python
How to make a surveillance camera (Security Camera) with Opencv and Python
Try to extract a character string from an image with Python3
Try to make a web service-like guy with 3D markup language
I tried to make a periodical process with Selenium and Python
Try HTML scraping with a Python library
Make a scraping app with Python + Django + AWS and change jobs
Let's make a web framework with Python! (1)
Practice web scraping with Python and Selenium
Easy web scraping with Python and Ruby
Let's make a web framework with Python! (2)
[For beginners] Try web scraping with Python
[Python] Try to recognize characters from images with OpenCV and pyocr
2. Make a decision tree from 0 with Python and understand it (2. Python program basics)
Try to make a capture software with as high accuracy as possible with python (2)
Try to make foldl and foldr with Python: lambda. Also time measurement
Make a decision tree from 0 with Python and understand it (4. Data structure)
I want to make a web application using React and Python flask
Scraping tabelog with python and outputting to CSV
Try it with Word Cloud Japanese Python JupyterLab.
Try to draw a life curve with python
[Python] Flow from web scraping to data analysis
Launch a web server with Python and Flask
Extract data from a web page with Python
Try scraping with Python.
Introduction and usage of Python bottle ・ Try to set up a simple web server with login function
I tried to make a periodical process with CentOS7, Selenium, Python and Chrome
Try to beautify with Talking Head Anime from a Single Image [python preparation]
Let's make an A to B conversion web application with Flask! From scratch ...
Try to make BOT by linking spreadsheet and Slack with python 2/2 (python + gspread + slackbot)
Try to make BOT by linking spreadsheet and Slack with python 1/2 (python + gspread + slackbot)
[ES Lab] I tried to develop a WEB application with Python and Flask ②
[Python] How to create a local web server environment with SimpleHTTPServer and CGIHTTPServer
Try to make a Python module in C language
(Python) Try to develop a web application using Django
Scraping your Qiita articles to create a word cloud
Try to operate DB with Python and visualize with d3
Let's make a simple game with Python 3 and iPhone
From buying a computer to running a program with python
Web scraping with python + JupyterLab
Make a fortune with Python
Web scraping beginner with python
Easy to use Nifty Cloud API with botocore and python
Try to make it using GUI and PyQt in Python
Associate Python Enum with a function and make it Callable
Experiment to make a self-catering PDF for Kindle with Python
I made a tool to create a word cloud from wikipedia
Web crawling, web scraping, character acquisition and image saving with python
Hash with python and escape from a certain minister's egosa
I tried to make GUI tic-tac-toe with Python and Tkinter
[Introduction to Tensorflow] Understand Tensorflow properly and try to make a model
Just try to receive a webhook in ngrok and python