I tried various things with Python: scraping (Beautiful Soup + Selenium + PhantomJS) and morphological analysis.

I've been programming Python for about 3 months on weekday nights and weekends, but I'm still having fun.

What I did recently

1, Morphological analysis ・ I wanted to grasp the flow of throwing data into Mecab, narrowing down by nouns only, calculating the frequency, then adding a user dictionary and trying again, so I tried it a little. ・ Since it was completed immediately, I will not describe the details. .. ..

[Site that I referred to] http://qiita.com/fantm21/items/d3d44f7d86f09acda86f http://qiita.com/naoyu822/items/473756fb8e8bbdc4d734 http://www.mwsoft.jp/programming/munou/mecab_command.html http://shimz.me/blog/d3-js/2711

2, scraping ・ Scraping, such as texts and images, is very often related to work, so I wanted to study to some extent, so this time I started with books. https://www.amazon.co.jp/dp/4873117615

・ First of all, it was well understood that Python + Beautiful Soup can quickly take a single page with an easy-to-understand structure.

・ Next, it turned out that the site generated by JS is difficult with the above combination, and there are PhantomJS and CasperJS, and by writing in JS and scraping, this can be done quickly again.

・ After that, it turned out that even Python can scrape websites generated by JS with the combination of Selenium + PhantomJS.

-For the time being, when I tried to convert to csv with the Pandas Dataframe of the last code, I got stuck with UnicodeEncodeError, but I want to do it for the time being with the end that I put the encode specification in the place to convert to csv with Dataframe and solve it. Was realized

[Site that I referred to] http://doz13189.hatenablog.com/entry/2016/08/21/154219 http://zipsan.hatenablog.jp/entry/20150413/1428861548 http://qiita.com/okadate/items/7b9620a5e64b4e906c42

I just combined the sources of the site that I referred to with copy and paste, but I did it with the following sources. .. ..

scraping.py


import lxml.html
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import time

aaa = []
bbb = []
ccc = []

for page in range(1,2): #Set the page limit as appropriate
	driver = webdriver.PhantomJS()
	driver.get("https://www.~~=page=" + str(page))
	data = driver.page_source.encode('utf-8')
	soup = BeautifulSoup(data, "lxml")

	for o in soup.findAll("h3", class_="hoge"):#I often see it, but why do everyone call it hoge?
		aaa.append(o.string)

	for o1 in soup.findAll("h3", class_="hoge"):#Why hoge?
		bbb.append(o1.string)

	for o2 in soup.findAll("div", class_="hoge"):#What...?
		ccc.append(o2.get_text())
	time.sleep(3)

df = pd.DataFrame({"aaa":aaa, "bbb":bbb, "ccc":ccc})

print(df)
df.to_csv("hogehoge.csv", index=False, encoding='utf-8')

driver.quit()

There are many places I'm not sure about, but it worked for the time being.

I will continue to study.

Recommended Posts

I tried various things with Python: scraping (Beautiful Soup + Selenium + PhantomJS) and morphological analysis.
Scraping with Python and Beautiful Soup
I tried scraping with Python
I tried scraping with python
Try scraping with Python + Beautiful Soup
I tried web scraping with python.
I tried to make a periodical process with Selenium and Python
I tried scraping Yahoo News with Python
Practice web scraping with Python and Selenium
Sort anime faces by scraping anime character pages with Beautiful Soup and Selenium
I tried Jacobian and partial differential with python
I tried function synthesis and curry with python
I tried morphological analysis and vectorization of words
I tried to make a periodical process with CentOS7, Selenium, Python and Chrome
Scraping with Python + PhantomJS
Scraping with Selenium [Python]
Scraping with Beautiful Soup
[OpenCV / Python] I tried image analysis of cells with OpenCV
I was addicted to scraping with Selenium (+ Python) in 2020
Automated testing method combining Beautiful Soup and Selenium (Python)
I tried to automate the article update of Livedoor blog with Python and selenium.
I tried fp-growth with python
[Python] Morphological analysis with MeCab
Scraping with selenium in Python
Scraping with Selenium + Python Part 1
Try various things with PhantomJS
Japanese morphological analysis with Python
Scraping with Selenium + Python Part 2
I tried gRPC with Python
Table scraping with Beautiful Soup
I tried [scraping] fashion images and text sentences in Python.
I tried to make various "dummy data" with Python faker
I tried various methods to send Japanese mail with Python
I tried follow management with Twitter API and Python (easy)
[Python scraping] I tried google search top10 using Beautifulsoup & selenium
I tried to make GUI tic-tac-toe with Python and Tkinter
I tried the same data analysis with kaggle notebook (python) and Power BI at the same time ②
I tried the same data analysis with kaggle notebook (python) and Power BI at the same time ①
[Python, Selenium, PhantomJS] A story when scraping a website with lazy load
I tried scraping the ranking of Qiita Advent Calendar with Python
Scraping with Node, Ruby and Python
[First scraping] I tried to make a VIP character of Smash Bros. [Beautiful Soup] [Data analysis]
Scraping with Selenium in Python (Basic)
Text mining with Python ① Morphological analysis
Scraping pages with pagination with Beautiful Soup
Scraping with Beautiful Soup in 10 minutes
I played with PyQt5 and Python3
I played with Mecab (morphological analysis)!
I tried running prolog with python 3.8.2.
Website scraping with Python's Beautiful Soup
I tried SMTP communication with Python
I tried to log in to twitter automatically with selenium (RPA, scraping)
Settings when using Python 3 requests and Beautiful Soup with crostini on Chromebook
Write a basic headless web scraping "bot" in Python with Beautiful Soup 4
I tried fMRI data analysis with python (Introduction to brain information decoding)
I tried updating Google Calendar with CSV appointments using Python and Google APIs
I tried multiple regression analysis with polynomial regression
I tried using Selenium with Headless chrome
I tried factor analysis with Titanic data!
I tried sending an email with python.
I tried a functional language with Python