[PYTHON] Japanese analysis processing using Janome part1

This article is the 7th day of the Cloud Analytics Advent Calendar.

Here, using the Cloudant data prepared on Day 3, using the Notebook on Data Science Experience, Spark's RDD handling and articles The Japanese sentences included in the title are handled by simple aggregation processing using Janome. * The article has become long, so I will divide it ...

Please refer to Article on the 3rd day for the preparation of Node-RED and Cloudant and the data used this time. In addition, the environment of this Notebook uses DataScienceExperience. For details on DSX, refer to Article on Day 1.

Preparing to connect with Cloudant

Create a connection to the Cloudant prepared from add data assets in the figure below.

スクリーンショット 2016-12-12 10.54.18.png

When the right panel opens, select the Connections tab and Create a new connection from Create Connection. スクリーンショット 2016-12-12 10.58.54.png

Open the Create Connection screen and set the required items.

Setting of each item 1 スクリーンショット 2016-12-12 11.02.19.png

Setting of each item 2 スクリーンショット 2016-12-12 11.05.19.png

If you press the Create button, a connection to Cloudant's rss database will be created as shown below.

スクリーンショット 2016-12-12 11.08.11.png

Cloudant is now ready. It's very easy.

Create Notebook

Next, create a new notebook for analysis. Select add notebooks from the project screen to add a new notebook. スクリーンショット 2016-12-12 11.12.16.png

It overlaps with the content explained in Day 1, but it is easy.

スクリーンショット 2016-12-12 11.17.23.png

In Language, you can also select Python 2 system, This time, we will handle Japanese, so we have selected 3.5, which is easy to operate.

For Data Science Experience, please refer to Article on Day 1.

Preparation for Janome

The notebook will open automatically when you create it. To perform morphological analysis in Japanese, install Janome, a morphological analyzer available in Python.

Janome's HP

You can use pip on Data Science Experience. Enter the following code in the first cell and execute it

!pip install janome

For the basic usage of Jupyter such as code execution, refer to the article on the first day.

Run it to install Janome. スクリーンショット 2016-12-12 11.27.57.png

After installing Janome, let's do a simple test to see if Janome can be used normally. Enter the writing code in a new cell and execute it.

from janome.tokenizer import Tokenizer
t = Tokenizer()
tokens = t.tokenize("Of the thighs and thighs")
for token in tokens:
    print(token)

Be careful of indentation when copying. It is OK if the result of normal morphological analysis is output. スクリーンショット 2016-12-12 11.32.43.png

Cloudant connection code

When Janome is ready, we'll write the code to get the data from Cloudant. First, fill the cell with Cloudant authentication data. Call the Cloudant information you registered earlier. Open the find and add data menu in the upper right and open Find the news_rss you just registered from Connections. スクリーンショット 2016-12-12 11.36.16.png

If you press insert to code with a new cell selected, the cell will automatically fill in the required information. スクリーンショット 2016-12-12 11.48.50.png

Executes the entered cell and makes the variable of credentials_1 available. Next, create a DataFrame using SparkSQL. Enter the following code in a new cell and execute it.

from pyspark.sql import SQLContext
#SQLContext
sqlContext = SQLContext(sc)
#load rss from cloudant
rss = sqlContext.read.format("com.cloudant.spark").\
option("cloudant.host",credentials_1["host"]).\
option("cloudant.username", credentials_1["username"]).\
option("cloudant.password",credentials_1["password"]).\
load("rss")

I specified the Cloudant format in sqlContext, specified the host name, connected user, and password from the credentials information above, and loaded the data from the rss table.

If you get the exception of Another instance of Derby may have already booted the database ..., restart the notebook kernel and re-execute the code from the first cell. It seems that apache Derby is used internally, but there seems to be a case where the handling of connection does not go well ...

Run the following code to see the schema of the loaded data.

rss.printSchema()

スクリーンショット 2016-12-12 12.01.28.png

Finally, let's check the contents of the data. Execute the following code to display the data.

rss.show()

スクリーンショット 2016-12-12 12.10.46.png

This is the end of part1. From part2, I will use PySpark and Janome to operate Japanese.

Recommended Posts

Japanese analysis processing using Janome part1
Japanese morphological analysis using Janome
Japanese Natural Language Processing Using Python3 (4) Sentiment Analysis by Logistic Regression
Data analysis planning collection processing and judgment (Part 1)
Data analysis planning collection processing and judgment (Part 2)
Data analysis using xarray
Explanation of the concept of regression analysis using python Part 2
Kaggle ~ Housing Analysis ③ ~ Part1
Dictionary-type processing using items ()
Data analysis using Python 0
Explanation of the concept of regression analysis using Python Part 1
100 language processing knock-30 (using pandas): reading morphological analysis results
Principal component analysis Analyze handwritten numbers using PCA. Part 1
Orthologous analysis using OrthoFinder
■ [Google Colaboratory] Preprocessing of Natural Language Processing & Morphological Analysis (janome)
Python: Japanese text: Morphological analysis
100 Language Processing Knock-57: Dependency Analysis
Natural language processing 1 Morphological analysis
Try using SQLAlchemy + MySQL (Part 1)
Japanese NLP @ janome / spaCy / Python
Try using SQLAlchemy + MySQL (Part 2)
Time series analysis part 4 VAR
Japanese morphological analysis with Python
Wrap analysis part1 (data preparation)
100 language processing knock-56: co-reference analysis
Implement reversi reversi processing using BitBoard
Time series analysis Part 1 Autocorrelation
Data analysis using python pandas
Using Python mode in Processing