Keyword extraction by MeCab (python)

Here is the Excel file. It is output from a certain DB, and sentences are stored in one record per line and one field. The theme this time is to extract frequently used keywords from the sentences in this field, count the number of appearances for each keyword, and rank them.

The entrance and exit are Windows Excel files, and the middle is done on a Mac.

What to prepare

Do it in your usual environment.

I plan to process the data with pandas later, so I use utf-8 or pandas.

Output to csv from the corresponding xls file

This is from the Excel menu. test.xls -> test.csv

Change the character code from sjis to utf-8

$ nkf -g test.csv
Shift_JIS
$ nkf -w test.csv > test_utf8.csv
$ nkf -g test_utf8.csv
UTF-8

Load csv with python

import pandas as pd 

csv_file = 'test.csv'
df = pd.read_csv(csv_file, encoding='utf-8', header=1)

Break down and count by noun and verb

install mecab

brew search mecab

pip search mecab
pip install mecab-python 

... Successfully installed mecab-python-0.996 It's OK to come out. Now you can use it with python (2.x series).

import MeCab

def count_word(df):
	e = df[u'comment']
	dic_n = {}
	dic_v = {}
	m = MeCab.Tagger('-Ochasen')	#Put the output in Chasen mode
	
	for s in e:
		if type(s) != unicode:
			continue
		s8 = s.encode('utf-8')
		print s8
		node = m.parseToNode(s8)
		while node:
			word=node.feature.split(',')[0]
			key = node.surface
			if word=='noun':
				dic = dic_n
				print "<", key, "> (n)"
			elif word=='verb':
				dic = dic_v
				print "<", key, "> (v)"
			else:
				node = node.next
				continue
			if dic.has_key(key):
				dic[key] += 1
			else:
				dic[key] = 1
			node = node.next
	return dic_n, dic_v

Write to csv in descending order of appearance (utf-8)

import csv

def write_to_csv(dic, csv_file):
	f = open(csv_file, 'w')
	writer = csv.writer(f, lineterminator='\n')
	
	#Sort by Value
	for k,v in sorted(dic.items(), key=lambda x:x[1], reverse=True):
		print k, v
		writer.writerow([k, v])
	f.close()

write_to_csv(dic_n, 'test_dic_n_utf8.csv')
write_to_csv(dic_v, 'test_dic_v_utf8.csv')

Convert to sjis

$ nkf -g test_dic_n_utf8.csv 
UTF-8
$ nkf -s test_dic_n_utf8.csv > test_dic_n_sjis.csv
$ nkf -g test_dic_n_sjis.csv 
Shift_JIS

Convert to xls format

Open test_dic_n_sjis.csv in Excel and save it in xls.

end.

Reference site

http://qiita.com/tstomoki/items/f17c04bd18699a6465be http://qiita.com/ysk_1031/items/7f0cfb7e9e4c4b9129c9 http://salinger.github.io/blog/2013/01/17/1/ [^1]

[^ 1]: Note that there was a note on this site. `If you want to handle Unicode strings in MeCab, you need to encode them once. At this time, if node = tagger.parseToNode (string.encode ("utf-8")), note that string may be garbage collected during parsing and behave strangely. There is no problem if you assign it to a variable once like this. ```

Recommended Posts

Keyword extraction by MeCab (python)
Easy keyword extraction with TermExtract for Python
MeCab from Python
A memorandum of extraction by python bs4 request
Primality test by Python
Visualization memo by Python
Communication processing by Python
Use mecab with Python3
Beamformer response by python
[Python] Morphological analysis with MeCab
EXE Web API by Python
Make MeCab available from Python3
Newcomer training program by Python
Pin python managed by conda
[Python3] Call by dynamically specifying the keyword argument of the function
Separate numbers by 3 digits (python)
Markov switching model by Python
Image processing by python (Pillow)
Python started by C programmers
Keyword arguments for Python functions
[Python] Numpy reference, extraction, combination
Platform (OS) determination by Python
Sort by date in python
Object extraction in images by pattern matching using OpenCV with Python
[Python] Sort iterable by multiple conditions
Expansion by argument of python dictionary
Put MeCab in "Windows 10; Python3.5 (64bit)"
Text extraction with AWS Textract (Python3.6)
Notes on using MeCab from Python
Machine learning summary by Python beginners
Learn Python by drawing (turtle graphics)
Prime number generation program by Python
python + django + scikit-learn + mecab (1) on heroku
Windows10: Install MeCab library in python
python + django + scikit-learn + mecab (2) on heroku
Make Python dict accessible by Attribute
OS determination by Makefile using Python
Typing automation notes by Python beginners
Mecab / Cabocha / KNP on Python + Windows
Interval scheduling learning memo ~ by python ~
10 selections of data extraction by pandas.DataFrame.query
Behavior of python3 by Sakura's server
100 Language Processing Knock Chapter 1 by Python
When using MeCab with virtualenv python
Story of power approximation by Python
Sorting files by Python naming convention