[PYTHON] Try using the Chinese morphological analysis engine jieba

Chinese morphological analysis engine jieba

I tried using it with the Python version of jieba. [Other programming language versions are also available](https://github.com/fxsjy/jieba#%E5%85%B6%E4%BB%96%E8%AF%AD%E8%A8%80%E5% AE% 9E% E7% 8E% B0).

Installation

$ pip install jieba

Text segmentation

>>> import jieba
>>> text = "I am a graduate of the University of Tokyo. Hayagami 10 points started."
#"I will attend a class at the University of Tokyo tomorrow. From 10 o'clock in the morning."

The return value of jieba.cut is a generator The return value of jieba.lcut is a list The return value of jieba.cut_for_search is a generator The return value of jieba.lcut_for_search is a list

Accurate Mode

>>> segments = jieba.cut(text)
>>> list(segments)
['I', 'Mingten', 'Leaving', 'University of Tokyo', 'Class', '。', 'Hayagami', 'Ten points', 'Start', '。']
>>> segments = jieba.lcut(text)
>>> segments
['I', 'Mingten', 'Leaving', 'University of Tokyo', 'Class', '。', 'Hayagami', 'Ten points', 'Start', '。']

** The University of Tokyo ** is one word, isn't it? Full Mode Set to cut_all = True.

>>> segments = jieba.cut(text, cut_all=True)
>>> list(segments)
['I', 'Mingten', 'Leaving', 'Tokyo', 'TokyoUniversity', 'University', 'Academically', 'Class', '。', 'Hayagami', 'Ten points', 'Start', '。']
>>> segments = jieba.lcut(text, cut_all=True)
>>> segments
['I', 'Mingten', 'Leaving', 'Tokyo', 'TokyoUniversity', 'University', 'Academically', 'Class', '。', 'Hayagami', 'Ten points', 'Start', '。']

Search Engine Mode

>>> segments = jieba.cut_for_search(text)
>>> list(segments)
['I', 'Mingten', 'Leaving', 'Tokyo', 'University', 'TokyoUniversity', 'Class', '。', 'Hayagami', 'Ten points', 'Start', '。']
>>> segments = jieba.lcut_for_search(text)
>>> segments
['I', 'Mingten', 'Leaving', 'Tokyo', 'University', 'TokyoUniversity', 'Class', '。', 'Hayagami', 'Ten points', 'Start', '。']

Keyword extraction

>>> import jieba.analyse
>>> text = '''
...The progress of globalization is constantly accelerating, the human race is facing the front, and the daily profits are sharply challenged. This is a trivial challenge, each kind of talent demanding power production, dedication dedication, joint conquest this trivial globalization problem. Under the background of this kind of background, he is a talented person who works as a leader, and is reluctantly assigned to the University of Tokyo. Infinite courage after our general courage, wisdom and assignment feeling, direct opposition to this trivial challenge.
...Academic scholarship, academic discipline, scholarship, scholarship, scholarship, scholarship. Opponents of the scholarship on the road. The University of Tokyo's unscrupulous national office, this is a trivial student, a scholar-provided long-term soil, a good place to build a society.
...The University of Tokyo is now in a straightforward position, and it is a unique scholarly point of eastern and western culture, uninterrupted development, an eye-opening world, and a unique flag. The future of the outpost, the future of the prospects, the University of Tokyo's aspirations, and the talented people of each ceremony. University of Tokyo, national world, culture, breakthrough of barriers, new area science research transcendental literary world limit, industry-government-academia collaboration exhibition. This is the first target, the demand for the neck, the excellence, the internationality, and the dual-purpose research student's institute, and the parallel exhibition....The University of Tokyo's decree, the University of Tokyo's power, world peace, humanity and welfare production, timeless offering. The modern social development, the demand for ourselves, the demand for the development of the era, the scholarship research, the new era. At the same time, the system reform is not possible or is not possible. At the same time as the reform of the education of the undergraduate students, the research student's institutional fundamental transformation, the messenger's knowledge, and the independent intentions can be realized. In addition to this, the reform of the personnel system for promoting demand, the equality of men and women, the equality of men and women, the qualitative meeting of the human resources, and the qualitative nature of the human resources. Unreasonable one-problem problem, promotion The above-mentioned reforming premise, the above-mentioned reforming premise, the social credibility, the credibility of the scholarship, the scholarship of the scholarship, the scholarship of the scholarship, and the scholarship of the scholarship.
...The University of Tokyo, which has been constantly in the process of success, has been developed by the University of Tokyo, and has been established by the University of Tokyo.
... '''

The text will be The University of Tokyo President's Theory Chinese Version.

Extraction by tf-idf value

>>> keywords = jieba.analyse.extract_tags(text, topK=20, withWeight=False, allowPOS=())
>>> keywords
['University of Tokyo', 'Persistent', 'Confidence', 'Science', 'Challenge', 'Human talent', 'Physics', 'Knowledge', 'Graduate School', '爱', 'Science研究', 'Shinshin', 'Promotion', 'Globalization', 'reform', 'Kaken', 'This trivial', 'Powerful', 'Feeling of joy', 'Ritsu']

Sounds good. It's a little different from Japanese kanji, but it's generally readable.

Extraction based on TextRank

>>> keywords = jieba.analyse.textrank(text, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v'))
>>> keywords
['Strategy', 'Knowledge', 'Shinshin', 'Science', 'Exhibition', 'demand', 'reform', 'Human talent', 'Promotion', 'Kaken', 'Challenge', 'Actual', 'Area', 'Will', 'society', 'Science研究', 'Mankind', 'culture', 'Physics', 'Courage']

Other

It has many other features, You can play with the dictionary, tag parts of speech, etc., so it seems better to look at Official for details. The first half of README.md is Chinese, but the second half (https://github.com/fxsjy/jieba#jieba-1) is translated into English.

The author has nothing to do with the University of Tokyo.

Recommended Posts

Try using the Chinese morphological analysis engine jieba
Try cluster analysis using the K-means method
Japanese morphological analysis using Janome
Try using the Twitter API
Try using the Twitter API
Try using the PeeringDB 2.0 API
Try using the Python Cmd module
Make the morphological analysis engine MeCab available in Python 3 (March 2016 version)
Feature extraction by TF method using the result of morphological analysis
Try using the web application framework Flask
Try using the Kraken API in Python
Try using the HL band in order
Try using the camera with Python's OpenCV
Shortening the analysis time of Openpose using sound
Try using the BitFlyer Ligntning API in Python
Python: Try using the UI on Pythonista 3 on iPad
Try using the Python web framework Tornado Part 1
Try using LINE Notify for the time being
Try using the collections module (ChainMap) of python3
Try using the Python web framework Tornado Part 2
Try using the DropBox Core API in Python
The first artificial intelligence. I wanted to try natural language processing, so I will try morphological analysis using MeCab with python3.
Try using docker-py
Try using cookiecutter
Try using PDFMiner
[For beginners] Language analysis using the natural language processing tool "GiNZA" (from morphological analysis to vectorization)
Try using geopandas
Try using Selenium
Try using scipy
Try using pandas.DataFrame
Try using django-swiftbrowser
Try using matplotlib
Try using tf.metrics
Try using PyODE
Explanation of the concept of regression analysis using python Part 2
Big data analysis using the data flow control framework Luigi
Try using the temperature sensor (LM75B) on the Raspberry Pi.
Explanation of the concept of regression analysis using Python Part 1
100 language processing knock-30 (using pandas): reading morphological analysis results
Explanation of the concept of regression analysis using Python Extra 1