Collecting information from Twitter with Python (morphological analysis with MeCab)

About morphological analysis

The most common pattern for collecting and using Tweets from various users is It is a pattern that extracts and uses a specific word contained in Tweet.

This time, we will use MeCab, a morphological analyzer, to extract nouns, verbs, and adjectives by dividing them into word units.

MeCab output format

The output format depends on the option. *'mecabrc': (default) *'-Ochasen': (ChaSen compatible format) *'-Owakati': (output only word-separation) *'-Oyomi': (output read only)

By default Surface form \ t Part of speech, Part of speech subclassification 1, Part of speech subclassification 2, Part of speech subclassification 3, Conjugation form, Conjugation type, Prototype, Reading, Pronunciation The output will be.

Sample code

Divide a sentence into word units (as is on the surface) and

A program that takes out 4 ways.

#!/usr/bin/env python                                                                                                                                             
# -*- coding:utf-8 -*-                                                                                                                                            

import MeCab

### Constants                                                                                                                                                     
MECAB_MODE = 'mecabrc'
PARSE_TEXT_ENCODING = 'utf-8'

### Functions                                                                                                                                                     
def main():
    sample_u = u"I want to be the catcher of the rye field. I know it's ridiculous. But that's the only thing I really want to be."
    words_dict = parse(sample_u)
    print "All:", ",".join(words_dict['all'])
    print "Nouns:", ",".join(words_dict['nouns'])
    print "Verbs:", ",".join(words_dict['verbs'])
    print "Adjs:", ",".join(words_dict['adjs'])
    return


def parse(unicode_string):
    tagger = MeCab.Tagger(MECAB_MODE)
    #If it is not str type, the operation will be strange, so convert it to str type
    text = unicode_string.encode(PARSE_TEXT_ENCODING)
    node = tagger.parseToNode(text)

    words = []
    nouns = []
    verbs = []
    adjs = []
    while node:
        pos = node.feature.split(",")[0]
        #Revert to unicode type
        word = node.surface.decode("utf-8")
        if pos == "noun":
            nouns.append(word)
        elif pos == "verb":
            verbs.append(word)
        elif pos == "adjective":
            adjs.append(word)
        words.append(word)
        node = node.next
    parsed_words_dict = {
        "all": words[1:-1], #Remove the empty string at the beginning and end
        "nouns": nouns,
        "verbs": verbs,
        "adjs": adjs
        }
    return parsed_words_dict

### Execute                                                                                                                                                       
if __name__ == "__main__":
    main()

Output result

(twi-py)$ python tweet_parser.py
All:Rye,field,of,Catch,Role,、,Such,もof,To,I,Is,Nari,Want,Hmm,Is,Yo,。,Stupid,Teru,thing,Is,Know,Teru,Yo,。,But,、,ほHmmWhenう,To,Nari,Want,もof,When,Ichi,Cod,It,Shika,Absent,Ne,。
Nouns:Rye,field,Role,thing,I,Hmm,thing,ほHmmとう,thing,It
Verbs:Catch,Nari,Stupid,Teru,Know,Teru,Nari,Ichi
Adjs:Absent

Finally

Now you can extract words by feeding parse () the retrieved Tweet.

For this sample code, I used the surface type in node.surface, If you want to normalize words that change endings, such as verbs, You can use the original form included in node.feature.

Recommended Posts

Collecting information from Twitter with Python (morphological analysis with MeCab)
Collecting information from Twitter with Python (Twitter API)
[Python] Morphological analysis with MeCab
Collecting information from Twitter with Python (Environment construction)
Collecting information from Twitter with Python (MySQL and Python work together)
Japanese morphological analysis with Python
Text mining with Python ① Morphological analysis
I played with Mecab (morphological analysis)!
Tweet from python with Twitter Developer + Tweepy
MeCab from Python
Tweet analysis with Python, Mecab and CaboCha
From the introduction of JUMAN ++ to morphological analysis of Japanese with Python
Python: Simplified morphological analysis with regular expressions
From preparation for morphological analysis with python using polyglot to part-of-speech tagging
Data analysis with python 2
Collecting tweets with Python
Use mecab with Python3
Voice analysis with python
Voice analysis with python
Data analysis with Python
Text mining with Python ① Morphological analysis (re: Linux version)
Principal component analysis using python from nim with nimpy
[Basics of data science] Collecting data from RSS with python
[Note] WordCloud from morphological analysis
[Co-occurrence analysis] Easy co-occurrence analysis with Python! [Python]
Twitter graphing memo with Python
Get Twitter timeline with python
Python: Japanese text: Morphological analysis
Use Twitter API with Python
Sentiment analysis with Python (word2vec)
Planar skeleton analysis with Python
Search twitter tweets with python
With skype, notify with skype from python!
Muscle jerk analysis with Python
[PowerShell] Morphological analysis with SudachiPy
Get Alembic information with Python
Morphological analysis using Igo + mecab-ipadic-neologd in Python (with Ruby bonus)
Python: Extract file information from shared drive with Google Drive API
Get PowerShell commands from malware dynamic analysis site with BeautifulSoup + Python
Call C from Python with DragonFFI
3D skeleton structure analysis with Python
Using Rstan from Python with PypeR
Impedance analysis (EIS) with python [impedance.py]
Install Python from source with Ansible
Create folders from '01' to '12' with python
Make the morphological analysis engine MeCab available in Python 3 (March 2016 version)
[Lambda] [Python] Post to Twitter from Lambda!
Introduction to Data Analysis with Python P17-P26 [ch02 1.usa.gov data from bit.ly]
Post multiple Twitter images with python
[python] Read information with Redmine API
Run Aprili from Python with Orange
■ [Google Colaboratory] Use morphological analysis (MeCab)
Call python from nim with Nimpy
Easily post to twitter with Python 3
Data analysis starting with python (data visualization 1)
Read fbx from python with cinema4d
Logistic regression analysis Self-made with python
When using MeCab with virtualenv python
Data analysis starting with python (data visualization 2)
Get weather information with Python & scraping
[Memo] Tweet on twitter with python