[python] Decompose the acquired Twitter timeline into morphemes with MeCab

Purpose

The Twitter timeline is a txt file. Timelines for multiple users are stored in one folder. The goal of this time is to morphologically analyze all of these files using MeCab.

Background / preparation

Get timeline

I got the timeline as in the next article. [python] Get Twitter timeline for multiple users

Preparing for MeCab

For morphological analysis, use the morphological analysis engine'MeCab'. How to use on mac mecab installation procedure I was allowed to refer to.

Implementation

  1. Get the list of file names in the folder to the list of python

  2. A function that creates a list of timelines from a filename list

  3. Morphological analysis function

  4. Morphological analysis of all files in the folder

1. Get the list of file names in the folder

The file'timelines' contains all the txt files you want to work with. Store these filenames (strings) in the list'file_names'.

import glob

file_names=[]

files = glob.glob("./timelines/*")
for file in files:
    file_names.append(file)

The obtained file_names has the following form.

['./timelines/20191210_user0_***.txt',..,'./timelines/20191210_user199_***.txt']

2. A function that creates a list of timelines from a filename list

timelines.py



def timelines(file_list):
    timelines=[]
    for file in file_list:
        text=open(file).read()
        open(file).close()

        timelines.append([text])
    return timelines

3. Morphological analysis function

Defines a function for morphological analysis. The argument of the function is a character string, and the return value is a list of morphological analysis results.

mecab_list.py


import MeCab

def mecab_list(text):
    tagger = MeCab.Tagger("-Ochasen")
    tagger.parse('')
    node = tagger.parseToNode(text)
    mecab_output = []
    while node:
        word = node.surface
        wclass = node.feature.split(',')
        if wclass[0] != u'BOS/EOS':
            if wclass[6] == None:
                mecab_output.append([word,wclass[0],wclass[1],wclass[2],""])
            else:
                mecab_output.append([word,wclass[0],wclass[1],wclass[2],wclass[6]])
        node = node.next
    return mecab_output

Let's check the operation of the'mecab_list'function.


print(mecab_list('I often eat cats that I started keeping yesterday.'))
'''
result
[['yesterday', 'noun', 'Adverbs possible', '*', 'yesterday'], ['Domestication', 'verb', 'Independence', '*', 'keep'], ['Begin', 'verb', '非Independence', '*', 'Beginる'], ['Ta', '助verb', '*', '*', 'Ta'], ['cat', 'noun', 'General', '*', 'cat'], ['Is', 'Particle', '係Particle', '*', 'Is'], ['Often', 'adverb', 'General', '*', 'Often'], ['eat', 'verb', 'Independence', '*', 'eat'], ['。', 'symbol', 'Kuten', '*', '。']]

There seems to be no problem.

4. Morphological analysis of all files in the folder

mecab_results_list=[]
the_timelines=timelines(file_names)

for the_timeline in the_timelines:
    mecab_result=[]
    for twt in the_timeline:
        mecab_result.append(mecab_list(twt))
    mecab_results_list.append(mecab_result)
print(mecab_results_list)
#result
[[[['w', 'symbol', 'Alphabet', '*', 'w'], ['yet', 'adverb', 'Particle connection', '*', 'yet'], ['Sub', 'noun', '固有noun', 'area', 'Sub'], ['seed', 'noun', 'suffix', 'General', 'seed'], ['?', 'symbol', 'General', '*', '?'], ['But', 'Particle', '格Particle', 'General', 'But'],..,]]]]

I got the result I wanted.

environment

macOS Catalina Jupyter notebook

Recommended Posts

[python] Decompose the acquired Twitter timeline into morphemes with MeCab
Get Twitter timeline with python
Use mecab with Python3
Collecting information from Twitter with Python (morphological analysis with MeCab)
Crawl the URL contained in the twitter tweet with python
Try hitting the Twitter API quickly and easily with Python
[Python] Morphological analysis with MeCab
Twitter graphing memo with Python
Use Twitter API with Python
Create a Twitter BOT with the GoogleAppEngine SDK for Python
How to get into the python development environment with Vagrant
Call the API with python3.
Search twitter tweets with python
I tried to divide the file into folders with Python
Extract the xz file with python
Specifying the date with the Twitter API
Get the weather with Python requests
Get the weather with Python requests 2
Find the Levenshtein Distance with python
Hit the Etherpad-lite API with Python
Install the Python plugin with Netbeans 8.0.2
Post multiple Twitter images with python
I liked the tweet with python. ..
Master the type with Python [Python 3.9 compatible]
Easily post to twitter with Python 3
Access the Twitter API in Python
When using MeCab with virtualenv python
[Memo] Tweet on twitter with python
Address to the bug that node.surface cannot be obtained with python3 + mecab
Make the Python console covered with UNKO
Collecting information from Twitter with Python (Twitter API)
Touch around the twitter list with tweepy
INSERT into MySQL with Python [For beginners]
[Python] Set the graph range with matplotlib
[Python] Visualize the information acquired by Wireshark
Behind the flyer: Using Docker with Python
Using Python and MeCab with Azure Databricks
Check the existence of the file with python
[Python] Get the variable name with str
[Python] Round up with just the operator
Display Python 3 in the browser with MAMP
Put protocol buffers into sqlite with python
Tweet from python with Twitter Developer + Tweepy
Tweet using the Twitter API in Python
Search the maze with the python A * algorithm
Let's read the RINEX file with Python ①
Working with OpenStack using the Python SDK
Download files on the web with Python
[python] Get Twitter timeline for multiple users
Tweet analysis with Python, Mecab and CaboCha
Learn the design pattern "Singleton" with Python
[Python] Automatically operate the browser with Selenium
Let's make a Twitter Bot with Python!
Use Python and MeCab with Azure Functions
Learn the design pattern "Facade" with Python
The road to compiling to Python 3 with Thrift
[Python] Replace the text output by MeCab with the important words extracted by MeCab + Term Extract.
Get information on the 100 most influential tech Twitter users in the world with python.