If you want to count words in Python, it's convenient to use Counter.

I played with Mecab and found it when I thought it was okay, so make a note.

It can be text or CSV, but I think it's rare that you want to write code that counts the frequency of occurrence of each element in a list that has duplicates. If you implement it obediently using a dictionary


data = ['aaa', 'bbb', 'ccc', 'aaa', 'ddd']

word_and_counts = {}
for word in data:
    if word_and_counts.has_key(word):
        word_and_counts[word] += 1
    else:
        word_and_counts[word] = 1
for w, c in sorted(word_and_counts.iteritems(), key=lambda x: x[1], reverse=True):
    print w, c  # =>
                #   aaa 2
                #   bbb 1
                #   ccc 1
                #   ddd 1

I think it will be like that.

In such a case, the collections module is convenient. So reimplement it using collections.Counter.

from collections import Counter

data = ['aaa', 'bbb', 'ccc', 'aaa', 'ddd']
counter = Counter(data)
for word, cnt in counter.most_common():
    print word, cnt # =>
                    #   aaa 2
                    #   bbb 1
                    #   ccc 1
                    #   ddd 1

I was able to implement it concisely. Moreover, it seems to be early because it is built-in. Besides, Counter has various other operators and convenient methods.

from collections import Counter

dataA = ['aaa', 'bbb', 'ccc', 'aaa', 'ddd']
dataB = ['aaa', 'bbb', 'bbb', 'bbb', 'abc']

counterA = Counter(dataA)
counterB = Counter(dataB)

counter = counterA + counterB  #The frequency can be added
counterA.subtract(counterB)  #Take the difference between the elements (destructive method)
counter.most_common(3)  #Get the top 3 elements (as in the example above, if you omit the omission of the argument n, you get all the elements in descending order)
#Some others

Any object that can be hashed is fine, so maybe there are other good uses?

Besides, the collections module has some useful classes that look good, so I think it's sometimes useful to read it once.

Finally, using Counter, the code that I tried Mecab in the tweet history of the downloaded Twitter looks like the following.

# -*- coding: utf-8 -*-

from collections import Counter
import codecs
import json

import MeCab


#I have a feeling of bad know-how, but I want to redirect the output result
import sys
reload(sys)
sys.setdefaultencoding("utf-8")

#codecs returns unicode
#There is an extra description on the first line and it is a tedious test code and it is troublesome so let's delete it in advance
_tweetfile = codecs.open('./data/js/tweets/2013_09.js', 'r', 'sjis')
tweets = json.load(_tweetfile)
#Encode because Mecab only accepts str type
texts = (tw['text'].encode('utf-8') for tw in tweets)

tagger = MeCab.Tagger('-Ochasen')
counter = Counter()
for text in texts:
    nodes = tagger.parseToNode(text)
    while nodes:
        if nodes.feature.split(',')[0] == 'noun':
            word = nodes.surface.decode('utf-8')
            counter[word] += 1
        nodes = nodes.next
for word, cnt in counter.most_common():
    print word, cnt

The part that distinguishes whether it is a noun or not is dull, and the symbol is inserted, but it moved to a good feeling for the time being. I'm happy.


I've put together these tricks, so if you don't mind, please (Frequent idioms that make Python code a little cleaner just by remembering it)

Recommended Posts

If you want to count words in Python, it's convenient to use Counter.
[Python] When you want to use all variables in another file
If you want to assign csv export to a variable in python
If you want to use field names with hyphens when updating firestore data in python
What to do if you can't use scikit grid search in Python
Use PIL in Python to extract only the data you want from Exif
If you want to make a discord bot with python, let's use a framework
If you want to use Cython, also include python-dev
I want to use the R dataset in python
Solution when you want to use cv_bridge with python3 (virtualenv)
[Road to intermediate Python] Use if statement in list comprehension
If you want your colleagues to use the same language
A convenient function memo to use when you want to enter the debugger if an error occurs when running a Python script.
How to use Mysql in python
How to use ChemSpider in Python
How to use PubChem in Python
[Python] When you want to import and use your own package in the upper directory
[Subprocess] When you want to execute another Python program in Python code
Do you want to wait for general purpose in Python Selenium?
Don't write Python if you want to speed it up with Python
What to do if you get a minus zero in Python
I want to know if you install Python on Mac ・ Iroha
Indispensable if you use Python! How to use Numpy to speed up operations!
Check if you can connect to a TCP port in Python
What to do if you can't use the trash in Lubuntu 18.04.
If you write go table driven test in python, it may be better to use subTest
[Introduction to Python] How to use class in Python?
I want to use jar from python
Easy way to use Wikipedia in Python
How to use __slots__ in Python class
How to use regular expressions in Python
How to use is and == in Python
If you want a singleton in python, think of the module as a singleton
If you want to include awsebcli with CircleCI, specify the python version
If you want to use NumPy, Pandas, Matplotlib, IPython, SciPy on Windows
[Python] If you want to draw a scatter plot of multiple clusters
If you want to get multiple statistics with groupby in pandas v1
Add words to MeCab's user dictionary on Ubuntu for use in Python
How to use the C library in Python
I want to do Dunnett's test in Python
How to build an environment when you want to use python2.7 after installing Anaconda3
I want to use MATLAB feval with python
I want to use Python in the environment of pyenv + pipenv on Windows 10
I want to create a window in Python
How to use Python Image Library in python3 series
If you want to display values using choices in a template in a Django model
If you want to create a Word Cloud.
Summary of how to use MNIST in Python
It is convenient to use stac_info and exc_info when you want to display traceback in log output by logging.
I want to merge nested dicts in Python
Use cryptography module to handle OpenSSL in Python
I want to use Temporary Directory with Python2
I want to use ceres solver from python
How to use tkinter with python in pyenv
What to do if you get "Python not configured." Using PyDev in Eclipse
If you use Pandas' Plot function in Python, it is really seamless from data processing to graph creation
I want to display the progress in Python!
You should know if you use Python! 10 useful libraries
Use os.getenv to get environment variables in Python
What to do if you get `No kernel for language python found` in Hydrogen
I want to use a python data source in Re: Dash to get query results