Try to extract high frequency words using NLTK (python)

While reading the official document of NLTK (Natural Language Toolkit), I tried to extract the words that are often used in the document. For the time being, I tried to display the keywords with high frequency from the sample data in order from the top, so I will leave it in the memo.

Development environment

NLTK installation

As you are familiar with other libraries, pip install first.

$ pip install nltk

Extract high-frequency words

The general flow is as follows: 1) After downloading the function to acquire the part of speech and the part of speech, 2) read the sample text, convert the read text to the word-separation, and 3) acquire the part of speech, and then the noun. Only the words in 4) are displayed, and finally, 4) the top three most used words are displayed.

Download required features

nltk_test.py


import nltk

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

After importing nltk, download the function that divides the word and part of speech from the official website. Once downloaded in the environment, no further downloads are required. When I try to download it, I get an alert like Package punkt is already up-to-date!.

Get sample text and convert it to word-separated

nltk_test.py


raw = open('sample.txt').read()
tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)

tokens_l = [w.lower() for w in tokens]

Prepare an English essay and a long sentence in advance. (Sample.txt) After reading this, convert it to word-separated with word_tokenize (). After that, in order to make them recognize the same if there is a difference between lowercase letters and uppercase letters, all lowercase letters are used to recognize the same thing as the same.

Extract only nouns after getting part of speech

nltk_test.py


only_nn = [x for (x,y) in pos if y in ('NN')]

freq = nltk.FreqDist(only_nn)

Only the part of speech corresponding to NN (noun) is extracted, and the frequency distribution is calculated using FreDist to count the number of frequent occurrences.

Show top 3

nltk_test.py


print(freq.most_common(3))

The display is completed using the function most_common () that counts the number of occurrences of Python and displays it from the most.

Recommended Posts

Try to extract high frequency words using NLTK (python)
Try to operate Excel using Python (Xlwings)
Try using Tweepy [Python2.7]
(Python) Try to develop a web application using Django
[Python] Try using Tkinter's canvas
Try to understand Python self
Try using Kubernetes Client -Python-
Start to Selenium using python
Try to make it using GUI and PyQt in Python
Try to operate an Excel file using Python (Pandas / XlsxWriter) ②
How to install python using anaconda
Try to operate Facebook with Python
Try using Pleasant's API (python / FastAPI)
Try to extract a character string from an image with Python3
Try using LevelDB in Python (plyvel)
Try using pynag to configure Nagios
Try to analyze online family mahjong using Python (PART 1: Take DATA)
Try to calculate Trace in Python
Try converting cloudmonkey CLI to python3 -1
Try to log in to Netflix automatically using python on your PC
Try to get statistics using e-Stat
Extract the targz file using python
Try using Python argparse's action API
Try to make capture software with as high accuracy as possible with python (1)
Try using the Python Cmd module
Try frequency control simulation with Python
Try using Leap Motion in Python
Try using Amazon DynamoDB from Python
Try using the Python web framework Django (1)-From installation to server startup
Try to solve a set problem of high school math with Python
[Python] [Word] [python-docx] Try to create a template of a word sentence in Python using python-docx
Try to poke DB on IBM i with python + JDBC using JayDeBeApi
Try to reproduce color film with Python
[Python] Use pandas to extract △△ that maximizes ○○
From Python to using MeCab (and CaboCha)
Try mathematical formulas using Σ with python
Introduction to Discrete Event Simulation Using Python # 1
Try using the Kraken API in Python
Try using Dialogflow (formerly API.AI) Python SDK #dialogflow
Try using Python with Google Cloud Functions
Try to detect fusion movement using AnyMotion
Log in to Slack using requests in Python
Dump BigQuery tables to GCS using Python
Python amateurs try to summarize the list ①
Introduction to Discrete Event Simulation Using Python # 2
Try using Junos On-box Python # 1 Op Script
Try to download Youtube videos using Pytube
Try python
First steps to try Google CloudVision in Python
Try to implement Oni Maitsuji Miserable in python
Try sending Metrics to datadog via python, DogStatsD
Try to calculate a statistical problem in Python
3.14 π day, so try to output in Python
Try using django-import-export to add csv data to django
Try auto to automatically price Enums in Python 3.6
#Monte Carlo method to find pi using Python
Procedure to use TeamGant's WEB API (using python)
Try to solve the Python class inheritance problem
Try to separate Controllers using Blueprint in Flask
Introducing 4 ways to monitor Python applications using Prometheus
I want to email from Gmail using Python.