[PYTHON] Natural Language Processing Case Study: Word Frequency in'Anne with an E'

This article is an experiment that uses natural language processing to analyze the wording of a novel. Create a pipeline to easily analyze the frequency of words. I recently encountered this amazing Netflix series Anne with an E and was amazed by the story. It's hard to find high-quality movies that pass the Bechdel Test. The story is extremely empowering for girls. Inspired by Hugo Bowne-Anderson from Data Camp, I decided to apply my knowledge in web-scraping and basic Natural Language Processing knowledge to analyze the word frequency in the original book ANNE OF GREEN GABLES.

anne-with-an-e.jpg

In this project, I'll build a simple pipeline to visualize and analyze the word frequency in ANNE OF GREEN GABLES.

from bs4 import BeautifulSoup
import requests
import nltk

If this is your first time using nltk, don't forget to include the following line to install nltk data as we'll need to use the stopwords file later to remove all the stopwords in English.

nltk.download()

Simply run this line in your program and a GUI window will pop up. Follow the instructions to install all the data. The installation procedure would take a couple of minutes. After the installation, you'll be good to go.


r = requests.get(‘http://www.gutenberg.org/files/45/45-h/45-h.htm')
r.encoding = ‘utf-8’
html = r.text
print(html[0:100])

Now we can fetch the html version of the book and print the first 100 characters to see if it's right. Before extracting all the text from the html document, we'll need to create a BeautifulSoup object.

soup = BeautifulSoup(html)
text = soup.get_text()
print(html[10000:12000])

Then we'll use nltk and Regex to tokenize the text into words:

tokenizer = nltk.tokenize.RegexpTokenizer('\w+')
tokens = tokenizer.tokenize(text)
print(tokens[:8])

Now we can turn all the words into lower-case for later frequency distribution calculation since it is case-sensitive.

words = []

for word in tokens:
    words.append(word.lower())
print(words[:8])

One last step before we can visualize the distribution of frequency of words is to get rid of all the stop words in words list.

sw = nltk.corpus.stopwords.words(‘english’)
print(sw[:8])
words_ns = []
for word in words:
    if word not in sw:
        words_ns.append(word)
print(words_ns[:5])

Finally, let's visualize the top dist plot and see what are the top 25 frequent words in the book.

%matplotlib inline
freqdist = nltk.FreqDist(words_ns)
freqdist.plot(25)

F3FC7F36-AB91-4C63-B5FE-02F9A04A8F5D.jpeg

Now we can see the top two supporting characters in the fiction are two females: Marilla and Diana. It's not surprising the name of the Netflix series renamed it as Anne With an ‘E’ as it is a name repeated over 1200 times.

Recommended Posts

Natural Language Processing Case Study: Word Frequency in'Anne with an E'
Study natural language processing with Kikagaku
3. Natural language processing with Python 1-1. Word N-gram
Building an environment for natural language processing with Python
Natural language processing 3 Word continuity
Natural language processing 2 Word similarity
[Natural language processing] Preprocessing with Japanese
3. Natural language processing with Python 2-1. Co-occurrence network
I tried natural language processing with transformers.
3. Natural language processing with Python 2-2. Co-occurrence network [mecab-ipadic-NEologd]
[Python] I played with natural language processing ~ transformers ~
Let's enjoy natural language processing with COTOHA API
I read an introductory book on natural language processing
Python: Natural language processing
100 Language Processing Knock-36 (using pandas): Frequency of word occurrence
3. Natural language processing with Python 4-1. Analysis for words with KWIC
RNN_LSTM2 Natural language processing
100 Language Processing Knock-83 (using pandas): Measuring word / context frequency
3. Natural language processing with Python 5-3. Emotion value analysis of Japanese sentences [Word emotion polarity value correspondence table]
3. Natural language processing with Python 5-1. Concept of sentiment analysis [AFINN-111]
3. Natural language processing with Python 5-2. Emotion intensity analysis tool VADER
100 Language Processing with Python Knock 2015
100 Language Processing Knock-51: Word Clipping
Natural language processing 1 Morphological analysis
100 Language Processing Knock-87: Word Similarity
Easily build a natural language processing model with BERT + LightGBM + optuna
Dockerfile with the necessary libraries for natural language processing in python
Summarize how to preprocess text (natural language processing) with tf.data.Dataset api
100 natural language processing knocks Chapter 4 Commentary
100 Language Processing Knock with Python (Chapter 1)
Natural language processing for busy people
100 Language Processing Knock-82 (Context Word): Context Extraction
100 Language Processing Knock with Python (Chapter 3)
Artificial language Lojban and natural language processing (artificial language processing)
Language processing 100 knock-86: Word vector display
[Language processing 100 knocks 2020] Chapter 7: Word vector
100 Language Processing Knock 2020 Chapter 7: Word Vector
Preparing to start natural language processing
Natural language processing analyzer installation summary
You become an engineer in 100 days ――Day 66 ――Programming ――About natural language processing
That's right, let's eat it. [Natural language processing starting with Kyoto dialect]
[Natural language processing] Extract keywords from Kakenhi database with MeCab-ipadic-neologd and termextract
3. Natural language processing with Python 1-2. How to create a corpus: Aozora Bunko