[PYTHON] CCC: coding crash course (5) Find out the frequency of words and letters that appear in Steve Jobs speeches

Steve Jobs Speech

First, get the textual information of the speech from the Stanford University page and create a non-paragraphed text file. https://news.stanford.edu/2005/06/14/jobs-061505/


f= open("stevejobs.txt","r")

contents =f.read()

You can create a "(almost) list of words" list_word by doing the following:


list_word = contents.split(" ")

You can use this to count how many times a particular word, eg the, appears.


count = 0
for w in list_word:
    if (w == "the"):
        count = count + 1

print(count)
# 91

The reason why I couldn't say "list of words" above is that this method causes "." Etc. to be attached to the word at the end of the sentence. Let's look at an example.


for w in list_word:
    if ('.' in w ):
        print(w)
        print('----')

# 
"""
world.
----
college.
----
graduation.
----
life.
----
Below, a lot continues.
"""

You can create a "list of statements (like)" list_sentence by doing the following:


list_sentence = contents.split(". ")

It cannot be called a "list of sentences" because this method does not end with a "." And does not recognize things like "!", "?", ":" As sentence breaks. Let's look at an example.


for l in list_sentence:
    if ("!" in l):
        print(l)
        print('----')

#Since nothing is displayed, "!There seems to be no sentence containing "".

for l in list_sentence:
     if ("?" in l):
         print(l)
         print('----')

#The execution result is as follows.
"""
So why did I drop out? It started before I was born
----
So my parents, who were on a waiting list, got a call in the middle of the night asking: “We have an unexpected baby boy; do you want him?” They said: “Of course.” My biological mother later found out that my mother had never graduated from college and that my father had never graduated from high school
----
How can you get fired from a company you started? Well, as Apple grew we hired someone who I thought was very talented to run the company with me, and for the first year or so things went well
----
When I was 17, I read a quote that went something like: “If you live each day as if it was your last, someday you’ll most certainly be right.” It made an impression on me, and since then, for the past 33 years, I have looked in the mirror every morning and asked myself: “If today were the last day of my life, would I want to do what I am about to do today?” And whenever the answer has been “No” for too many days in a row, I know I need to change something
----
"""

Letter frequency

Finally, let's look at the letter frequency.

First, make all letters lowercase.


contents_lower = contents.lower()

def count_char(char_target):
    count = 0
    for i in range(len(contents_lower)):
        s = contents_lower[i]
        if (s==char_target):
            count = count+1
    
    return count

abc = "a b c d e f g h i j k l m n o p q r s t u v w x y z"

list_abc = abc.split()
#list_abc = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

import numpy
list_count_letter = numpy.zeros(26) 

for i in range(26):
    letter = list_abc[i]
    N = count_char(letter)
    list_count_letter[i] = N
    print(letter, N)

# 
"""
a 772
b 132
c 219
d 417
e 1077
f 206
g 207
h 440
i 642
j 9
k 65
l 402
m 216
n 595
o 772
p 183
q 4
r 499
s 510
t 926
u 283
v 115
w 245
x 16
y 257
z 4
"""

To make it easier to see, if you try sort, it will be as follows.

-(1st place) "e" -(2nd place) "t" -(3rd place) "a" -(4th) "o" -(5th) "i" -(6th-10th) "nsrhd" -(11-15th) "luywc" -(16th-20th) "mgfpb" -(21-26th) "vkxjqz"

It looks like this, and surprisingly "u" is in 12th place.


index_sort = numpy.argsort(list_count_letter)

for i in range(26):
    k = index_sort[i]
    letter = list_abc[k]
    N = list_count_letter[k]
    print(letter, N)

"""
z 4.0
q 4.0
j 9.0
x 16.0
k 65.0
v 115.0
b 132.0
p 183.0
f 206.0
g 207.0
m 216.0
c 219.0
w 245.0
y 257.0
u 283.0
l 402.0
d 417.0
h 440.0
r 499.0
s 510.0
n 595.0
i 642.0
o 772.0
a 772.0
t 926.0
e 1077.0
"""

Recommended Posts

CCC: coding crash course (5) Find out the frequency of words and letters that appear in Steve Jobs speeches
CCC: coding crash course (4) Make the numbers appearing in pi 3.141562 .... into a histogram
CCC: coding crash course (1)
CCC: coding crash course (3)
CCC: coding crash course (2)
Maya | Find out the number of polygons in the selected object
Find out the apparent width of a string in python
Python --Find out number of groups in the regex expression
Find out the age and number of winnings of prefectural governors nationwide