Aidemy　2020/10/30

Introduction

Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This is the second post of Negative-Positive Analysis. Nice to meet you.

This article is a summary of what you learned in "Aidemy" "in your own words". It may contain mistakes and misunderstandings. Please note.

What to learn this time ・ Twitter negative / positive analysis using RNN ・ Creating a database

RNN -Negative-positive analysis can be performed by using a polarity dictionary, but it is difficult to make a judgment in line with the context. Therefore, using __ "RNN (Recurrent Neural Network)" __ dealt with in "Topic Extraction 1 of Japanese Text", it is possible to make a judgment in context. ・ Although RNN context estimation can be performed in Japanese, __ this time we will use English __.

Twitter Negative / Positive Analysis

-Twitter is composed of a short sentence of 140 characters, so __it is suitable for analysis of natural language processing __. This time, we will use the Twitter data of USAirline. -First, open the csv file where the Tweet data is recorded, and extract the 'text' column and 'airline_sentiment' __ all rows from it. The method is "__loc [:, [column name]] __".

Creating a database

Delete frequent words

-When analyzing the relationships between words with RNNs, the frequently-used words __ such as __ "@" and "I" are called __ "stop words" __ that interfere with the analysis. This time I will delete this __stopword . -For the tweet data read in the previous section, first use morphological analysis __ to divide each word, and then use __ "lower ()" __ to make all lowercase letters ( normalization __). -Also, download the stopwords module of nltk and set the English stopword information in the variable "stops" with "__set (stopwords.words (" english ")) __". -For each word w in words, only __ "not in stops (not stop words)" __ and __ "does not contain @ and flight" __ words are returned as "meaningful_words". -Of the conditions, the former can be described by __ "if not w in stops" , and the latter by " not re.match ('^ [character string]', w)" __.

・ Code![Screenshot 2020-10-27 18.04.28.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/693ead18-245f-1050- 07de-fe398abc1569.png)

Creating a database of words

-Next, put all the words of __Tweet in the database __. By doing this, you will be able to check the frequency of occurrence of __words and tag negatives and positives all at once __. -"CleanTweet" is the text part of the Tweet data to which the __ "tweet_to_words" __ function created in the previous section is applied. With "__apply (lambda x: function) __", you can apply a function to each element . The content of this is that each word is stored separated by spaces, so it can be separated for each word by using "''. Join (list) __". Furthermore, by using "__split () __", you can get the one that stores each word in the list one by one.

・ Code![Screenshot 2020-10-27 18.38.27.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/a115f91f-0811-1a2f- bbf3-5bb47e0c5430.png)

・ Contents of words![Screenshot 2020-10-27 18.39.07.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/b9d684fe-1bf6- 78f9-64cb-aba87ea7f537.png)

Digitize words

-Add a numeric tag to each __word based on the number of times the word appears __. It also creates a new list of __cleanTweet strings converted to numbers __. -The number of times a word appears can be calculated with __ "Counter () ". Sort the words in descending order of the number of occurrences, and create a dictionary so that it becomes "{word: number of occurrences}". The number of occurrences can be obtained with " for ii, word in enumerate (vocab, 1) __". For __enumerate () __, if you pass a list as an argument and make it a for statement, index will be entered in the first variable (ii) and column will be entered in the second variable (word). Therefore, it can be said that this code creates a dictionary in which index numbers are assigned in order from the word in which "vocab" appears most frequently.

-Next, convert the __cleanTweet string to a number __ and create a new list. First, create an empty list "tweet_ints". -For each line (each) of cleanTweet divided into words (word), convert it to the index number assigned immediately before and store it in the empty list tweet_ints as it is. Then, tweet_ints contains a list in which the index number is stored for each line (tweet). -For the conversion method, refer to the value of the dictionary created immediately before for each word.

・ Code![Screenshot 2020-10-27 20.36.39.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/71666eae-8acf-2146- 218f-a1a04ee76b5f.png)

・ Result (only part)![Screenshot 2020-10-27 20.37.08.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/1cbd9513 -e394-12a4-4687-ed90a9ec809e.png)

Negative and positive digitization

-In the column 'airline_sentiment' of tweetData extracted at the beginning of this Chapter, information that each tweet is __negative, positive, or neutral is stored. This information is converted into a numerical value such as __ "negative = 0, positive = 1, neutral = 2" __. __ By making it a numerical value, you can use it for learning __. -If each line of'airline_sentiment'is each, if __ "each ==' negative'", the value represents "0" __ if statement, and they are put together in one np.array format. To do.

・ Code![Screenshot 2020-10-27 21.01.40.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/af152708-57ec-3735- fa90-16e3728c50f3.png)

Align the number of columns

-The __ "tweet_ints" __ created in the previous section contains the index number of the word of each tweet, but the length of each list is different __. However, when learning __, it is necessary to make the lengths uniform. __ Also, delete the line where the number of words became 0 in the cleanTweet process __. -First, get the length of each list with __ "Counter ()" __. In the process immediately after, __ the length of the other list will be adjusted according to the one with the maximum __ length, so only the maximum one will be acquired with __ "max ()" __ and the variable "seq_len". Store in. -Next, __ delete the line with length 0 __. Instead of deleting it directly with drop () etc., tweet_ints For column (tweet), it is done by the method of getting only the one of len (tweet)> 0. Store the index of the row that satisfies the condition of len (tweet)> 0 in "__tweet_index" __, and store __labels and tweetData again accordingly __. Finally, you can store tweet_ints again according to the conditions. ・ __ Align the length of each list __. As mentioned above, __ adjust the length to the longest . If the length is not enough, fill in the digitized words on the line from the right and replace the shortage with 0. (For example, when the length of the list of __ [1,2,3] __ is set to 5 by this method, it becomes __ [0,0,3,2,1] __) -The "np.zeros ((len (tweet_ints), seq_len), dtype = int) " part of the code below is an array with all zero elements, with rows the length of tweet_ints and the maximum number of columns. Indicates that you are creating. When each __ row of tweet_ints is i and __ column is row, by putting row in the part where __ row is i and column is -len (row) __ in the array with 0 elements, The above complementary method is realized.

-Code![Screenshot 2020-10-28 15.23.00.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/69e4bdd0-0c2e-398f- ec5a-dee991526845.png)

・ Result (only part)![Screenshot 2020-10-27 22.52.30.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/907dc703 -1da6-ec31-9c8a-5b964658ed11.png)

Summary

-By using RNN, it is possible to perform negative / positive analysis that takes into account the flow of the context __. -Words that frequently appear in the data are called __ "stop words" __ and interfere with analysis, so it is necessary to delete __. -By creating a __database of words in Twitter data, it is possible to perform negative / positive analysis for each __word __. -Word data is quantified by assigning __ID in order of frequency __, and the corresponding __PN value is also stored using a polarity dictionary __. In addition, it is possible to numerically distinguish between negative and positive, which is a teacher label when implementing a model. -For the data that will be the learning data, the column lengths will be the same for each __ list __.

This time is over. Thank you for reading until the end.

[PYTHON] Negative / Positive Analysis 2 Twitter Negative / Positive Analysis (1)