Start studying: Saturday, December 7th
Teaching materials, etc .: ・ Miyuki Oshige "Details! Python3 Introductory Note ”(Sotec, 2017): 12/7 (Sat) -12/19 (Thu) read ・ Progate Python course (5 courses in total): 12/19 (Thursday) -12/21 (Saturday) end ・ Andreas C. Müller, Sarah Guido "(Japanese title) Machine learning starting with Python" (O'Reilly Japan, 2017): 12/21 (Sat) -December 23 (Sat) ・ Kaggle: Real or Not? NLP with Disaster Tweets: Posted on Saturday, December 28th to Friday, January 3rd Adjustment ・ Wes Mckinney "(Japanese title) Introduction to data analysis by Python" (O'Reilly Japan, 2018): 1/4 (Wednesday) to 1/13 (Monday) read ・ Yasuki Saito "Deep Learning from Zero" (O'Reilly Japan, 2016): 1/15 (Wed) -1/20 (Mon) ・ ** François Chollet “Deep Learning with Python and Keras” (Queep, 2018): 1/21 (Tue) ~ **
p.261 Chapter 6 Deep Learning for Texts and Sequences Finished reading halfway.
The tokenization that was struggling yesterday has been completed.
Data preprocessing (natural language processing)
#type : pandas.core.series.Series
#Convert to lowercase
X_l = X.str.lower()
#Replace unnecessary characters with half-width spaces.
X_r = X_l.replace(',', ' ').replace('.', ' ').replace('#', ' ').replace('#', ' ').replace('!', ' ').replace('!', ' ').replace(' ', ' ')
#Divide each word using a half-width space as a separator
X_s = X_r.str.split(' ')
#Defined together
def make_vector(df):
X_l = df.str.lower()
X_r = X_r = X_l.replace(',', ' ').replace('.', ' ').replace('#', ' ').replace('#', ' ').replace('!', ' ').replace('!', ' ').replace(' ', ' ')
X_s = X_r.str.split(' ')
return X_s
Now that we've tokenized the text retrieved from the dataset, all we have to do now is train the defined model. (Under implementation)
By the way, at first I tried to take out one by one and turn it with a for statement as follows, but it doesn't work. I wondered if it would be okay to preprocess the Series as it is without having to take it out, so I looked it up and found that it was still possible. Write while referring to the pandas official (API reference, Series) for preprocessing Succeeded.
Recommended Posts