Use Python to play with LINE's talk history.
LINE is a new communication app that allows you to enjoy as many calls and emails as you like, 24 hours a day, anytime, anywhere.
We will assume that everyone reading this article knows LINE.
Summarize the date and time of sending, the name of the person who sent it, and the content of sending in a Pandas DateFrame. The ideal is like this ↓
Datetime | Name | Content |
---|---|---|
2020/7/30 12:00 | I | Hello |
2020/7/30 12:01 | I | [stamp] |
... | ... | ... |
LINE talk history can be downloaded in txt format. When opened, it should look like this. It has been updated frequently recently, so the format may have changed again.
Monday, December 23, 2019
15:16-joined the group.
15:16-joined the group.
15:16 - Hello
15:16-[Stamp]
15:16 --Thank you.
15:16-[Stamp]
\ n is a line break.
--Date
yyyy.mm.dd -day of the week \ n
--Ordinary talk
hh: mm Name Content \ n
--Talk with line breaks
hh: mm Name Content 1 \ n
Content 2 .... \ n
Other than this, I will omit it. Try it for yourself.
Read txt file → Reorganize the talk content including line breaks into one → Time name Extract the contents → Append the date to the talk → Process into DataFrame
The script looks like this.
import pandas as pd
# Read txt file
f = open("line_--.txt", encoding="UTF-8")
line_data = f.readlines()
f.close()
# Define concatenation
def appending(list, row):
row = list[-1] + row
del list[-1]
list.append(row)
data = []
for row in line_data:
# Lines less than 10 characters
if len(row) < 10:
row = row[:-1]
appending(data, row)
# Lines less than 15 characters
elif len(row) < 15:
# Time + Name + Content
if row[2] == ":" and row[5] == " ":
row = row[:-1]
data.append(row)
# Talk content after line break
else:
row = row[:-1]
appending(data, row)
# Lines of 15 characters or more
else:
# Date
if row [4] == "." and row [7] == "." and row [-3: -1] == "day of the week":
row = row[:10]
data.append(row)
# Time + Name + Content
elif row[2] == ":" and row[5] == " ":
row = row[:-1]
data.append(row)
# Talk content after line break
else:
row = row[:-1]
appending(data, row)
data2 = []
for row in data:
# Assign date to variable date
if row[4] == ".":
date = row
# Concatenate date to time + name + content
else:
row = date + "." + row
row = row.split(" ")
if len(row) == 3:
data2.append(row)
# List dataframe
df = pd.DataFrame(data2, columns=["Datetime", "Name", "Content"])
# Convert time to Datetime type
df["Datetime"] = pd.to_datetime(df["Datetime"], format="%Y.%m.%d.%H:%M")
Read line by line from the beginning.
f = open("line_--.txt", encoding="UTF-8")
line_data = f.readlines()
f.close()
Apparently line_data looks like a list.
print(type(line_data))
<class 'list'>
Concatenate the current row to the last element of list.
def appending(list, row):
row = list[-1] + row
del list[-1]
list.append(row)
Let's play.
thanks = ["always", "thank you"]
appending (thanks, "yes")
# thanks = ['always','thank you']
hh: mm Name Content \ n
The minimum number of characters is when the name is one character and the content is one character. In other words, all lines less than 10 characters are "the part of the talk including the line break after the line break".
row = row [: -1] # The last character is a newline \ n, so delete it
appending(data, row)
With this process, this is ↓
01:10 Happy New Year! \ n
Nice to meet you! \ n
↓ I will do this.
01:10 Happy New Year! Nice to meet you! \ n
Was there a better example?
yyyy.mm.dd -day of the week \ n
The date and time line has 15 characters. Lines less than 15 characters are either "normal talk" or "the part of the talk that contains line breaks after the line break".
01:10 Happy New Year! \ n
The 3rd character is ":" and the 6th character is "", so that part is extracted.
if row[2] == ":" and row[5] == " ":
row = row[:-1]
data.append(row)
For talks that include line breaks, perform the same processing as before.
yyyy.mm.dd-day of the week
The 5th and 8th characters are "." And the last 2 characters are "day of the week". Good thing Pick up the date line.
if row [4] == "." and row [7] == "." and row [-3: -1] == "day of the week":
row = row[:10]
data.append(row)
I somehow deleted the "day of the week" part. Of course you can leave it.
Same as before.
With the work so far, the elements of "date" and "time name content" have been added to the list data. If the list data is empty, or if you are getting an error, please resent the LINE update.
Of the elements in data, if the 5th character is ".", It is the date, otherwise it is the talk content. Assign a date to the variable date.
if row[4] == ".":
date = row
After that, if you separate it with "." And append it as "time name content",
else:
row = date + "." + row
The row (element) row becomes ↓.
yyyy.mm.dd.hh: mm Name Content
Finally, separate with "" to update the list.
List data2 has a double list structure. I'll fix it in a dataframe with Pandas and give it a column name. Since it's a big deal, I changed the date and time to Datetime type.
That's it. Thank you for your hard work. You can also export using to_csv.
Can LINE's talk history be used as a natural language data set? I thought that was the trigger. Learning human conversations, reading emotions from conversations ... I think there are many ways to use them.
This time I didn't use a regular expression to extract the rows. If this code doesn't work for you (such as pulling out lines you don't want to extract), try the Python re package.
I started working on it more than three months ago, but the format changed between the time I published this article. All the articles I wrote little by little are now in par. I'm crying and rewriting the code. I will also write the previous code for the memorial service.
import pandas as pd
# Read txt file line by line
f = open("line_--.txt", encoding="UTF-8")
line_data = f.readlines()
f.close()
# Concatenate line feed data to the previous line
def appending(list, row):
row = list[-1] + row
del list[-1]
list.append(row)
data = []
# Read from the 4th line
for row in line_data[3:]:
# Concatenate lines less than 9 characters
if len(row) < 9:
row = row[:-1]
appending(data, row)
# When less than 13 characters
elif len(row) < 13:
# Time + Name + Content
if row[2] == ":" and row[5] == "\t":
row = row[:-1]
data.append(row)
# When the time is a single digit
elif row[1] == ":" and row[4] == "\t":
row = row[:-1]
data.append(row)
# Linking
else:
row = row[:-1]
appending(data, row)
# When 13 characters or more
else:
# Date
if row[4] == "/" and row[-4] == "(" and row[-2] == ")":
row = row[:-4]
data.append(row)
# Time + Name + Content
elif row[2] == ":" and row[5] == "\t":
row = row[:-1]
data.append(row)
# When the time is a single digit
elif row[1] == ":" and row[4] == "\t":
row = row[:-1]
data.append(row)
# Linking
else:
row = row[:-1]
appending(data, row)
data2 = []
for row in data:
# Substitute date for date
if row[4] == "/":
date = row
# Concatenate date to time + name + content
else:
row = date + " " + row
row = row.split("\t")
if len(row) == 3:
# Delete if "" is attached to the content
if row[2][0] == '"' and row[2][-1] == '"':
row[2] = row[2][1:-1]
data2.append(row)
df = pd.DataFrame(data2, columns=["Datetime", "Name", "Content"])
df["Datetime"] = pd.to_datetime(df["Datetime"], format="%Y/%m/%d %H:%M")
Recommended Posts