[PYTHON] [Beginner] Process LINE talk history into a dataset

Process LINE talk history

Use Python to play with LINE's talk history.

What is LINE in the first place?

LINE is a new communication app that allows you to enjoy as many calls and emails as you like, 24 hours a day, anytime, anywhere.

We will assume that everyone reading this article knows LINE.

This goal

Summarize the date and time of sending, the name of the person who sent it, and the content of sending in a Pandas DateFrame. The ideal is like this ↓

Datetime Name Content
2020/7/30 12:00 I Hello
2020/7/30 12:01 I [stamp]
... ... ...

format

Text file format

LINE talk history can be downloaded in txt format. When opened, it should look like this. It has been updated frequently recently, so the format may have changed again.

 Monday, December 23, 2019
 15:16-joined the group.
 15:16-joined the group.
 15:16 - Hello
 15:16-[Stamp]
 15:16 --Thank you.
 15:16-[Stamp]

\ n is a line break.

--Date

 yyyy.mm.dd -day of the week \ n

--Ordinary talk

 hh: mm Name Content \ n

--Talk with line breaks

 hh: mm Name Content 1 \ n
 Content 2 .... \ n

Other than this, I will omit it. Try it for yourself.

Implement

Read txt file → Reorganize the talk content including line breaks into one → Time name Extract the contents → Append the date to the talk → Process into DataFrame

Whole script

The script looks like this.

import pandas as pd

# Read txt file
f = open("line_--.txt", encoding="UTF-8")
line_data = f.readlines()
f.close()

# Define concatenation
def appending(list, row):
    row = list[-1] + row
    del list[-1]
    list.append(row)

data = []
for row in line_data:
 # Lines less than 10 characters
    if len(row) < 10:
        row = row[:-1]
        appending(data, row)
 # Lines less than 15 characters
    elif len(row) < 15:
 # Time + Name + Content
        if row[2] == ":" and row[5] == " ":
            row = row[:-1]
            data.append(row)
 # Talk content after line break
        else:
            row = row[:-1]
            appending(data, row)
 # Lines of 15 characters or more
    else:
 # Date
 if row [4] == "." and row [7] == "." and row [-3: -1] == "day of the week":
            row = row[:10]
            data.append(row)
 # Time + Name + Content
        elif row[2] == ":" and row[5] == " ":
            row = row[:-1]
            data.append(row)
 # Talk content after line break
        else:
            row = row[:-1]
            appending(data, row)

data2 = []
for row in data:
 # Assign date to variable date
    if row[4] == ".":
        date = row
 # Concatenate date to time + name + content
    else:
        row = date + "." + row
        row = row.split(" ")
        if len(row) == 3:
            data2.append(row)

# List dataframe
df = pd.DataFrame(data2, columns=["Datetime", "Name", "Content"])
# Convert time to Datetime type
df["Datetime"] = pd.to_datetime(df["Datetime"], format="%Y.%m.%d.%H:%M")

Read txt file

Read line by line from the beginning.

f = open("line_--.txt", encoding="UTF-8")
line_data = f.readlines()
f.close()

Apparently line_data looks like a list.

print(type(line_data))
 <class 'list'>

Define concatenation

Concatenate the current row to the last element of list.

def appending(list, row):
    row = list[-1] + row
    del list[-1]
    list.append(row)

Let's play.

 thanks = ["always", "thank you"]
 appending (thanks, "yes")
# thanks = ['always','thank you']

Less than 10 lines

 hh: mm Name Content \ n

The minimum number of characters is when the name is one character and the content is one character. In other words, all lines less than 10 characters are "the part of the talk including the line break after the line break".

 row = row [: -1] # The last character is a newline \ n, so delete it
appending(data, row)

With this process, this is ↓

 01:10 Happy New Year! \ n
 Nice to meet you! \ n

↓ I will do this.

 01:10 Happy New Year! Nice to meet you! \ n

Was there a better example?

Lines less than 15 characters

 yyyy.mm.dd -day of the week \ n

The date and time line has 15 characters. Lines less than 15 characters are either "normal talk" or "the part of the talk that contains line breaks after the line break".

Ordinary talk

 01:10 Happy New Year! \ n

The 3rd character is ":" and the 6th character is "", so that part is extracted.

if row[2] == ":" and row[5] == " ":
    row = row[:-1]
    data.append(row)

For talks that include line breaks, perform the same processing as before.

Lines of 15 characters or more

date

 yyyy.mm.dd-day of the week

The 5th and 8th characters are "." And the last 2 characters are "day of the week". Good thing Pick up the date line.

 if row [4] == "." and row [7] == "." and row [-3: -1] == "day of the week":
    row = row[:10]
    data.append(row)

I somehow deleted the "day of the week" part. Of course you can leave it.

Other than date

Same as before.

With the work so far, the elements of "date" and "time name content" have been added to the list data. If the list data is empty, or if you are getting an error, please resent the LINE update.

Append date

Of the elements in data, if the 5th character is ".", It is the date, otherwise it is the talk content. Assign a date to the variable date.

if row[4] == ".":
    date = row

After that, if you separate it with "." And append it as "time name content",

else:
    row = date + "." + row

The row (element) row becomes ↓.

 yyyy.mm.dd.hh: mm Name Content

Finally, separate with "" to update the list.

Data frame

List data2 has a double list structure. I'll fix it in a dataframe with Pandas and give it a column name. Since it's a big deal, I changed the date and time to Datetime type.

That's it. Thank you for your hard work. You can also export using to_csv.

Finally

Can LINE's talk history be used as a natural language data set? I thought that was the trigger. Learning human conversations, reading emotions from conversations ... I think there are many ways to use them.

This time I didn't use a regular expression to extract the rows. If this code doesn't work for you (such as pulling out lines you don't want to extract), try the Python re package.

I started working on it more than three months ago, but the format changed between the time I published this article. All the articles I wrote little by little are now in par. I'm crying and rewriting the code. I will also write the previous code for the memorial service.

import pandas as pd

# Read txt file line by line
f = open("line_--.txt", encoding="UTF-8")
line_data = f.readlines()
f.close()

# Concatenate line feed data to the previous line
def appending(list, row):
    row = list[-1] + row
    del list[-1]
    list.append(row)

data = []
# Read from the 4th line
for row in line_data[3:]:
 # Concatenate lines less than 9 characters
    if len(row) < 9:
        row = row[:-1]
        appending(data, row)
 # When less than 13 characters
    elif len(row) < 13:
 # Time + Name + Content
        if row[2] == ":" and row[5] == "\t":
            row = row[:-1]
            data.append(row)
 # When the time is a single digit
        elif row[1] == ":" and row[4] == "\t":
            row = row[:-1]
            data.append(row)
 # Linking
        else:
            row = row[:-1]
            appending(data, row)
 # When 13 characters or more
    else:
 # Date
        if row[4] == "/" and row[-4] == "(" and row[-2] == ")":
            row = row[:-4]
            data.append(row)
 # Time + Name + Content
        elif row[2] == ":" and row[5] == "\t":
            row = row[:-1]
            data.append(row)
 # When the time is a single digit
        elif row[1] == ":" and row[4] == "\t":
            row = row[:-1]
            data.append(row)
 # Linking
        else:
            row = row[:-1]
            appending(data, row)

data2 = []
for row in data:
 # Substitute date for date
    if row[4] == "/":
        date = row
 # Concatenate date to time + name + content
    else:
        row = date + " " + row
        row = row.split("\t")
        if len(row) == 3:
 # Delete if "" is attached to the content
            if row[2][0] == '"' and row[2][-1] == '"':
                row[2] = row[2][1:-1]
            data2.append(row)

df = pd.DataFrame(data2, columns=["Datetime", "Name", "Content"])
df["Datetime"] = pd.to_datetime(df["Datetime"], format="%Y/%m/%d %H:%M")

Recommended Posts

[Beginner] Process LINE talk history into a dataset