[PYTHON] Is it possible to extract the person's profile information from the chat log?

Purpose

Environment and data

API for analysis

I don't have enough knowledge to analyze from scratch by myself, so this time NTT Communications has released COTOHA API. Was used.

raw data

This time I'm thinking of using the chat log as a base. In Japan, LINE is the mainstream of chat, but LINE has a chat log export function. This time, we will analyze using the chat log exported by LINE. Just in case, we have obtained the prior consent of the person.

The file was too big

I don't usually talk about bland things that don't touch each other's profiles so much, but it was a fairly large file. Let's start by formatting this file.

Although it is not the real thing, the LINE chat log is structured like this.

Original file sample


2019/12/22 Sun
17:00 bowtin [Sticker]
17:01  hogehogekun [Sticker]
17:02 hogehogekun Let's eat ramen if you have free time today

2019/12/23 Mon
05:00 bowtin I'm sorry I slept
05:00  bowtin [Sticker]
08:35 hogehogekun do not forgive
   :
   :

First, we have eliminated the following information:

As a result, it became as follows.

File after formatting


If you're free today, let's eat ramen
unforgivable

Since it is one chat and one line, it is relatively easy to understand visually. The number of lines in the formatted file was about 20500.

Divide the file into about 500 lines

At the time of the formatted file, it was a fairly large chat log with 20500 lines. When I hit the API as it is, an error came back, so I divided it into files of about 500 lines each. (I should have used glob ...)

filesplitter.py



with open(file=r'\path\to\file\sample_chatlog.txt', mode='r', encoding='utf-8') as old_file:
    lines = old_file.readlines()

    for i in range(0, 21000, 500):
        line_count = 0 + i
        while line_count <= i + 500:
            with open(file=r'\path\to\file\splitted_file' + str(i) + '.txt', mode='a+', encoding='utf-8') as new_file:
                new_file.write(lines[line_count + i])
                line_count += 1

I think there is a better way to write it, but for the time being, the purpose was to split the file, so I'm going to use this.

User attribute estimation using COTOHA API

COTOHA API has various APIs published, but this time we will use "User attribute estimation". Did. This API is still in beta (as of February 19, 2020).

Now, let's pass all the contents of the first file to the API.

Estimated result of the first file


{"civilstatus": "Unmarried", "earnings": "1M-3M", "gender": "Female", "hobby": ["INTERNET", "MUSIC", "PAINT", "TRAVEL", "TVGAME"], "moving": ["BUS", "WALKING"], "occupation": "College student"},

it's amazing. About 80% is suitable.

I will continue.

{"civilstatus": "Unmarried", "earnings": "-1M", "gender": "Female", "hobby": ["COOKING", "FORTUNE", "GOURMET", "INTERNET", "MUSIC", "TRAVEL"], "location": "Kanto", "moving": ["RAILWAY", "WALKING"]},

This time I got a slightly different result. What is "earnings": "-1M"? Is there a negative annual income? ?? Postscript: I received a comment that it may be interpreted as "0-1M" instead of "-1M". That may be true! It means "less than 1M" or "less than 1M".

Also, this time there was information about the area. The information that can be extracted seems to differ slightly depending on the original data.

So, after that, I just passed about 40 files to the API. Since the above response is just returned, I tried to store all the returned ones in one file.

Here is the code I actually used.

main.py


#Basic information about requests to the API
BASE_URL = 'https://api.ce-cotoha.com/hogehoge/'
CLIENT_ID = 'YOUR ID'
CLIENT_SECRET = 'YOUR SECRET'
TOKEN_SERVER_URL = 'https://api.ce-cotoha.com/hogehoge/'


#A function that acquires an API access token (is it a specification that the access token is invalidated at regular intervals?...I'm sorry if it's different)
def authorization():
    payload = {
        'grantType': 'client_credentials',
        'clientId': CLIENT_ID,
        'clientSecret': CLIENT_SECRET
    }
    headers = {
        'content-type': 'application/json'
    }
    response = requests.post(TOKEN_SERVER_URL, data=json.dumps(payload), headers=headers)
    auth_info = response.json()

    return auth_info['access_token']


#A function that makes a request to the API (argument is a list of strings)
def make_request(original_string_list):
    headers = {
        'Content-Type': 'application/json',
        'charset': 'UTF-8',
        'Authorization': 'Bearer ' + authorization()
    }

    payload = {
        'document': original_string_list,
        'type': 'kuzure' #It seems that there is a mode for it in the case of broken sentences such as chat logs.
    }

    response = requests.post(BASE_URL, data=json.dumps(payload), headers=headers)

    jsonified_response = response.json()
    return jsonified_response['result']


if __name__ == '__main__':
    #List the file names of about 40 original files (this time, we have regularity in the form of file name + number)
    file_list = ['splitted_file' + str(i) + '.txt' for i in range(0, 21000, 500)]

    #Get one file name from the list of file names and read the contents
    for a_file in file_list:
        lines = []
        with open(file=(r'path\to\file' + a_file), mode='r', encoding='utf-8') as file:
            lines = file.readlines()
            file.close()
        
        #Throw the read content to the COTOHA API as it is and save the result in a file
        with open(file=r'path\to\file\result.txt', mode='a+', encoding='utf-8') as file:
            file.write(json.dumps(parse(lines)))
            file.close()
            sleep(1) #If you throw too many requests in a short time, it will cause trouble, so wait for 1 second

result

That's why some excerpts, but the list of results looks like this. Even with the same age, there are some variations.

User attribute extraction result (partial excerpt).py


[
{"age": "20-29-year-old", "civilstatus": "Unmarried", "gender": "Female", "hobby": ["COOKING", "FORTUNE", "INTERNET", "TVCOMMEDY"], "moving": ["OTHER", "WALKING"]},
{"age": "20-29-year-old", "civilstatus": "Unmarried", "gender": "Female", "hobby": ["GOURMET", "INTERNET", "SMARTPHONE_GAME", "PAINT", "TVGAME"], "moving": ["CYCLING", "OTHER", "RAILWAY", "WALKING"]},
{"civilstatus": "Unmarried", "earnings": "1M-3M", "gender": "Female", "hobby": ["COOKING", "GOURMET", "INTERNET", "SHOPPING", "TRAVEL"], "moving": ["CYCLING"]},
{"age": "30-39 years old", "civilstatus": "Unmarried", "earnings": "-1M", "gender": "Female", "hobby": ["ANIMAL", "CAMERA", "COLLECTION", "COOKING", "INTERNET", "SHOPPING"]},
{"age": "20-29-year-old", "civilstatus": "Unmarried", "earnings": "-1M", "gender": "Female", "hobby": ["COOKING", "FORTUNE", "GOURMET", "PAINT"], "location": "Tokai", "moving": ["RAILWAY"]},
{"age": "20-29-year-old", "civilstatus": "Unmarried", "earnings": "-1M", "gender": "Female", "hobby": ["ANIMAL", "DRAMA", "GYM", "SMARTPHONE_GAME", "MUSIC", "TVGAME"], "moving": ["RAILWAY", "WALKING"]}
]

I'm not sure if this is the case, so I'd like to add up. I will hard code the above result as a python dict.

parse_result.py


results = [ 
  {"age": "20-29-year-old", "civilstatus": "Unmarried", "earnings": "-1M", "gender": "Female", "hobby": ["ANIMAL", "DRAMA", "GYM", "SMARTPHONE_GAME", "MUSIC", "TVGAME"], "moving": ["RAILWAY", "WALKING"]},
  {"age": "30-39 years old", "civilstatus": "Unmarried", "earnings": "-1M", "gender": "Female", "hobby": ["ANIMAL", "CAMERA", "COLLECTION", "COOKING", "INTERNET", "SHOPPING"]}
  #The following is omitted
]


from collections import Counter
import itertools

#You can use collections to retrieve the mode
print(Counter([data['age'] for data in dict_array if 'age' in data]).most_common()[0][0])
print(Counter([data['location'] for data in dict_array if 'location' in data]).most_common()[0][0])
print(Counter([data['gender'] for data in dict_array if 'gender' in data]).most_common()[0][0])
print(Counter([data['civilstatus'] for data in dict_array if 'civilstatus' in data]).most_common()[0][0])
print(Counter([data['earnings'] for data in dict_array if 'earnings' in data]).most_common()[0][0])

#Since there is a list in the list, I just throw it all into a flat list and then retrieve the mode.
print(Counter(list(itertools.chain.from_iterable([data['hobby'] for data in dict_array if 'hobby' in data]))).most_common()[0][0])

The summary of the mode is like this.

Mode


20-29-year-old
Kanto
Female
Unmarried
1M-3M
INTERNET

Consideration

I think the accuracy is quite high. At least I shouldn't have talked about "whether I'm married" in a chat with this person, and of course I don't ask the simple question "Are you a woman?" The story of annual income may be a little.

By the way, I tried a little while chatting with other people, but it was mostly correct.

In the future, you may be able to find out the true profile of the person from the chat log to some extent with a matching app etc.! Considering deviations, it seems that people who usually use different characters will find out that they are using them properly.

In conclusion, it turned out that people unexpectedly spilled their profile in chat, but I think that it is difficult to understand for those who are thorough in character creation such as VTuber and so-called nekama.

Recommended Posts

Is it possible to extract the person's profile information from the chat log?
[DanceDanceRevolution] Is it possible to predict the difficulty level (foot) from the value of the groove radar?
Select PDFMiner to extract text information from PDF
Clear the cron.log regularly to prevent it from growing.
Send log data from the server to Splunk Cloud
Let's add it to the environment variable from the command ~
Is it possible to detect similar images only with ImageHash?
Log in to the fortigate (6.0) management screen from selenium-try to log out
You who color the log to make it easier to see
How to log in automatically like 1Password from the CLI
Extract the value closest to a value from a Python list element
To make it possible to connect from the outside (other than localhost) with dev_appserver.py of GAE / Py
Make it possible to output a log to a file with go echo
How easy is it to synthesize a drug on the market?
Normalize the file that converted Excel to csv as it is.
How to extract the desired character string from a line 4 commands
Find all patterns to extract a specific number from the set