Introduction to her made with Python ~ Tinder automation project ~ Episode 5

table of contents

What i did Main events
Episode 1 Automatic right swipe
Episode 2 Automatic message sending Matched a woman
Episode 3 Library Exchanged LINE with a matching woman
number 3.Episode 5 Re-acquisition of access token Tokens could not be obtained with the previous code
Episode 4 Data collection LINE replies no longer come
Episode 5 Data analysis Profile sentence Information products were recommended by people I became friends with
Episode 6 Data analysis image edition A real acquaintance girl calls me late at night(?)

The code can be viewed from [GitHub] git.

Synopsis up to the last time

Recent situation

I was busy preparing for the conference, and when I realized it, it had been more than two months since the last article. However, the crawler has been working all the time, so I have a lot of data that I started collecting from the last time. As usual she can't.

Data analysis

A lot of data has been collected. There are 10632 women who swiped. 72 of them matched. It doesn't match as much as I expected. Last time, I saved the table data in a spreadsheet and the image data in Google Drive, so I started by downloading them. When downloading a spreadsheet, you can select several file formats, but when you download it with csv or tsv, line breaks in the profile text and commas written by foreigners in the profile text are bad and annoying. So save it in .xlsx format. Also, there were about 25,000 profile images, so it took a long time to download. The analysis is done on the jupyter notebook.

Please note that this data is a verification of the data collected based on my profile, so if you execute it, the result may differ. [^ 1] Please be careful.

Take a look at the data

First, let's look at the data of the matched people.

analytics.py


import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter


filePath="data/tinder.xlsx"
df=pd.read_excel(filePath)
df.set_index("id", inplace=True)

match = df[df["match"]==1]

Is there anyone you don't recognize? ??

Next is the data of the person who did not match.

analytics.py


unmatch = df[df["match"]==0]

It looks like Pat, but I feel that the matching person is writing the profile text more firmly. Let's check. It is difficult to define "a well-written profile sentence", but for the time being, let's simply check the number of characters in the profile sentence.

analytics.py


%matplotlib inline
sns.distplot(unmatch["bio"].apply(lambda w:len(str(w))), color="b", bins=30)
sns.distplot(match["bio"].apply(lambda w:len(str(w))), color="r", bins=30)

The result is as follows. Those who matched red and those who did not match blue. bio-length.png

After all, I feel that red has less near zero characters than blue. In fact, it seems that there are many accounts that do not write a single character in the profile for those who did not match. Accounts with a blank profile are less likely to match even if you swipe right, so it seems better not to swipe.

View profile text in detail

Morphological analysis

First of all, I would like to check the words included in the profile sentence. Morphological analysis is performed on the profile sentence using the morphological analysis engine MeCab [1] and the extended dictionary mecab-ipadic-NEologd [2].

Installation

You can install MeCab with `` `pip install mecab-python3```. For how to install mecab-ipadic-NEologd, the official text [3] is very well organized, so please refer to that. You can choose various options, but for those who are really troublesome

$git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git ~/neologd
$echo yes | ~/neologd/bin/install-mecab-ipadic-neologd -n -a

Then you can install it.

Word split

Call mecab from Python to split the profile statement word by word. Simply calling mecab will use a standard dictionary, so specify NEologd as an option. The location of the dictionary can be obtained with echo `mecab-config --dicdir`" / mecab-ipadic-neologd ".

mecab.py


import subprocess
import MeCab

cmd = 'echo `mecab-config --dicdir`"/mecab-ipadic-neologd"'
path = (subprocess.Popen(cmd, stdout=subprocess.PIPE,
                           shell=True).communicate()[0]).decode('utf-8')
m = MeCab.Tagger("-d {0}".format(path))

print(m.parse("She danced in love with Pen-Pineapple Appo-Pen."))
#>>
#Her noun,Pronoun,General,*,*,*,Girlfriend,Girlfriend,Girlfriend
#Is a particle,Particle,*,*,*,*,Is,C,Wow
#Pen-Pineapple Appo-Pen Noun,Proper noun,General,*,*,*,Pen-Pineapple-Apple-Pen,Pen pineapple apple pen,Pen pineapple apple pen
#And particles,Parallel particles,*,*,*,*,When,To,To
#Koi dance noun,Proper noun,General,*,*,*,Love dance,Love dance,Love dance
#Particles,Case particles,General,*,*,*,To,Wo,Wo
#Dancing verb,Independence,*,*,Five steps, La line,Continuous connection,dance,Odd,Odd
#Auxiliary verb,*,*,*,Special,Uninflected word,Ta,Ta,Ta
#.. symbol,Kuten,*,*,*,*,。,。,。
#EOS

Words often used by matching people

Use this to extract the words contained in the profile sentence from each of the matched and unmatched people.

analytics.py


def getWord(df):
    retval = []
    for bio in df.bio:
        parse = m.parse(str(bio)).strip().split("\n")
        for p in parse:
            if ("\t" in p) == False:
                continue
            word, desc = p.split("\t")
            if desc.split(",")[0] in ("noun", "verb", "adjective", "形容verb", "Adnominal adjective", "adverb", "conjunction", "感verb", "symbol"): # 助詞と助verbを除きたかった
                retval.append(word)
    return retval

bio_match = getWord(match)
bio_unmatch = getWord(unmatch)

The list of obtained words is displayed in order of frequency of occurrence. First of all, from the person who matched.

analytics.py


df_bio_match = pd.DataFrame.from_dict(
    Counter(bio_match), orient="index").reset_index().rename(columns={"index":"word",0:"count"})
sns.barplot(data=df_bio_match.sort_values(
    "count", ascending=False)[:20], x="word", y="count")
plt.xticks(rotation="vertical")

bio-match.png

The tofu that is occurring is something like a blank space. Maybe thinsp? I was wondering who was using ∇ (Nabla), so I checked it and found that it was used for emoticons like (・ ∇ ・). Next is the person who did not match.

analytics.py


df_bio_unmatch = pd.DataFrame.from_dict(
    Counter(bio_unmatch), orient="index").reset_index().rename(columns={"index":"word",0:"count"})
sns.barplot(data=df_bio_unmatch.sort_values(
    "count", ascending=False)[:20], x="word", y="count")
plt.xticks(rotation="vertical")

bio-unmatch.png

There seems to be a difference in the tendency, it seems that it is not ... For example, it can be seen that matching people tend not to put punctuation marks in the text. Also, while there are many people who write "like" in kanji with or without a match, none of them match those who write "suki" in hiragana. ~~ Is it a land mine? ~~ To be honest, I don't think it's within the margin of error because the number of matched people is small, but it may be worth remembering.

Sentence vectorization

Finally, let's vectorize the profile statement using Doc2Vec. A few years ago, DNN, which vectorizes the word Word2Vec, became a big topic in the NLP area, but Doc2Vec is an algorithm that applies it to sentences instead of words. The explanation of Word2Vec was helpful in [4], and the explanation of Doc2Vec was helpful in [5] [6]. The implementation uses a library called gensim [7]. Please install it with `` `pip install gensim```. For the specific code, I referred to [8].

analytics.py


#Divide the data into training data and test data
df_train, df_test = train_test_split(df, random_state=8888)

#Split profile sentences into words using MeCab
m_wakati = MeCab.Tagger("-d {0} -Owakati".format(path)) #In MeCab options-By adding Owakati, words are separated by spaces without outputting part of speech.
bios=[]
for bio in df_train.bio:
    bio = m_wakati.parse(str(bio)).strip()
    bios.append(bio)

#Convert data to a format that can be processed by gensim
trainings = [TaggedDocument(words = data.split(),tags = [i]) for i,data in enumerate(bios)]

#Learn doc2vec
doc2vec = Doc2Vec(documents=trainings, dm=1, vector_size=300, window=4, min_count=3, workers=4)

#Get training data vector
X_train = np.array([doc2vec.docvecs[i] for i in range(df_train.shape[0])])

#Get the correct label for training data
y_train = df_train["match"]

#Get test data vector and correct label
X_test = np.array([doc2vec.infer_vector(m.parse(str(bio)).split(" ")) for bio in df_test.bio])
y_test = df_test["match"]

Let's visualize the vectorized text using PCA. First of all, from the training data.

analytics.py


from sklearn.decomposition import PCA

pca = PCA()
X_reduced = pca.fit_transform(X_train)

plt.scatter(X_reduced[y_train==0][:,0], X_reduced[y_train==0][:,1], c="b", label="No Match")
plt.scatter(X_reduced[y_train==1][:,0], X_reduced[y_train==1][:,1], c="r", label="Match")
plt.legend()

pca_train.png For those who did not match, there are a certain number of people who have a large second principal component, while the second principal component of those who matched is generally around 0. Let's also look at the test data.

analytics.py


X_test_reduced = pca.transform(X_test)

plt.scatter(X_test_reduced[y_test==0][:,0], X_test_reduced[y_test==0][:,1], c="b", label="No Match")
plt.scatter(X_test_reduced[y_test==1][:,0], X_test_reduced[y_test==1][:,1], c="r", label="Match")
plt.legend()

pca_test.png

Matched people can see that the second principal component is gathered near 0.

Machine learning

Classification using svm

Now that we have vectorized the sentences, let's use machine learning to classify them. Classify profile statement vectors using a support vector machine. Since the data handled this time is extremely biased imbalanced data, if the decision boundary is drawn obediently, all profiles will be judged as "not matching". This is useless. In the first place, looking back on what I wanted to do this time, I didn't want to improve the accuracy of machine learning, I wanted her. Focusing on each item of the confusion matrix,

Description Remarks
TP Determined to match the person who actually matches This is what you are looking for
TN Judge that people who do not actually match are not matched You can reduce unnecessary right swipe
FP Determined to match people who do not actually match Right swipe is wasted once
FN Judge that the person who actually matches does not match I can't meet the soul mate

Obviously FN is the worst and I want to avoid it at all costs. On the other hand, it is desirable that FP does not occur, but it does not matter if it occurs a little. Therefore, in this task, it is required that the recall is as high as possible. On the other hand, even if the precision and F values are low, they are acceptable. Of course, if you predict that all cases will "match", you can get a high recall in exchange for the destruction of precision and F value [^ 2], so we introduced machine learning to avoid that. For that reason, if the recall goes down, I have to say that it is a fall. Therefore, this time, we will take a strategy to estimate the probability of matching by Regressor and set a fairly low threshold value to eliminate only "bad guys who are obviously not likely to match". The suspicion is a right swipe. Auc is used as the evaluation index.

analytics.py


from sklearn.svm import SVR
from sklearn.metrics import roc_auc_score

model = SVR(C=100.0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(roc_auc_score(y_test, y_pred))
#>>0.6196

Over auc0.6! Isn't this a pretty good result? The specific threshold will be set after the image analysis is completed. The article has become long, so I'm here today. Please look forward to the next profile image edition.

Episode 6 is [here] ep.6

References

[1]https://taku910.github.io/mecab/ [2]https://github.com/neologd/mecab-ipadic-neologd [3]https://github.com/neologd/mecab-ipadic-neologd/blob/master/README.ja.md [4] Yasuki Saito, Deep Learning from scratch ❷ ― Natural language processing [5] https://kitayamalab.wordpress.com/2016/12/10/doc2vecparagraph-vector-algorithm / [6]https://deepage.net/machine_learning/2017/01/08/doc2vec.html [7]https://radimrehurek.com/gensim/index.html [8]https://qiita.com/asian373asian/items/1be1bec7f2297b8326cf

[^ 1]: I would like to verify what kind of difference will occur if the same experiment is conducted between handsome and me. No, you may not want to see it. [^ 2]: And that was achieved by the swipe strategy that had been implemented so far, called all-right swipe.

Recommended Posts

Introduction to her made with Python ~ Tinder automation project ~ Episode 6
Introduction to her made with Python ~ Tinder automation project ~ Episode 5
IPynb scoring system made with TA of Introduction to Programming (Python)
[Introduction to Python] Let's use foreach with Python
[Python] Introduction to CNN with Pytorch MNIST
[Python] Easy introduction to machine learning with python (SVM)
Introduction to Artificial Intelligence with Python 1 "Genetic Algorithm-Theory-"
Markov Chain Chatbot with Python + Janome (1) Introduction to Janome
Markov Chain Chatbot with Python + Janome (2) Introduction to Markov Chain
Introduction to Artificial Intelligence with Python 2 "Genetic Algorithm-Practice-"
Introduction to Tornado (1): Python web framework started with Tornado
Introduction to formation flight with Tello edu (Python)
Introduction to Python with Atom (on the way)
Introduction to Generalized Linear Models (GLM) with Python
[Introduction to Udemy Python3 + Application] 9. First, print with print
Introduction to Python language
Introduction to OpenCV (python)-(2)
[Introduction to Python] How to iterate with the range function?
[Chapter 5] Introduction to Python with 100 knocks of language processing
An introduction to Python distributed parallel processing with Ray
Reading Note: An Introduction to Data Analysis with Python
I made Othello to teach Python3 to children (6) Final episode
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
[Automation] Operate GitLab with Python to facilitate inquiry management
[Chapter 4] Introduction to Python with 100 knocks of language processing
Introduction to Python Django (2) Win
I made a package to filter time series with python
Connect to Wikipedia with Python
Post to slack with Python 3
I made a simple book application with python + Flask ~ Introduction ~
20200329_Introduction to Data Analysis with Python Second Edition Personal Summary
Convert Scratch project to Python
Mayungo's Python Learning Episode 3: I tried to print numbers with print
Introduction to Python for VBA users-Calling Python from Excel with xlwings-
Switch python to 2.7 with alternatives
Write to csv with Python
[Introduction to Python] <list> [edit: 2020/02/22]
Introduction to Python (Python version APG4b)
An introduction to Python Programming
[Raspi4; Introduction to Sound] Stable recording of sound input with python ♪
I made blackjack with Python.
Othello made with python (GUI-like)
I made wordcloud with Python.
Introduction to Python For, While
[Introduction to Python] How to get data with the listdir function
[Introduction to Udemy Python3 + Application] 51. Be careful with default arguments
Made it possible to convert PNG to JPG with Pillow of Python
I made a library to easily read config files with Python
A story about adding a REST API to a daemon made with Python
[Introduction to Python] How to split a character string with the split function
Introduction to Structural Equation Modeling (SEM), Covariance Structure Analysis with Python
I read "Reinforcement Learning with Python: From Introduction to Practice" Chapter 1
Introduction to Bayesian Statistical Modeling with python ~ Trying Linear Regression with MCMC ~
[Introduction] I want to make a Mastodon Bot with Python! 【Beginners】
Let's feel like a material researcher with python [Introduction to pymatgen]
From the introduction of JUMAN ++ to morphological analysis of Japanese with Python
I read "Reinforcement Learning with Python: From Introduction to Practice" Chapter 2
[Introduction to Udemy Python 3 + Application] 58. Lambda
[Introduction to Udemy Python 3 + Application] 31. Comments
Python: How to use async with