[PYTHON] Matching app I tried to take statistics of strong people & tried to create a machine learning model

Preface

Hello everyone.

Do you use a matching app? !! I feel good with one of the people who recently matched with the matching app.

By the way, the matching app I was using could refer to the data of other popular members. (Probably people who have received 100 or more like will be displayed.)

I was disappointed to see it.

"Like, I didn't get 100 ..." ** "I also want to be a man over 100" **

I thought so strongly.

At the same time, how can you become a "man over 100"? With that in mind, I analyzed the data.

Data collection

We steadily manually entered other member data (fully utilizing the transcription of Google Docs) and collected about 60 data.

The data of other members displayed was for people who were close to my age of 32 years old. After that, all data is taken as 30 years old.

Analysis

I analyzed the data collected steadily using the Python library.

Feature value

For the features, we used the following selected items as input items.

--Like number --Face (two values, a photo with a face or not)

** I would like to see the relationship between the number of features and the number of likes that you may be interested in. ** **

annual income

Suddenly came annual income! It ’s an annual income anyway! Cusso! (I'm studying data science because I want to earn an annual income, please 10 million)

So let's draw a scatter plot.


import matplotlib.pyplot as plt
plt.scatter(data['annual income'], data['Number of likes'], alpha=0.3)
#data is a dataframe.

ダウンロード (71).png (Vertical is the number of likes, horizontal is the annual income)

** [Discussion] ** ** The one who doesn't have 5 million is not a man. It is impossible for a monthly income of 140,000 yen. ** ** I feel like that is said.

Surprising? The reason is that there is not much correlation between annual income and the number of likes. (The higher it is, the more like it does not increase)

When I actually put out the correlation coefficient ...


pd.DataFrame({"x":data['annual income'], "y":data['Number of likes']}).corr()

Annual income x like correlation coefficient


         x        y
x  1.00000 -0.06363
y -0.06363  1.00000

It can be said that there is almost no correlation.

People who get a lot of like are likely to have something other than their annual income. (* However, it is limited to 5 million or more)

Which other features are involved? Let's take a look.

Educational background

The selection values for educational background were as follows. Junior college/Vocational school/College graduate|High school graduate|University graduate|Graduate school graduate|Other

It's hard to handle in Japanese, so


data['Educational background'] = data['Educational background'].replace({'Junior college/Vocational school/College graduate':0 ,'High school graduate':1 ,'University graduate':2 ,'Graduate school graduate':3 ,'Other':4})

Label encoding was done in the form of.

Let's draw a scatter plot in this form.


plt.scatter(data['Educational background'], data['Number of likes'], alpha=0.3)

result... ダウンロード (74).png (I'm sorry I didn't adjust the scale.)

'Junior college / vocational school / technical college graduate': 0,'High school graduate': 1,'University graduate': 2,'Graduate school': 3,'Other': 4 is.

** After all, there are many universities and above. .. .. ** **

Here, instead of the correlation coefficient ,,,, in this case, it is the relationship between the quantitative variable and the qualitative variable, so the correlation ratio is calculated. Since educational background cannot be quantified (a qualitative variable because we do not know how much difference there is between university graduates and graduate school graduates), let's examine whether the result of the number of likes is biased for each selected value.

A function that came from somewhere



def corr_ratio(x, y):
  variation = ((x - x.mean()) ** 2).sum()
  #print("    variation", variation)
  inter_class = sum([((x[y == i] - x[y == i].mean()) ** 2).sum() for i in np.unique(y)])
  #print("    inter_class", inter_class)
  return (inter_class / variation)

#Calculate correlation ratio
corr_ratio(data.loc[:, ["Number of likes"]].values, data.loc[:, ['Educational background']].values)

result


# 0.8820459777290447

There seems to be some correlation.

** [Discussion] ** ** At least you should be out of college ** I feel like that is said.

height

Will you forgive me even if it's low? I'm asking you, so don't be like your annual income. .. ..

Let's take a look.

#Plot excluding NaN lines
plt.scatter(data.loc[data["height"].notnull()]["height"], data.loc[data["height"].notnull()]['Number of likes'], alpha=0.3)

ダウンロード (72).png

...?

This makes it difficult to understand the height distribution, so let's put it out in a histogram.

plt.hist(data["height"].astype(np.float32))

ダウンロード (73).png (Round to 5 cm.)

By the way, the average height of Japanese men is about 170 cm ... ** It's unforgiving **

Let's get the correlation coefficient here as well.

pd.DataFrame({"x":data['height'].astype(np.float32), "y":data['Number of likes']}).corr()

Height x like correlation coefficient


          x         y
x  1.000000  0.073241
y  0.073241  1.000000

There is almost no correlation here either.

** [Discussion] ** ** I don't say 180cm, but I want 175cm ** I feel like that is said.

I'm 171 cm ... ~~ ○ Really ~~

Body type

The body type options are as follows. 'Slim': 0,'Slightly thin': 1,'Normal': 2,'Muscle': 3,'Slightly chubby': 4,'Chubby': 5

I will try plotting after replacing with the above

#Plot excluding NaN lines
plt.scatter(data.loc[data["Body type"].notnull()]["Body type"], data.loc[data["Body type"].notnull()]['Like number'], alpha=0.3)

ダウンロード (76).png (I'm sorry I didn't adjust the scale again.)

Well, it looks like a normal distribution, or rather, there are most people with normal body shapes and few others, so it looks like there is no bias.

[2019/11/12 postscript-] Let's try again with the histogram.

plt.hist(data_original['Body type'].astype(np.float32))

ダウンロード (45).png (Scale ry)

The first is normal and the second is muscular. [--2019/11/12 postscript]

Next, let's get the correlation ratio.


#Calculate correlation ratio
corr_ratio(data.loc[:, ["Like number"]].values, data.loc[:, ['Body type']].values)

result

0.9457908220700801

It's a pretty good number. Honestly, the reliability is unknown because there is little data, but the number of people is the largest ** Normal or muscular is good. ** **

** [Discussion] ** ** Normal or muscular ** Aim for normal body shape

Around here ...

I think you could see most of the things you were interested in. After that, I will try to make a like number prediction machine using machine learning.

Try to make a like number prediction machine

Preprocessing

Labeling was done to perform the regression, but for most items the numbers are not proportional to their importance. For example, just because you replaced "others" with 4 as you did earlier, it does not mean that it is better than 3 "university graduates" that are less than 4.

So this time we did frequency labeling. Frequency labeling is not discussed here, but the higher the frequency, the higher the number. (I think it's quite reasonable considering the hypothesis that the choices that people with a lot of likes must choose.)


def labeling(data):
  for column in data.columns:
    #Avoid Likes because they are objective variables. Height and annual income will be standardized later.
    if not column in ['Likes', 'Heights', 'Salary']:
      #size of each category
      freq_encoding[column] = data.groupby(column).size()
      #get frequency of each category
      freq_encoding[column] = freq_encoding[column]/len(data)
      #print(encoding)
      data[column] = data[column].map(freq_encoding[column])
      freq_encoding[column] = freq_encoding[column].to_dict()
  return freq_encoding, data

# freq_Encoding is reused when frequently labeling the data you want to predict in the future.
freq_encoding, data = labeling(data)

Height and annual income have been standardized.


def normalize(data):
  # #Standardization
  #Height: says 171.5(average)/5.8(standard deviation)Seems to be
  data['Heights'] = ((data['Heights']-171.5)/5.8)
  #No standard deviation can be obtained for annual income...
  data['Salary'] = ((data['Salary']-data['Salary'].mean())/data['Salary'].std())
  return data

data = normalize(data)

On top of that, I would like to find out the correlation coefficient between each feature and the number of likes. In this context, the fact that there is a correlation means that there are options that the strong men are all choosing.

corr.png

How is it? Look at the Likes column (first column).

・ Background (educational background) ・ WantAKids (Do you want children) ・ Sociality ・ Alcohol (liquor) There is some correlation in. (The closer the value is to 1, the more correlated it is)

In short, if you imitate a strong man around here, you may get more like! !!

Selection of features

The process is broken, but when I removed various features, the accuracy was the best because the following features were not included.

・ Presence or absence of face ・ Body type ·annual income

It wasn't enough to cut things with a low correlation ratio / correlation coefficient, so I wonder if there is no choice but to actually try reducing the features.

It's quite surprising that body shape and annual income have nothing to do with the number of likes. (However, don't forget that ** the annual income is limited to 5 million or more **!)

[2019/11/14 postscript-] 【apology】 Regarding the presence or absence of a face, the actual flag is "Is the face visible on the first sheet?" (I forgot to register multiple images.) So it's not that the face doesn't matter. [--2019/11/14 postscript]

Model selection

This time, I didn't use deep learning because I was studying machine learning. When I tried linear regression (normal, Lasso, Ridge), deterministic regression tree, and SVR, SVR was the best.

In addition, although it is said that the holdout method should not be done so much when this data set is small, we were able to obtain an accuracy of about 83%.

Overfitting is possible, but it can't be helped even if it takes too much time, so we will proceed with this accuracy.

[2019/11/12 postscript-] By the way, train and test are about 8: 2. [--2019/11/12 postscript]

I tried to predict my number of likes

I quit after less than a month, but my number of likes was about 80 (Tohoho ...

I tried to see if I could predict my data correctly.

My data is below.


my_df = pd.DataFrame({
  'Like number': 80.0,
  'face': 'Yes',
  'blood type': 'O type',
  'Brothers and sisters': 'Eldest son',
  'Educational background': 'University graduate',
  'school name': 'None',
  'Occupation': 'IT related', #2019/11/11 I forgot that I was a company employee as a trial.
  'annual income': '***',  # annual incomeは関係ないので秘密
  'height': '171',
  'Body type': 'usually',
  'Marriage history': 'Single (unmarried)',
  'Willingness to marry': 'I want to have a good person',
  'Do you want children': 'do not know',
  'Housework / childcare': 'I want to participate actively',
  'Hope until we meet': 'I want to see you if you feel like it',
  'First date cost': 'Men pay everything',
  'Sociability': 'I like small groups',
  'Housemate': 'Living alone',
  'holiday': 'Saturday and Sunday',
  'sake': 'to drink',
  'tobacco': 'Do not smoke',
  'name_alpha': 0
}, index=[0])

##Label encoding and frequency encoding below...

#Drop the number of likes and convert to numpy
X = my_df.iloc[:, 1:].values 

#Prediction! !! !! !!
print(model_svr_rbf1.predict(X))

result

[73.22405579]

Don't give out numbers in the distance !!!! (Machine learning is amazing)

(2019/11/11: I forgot that I made the job type a company employee as a trial (result: about 64-like). I fixed it to the IT related input at that time. The IT related is a better impression! ?)

By the way, after this, if I put in my own value and trained, the accuracy (overall) improved (about 83 → 86%), so there is a great possibility that the like I got was a reasonable number (tears).

I changed the data and played

I tried to graduate from graduate school

...
  'Educational background': 'Graduate school graduate', #Change from college graduate
...

#result
[207.56731856]

A terrifying educational background.

I tried to make it 180 cm tall

...
  'height': '180', #Changed from 171
...

#result
[164.67592949]

Terrifying height.

Finally

The prediction machine was made with a sloppy accuracy of about 86%. Does this mean that the number of likes changes depending on the choices?

And there was no relationship between annual income (if it was 5 million or more) and body shape (although I am a normal body type). In other words, there may have been a factor that did not make the choice I chose.

Based on the results, it seems that the annual income has nothing to do with it, so I will do my best to grow taller in the future.

Recommended Posts

Matching app I tried to take statistics of strong people & tried to create a machine learning model
[Azure] I tried to create a Linux virtual machine in Azure of Microsoft Learn
I tried to create a model with the sample of Amazon SageMaker Autopilot
[Machine learning] I tried to summarize the theory of Adaboost
I tried to divide with a deep learning language model
I tried to compare the accuracy of machine learning models using kaggle as a theme.
I tried to create a list of prime numbers with python
I want to create a machine learning service without programming! WebAPI
I tried to create a linebot (implementation)
I tried to create a linebot (preparation)
I tried to predict the number of domestically infected people of the new corona with a mathematical model
I tried calling the prediction API of the machine learning model from WordPress
A beginner of machine learning tried to predict Arima Kinen with python
I tried to visualize the model with the low-code machine learning library "PyCaret"
I want to create a machine learning service without programming! Text classification
I want to easily create a Noise Model
I tried to organize the evaluation indexes used in machine learning (regression model)
Machine learning beginners tried to make a horse racing prediction model with python
I tried to predict the presence or absence of snow by machine learning.
I tried to implement various methods for machine learning (prediction model) using scikit-learn.
A machine learning beginner tried to create a sheltie judgment AI in one day
I tried to understand the learning function of neural networks carefully without using a machine learning library (first half).
Create a dataset of images to use for learning
I tried to move machine learning (ObjectDetection) with TouchDesigner
I tried to create a table only with Django
PyTorch Learning Note 2 (I tried using a pre-trained model)
I tried to compress the image using machine learning
[Keras] I tried to solve a donut-type region classification problem by machine learning [Study]
I tried to make something like a chatbot with the Seq2Seq model of TensorFlow
I tried to make a real-time sound source separation mock with Python machine learning
I tried to create a reinforcement learning environment for Othello with Open AI gym
[Python] I tried to automatically create a daily report of YWT with Outlook mail
Try to evaluate the performance of machine learning / regression model
I tried to automatically create a report with Markov chain
Create a machine learning app with ABEJA Platform + LINE Bot
Try to evaluate the performance of machine learning / classification model
I tried hosting a TensorFlow deep learning model using TensorFlow Serving
I made a function to check the model of DCGAN
I tried using Tensorboard, a visualization tool for machine learning
I tried machine learning to convert sentences into XX style
I tried to verify the yin and yang classification of Hololive members by machine learning
I tried to implement ListNet of rank learning with Chainer
I tried to create a Python script to get the value of a cell in Microsoft Excel
Create a python machine learning model relearning mechanism with mlflow
I tried to create a bot for PES event notification
I tried using PI Fu to generate a 3D model of a person from one image
I tried to understand supervised learning of machine learning in an easy-to-understand manner even for server engineers 1
[Qiita API] [Statistics • Machine learning] I tried to summarize and analyze the articles posted so far.
I was frustrated by Kaggle, so I tried to find a good rental property by scraping & machine learning
I tried to understand supervised learning of machine learning in an easy-to-understand manner even for server engineers 2
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Implementation ~
I tried to create a server environment that runs on Windows 10
I tried to create an environment of MkDocs on Amazon Linux
I tried to get a database of horse racing using Pandas
I tried to create a simple credit score by logistic regression.
I tried to make a regular expression of "amount" using Python
I tried to make a regular expression of "time" using Python
I tried to implement anomaly detection using a hidden Markov model
I tried to make a regular expression of "date" using Python
I tried to get a list of AMI Names using Boto3
I tried to create Bulls and Cows with a shell program