Excuse I didn't update

The update has been delayed. There are many reasons (or excuses), but I think most of the time I didn't write much code. At the beginning of the year, the amount of lessons and assignments settled down a little, and at the same time, I was wandering around in my mind, thinking about my graduation work and my career path after graduation, so I didn't move much. So I updated my my blog every day, but my honest impression is that there wasn't enough content to write on Qiita. I'm still a little lost, but for the time being, my graduation work has been decided and I'm working hard to make it.

I learned a little python in class

During the class, I played with some data on the cloud. I also learned about an amazing site called Kaggle. This is the first time I tried it while taking data from Kaggle myself after class and checking it.

Get data from kaggle

https://www.kaggle.com/unsdsn/world-happiness#2019.csv I thought that I could see various correlations, so I chose this.

Mount from google drive

from google.colab import drive
drive.mount('/content/drive')

Upload the required CSV to google drive in advance

Import and data acquisition of libraries that may be needed

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("/content/drive/My Drive/2019.csv")

I was impressed that there are so many libraries besides this!

Check the number of data and the presence of missing values

df.count()

Overall rank	156
Country or region	156
Score	156
GDP per capita	156
Social support	156
Healthy life expectancy	156
Freedom to make life choices	156
Generosity	156
Perceptions of corruption	156
Healthy life expectancy	156

Number of data 156, no missing values If I don't do this, I don't know whether to display all the data or only the first few, so I tried it. I also wanted to avoid data with many missing values because it seemed to be confusing. (I think we need to take on the challenge in the future, but for the time being, this is the first time)

Try to display only the first 20

df.head(20)

Overall rank	Country or region	Score	GDP per capita	Social support	Healthy life expectancy	Freedom to make life choices	Generosity	Perceptions of corruption
1	Finland	7.769	1.340	1.587	0.986	0.596	0.153	0.393
2	Denmark	7.600	1.383	1.573	0.996	0.592	0.252	0.410
3	Norway	7.554	1.488	1.582	1.028	0.603	0.271	0.341
4	Iceland	7.494	1.380	1.624	1.026	0.591	0.354	0.118
5	Netherlands	7.488	1.396	1.522	0.999	0.557	0.322	0.298
6	Switzerland	7.480	1.452	1.526	1.052	0.572	0.263	0.343
7	Sweden	7.343	1.387	1.487	1.009	0.574	0.267	0.373
8	New Zealand	7.307	1.303	1.557	1.026	0.585	0.330	0.380
9	Canada	7.278	1.365	1.505	1.039	0.584	0.285	0.308
10	Austria	7.246	1.376	1.475	1.016	0.532	0.244	0.226
11	Australia	7.228	1.372	1.548	1.036	0.557	0.332	0.290
12	Costa Rica		7.167	1.034	1.441	0.963	0.558	0.144
13	Israel	7.139	1.276	1.455	1.029	0.371	0.261	0.082
14	Luxembourg	7.090	1.609	1.479	1.012	0.526	0.194	0.316
15	United Kingdom	7.054	1.333	1.538	0.996	0.450	0.348	0.278
16	Ireland	7.021	1.499	1.553	0.999	0.516	0.298	0.310
17	Germany	6.985	1.373	1.454	0.987	0.473	0.160	0.210
19	United	States	6.892	1.433	1.457	0.874	0.454	0.280
20	Czech Republic	6.852	1.269	1.487	0.920	0.457	0.046	0.036

Japan is not included

Let's roughly display what seems to be necessary

df.describe()

Overall rank	Score	GDP per capita	Social support	Healthy life expectancy	Freedom to make life choices	Generosity	Perceptions of corruption
count	156.000000	156.000000	156.000000	156.000000	156.000000	156.000000	156.000000
mean	78.500000	5.407096	0.905147	1.208814	0.725244	0.392571	0.184846
std	45.177428	1.113120	0.398389	0.299191	0.242124	0.143289	0.095254
min	1.000000	2.853000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	39.750000	4.544500	0.602750	1.055750	0.547750	0.308000	0.108750
50%	78.500000	5.379500	0.960000	1.271500	0.789000	0.417000	0.177500
75%	117.250000	6.184500	1.232500	1.452500	0.881750	0.507250	0.248250
max	156.000000	7.769000	1.684000	1.624000	1.141000	0.631000	0.566000

Is this just the nature of the data? I can understand how.

Try to find the correlation coefficient between score (happiness) and social support (social welfare)


#Library preparation
import numpy as np
import pandas as pd

#Data set preparation

##Happiness to array
happy = df["Score"]

##Arrange social welfare
social = df["Social support"]

#Get the correlation coefficient!
correlation = np.corrcoef(social, happy)
print(correlation)

[[1. 0.77705779] [0.77705779 1. ]]

It came out ~~ Since the correlation coefficient is 0.7, social welfare has a strong correlation with happiness! !!

Try to put out a heat map


#Library preparation
import pandas as pd
import numpy as np

#You should be able to get the correlation coefficient between columns!
corr_df =df.corr()
print(corr_df)

Overall rank ... Perceptions of corruption Overall rank 1.000000 ... -0.351959 Score -0.989096 ... 0.385613 GDP per capita -0.801947 ... 0.298920 Social support -0.767465 ... 0.181899 Healthy life expectancy -0.787411 ... 0.295283 Freedom to make life choices -0.546606 ... 0.438843 Generosity -0.047993 ... 0.326538 Perceptions of corruption -0.351959 ... 1.000000

[8 rows x 8 columns]

I'm sorry it's hard to see, I'm out for the time being!

#Library preparation
import seaborn as sns
sns.heatmap(corr_df, cmap= sns.color_palette('cool', 5), annot=True,fmt='.2f', vmin = -1, vmax = 1)

I forgot to completely overtake the overall rank, but I managed to get it!

I will study for myself in the future

Impressions of the class

I've always been interested in python. Although the number of lessons was limited to 4 times in total, it was interesting to learn various things about "data", not just python. The rest was purely fun. I remember doing a little statistics at SPSS when I was a graduate student. I didn't use statistics in my master's thesis, so I only touched on it a little, but it was purely interesting at that time as well. I remembered that time when I was young lol

There are many things I want to try

For the time being, this was my first time, so I tried it roughly without thinking about deep things. Instead of thinking about strict statistics, simply try to figure out or visualize. If you look at the specialists, I think there are a lot of things to do. There are many things I would like to do with python, such as factor analysis, principal component analysis, and logistic analysis that I used to do with SPSS. I couldn't do it at all due to lack of time and knowledge during the class ...

Try it steadily

The class itself is over, but at the same time I felt the potential of machine learning, and at the same time, I realized that I had a faint feeling because I was in this industry. I'm sure it's not just fun or interesting things while studying, but in the future I would like to study as much as I can. I'm not planning to include machine learning in the product I'm making, so how much can I do in a different frame from the complete graduation work? It seems that it will be finite non-execution, and I'm already worried, but I wrote Qiita to warn myself, so I'd like to try it little by little. I hope to be able to describe such contents little by little in the future.

[PYTHON] The first time a programming beginner tried simple data analysis by programming