[PYTHON] The first time a programming beginner tried simple data analysis by programming

Excuse I didn't update

The update has been delayed. There are many reasons (or excuses), but I think most of the time I didn't write much code. At the beginning of the year, the amount of lessons and assignments settled down a little, and at the same time, I was wandering around in my mind, thinking about my graduation work and my career path after graduation, so I didn't move much. So I updated my my blog every day, but my honest impression is that there wasn't enough content to write on Qiita. I'm still a little lost, but for the time being, my graduation work has been decided and I'm working hard to make it.

I learned a little python in class

During the class, I played with some data on the cloud. I also learned about an amazing site called Kaggle. This is the first time I tried it while taking data from Kaggle myself after class and checking it.

Get data from kaggle

https://www.kaggle.com/unsdsn/world-happiness#2019.csv I thought that I could see various correlations, so I chose this.

Mount from google drive

from google.colab import drive
drive.mount('/content/drive')

Upload the required CSV to google drive in advance

Import and data acquisition of libraries that may be needed

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("/content/drive/My Drive/2019.csv")

I was impressed that there are so many libraries besides this!

Check the number of data and the presence of missing values

df.count()
Overall rank 156
Country or region 156
Score 156
GDP per capita 156
Social support 156
Healthy life expectancy 156
Freedom to make life choices 156
Generosity 156
Perceptions of corruption 156
Healthy life expectancy 156

Number of data 156, no missing values If I don't do this, I don't know whether to display all the data or only the first few, so I tried it. I also wanted to avoid data with many missing values because it seemed to be confusing. (I think we need to take on the challenge in the future, but for the time being, this is the first time)

Try to display only the first 20

df.head(20)
Overall rank Country or region Score GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption
1 Finland 7.769 1.340 1.587 0.986 0.596 0.153 0.393
2 Denmark 7.600 1.383 1.573 0.996 0.592 0.252 0.410
3 Norway 7.554 1.488 1.582 1.028 0.603 0.271 0.341
4 Iceland 7.494 1.380 1.624 1.026 0.591 0.354 0.118
5 Netherlands 7.488 1.396 1.522 0.999 0.557 0.322 0.298
6 Switzerland 7.480 1.452 1.526 1.052 0.572 0.263 0.343
7 Sweden 7.343 1.387 1.487 1.009 0.574 0.267 0.373
8 New Zealand 7.307 1.303 1.557 1.026 0.585 0.330 0.380
9 Canada 7.278 1.365 1.505 1.039 0.584 0.285 0.308
10 Austria 7.246 1.376 1.475 1.016 0.532 0.244 0.226
11 Australia 7.228 1.372 1.548 1.036 0.557 0.332 0.290
12 Costa Rica 7.167 1.034 1.441 0.963 0.558 0.144
13 Israel 7.139 1.276 1.455 1.029 0.371 0.261 0.082
14 Luxembourg 7.090 1.609 1.479 1.012 0.526 0.194 0.316
15 United Kingdom 7.054 1.333 1.538 0.996 0.450 0.348 0.278
16 Ireland 7.021 1.499 1.553 0.999 0.516 0.298 0.310
17 Germany 6.985 1.373 1.454 0.987 0.473 0.160 0.210
19 United States 6.892 1.433 1.457 0.874 0.454 0.280
20 Czech Republic 6.852 1.269 1.487 0.920 0.457 0.046 0.036

Japan is not included

Let's roughly display what seems to be necessary

df.describe()
Overall rank Score GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption
count 156.000000 156.000000 156.000000 156.000000 156.000000 156.000000 156.000000
mean 78.500000 5.407096 0.905147 1.208814 0.725244 0.392571 0.184846
std 45.177428 1.113120 0.398389 0.299191 0.242124 0.143289 0.095254
min 1.000000 2.853000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 39.750000 4.544500 0.602750 1.055750 0.547750 0.308000 0.108750
50% 78.500000 5.379500 0.960000 1.271500 0.789000 0.417000 0.177500
75% 117.250000 6.184500 1.232500 1.452500 0.881750 0.507250 0.248250
max 156.000000 7.769000 1.684000 1.624000 1.141000 0.631000 0.566000

Is this just the nature of the data? I can understand how.

Try to find the correlation coefficient between score (happiness) and social support (social welfare)


#Library preparation
import numpy as np
import pandas as pd

#Data set preparation

##Happiness to array
happy = df["Score"]

##Arrange social welfare
social = df["Social support"]

#Get the correlation coefficient!
correlation = np.corrcoef(social, happy)
print(correlation)

[[1. 0.77705779] [0.77705779 1. ]]

It came out ~~ Since the correlation coefficient is 0.7, social welfare has a strong correlation with happiness! !!

Try to put out a heat map


#Library preparation
import pandas as pd
import numpy as np

#You should be able to get the correlation coefficient between columns!
corr_df =df.corr()
print(corr_df)

Overall rank ... Perceptions of corruption Overall rank 1.000000 ... -0.351959 Score -0.989096 ... 0.385613 GDP per capita -0.801947 ... 0.298920 Social support -0.767465 ... 0.181899 Healthy life expectancy -0.787411 ... 0.295283 Freedom to make life choices -0.546606 ... 0.438843 Generosity -0.047993 ... 0.326538 Perceptions of corruption -0.351959 ... 1.000000

[8 rows x 8 columns]

I'm sorry it's hard to see, I'm out for the time being!

#Library preparation
import seaborn as sns
sns.heatmap(corr_df, cmap= sns.color_palette('cool', 5), annot=True,fmt='.2f', vmin = -1, vmax = 1)
スクリーンショット 2020-02-07 0.46.40.jpeg

I forgot to completely overtake the overall rank, but I managed to get it!

I will study for myself in the future

Impressions of the class

I've always been interested in python. Although the number of lessons was limited to 4 times in total, it was interesting to learn various things about "data", not just python. The rest was purely fun. I remember doing a little statistics at SPSS when I was a graduate student. I didn't use statistics in my master's thesis, so I only touched on it a little, but it was purely interesting at that time as well. I remembered that time when I was young lol

There are many things I want to try

For the time being, this was my first time, so I tried it roughly without thinking about deep things. Instead of thinking about strict statistics, simply try to figure out or visualize. If you look at the specialists, I think there are a lot of things to do. There are many things I would like to do with python, such as factor analysis, principal component analysis, and logistic analysis that I used to do with SPSS. I couldn't do it at all due to lack of time and knowledge during the class ...

Try it steadily

The class itself is over, but at the same time I felt the potential of machine learning, and at the same time, I realized that I had a faint feeling because I was in this industry. I'm sure it's not just fun or interesting things while studying, but in the future I would like to study as much as I can. I'm not planning to include machine learning in the product I'm making, so how much can I do in a different frame from the complete graduation work? It seems that it will be finite non-execution, and I'm already worried, but I wrote Qiita to warn myself, so I'd like to try it little by little. I hope to be able to describe such contents little by little in the future.

Recommended Posts

The first time a programming beginner tried simple data analysis by programming
I tried python programming for the first time.
I tried logistic regression analysis for the first time using Titanic data
A Python beginner first tried a quick and easy analysis of weather data for the last 10 years.
A programming beginner tried to find out the execution time of sorting etc.
[First data science ⑤] I tried to help my friend find the first property by data analysis.
First satellite data analysis by Tellus
A simple data analysis of Bitcoin provided by CoinMetrics in Python
Prepare a high-speed analysis environment by hitting mysql from the data analysis environment
SE, a beginner in data analysis, learns with the data science unit vol.1
I tried using scrapy for the first time
Prepare a programming language environment for data analysis
Before the coronavirus, I first tried SARS analysis
I tried Mind Meld for the first time
A story about data analysis by machine learning
Let's display a simple template that is ideal for Django for the first time
I tried Python on Mac for the first time.
Register a task in cron for the first time
I tried to predict the J-League match (data analysis)
I tried python on heroku for the first time
I tried increasing or decreasing the number by programming
AI Gaming I tried it for the first time
I tried the same data analysis with kaggle notebook (python) and Power BI at the same time ②
I tried the same data analysis with kaggle notebook (python) and Power BI at the same time ①
I tried the Google Cloud Vision API for the first time
Kaggle for the first time (kaggle ①)
Kaguru for the first time
[First scraping] I tried to make a VIP character of Smash Bros. [Beautiful Soup] [Data analysis]
Create a summary table by product and time by processing the data extracted from a certain POS system
What I learned by writing a Python Pull Request for the first time in my life
If you are a beginner in programming, why not make a "game" for the time being? The story
I tried to create a simple credit score by logistic regression.
[Unexpectedly known? ] Introducing a real day in the data analysis department
Can I pass the first grade of math test by programming?
A beginner who has been programming for 2 months tried to analyze the real GDP of Japan in time series with the SARIMA model.
I tried to summarize the commands used by beginner engineers today
Raspberry Pi --1 --First time (Connect a temperature sensor to display the temperature)
First simple regression analysis in Python
[For self-learning] Go2 for the first time
See python for the first time
Time series analysis 3 Preprocessing of time series data
Start Django for the first time
Let's play with the corporate analysis data set "CoARiJ" created by TIS ①
July, a certain, M5 ~ Kaggle beginner time series data competition failure story ~
Looking back on the 10 months before a programming beginner becomes a Kaggle Expert
Instantly illustrate the predominant period in time series data using spectrum analysis
I tried fractal dimension analysis by the box count method in 3D
I tried to summarize the Linux commands used by beginner engineers today-Part 1-
I tried to perform a cluster analysis of customers using purchasing data
A useful note when using Python for the first time in a while
Let's play with the corporate analysis data set "CoARiJ" created by TIS ②
I tried to verify the result of A / B test by chi-square test
Since I'm free, the front-end engineer tried Python (v3.7.5) for the first time.
Until you win the silver medal (top 3%) in the competition you participated in within a month for the first time in data science!
(Preserved version: Updated from time to time) A collection of useful tutorials for data analysis hackathons by Team AI