Practical exercise of data analysis with Python ~ 2016 New Coder Survey Edition ~

Introduction

Practice working with data using libraries such as python and libraries such as numpy, pandas, and seaborn. The data used is from kaggle. This time, we will use the data from 2016 New Coder Survey. The contents of the data are as follows. Simply put, it's data about who is learning to code.

Free Code Camp is an open source community where you learn to code and build projects for nonprofits.

CodeNewbie.org is the most supportive community of people learning to code.

Together, we surveyed more than 15,000 people who are actively learning to code. We reached them through the twitter accounts and email lists of various organizations that help people learn to code.

Our goal was to understand these people's motivations in learning to code, how they're learning to code, their demographics, and their socioeconomic background.

Also, as a premise, it is executed on ipython notebook. The version is pyenv: anaconda3-2.4.0 (Python 3.5.2 :: Anaconda 2.4.0) is.

If you are familiar with this field, we would appreciate it if you could take a warm look at the content and give us some advice if you notice anything. I would be grateful if you could comment on something like "I would do this kind of analysis if I used this data"! (It will be helpful if you can use the code base!)

Library import

I will load the ones that I think I will use.

import numpy as np
from numpy.random import randn
import pandas as pd
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Data reading

I downloaded the data from 2016 New Coder Survey and put it in the same folder with the name "code_survey.csv".

survey_df = pd.read_csv('corder_survey.csv')

Data overview

shape

survey_df.shape
(15620, 113)

I see. There are quite a few items. The number of rows is 15620 (the number of people targeted for data), and the number of columns (answer items) is 113.

info You may also want to use info.

survey_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15620 entries, 0 to 15619
Columns: 113 entries, Age to StudentDebtOwe
dtypes: float64(85), object(28)
memory usage: 13.5+ MB

describe

survey_df.describe()

You can see information such as'count',' mean',' std',' min', '25%', '50%', '75%', and'max' in each column. Data is omitted because there is too much data.

Column check

for col in survey_df.columns:
    print(col)

Now I have displayed all 113 items. Since this is a practice, I will pick up the columns to be used first.

Gender:sex
HasChildren:With or without children
EmploymentStatus:Current employment form
Age:age
Income:income
HoursLearning:Learning time
SchoolMajor:Major

Overview of each item

Gender

countplot

Let's start with the gender data. Let's make a histogram. seaborn count plot is useful.

sns.countplot('Gender', data=survey_df)

gender.png

In Japan, it seems that men and women are divided, but there is diversity and it seems to be overseas.

seaborn.countplot

By the way, for a simple histogram, there is plt.hist in matplotlib. (There is also plt.bar that makes a bar graph, but plt.hist is easier when making a histogram from the frequency distribution of data.

dataset = randn(100)
plt.hist(dataset)

(Randn will generate random numbers according to the normal distribution)

plt.png

There are also various options.

# normed:Normalization, alpha:Transparency, color:color, bins:Number of bins
plt.hist(dataset, normed=True, alpha=0.8, color='indianred', bins=15)

matplotlib.pyplot.hist

HasChildren

Similarly, try drawing with or without children using count plot.

sns.countplot('HasChildren', data=survey_df)

has_children.png

If it is 0 or 1, it is difficult to understand, so let's say No without children and Yes with children.

survey_df['HasChildren'].loc[survey_df['HasChildren'] == 0] = 'No'
survey_df['HasChildren'].loc[survey_df['HasChildren'] == 1] = 'Yes'

pandas.DataFrame.loc

You can now convert.

df.map

Conversion using map seems to be good.

survey_df['HasChildren'] = survey_df['HasChildren'].map({0: 'No', 1: 'Yes'})
sns.countplot('HasChildren', data=survey_df)
sns.countplot('HasChildren', data=survey_df)

has_children_2.png

It's a little easier to understand!

EmploymentStatus

The current employment form is also represented by count plot.

sns.countplot('EmploymentStatus', data=survey_df)

employment_status.png

It's kind of messy and hard to understand. ..

So let's change the axis.

countplot Axis change

sns.countplot(y='EmploymentStatus', data=survey_df)

employment_status_2.png

Easy to see!

Age

Again, try using count plot.

sns.countplot('Age', data=survey_df)

age.png

It's colorful and beautiful, but it's hard to see as a graph.

So let's smooth the graph.

kde plot

Use kernel density estimation (kde: kernel density plot). The method itself is easy.

sns.kdeplot(survey_df['Age'])

age_2.png

There are many people in their 20s and 30s. Is it just as you would expect? However, the hem is widening even if you get older to some extent.

Kernel density estimation

Now let's consider a little kernel density estimation. (If you look at Wikipedia and other sites, you can see a proper explanation.) [Kernel Density Estimate](https://ja.wikipedia.org/wiki/%E3%82%AB%E3%83%BC%E3%83%8D%E3%83%AB%E5%AF%86%E5 % BA% A6% E6% 8E% A8% E5% AE% 9A)

dataset = randn(30)
plt.hist(dataset, alpha=0.5)
sns.rugplot(dataset)

The rug plot shows each sample point with a bar.

kernel.png

This is an image of creating a kernel function (which is easy to understand if you consider a normal distribution as an example) for each sample point in this graph and adding them together.

sns.kdeplot(dataset)

kernel_2.png

In estimating kernel density

--Kernel function: How to spread the influence of each sample point --Bandwidth: Width of kernel function spread

You need to decide on two things.

Kernel function

You can also use various kernel functions. The default is gau (Gaussian distribution, normal distribution).

kernel_options = ["gau", "biw", "cos", "epa", "tri", "triw"]
for kernel in kernel_options:
    sns.kdeplot(dataset, kernel=kernel, label=kernel)

kernel_3.png

Bandwidth

Bandwidth can also be changed.

for bw in np.arange(0.5, 2, 0.25):
  sns.kdeplot(dataset, bw=bw, label=bw)

kernel_4.png

So far, the explanation of kernel density estimation has been separated, and then we will continue.

Income

Again, use kdeplot.

sns.kdeplot(survey_df['Income'])

income.png

The unit is dollars, so it's an annual income.

Let's take a closer look at the data.

describe

survey_df['Income'].describe()
RuntimeWarning: Invalid value encountered in median
count      7329.000000
mean      44930.010506
std       35582.783216
min        6000.000000
25%                NaN
50%                NaN
75%                NaN
max      200000.000000
Name: Income, dtype: float64

It seems that the problem that the quartile becomes NaN has already been solved at the time of writing the article, but it seems to be waiting for merge. Wait for the version upgrade without worrying about it. describe() returns RuntimeWarning: Invalid value encountered in median RuntimeWarning #13146

boxplot

I would like to create a boxplot (boxplot).

sns.boxplot(survey['Income'])

boxplot.png

From the vertical line on the left, the minimum boxplot, the first quartile (Q1), the median, the third quartile (Q3), and the maximum boxplot are shown. Those that deviate from (minimum value --IQR1.5) ~ (maximum value + IQR1.5) as IQR = Q3-Q1 are represented by black dots as outliers that deviate from the boxplot.

It is also possible to express outliers without them.

sns.boxplot(survey['Income'], whips=np.inf)

boxplot_2.png

violinplot

There is also a Vaviolin plot that gives boxplot information about kde.

sns.violinplot(survey_df['Income'])

violinplot.png

The distribution is easier to understand!

HoursLearning

Let's take a look at the study time. First, let's do a kde plot.

sns.kdeplot(survey_df['HoursLearning'])

hours_kde.png

It's the amount of learning per week in terms of time. Also, the extremum is noticeable, but this is happening with a clear number. Normally, I answer the questionnaire with good numbers, so this happens.

I also have a violin plot.

sns.violinplot(survey_df['HoursLearning'])

hours_violinplot.png

It reflects the characteristics of the kde plot.

There is also a distplot that can generate both countplot and kdeplot together. nan is deleted and used.

hours_learning = survey_df['HoursLearning']
hours_learning = hours_learning.dropna()
sns.distplot(hours_learning)

hours_dist.png

You can turn the histogram into a lag plot and add options. Convenient!

sns.distplot(hours_learning, rug=True, hist=False, kde_kws={'color':'indianred'})

hours_dist_2.png

SchoolMajor

If it is a continuous value, kdeplot will be useful, but since this is a categorization, use countplot.

sns.countplot(y='SchoolMajor' , data=survey_df)

major_countplot.png

It's hard to see. .. There are too many categories. I want to see the top 10 or so.

collections.Counter

from collections import Counter
major_count = Counter(survey_df['SchoolMajor'])
major_count.most_common(10)

Count by using the standard library collections. Furthermore, by setting most_common (10), you can get the top 10 of them.

[(nan, 7170),
 ('Computer Science', 1387),
 ('Information Technology', 408),
 ('Business Administration', 284),
 ('Economics', 252),
 ('Electrical Engineering', 220),
 ('English', 204),
 ('Psychology', 187),
 ('Electrical and Electronics Engineering', 164),
 ('Software Engineering', 159)]

Let's display it on the graph.

X = []
Y = []
major_count_top10 = major_count.most_common(10)
for record in major_count_top10:
    X.append(record[0])
    Y.append(record[1])

# [nan, 'Computer Science', 'Information Technology', 'Business Administration', 'Economics', 'Electrical Engineering', 'English', 'Psychology', 'Electrical and Electronics Engineering', 'Software Engineering']
# [7170, 1387, 408, 284, 252, 220, 204, 187, 164, 159]

plt.barh(np.arange(10), Y)
plt.yticks(np.arange(10), X)

major_bar.png

I referred to here. Bar Graph -Introduction to matplotlib

You can use plt.barh to change the axis of plt.bar. I also labeled it with yticks. Well, I don't want nan to be displayed and I want to sort it in reverse order.

X = []
Y = []
major_count_top10 = major_count.most_common(10)
major_count_top10.reverse()
for record in major_count_top10:
    # record[0] == record[0]There is a supplement below
    if record[0] == record[0]:
        X.append(record[0])
        Y.append(record[1])

# ['Software Engineering', 'Electrical and Electronics Engineering', 'Psychology', 'English', 'Electrical Engineering', 'Economics', 'Business Administration', 'Information Technology', 'Computer Science']
# [159, 164, 187, 204, 220, 252, 284, 408, 1387]

plt.barh(np.arange(9), Y)
plt.yticks(np.arange(9), X)

major_barplot_2.png

The graph I was thinking of was created!

[Addition] about if record [0] == record [0]:

Here, False is returned only when comparing between NaN, so I implemented this. (See also the URL below)

How to judge Python, Nan.

However, it is difficult to understand The implementation method introduced by @shiracamus is easier to understand. I will use this in the future as well.

if record[0] == record[0]:

Changed the part as follows.

if pd.notnull(record[0]):

Data association

Gender and HasChildren

First, Gender is for men and women only for simplicity.

male_female_df = survey_df.where((survey_df['Gender'] == 'male') + (survey_df['Gender'] == 'female') )

You can count by layer by using hue of count plot.

countplot(hue)

sns.countplot('Gender', data=male_female_df, hue='HasChildren')

gender_haschildren_countplot.png

It seems that both men and women have the same proportion of children.

Gender and Age

Graphs other than count plot can also be represented by layers. Use FacetGrid.

sns.FacetGrid

fig = sns.FacetGrid(male_female_df, hue='Gender', aspect=4)
fig.map(sns.kdeplot, 'Age', shade=True)
oldest = male_female_df['Age'].max()
fig.set(xlim=(0, oldest))
fig.add_legend()

gender_age_facetgrid.png

Men are a little younger, aren't they?

EmploymentStatus and Gender

Since there are multiple Employment Status, I would like to use only the top few.

# male_female_df is survey_Gender of df narrowed down to men and women
#Get the top 5 Employment Status
from collections import Counter
employ_count = Counter(male_female_df['EmploymentStatus'])
employ_count_top = employ_count.most_common(5)
print(employ_count_top)
employ_list =[]

for record in employ_count_top:
     if record[0] == record[0]:
        employ_list.append(record[0])

def top_employ(status):
    return status in employ_list

#employ using apply_Get only the rows of items in list
new_survey_df = male_female_df.loc[male_female_df['EmploymentStatus'].apply(top_employ)]

sns.countplot(y='EmploymentStatus', data=new_survey_df)

employment_status_only_3.png

Now there are only the top 3 items.

Let's look at the gender stratification using the count plot hue.

sns.countplot(y='EmploymentStatus', data=employ_df, hue='Gender')

employment_status_hue_gender.png

EmploymentStatus and HasChildren

First, convert HasChildren to No-> 0, Yes-> 1.

new_survey_df['HasChildren'] = new_survey_df['HasChildren'].map({'No': 0, 'Yes': 1})

Use factor plot here. Let's see how the Employment Status` is related to the presence or absence of children.

factorplot

sns.factorplot('EmploymentStatus', 'HasChildren', data=new_survey_df, aspect=2)

factorplot_employment_haschildren.png

ʻEmloyed for wages` is a little high. This is convincing.

You can also see the factor plots by layer, so let's see if there is any difference between men and women.

sns.factorplot('EmploymentStatus', 'HasChildren', data=new_survey_df, aspect=2, hue='Gender')

factorplot_employment_hue_gender.png

This is pretty interesting. In fact, it was completely different for men and women. In light of the employment situation, there are many possibilities.

Age and HasChildren

lmplot

I would like to see the relationship with a regression line. Use lmplot for the regression line.

sns.lmplot('Age', 'HasChildren', data=new_survey_df)

lmplot_age_haschildren.png

By the way, lmplot can also be seen by layer, so I would like to try it. First, stratify by ʻEmployment Status`.

sns.lmplot('Age', 'HasChildren', data=new_survey_df, hue='EmploymentStatus')

lmplot_hue_employmentstatus.png

Generally, the value of ʻEmployed for wages` is a little high, but as we saw in the previous section, it seems to be a little clearer if we divide it by gender.

By the way, I will also stratify Gender.

sns.lmplot('Age', 'HasChildren', data=new_survey_df, hue='Gender')

lmplot_hue_gender.png

Display multiple graphs side by side

You can also display multiple graphs side by side. I drew two graphs using subplots.

fig, (axis1, axis2) = plt.subplots(1, 2, sharey=True)
sns.regplot('HasChildren', 'Age', data=new_survey_df, ax=axis1)
sns.violinplot(y='Age', x='HasChildren', data=new_survey_df, ax=axis2)

subplots.png

regplot is a low-level version of lmplot, which is the same for making simple regressions. I used regplots because the functions that can be used with subplots seem to be limited to those that return a matplotlib Axes object, and lmplot could not be used.

Detailed explanation Plotting with seaborn using the matplotlib object-oriented interface Was easy to understand.

It seems that the "Axis-level" function mentioned in this article can be used. (regplot, boxplot, kdeplot, and many others)

On the other hand, the "Figure-level" function is also introduced, and there is also lmplot in it. (lmplot, factorplot, jointplot and one or two others)

It's almost ʻAxis and partly like Figure`.

So it seems that FacetGrid is good for figures. Plotting on data-aware grids

You can also arrange multiple graphs with FacetGrid. I tried to display the distribution of Age side by side by EmploymentStatus.

fig = sns.FacetGrid(new_survey_df, col='EmploymentStatus', aspect=1.5)
fig.map(sns.distplot, 'Age')
oldest = new_survey_df['Age'].max()
fig.set(xlim=(0, oldest))
fig.add_legend()

facetgrid_age_haschildren.png

in conclusion

Organize what comes out

Referenced in how to use python

Better writing that I've wanted to know since I started Python Python pandas data selection process in a little more detail <Part 2>

reference

[20,000 people in the world] Practical Python Data Science This is a recommended video course that is easy to understand with detailed explanations. The point is that even if you ask a question, the answer will be returned the next day.

[Introduction to Data Analysis with Python-Data Processing Using NumPy and pandas](https://www.amazon.co.jp/Python%E3%81%AB%E3%82%88%E3%82%8B%E3 % 83% 87% E3% 83% BC% E3% 82% BF% E5% 88% 86% E6% 9E% 90% E5% 85% A5% E9% 96% 80-% E2% 80% 95NumPy% E3% 80% 81pandas% E3% 82% 92% E4% BD% BF% E3% 81% A3% E3% 81% 9F% E3% 83% 87% E3% 83% BC% E3% 82% BF% E5% 87% A6% E7% 90% 86-Wes-McKinney / dp / 4873116554) O'Reilly's data analysis book. It is well organized.

Start Python Club The Python community. I'm not specialized in data analysis, but I'm widely active in Python, so if you use Python, I think it's fun to go.

Recommended Posts

Practical exercise of data analysis with Python ~ 2016 New Coder Survey Edition ~
Data analysis with python 2
Data analysis with Python
Challenge principal component analysis of text data with Python
20200329_Introduction to Data Analysis with Python Second Edition Personal Summary
Data analysis starting with python (data visualization 1)
Data analysis starting with python (data visualization 2)
Practice of data analysis by Python and pandas (Tokyo COVID-19 data edition)
Practical edition of automating the testing of Flutter apps with Appium (Python)
Recommendation of Altair! Data visualization with Python
Data analysis starting with python (data preprocessing-machine learning)
Try to image the elevation data of the Geographical Survey Institute with Python
Data analysis python
Solve A ~ D of yuki coder 247 with python
Static analysis of Python code with GitLab CI
A well-prepared record of data analysis in Python
Data analysis for improving POG 1 ~ Web scraping with Python ~
Get a large amount of Starbucks Twitter data with python and try data analysis Part 1
[Python] [Word] [python-docx] Simple analysis of diff data using python
Data analysis using Python 0
Data analysis overview python
[OpenCV / Python] I tried image analysis of cells with OpenCV
Voice analysis with python
Reading Note: An Introduction to Data Analysis with Python
Data analysis environment construction with Python (IPython notebook + Pandas)
Calculate the regression coefficient of simple regression analysis with python
List of Python code used in big data analysis
[In-Database Python Analysis Tutorial with SQL Server 2017] Step 4: Feature extraction of data using T-SQL
Python data analysis template
Planar skeleton analysis with Python (4) Handling of forced displacement
Voice analysis with python
[Basics of data science] Collecting data from RSS with python
Extract the band information of raster data with python
"Measurement Time Series Analysis of Economic and Finance Data" Solving Chapter End Problems with Python
Try scraping the data of COVID-19 in Tokyo with Python
3. Natural language processing with Python 5-1. Concept of sentiment analysis [AFINN-111]
Plane skeleton analysis with Python (3) Creation of section force diagram
Perform isocurrent analysis of open channels with Python and matplotlib
Notes on handling large amounts of data with python + pandas
Get rid of dirty data with Python and regular expressions
The story of rubyist struggling with python :: Dict data with pycall
[Homology] Count the number of holes in data with Python
Implement normalization of Python training data preprocessing with scikit-learn [fit_transform]
Basic summary of data manipulation with Python Pandas-First half: Data creation & manipulation
[Co-occurrence analysis] Easy co-occurrence analysis with Python! [Python]
Get Youtube data with python
Sentiment analysis with Python (word2vec)
Static analysis of Python programs
Learn new data with PaintsChainer
[Python] Notes on data analysis
Python data analysis learning notes
Planar skeleton analysis with Python
Japanese morphological analysis with Python
Python for Data Analysis Chapter 2
Data analysis using python pandas
Muscle jerk analysis with Python
Python for Data Analysis Chapter 3
Read json data with python
3. Natural language processing with Python 5-4. Emotion value analysis of Japanese sentences [Japanese evaluation polarity dictionary (noun edition)]
Ported from R language of "Sazae-san's rock-paper-scissors data analysis" to Python
I studied four libraries of Python 3 engineer certified data analysis exams