Data analysis starting with python (data visualization 1)

Introduction

This is the first post of CEML (Clinical Engineer Machine Learning). This time I would like to explain data analysis with python for beginners. Source code https://gitlab.com/ceml/qiita/-/blob/master/src/python/notebook/first_time_data_analysis.ipynb

Contents of this article

We will explain from reading data to simple data analysis using a dataset that is open to the public for free.

About the dataset

・ Provided by: California Institute of Technology ・ Contents: Test data of heart disease patients ・ URL: https://archive.ics.uci.edu/ml/datasets/Heart+Disease -Use only processed.cleveland.data in the above URL.

Analysis purpose

The dataset classifies the patient's condition into five classes. I will proceed with the analysis for the purpose of grasping the characteristics of each class.

Download data

Access the above URL and download processed.cleveland.data in the Data Folder. スクリーンショット 2020-04-21 15.01.10.png

Data reading

Import pandas and read the data with pandas' read_csv method. The column name is specified when reading the data. Make a list of column names and pass it as an argument to nemes of the read_csv method.

import pandas as pd

columns_name = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak","slope","ca","thal","class"]
data = pd.read_csv("/Users/processed.cleveland.data", names=columns_name)
#Display the top 5 lines of data
data.head()

The following is the read data. スクリーンショット 2020-04-21 15.16.07.png Here is a brief description of the column. Please refer to the data source for details. ・ Raise ・ Sex (1 = male; 0 = female) ・ Cp: chest pain type    1:typical angina 2: atypical angina 3: non-anginal pain    4: asymptomatic ・ Trestbps: resting blood pressure (in mm Hg on admission to the hospital) ・ Chol: serum cholestoral in mg / dl ・ Fbs: fasting blood sugar> 120 mg / dl) (1 = true; 0 = false) ・ Restecg: resting electrocardiographic results     0: normal     1: having ST-T wave abnormality     (T wave inversions and/or ST  elevation or depression of > 0.05 mV)     2: showing probable or definite left ventricular hypertrophy by Estes'criteria ・ Thalach: maximum heart rate achieved ・ Exang: exercise induced angina (1 = yes; 0 = no) ・ Oldpeak: ST depression induced by exercise relative to rest ・ Slope: the slope of the peak exercise ST segment     1: upsloping     2: flat     3: downsloping ・ Ca: number of major vessels (0-3) colored by flourosopy ・ Thal: 3 = normal; 6 = fixed defect; 7 = reversable defect ・ Class: 0 ~ 5 (0 is normal, the larger the number, the worse)

Data preprocessing

This time, as preprocessing, check the data type of each column, and if it is not a numeric type, convert it to a numeric type. There is a missing value entered as?, So replace it with null.

#Check data type
data.dtypes

#Convert type to float ,? Replaced with a null value
data = data.replace("?",np.nan).astype("float")

Check the basic statistics and missing values of the data

Confirmation for each feature (variable)

#Calculate statistics
data.describe()
#Count missing values
data.isnull().sum()

With just this, you can see the missing values in the statistics for each column. The figure below shows the calculation results of the statistics. スクリーンショット 2020-04-21 17.24.17.png スクリーンショット 2020-04-21 17.24.06.png

Confirmation of each feature (variable) for each class

Here is the main issue. As a confirmation, the purpose of this analysis is to understand the characteristics of each class. In this case, use the pandas group_by method.

#Group by class column
class_group = data.groupby("class")


#When you specify a class and get statistics
# class_group.get_group(0).describe()

#Specify options so that all columns can be displayed (notebook)
pd.options.display.max_columns = None
#Statistic display for all classes
class_group.describe()

The following shows the statistics for all classes. スクリーンショット 2020-04-21 16.50.07.png

It's easy. Since the features (variables) and the number of classified classes (5) are small in this data, it can be confirmed by displaying the statistics of all classes, but if there are many of these, display all and confirm. Things get harder.

Visualize data

Check the distribution of each feature (variable)

Check the distribution of the data in the histogram.

data.hist(figsize=(20,10))
#Prevent the graphs from overlapping
plt.tight_layout() 
plt.show()
スクリーンショット 2020-04-21 17.29.05.png

Display a histogram of each feature (variable) for each class

#Plot by itself
# class_group["age"].hist(alpha=0.7)
# plt.legend([0,1,2,3,4])

#Show all
plt.figure(figsize=(20,10))
for n, name in enumerate(data.columns.drop("class")):
    plt.subplot(4,4,n+1)
    class_group[name].hist(alpha=0.7)
    plt.title(name,fontsize=13,x=0, y=0)
    plt.legend([0,1,2,3,4])
スクリーンショット 2020-04-21 18.02.21.png

Display the mean and variance of each feature (variable) for each class on a bar graph

#Plot by itself
# class_group.mean()["age"].plot.bar(yerr=class_group.std()["age"])

#Show all
plt.figure(figsize=(20,10))
for n, name in enumerate(data.columns.drop("class")):
    plt.subplot(4,4,n+1)
    class_group.mean()[name].plot.bar(yerr=class_group.std()[name], fontsize=8)
    plt.title(name,fontsize=13,x=0, y=0)
スクリーンショット 2020-04-21 17.54.01.png

I tried to visualize it roughly, but the histogram for each class cannot be seen well as it is. Next time, I will analyze using graphs that can be moved and 3d plots.

Data analysis starting with python (data visualization 2) https://qiita.com/CEML/items/e932684502764be09157 Data analysis starting with python (data visualization 3) https://qiita.com/CEML/items/71fbc7b8ab6a7576f514

Recommended Posts

Data analysis starting with python (data visualization 1)
Data analysis starting with python (data visualization 2)
Data analysis with python 2
Data analysis with Python
Data analysis python
Easy data visualization with Python seaborn.
Python visualization tool for data analysis work
Recommendation of Altair! Data visualization with Python
Data analysis using Python 0
Data analysis overview python
Voice analysis with python
Python starting with Windows 7
GRPC starting with Python
Data visualization with pandas
Python data analysis template
Voice analysis with python
Logistics visualization with Python
Links to people who are just starting data analysis with python
[Various image analysis with plotly] Dynamic visualization with plotly [python, image]
[In-Database Python Analysis Tutorial with SQL Server 2017] Step 3: Data Exploration and Visualization
Sample data created with python
My python data analysis container
[Python] Morphological analysis with MeCab
[Co-occurrence analysis] Easy co-occurrence analysis with Python! [Python]
Python for Data Analysis Chapter 4
Get Youtube data with python
Reinforcement learning starting with Python
[Python] Notes on data analysis
Python data analysis learning notes
Planar skeleton analysis with Python
Japanese morphological analysis with Python
Python for Data Analysis Chapter 2
Python starting with Hello world!
Data analysis using python pandas
Muscle jerk analysis with Python
Python for Data Analysis Chapter 3
Read json data with python
Data analysis for improving POG 1 ~ Web scraping with Python ~
Reading Note: An Introduction to Data Analysis with Python
Challenge principal component analysis of text data with Python
Beautiful graph drawing with python -seaborn makes data analysis and visualization easier Part 1
Beautiful graph drawing with python -seaborn makes data analysis and visualization easier Part 2
Python: Time Series Analysis: Preprocessing Time Series Data
Impedance analysis (EIS) with python [impedance.py]
[Python] Get economic data with DataReader
20200329_Introduction to Data Analysis with Python Second Edition Personal Summary
Thorough capture PDF open data. PDF text analysis starting with PDFMiner.
Python data structures learned with chemoinformatics
Preprocessing template for data analysis (Python)
Python application: data visualization part 1: basic
Process Pubmed .xml data with python
Logistic regression analysis Self-made with python
Implement "Data Visualization Design # 2" with matplotlib
Python application: Data cleansing # 2: Data cleansing with DataFrame
Practical exercise of data analysis with Python ~ 2016 New Coder Survey Edition ~
Introduction to Data Analysis with Python P32-P43 [ch02 3.US Baby Names 1880-2010]
Introduction to Data Analysis with Python P17-P26 [ch02 1.usa.gov data from bit.ly]
[In-Database Python Analysis Tutorial with SQL Server 2017] Step 1: Download sample data
Create a USB boot Ubuntu with a Python environment for data analysis
System trading starting with Python3: long-term investment
Get additional data in LDAP with python