One of the important things in data analysis is to check the contents of the data. This time, I will introduce a method for checking missing values that even non-engineers can do.
Import pandas to load the dataset. This time, we'll use data from train.csv in Kaggle's House Prices: Advanced Regression Techniques.
House Prices: Advanced Regression Techniques https://www.kaggle.com/c/house-prices-advanced-regression-techniques
import pandas as pd
data = pd.read_csv('../train.csv')
Set the data you want to check in df. In this case, we will look at the train.csv set above.
#How to check missing values
df=data #Register the dataset in df
total = df.isnull().sum()
percent = round(df.isnull().sum()/df.isnull().count()*100,2)
missing_data = pd.concat([total,percent],axis =1, keys=['Total','Ratio_of_NA(%)'])
type=pd.DataFrame(df[missing_data.index].dtypes, columns=['Types'])
missing_data=pd.concat([missing_data,type],axis=1)
missing_data=missing_data.sort_values('Total',ascending=False)
missing_data.head(20)
print(missing_data.head(20))
print()
print(set(missing_data['Types']))
print()
print("---Categorical col---")
print(missing_data[missing_data['Types']=="object"].index)
print()
print("---Numerical col---")
print(missing_data[missing_data['Types'] !="object"].index)
You can use the code above to find out the percentage of missing values. But where are the missing values, such as time series datasets? There are times when you want to know. In that case, use heatmap.
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline
df = data
plt.figure(figsize=(16,16)) #Size adjustment
plt.title("Missing Value") #title
sns.heatmap(df.isnull(), cbar=False) #Heat map display
By registering various data sets in the df of each code, it is possible to automatically determine whether each column is a text type or a numeric type and visualize missing values.
Recommended Posts