[Python beginner's memo] Importance and method of confirming missing value NaN before data analysis

I'm new to Python / machine learning. As a result of enthusiasm for data analysis, I was stuck because I neglected to confirm the missing value, so I will leave a memo as a reflection.

Conclusion

--Before starting data analysis, you should check for missing values. --If missing values are found, some measures should be taken, such as overwriting the data other than the missing values or excluding the rows containing the missing values for analysis.

What happened

--When I participated in a data analysis contest called Kaggle, I analyzed an amount of data that could not be visually confirmed. ――At that time, I did not notice the existence of the missing value (NaN), and the program became full of NaN, and the error did not stop.

What is a missing value?

Countermeasures-Recommendations at the start of data analysis

―― ① First and foremost, check if there are any missing values in the data. --Use ```isnull (). Any () `` ` --Tells you which columns contain missing values in your dataframe --If you check the missing values for df_example as shown below, you can check the existence of missing values for population and GDP with *** True *** (I imagine that you do not know the exact population of North Korea, etc.) Can also be)

#Example:countries.Suppose that csv contains basic statistical data of each country
import pandas as pd
df_example = pd.read_csv("hogehoge/example.csv").copy()

print(df_example.isnull().any())
#Example
Id            False
Name          False
Population    True
GDP           True
Region        False
life_expct    False

-② Perform replacement work in the column where the existence of missing values is confirmed. --I will omit another deletion method when the entire column is composed of NaN, and the processing when deleting the row itself instead of replacing the missing value.

#Where the missing value existence column is found
df_example.loc[df_example['Population'].isnull(), 'Population'] = 0

Caution

--In this case, pay attention to whether the value to be replaced is appropriate and what to keep in mind in the later calculation. --For example, if you replace the population with 0 as above, there could be two patterns: ―― “This data is analyzed only to calculate the top 30 most populous countries and their characteristics, so this is not a problem.” ―― "Since we will analyze the average population from this data, in that case, let's calculate only" countries whose population value is not 0 "and make sure that the value of the denominator and numerator is correct."

Summary

――Given the data, it is important to check the missing values instead of jumping to it and starting the analysis.

reference

-Pandas determines if missing value NaN is included, counts the number -Exclude (delete) / replace (fill in) / extract missing value NaN with pandas

(that's all)


Supplement

――The author experienced that the later analysis would be completely useless because the missing values were mixed in the input layer of deep learning, and I came to write this article. ――In addition to confirming missing values, I think there are many confirmation processes and data cleansing processes before analysis, such as drawing a histogram to search for outliers. I have refrained from mentioning them in this article as of March 24, 2020, but I would like to add them after examining them.

Recommended Posts

[Python beginner's memo] Importance and method of confirming missing value NaN before data analysis
Recommended books and sources of data analysis programming (Python or R)
The story of Python and the story of NaN
Practice of data analysis by Python and pandas (Tokyo COVID-19 data edition)
Python: Preprocessing in machine learning: Handling of missing, outlier, and imbalanced data
[Python] From morphological analysis of CSV data to CSV output and graph display [GiNZA]
Data analysis in Python Summary of sources to look at first for beginners
Fill the missing value (null) of DataFrame with the values before and after with pyspark
[Python] Conversion memo between time data and numerical data
A well-prepared record of data analysis in Python
[Data science memorandum] Handling of missing values ​​[python]
Get a large amount of Starbucks Twitter data with python and try data analysis Part 1
Data analysis python
Analysis of financial data by pandas and its visualization (2)
Full-width and half-width processing of CSV data in Python
[Python of Hikari-] Chapter 06-02 Function (argument and return value 1)
[Python] [Word] [python-docx] Simple analysis of diff data using python
[For beginners] How to study Python3 data analysis exam
List of Python libraries for data scientists and data engineers
Analysis of financial data by pandas and its visualization (1)
Challenge principal component analysis of text data with Python
Story of image analysis of PDF file and data extraction
List of Python code used in big data analysis
Analysis of measurement data ②-Histogram and fitting, lmfit recommendation-
Visualization method of data by explanatory variable and objective variable
"Measurement Time Series Analysis of Economic and Finance Data" Solving Chapter End Problems with Python