If you are a data engineer or a data maintenance person, you can use various tools to check the inconsistency of data, or you can hit it with SQL to check it. Recently, I often do such things. Especially when new data linkage starts, I often look at the contents of the data. That's where pandas_profiling comes in handy.
pip install pandas-profiling[notebook]
import pandas_profiling as pdp
from sklearn.datasets import load_boston
data = load_boston()
df = pd.DataFrame(data.data, columns=data.feature_names)
profile = pdp.ProfileReport(df, {'correlations': None})
profile.to_file("profile.html")
I often just want to know the distribution of the data, so I'm adding an option so that I don't calculate the correlation. It is also output to html for sharing with other people.
When you run it on Jupyter notebook, the process bar will be displayed as shown below, and you can see the processing status. You can see the data status of each item. I'm particularly interested in missing values, which is very useful because it shows the number and percentage of missing values.
Recommended Posts