[PYTHON] Check the status of your data using pandas_profiling

Overview

If you are a data engineer or a data maintenance person, you can use various tools to check the inconsistency of data, or you can hit it with SQL to check it. Recently, I often do such things. Especially when new data linkage starts, I often look at the contents of the data. That's where pandas_profiling comes in handy.

Installation method

pip install pandas-profiling[notebook]

How to use

import pandas_profiling as pdp
from sklearn.datasets import load_boston

data = load_boston()
df = pd.DataFrame(data.data, columns=data.feature_names)

profile = pdp.ProfileReport(df, {'correlations': None})
profile.to_file("profile.html")

I often just want to know the distribution of the data, so I'm adding an option so that I don't calculate the correlation. It is also output to html for sharing with other people.

result

When you run it on Jupyter notebook, the process bar will be displayed as shown below, and you can see the processing status. You can see the data status of each item. I'm particularly interested in missing values, which is very useful because it shows the number and percentage of missing values.

スクリーンショット 2020-07-14 11.02.34.png スクリーンショット 2020-07-14 11.03.20.png

Recommended Posts

Check the status of your data using pandas_profiling

Check the type of the variable you are using

Scraping the winning data of Numbers using Docker

I tried using the API of the salmon data project

Understand the status of data loss --Python vs. R

Check the type and version of your Linux distribution

Check the memory status of the server with the Linux free command

Check the operating status of the server with the Linux top command

[Python] I tried collecting data using the API of wikipedia

Check the return value using PEP 380

Check the data summary in CASTable

Recommendation of data analysis using MessagePack

Visualize the response status of the census 2020

[Machine learning] Check the performance of the classifier with handwritten character data

Explain the mechanism of PEP557 data class

Check the behavior of destructor in Python

How to check the version of Django

The story of verifying the open data of COVID-19

Get the column list & data list of CASTable

Check the existence of the file with python

Try to get the road surface condition using big data of road surface management

Verify the accuracy of the scoring formula "RC" using actual professional baseball data

Check the path of the Python imported module

Visualize the export data of Piyo log

Awareness of using Aurora Severless Data API

Plot the environmental concentration of organofluorine compounds on a map using open data

Let's check the population transition of Matsue City, Shimane Prefecture with open data

Check the operation of OpenCV3 installed by Anaconda

[python] Check the elements of the list all, any

Shortening the analysis time of Openpose using sound

Estimating the effect of measures using propensity scores

Exclusive release of the django app using ngrok

Dynamically display epidemic data using the Grafana Dashboard

[2020July] Check the UDID of the iPad on Linux

Check the date of the flag duty with Python

[Pandas] Basics of processing date data using dt

The story of reading HSPICE data in Python

Visualized the usage status of the sink in the company

Try using the collections module (ChainMap) of python3

Python introductory study-output of sales data using tuples-

Determine the number of classes using the Starges formula

I tried using the image filter of OpenCV

The transition of baseball as seen from the data

Download the wind data of the Japan Meteorological Agency

How strong is your Qiita? Statistics on the number of Contributes seen in the data

Calculation of the shortest path using the Monte Carlo method

Easy way to check the source of Python modules

[Python] [Word] [python-docx] Simple analysis of diff data using python

Cut a part of the string using a Python slice

Big data analysis using the data flow control framework Luigi

Drawing on Jupyter using the plot function of pandas

I tried clustering ECG data using the K-Shape method

Not being aware of the contents of the data in python

Explanation of the concept of regression analysis using Python Part 1

Write data to KINTONE using the Python requests module

Post to your account using the API on Twitter

Let's use the open data of "Mamebus" in Python

Check for the existence of BigQuery tables in Java

Let's analyze the emotions of Tweet using Chainer (2nd)

Explanation of the concept of regression analysis using Python Extra 1

Study from the beginning of Python Hour8: Using packages