[PYTHON] [EDA] Introduction of Sweetviz (comparison with + pandas-profiling)

Until now, EDA used pandas-profiling, but Sweetviz I sometimes see something like that, so I tried it.

The data used was Titanic data.

table of contents

  1. [What is EDA](What is # 1-eda)
  2. [Sweetviz execution example](# 2-sweetviz execution example)
  3. [Pandas-profiling execution example](# 3-pandas-profiling execution example)
  4. [Implementation example](# 4-Implementation example)
  5. [Download Code and Data](# 4-1-Download Code and Data)
  6. [Environmental preparation](# 4-2-Environmental preparation)
  7. [Code Execution](# 4-3-Code Execution)
  8. [Finally](# 5-Finally)

1. What is EDA?

EDA stands for ** Explanatory Data Analysis **. When carrying out data analysis work such as machine learning, the following work is performed for the purpose of understanding the data. --Data visualization --Understanding the characteristics of data --Understanding the relationship between data

If you look into the details of EDA, you will find a lot, so I will omit it in this article. Please refer to the following articles. ・ [Introduction to Data Scientists] Let's try basic operations of exploratory data analysis (EDA) using PythonWhat is EDA (Exploratory Data Analysis)? Exploratory Data Analysis (EDA)

2. Sweetviz execution example

Sweetviz is a library that can semi-automatically perform various tasks when performing EDA. I will introduce an execution example with Titanic data immediately. all_number.png

After execution, the above html will be created. Let's look at the contents in three parts.

① Overall overview and correlation coefficient

In part (1), you can check the ** characteristics of the entire data ** and the ** correlation coefficient **. As a whole, one of the big advantages of Sweetviz is that you can see ** training data and inference data separately **. ①-1.png In the part of the figure ** For each of the training data and the inference data ** You can check the following contents.

You can also check the correlation coefficient by pressing the ** Associations ** button. associations.png

The above is an example with training data, but since it can be confirmed with inference data as well, It may be possible to guess whether there is a difference in the distribution by looking at the difference in the correlation coefficient between the training data and the inference data.

② Outline of each feature

In the part of ②, you can confirm the following.

--Distribution of objective variable (Survived) --Distribution of explanatory variables -** Positive rate (ratio where the objective variable is 1) ** -** Comparison of the above three training data and inference data **

②.png

It is natural that the distribution can be seen, It is very convenient to see ** "Positive rate" ** and ** "Comparison of training data and inference data" **. By looking at these

-** How likely is the accuracy of prediction by AI? ** -** Is there a problem with the data acquisition method and number of learning data and inference data? (It seems that there are many cases where the timing and users are different, but if the distribution is similar to some extent, it can be judged that there is no problem with the data acquisition flow) ** -** What is the value of the explanatory variable with a high positive rate **

It is possible to predict quite a lot in advance before implementing AI algorithms and calculating predictive accuracy and descriptiveness such as LIME / SHAP. If you can predict in advance, you will not blindly believe the results of AI and it will be easier to consider the results.

③ Details of each feature

In part (3), you can check a little more detailed information about each feature. ③.png

For example, in addition to the information in ②, the following contents are displayed.

--Deficiency rate --Features with high correlation coefficient --Frequent list of values --List of values in descending order

It's normal here. ** The list in descending order of value ** can only be seen in the top 5 in pandas-profiling, so if you want to see a little more or if there are 5 or more outliers, Sweetviz is effective. However, after all, it was also displayed in ②

-** Positive rate (ratio where the objective variable is 1) ** -** Comparison of training data and inference data **

Seems like the ** benefits of using Sweetviz **.

Implementation example

The code and the html output by Sweetviz are placed in the following git. You can just look at the html, and it's pretty easy to move the code. https://github.com/yuomori0127/sweetviz_titanic

The formula is below. Sweetviz

3. Pandas-profiling execution example

We will also look at ** pandas-profiling **, a library for the same EDA. The Titanic implementation example of ** pandas-profiling ** is published on colab.

https://colab.research.google.com/github/pandas-profiling/pandas-profiling/blob/master/examples/titanic/titanic.ipynb

pandas.png

I think this is more than enough, as it will give you this just by entering the data. Or rather, I used it a lot.

The benefits of pandas-profiling that Sweetviz doesn't have ** It suggests explanatory variables to be deleted when preprocessing data **. recommend.png As shown in the figure

--Many cardinality (number of types of values) --Many missing values --Many zeros --High correlation coefficient

Etc. ** Suggest explanatory variables to be deleted when preprocessing data **. It is a very convenient function that Sweetviz does not have to suggest these without having to look at the diagram and distribution one by one and draw the threshold value by yourself.

The formula is as follows. pandas-profiling

4. Comparison of Sweetviz and pandas-profiling

I made a comparison table of Sweetviz and pandas-profiling.

It has both the basic functions of EDA, The details are a little different. Also, I can't list all the features, so I've extracted quite a bit.

# Comparison items Sweetviz pandas-profiling
1 Display of distribution
2 Display of basic statistics
3 Display of loss rate
4 Display of correlation coefficient
5 Data display in order of frequency
6 Data display in order of value △(Only 5)
7 Display of positive rate ×
8 Comparison of training data and inference data ×
9 Suggest explanatory variables to delete ×

5. Which one should I use?

Personally, I recommend ** Sweetviz **.

After all, ** "Display positive rate" ** and ** "Comparison of training data and inference data" ** are very convenient. The advantage of pandas-profiling is ** "Proposal of explanatory variables to be deleted" **, Of course, it's a great feature, but ** it's impossible to execute as suggested and not see the data after all, and not think about it **, and I don't know if the proposal is valid, so I don't refer to it after all. However, I am grateful for the ** potential to prevent oversights **.

It's difficult to decide the superiority or inferiority, but both are easy to move, so please try both and use the one that suits you!

Recommended Posts

[EDA] Introduction of Sweetviz (comparison with + pandas-profiling)
Easy introduction of speech recognition with Python
Complete everything with Jupyter ~ Introduction of nbdev ~
Comparison of matrix transpose speeds with Python
Introduction of Python
Introduction of scikit-optimize
Performance comparison of face detector with Python + OpenCV
Introduction of PyGMT
Introduction of cymel
Introduction of Python
Introduction to Simple Regression Analysis with Python (Comparison of 6 Libraries of Numerical Calculation/Computer Algebra System)
Play with the UI implementation of Pythonista3 [Super Super Introduction]
[Chapter 6] Introduction to scikit-learn with 100 knocks of language processing
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
Comparison of CoffeeScript with JavaScript, Python and Ruby grammar
[Chapter 4] Introduction to Python with 100 knocks of language processing
Introduction of trac (Windows + trac 1.0.10)
Introduction of ferenOS 1 (installation)
Comparison of LDA implementations
Comparison of online classifiers
Introduction of Virtualenv wrapper
Comparison of fitting programs
[Raspi4; Introduction to Sound] Stable recording of sound input with python ♪