[PYTHON] Calculate the probability of outliers on a boxplot

Summary of this article

――Investigating the probability that "outliers" in the box plot will appear when a normal distribution is assumed. --Definition of "outliers": 1st quartile-quartile range * less than 1.5 or 1st quartile-quartile range * greater than 1.5 (including extreme values) --The probability of "outliers" appearing is approximately 0.70%.

Motivation to write this article

――When you draw a box plot when performing statistical analysis in business, "outliers" often appear. ――I wanted to know how likely it is that "outliers" will appear when a certain distribution is assumed.

Introduction: Box plot and outliers

Please refer to the following sites for explanations on boxplots and outliers in boxplots. -Box plot --Wikipedia -How to read the box plot

Calculate the probability of outliers appearing in the normal distribution

Let's calculate the probability of outliers using the probability density function of the standard normal distribution.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

q1_ideal = stats.norm.ppf (q = 0.25, loc = mu, scale = sd) # 1st quartile q3_ideal = stats.norm.ppf (q = 0.75, loc = mu, scale = sd) # 3rd quartile iqr_ideal = q3_ideal-q1_ideal # interquartile range lb_ideal = q1_ideal-1.5 * iqr_ideal # Lower outlier boundary ub_ideal = q3_ideal + 1.5 * iqr_ideal # Upper outlier boundary

print('Q1:', q1_ideal)
print('Q3:', q3_ideal)
print('IQR:', iqr_ideal)
print('Lower Bound:', lb_ideal)
print('Upper Bound:', ub_ideal)

print ('Probability of lower outliers:', stats.norm.cdf (x = lb_ideal, loc = mu, scale = sd) * 100,'%') print ('Probability of upper outliers:', stats.norm.sf (x = ub_ideal, loc = mu, scale = sd) * 100,'%') print ('Probability of outliers:', (stats.norm.sf (x = ub_ideal, loc = mu, scale = sd) + stats.norm.cdf (x = lb_ideal, loc = mu, scale = sd)) * 100,'%')

>Q1: -0.674489750196
>Q3: 0.674489750196
>IQR: 1.34897950039
>Lower Bound: -2.69795900078
>Upper Bound: 2.69795900078

Probability of lower outliers: 0.348830161964% Probability of upper outliers: 0.348830161964% Outlier probability: 0.697660323928%

So, with a normal distribution, the probability of getting outliers is 0.7%. If there are 1000 samples, about 7 will be outliers. 0.3% is outside 3σ, so it's more than that.

Verify that it is actually the case

Let's use data randomly sampled from a normal distribution to see if this really happens.

#Data generation n = 1000000 #number of samples mu = 0 # average sd = 1 # standard deviation q1 = stats.scoreatpercentile(data, 25) q3 = stats.scoreatpercentile(data, 75) iqr = q3-q1 lb = q1-1.5iqr ub = q3+1.5iqr print('Q1:', q1) print('Q2:', med) print('Q3:', q3) print('IQR:', iqr) print('Lower Bound:', lb) print('Upper Bound:', ub) print ('Ratio of the number of samples with upper outliers to the total number of samples:', len (np.where (data <lb) [0]) / n * 100,'%') print ('Ratio of the number of samples with lower outliers to the total number of samples:', len (np.where (data> ub) [0]) / n * 100,'%') print ('Ratio of outliers to the total number of samples:', (len (np.where (data> ub) [0]) + len (np.where (data <lb))) / n * 100,'%')

>Q1: -0.674873830027
>Q2: -0.00106013590319
>Q3: 0.673290672641
>IQR: 1.34816450267
>Lower Bound: -2.69712058403
>Upper Bound: 2.69553742664

Percentage of total number of samples with outliers: 0.3554% Percentage of total sample numbers with lower outliers: 0.3478% Percentage of outliers in total sample size: 0.7032%

The percentage of outlier samples calculated by random sampling was 0.7%, which was almost the same as the value calculated from the probability density function.

Recommended Posts

Calculate the probability of outliers on a boxplot
Investigate the effect of outliers on correlation
How to calculate the volatility of a brand
Calculate the probability of being a squid coin with Bayes' theorem [python]
The story of the escape probability of a random walk on an integer grid
Calculate volume from the two-dimensional structure of a compound
Steps to calculate the likelihood of a normal distribution
A Study on Visualization of the Scope of Prediction Models
Calculate the product of matrices with a character expression?
Create a shape on the trajectory of an object
Calculate the number of changes
A note on the default behavior of collate_fn in PyTorch
[Python] Calculate the angle consisting of three points on the coordinates
Is the probability of precipitation correct?
The story of writing a program
A memo that reproduces the slide show (gadget) of Windows 7 on Windows 10.
On Linux, the time stamp of a file is a little past.
Find the rank of a matrix in the XOR world (rank of a matrix on F2)
A command to easily check the speed of the network on the console
Get the number of readers of a treatise on Mendeley in Python
Approximation of distance between two points on the surface of a spheroid (on the surface of the earth)
Randomly play the movie on ChromeCast for a certain period of time
Calculate the shortest route of a graph with Dijkstra's algorithm and Python
Measure the relevance strength of a crosstab
A quick overview of the Linux kernel
Post the subject of Gmail on twitter
I measured the run queue wait time of a process on Linux
Display the graph of tensorBoard on jupyter
[python] [meta] Is the type of python a type?
Estimate the probability that a coin will appear on the table using MCMC
Change the order of PostgreSQL on Heroku
A memo explaining the axis specification of axis
Get the filename of a directory (glob)
The story of blackjack A processing (python)
[Python] A progress bar on the terminal
Notice the completion of a time-consuming command
Plot the environmental concentration of organofluorine compounds on a map using open data
How to access the contents of a Linux disk on a Mac (but read-only)
A record of the time it took to deploy mysql on Cloud9 + Rails
A memo of a tutorial on running python on heroku
Get the caller of a function in Python
Visualize the inner layer of a neural network
The behavior of signal () depends on the compile options
A note on customizing the dict list class
Calculate the memory sharing rate of Linux processes
Calculate the total number of combinations with python
[2020July] Check the UDID of the iPad on Linux
Make a copy of the list in Python
Find the number of days in a month
A note about the python version of python virtualenv
The story of making a lie news generator
[Python] A rough understanding of the logging module
Output in the form of a python array
At the time of python update on ubuntu
Change the resolution of Ubuntu running on VirtualBox
The story of making a mel icon generator
A discussion of the strengths and weaknesses of Python
[AWS S3] Confirmation of the existence of folders on S3
Create a GUI on the terminal using curses
I did a little research on the class
[Python3] Take a screenshot of a web page on the server and crop it further