Automatically generate frequency distribution table in one shot with Python

Introduction

In the field of mathematics and statistics, you may see a table of classes, class values, frequencies, cumulative frequencies, relative frequencies, and cumulative relative frequencies. This is what it is like.

class class値 frequency 累積frequency 相対frequency 累積相対frequency
0 or more and less than 3 1.5 1 1 0.07143 0.0714
3 or more and less than 6 4.5 6 7 0.42857 0.5000
6 or more and less than 9 7.5 2 9 0.14286 0.6429
9 or more and less than 12 10.5 2 11 0.14286 0.7857
12 or more and less than 15 13.5 3 14 0.21429 1.0000
total - 14 - 1.00000 -

I made it because I couldn't find a function that would give this out in one shot in Python.

Existing convenience functions

There is no function to create a complete table, but the following convenient functions can be used to partially retrieve the necessary information. In addition to that, you can get all the necessary values by doing some calculations.

#numpy cumsum()Get the cumulative frequency with
data.cumsum()

#pandas value_counts()Count the frequency of appearance of each value with
pd.Series(data).value_counts()

Ingenuity for automation

Determining the number of classes and the width of classes

There are no clear rules for determining the number of classes or the width of classes. However, there is a Starges formula to get an idea, so I will use it.

** Sturges' formula **
A formula that gives you a guide to determine the number of classes when creating frequency distribution tables and histograms. Assuming that N is the sample size and k is the number of classes, it can be calculated as follows. The width of the class is calculated by dividing the maximum value by k from the minimum value of the data.

k=log_2N+1

#Find the number of classes from the Starges formula
class_size = 1 + np.log2(len(data))
class_size = int(round(class_size))

#Find the class width
class_width = (max(data) - min(data)) / class_size #The denominator is the number of classes and the numerator is the range.
class_width = round(class_width)

However, there are no clear rules, and I think that there are times when you want to set the width of the class to a good value, such as 5, so I will make it compatible with that. If you want to use the value given by the Starges formula, specify None as the second argument of the function. If you want to use an arbitrary value, specify that arbitrary value in the second argument. The number of classes will be changed accordingly.

def Frequency_Distribution(data, class_width):
    if class_width == None:
        #Find the class width
        class_width = (max(data) - min(data)) / class_size #The denominator is the number of classes and the numerator is the range.
        class_width = round(class_width) #Rounding
    else:
        class_width = class_width
        class_size = max(x) // class_width

Dynamically change class index

The class is "more than ... less than ...". When creating a frequency distribution table, I would like to set a class as an index and describe it in the table, but it is difficult to manually enter it according to the input data. Therefore, the index label can be generated by turning the for statement in list comprehension notation using the class width, the number of classes, and the format operator.


class_width = 5 #Class width
class_size = 10 #Number of classes
['%s or more%Less than s'%(w, w+class_width) for w in range(0, class_size*class_width*2, class_width)]

# ['0 or more and less than 5','5 or more and less than 10','10 or more and less than 15','15 or more and less than 20','20 or more and less than 25','25 or more and less than 30']

Creating a table

All you have to do now is add rows and columns and update column names and index names using pandas.

Whole code

import pandas as pd
import numpy as np

#Make a frequency distribution table
def Frequency_Distribution(data, class_width):
    #Find the number of classes from the Starges formula
    class_size = 1 + np.log2(len(data))
    class_size = int(round(class_size))
    if class_width == None:
        #Find the class width
        class_width = (max(data) - min(data)) / class_size #The denominator is the number of classes and the numerator is the range.
        class_width = round(class_width) #Rounding
    else:
        class_width = class_width
        class_size = max(x) // class_width
    # print('Number of classes:', class_size)
    # print('Class width:', class_width)
    
    #Sort by class
    #Make each observation a class value
    cut_data = []
    for row in data:
        cut = row // class_width
        cut_data.append(cut)
        
    #Count the frequency
    Frequency_data = pd.Series(cut_data).value_counts()
    Frequency_data = pd.DataFrame(Frequency_data)
    #I want to sort by index and insert a row at any position, so I transpose it once.
    F_data = Frequency_data.sort_index().T
    #If there is a class with 0 frequency, insert it in the data frame
    for i in range(0, max(F_data.columns)):
        if (i in F_data) == False:
            F_data.insert(i, i, 0)
    F_data = F_data.T.sort_index()
    #Rename indexes and columns
    F_data.index = ['%s or more%Less than s'%(w, w + class_width) for w in range(0, class_size * class_width * 2, class_width)][:len(F_data)]
    F_data.columns = ['frequency']

    F_data.insert(0, 'Class value', [((w + (w + class_width)) / 2) for w in range(0, class_size * class_width * 2, class_width)][:len(F_data)])
    F_data['Cumulative frequency'] = F_data['frequency'].cumsum()
    F_data['Relative frequency'] = F_data['frequency'] / sum(F_data['frequency'])
    F_data['Cumulative relative frequency'] = F_data['Cumulative frequency'] / max(F_data['Cumulative frequency'])
    F_data.loc['total'] = [None, sum(F_data['frequency']), None, sum(F_data['相対frequency']), None]

    return F_data

#Sample data
x = [0, 3, 3, 5, 5, 5, 5, 7, 7, 10, 11, 14, 14, 14]
Frequency_Distribution(x, None)

result

class class値 frequency 累積frequency 相対frequency 累積相対frequency
0 or more and less than 3 1.5 1 1 0.07143 0.0714
3 or more and less than 6 4.5 6 7 0.42857 0.5000
6 or more and less than 9 7.5 2 9 0.14286 0.6429
9 or more and less than 12 10.5 2 11 0.14286 0.7857
12 or more and less than 15 13.5 3 14 0.21429 1.0000
total - 14 - 1.00000 -

Supplement (2020/10/4)

This code, which was commented by @nkay, is recommended because it can be written very smartly.


def Frequency_Distribution(data, class_width=None):
    data = np.asarray(data)
    if class_width is None:
        class_size = int(np.log2(data.size).round()) + 1
        class_width = round((data.max() - data.min()) / class_size)

    bins = np.arange(0, data.max()+class_width+1, class_width)
    hist = np.histogram(data, bins)[0]
    cumsum = hist.cumsum()

    return pd.DataFrame({'Class value': (bins[1:] + bins[:-1]) / 2,
                         'frequency': hist,
                         'Cumulative frequency': cumsum,
                         'Relative frequency': hist / cumsum[-1],
                         'Cumulative relative frequency': cumsum / cumsum[-1]},
                        index=pd.Index([f'{bins[i]}that's all{bins[i+1]}Less than'
                                        for i in range(hist.size)],
                                       name='class'))


x = [0, 3, 3, 5, 5, 5, 5, 7, 7, 10, 11, 14, 14, 14]
Frequency_Distribution(x)

reference

In creating the above code, I mainly referred to the following sites. Going to the Data Scientist Statistical Glossary

Recommended Posts

Automatically generate frequency distribution table in one shot with Python
Generate U distribution in Python
[Python] Plotly draws Pandas dataframes in one shot with Cufflinks
How to calculate "xx time" in one shot with Python timedelta
Automatically generate Python Docstring Comment in Emacs
Automatically aggregate JCG deck distribution with Python
One liner webServer (with CGI) in python
[Automation] Extract the table in PDF with Python
Read table data in PDF file with Python
Try to automatically generate Python documents with Sphinx
Logistic distribution in Python
One liner in Python
Use Cursur that closes automatically with sqlite3 in Python
I tried to automatically generate a password with Python3
Generate rounded thumbnails in Python
Scraping with selenium in Python
Fizzbuzz in Python (in one line)
Working with LibreOffice in Python
Scraping with chromedriver in python
Debugging with pdb in Python
Generate XML (RSS) with Python
DMD in Python one dimension
Working with sounds in Python
Scraping with Selenium in Python
Tweet with image in Python
Combined with permutations in Python
Generate QR code in Python
Generate 8 * 8 (64) cubes in Blender Python
How to log in to AtCoder with Python and submit automatically
Visualize the frequency of word occurrences in sentences with Word Cloud. [Python]
How to display legend marks in one with Python 2D plot
Number recognition in images with Python
Make python segfault in one line
Testing with random numbers in Python
GOTO in Python with Sublime Text 3
[Python] Generate QR code in memory
Generate a normal distribution with SciPy
Scraping with Selenium in Python (Basic)
Mixed normal distribution implementation in python
CSS parsing with cssutils in Python
Automatically format Python code in Vim
Generate Jupyter notebook ".ipynb" in Python
Open UTF-8 with BOM in Python
How to convert 0.5 to 1056964608 in one shot
Automatically generate model relationships with Django
Use Python in pyenv with NeoVim
Heatmap with Dendrogram in Python + matplotlib
[Python] Generate a password with Slackbot
Automatically build Python documentation with Sphinx
Read files in parallel with Python
Password generation in texto with python
Try frequency control simulation with Python
Generate fake table data with GAN
Use OpenCV with Python 3 in Window
Until dealing with python in Atom
Get started with Python in Blender
Working with DICOM images in Python
Handle multiple python versions in one jupyter
Write documentation in Sphinx with Python Livereload
Generate a first class collection in Python
Get additional data in LDAP with python