Introduction

In the field of mathematics and statistics, you may see a table of classes, class values, frequencies, cumulative frequencies, relative frequencies, and cumulative relative frequencies. This is what it is like.

class	class値	frequency	累積frequency	相対frequency	累積相対frequency
0 or more and less than 3	1.5	1	1	0.07143	0.0714
3 or more and less than 6	4.5	6	7	0.42857	0.5000
6 or more and less than 9	7.5	2	9	0.14286	0.6429
9 or more and less than 12	10.5	2	11	0.14286	0.7857
12 or more and less than 15	13.5	3	14	0.21429	1.0000
total	-	14	-	1.00000	-

I made it because I couldn't find a function that would give this out in one shot in Python.

Existing convenience functions

There is no function to create a complete table, but the following convenient functions can be used to partially retrieve the necessary information. In addition to that, you can get all the necessary values by doing some calculations.

#numpy cumsum()Get the cumulative frequency with
data.cumsum()

#pandas value_counts()Count the frequency of appearance of each value with
pd.Series(data).value_counts()

Ingenuity for automation

Determining the number of classes and the width of classes

There are no clear rules for determining the number of classes or the width of classes. However, there is a Starges formula to get an idea, so I will use it.

** Sturges' formula **
A formula that gives you a guide to determine the number of classes when creating frequency distribution tables and histograms. Assuming that N is the sample size and k is the number of classes, it can be calculated as follows. The width of the class is calculated by dividing the maximum value by k from the minimum value of the data.

k=log_2N+1


#Find the number of classes from the Starges formula
class_size = 1 + np.log2(len(data))
class_size = int(round(class_size))

#Find the class width
class_width = (max(data) - min(data)) / class_size #The denominator is the number of classes and the numerator is the range.
class_width = round(class_width)

However, there are no clear rules, and I think that there are times when you want to set the width of the class to a good value, such as 5, so I will make it compatible with that. If you want to use the value given by the Starges formula, specify None as the second argument of the function. If you want to use an arbitrary value, specify that arbitrary value in the second argument. The number of classes will be changed accordingly.

def Frequency_Distribution(data, class_width):
    if class_width == None:
        #Find the class width
        class_width = (max(data) - min(data)) / class_size #The denominator is the number of classes and the numerator is the range.
        class_width = round(class_width) #Rounding
    else:
        class_width = class_width
        class_size = max(x) // class_width

Dynamically change class index

The class is "more than ... less than ...". When creating a frequency distribution table, I would like to set a class as an index and describe it in the table, but it is difficult to manually enter it according to the input data. Therefore, the index label can be generated by turning the for statement in list comprehension notation using the class width, the number of classes, and the format operator.


class_width = 5 #Class width
class_size = 10 #Number of classes
['%s or more%Less than s'%(w, w+class_width) for w in range(0, class_size*class_width*2, class_width)]

# ['0 or more and less than 5','5 or more and less than 10','10 or more and less than 15','15 or more and less than 20','20 or more and less than 25','25 or more and less than 30']

Creating a table

All you have to do now is add rows and columns and update column names and index names using pandas.

Whole code

import pandas as pd
import numpy as np

#Make a frequency distribution table
def Frequency_Distribution(data, class_width):
    #Find the number of classes from the Starges formula
    class_size = 1 + np.log2(len(data))
    class_size = int(round(class_size))
    if class_width == None:
        #Find the class width
        class_width = (max(data) - min(data)) / class_size #The denominator is the number of classes and the numerator is the range.
        class_width = round(class_width) #Rounding
    else:
        class_width = class_width
        class_size = max(x) // class_width
    # print('Number of classes:', class_size)
    # print('Class width:', class_width)
    
    #Sort by class
    #Make each observation a class value
    cut_data = []
    for row in data:
        cut = row // class_width
        cut_data.append(cut)
        
    #Count the frequency
    Frequency_data = pd.Series(cut_data).value_counts()
    Frequency_data = pd.DataFrame(Frequency_data)
    #I want to sort by index and insert a row at any position, so I transpose it once.
    F_data = Frequency_data.sort_index().T
    #If there is a class with 0 frequency, insert it in the data frame
    for i in range(0, max(F_data.columns)):
        if (i in F_data) == False:
            F_data.insert(i, i, 0)
    F_data = F_data.T.sort_index()
    #Rename indexes and columns
    F_data.index = ['%s or more%Less than s'%(w, w + class_width) for w in range(0, class_size * class_width * 2, class_width)][:len(F_data)]
    F_data.columns = ['frequency']

    F_data.insert(0, 'Class value', [((w + (w + class_width)) / 2) for w in range(0, class_size * class_width * 2, class_width)][:len(F_data)])
    F_data['Cumulative frequency'] = F_data['frequency'].cumsum()
    F_data['Relative frequency'] = F_data['frequency'] / sum(F_data['frequency'])
    F_data['Cumulative relative frequency'] = F_data['Cumulative frequency'] / max(F_data['Cumulative frequency'])
    F_data.loc['total'] = [None, sum(F_data['frequency']), None, sum(F_data['相対frequency']), None]

    return F_data

#Sample data
x = [0, 3, 3, 5, 5, 5, 5, 7, 7, 10, 11, 14, 14, 14]
Frequency_Distribution(x, None)

result

class	class値	frequency	累積frequency	相対frequency	累積相対frequency
0 or more and less than 3	1.5	1	1	0.07143	0.0714
3 or more and less than 6	4.5	6	7	0.42857	0.5000
6 or more and less than 9	7.5	2	9	0.14286	0.6429
9 or more and less than 12	10.5	2	11	0.14286	0.7857
12 or more and less than 15	13.5	3	14	0.21429	1.0000
total	-	14	-	1.00000	-

Supplement (2020/10/4)

This code, which was commented by @nkay, is recommended because it can be written very smartly.


def Frequency_Distribution(data, class_width=None):
    data = np.asarray(data)
    if class_width is None:
        class_size = int(np.log2(data.size).round()) + 1
        class_width = round((data.max() - data.min()) / class_size)

    bins = np.arange(0, data.max()+class_width+1, class_width)
    hist = np.histogram(data, bins)[0]
    cumsum = hist.cumsum()

    return pd.DataFrame({'Class value': (bins[1:] + bins[:-1]) / 2,
                         'frequency': hist,
                         'Cumulative frequency': cumsum,
                         'Relative frequency': hist / cumsum[-1],
                         'Cumulative relative frequency': cumsum / cumsum[-1]},
                        index=pd.Index([f'{bins[i]}that's all{bins[i+1]}Less than'
                                        for i in range(hist.size)],
                                       name='class'))


x = [0, 3, 3, 5, 5, 5, 5, 7, 7, 10, 11, 14, 14, 14]
Frequency_Distribution(x)

reference

In creating the above code, I mainly referred to the following sites. Going to the Data Scientist Statistical Glossary

Automatically generate frequency distribution table in one shot with Python