How to count the number of occurrences of each element in the list in Python with weight

Thing you want to do

Given the following two lists, I want to weight and count the number of occurrences of each element included in ʻa by the value of b`. Python is 3.7.5.

a = ["A", "B", "C", "A"]
b = [ 1 ,  1 ,  2 ,  2 ]

c = hoge(a, b)
print(c)

output


{"A": 3, "B": 1, "C": 2}  #I want this kind of output

#The key and value can be separate
# (["A", "B", "C"], [3, 1, 2])

Specific example of what you want to do

Suppose you want to count the number of books sold so far at a bookstore for each book. [^ 1] However, I only have ** multiple table data that has already been aggregated by month **. For the sake of simplicity, let's imagine the following two csv files.

■ 2020_01.csv

Book name Number of books sold
Book_A 1
Book_B 2
Book_C 3

■ 2020_02.csv

Book name Number of books sold
Book_A 2
Book_C 1
Book_D 3

Combining these two data results in a counting problem with "elements" and "weights" as described in "What you want to do".

Method

It was made by the following three methods. I would be grateful if you could tell me which one is better or another method [^ 2].

  1. Join all the tables, create a label that uniquely corresponds to the name of the book, and weight count with numpy.bincount.
  2. Create a collections.Counter object for each table and add the Counter objects for all tables.
  3. Use the for statement to add elements to the dictionary and update the values. Use reduce instead of 3'. For statement.

1. Use numpy.bincount

You can count by weighting the input by using the bincount function of numpy. Reference: Meaning of weight in numpy.bincount

However, each element you enter in np.bincount ** must be a non-negative integer **.

numpy.bincount(x, weights=None, minlength=0) Count number of occurrences of each value in array of non-negative ints.

x : array_like, 1 dimension, nonnegative ints ---- Input array. weights : array_like, optional ---- Weights, array of the same shape as x. minlength : int, optional ---- A minimum number of bins for the output array. ---- New in version 1.6.0.

Therefore, in order to use np.bincount, prepare a label that uniquely corresponds to the name of the book. I used LabelEncoder of sklearn to create label.

code

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

#Data preparation
df_01 = pd.DataFrame([["Book_A", 1],
                      ["Book_B", 2],
                      ["Book_C", 3]],
                     columns=["Name", "Count"])
df_02 = pd.DataFrame([["Book_A", 2],
                      ["Book_C", 1],
                      ["Book_D", 3]],
                     columns=["Name", "Count"])

#Join table
df_all = pd.concat([df_01, df_02])
#The contents are like this.
# |  | Name | Count |
# |--:|:--|--:|
# | 0 | Book_A | 1 |
# | 1 | Book_B | 2 |
# | 2 | Book_C | 3 |
# | 0 | Book_A | 2 |
# | 1 | Book_C | 1 |
# | 2 | Book_D | 3 |

#Label Encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
encoded = le.fit_transform(df_all['Name'].values)

#Add new Label column
df_all["Label"] = encoded

# np.Weighted count with bincount
#In addition to the Label column, enter the Count column as the weight. Since the result has a decimal point, I'm converting it to an int.
count_result = np.bincount(df_all["Label"], weights=df_all["Count"]).astype(int)
#Get the Name corresponding to result
name_result = le.inverse_transform(range(len(result)))

#Create the dictionary you want in the end
result = dict(zip(name_result, count_result))
print(result)

output


{'Book_A': 3, 'Book_B': 2, 'Book_C': 4, 'Book_D': 3}

Supplement

You can also create a label using np.unique. You can get the same result as fit_transform of LabelEncoder by setting the argument return_inverse of np.unique to True. In addition, you can also get the corresponding Name (name_result in the above) at once.

# np.Label encoding using unique
name_result, encoded = np.unique(df_all["Name"], return_inverse=True)
print(encoded)
print(name_result)

output


[0 1 2 0 2 3]
['Book_A' 'Book_B' 'Book_C' 'Book_D']

In addition, weighting count is possible by turning the for statement without using np.bincount [^ 3].

#Create a zero-padded array with the same length as the dictionary you want
unique_length = len(name_result)
count_result = np.zeros(unique_length, dtype=int)

#Extract only the rows whose encoded matches i in the table and calculate the sum of the Count values.
for i in range(unique_length):
    count_result[i] = df_all.iloc[encoded==i]["Count"].sum().astype(int)

result = dict(zip(name_result, count_result))
print(result)

output


{'Book_A': 3, 'Book_B': 2, 'Book_C': 4, 'Book_D': 3}

2. Use collections.Counter

Overview of collections.Counter

The Counter module of the standard module collections will often be introduced for ** unweighted ** counting.

from collections import Counter

a = ["A", "B", "C", "A"]

#Give Counter a list and do unweighted counting
counter = Counter(a)
print(counter)

#Access to elements is the same as a dictionary
print("A:", counter["A"])

output


Counter({'A': 2, 'B': 1, 'C': 1})
A: 2

Also, if it has already been aggregated like this time, you can create an object by storing it in the dictionary type and then passing it.

counter = Counter(dict([["Book_A", 1],
                        ["Book_B", 2],
                        ["Book_C", 3]]))
print(counter)

output


Counter({'Book_A': 1, 'Book_B': 2, 'Book_C': 3})

Calculation using Counter

By the way, this Counter object can be operated on. Reference: Various ways to check the number of occurrences of an element with Python Counter

It seems that the purpose of this time can be achieved by the calculation of the sum.

from collections import Counter

a = ["A", "B", "C", "A"]
b = ["C", "D"]

counter_a = Counter(a)
counter_b = Counter(b)

#Can be added with sum
counter_ab = sum([counter_a, counter_b], Counter())
print(counter_ab)

output


Counter({'A': 2, 'C': 2, 'B': 1, 'D': 1})

code

from collections import Counter

#Data preparation
df_01 = pd.DataFrame([["Book_A", 1],
                      ["Book_B", 2],
                      ["Book_C", 3]],
                     columns=["Name", "Count"])
df_02 = pd.DataFrame([["Book_A", 2],
                      ["Book_C", 1],
                      ["Book_D", 3]],
                     columns=["Name", "Count"])

#Creating a Counter
counter_01 = Counter(dict(df_01[["Name", "Count"]].values))
counter_02 = Counter(dict(df_02[["Name", "Count"]].values))

#Calculate the sum
# *Supplement:You can set the initial value for the second argument of sum.
#This time, an empty Counter is set as the initial value. The default is 0(int)is.
result = sum([counter_01, counter_02], Counter())
print(result)

output


Counter({'Book_C': 4, 'Book_A': 3, 'Book_D': 3, 'Book_B': 2})

~~ Apparently, the counts are sorted in descending order. ~~

3. Add elements to the dictionary and update values with the for statement

Adding elements to the dictionary and updating values

If you give the dictionary multiple values for the same key, it will be overwritten by the last given value.

print( {"A": 1, "B": 2, "C": 3, "A":10} )

output


{'A': 10, 'B': 2, 'C': 3}

Using this, in order to update the count value of a certain key, it seems that ** get the value ** of the existing dictionary, add ** the value ** to be added, and add it to the end. Also, to add an element after an existing dictionary, you can expand the dictionary by prepending \ * \ * (two stars) to the variable. Reference: [\ Python ] function arguments \ * (star) and \ * \ * (double star)

#Existing dictionary
d = {"A": 1, "B": 2, "C": 3}

#Element to add value
k = "A"
v = 10
#update
d = {**d, k: d[k]+v}    # {"A": 1, "B": 2, "C": 3, "A": 1+10}Equivalent to

print(d)

output


{'A': 11, 'B': 2, 'C': 3}

However, if you specify a key that does not exist in the dictionary, an error will occur, so you cannot add a new key as it is. Therefore, we use the function get () of the dictionary object. You can use get () to set the value to be returned by default when key does not exist in the dictionary. Reference: Get value from key with get method of Python dictionary (key that does not exist is OK)

d = {"A": 1, "B": 2, "C": 3}

#Specify an existing key
print(d.get("A", "NO KEY"))
#Specify a key that does not exist
print(d.get("D", "NO KEY"))

output


1
NO KEY

This allows you to handle additions and updates in the same way by setting the default value to 0. Using the above contents, the code that performs weighting counting by adding / updating a value to an empty dictionary is as follows.

code

import pandas as pd
from itertools import chain

#Data preparation
import pandas as pd
from itertools import chain
from functools import reduce

#Data preparation
df_01 = pd.DataFrame([["Book_A", 1],
                      ["Book_B", 2],
                      ["Book_C", 3]],
                     columns=["Name", "Count"])
df_02 = pd.DataFrame([["Book_A", 2],
                      ["Book_C", 1],
                      ["Book_D", 3]],
                     columns=["Name", "Count"])
#Convert data frame to dictionary
data1 = dict(df_01[["Name", "Count"]].values)
data2 = dict(df_02[["Name", "Count"]].values)

#Function definition
chain_items = lambda data : chain.from_iterable( d.items() for d in data )  #Combine multiple dictionaries"key and value pair"Function that returns
add_elem = lambda acc, e : { **acc, e[0]: acc.get(e[0], 0) + e[1] }  #Functions that add elements to the dictionary and update values

#A function that receives and merges multiple dictionaries where key is an element and value is a weight
def merge_count(*data) :
    result = {}
    for e in chain_items(data) :
        result = add_elem(result, e)
    return result

print( merge_count(data1, data2) )

output


{'A': 3, 'B': 2, 'C': 4, 'D': 3}

Use reduce instead of 3'for statement

With reduce, iterative processing is possible without writing a for statement. reduce takes the following arguments.

--First argument: Function. However, take the calculation result up to the previous time and the value this time as arguments. --Second argument: Loopable object (list, generator, etc.) --Third argument (optional): Initial value. The default is 0

from functools import reduce

func = lambda ans, x: ans * x
a = [1, 2, 3, 4]
start = 10

print(reduce(func, a, start))

output


240  #    10*1 = 10
     # -> 10*2 = 20
     # -> 20*3 = 60
     # -> 60*4 = 240

Recreating the above merge_count using reduce gives:

from functools import reduce

merge_count = lambda *data : reduce( add_elem, chain_items(data), {} )    #Merge above_Equivalent to count
print( merge_count(data1, data2) )

output


{'A': 3, 'B': 2, 'C': 4, 'D': 3}

The following site was very helpful for reduce. Reference: Introduction to Functional Programming

Referenced page

Meaning of weight in numpy.bincount [Category variable encoding] (https://qiita.com/ground0state/items/f516b97c7a8641e474c4)

[[Python] Enumeration of list elements, how to use collections.Counter] (https://qiita.com/ellio08/items/259388b511e24625c0d7) [Various ways to check the number of occurrences of an element with Python Counter] (https://www.headboost.jp/python-counter/)

[\ [Python ] function arguments \ * (star) and \ * \ * (double star)] (https://qiita.com/supersaiakujin/items/faee48d35f8d80daa1ac) [Introduction to Functional Programming] (https://postd.cc/an-introduction-to-functional-programming/)

[^ 1]: I gave an appropriate concrete example to make it easier to convey, but in reality it was used to aggregate the morphological analysis results of multiple documents. [^ 2]: Execution speed, memory efficiency, etc ... [^ 3]: I couldn't think of anything other than writing a for statement with my own knowledge ... (excluding list comprehensions).

Recommended Posts

How to count the number of occurrences of each element in the list in Python with weight
How to identify the element with the smallest number of characters in a Python list?
Get the number of occurrences for each element in the list
How to get the number of digits in Python
How to get a list of files in the same directory with python
[Homology] Count the number of holes in data with Python
How to determine the existence of a selenium element in Python
[Python] How to put any number of standard inputs in a list
[Python] How to output the list values in order
How to pass the execution result of a shell command in a list in Python
How to count the number of elements in Django and output to a template
Compare the sum of each element in two lists with the specified value in Python
4 methods to count the number of occurrences of integers in a certain interval (including imos method) [Python implementation]
[Completed version] Try to find out the number of residents in the town from the address list with Python
How to find the optimal number of clusters in k-means
Get the number of specific elements in a python list
python: Tips for displaying an array (list) with an index (how to find out what number an element of an array is)
How to get the last (last) value in a list in Python
How to get a list of built-in exceptions in python
How to check in Python if one of the elements of a list is in another list
Get the index of each element of the confusion matrix in Python
How to find the first element that matches your criteria in a Python list
How to quickly count the frequency of appearance of characters from a character string in Python?
How to pass the execution result of a shell command in a list in Python (non-blocking version)
Count the number of Thai and Arabic characters well in Python
How to know the internal structure of an object in Python
How to check the memory size of a variable in Python
Output the contents of ~ .xlsx in the folder to HTML with Python
Visualize the frequency of word occurrences in sentences with Word Cloud. [Python]
python note: map -do the same for each element of the list
Summary of how to use Python list
Count the number of characters with echo
Let's see how to count the number of elements in an array in some languages [Go, JavaScript, PHP, Python, Ruby, Swift]
How to get the date and time difference in seconds with python
Get the number of visits to each page with ReportingAPI + Cloud Functions
[Python] How to use the enumerate function (extract the index number and element)
[Python] How to use list 2 Reference of list value, number of elements, maximum value, minimum value
Receive a list of the results of parallel processing in Python with starmap
[Cloudian # 5] Try to list the objects stored in the bucket with Python (boto3)
How to use the C library in Python
Output the number of CPU cores in Python
[Python] Sort the list of pathlib.Path in natural sort
[REAPER] How to play with Reascript in Python
How to clear tuples in a list (Python)
Match the distribution of each group in Python
Calculate the total number of combinations with python
Make a copy of the list in Python
Summary of how to use MNIST in Python
How to specify attributes with Mock of python
[Algorithm x Python] How to use the list
How to get the files in the [Python] folder
How to use tkinter with python in pyenv
How to remove duplicate elements in Python3 list
Count the number of times two values appear in a Python 3 iterator type element at the same time
plot the coordinates of the processing (python) list and specify the number of times in draw ()
[python] How to sort by the Nth Mth element of a multidimensional array
Visualize the timeline of the number of issues on GitHub assigned to you in Python
How to find the coefficient of the trendline that passes through the vertices in Python
[Introduction to Python] How to get the index of data with a for statement
python beginners tried to predict the number of criminals
[Python] How to remove duplicate values from the list