# Thing you want to do

Given the following two lists, I want to weight and count the number of occurrences of each element included in ʻa` by the value of` b`. Python is 3.7.5.

``````a = ["A", "B", "C", "A"]
b = [ 1 ,  1 ,  2 ,  2 ]

c = hoge(a, b)
print(c)
``````

#### `output`

``````
{"A": 3, "B": 1, "C": 2}  #I want this kind of output

#The key and value can be separate
# (["A", "B", "C"], [3, 1, 2])
``````
• Addition: Comment introduced a simple implementation for the above problem.

# Specific example of what you want to do

Suppose you want to count the number of books sold so far at a bookstore for each book. [^ 1] However, I only have ** multiple table data that has already been aggregated by month **. For the sake of simplicity, let's imagine the following two csv files.

■ 2020_01.csv

Book name Number of books sold
Book_A 1
Book_B 2
Book_C 3

■ 2020_02.csv

Book name Number of books sold
Book_A 2
Book_C 1
Book_D 3

Combining these two data results in a counting problem with "elements" and "weights" as described in "What you want to do".

# Method

It was made by the following three methods. I would be grateful if you could tell me which one is better or another method [^ 2].

1. Join all the tables, create a `label` that uniquely corresponds to the name of the book, and weight count with` numpy.bincount`.
2. Create a `collections.Counter` object for each table and add the` Counter` objects for all tables.
3. Use the for statement to add elements to the dictionary and update the values. Use `reduce` instead of 3'. For statement.
• Addition Comment added 3 that was taught.

## 1. Use numpy.bincount

You can count by weighting the input by using the `bincount` function of` numpy`. Reference: Meaning of weight in numpy.bincount

However, each element you enter in `np.bincount` ** must be a non-negative integer **.

numpy.bincount(x, weights=None, minlength=0) Count number of occurrences of each value in array of non-negative ints.

x : array_like, 1 dimension, nonnegative ints ---- Input array. weights : array_like, optional ---- Weights, array of the same shape as x. minlength : int, optional ---- A minimum number of bins for the output array. ---- New in version 1.6.0.

Therefore, in order to use `np.bincount`, prepare a` label` that uniquely corresponds to the name of the book. I used `LabelEncoder` of` sklearn` to create `label`.

### code

``````import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

#Data preparation
df_01 = pd.DataFrame([["Book_A", 1],
["Book_B", 2],
["Book_C", 3]],
columns=["Name", "Count"])
df_02 = pd.DataFrame([["Book_A", 2],
["Book_C", 1],
["Book_D", 3]],
columns=["Name", "Count"])

#Join table
df_all = pd.concat([df_01, df_02])
#The contents are like this.
# |  | Name | Count |
# |--:|:--|--:|
# | 0 | Book_A | 1 |
# | 1 | Book_B | 2 |
# | 2 | Book_C | 3 |
# | 0 | Book_A | 2 |
# | 1 | Book_C | 1 |
# | 2 | Book_D | 3 |

#Label Encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
encoded = le.fit_transform(df_all['Name'].values)

#Add new Label column
df_all["Label"] = encoded

# np.Weighted count with bincount
#In addition to the Label column, enter the Count column as the weight. Since the result has a decimal point, I'm converting it to an int.
count_result = np.bincount(df_all["Label"], weights=df_all["Count"]).astype(int)
#Get the Name corresponding to result
name_result = le.inverse_transform(range(len(result)))

#Create the dictionary you want in the end
result = dict(zip(name_result, count_result))
print(result)
``````

#### `output`

``````
{'Book_A': 3, 'Book_B': 2, 'Book_C': 4, 'Book_D': 3}
``````

### Supplement

You can also create a `label` using` np.unique`. You can get the same result as `fit_transform` of` LabelEncoder` by setting the argument `return_inverse` of` np.unique` to True. In addition, you can also get the corresponding Name (`name_result` in the above) at once.

``````# np.Label encoding using unique
name_result, encoded = np.unique(df_all["Name"], return_inverse=True)
print(encoded)
print(name_result)
``````

#### `output`

``````
[0 1 2 0 2 3]
['Book_A' 'Book_B' 'Book_C' 'Book_D']
``````

In addition, weighting count is possible by turning the for statement without using `np.bincount` [^ 3].

``````#Create a zero-padded array with the same length as the dictionary you want
unique_length = len(name_result)
count_result = np.zeros(unique_length, dtype=int)

#Extract only the rows whose encoded matches i in the table and calculate the sum of the Count values.
for i in range(unique_length):
count_result[i] = df_all.iloc[encoded==i]["Count"].sum().astype(int)

result = dict(zip(name_result, count_result))
print(result)
``````

#### `output`

``````
{'Book_A': 3, 'Book_B': 2, 'Book_C': 4, 'Book_D': 3}
``````

## 2. Use collections.Counter

### Overview of collections.Counter

The `Counter` module of the standard module` collections` will often be introduced for ** unweighted ** counting.

``````from collections import Counter

a = ["A", "B", "C", "A"]

#Give Counter a list and do unweighted counting
counter = Counter(a)
print(counter)

#Access to elements is the same as a dictionary
print("A:", counter["A"])
``````

#### `output`

``````
Counter({'A': 2, 'B': 1, 'C': 1})
A: 2
``````

Also, if it has already been aggregated like this time, you can create an object by storing it in the dictionary type and then passing it.

``````counter = Counter(dict([["Book_A", 1],
["Book_B", 2],
["Book_C", 3]]))
print(counter)
``````

#### `output`

``````
Counter({'Book_A': 1, 'Book_B': 2, 'Book_C': 3})
``````

### Calculation using Counter

By the way, this `Counter` object can be operated on. Reference: Various ways to check the number of occurrences of an element with Python Counter

It seems that the purpose of this time can be achieved by the calculation of the sum.

``````from collections import Counter

a = ["A", "B", "C", "A"]
b = ["C", "D"]

counter_a = Counter(a)
counter_b = Counter(b)

#Can be added with sum
counter_ab = sum([counter_a, counter_b], Counter())
print(counter_ab)
``````

#### `output`

``````
Counter({'A': 2, 'C': 2, 'B': 1, 'D': 1})
``````

### code

``````from collections import Counter

#Data preparation
df_01 = pd.DataFrame([["Book_A", 1],
["Book_B", 2],
["Book_C", 3]],
columns=["Name", "Count"])
df_02 = pd.DataFrame([["Book_A", 2],
["Book_C", 1],
["Book_D", 3]],
columns=["Name", "Count"])

#Creating a Counter
counter_01 = Counter(dict(df_01[["Name", "Count"]].values))
counter_02 = Counter(dict(df_02[["Name", "Count"]].values))

#Calculate the sum
# *Supplement:You can set the initial value for the second argument of sum.
#This time, an empty Counter is set as the initial value. The default is 0(int)is.
result = sum([counter_01, counter_02], Counter())
print(result)
``````

#### `output`

``````
Counter({'Book_C': 4, 'Book_A': 3, 'Book_D': 3, 'Book_B': 2})
``````

~~ Apparently, the counts are sorted in descending order. ~~

• Addition: Sometimes it was not sorted. It's a dictionary in the first place, so the order didn't matter.

## 3. Add elements to the dictionary and update values with the for statement

### Adding elements to the dictionary and updating values

If you give the dictionary multiple `value`s for the same` key`, it will be overwritten by the last given `value`.

``````print( {"A": 1, "B": 2, "C": 3, "A":10} )
``````

#### `output`

``````
{'A': 10, 'B': 2, 'C': 3}
``````

Using this, in order to update the count value of a certain `key`, it seems that ** get the value ** of the existing dictionary, add ** the value ** to be added, and add it to the end. Also, to add an element after an existing dictionary, you can expand the dictionary by prepending \ * \ * (two stars) to the variable. Reference: [\ Python ] function arguments \ * (star) and \ * \ * (double star)

``````#Existing dictionary
d = {"A": 1, "B": 2, "C": 3}

#Element to add value
k = "A"
v = 10
#update
d = {**d, k: d[k]+v}    # {"A": 1, "B": 2, "C": 3, "A": 1+10}Equivalent to

print(d)
``````

#### `output`

``````
{'A': 11, 'B': 2, 'C': 3}
``````

However, if you specify a `key` that does not exist in the dictionary, an error will occur, so you cannot add a new` key` as it is. Therefore, we use the function `get ()` of the dictionary object. You can use `get ()` to set the value to be returned by default when `key` does not exist in the dictionary. Reference: Get value from key with get method of Python dictionary (key that does not exist is OK)

``````d = {"A": 1, "B": 2, "C": 3}

#Specify an existing key
print(d.get("A", "NO KEY"))
#Specify a key that does not exist
print(d.get("D", "NO KEY"))
``````

#### `output`

``````
1
NO KEY
``````

This allows you to handle additions and updates in the same way by setting the default value to `0`. Using the above contents, the code that performs weighting counting by adding / updating a value to an empty dictionary is as follows.

### code

``````import pandas as pd
from itertools import chain

#Data preparation
import pandas as pd
from itertools import chain
from functools import reduce

#Data preparation
df_01 = pd.DataFrame([["Book_A", 1],
["Book_B", 2],
["Book_C", 3]],
columns=["Name", "Count"])
df_02 = pd.DataFrame([["Book_A", 2],
["Book_C", 1],
["Book_D", 3]],
columns=["Name", "Count"])
#Convert data frame to dictionary
data1 = dict(df_01[["Name", "Count"]].values)
data2 = dict(df_02[["Name", "Count"]].values)

#Function definition
chain_items = lambda data : chain.from_iterable( d.items() for d in data )  #Combine multiple dictionaries"key and value pair"Function that returns
add_elem = lambda acc, e : { **acc, e: acc.get(e, 0) + e }  #Functions that add elements to the dictionary and update values

#A function that receives and merges multiple dictionaries where key is an element and value is a weight
def merge_count(*data) :
result = {}
for e in chain_items(data) :
result = add_elem(result, e)
return result

print( merge_count(data1, data2) )
``````

#### `output`

``````
{'A': 3, 'B': 2, 'C': 4, 'D': 3}
``````

### Use `reduce` instead of 3'for statement

With `reduce`, iterative processing is possible without writing a for statement. `reduce` takes the following arguments.

--First argument: Function. However, take the calculation result up to the previous time and the value this time as arguments. --Second argument: Loopable object (list, generator, etc.) --Third argument (optional): Initial value. The default is 0

``````from functools import reduce

func = lambda ans, x: ans * x
a = [1, 2, 3, 4]
start = 10

print(reduce(func, a, start))
``````

#### `output`

``````
240  #    10*1 = 10
# -> 10*2 = 20
# -> 20*3 = 60
# -> 60*4 = 240
``````

Recreating the above `merge_count` using` reduce` gives:

``````from functools import reduce

merge_count = lambda *data : reduce( add_elem, chain_items(data), {} )    #Merge above_Equivalent to count
print( merge_count(data1, data2) )
``````

#### `output`

``````
{'A': 3, 'B': 2, 'C': 4, 'D': 3}
``````

The following site was very helpful for `reduce`. Reference: Introduction to Functional Programming

# Referenced page

Meaning of weight in numpy.bincount [Category variable encoding] (https://qiita.com/ground0state/items/f516b97c7a8641e474c4)

[[Python] Enumeration of list elements, how to use collections.Counter] (https://qiita.com/ellio08/items/259388b511e24625c0d7) [Various ways to check the number of occurrences of an element with Python Counter] (https://www.headboost.jp/python-counter/)

[\ [Python ] function arguments \ * (star) and \ * \ * (double star)] (https://qiita.com/supersaiakujin/items/faee48d35f8d80daa1ac) [Introduction to Functional Programming] (https://postd.cc/an-introduction-to-functional-programming/)

[^ 1]: I gave an appropriate concrete example to make it easier to convey, but in reality it was used to aggregate the morphological analysis results of multiple documents. [^ 2]: Execution speed, memory efficiency, etc ... [^ 3]: I couldn't think of anything other than writing a for statement with my own knowledge ... (excluding list comprehensions).