This is a learning memo that summarizes the advantages, disadvantages, and implementation methods of collaborative filtering and content-based (content-based) filtering of the recommendation algorithm. I have developed Collaborative Filtering Recommendation Engine before, so I would like to develop a content-based recommendation engine. (･ ㅂ ･) و

Advantages and disadvantages of content-based and collaborative filtering

■ List of strengths and weaknesses

Classification	Collaborative filtering	Content base
Diversity	o	x
Domain knowledge	o	x
Startup problem	x	△
Number of users	x	o
Coverage	x	o
Similar items	x	o
A small number of users	x	o

This table is reprinted from Explanatory article "Algorithm of recommender system" serialized by Professor Toshihiro Kamishima in the journal of the Japanese Society for Artificial Intelligence.

Description of the algorithm

■ What is collaborative filtering? It is a method to recommend based on the action history of the item user. Amazon's "People who bought this product also have this product" function is famous. Collaborative filtering recommendation is a method of making recommendations based on user behavior.

■ What is content-based (content-based) filtering? It is a method to sort and recommend the similarity by the feature vector of the item. This applies when the shops associated with the keyword "Shinjuku / ethnic food" entered by the user on the gourmet site are displayed. Content-based recommendations are a method of making recommendations based on the characteristics of the item.

About the details of the characteristics

■ Diversity Collaboration: o Content-based: x Information that is not included in the product content is not recommended on a content basis, but collaborative filtering recommends information through other users, so you can recommend information that you do not know.

■ Domain knowledge Collaboration: o Content-based: x Collaborative filtering allows you to make recommendations through the behavior history of other users without any information or knowledge about the target item. On the other hand, content-based design requires a design that converts item features into feature vectors.

■ Startup problem Cooperation: x Content base: △ Collaborative filtering cannot be used in situations where there are no other users in the new system or where it is difficult to obtain a user profile. On a content basis, it is possible to make recommendations even in situations where a user profile cannot be obtained as long as the item characteristics can be obtained.

■ Number of users Cooperation: x Content base: o Collaborative filtering does not allow you to evaluate items that no one has evaluated yet. On the other hand, on a content basis, it is possible to make recommendations based on the characteristics even if there are no users.

■ Coverage Cooperation: x Content base: o As with the number of users, collaborative filtering cannot evaluate items that no one has evaluated yet, so it is not possible to make recommendations that cover all products.

■ Similar items Cooperation: x Content base: o Collaborative filtering that does not consider the characteristics of items at all cannot distinguish, for example, different colors of mugs. You can prevent the problem of different colors by cutting off items that are too similar on a content basis.

■ Minority users Cooperation: x Content base: o On a collaborative basis, if the number of users of a particular item is extremely small, similar items cannot be predicted and cannot be recommended. On a content basis, you can make recommendations according to the characteristics of the item.

Content-based filtering similarity calculation method

Once the item features are extracted and vectorized, the similarity can be calculated by Cosine Similarity. If you calculate the similarity between products X and products A-F and sort them, you can make recommendations by content-based filtering.

Reprinted: Five most popular similarity measures implementation in python

`CosineSimilarity`


# -*- coding: utf-8 -*-
from math import sqrt


def similarity(tfidf1, tfidf2):
    """
    Get Cosine Similarity
    cosθ =A / B/|A||B|
    :param tfidf1: list[list[str, float]]
    :param tfidf2: list[list[str, float]]
    :rtype : float
    """
    tfidf2_dict = {key: value for key, value in tfidf2}

    ab = 0  #A / B
    for key, value in tfidf1:
        value2 = tfidf2_dict.get(key)
        if value2:
            ab += float(value * value2)

    # |A| and |B|
    a = sqrt(sum([v ** 2 for k, v in tfidf1]))
    b = sqrt(sum([v ** 2 for k, v in tfidf2]))

    return float(ab / (a * b))

How to implement content-based filtering

[I'm going now](https://ja.wikipedia.org/wiki/%E4%BB%8A%E3%81%84%E3%81%8F%E3%82%88%E3%83%BB % E3% 81% 8F% E3% 82% 8B% E3% 82% 88) Master and [Korokoro Chiki Chiki Peppers](https://ja.wikipedia.org/wiki/%E3%82%B3%E3%83% AD% E3% 82% B3% E3% 83% AD% E3% 83% 81% E3% 82% AD% E3% 83% 81% E3% 82% AD% E3% 83% 9A% E3% 83% 83% E3% 83% 91% E3% 83% BC% E3% 82% BA) Wikipedia sentences were morphologically analyzed to extract nouns, and the number of frequent nouns / total number of nouns was used as the feature vector. This method is called TF-IDF.

# -*- coding: utf-8 -*-
from math import sqrt


def similarity(tfidf1, tfidf2):
    """
    Get Cosine Similarity
    cosθ =A / B/|A||B|
    :param tfidf1: list[list[str, float]]
    :param tfidf2: list[list[str, float]]
    :rtype : float
    """
    tfidf2_dict = {key: value for key, value in tfidf2}

    ab = 0  #A / B
    for key, value in tfidf1:
        value2 = tfidf2_dict.get(key)
        if value2:
            ab += float(value * value2)

    # |A| and |B|
    a = sqrt(sum([v ** 2 for k, v in tfidf1]))
    b = sqrt(sum([v ** 2 for k, v in tfidf2]))

    return float(ab / (a * b))

#I'm going now, Kuruyo Master
ikuyo_kuruyo = [
    ['Kyoto', 0.131578947369],
    ['Manzai', 0.122807017544],
    ['comedy', 0.122807017544],
    ['radio', 0.105263157894],
    ['Yoshimoto Kogyo', 0.09649122807],
    ['Entertainer', 0.09649122807],
    ['I'll go', 0.0701754385966],
    ['combination', 0.0701754385966],
    ['Osaka', 0.0526315789474],
    ['Master', 0.0438596491229],
    ['Upward', 0.0438596491229],
    ['Kao', 0.0438596491229],
]


#Korokoro
chikichiki = [
    ['comedy', 0.169014084507],
    ['Nishino', 0.12676056338],
    ['Osaka', 0.112676056338],
    ['Nadal', 0.0845070422536],
    ['combination', 0.0845070422536],
    ['Winner', 0.0704225352114],
    ['Korokoro Chikichiki Peppers', 0.0704225352114],
    ['Neta', 0.0704225352114],
    ['Entertainer', 0.056338028169],
    ['Yoshimoto Kogyo', 0.056338028169],
    ['Manzai', 0.056338028169],
    ['King of Conte', 0.0422535211267],
]

#Calculation of similarity
print similarity(ikuyo_kuruyo, chikichiki)

"""
>>>Execution result
0.521405857242
"""

Benefits of implementing content-based filtering with Cosine Similarity

If you implement content-based filtering with feature vectors, you can easily reflect it by adding one vector when you want to change the weighting, which enables flexible operation. For example, let's add 20% of the feature "DVD" to increase the similarity.


#I'm going now, Kuruyo Master
ikuyo_kuruyo = [
    ['DVD', 0.2],
    ['Kyoto', 0.131578947369],
    ['Manzai', 0.122807017544],
...
]


#Korokoro
chikichiki = [
    ['DVD', 0.2],
    ['comedy', 0.169014084507],
    ['Nishino', 0.12676056338],
...
]

#Calculation of similarity
print similarity(ikuyo_kuruyo, chikichiki)

"""
>>>Execution result
0.661462974013
"""

There is a library

The TF-IDF function that morphologically analyzes sentences on the Web and counts the number of nouns and the library simple_tfidf_japanese that calculates Cosine Similarity are available on PyPi. .. Since the implementation is 142 lines, you can read the source code in a short time.

`simple_tfidf_Implemented in japanese`


# -*- coding: utf-8 -*-
from __future__ import absolute_import, unicode_literals
from simple_tfidf_japanese.tfidf import TFIDF

ikuyo_wikipedia_url = 'https://ja.wikipedia.org/wiki/%E4%BB%8A%E3%81%84%E3%81%8F%E3%82%88%E3%83%BB%E3%81%8F%E3%82%8B%E3%82%88'
korochiki_wikipedia_url = 'https://ja.wikipedia.org/wiki/%E3%82%B3%E3%83%AD%E3%82%B3%E3%83%AD%E3%83%81%E3%82%AD%E3%83%81%E3%82%AD%E3%83%9A%E3%83%83%E3%83%91%E3%83%BC%E3%82%BA'
ikuyo_tfidf = TFIDF.gen_web(ikuyo_wikipedia_url)
chiki_tfidf = TFIDF.gen_web(korochiki_wikipedia_url)

#Similarity calculation
print TFIDF.similarity(ikuyo_tfidf, chiki_tfidf)

Collaborative filtering similarity calculation method and implementation method

There are multiple methods for calculating the similarity of collaborative filtering. This time, I will introduce an example of using co-occurrence and the Jackard index.

■ Sample data

■ Co-occurrence Calculated based on how many other products the customer who bought product X bought. It is less accurate than the Jackard Index, but it is easy to calculate. For example, if you calculate the co-occurrence value of product X and product A, the co-occurrence value is "1" because only customer E purchases both products. The feature is that the calculation time is a little short.

■ How to calculate the Jackard Index

スクリーンショット 2015-11-12 18.12.15.png

`Calculate the number of co-occurrence and jackard`


# -*- coding: utf-8 -*-
from __future__ import absolute_import
from __future__ import unicode_literals
from collections import defaultdict


def jaccard(e1, e2):
    """
Calculate the Jackard Index
    :param e1: list of int
    :param e2: list of int
    :rtype: float
    """
    set_e1 = set(e1)
    set_e2 = set(e2)
    return float(len(set_e1 & set_e2)) / float(len(set_e1 | set_e2))

#Customer ID that purchased product X is 1,3,5
product_x = [1, 3, 5]
product_a = [2, 4, 5]
product_b = [1, 2, 3]
product_c = [2, 3, 4, 7]
product_d = [3]
product_e = [4, 6, 7]

#Product data
products = {
    'A': product_a,
    'B': product_b,
    'C': product_c,
    'D': product_d,
    'E': product_e,
}

#Calculate the co-occurrence value with X
print "Co-occurrence"
r = defaultdict(int)

for key in products:
    overlap = list(set(product_x) & set(products[key]))
    r[key] = len(overlap)
print r

#Calculate the Jackard Index with X
print "Jackard Index"
r2 = defaultdict(float)
for key in products:
    r2[key] = jaccard(product_x, products[key])
print r2

`Execution result`


"""
>>> python cf.py
Co-occurrence
defaultdict(<type 'int'>, {u'A': 1, u'C': 1, u'B': 2, u'E': 0, u'D': 1})
Jackard Index
defaultdict(<type 'float'>, {u'A': 0.2, u'C': 0.16666666666666666, u'B': 0.5, u'E': 0.0, u'D': 0.3333333333333333})
"""

■ Summary of results Compared to co-occurrence, the Jackard Index can calculate the similarity between product X and products A, C, and D in more detail. The calculation cost is high accordingly. スクリーンショット 2015-11-12 18.48.21.png

reference

Algorithm of recommender system I implemented collaborative filtering (recommendation) with redis and python

[PYTHON] [Recommendation] Summary of advantages and disadvantages of content-based and collaborative filtering / implementation method