Manipulating mongoDB with Python-Part 6: aggregate-

Scope of this article

In this article, I will describe how to use aggregate (aggregate function in SQL) after connecting to mongodb with Python. For information on how to start mongodb and install pymongo, please see the following articles. https://qiita.com/bc_yuuuuuki/items/2b92598434f6cc320112

Preparation data

For the preparation data, we will use the article information of Qiita that plunged into mongoDB in the following article [Python] Qiita article information is pushed into mongoDB

How to use aggregate

How to use aggregate of mongoDB does not come to me if you are accustomed to SQL. The table below is a comparison table of SQL and aggregate.

SQL aggregate
WHERE $match
GROUP BY $group
HAVING $match
SELECT $project
ORDER BY $sort
LIMIT $limit
SUM() $sum
COUNT() $sum

mongoDB operation class

I am making various classes that use mongoDB using pymongo.

mongo_sample.py


from pymongo import MongoClient

class MongoSample(object):

    def __init__(self, dbName, collectionName):
        self.client = MongoClient()
        self.db = self.client[dbName] #Set DB name
        self.collection = self.db.get_collection(collectionName)

    def aggregate(self, filter, **keyword):
        return self.collection.aggregate(filter, keyword)

I'm just creating a function for calling aggregate.

Get data from mongoDB

First is the code.

aggregate_sample.py


from mongo_sample import MongoSample
import pprint
# arg1:DB Name
# arg2:Collection Name
mongo = MongoSample("db", "qiita")

#Maximum value
pipeline = [
    {"$group":{ "_id":"title","page_max_view":{"$max":"$page_views_count"}}}
]
results = mongo.aggregate(pipeline)
print("------------------------Maximum value-----------------------------")
pprint.pprint(list(results))

#minimum value
pipeline = [
    {"$group":{ "_id":"title","page_min_view":{"$min":"$page_views_count"}}}
]
results = mongo.aggregate(pipeline)
print("------------------------minimum value-----------------------------")
pprint.pprint(list(results))

#Average value
pipeline = [
    {"$group":{ "_id":"average","page_average_view":{"$avg":"$page_views_count"}}}
]

#total
pipeline = [
    {"$group":{"_id":"page_total_count","total":{"$sum":"$page_views_count"}}}
]
results = mongo.aggregate(pipeline)
print("------------------------Average value-----------------------------")
pprint.pprint(list(results))

#Count the number of occurrences for each tag
pipeline = [
    { "$unwind": "$tag_list"}, 
    { "$group": { "_id": "$tag_list", "count": { "$sum":1}}},
    { "$sort": {"count": -1, "_id":1}}
]

results = mongo.aggregate(pipeline)
print("------------------------Aggregate value-----------------------------")
pprint.pprint(list(results))

What you are doing is not a big deal. The maximum value, minimum value, average value, and count for each tag are acquired.

pprint needs to be installed.

pip install pprint

We will compare each with the operation method of mongoDB.

Maximum / minimum / average / total

First, mongoDB commands The example is only the maximum value. If you change max to min, avg, or sum, it will be the minimum / average / maximum.

db.qiita.aggregate([{$group:{_id:"page_max_views",total:{$max:"$page_views_count"}}}])
pipeline = [
    {"$group":{ "_id":"title","page_max_view":{"$max":"$page_views_count"}}}
]

Execution result

[{'_id': 'title', 'page_max_view': 2461}]

This way, the result is that "_id" is fixed to "title" and the maximum value in all records is obtained.

However, I want to display the title of the article because I want to know which article is read the most.

mongoDB command

> db.qiita.aggregate([{$project:{title:1,page_views_count:1}},{$group:{_id:"$title", total:{$max:"$page_views_count"}}},{$sort:{total:-1}}])
 {"_id": "Operating mongodb with Python-Part 2: find-", "total": 2461}
 {"_id": "Operating mongodb with Python-Part 3: update-", "total": 1137}
 {"_id": "Operating mongodb with Python-Part 4: insert-", "total": 1102}
 {"_id": "Various search conditions using pymongo (AND / OR / partial match / range search)", "total": 1019}
 (Omitted)

With this command, I could see the title of the article and the number of times the page was viewed. Obviously, it doesn't make much sense because it is grouped by article name. .. If it is the maximum value that does not need grouping, it seems good to sort with find and set limit.

Try to get the maximum value for each tag1.

> db.qiita.aggregate([{$group:{_id:"$tag1", total:{$max:"$page_views_count"}}},{$sort:{total:-1}}])
{ "_id" : "Python", "total" : 2461 }
{ "_id" : "Vagrant", "total" : 946 }
{ "_id" : "Java", "total" : 617 }
{ "_id" : "Hyperledger", "total" : 598 }
{ "_id" : "solidity", "total" : 363 }
{ "_id" : "Ethereum", "total" : 347 }
 {"_id": "blockchain", "total": 232}
{ "_id" : "Blockchain", "total" : 201 }
{ "_id" : "coverage", "total" : 199 }

Yes. I got it with a good feeling.

For the time being, I will change the python code as well.

# Maximum value
pipeline = [
    {"$group":{ "_id":"$tag1","page_max_view":{"$max":"$page_views_count"}}}
]
results = mongo.aggregate(pipeline)
 print ("------------------------ Maximum value --------------------- -------- ")
pprint.pprint(list(results))

Aggregation by tag

I would like to count how many articles are written for each tag. The aggregation uses an item called tag_list, which looks like this data:

> db.qiita.find({},{_id:0,tag_list:1})
{ "tag_list" : [ "Python", "MongoDB", "Python3", "pymongo" ] }
{ "tag_list" : [ "Python", "Python3" ] }
 {"tag_list": ["Python", "Python3", "Blockchain", "Blockchain", "Hyperledger-Iroha"]}
 {"tag_list": ["Blockchain", "Blockchain", "Hyperledger-Iroha"]}
{ "tag_list" : [ "Blockchain", "Ethereum", "Hyperledger", "Hyperledger-sawtooth" ] }
 {"tag_list": ["Blockchain", "Hyperledger", "Hyperledger-sawtooth"]}
 {"tag_list": ["Java", "blockchain", "Hyperledger", "Hyperledger-Iroha"]}
 {"tag_list": ["Blockchain", "Hyperledger", "Hyperledger-Iroha"]}
 {"tag_list": ["Java", "Ethereum", "Blockchain", "Hyperledger", "Hyperledger-Iroha"]}
 {"tag_list": ["Java", "blockchain", "Hyperledger", "Hyperledger-Iroha"]}
{ "tag_list" : [ "Hyperledger", "Hyperledger-Iroha", "Hyperledger-burrow", "Hyperledger-sawtooth", "Hyperledger-besu" ] }
{ "tag_list" : [ "Vagrant", "VirtualBox", "Hyper-V" ] }
 {"tag_list": ["Java", "Ethereum", "solidity", "blockchain", "web3j"]}
 {"tag_list": ["Java", "Ethereum", "blockchain", "web3j"]}
 {"tag_list": ["Java", "Ethereum", "blockchain", "web3j"]}
 {"tag_list": ["Java", "Ethereum", "blockchain", "web3j"]}
 {"tag_list": ["Java", "Ethereum", "solidity", "blockchain", "web3j"]}
 {"tag_list": ["Java", "Ethereum", "blockchain", "web3j"]}
 {"tag_list": ["Java", "Ethereum", "blockchain", "web3j"]}
 {"tag_list": ["Ethereum", "blockchain"]}

It's quite annoying to aggregate the data stored in this format in SQL. ..

In mongoDB, by using something called unwind, it is possible to divide and aggregate LIST format data.

> db.qiita.aggregate( { $project:{tag_list:1}}, { $unwind: "$tag_list"}, { $group: { _id: "$tag_list", count: { $sum:1}}},{ $sort: {"count": -1, "_id":1}} )
 {"_id": "blockchain", "count": 16}
{ "_id" : "Ethereum", "count" : 11 }
{ "_id" : "Java", "count" : 10 }
{ "_id" : "Python", "count" : 9 }
{ "_id" : "Python3", "count" : 9 }
{ "_id" : "Hyperledger", "count" : 7 }
{ "_id" : "Hyperledger-Iroha", "count" : 7 }
{ "_id" : "MongoDB", "count" : 7 }
{ "_id" : "web3j", "count" : 7 }
{ "_id" : "solidity", "count" : 4 }
{ "_id" : "Blockchain", "count" : 3 }
{ "_id" : "Hyperledger-sawtooth", "count" : 3 }
{ "_id" : "Hyper-V", "count" : 1 }
{ "_id" : "Hyperledger-besu", "count" : 1 }
{ "_id" : "Hyperledger-burrow", "count" : 1 }
{ "_id" : "Vagrant", "count" : 1 }
{ "_id" : "VirtualBox", "count" : 1 }
{ "_id" : "coverage", "count" : 1 }
{ "_id" : "pymongo", "count" : 1 }
{ "_id" : "truffle", "count" : 1 }

The python code does not include "{" $ project ": {" tag_list ": 1}}". The result did not change with or without it. I'm not sure how to use this project.

Impressions

There are many parts that are difficult to understand if you are accustomed to SQL, but it seems that flexible aggregation can be done by using unwind etc.

Related article

-Operating mongodb with Python-Part 1- -Operating mongodb with Python-Part 2: find- -Operating mongodb with Python-Part 3: update- -Operating mongodb with Python-Part 4: insert- -Operating mongodb with Python-Part 5: delete- -Various search conditions using pymongo (AND / OR / partial match / range search)

Recommended Posts

Manipulating mongoDB with Python-Part 6: aggregate-
mongodb access with pymongo
[Note] Operate MongoDB with Python
Basics of touching MongoDB with MongoEngine
Manipulating strings with pandas group by
Read system environment variables with python-Part 1
Read system environment variables with python-Part 2