[PYTHON] Experiment to collect tweets for a long time (Program preparation (3))

Until last time

―― [x] Somehow, the area around Twitter looks like that, isn't it? -[] Then next is MongoDB.

What you have to do

The highest priority on the MongoDB side in this program is "** Storage of received data in MongoDB ". In other words, based on the specifications, will it be " Save the received data without missing it and store it for 3 months **"?

I will list the possible situations for the time being

What can happen when inputting to the DB fails?

I wonder if this is the only thing that came to my mind (because it's a crap, I'm sorry for the technical aspects).

  1. The person who loses the connection is between Localhost, so you don't have to worry too much, maybe.
  2. If the DB itself goes down, you have to think about some other countermeasure, so pass this time.
  3. I think you can trust the format error because it is the data sent from Twitter in a straightforward manner as long as it is inserted with zero processing.
  4. I'm afraid I'm dead because I can't keep up with the process.

I believe that the flow rate is okay, but unlike the development machine with relatively abundant CPU and disk speed, The execution environment is Celeron 2.41GHz, and the memory is increased to 8GB (enhanced) . Since it is also used as a NAS, I think the environment is quite harsh.

Based on the previous experiment , it is assumed that 2GB will be flooded per day. When converted to an hourly average, it is 33MB / h, and at the maximum peak, it is necessary to double, that is, 66MB / h.

……Hmm? Is it less than I expected? ?? Is the calculation wrong somewhere again? ?? I will check it later. Calculated as 70MB / h, assuming that the values are averaged in an easy-to-understand manner. Since the average length of JSON data saved in Sqlite is 7000 bytes, it is * 10,000 tweets / h *. ……realy? I don't feel like there is a hole somewhere ...

I will program it for the time being

For the time being, I decided to write a program that used PyMongo, leaving it as a check.

pymongotest1.py


#!/usr/bin/env python
# -*- coding:utf-8 -*-

from pymongo import MongoClient

Client = MongoClient()  # Localhost,No setting required for default port
db = Client.testdb      #DB name: testdb(Created automatically)
Collection = db.testCollection    #collection(table)Name: testCollection

Collection.insert({"test": "Tesuto"})

...... Well, is this okay? While thinking, run after installing pymongo. If you want to see the result on the GUI, there is a tool called Robomongo . It's free for normal use, so install and run it quickly.

実行結果

Contains the registered data. It seems that "_id" is automatically assigned. Would you like to register more than one next time ...

for i in range(0, 10):
	Collection.insert({"test": "Tesuto"})

I was able to register. It's easier than you think. Next, I will insert the Twitter Tweet JSON. The original data properly picks up the tweets stored in SQLite at the time of the first check.

 Collection.insert({"created_at":"Sat Sep 24 15:35:21 +0000 2016", ...(Omitted because it is long)... })
NameError: name 'false' is not defined

I got an error. If you read it as it is, false means undefined ……, but that's right. It should be raw JSON. When I googled, I found a person who was worried about the same thing immediately .

import json

#(Abbreviation)

raw_string = r'''{"created_at":"Sat Sep 24 15:35:21 +0000 2016", ...(Omitted because it is long)... }'''
json_object = json.loads(raw_string)
Collection.insert(json_object)

I wish I could do something like this. I see, you can register. Then next is multiple.

raw_string = r'''{"created_at":"Sat Sep 24 15:35:21 +0000 2016", ...(Omitted because it is long)... }'''
json_object = json.loads(raw_string)

for i in range(0, 10):
	Collection.insert(json_object)

I got an error. It seems that "pymongo.errors.DuplicateKeyError: E11000 duplicate key error collection:", so it means that you can't hit the exact same thing repeatedly.

for i in range(0, 10):
	raw_string = r'''{"created_at":"Sat Sep 24 15:35:21 +0000 2016", ...(Omitted because it is long)... }'''
	json_object = json.loads(raw_string)

	Collection.insert(json_object)

If I did it like this, it worked. It feels like the id is assigned at the stage of json.loads () ... I think.

Speed measurement

Now let's see how fast it actually goes. The speed check code is as follows.

MongoSpeed.py


#!/usr/bin/env python
# -*- coding:utf-8 -*-

from pymongo import MongoClient
import json

import time    #measurement of time

Client = MongoClient()
db = Client.testdb
Collection = db.testCollection

start = time.time()   #Start measurement
for i in range(0, 10000):    # 10,000 loops
	aw_string = r'''{"created_at":"Sat Sep 24 15:35:21 +0000 2016", ...(Omitted because it is long)... }'''
	json_object = json.loads(raw_string)
	Collection.insert(json_object)

elapsed_time = time.time() - start   #Elapsed time calculation by subtracting the end and start of measurement
print('Execution time:', elapsed_time * 1000 ,' [ms]')

The JSON data is a little long, about 10KB. It contains 4 image URLs and hashtags. The largest data acquired last time was 25KB, and the smaller one was 2KB. Is it a little big from the average of 7KB? Place data. ** In other words, if the execution time of this is less than 1 hour, there is almost no problem. ** **

(Python) >Python .\MongoSpeed.py

Execution time: 11719.62308883667 [ms]

(Python) >

What? (to be continued.)

Recommended Posts

Experiment to collect tweets for a long time (Program preparation (3))
Experiment to collect tweets for a long time (Program preparation (1))
Experiment to collect tweets for a long time (program preparation (2))
Experiment to collect tweets for a long time (Program preparation (5))
Experiment to collect tweets for a long time (immediately before execution)
Experiment to collect tweets for a long period of time (aggregation & content confirmation)
Experiment to make a self-catering PDF for Kindle with Python
A study method for beginners to learn time series analysis
I made a program to collect images in tweets that I liked on twitter with Python
I want to create a Dockerfile for the time being.
[Profile] Identify where the program is taking a long time (google-perftool)
[Python] It was very convenient to use a Python class for a ROS program.
How to stop a program in python until a specific date and time
I tried to create a linebot (preparation)
Introduction to discord.py (1st day) -Preparation for discord.py-
A simple workaround for bots to try to post tweets with the same content
It takes a long time to shut down in CentOS 7 with LVM configuration.