When storing a large amount of data in the DB, a certain amount of write processing may be accumulated in the batch object and committed periodically.
A python code note on how to process an array at regular intervals to do the above. If you know a better way, I would appreciate it if you could point it out.
Finally, the sample code for batch writing to the Firestore of GCP is posted.
Update: Added "Method 4 (using slices)" after pointing out (Thanks to @shiracamus)
data = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
q = []
batch_size = 3
batch_count = 0
for d in data:
print("{}".format(d))
q.append(d)
batch_count += 1
if batch_count == batch_size:
print("commit {}".format(q))
q = []
batch_count = 0
print("commit {}".format(q))
> python sample1.py
a
b
c
commit ['a', 'b', 'c']
d
e
f
commit ['d', 'e', 'f']
g
h
commit ['g', 'h']
data = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
q = []
batch_size = 3
for i, d in enumerate(data):
print(d)
q.append(d)
if (i + 1) % batch_size == 0:
print("commit {}".format(q))
q = []
print("commit {}".format(q))
> python sample2.py
a
b
c
commit ['a', 'b', 'c']
d
e
f
commit ['d', 'e', 'f']
g
h
commit ['g', 'h']
It's a good way to get rid of the last commit statement and get a better view of the code.
Since it is necessary to acquire the data size in advance, it cannot be used for iterable data, and the weakness is that the judgment processing is inefficient.
data = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
q = []
last = len(data)
batch_size = 3
for i, d in enumerate(data):
print(d)
q.append(d)
if ((i + 1) % batch_size == 0) | ((i + 1) == last):
print("commit {}".format(q))
q = []
> python sample3.py
a
b
c
commit ['a', 'b', 'c']
d
e
f
commit ['d', 'e', 'f']
g
h
commit ['g', 'h']
Python's enumerate allows you to specify the start number with the second argument, so Method 3 can be a little simpler.
data = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
q = []
last = len(data)
batch_size = 3
for n, d in enumerate(data, 1):
print(d)
q.append(d)
if (n % batch_size == 0) | (n == last):
print("commit {}".format(q))
q = []
> python sample3.py
a
b
c
commit ['a', 'b', 'c']
d
e
f
commit ['d', 'e', 'f']
g
h
commit ['g', 'h']
Fashionable method (Thanks to @shiracamus) If it cannot be passed as an array to q of the batch object, separate it with for in etc.
data = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
batch_size = 3
for i in range(0, len(data), batch_size):
q = data[i:i+batch_size]
print("commit", q)
> python sample3.py
commit ['a', 'b', 'c']
commit ['d', 'e', 'f']
commit ['g', 'h']
In method 4, it is a sample to add data of 500 batch inserts to the firestore of GCP.
db = firestore.Client()
collection = db.collection("<COLLECTION NAME>")
batch_size = 500
batch = db.batch()
for i in range(0, len(data), batch_size):
for row in data[i:i + batch_size]:
batch.set(collection.document(), row)
print('committing...')
batch.commit()
batch = db.batch()
Recommended Posts