I made a search engine that I was interested in for a long time as a work for winter vacation. (Although it's messed up ...)
Since the search query is obtained from the standard input, I will write the environment for the time being. The environment is as follows. Mac OSX 10.9.5 Core i7(64bit) Python 2.7.9
[Introduction to Large-Scale Service Technology [for Web Developers]](http://www.amazon.co.jp/gp/product/4774143073/ref=as_li_qf_sp_asin_tl?ie=UTF8&camp=247&creative=1211&creativeASIN=4774143073&linkCode=as2&tag= shimashimao06-22)
It covers a wide range of topics, including not only the search algorithm but also memory and disk. In addition, it was an interesting book as a reading material, as it also contains examples of failures based on actual experience.
There are three main functions: data compression, inverted index creation, and search.
This time, we are compressing with VBcode. It seems that there are various other compression methods such as PForDelta and Golomb coding. The reason for data compression in the first place is to increase the overall throughput now that large-capacity storage can be purchased at a low price.
According to the book above
From the viewpoint of throughput, there are many situations where it is faster to compress the data to be handled. The computer has two types of loads, CPU and I / O. While waiting for I / O to do a certain process, the CPU cannot use it for that process. Compressing the file puts a strain on the CPU, but it can reduce I / O waiting. Since the CPU is often free and I / O is busy, the overall throughput can be increased by reducing the I / O load by compression and letting the CPU take over.
There is an explanation. In addition, since the compressed one needs to be decompressed, consider using it when the merit of the I / O part is greater than the load applied to the compression / decompression, or using a compression method that has the merit. I think it is necessary.
Regarding the compression method, VBcode compresses by removing redundant bytes and describing the information with the minimum number of bytes. For example, the integer 5 is 00000000 00000000 00000000 00000101 in fixed-length binary code, but can be expressed as 10000101 in VBcode. Since 8 bits are one block, it is important to make good use of 128 (10000000 in binary) in the code.
Alternatively, you can get the compression effect by sorting the integer values and taking the numerical gaps. This seems to be derived from the probability distribution that small numbers are easy to appear, but large numbers are hard to appear.
This time, I created an inverted index using bi-gram. I think that many people are familiar with the inverted index, so I will omit the explanation. In the implementation below { term: {(doc_id1, doc_id2,...): [[location_list1],[location_list2],...]} } Posting list including location is stored in the dictionary with term as the key. (I think there is a more decent method, so if you know it, please let me know.) The created inverted index (dictionary) was saved in a file with cPickle.
There is not much explanation, so I will list it.
Please download from Technical Review Company Support Page. Use a file called 10000entries.txt that contains title data and a folder called texts that contains document files separated by document_id. The capacity is 1.4M for the title file and 321K for the texts folder as a whole.
Sorry for the pretty dirty code, but I'll paste it below.
First, the code that compresses / decompresses with VBcode.
vbcode.py
# coding: utf-8
from struct import pack, unpack
def vbencode_number(number):
number = int(number)
_bytes = []
while 1:
_bytes.insert(0, number % 128)
if number < 128: break
number /= 128
_bytes[-1] += 128 #Declare the end of the bit string(Use the first bit as a mask)
return pack('%dB' % len(_bytes), *_bytes)
def vbencode(numbers):
bytestream = ''
for number in numbers:
_bytes = vbencode_number(number)
bytestream += _bytes
return bytestream
def vbdecode(bytestream):
numbers = []
n = 0
bytestream = unpack('%dB' % len(bytestream), bytestream)
for i in xrange(len(bytestream)):
if bytestream[i] < 128:
n = 128*n + bytestream[i]
else:
n = 128*n + bytestream[i] - 128
numbers.append(n)
n = 0
return numbers
if __name__=='__main__':
numbers = (1,2,128,145)
res = vbencode(numbers)
print vbdecode(res)
Next is the creation and storage of the inverted index.
index.py
# codeing: utf-8
from __future__ import unicode_literals
from collections import defaultdict
from unicodedata import normalize
import MeCab
from vbcode import vbencode
import numpy as np
import os
import cPickle
def get_title_data(input_filename):
titles = defaultdict(str)
with open(input_filename, 'r', 10000) as f:
for line in f:
try:
doc_id, cat_id, url, title = line.strip().split(b'\t')
titles[int(doc_id)] = title
except Exception, e:
doc_id, cat_id, url = line.strip().split(b'\t')
titles[int(doc_id)] = ''
return titles
def create_index(doc_base_path,titles):
main_index = defaultdict(dict)
for doc_id in titles.keys():
try:
main_index = get_index_dic(doc_base_path, doc_id, main_index)
except Exception, e:
import traceback
print traceback.format_exc()
print e
compressed_main_index = compress_postinglist(main_index)
save_index(compressed_main_index)
def save_index(index):
with open('index_bigram.pkl','w') as f:
cPickle.dump(index, f)
def compress_postinglist(index):
for key_word, posting_dic in index.items():
tup_did_location = zip( posting_dic.keys()[0],posting_dic.values()[0] ) # [(),()]
tup_did_location.sort(key=lambda x:x[0]) # [(),()]
doc_id_list = []
lctn_pl = []
for doc_id, location_list in tup_did_location:
doc_id_list.append(doc_id)
lctn_pl.append( vbencode( get_differences(location_list) ) )
d_pl = get_differences(doc_id_list)
index[key_word] = { vbencode(d_pl) : lctn_pl}
return index
def get_differences(sorted_list):
if len(sorted_list)>1:
diff_list = ( np.array(sorted_list) - np.array([0]+sorted_list[:-1]) ).tolist()
else:
diff_list = sorted_list
return diff_list
def get_index_dic(doc_base_path, doc_id, main_index):
with open(doc_base_path+'/'+str(doc_id), 'r') as f:
doc = f.read().strip()
doc = normalize('NFKC',doc.decode('utf-8')).lower()
word_list, word_location = bi_gram_tokenize(doc)
for word, location in zip(word_list, word_location):
if main_index[word]:
_keys = main_index[word].keys()[0]
if doc_id in _keys:
main_index[word][_keys][-1] += [location]
else:
main_index[word][_keys].append( [location] )
main_index[word] = {_keys+(doc_id,): main_index[word].values()[0]}
else:
main_index[word][(doc_id,)] = [ [location] ]
return main_index
def mecab_tokenize(doc):
tagger = MeCab.Tagger(b'-Ochasen')
node = tagger.parseToNode(doc.encode('utf-8'))
word_list = []
while node:
ns = node.surface
word_list.append(ns)
node = node.next
return word_list
def bi_gram_tokenize(doc):
word_list = []
word_location = []
for i in xrange(len(doc)):
term = doc[i:i+2].strip()
if len(term)!=0:
#print "term:",term
word_list.append( term )
word_location.append( i )
return word_list, word_location
def main(input_filename, doc_base_path):
titles = get_title_data(input_filename)
create_index(doc_base_path,titles)
if __name__=='__main__':
ifpath = os.getcwd()+'/../works/input'
ifname = ifpath + '/10000entries.txt'
doc_base_path = ifpath + '/texts'
main(ifname, doc_base_path)
Finally, the code for the search part.
search.py
# coding: utf-8
from __future__ import unicode_literals
import os
from collections import defaultdict
import sys
from vbcode import vbdecode
import cPickle as pickle
import numpy as np
from unicodedata import normalize
def get_titles(input_filename):
with open(input_filename,'r',10000) as f:
titles = defaultdict(dict)
for line in f:
try:
doc_id, cat_id, url, title = line.strip().split(b'\t')
titles[int(doc_id)] = {'title': title, 'url':url}
except Exception, e:
doc_id, cat_id, url = line.strip().split(b'\t')
titles[int(doc_id)] = {'title':'', 'url': url}
return titles
def load_index(index_filename):
with open(index_filename, 'r') as f:
index = pickle.load(f)
return index
def search(titles, index, query, text_path):
query = normalize('NFKC',query.decode('utf-8'))
for q in query.strip().split(' '):
print "search:", query
if len(q.strip())==0: continue
bi_gram_qs = get_bigram_queries(q.strip())
if not index[bi_gram_qs[0]]: continue
# doc_Get the intersection of id
doc_ids = set( get_decoded_data(index[bi_gram_qs[0]].keys()[0]) )
for bi_gram_q in bi_gram_qs[1:]:
doc_ids = doc_ids & set( get_decoded_data(index[bi_gram_q].keys()[0]) )
doc_ids = list(doc_ids)
if len(doc_ids)==0: continue
location_lists = []
for bi_gram_q in bi_gram_qs:
# doc_Get the index list of id
all_doc_ids = get_decoded_data( index[bi_gram_q].keys()[0] )
idx_list = [ all_doc_ids.index(doc_id) for doc_id in doc_ids ]
#Get the location for each document in each query
location_list = [ get_decoded_data(index[bi_gram_q].values()[0][idx]) for idx in idx_list ]
location_lists.append( location_list )
doc_id_idx_idx, locations4snippet = check_sequence(location_lists)
#print "doc_id_idx_idx:", doc_id_idx_idx
#print "doc_ids:", doc_ids
squewed_doc_ids = ( doc_ids[doc_id_idx] for doc_id_idx in doc_id_idx_idx )
for doc_id, location in zip( squewed_doc_ids, locations4snippet):
print "doc_id:",doc_id
if doc_id in titles:
print titles[doc_id]['title']
if location<50:
start = 0
else:
start = location - 50
end = location + 50
with open(text_path+'/'+str(doc_id), 'r') as f:
print f.read().strip().decode('utf-8')[start:end]
def get_bigram_queries(query):
bi_gram_qs = []
for i in xrange(0, len(query), 2):
bi_gram_q = query[i:i+2].strip()
if len(bi_gram_q) ==1 and i>0:
bi_gram_q = query[i-1:i+1]
bi_gram_qs.append( bi_gram_q )
return bi_gram_qs
def get_decoded_data(encoded_data):
decoded_data = vbdecode(encoded_data)
return np.cumsum(decoded_data).tolist()
def check_sequence(location_lists):
length = len(location_lists)
doc_id_idx_idx = []
locations4snippet = []
for doc_i, location_list in enumerate(location_lists[0]):
for location in location_list:
if length==1:
return [doc_i]
elif check_next_node(1, doc_i, location, location_lists, length):
doc_id_idx_idx.append(doc_i)
locations4snippet.append(location)
else:
continue
return doc_id_idx_idx, locations4snippet
def check_next_node(i, doc_i, location, location_lists, last_node):
for next_location in location_lists[i][doc_i]:
if location < next_location <= location+2:
if i < last_node-1:
return check_next_node(i+1, doc_i, location+2, location_lists, last_node)
else:
return True
else:
return False
if __name__=='__main__':
basepath = os.getcwd()
index_filename = basepath + '/index_bigram.pkl'
file_path = basepath + '/../works/input'
ifname = file_path + '/10000entries.txt'
text_path = file_path + '/texts'
titles = get_titles(ifname)
index = load_index(index_filename)
print "get search query"
query = raw_input()
search(titles, index, query, text_path)
print "done"
Creating an inverted index takes a tremendous amount of time. This time it was good because it was enough data to fit in the memory, but if it does not fit in the memory, I imagine that the buffered index will be merged into the DB and stored while storing it in the database. I was making it. Although it is a DB, I think MongoDB is easy if you create it with a python dictionary type.
I think it's a deep specialty such as data compression and search technology, but it was easier than I had originally imagined to simply create it without thinking about it. I would like to take on the challenge of creating ranking algorithms that I have not started this time.
We apologize for the inconvenience, but we would appreciate it if you could point out any mistakes.
Recommended Posts