I made a search engine that I was interested in for a long time as a work for winter vacation. (Although it's messed up ...)

Work environment

Since the search query is obtained from the standard input, I will write the environment for the time being. The environment is as follows. Mac OSX 10.9.5 Core i7(64bit) Python 2.7.9

Reference material

[Introduction to Large-Scale Service Technology [for Web Developers]](http://www.amazon.co.jp/gp/product/4774143073/ref=as_li_qf_sp_asin_tl?ie=UTF8&camp=247&creative=1211&creativeASIN=4774143073&linkCode=as2&tag= shimashimao06-22)

It covers a wide range of topics, including not only the search algorithm but also memory and disk. In addition, it was an interesting book as a reading material, as it also contains examples of failures based on actual experience.

What i did

There are three main functions: data compression, inverted index creation, and search.

Creating code for data compression

This time, we are compressing with VBcode. It seems that there are various other compression methods such as PForDelta and Golomb coding. The reason for data compression in the first place is to increase the overall throughput now that large-capacity storage can be purchased at a low price.

According to the book above

From the viewpoint of throughput, there are many situations where it is faster to compress the data to be handled. The computer has two types of loads, CPU and I / O. While waiting for I / O to do a certain process, the CPU cannot use it for that process. Compressing the file puts a strain on the CPU, but it can reduce I / O waiting. Since the CPU is often free and I / O is busy, the overall throughput can be increased by reducing the I / O load by compression and letting the CPU take over.

There is an explanation. In addition, since the compressed one needs to be decompressed, consider using it when the merit of the I / O part is greater than the load applied to the compression / decompression, or using a compression method that has the merit. I think it is necessary.

Regarding the compression method, VBcode compresses by removing redundant bytes and describing the information with the minimum number of bytes. For example, the integer 5 is 00000000 00000000 00000000 00000101 in fixed-length binary code, but can be expressed as 10000101 in VBcode. Since 8 bits are one block, it is important to make good use of 128 (10000000 in binary) in the code.

Alternatively, you can get the compression effect by sorting the integer values and taking the numerical gaps. This seems to be derived from the probability distribution that small numbers are easy to appear, but large numbers are hard to appear.

Create and save inverted index

This time, I created an inverted index using bi-gram. I think that many people are familiar with the inverted index, so I will omit the explanation. In the implementation below { term: {(doc_id1, doc_id2,...): [[location_list1],[location_list2],...]} } Posting list including location is stored in the dictionary with term as the key. (I think there is a more decent method, so if you know it, please let me know.) The created inverted index (dictionary) was saved in a file with cPickle.

Document search

There is not much explanation, so I will list it.

Enter a search query from standard input
Create term with bi-gram from search query
Search while checking the continuity of terms in each document by location
Shows the title and snippet of the hit document

Usage data

Please download from Technical Review Company Support Page. Use a file called 10000entries.txt that contains title data and a folder called texts that contains document files separated by document_id. The capacity is 1.4M for the title file and 321K for the texts folder as a whole.

Reference code

Sorry for the pretty dirty code, but I'll paste it below.

First, the code that compresses / decompresses with VBcode.

`vbcode.py`


# coding: utf-8

from struct import pack, unpack

def vbencode_number(number):
    number = int(number)
    _bytes = []
    while 1:
        _bytes.insert(0, number % 128)
        if number < 128: break
        number /= 128
    _bytes[-1] += 128 #Declare the end of the bit string(Use the first bit as a mask)
    return pack('%dB' % len(_bytes), *_bytes)

def vbencode(numbers):
    bytestream = ''
    for number in numbers:
        _bytes = vbencode_number(number)
        bytestream += _bytes
    return bytestream

def vbdecode(bytestream):
    numbers = []
    n = 0
    bytestream = unpack('%dB' % len(bytestream), bytestream)
    for i in xrange(len(bytestream)):
        if bytestream[i] < 128:
            n = 128*n + bytestream[i]
        else:
            n = 128*n + bytestream[i] - 128
            numbers.append(n)
            n = 0
    return numbers

if __name__=='__main__':
    numbers = (1,2,128,145)
    res = vbencode(numbers)
    print vbdecode(res)

Next is the creation and storage of the inverted index.

`index.py`


# codeing: utf-8

from __future__ import unicode_literals
from collections import defaultdict
from unicodedata import normalize
import MeCab
from vbcode import vbencode
import numpy as np
import os
import cPickle

def get_title_data(input_filename):
    titles = defaultdict(str)
    with open(input_filename, 'r', 10000) as f:
        for line in f:
            try:
                doc_id, cat_id, url, title = line.strip().split(b'\t')
                titles[int(doc_id)] = title
            except Exception, e:
                doc_id, cat_id, url = line.strip().split(b'\t')
                titles[int(doc_id)] = ''
    return titles

def create_index(doc_base_path,titles):
    main_index = defaultdict(dict)
    for doc_id in titles.keys():
        try:
            main_index = get_index_dic(doc_base_path, doc_id, main_index)
        except Exception, e:
            import traceback
            print traceback.format_exc()
            print e
    compressed_main_index = compress_postinglist(main_index)
    save_index(compressed_main_index)

def save_index(index):
    with open('index_bigram.pkl','w') as f:
        cPickle.dump(index, f)

def compress_postinglist(index):
    for key_word, posting_dic in index.items():
        tup_did_location = zip( posting_dic.keys()[0],posting_dic.values()[0] ) # [(),()]
        tup_did_location.sort(key=lambda x:x[0]) # [(),()]

        doc_id_list = []
        lctn_pl = []
        for doc_id, location_list in tup_did_location:
            doc_id_list.append(doc_id)
            lctn_pl.append( vbencode( get_differences(location_list) ) )
        d_pl = get_differences(doc_id_list)
        index[key_word] = { vbencode(d_pl) : lctn_pl}
    return index

def get_differences(sorted_list):
    if len(sorted_list)>1:
        diff_list = ( np.array(sorted_list) - np.array([0]+sorted_list[:-1]) ).tolist()
    else:
        diff_list = sorted_list
    return diff_list

def get_index_dic(doc_base_path, doc_id, main_index):
    with open(doc_base_path+'/'+str(doc_id), 'r') as f:
        doc = f.read().strip()
        doc = normalize('NFKC',doc.decode('utf-8')).lower()
        word_list, word_location = bi_gram_tokenize(doc)

        for word, location in zip(word_list, word_location):
            if main_index[word]:
                _keys = main_index[word].keys()[0]
                if doc_id in _keys:
                    main_index[word][_keys][-1] += [location]
                else:
                    main_index[word][_keys].append( [location] )
                    main_index[word] = {_keys+(doc_id,): main_index[word].values()[0]}
            else:
                main_index[word][(doc_id,)] = [ [location] ]
    return main_index

def mecab_tokenize(doc):
    tagger = MeCab.Tagger(b'-Ochasen')
    node = tagger.parseToNode(doc.encode('utf-8'))
    word_list = []
    while node:
        ns = node.surface
        word_list.append(ns)
        node = node.next
    return word_list

def bi_gram_tokenize(doc):
    word_list = []
    word_location = []
    for i in xrange(len(doc)):
        term = doc[i:i+2].strip()
        if len(term)!=0:
            #print "term:",term
            word_list.append( term )
            word_location.append( i )
    return word_list, word_location

def main(input_filename, doc_base_path):
    titles = get_title_data(input_filename)
    create_index(doc_base_path,titles)

if __name__=='__main__':
    ifpath = os.getcwd()+'/../works/input'
    ifname = ifpath + '/10000entries.txt'
    doc_base_path = ifpath + '/texts'
    main(ifname, doc_base_path)

Finally, the code for the search part.

`search.py`


# coding: utf-8

from __future__ import unicode_literals
import os
from collections import defaultdict
import sys
from vbcode import vbdecode
import cPickle as pickle
import numpy as np
from unicodedata import normalize

def get_titles(input_filename):
  with open(input_filename,'r',10000) as f:
      titles = defaultdict(dict)
      for line in f:
          try:
              doc_id, cat_id, url, title = line.strip().split(b'\t')
              titles[int(doc_id)] = {'title': title, 'url':url}
          except Exception, e:
              doc_id, cat_id, url = line.strip().split(b'\t')
              titles[int(doc_id)] = {'title':'', 'url': url}
  return titles

def load_index(index_filename):
  with open(index_filename, 'r') as f:
      index = pickle.load(f)
  return index

def search(titles, index, query, text_path):
    query = normalize('NFKC',query.decode('utf-8'))
    for q in query.strip().split(' '):
        print "search:", query
        if len(q.strip())==0: continue

        bi_gram_qs = get_bigram_queries(q.strip())
        if not index[bi_gram_qs[0]]: continue

        # doc_Get the intersection of id
        doc_ids = set( get_decoded_data(index[bi_gram_qs[0]].keys()[0]) )
        for bi_gram_q in bi_gram_qs[1:]:
            doc_ids = doc_ids & set( get_decoded_data(index[bi_gram_q].keys()[0]) )
        doc_ids = list(doc_ids)

        if len(doc_ids)==0: continue

        location_lists = []
        for bi_gram_q in bi_gram_qs:
            # doc_Get the index list of id
            all_doc_ids = get_decoded_data( index[bi_gram_q].keys()[0] )
            idx_list = [ all_doc_ids.index(doc_id) for doc_id in doc_ids ]
            #Get the location for each document in each query
            location_list = [ get_decoded_data(index[bi_gram_q].values()[0][idx]) for idx in idx_list ]
            location_lists.append( location_list )

        doc_id_idx_idx, locations4snippet = check_sequence(location_lists)
        #print "doc_id_idx_idx:", doc_id_idx_idx
        #print "doc_ids:", doc_ids
        squewed_doc_ids = ( doc_ids[doc_id_idx] for doc_id_idx in doc_id_idx_idx )
        for doc_id, location in zip( squewed_doc_ids, locations4snippet):
            print "doc_id:",doc_id
            if doc_id in titles:
                print titles[doc_id]['title']
                if location<50:
                    start = 0
                else:
                    start = location - 50
                end = location + 50
                with open(text_path+'/'+str(doc_id), 'r') as f:
                    print f.read().strip().decode('utf-8')[start:end]

def get_bigram_queries(query):
    bi_gram_qs = []
    for i in xrange(0, len(query), 2):
        bi_gram_q = query[i:i+2].strip()
        if len(bi_gram_q) ==1 and i>0:
            bi_gram_q = query[i-1:i+1]
        bi_gram_qs.append( bi_gram_q )
    return bi_gram_qs

def get_decoded_data(encoded_data):
    decoded_data = vbdecode(encoded_data)
    return np.cumsum(decoded_data).tolist()

def check_sequence(location_lists):
    length = len(location_lists)

    doc_id_idx_idx = []
    locations4snippet = []
    for doc_i, location_list in enumerate(location_lists[0]):
        for location in location_list:
            if length==1:
                return [doc_i]
            elif check_next_node(1, doc_i, location, location_lists, length):
                doc_id_idx_idx.append(doc_i)
                locations4snippet.append(location)
            else:
                continue
    return doc_id_idx_idx, locations4snippet

def check_next_node(i, doc_i, location, location_lists, last_node):
    for next_location in location_lists[i][doc_i]:
        if location < next_location <= location+2:
            if i < last_node-1:
                return check_next_node(i+1, doc_i, location+2, location_lists, last_node)
            else:
              return True
        else:
            return False

if __name__=='__main__':
    basepath = os.getcwd()
    index_filename = basepath + '/index_bigram.pkl'
    file_path = basepath + '/../works/input'
    ifname = file_path + '/10000entries.txt'
    text_path = file_path + '/texts'

    titles = get_titles(ifname)
    index = load_index(index_filename)

    print "get search query"
    query = raw_input()
    search(titles, index, query, text_path)
    print "done"

Creating an inverted index takes a tremendous amount of time. This time it was good because it was enough data to fit in the memory, but if it does not fit in the memory, I imagine that the buffered index will be merged into the DB and stored while storing it in the database. I was making it. Although it is a DB, I think MongoDB is easy if you create it with a python dictionary type.

comment

I think it's a deep specialty such as data compression and search technology, but it was easier than I had originally imagined to simply create it without thinking about it. I would like to take on the challenge of creating ranking algorithms that I have not started this time.

We apologize for the inconvenience, but we would appreciate it if you could point out any mistakes.

Search engine work with python