Search engine work with python

I made a search engine that I was interested in for a long time as a work for winter vacation. (Although it's messed up ...)

Work environment

Since the search query is obtained from the standard input, I will write the environment for the time being. The environment is as follows. Mac OSX 10.9.5 Core i7(64bit) Python 2.7.9

Reference material

[Introduction to Large-Scale Service Technology [for Web Developers]](http://www.amazon.co.jp/gp/product/4774143073/ref=as_li_qf_sp_asin_tl?ie=UTF8&camp=247&creative=1211&creativeASIN=4774143073&linkCode=as2&tag= shimashimao06-22)

It covers a wide range of topics, including not only the search algorithm but also memory and disk. In addition, it was an interesting book as a reading material, as it also contains examples of failures based on actual experience.

What i did

There are three main functions: data compression, inverted index creation, and search.

Creating code for data compression

This time, we are compressing with VBcode. It seems that there are various other compression methods such as PForDelta and Golomb coding. The reason for data compression in the first place is to increase the overall throughput now that large-capacity storage can be purchased at a low price.

According to the book above

From the viewpoint of throughput, there are many situations where it is faster to compress the data to be handled. The computer has two types of loads, CPU and I / O. While waiting for I / O to do a certain process, the CPU cannot use it for that process. Compressing the file puts a strain on the CPU, but it can reduce I / O waiting. Since the CPU is often free and I / O is busy, the overall throughput can be increased by reducing the I / O load by compression and letting the CPU take over.

There is an explanation. In addition, since the compressed one needs to be decompressed, consider using it when the merit of the I / O part is greater than the load applied to the compression / decompression, or using a compression method that has the merit. I think it is necessary.

Regarding the compression method, VBcode compresses by removing redundant bytes and describing the information with the minimum number of bytes. For example, the integer 5 is 00000000 00000000 00000000 00000101 in fixed-length binary code, but can be expressed as 10000101 in VBcode. Since 8 bits are one block, it is important to make good use of 128 (10000000 in binary) in the code.

Alternatively, you can get the compression effect by sorting the integer values and taking the numerical gaps. This seems to be derived from the probability distribution that small numbers are easy to appear, but large numbers are hard to appear.

Create and save inverted index

This time, I created an inverted index using bi-gram. I think that many people are familiar with the inverted index, so I will omit the explanation. In the implementation below { term: {(doc_id1, doc_id2,...): [[location_list1],[location_list2],...]} } Posting list including location is stored in the dictionary with term as the key. (I think there is a more decent method, so if you know it, please let me know.) The created inverted index (dictionary) was saved in a file with cPickle.

Document search

There is not much explanation, so I will list it.

Usage data

Please download from Technical Review Company Support Page. Use a file called 10000entries.txt that contains title data and a folder called texts that contains document files separated by document_id. The capacity is 1.4M for the title file and 321K for the texts folder as a whole.

Reference code

Sorry for the pretty dirty code, but I'll paste it below.

First, the code that compresses / decompresses with VBcode.

vbcode.py


# coding: utf-8

from struct import pack, unpack

def vbencode_number(number):
    number = int(number)
    _bytes = []
    while 1:
        _bytes.insert(0, number % 128)
        if number < 128: break
        number /= 128
    _bytes[-1] += 128 #Declare the end of the bit string(Use the first bit as a mask)
    return pack('%dB' % len(_bytes), *_bytes)

def vbencode(numbers):
    bytestream = ''
    for number in numbers:
        _bytes = vbencode_number(number)
        bytestream += _bytes
    return bytestream

def vbdecode(bytestream):
    numbers = []
    n = 0
    bytestream = unpack('%dB' % len(bytestream), bytestream)
    for i in xrange(len(bytestream)):
        if bytestream[i] < 128:
            n = 128*n + bytestream[i]
        else:
            n = 128*n + bytestream[i] - 128
            numbers.append(n)
            n = 0
    return numbers

if __name__=='__main__':
    numbers = (1,2,128,145)
    res = vbencode(numbers)
    print vbdecode(res)

Next is the creation and storage of the inverted index.

index.py


# codeing: utf-8

from __future__ import unicode_literals
from collections import defaultdict
from unicodedata import normalize
import MeCab
from vbcode import vbencode
import numpy as np
import os
import cPickle

def get_title_data(input_filename):
    titles = defaultdict(str)
    with open(input_filename, 'r', 10000) as f:
        for line in f:
            try:
                doc_id, cat_id, url, title = line.strip().split(b'\t')
                titles[int(doc_id)] = title
            except Exception, e:
                doc_id, cat_id, url = line.strip().split(b'\t')
                titles[int(doc_id)] = ''
    return titles

def create_index(doc_base_path,titles):
    main_index = defaultdict(dict)
    for doc_id in titles.keys():
        try:
            main_index = get_index_dic(doc_base_path, doc_id, main_index)
        except Exception, e:
            import traceback
            print traceback.format_exc()
            print e
    compressed_main_index = compress_postinglist(main_index)
    save_index(compressed_main_index)

def save_index(index):
    with open('index_bigram.pkl','w') as f:
        cPickle.dump(index, f)

def compress_postinglist(index):
    for key_word, posting_dic in index.items():
        tup_did_location = zip( posting_dic.keys()[0],posting_dic.values()[0] ) # [(),()]
        tup_did_location.sort(key=lambda x:x[0]) # [(),()]

        doc_id_list = []
        lctn_pl = []
        for doc_id, location_list in tup_did_location:
            doc_id_list.append(doc_id)
            lctn_pl.append( vbencode( get_differences(location_list) ) )
        d_pl = get_differences(doc_id_list)
        index[key_word] = { vbencode(d_pl) : lctn_pl}
    return index

def get_differences(sorted_list):
    if len(sorted_list)>1:
        diff_list = ( np.array(sorted_list) - np.array([0]+sorted_list[:-1]) ).tolist()
    else:
        diff_list = sorted_list
    return diff_list

def get_index_dic(doc_base_path, doc_id, main_index):
    with open(doc_base_path+'/'+str(doc_id), 'r') as f:
        doc = f.read().strip()
        doc = normalize('NFKC',doc.decode('utf-8')).lower()
        word_list, word_location = bi_gram_tokenize(doc)

        for word, location in zip(word_list, word_location):
            if main_index[word]:
                _keys = main_index[word].keys()[0]
                if doc_id in _keys:
                    main_index[word][_keys][-1] += [location]
                else:
                    main_index[word][_keys].append( [location] )
                    main_index[word] = {_keys+(doc_id,): main_index[word].values()[0]}
            else:
                main_index[word][(doc_id,)] = [ [location] ]
    return main_index

def mecab_tokenize(doc):
    tagger = MeCab.Tagger(b'-Ochasen')
    node = tagger.parseToNode(doc.encode('utf-8'))
    word_list = []
    while node:
        ns = node.surface
        word_list.append(ns)
        node = node.next
    return word_list

def bi_gram_tokenize(doc):
    word_list = []
    word_location = []
    for i in xrange(len(doc)):
        term = doc[i:i+2].strip()
        if len(term)!=0:
            #print "term:",term
            word_list.append( term )
            word_location.append( i )
    return word_list, word_location

def main(input_filename, doc_base_path):
    titles = get_title_data(input_filename)
    create_index(doc_base_path,titles)

if __name__=='__main__':
    ifpath = os.getcwd()+'/../works/input'
    ifname = ifpath + '/10000entries.txt'
    doc_base_path = ifpath + '/texts'
    main(ifname, doc_base_path)

Finally, the code for the search part.

search.py


# coding: utf-8

from __future__ import unicode_literals
import os
from collections import defaultdict
import sys
from vbcode import vbdecode
import cPickle as pickle
import numpy as np
from unicodedata import normalize

def get_titles(input_filename):
  with open(input_filename,'r',10000) as f:
      titles = defaultdict(dict)
      for line in f:
          try:
              doc_id, cat_id, url, title = line.strip().split(b'\t')
              titles[int(doc_id)] = {'title': title, 'url':url}
          except Exception, e:
              doc_id, cat_id, url = line.strip().split(b'\t')
              titles[int(doc_id)] = {'title':'', 'url': url}
  return titles

def load_index(index_filename):
  with open(index_filename, 'r') as f:
      index = pickle.load(f)
  return index

def search(titles, index, query, text_path):
    query = normalize('NFKC',query.decode('utf-8'))
    for q in query.strip().split(' '):
        print "search:", query
        if len(q.strip())==0: continue

        bi_gram_qs = get_bigram_queries(q.strip())
        if not index[bi_gram_qs[0]]: continue

        # doc_Get the intersection of id
        doc_ids = set( get_decoded_data(index[bi_gram_qs[0]].keys()[0]) )
        for bi_gram_q in bi_gram_qs[1:]:
            doc_ids = doc_ids & set( get_decoded_data(index[bi_gram_q].keys()[0]) )
        doc_ids = list(doc_ids)

        if len(doc_ids)==0: continue

        location_lists = []
        for bi_gram_q in bi_gram_qs:
            # doc_Get the index list of id
            all_doc_ids = get_decoded_data( index[bi_gram_q].keys()[0] )
            idx_list = [ all_doc_ids.index(doc_id) for doc_id in doc_ids ]
            #Get the location for each document in each query
            location_list = [ get_decoded_data(index[bi_gram_q].values()[0][idx]) for idx in idx_list ]
            location_lists.append( location_list )

        doc_id_idx_idx, locations4snippet = check_sequence(location_lists)
        #print "doc_id_idx_idx:", doc_id_idx_idx
        #print "doc_ids:", doc_ids
        squewed_doc_ids = ( doc_ids[doc_id_idx] for doc_id_idx in doc_id_idx_idx )
        for doc_id, location in zip( squewed_doc_ids, locations4snippet):
            print "doc_id:",doc_id
            if doc_id in titles:
                print titles[doc_id]['title']
                if location<50:
                    start = 0
                else:
                    start = location - 50
                end = location + 50
                with open(text_path+'/'+str(doc_id), 'r') as f:
                    print f.read().strip().decode('utf-8')[start:end]

def get_bigram_queries(query):
    bi_gram_qs = []
    for i in xrange(0, len(query), 2):
        bi_gram_q = query[i:i+2].strip()
        if len(bi_gram_q) ==1 and i>0:
            bi_gram_q = query[i-1:i+1]
        bi_gram_qs.append( bi_gram_q )
    return bi_gram_qs

def get_decoded_data(encoded_data):
    decoded_data = vbdecode(encoded_data)
    return np.cumsum(decoded_data).tolist()

def check_sequence(location_lists):
    length = len(location_lists)

    doc_id_idx_idx = []
    locations4snippet = []
    for doc_i, location_list in enumerate(location_lists[0]):
        for location in location_list:
            if length==1:
                return [doc_i]
            elif check_next_node(1, doc_i, location, location_lists, length):
                doc_id_idx_idx.append(doc_i)
                locations4snippet.append(location)
            else:
                continue
    return doc_id_idx_idx, locations4snippet

def check_next_node(i, doc_i, location, location_lists, last_node):
    for next_location in location_lists[i][doc_i]:
        if location < next_location <= location+2:
            if i < last_node-1:
                return check_next_node(i+1, doc_i, location+2, location_lists, last_node)
            else:
              return True
        else:
            return False

if __name__=='__main__':
    basepath = os.getcwd()
    index_filename = basepath + '/index_bigram.pkl'
    file_path = basepath + '/../works/input'
    ifname = file_path + '/10000entries.txt'
    text_path = file_path + '/texts'

    titles = get_titles(ifname)
    index = load_index(index_filename)

    print "get search query"
    query = raw_input()
    search(titles, index, query, text_path)
    print "done"

Creating an inverted index takes a tremendous amount of time. This time it was good because it was enough data to fit in the memory, but if it does not fit in the memory, I imagine that the buffered index will be merged into the DB and stored while storing it in the database. I was making it. Although it is a DB, I think MongoDB is easy if you create it with a python dictionary type.

comment

I think it's a deep specialty such as data compression and search technology, but it was easier than I had originally imagined to simply create it without thinking about it. I would like to take on the challenge of creating ranking algorithms that I have not started this time.

We apologize for the inconvenience, but we would appreciate it if you could point out any mistakes.

Recommended Posts

Search engine work with python
Sequential search with Python
Binary search with python
Binary search with Python3
Full bit search with Python
Search twitter tweets with python
Streamline web search with python
Learn search with Python # 2bit search, permutation search
When matplotlib doesn't work with python2.7
Algorithm learned with Python 10th: Binary search
Algorithm learned with Python 9th: Linear search
Statistics with python
Python with Go
Twilio with Python
Integrate with Python
Search the maze with the python A * algorithm
How to work with BigQuery in Python
Play with 2016-Python
AES256 with python
Tested with Python
python starts with ()
with syntax (Python)
[Pyto] Operate iPhone Taptic Engine with Python
Bingo with python
Question: Multiple integrals with python don't work
Zundokokiyoshi with python
To work with timestamp stations in Python
Algorithm learned with Python 12th: Maze search
Excel with Python
Microcomputer with Python
Cast with python
Csv output from Google search with [Python]! 【Easy】
Automatically search and download YouTube videos with Python
Causal reasoning and causal search with Python (for beginners)
Work in a virtual environment with Python virtualenv.
Make Python built with jhbuild work on OSX
Zip, unzip with python
Django 1.11 started with Python3.6
Primality test with Python
Python with eclipse + PyDev.
Scraping with Python (preparation)
Try scraping with Python.
Learning Python with ChemTHEATER 03
"Object-oriented" learning with python
Run Python with VBA
Handling yaml with python
Serial communication with python
Learning Python with ChemTHEATER 05-1
Learn Python with ChemTHEATER
Run prepDE.py with python3
1.1 Getting Started with Python
Collecting tweets with Python
Python Exercise 1-Breadth-first search
Binarization with OpenCV / Python
[Python] Search (itertools) ABC167C
3. 3. AI programming with Python
Binary search in Python
Kernel Method with Python
Non-blocking with Python + uWSGI
Scraping with Python + PhantomJS
Posting tweets with python