Analyze the source code of your own simple search engine written in Python with the code visualization tool "SOURCE TRAIL"

I'm a software engineer r2en at white, inc. Our company conducts consulting work centered on new businesses, and usually the engineer team develops free cloud-type tools for developing new businesses and new businesses. I am involved in everything from consulting to PoC development.

When I was looking for a tool that could understand the source code interactively in the transfer of development from the PoC development phase to the operation phase and the operation of existing software, I found SOURCE TEAIL, so I touched it this time and shared it. let me try

Overview

SOURCETEAIL is a tool that helps developers to productively code by analyzing and visualizing source code so that they do not spend a lot of time understanding source code written by others.

TL;DR The impression of actually using it is that it is easy to operate and there is little to remember about the tool, but it is relatively easy to see and use.

I think it's worth continuing to use it, as it will definitely help you decipher the source code, if not completely understandable.

SOURCE TEAIL installation method

https://www.sourcetrail.com/ Jump to the sourcetrail site and click the download button

Download an image of your OS environment to jump to the sourcetrail github page

Unzip and move to application

How to use SOURCETEAIL

[Click here for a description of the source code used](https://qiita.com/r2en/items/e4f8145f54d6c5b4e77e#%E4%BD%BF%E7%94%A8%E3%81%97%E3%81%9F% E3% 82% BD% E3% 83% BC% E3% 82% B9% E3% 82% B3% E3% 83% BC% E3% 83% 89)

When started, the following screen will be displayed. Press New Project.

When soucetail performs static analysis, a file is automatically generated in the analyzed repository (directory). Select the project name and project location for that time

Select add source group

Select your programming language and select next

Enter the language environment, external module, module, etc. here Since it supports drag and drop, the repository is dropped to Files & Directories to Index as it is.

Works without selecting environment or external module

The screen where the dropped file is referenced by show files on the above screen

It has been imported properly

After completing the entry to some extent, select create

A screen for static analysis of the source code will be displayed, so select start.

Seach_engine.srctrlbm, seach_engine.srctrldb, seach_engine.srctrlprj are generated as shown in the folder below.

Select start in-depth indexing because it seems to make static analysis deeper again.

Select start

An error has occurred, but this time the external modules (numpy, pandas, etc.) were not referenced at the time of initial setting. This time it's okay to go that far, so proceed as it is

Source code overview (modules, classes, functions, variables)

Let's see if we can display the components of the source code.

Files existing in the repository are displayed in alphabetical order

Modules existing in the repository are displayed in alphabetical order

Classes existing in the repository are displayed in alphabetical order

Functions existing in the repository are displayed in alphabetical order

Global variables existing in the repository are displayed in alphabetical order

Source code analysis (module)

Let's see if more detailed analysis of the source code and dependencies are displayed.

First look at the module

In the app.py file, since the responder server is running, call api as a global variable and call it. Originally it should be batch processed instead of real-time API,

Since we are creating a simple search engine this time, we will create an index of data and create it only at the first startup, so it exists in a global variable

In the function, there is a batch_process function for creating index and bow, and there is a search engine API search_engine.

This time, search_engine in app.py is the basis of all processing, so let's select this function.

It can be seen that it is composed of the global variables bow and index and the functions query_engine, query_post_processor and query_pre_processor.

Select query_pre_processor to take a closer look

Unfortunately, there are CountVector and TfidfVector classes that inherit from the BaseVector class, but the dependencies are difficult to understand in this chart.

MecabMorphologicalAnalysis also originally inherits from the BaseMorphologicalAnalysis class, but it doesn't show up at first glance either.

I will look at each other

You can see that query_engine uses cosine_similarity

You can see that query_post_prosessor is closed without using other classes or functions

Source code dependency (class)

Looking at CountVector, you can see that it inherits from BaseVector You can also see that it is called from the indexer module, indexer function, and query_pre_processor function.

On the contrary, if you look at BaseVector, it is called from count_vector module and tfidf_vector module. You can see that it is the parent of CountVector class and TfidfVector class.

Source code to use

This part is not related to SOURCE TEAIL, so read it if you want a deeper understanding of what you are analyzing.

The reason for choosing this source code is to consider a system with multiple files (modules) and a simple structure that is easy to understand while verifying whether the code has some dependencies and can be used in projects, etc. Because the corresponding code was a simple search engine at that time

Overview

When the user asks the API "What is the animal that screams?", The following question list "What is the animal with a long nose?" That the API holds as data inside, From "What is a long-necked animal?", "Who is the king of beasts?", "What is a screaming animal?", "What is a screaming animal?" It is a system that selects the question sentence that is closest to the query and returns the answer "dog" as a response to it.

Logical configuration

The theoretical components of the search engine refer to the following

The data you want to search is preprocessed through product information management and generated as an index, and when a search query is sent at the request of the user, the query preprocessor performs the same preprocessing and searches for similar elements with the query engine. It is a very simple and minimal search engine that formats the results with a query post processor and returns the search results as a response.

iOS の画像 (1)

I borrowed a chart from the book Machine Learning for AI Algorithm Marketing Automation

I don't think it's been talked about in the neighborhood so much, but for those who want to search and recommend, it's a book with a lot of learning, so I definitely want everyone to read it.

Since it is quoted without permission, it will be deleted immediately if the author or related parties contact us.

Directory structure

If you drop it concretely, it will be in the following form

The relevance tuning part is omitted this time because it varies depending on the business requirements this time.

The source code described this time is posted on github, so if you want to see it in detail, please access the following

https://github.com/r2en/simple_search_engine

    ├── app.py    <-Responder API server Receives search requests and returns search results
    │                         
    │
    ├── indexer.py    <-Catalog data(documents)Preprocess and index(前処理済みdocumentsのベクトル化)To generate
    │
    │
    ├── query_engine.py    <-Matching and scoring queries and catalog data
    │
    │    
    ├── query_pre_processor.py    <-Query(User search request)Is processed in index format, which is almost the same as the preprocessing performed by indexer.
    │
    │  
    ├── query_post_processor.py    <-Generate search results from the results of matching queries and catalog data
    │
    │  
    ├── product_information_management.py    <-Manipulate the data group to be searched
    │ 
    │
    ├── catalog_data    <-Stores the data to be searched
    │  │
    │  │
    │  ├── answer.csv    <-Stores the data of the answer requested by the query
    │  │
    │  │
    │  └── question.csv    <-Stores question data of the same quality as the query
    │  
    │  
    ├── morphological_analysis    <-A group of tools for morphological analyzers of natural language
    │  │
    │  ├── base_morphological_analysis.py    <-Parent class of the base morphological analyzer
    │  │
    │  │
    │  ├── mecab_morphological_analysis.py    <-Stores mecab of morphological analyzer
    │  │
    │  │
    │  └── __init__.py
    │  
    │      
    ├── search    <-Search for similarities between vectorized queries and catalog data
    │  │
    │  ├── cosine_similarity.py    <-Search for similarities by cosine similarity
    │  │
    │  │
    │  └── __init__.py
    │  
    │  
    └── word_embedding    <-Create data structures and indexes that allow you to search query and catalog data
       │
       │
       ├── base_vector.py    <-Parent class of base vector converter
       │
       │
       ├── count_vector.py    <-BoW count-based vector converter
       │
       │
       │
       ├── tfidf_vector.py    <- BoW Tf-Idf-based vector transducer
       │
       │
       └── __init__.py

Impressions

The impression of actually using it is that it is easy to operate and there is little to remember about the tool, but it is relatively easy to see and use.

I think it's worth continuing to use it, as it will definitely help you decipher the source code, if not completely understandable.