Analyze the source code of your own simple search engine written in Python with the code visualization tool "SOURCE TRAIL"

image

image

I'm a software engineer r2en at white, inc. Our company conducts consulting work centered on new businesses, and usually the engineer team develops free cloud-type tools for developing new businesses and new businesses. I am involved in everything from consulting to PoC development.

When I was looking for a tool that could understand the source code interactively in the transfer of development from the PoC development phase to the operation phase and the operation of existing software, I found SOURCE TEAIL, so I touched it this time and shared it. let me try

image

Overview

SOURCETEAIL is a tool that helps developers to productively code by analyzing and visualizing source code so that they do not spend a lot of time understanding source code written by others.

TL;DR The impression of actually using it is that it is easy to operate and there is little to remember about the tool, but it is relatively easy to see and use.

I think it's worth continuing to use it, as it will definitely help you decipher the source code, if not completely understandable.

SOURCE TEAIL installation method

https://www.sourcetrail.com/ Jump to the sourcetrail site and click the download button

image

Download an image of your OS environment to jump to the sourcetrail github page

image

Unzip and move to application

image

How to use SOURCETEAIL

[Click here for a description of the source code used](https://qiita.com/r2en/items/e4f8145f54d6c5b4e77e#%E4%BD%BF%E7%94%A8%E3%81%97%E3%81%9F% E3% 82% BD% E3% 83% BC% E3% 82% B9% E3% 82% B3% E3% 83% BC% E3% 83% 89)

When started, the following screen will be displayed. Press New Project.

image

When soucetail performs static analysis, a file is automatically generated in the analyzed repository (directory). Select the project name and project location for that time

Select add source group

image

Select your programming language and select next

image

Enter the language environment, external module, module, etc. here Since it supports drag and drop, the repository is dropped to Files & Directories to Index as it is.

image

The screen where the dropped file is referenced by show files on the above screen

It has been imported properly

image

After completing the entry to some extent, select create

image

A screen for static analysis of the source code will be displayed, so select start.

image

Seach_engine.srctrlbm, seach_engine.srctrldb, seach_engine.srctrlprj are generated as shown in the folder below.

image

Select start in-depth indexing because it seems to make static analysis deeper again.

image

Select start

image

An error has occurred, but this time the external modules (numpy, pandas, etc.) were not referenced at the time of initial setting. This time it's okay to go that far, so proceed as it is

image

Source code overview (modules, classes, functions, variables)

Let's see if we can display the components of the source code.

Files existing in the repository are displayed in alphabetical order

image

Modules existing in the repository are displayed in alphabetical order

image

Classes existing in the repository are displayed in alphabetical order

image

Functions existing in the repository are displayed in alphabetical order

image

Global variables existing in the repository are displayed in alphabetical order

image

Source code analysis (module)

Let's see if more detailed analysis of the source code and dependencies are displayed.

First look at the module

In the app.py file, since the responder server is running, call api as a global variable and call it. Originally it should be batch processed instead of real-time API,

Since we are creating a simple search engine this time, we will create an index of data and create it only at the first startup, so it exists in a global variable

In the function, there is a batch_process function for creating index and bow, and there is a search engine API search_engine.

image

This time, search_engine in app.py is the basis of all processing, so let's select this function.

It can be seen that it is composed of the global variables bow and index and the functions query_engine, query_post_processor and query_pre_processor.

image

Select query_pre_processor to take a closer look

Unfortunately, there are CountVector and TfidfVector classes that inherit from the BaseVector class, but the dependencies are difficult to understand in this chart.

MecabMorphologicalAnalysis also originally inherits from the BaseMorphologicalAnalysis class, but it doesn't show up at first glance either.

image

I will look at each other

You can see that query_engine uses cosine_similarity

image

You can see that query_post_prosessor is closed without using other classes or functions

image

Source code dependency (class)

Looking at CountVector, you can see that it inherits from BaseVector You can also see that it is called from the indexer module, indexer function, and query_pre_processor function.

image

On the contrary, if you look at BaseVector, it is called from count_vector module and tfidf_vector module. You can see that it is the parent of CountVector class and TfidfVector class.

image

Source code to use

This part is not related to SOURCE TEAIL, so read it if you want a deeper understanding of what you are analyzing.

The reason for choosing this source code is to consider a system with multiple files (modules) and a simple structure that is easy to understand while verifying whether the code has some dependencies and can be used in projects, etc. Because the corresponding code was a simple search engine at that time

Overview

When the user asks the API "What is the animal that screams?", The following question list "What is the animal with a long nose?" That the API holds as data inside, From "What is a long-necked animal?", "Who is the king of beasts?", "What is a screaming animal?", "What is a screaming animal?" It is a system that selects the question sentence that is closest to the query and returns the answer "dog" as a response to it.

Logical configuration

The theoretical components of the search engine refer to the following

The data you want to search is preprocessed through product information management and generated as an index, and when a search query is sent at the request of the user, the query preprocessor performs the same preprocessing and searches for similar elements with the query engine. It is a very simple and minimal search engine that formats the results with a query post processor and returns the search results as a response.

iOS の画像 (1)

I borrowed a chart from the book Machine Learning for AI Algorithm Marketing Automation

I don't think it's been talked about in the neighborhood so much, but for those who want to search and recommend, it's a book with a lot of learning, so I definitely want everyone to read it.

Since it is quoted without permission, it will be deleted immediately if the author or related parties contact us.

Directory structure

If you drop it concretely, it will be in the following form

The relevance tuning part is omitted this time because it varies depending on the business requirements this time.

The source code described this time is posted on github, so if you want to see it in detail, please access the following

https://github.com/r2en/simple_search_engine

    ├── app.py    <-Responder API server Receives search requests and returns search results
    │                         
    │
    ├── indexer.py    <-Catalog data(documents)Preprocess and index(前処理済みdocumentsのベクトル化)To generate
    │
    │
    ├── query_engine.py    <-Matching and scoring queries and catalog data
    │
    │    
    ├── query_pre_processor.py    <-Query(User search request)Is processed in index format, which is almost the same as the preprocessing performed by indexer.
    │
    │  
    ├── query_post_processor.py    <-Generate search results from the results of matching queries and catalog data
    │
    │  
    ├── product_information_management.py    <-Manipulate the data group to be searched
    │ 
    │
    ├── catalog_data    <-Stores the data to be searched
    │  │
    │  │
    │  ├── answer.csv    <-Stores the data of the answer requested by the query
    │  │
    │  │
    │  └── question.csv    <-Stores question data of the same quality as the query
    │  
    │  
    ├── morphological_analysis    <-A group of tools for morphological analyzers of natural language
    │  │
    │  ├── base_morphological_analysis.py    <-Parent class of the base morphological analyzer
    │  │
    │  │
    │  ├── mecab_morphological_analysis.py    <-Stores mecab of morphological analyzer
    │  │
    │  │
    │  └── __init__.py
    │  
    │      
    ├── search    <-Search for similarities between vectorized queries and catalog data
    │  │
    │  ├── cosine_similarity.py    <-Search for similarities by cosine similarity
    │  │
    │  │
    │  └── __init__.py
    │  
    │  
    └── word_embedding    <-Create data structures and indexes that allow you to search query and catalog data
       │
       │
       ├── base_vector.py    <-Parent class of base vector converter
       │
       │
       ├── count_vector.py    <-BoW count-based vector converter
       │
       │
       │
       ├── tfidf_vector.py    <- BoW Tf-Idf-based vector transducer
       │
       │
       └── __init__.py

Impressions

The impression of actually using it is that it is easy to operate and there is little to remember about the tool, but it is relatively easy to see and use.

I think it's worth continuing to use it, as it will definitely help you decipher the source code, if not completely understandable.

Recommended Posts

Analyze the source code of your own simple search engine written in Python with the code visualization tool "SOURCE TRAIL"
Run the intellisense of your own python library with VScode.
In search of the fastest FizzBuzz in Python
[Python] Read the source code of Bottle Part 2
Argument implementation (with code) in your own language
[Python] Read the source code of Bottle Part 1
Convert the character code of the file with Python3
Flow of creating your own package with setup.py with python
Calculate the regression coefficient of simple regression analysis with python
Use the CASA Toolkit in your own Python environment
Let's statically check and format the code of E2E automatic test written in Python [VS Code]
Try scraping the data of COVID-19 in Tokyo with Python
Try sorting your own objects with priority queue in Python
Calculate the square root of 2 in millions of digits with python
Simple sales tool creation with Python GUI: Employee number search
Comparison of exponential moving average (EMA) code written in Python
[Homology] Count the number of holes in data with Python
Google search for the last line of the file in Python
Try HeloWorld in your own language (with How to & code)
Get the source of the page to load infinitely with python.
Search engine work with python
I compared the calculation time of the moving average written in Python
Output the contents of ~ .xlsx in the folder to HTML with Python
I wrote the code to write the code of Brainf * ck in python
[Talking about the drawing structure of plotly] Dynamic visualization with plotly [python]
Visualize the frequency of word occurrences in sentences with Word Cloud. [Python]
Let's summarize the degree of coupling between modules with Python code
About bit full search that often appears in competition pros From the eyes of beginners with python
[CleanArchitecture with Python] Apply CleanArchitecture step by step to a simple API and try to understand "what kind of change is strong" in the code base.
[Python] logging in your own module
[Python] Read the Flask source code
Simple gacha logic written in Python
[Python] Explore the characteristics of the titles of the top sites in Google search results
Get a list of packages installed in your current environment with python
I tried to get the authentication code of Qiita API with Python.
About Python code for simple moving average assuming the use of Numba
Receive a list of the results of parallel processing in Python with starmap
[Python & SQLite] I tried to analyze the expected value of a race with horses in the 1x win range ①