I'm a software engineer r2en at white, inc. Our company conducts consulting work centered on new businesses, and usually the engineer team develops free cloud-type tools for developing new businesses and new businesses. I am involved in everything from consulting to PoC development.
When I was looking for a tool that could understand the source code interactively in the transfer of development from the PoC development phase to the operation phase and the operation of existing software, I found SOURCE TEAIL, so I touched it this time and shared it. let me try
SOURCETEAIL is a tool that helps developers to productively code by analyzing and visualizing source code so that they do not spend a lot of time understanding source code written by others.
TL;DR The impression of actually using it is that it is easy to operate and there is little to remember about the tool, but it is relatively easy to see and use.
I think it's worth continuing to use it, as it will definitely help you decipher the source code, if not completely understandable.
https://www.sourcetrail.com/ Jump to the sourcetrail site and click the download button
Download an image of your OS environment to jump to the sourcetrail github page
Unzip and move to application
[Click here for a description of the source code used](https://qiita.com/r2en/items/e4f8145f54d6c5b4e77e#%E4%BD%BF%E7%94%A8%E3%81%97%E3%81%9F% E3% 82% BD% E3% 83% BC% E3% 82% B9% E3% 82% B3% E3% 83% BC% E3% 83% 89)
When started, the following screen will be displayed. Press New Project.
When soucetail performs static analysis, a file is automatically generated in the analyzed repository (directory). Select the project name and project location for that time
Select add source group
Select your programming language and select next
Enter the language environment, external module, module, etc. here Since it supports drag and drop, the repository is dropped to Files & Directories to Index as it is.
The screen where the dropped file is referenced by show files on the above screen
It has been imported properly
After completing the entry to some extent, select create
A screen for static analysis of the source code will be displayed, so select start.
Seach_engine.srctrlbm, seach_engine.srctrldb, seach_engine.srctrlprj are generated as shown in the folder below.
Select start in-depth indexing because it seems to make static analysis deeper again.
Select start
An error has occurred, but this time the external modules (numpy, pandas, etc.) were not referenced at the time of initial setting. This time it's okay to go that far, so proceed as it is
Let's see if we can display the components of the source code.
Files existing in the repository are displayed in alphabetical order
Modules existing in the repository are displayed in alphabetical order
Classes existing in the repository are displayed in alphabetical order
Functions existing in the repository are displayed in alphabetical order
Global variables existing in the repository are displayed in alphabetical order
Let's see if more detailed analysis of the source code and dependencies are displayed.
First look at the module
In the app.py file, since the responder server is running, call api as a global variable and call it. Originally it should be batch processed instead of real-time API,
Since we are creating a simple search engine this time, we will create an index of data and create it only at the first startup, so it exists in a global variable
In the function, there is a batch_process function for creating index and bow, and there is a search engine API search_engine.
This time, search_engine in app.py is the basis of all processing, so let's select this function.
It can be seen that it is composed of the global variables bow and index and the functions query_engine, query_post_processor and query_pre_processor.
Select query_pre_processor to take a closer look
Unfortunately, there are CountVector and TfidfVector classes that inherit from the BaseVector class, but the dependencies are difficult to understand in this chart.
MecabMorphologicalAnalysis also originally inherits from the BaseMorphologicalAnalysis class, but it doesn't show up at first glance either.
I will look at each other
You can see that query_engine uses cosine_similarity
You can see that query_post_prosessor is closed without using other classes or functions
Looking at CountVector, you can see that it inherits from BaseVector You can also see that it is called from the indexer module, indexer function, and query_pre_processor function.
On the contrary, if you look at BaseVector, it is called from count_vector module and tfidf_vector module. You can see that it is the parent of CountVector class and TfidfVector class.
This part is not related to SOURCE TEAIL, so read it if you want a deeper understanding of what you are analyzing.
The reason for choosing this source code is to consider a system with multiple files (modules) and a simple structure that is easy to understand while verifying whether the code has some dependencies and can be used in projects, etc. Because the corresponding code was a simple search engine at that time
When the user asks the API "What is the animal that screams?", The following question list "What is the animal with a long nose?" That the API holds as data inside, From "What is a long-necked animal?", "Who is the king of beasts?", "What is a screaming animal?", "What is a screaming animal?" It is a system that selects the question sentence that is closest to the query and returns the answer "dog" as a response to it.
The theoretical components of the search engine refer to the following
The data you want to search is preprocessed through product information management and generated as an index, and when a search query is sent at the request of the user, the query preprocessor performs the same preprocessing and searches for similar elements with the query engine. It is a very simple and minimal search engine that formats the results with a query post processor and returns the search results as a response.
I borrowed a chart from the book Machine Learning for AI Algorithm Marketing Automation
I don't think it's been talked about in the neighborhood so much, but for those who want to search and recommend, it's a book with a lot of learning, so I definitely want everyone to read it.
Since it is quoted without permission, it will be deleted immediately if the author or related parties contact us.
If you drop it concretely, it will be in the following form
The relevance tuning part is omitted this time because it varies depending on the business requirements this time.
The source code described this time is posted on github, so if you want to see it in detail, please access the following
https://github.com/r2en/simple_search_engine
├── app.py <-Responder API server Receives search requests and returns search results
│
│
├── indexer.py <-Catalog data(documents)Preprocess and index(前処理済みdocumentsのベクトル化)To generate
│
│
├── query_engine.py <-Matching and scoring queries and catalog data
│
│
├── query_pre_processor.py <-Query(User search request)Is processed in index format, which is almost the same as the preprocessing performed by indexer.
│
│
├── query_post_processor.py <-Generate search results from the results of matching queries and catalog data
│
│
├── product_information_management.py <-Manipulate the data group to be searched
│
│
├── catalog_data <-Stores the data to be searched
│ │
│ │
│ ├── answer.csv <-Stores the data of the answer requested by the query
│ │
│ │
│ └── question.csv <-Stores question data of the same quality as the query
│
│
├── morphological_analysis <-A group of tools for morphological analyzers of natural language
│ │
│ ├── base_morphological_analysis.py <-Parent class of the base morphological analyzer
│ │
│ │
│ ├── mecab_morphological_analysis.py <-Stores mecab of morphological analyzer
│ │
│ │
│ └── __init__.py
│
│
├── search <-Search for similarities between vectorized queries and catalog data
│ │
│ ├── cosine_similarity.py <-Search for similarities by cosine similarity
│ │
│ │
│ └── __init__.py
│
│
└── word_embedding <-Create data structures and indexes that allow you to search query and catalog data
│
│
├── base_vector.py <-Parent class of base vector converter
│
│
├── count_vector.py <-BoW count-based vector converter
│
│
│
├── tfidf_vector.py <- BoW Tf-Idf-based vector transducer
│
│
└── __init__.py
The impression of actually using it is that it is easy to operate and there is little to remember about the tool, but it is relatively easy to see and use.
I think it's worth continuing to use it, as it will definitely help you decipher the source code, if not completely understandable.
Recommended Posts