[PYTHON] Detect lost search status from Google Chrome search query history

Introduction

Have you ever wasted your time because the information you were looking for on the Web was hard to come by? If I'm not careful, I may end up searching endlessly. I think that there are quite a few cases where you get lost in a search where you cannot get the information you want no matter how many times you search. I thought that if I could objectively judge the lost child state, I wouldn't have to spend unnecessary time, so I investigated whether it would be possible to judge with the information at hand.

In this article, we investigated and analyzed whether the search lost state can be determined by analyzing how the search query is changing using the search history of several people. As a result, I couldn't clearly judge the search lost state, but I felt that it would be possible to increase the variables of the analysis and grasp the tendency from another perspective.

Search lost child state

Assumption

This time, I assumed that the change of the search query is effective as a factor to judge the situation where the search is lost. I think it is natural that even if you search repeatedly by changing the combination of words, only similar pages will be hit. Also, even if you just change the wording, it is unlikely that you will come across a useful page. In other words, if there is little change in the search query, it can be said that the search is getting lost.

Existing technology for problem solving

On the other hand, it can be said that you are not in a lost search state because you are getting closer to the page you were looking for by getting new information about the search target or changing the search query with a new idea. In order to prevent lost search conditions, the solution is being attempted by recommending search results based on user preferences based on query extensions such as Google Suggest and collaborative filtering using search information of other users. I will.

Check points this time

On the other hand, in this article, we confirmed whether it is possible to detect the search lost state instead of eliminating it. If you are searching for a specific phrase, you can solve it with existing technology, but I think that the search lost state due to the fact that you do not understand the search term clearly may not be solved well yet. So, first of all, I wanted to encourage people around me to get support by grasping the situation where I got lost in the search, so I investigated whether it can be judged.

Analysis summary

This time, we conducted a survey according to the following flow.

Kobito.Y1nW4f.png

  1. Get search query from Google Chrome
  2. Analyze the transition of search queries using Python
  3. Graph the results

Environment / used

・ Mac ・ Search history of Google Chrome ・ Python (3.6.0)

analysis

Google Chrome search history

Location of search history data

In Google Chrome, not only the page history viewed, but also information such as search query information and the last viewed date of the target URL are stored locally.

~/Library/Application\ Support/Google/Chrome/Default/History
Local Settings/Application Data/Google/Chrome/User Data/Default/History

Browsing search history data

History is stored in SQLite3 format, which is an RDBMS that runs on your application. If you are using a Mac, you can browse from CUI without any special preparation. It is better to make a copy before browsing.

$sqlite3 History

Since it can be handled with SQL statements similar to oracle etc., if you have touched RDBMS, you will not have any trouble in data acquisition etc. If you remember .schema, which outputs schema information, .table, which outputs a table list, and .output, which changes the output destination, as characteristic commands that are often used, you will not have any trouble. It can be executed by reading an external SQL file or by using .read file name.

[reference] http://qiita.com/northriver/items/3f48f27b60f6362d330c http://l-w-i.net/t/sqlite/ext_001.txt https://www.dbonline.jp/sqlite/sqlite_command/list.html

In addition, it can be handled by GUI by installing the following application. (Compatible with both Windows and Mac) DB Browser for SQLite

About search history data

This time, I used keyword_search_terms which is the information of the search query history table.

Also, regarding the data in History, I did not have much detailed information, so I will describe the information found by investigating.

・ About column name Note that each table has a column name with the same name, but they do not always have the same meaning. Example) urls.id = visits.url

・ About date and time information Please note that the base date and unit differ depending on the column. visits.visit_time is based on January 1, 1961, unit is microsecond downloads.start_time is based on January 1, 1970, unit is second [Reference] http://www.forensicswiki.org/wiki/Google_Chrome

・ About transition Since the format of some columns is unknown, there are many items that cannot be understood. Although not used this time, urls.transition is a code that shows how the page was transitioned to. (Typed opened by directly entering the URL or link from another page) The transition code can be obtained by converting from a binary number to a hexadecimal number and ANDing it with 0xFF.

[Reference] How to find the transition code value https://groups.google.com/a/chromium.org/forum/#!topic/chromium-discuss/r7UQ2i98Lu4 [Reference] Meaning of transition code value https://developer.chrome.com/extensions/history

Search efficiency measurement

About search score

We measured the degree of similarity with previous search queries for each search attempt, and calculated a score that indicates that the search efficiency deteriorates if there are many similar words.

I used difflib, which is prepared as a standard library in Python, to calculate the similarity between words. Compare with SequenceMatcher and calculate the similarity. It can be used as follows. The similarity is calculated from 0 (no match at all) to 1 (exact match).

>>> difflib.SequenceMatcher(None, 'python','python3').ratio()
0.9230769230769231

The following is an example of calculating the search score using this. The combination of words with the highest similarity is the one connected by the red line. The score for the second search will be "0.64".

The closer the calculated search score is to 0, the lower the similarity can be judged, and I would like to use it as a basis for determining whether or not the search is lost.

Experiment

Target data

We received Chrome history from 4 people and confirmed it. As for the history contents, I asked them to search for the issues presented here in 10 minutes. In addition to that, we targeted arbitrary search history that we usually used. One is my senior and the research task is very quick. The other three are juniors.

Pre-processing

Only the transition of the search query that seems to have been searched for one purpose was extracted and used as test data. Also, since Chrome's History holds both the information as it is in the search query and the information in which all letters are in lowercase, we used the information converted to lowercase this time.

Below is an example.

postgresql mac installation
postgresql mac installation location
postgresql mac installation location specification
postgresql mac installation directory change

The horizontal axis is the number of search attempts, and the vertical axis is the search score. The first time is 0 because there is no comparison target. In this example, the search query hasn't changed much, so the search score continues to grow.

result

Kobito.4jPMyn.png

About the score of common tasks

The red line is the result of searching for common issues. The senior arrived at the target information for the sixth time, but the other three were out of time.

The search score of my seniors dropped significantly from the 4th to the 5th time, and when I checked the query, it changed drastically. At the beginning of the search, we were able to search only for abstract words, but as we proceeded, we found specific words to search for.

On the other hand, Mr. A and Mr. C did not have a big change in the search query, and I could read the situation they were worried about. Also, when I focused on Mr. B's blue line data, the score dropped from the 5th to the 6th time, but I just read from English to Japanese. I think there are pros and cons as to whether or not this is a leveling out situation.

About the score of normal search

Reading from the graph, I got the impression that seniors with high research ability have low search scores, but it seems that additional research is necessary to see if it is valid because there is little data.

For the future

・ Improvement of score calculation method I felt that the search score assumed as the tendency of the search lost state may still be useful, so I would like to improve the score calculation method.

・ Service using search score We would like to set a threshold for the search score in order to eliminate the lost search state, and if it exceeds it, we would like to create a service such as notifying the user to be aware of it and cooperating with others.

Recommended Posts

Detect lost search status from Google Chrome search query history
Save dog images from Google image search
[Python] Download original images from Google Image Search
Csv output from Google search with [Python]! 【Easy】