[Python] Explore the characteristics of the titles of the top sites in Google search results

Thing you want to do

I want to know the characteristics of the titles of the top 10 sites that Google searched for a certain keyword.

Library

BeautifulSoup4
requests
pandas
janome
sklearn

Premise

This time I will write with Jupyter notebook, so it must be installed.

Implementation

I will write the actual code, but since I have little experience with it, I may be writing it inefficiently. Please note.

Library import

This time we will use only this library, so import it at the beginning.

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
from math import log
from janome.tokenizer import Tokenizer
from janome.analyzer import Analyzer
from janome.tokenfilter import POSStopFilter
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

Get the title of the search result

Use BeautifulSoup to scrape Google search results.

#Request header
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"}
list_keywd = ['Keyword 1','Keyword 2']
input_num = 10
url = 'https://www.google.co.jp/search?num={}&q='.format(input_num) + '　'.join(list_keywd)

#Connect
response = requests.get(url, headers=headers)

#Check HTTP status code (exception handling except 200)
response.raise_for_status()

#Parse the retrieved HTML
soup = bs(response.content, 'html.parser')

#Get search result titles and links
ret_link = soup.select('.r > a')
#To avoid removing breadcrumbs
ret_link2 = soup.select('.r > a > h3')

title_list = []
url_list = []
leng = len(ret_link)
r_list = []
cols = ['title','url']

for i in range(len(ret_link)):
    #Get the text part of the title
    title_txt = ret_link2[i].get_text()

    #Get only the link and remove the extra part
    url_txt = ret_link[i].get('href').replace('/url?q=','')

    title_list.append(title_txt)
    url_list.append(url_txt)
 
    tmp = []
    tmp = [title_txt,url_txt]
    r_list.append(tmp)

#Display search results
df = pd.DataFrame(r_list,columns=cols)
df

The top 10 sites in Google search results for a certain keyword were as follows.

Morphological analysis of the title-separate writing

Now that we have the titles of the top 10 sites, we will do everything from morphological analysis to word-separation. Janome was used for morphological analysis.

#Separate and register the nouns of each blog
work = []
WAKATI = []
for i in BLOG.keys():
    texts_flat = "".join(BLOG[i]["title"])
    tokens = a.analyze(texts_flat)
    work.append(' '.join([t.surface for t in tokens]))
    WAKATI.append(work[i].lower().split())
#Verification
for i in BLOG.keys():
    print("■WAKATI[{}]: {}".format(i,WAKATI[i]))

#scikit-Calculate the frequency of word occurrence with learn
vectorizer = CountVectorizer()

#Bow calculation
X = vectorizer.fit_transform([work[i] for i in range(len(work))])
WORDS = vectorizer.get_feature_names()
WORDS.sort()
print('=========================================')
print('All words')
print('=========================================')
print(WORDS)

It was analyzed in this way.

Function definition

This time, I wrote a function to find the tf value, idf value, and tf-idf value.

#Function definition
def tf(t, d):
  return d.count(t)/len(d)

def idf(t):
  df = 0
  for wak in WAKATI:
    df += t in wak
  
  #return log(N/df) + 1
  return log(N/np.array(df)) + 1

def tfidf(t,d):
  return tf(t,d) * idf(t)

def highlight_negative(val):
    if val > 0:
        return 'color: {0}; font-weight: bold'.format('red')
    else:
        return 'color: {0}'.format('black')
#Function definition End

Let's look at the tf value

First, let's look at the tf value.

#tf calculation
print('■ TF value for each site')
print('Frequency of appearance in one document')
ret = []
for i in range(N):
    ret.append([])
    d = WAKATI[i]
    for j in range(len(WORDS)):
        t = WORDS[j]
        if len(d) > 0:
            ret[-1].append(tf(t,d))

tf_ = pd.DataFrame(ret, columns=WORDS)
tf_.style.applymap(highlight_negative)

As shown in the figure below, the tf value was acquired. The part of speech used is written in red. "Mask" and "cool" are used in many titles. You may be able to find the search word just by the tf value.

Let's look at the idf value

The higher the idf value, the less likely it is to appear in other titles, making it a rare word. Conversely, the smaller the value, the more often it is used.

#idf calculation
ret = []
for i in range(len(WORDS)):
  t = WORDS[i]
  ret.append(idf(t))

idf_ = pd.DataFrame(ret, index=WORDS, columns=["IDF"])
idf_s = idf_.sort_values('IDF')
idf_s.style.applymap(highlight_negative)

As for the tf value, the idf value of "mask" and "cool" that often appeared is naturally small. In this result, the value 2.609438 appears at 2 sites, and the value 3.302585 appears at only 1 site.

Let's look at the tf-idf value

The higher the tf-idf value, the more important the word has to play in the title.

ret = []
for i in range(N):
  ret.append([])
  d = WAKATI[i]
  for j in range(len(WORDS)):
    t = WORDS[j]
    if len(d) > 0:
        ret[-1].append(tfidf(t,d))

tfidf_ = pd.DataFrame(ret,columns=WORDS)
tfidf_.style.applymap(highlight_negative)

The result looks like this. The row is the site and the column contains all the words. The value of words that do not appear on the site is "0". When looking at each site, it may be said that a word with a large value has a big role in the site title.

Summary

You can check the features like this. It may be helpful when you find a word to put in, such as what the title of the site should be.