Since I thought about the above, I wrote a little code like that and I will share it. (This time, it is assumed that the handle name is known)
As for scraping, basically you have to follow various rules. Read here
Please be very careful when executing it as it may lead to the server load of the other party.
Arrange the handle names in the first row of Excel ↓ Get information from excel ↓ Search on google ↓ Extract information (account name) from it ↓ List account names in the third column of Excel
I will do it in this way.
4 modules to use
I created a scraype directory in Documents in my local environment.
Let's install 4 immediately using the above pip.
pip3 install requests
pip3 install openpyxl
pip3 install BeautifulSoup4
pip3 install time
And import to the file created earlier
handle_name-search.py
import requests
import openpyxl
from bs4 import BeautfulSoup as bs
import time
Set as above, save as handle.xlsx and save in scraype folder.
Use the openpyxl module to load files in the local environment and operate Sheet1
handle_name_search.py
wb = openpyxl.laod_workbook('/Users/{My local}/Documents/scraype/handle.xlsx')
sheet1=wb['Sheet1']
This completes the setting of Sheet1, so let's take all the A1 columns.
handle_name_search.py
for i in range(0,sheet1.max_row):
print(sheet1.cell(row=i+1,column=1).value
If you do this, all the information in column A1 will be output from Excel! Then look for this!
handle_name_search.py
req = requests.get("https//www.google.com/search?q=" + sheet1.cell(row = i+1, column=1).value)
handle_name_search.py
req = req.text
soup = bs(req,"html.parser")
tags = soup.find_all("div",class_="ZINbbc xpd O9g5cc uUPGi")
if(tags[0].find("div",class_="BNeawe vvjwJb AP7Wnd") != None):
title = tags[0].find("div",class_="BNeawe vvjwJb AP7Wnd").string
The content that I struggled with here is that the class name seen in the verification of chrome and the class acquired by bs4 are different, so in soup.prettify (), the search title of google is included in the class of BNeawe vvjwJb AP7 Wnd Verification.
handle_name_search.py
if "(@" in title:
title = title.split('(@')[0]
else:
if "- Twitter" in title:
title = title.split('-')[0]
if "✓" in title:
title = title.split('✓')[0]
handle_name_searchpy
sheet1.cell(row=i+1,column=3).value = title
wb.save('/Users/{Your local name}/Documents/scraype/handle.xlsx')
handle_name_search.py
import requests
import openpyxl
from bs4 import BeautifulSoup as bs
import time
##Local handle.Access xlsx (enter handle name in the first column))
wb = openpyxl.load_workbook('/Users/{Your local name}/Documents/scraype/handle.xlsx')
sheet1=wb['Sheet1']
##Get the handle of the first column of excel and output it to the third column
for i in range(0,sheet1.max_row):
time.sleep(1)
print(sheet1.cell(row=i+1,column=1).value)
req = requests.get("https://www.google.com/search?q=" + sheet1.cell(row=i+1,column=1).value)
req = req.text
soup = bs(req,"html.parser")
tags = soup.find_all("div", class_="ZINbbc xpd O9g5cc uUPGi")
if(tags[0].find("div",class_="BNeawe vvjwJb AP7Wnd") != None):
title = tags[0].find("div", class_="BNeawe vvjwJb AP7Wnd").string
if "(@" in title:
title = title.split('(@')[0]
else:
if "- Twitter" in title:
title = title.split('-')[0]
if "✓" in title:
title = title.split('✓')[0]
print(title)
sheet1.cell(row=i+1,column=3).value= title
wb.save('/Users/{Your local name}/Documents/scraype/handle.xlsx')
wb.save('/Users/{Your local name}/Documents/scraype/handle.xlsx')
After that, please compile and check it. Is the account name output in the third column? ??
We have introduced a sleep function on the way to prevent server attacks, but please be sure to follow the rules when performing scraping.
That was scraping.
Recommended Posts