・ Mac ・ Python3
Create a directory fortnite on your desktop. Create an images folder (for saving images) and scraping.py in the directory.
fortnite
├scraping.py
└images
Build a virtual environment in the directory.
python3 -m venv .
sorce bin/activate
Install required packages and modules
pip install beautifulsoup4
pip install requests
pip install lxml
Fortnite's image scraping uses Yahoo's image search results. https://search.yahoo.co.jp/image/search?p=%E3%83%95%E3%82%A9%E3%83%BC%E3%83%88%E3%83%8A%E3%82%A4%E3%83%88&ei=UTF-8&b=1 There are 10 images per page, and it can be confirmed that there are more than 100 images including the following pages. Scrape from here and store in the images folder.
.py:scraping.py
from bs4 import BeautifulSoup
import lxml
import requests
import os
import time
def main():
#20 images per page, variables for scraping the next page
page_key=0
#Variables for numbering saved images
num_m = 0
for i in range(6):
URL = "https://search.yahoo.co.jp/image/search?p=%E3%83%95%E3%82%A9%E3%83%BC%E3%83%88%E3%83%8A%E3%82%A4%E3%83%88&ei=UTF-8&b={}".format(page_key + 1)
res = requests.get(URL)
res.encoding = res.apparent_encoding
html_doc = res.text
soup = BeautifulSoup(html_doc,"lxml")
list = []
_list = soup.find_all("div",class_="gridmodule")
for i in _list:
i2 = i.find_all('img')
for i3 in i2:
i4 = i3.get('src')
list.append(i4)
for i in list:
i2 = requests.get(i)
#Save with absolute path
with open(os.path.dirname(os.path.abspath(__file__)) + '/images' + '/{}'.format(num_m)+'.jpeg','wb')as f:
f.write(i2.content)
num_m += 1
#Stop the save process when the 101st image is reached (stop for statement)
if num_m == 101:
break
#When the for statement of the inner save process is stopped, the process of stopping the outer for statement as well
else:
continue
break
#Open 1 second interval to prevent server load
time.sleep(1)
page_key+=20
if __name__ == '__main__':
main()
-As a result of searching for a likely location of the image URL using "verification" of Google Chrome, it was confirmed that the class of the div tag is in the gridmodule part. From there, scrape the img tag part. -Get the value of src attribute of img tag with get ('src'). -Although the src attribute of the acquired img tag is url, it is str type, so get the response object that stores the response information in requests. Response objects include text, encoding, status_code, and content. content is needed to get the response body in binary format. (Reference) How to use Requests (Python Library) -Specify the absolute path in the file and write in wb mode (reference) About Python and os operations ・ After saving 100 sheets with a for statement, cancel the inner for statement and the outer for statement. Python for loop break (break condition)
Recommended Posts