As shown below, the logo on the top page of Google search is converted to text and displayed on HTML.
↓
You can use this method to compile English books published on the Internet in image format into HTML, and use Chrome's page translation function to translate them into Japanese for reading.
bash
#For step 1
pip install beautifulsoup4
#For step 2
brew install tesseract
pip install pyocr
#For step 3
pip install jinja2
** Step 1: Download logo image **
python
import requests
from bs4 import BeautifulSoup
#Get html
url = 'https://www.google.com'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
#Extract image
img = soup.find('img', {'id': 'hplogo'})
#Create URL for image
img_url = 'https://www.google.com' + img['src']
#Download image
r = requests.get(img_url)
#Save image
with open('hplogo.jpg' ,'wb') as file:
file.write(r.content)
** Step 2: Convert logo image to text with OCR **
python
from PIL import Image
import pyocr
import pyocr.builders
#Preset 1
tools = pyocr.get_available_tools()
tool = tools[0]
#Preset 2
builder = pyocr.builders.TextBuilder()
#Load image
img = Image.open('hplogo.jpg')
#Run OCR
result = tool.image_to_string(img, builder=builder)
** Step 3: Display the text in HTML **
python
from jinja2 import Template
#Generate view
html = '''
<!DOCTYPE html>
<html lang="en">
<head>
<title>The Farther Reaches Of Human Nature</title>
</head>
<body>
<h1>{{ result }}</h1>
</body>
</html>
'''
template = Template(html)
data = { 'result': result }
view = template.render(data)
#Save
with open('hplogo.html', 'w', encoding='utf-8') as f:
f.write(view)
When you open the generated hplogo.html
in your browser, you should see the text "Google" as follows: (Image reprinted)
Beautiful Soup in 10 minutes --Qiita Let's scrape images with Python-Qiita How to execute OCR with Python | Gammasoft Co., Ltd. I want to output HTML in Python for the first time in a while, so check the template --Qiita
Recommended Posts