When I was scraping, I think I had the experience that CAPTCHA came out and the program stopped. (Only those people will see this article.) In order to somehow avoid CAPTHCA, there are ways to make it move like BOT or IP distribution, but this time I will try to solve CAPTCHA obediently. Of course, since I'm an engineer, I want to solve it automatically on the program rather than solving it by myself. Machine learning has high learning costs and introduction costs, and I want to enjoy it even more. A service called 2Cpathca makes that possible. There are many other services, so find the one that suits you best. There was a Python article, but I couldn't find a Ruby article, so I wrote it.
It is a service to break through the CAPTHCA function, and authentication can be automated by using the API. It's a paid service, but with reCAPTCHA v2 it's as cheap as $ 2.99 for 1,000 requests. As a reminder, there is no exchange of money between me and 2Captcha for promotional purposes.
A library that allows you to operate Chrome instances from Ruby. Please refer to Explanation page and Repository for detailed usage. As a premise for scraping, it is necessary to make it difficult for CAPTHCA to appear in the first place. Unlike Selenium etc., Chrome_Remote that runs Chrome as it is is harder to judge BOT. (I want to verify the difference soon.)
Break through the reCAPTCHA demo page. 2 Please refer to Predecessor's article for creating a Capthca account and obtaining an api key.
key.yaml
---
:2Capthca:2 Captcha api key
For Mac
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222 &
Gemfile
source "https://rubygems.org"
gem 'nokogiri'
gem 'chrome_remote'
bundle install
crawler.rb
require 'nokogiri'
require 'chrome_remote'
require 'yaml'
class CaptchaDetectedException < StandardError; end
class ChromeController
def initialize
@chrome = ChromeRemote.client
# Enable events
@chrome.send_cmd "Network.enable"
@chrome.send_cmd "Page.enable"
end
def open(url)
#Page access
move_to url
captcha_detect
end
def reload_page
sleep 1
@chrome.send_cmd "Page.reload", ignoreCache: false
wait_event_fired
end
def execute_js(js)
@chrome.send_cmd "Runtime.evaluate", expression: js
end
def wait_event_fired
@chrome.wait_for "Page.loadEventFired"
end
#Page navigation
def move_to(url)
sleep 1
@chrome.send_cmd "Page.navigate", url: url
wait_event_fired
end
#Get HTML
def get_html
response = execute_js 'document.getElementsByTagName("html")[0].innerHTML'
html = '<html>' + response['result']['value'] + '</html>'
end
def captcha_detect
bot_detect_cnt = 0
begin
html = get_html
raise CaptchaDetectedException, 'captcha confirmed' if html.include?("captcha")
rescue CaptchaDetectedException => e
p e
bot_detect_cnt += 1
p "Captcha breakthrough attempt: #{bot_detect_cnt}Time"
doc = Nokogiri::HTML.parse(html, nil, 'UTF-8')
return if captcha_solve(doc) == 'Successful release'
reload_page
retry if bot_detect_cnt < 3
p 'Captcha breakthrough error. Quit Ruby'
exit
end
p 'There was no captcha'
end
def captcha_solve(doc)
id = request_id(doc).match(/(\d.*)/)[1]
solution = request_solution(id)
return false unless solution
submit_solution(solution)
p captcha_result
end
def request_id(doc)
#Read API key
@key = YAML.load_file("key.yaml")[:"2Capthca"]
# data-Get the value of the sitekey attribute
googlekey = doc.at_css('#recaptcha-demo')["data-sitekey"]
method = "userrecaptcha"
pageurl = execute_js("location.href")['result']['value']
request_url="https://2captcha.com/in.php?key=#{@key}&method=#{method}&googlekey=#{googlekey}&pageurl=#{pageurl}"
#Request to release captcha
fetch_url(request_url)
end
def request_solution(id)
action = "get"
response_url = "https://2captcha.com/res.php?key=#{@key}&action=#{action}&id=#{id}"
sleep 15
retry_cnt = 0
begin
sleep 5
#Get captcha unlock code
response_str = fetch_url(response_url)
raise 'Before releasing captcha' if response_str.include?('CAPCHA_NOT_READY')
rescue => e
p e
retry_cnt += 1
p "retry:#{retry_cnt}Time"
retry if retry_cnt < 10
return false
end
response_str.slice(/OK\|(.*)/,1)
end
def submit_solution(solution)
#Enter the unlock code in the specified textarea
execute_js("document.getElementById('g-recaptcha-response').innerHTML=\"#{solution}\";")
sleep 1
#Click the submit button
execute_js("document.getElementById('recaptcha-demo-submit').click();")
end
def captcha_result
sleep 1
html = get_html
doc = Nokogiri::HTML.parse(html, nil, 'UTF-8')
doc.at_css('.recaptcha-success') ? 'Successful release' : 'Cancellation failure'
end
def fetch_url(url)
sleep 1
`curl "#{url}"`
end
end
crawler = ChromeController.new
url = 'https://www.google.com/recaptcha/api2/demo'
crawler.open(url)
When you run the program, it will access the reCAPTCHA demo page and try to break through the CAPTCHA.
bundle exec ruby crawler.rb
Depending on the purpose and mode of scraping and how to handle the data obtained by scraping, there is a risk of violating copyright law and personal information protection law. I wish you all a happy scraping life.