I usually work mainly on crawler development. In my business, I have the opportunity to visit various sites to collect data, and I struggle with access blocks every day. This time, as a memorandum of my own, I will summarize the techniques and countermeasures for scraping that do not fit the access block. The scraping technology handled is Ruby's open-uri, curl command, and selenium. Needless to say, let's comply with the rules of the crawled site and related laws.
There are various types of access blocks, and avoid them by appropriate means.
IP 1: Access frequency If you frequently send a request to the site side, it may be judged as a robot. Solution: Increase sleep time. Number of crawls * sleep = required time <Long sleep is taken within the allowable required time.
2: If you send the request at the same pace, it may be judged as a robot. Solution: Set a random sleep time.
3: If requests are frequently sent from the same IP to the same site, it may be blocked on a per-IP basis. Solution: Check the number of continuous connections and seconds per IP, and if necessary, set SSH server stepping stone and IP rotation. Even if they are different sites, CDN is Akamai is common, and if IP is blocked by Akamai, both sites may be blocked.
#ruby open-uri
socksify gem
Socksify::proxy("127.0.0.1", port) { open(url).read }
# curl
curl -x socks5h://loalhost:port url
# selenium
selenium_option = [ --proxy-server=socks5://127.0.0.1:port ]
4: Some sites block access from specific IPs. Solution: If you are using a data center (IDCF or AWS) as a server, try changing the IP provider etc. by setting SSH server as a stepping stone.
HTTP Header 1: Default HTTP Header The server side that receives the request recognizes that it is a suspicious request when an unnatural HTTP Header request comes in, and blocks it if necessary. The default HTTP Headers such as open-uri and curl are clearly BOT. Solution: Specify the same as when accessing User-Agent, Accept-Language, etc. with a browser. Check the Header when accessed with a browser by copying it from Chrome → Developer tool → Network tab → Copy → Copy as cURL.
2: Access frequency from the same user If the server determines from the HTTP Header (or IP) that the same user is over-accessing, it may be blocked. Solution: Try rotating User-Agent etc.
Captcha There is a site that can determine the movement of BOT and solve problems that humans can easily solve. Solution 1: Do not trigger the Captcha test. Try various conditions such as lengthening sleep, changing IP, remove webdriver property, remove headless option if you are using selenium.
Solution 2: Solve the test automatically. Use a captha breakthrough API, or an image recognition technology that passes this check using machine learning or deep learning skills.
If you're using selenium, you'll find it easy to auto-operate with the following code.
var isAutomated = navigator.webdriver;
if(isAutomated){
blockAccess();
}
Run the following code to remove the webdriver property.
delete Object.getPrototypeOf(navigator).webdriver;
If you are using selenium, running it headless may block it on the site side. Headless is a proof that you are not human, and you can easily tell if it is headless with the following code. For sites that use ditsil or geetest bot prevention technology, headless is not possible, and if you want to run it on a server, you need to have a GUI.
navigator.permissions.query({name:'notifications'}).then(function(permissionStatus) {
if(Notification.permission === 'denied' && permissionStatus.state === 'prompt') {
console.log("Headless Chrome");
} else {
console.log("Not Headless Chrome");
}
});
Find out the IP address from the domain name with the dig command dig domain name Example: dig www.armaniexchange.com If your CDN is Akamai, you'll find akamai edge in the Answer section.
Reverse lookup from IP address with dig command
dig -x IP address
Check your global IP address with the dig command and reverse the domain with the -x option of the dig command. Determine if port forwarding is possible with the domain name in the Answer section. IDCF:idcfcloud.net. AWS EC2:amazonaws.com. FLET'S: mesh.ad.jp.
If you use the SSH server as a stepping stone, you can make it think that you have connected to the connected Web server from the IP address of the SSH server. Create a "tunnel" to the SSH server using the "dynamic forward" function of the SSH server and use it as a "SOCKS proxy".
ssh -D Port number Username@hostname
When IP rotation, it is necessary to launch multiple SOCKS proxies and randomly retrieve ports.
When the dynamic_ports method is executed, the Linux command pgrep -fal ssh
extracts the ssh server and returns the SOCKS proxy ports as an array, so only one should be extracted at random with the sample method.
dynamic_ports.rb
def dynamic_ports
lines = `pgrep -fal ssh`.split("\n")
ports = []
lines.each do |line|
opts = line.split
d_options_index = opts.find_index("-D")
if d_options_index.present? && d_options_index > 0
next if opts[d_options_index + 1].to_i <= 0
ports << opts[d_options_index + 1].to_i
end
end
ports
end
Local name resolution curl --socks5 localhost:port Name resolution on the server curl --socks5-hostname localhost:port curl -x socks5h://localhost:post chrome_option = [ --proxy-server="socks5://127.0.0.1:port" ]
10 Ways to hide your Bot Automation from Detection | How to make Selenium undetectable and stealth Reliable wherever you go! How to access the Web using an SSH server as a stepping stone
Access blocks vary from site to site, and countermeasures vary widely. No site can't be scraped! I want to acquire the technical ability that can be said.
Recommended Posts