This article is the 11th day of Crawler / Web Scraping Advent Calendar 2016.
In March 2016, a book called Web Scraping with Python was published. It's a little power, but I also helped.
The scraping tools selenium and phantomjs introduced in Web scraping with Python, When creating a scraper, there is a problem that phantomjs cannot be closed depending on the environment. This time, I would like to introduce the case or the problem.
Run the following code. It is assumed that Python, Node, selenium, and phantomjs are already installed.
run.py
from selenium.webdriver.phantomjs.webdriver import WebDriver
browser = WebDriver()
browser.close()
browser.quit()
print('Finished')
Run this run.py.
$ python run.py
Finished
If there is an executable environment, output Finished to the standard output and exit.
Now let's check if the phantomjs process exists. If you are having problems, the phantomjs process should be left behind.
$ ps aux | grep phantomjs
sximada 74272 100.0 0.0 2432804 2004 s006 S+ 4:41PM 0:00.01 grep --color phantomjs
sximada 74267 0.0 0.7 3647068 59976 s006 S 4:41PM 0:02.01 /usr/local/bin/phantomjs --cookies-file=/var/folders/hx/xp4thw0x7rj15r_2w57_wvfh0000gn/T/tmp2vmrand3 --webdriver=50599
It's sad, is not it.
There are several similar posts on stackoverflow about this issue, some of which are like pkill phantomjs white
.
I feel like it's serious.
The above operating environment used the following.
The installation of OS, Python, and Node is far from the essence, so I will skip the explanation.
For selenium, just do pip install selenium
normally.
The problem is phantomjs. Because it's a node, I want to do npm install phantomjs
.
The home of phantomjs is https://github.com/ariya/phantomjs. But this is not something you can put in with npm.
npm install phantomjs
installs https://github.com/Medium/phantomjs.
This is the NPM wrapper for installing phantomjs
, as you can see in the repository description.
A wrapper that allows you to install / run phantomjs on npm.
In fact, if you do the following, you'll probably run into problems (although in some cases it won't). Create package.json.
$ npm init .
phantomjs(github.com/Medium/phantomjs)をインストールします。
$ npm install phantomjs
zombie1.py:
from selenium.webdriver.phantomjs.webdriver import WebDriver
browser = WebDriver(executable_path='./node_modules/.bin/phantomjs') # MODIFIED
browser.close()
browser.quit()
print('Finished')
Run zombie1.py.
$ python zombie1.py
Finished
Check if the process remains.
(py3.5.2) $ ps aux | grep phantomjs
sximada 2426 0.0 0.0 2423392 408 s002 R+ 5:51PM 0:00.00 grep --color phantomjs
sximada 2421 0.0 0.6 3645988 46780 s002 S 5:51PM 0:01.56 /usr/local/bin/phantomjs --cookies-file=/var/folders/hx/xp4thw0x7rj15r_2w57_wvfh0000gn/T/tmp8nobgxn7 --webdriver=50641
Hey there is left.
What about phantomjs installed with homebrew?
$ brew install phantomjs
zombie2.py:
from selenium.webdriver.phantomjs.webdriver import WebDriver
browser = WebDriver(executable_path='/usr/local/bin/phantomjs')
browser.close()
browser.quit()
print('Finished')
Run zombie2.py.
$ python zombie2.py
Finished
Check if the process remains.
$ ps aux | grep phantomjs
sximada 3530 0.0 0.0 2432804 796 s002 R+ 6:11PM 0:00.00 grep --color phantomjs
$
There aren't any left. What's the difference?
If you check with the file command, phantomjs / usr / local / bin / phantomjs
entered by homebrew is an executable binary file.
For convenience, this direct method is the binary version.
$ file /usr/local/bin/phantomjs
/usr/local/bin/phantomjs: Mach-O 64-bit executable x86_64
On the other hand, if you check phantomjs entered with npm with the file command, it is a text file. For convenience, this direct method is the npm version.
$ file node_modules/.bin/phantomjs
node_modules/.bin/phantomjs: a /usr/bin/env node script text executable, ASCII text
The contents are described as follows in the script of nodejs.
#!/usr/bin/env node
/**
* Script that will execute the downloaded phantomjs binary. stdio are
* forwarded to and from the child process.
*
* The following is for an ugly hack to avoid a problem where the installer
* finds the bin script npm creates during global installation.
*
* {NPM_INSTALL_MARKER}
*/
var path = require('path')
var spawn = require('child_process').spawn
var binPath = require(path.join(__dirname, '..', 'lib', 'phantomjs')).path
var args = process.argv.slice(2)
// For Node 0.6 compatibility, pipe the streams manually, instead of using
// `{ stdio: 'inherit' }`.
var cp = spawn(binPath, args)
cp.stdout.pipe(process.stdout)
cp.stderr.pipe(process.stderr)
process.stdin.pipe(cp.stdin)
cp.on('error', function (err) {
console.error('Error executing phantom at', binPath)
console.error(err.stack)
})
cp.on('exit', function(code){
// Wait few ms for error to be printed.
setTimeout(function(){
process.exit(code)
}, 20)
});
process.on('SIGTERM', function() {
cp.kill('SIGTERM')
process.exit(1)
})
It seems that you are running the binary as a child process with var cp = spawn (binPath, args)
.
There is a handler for SIGTERM
near the end, and it seems that when SIGTERM comes, it sends SIGTERM
to the child process and exits.
If you start selenium using the binary version and the npm version, the process will have the following structure.
Binary version:
$ pstree 3812
-+= 03812 sximada python zombie2.py
\--- 03815 sximada /usr/local/bin/phantomjs --cookies-file=/var/folders/hx/xp4thw0x7rj15r_2w57_wvfh0000gn/T/tmpu8trzjh0 --webdriver=50761
npm version:
$ pstree 3701
-+= 03701 sximada python zombie1.py
\-+- 03704 sximada node ./node_modules/.bin/phantomjs --cookies-file=/var/folders/hx/xp4thw0x7rj15r_2w57_wvfh0000gn/T/tmp9c0y1sj7 --webdriver=50747
\--- 03705 sximada /usr/local/bin/phantomjs --cookies-file=/var/folders/hx/xp4thw0x7rj15r_2w57_wvfh0000gn/T/tmp9c0y1sj7 --webdriver=50747
Looking at the process, it seems that there are still grandchild processes hanging at the bottom of this.
$ pstree 4537
-+= 04537 sximada python zombie1.py
\-+- 04540 sximada node ./node_modules/.bin/phantomjs --cookies-file=/var/folders/hx/xp4thw0x7rj15r_2w57_wvfh0000gn/T/tmpg1eq1xst --webdriver=51406
\--- 04541 sximada /usr/local/bin/phantomjs --cookies-file=/var/folders/hx/xp4thw0x7rj15r_2w57_wvfh0000gn/T/tmpg1eq1xst --webdriver=51406
$ ps aux | grep phantomjs
sximada 4554 0.0 0.0 2432804 632 s003 R+ 6:50PM 0:00.00 grep --color phantomjs
sximada 4541 0.0 0.6 3646488 47532 s002 S 6:49PM 0:05.84 /usr/local/bin/phantomjs --cookies-file=/var/folders/hx/xp4thw0x7rj15r_2w57_wvfh0000gn/T/tmpg1eq1xst --webdriver=51406
After running node ./node_modules/.bin/phantomjs --cookies-file = / var / folders / hx / xp4thw0x7rj15r_2w57_wvfh0000gn / T / tmpg1eq1xst --webdriver = 51406
Try kill -KILL
on that process.
$ node ./node_modules/.bin/phantomjs --cookies-file=/var/folders/hx/xp4thw0x7rj15r_2w57_wvfh0000gn/T/tmpoazqtmx7 --webdriver=51448
[INFO - 2016-12-10T09:57:42.829Z] GhostDriver - Main - running on port 51448
When the process starts, kill it with SIGKILL.
$ ps -ef | grep phantom
501 4662 763 0 6:57PM ttys002 0:00.12 node ./node_modules/.bin/phantomjs --cookies-file=/var/folders/hx/xp4thw0x7rj15r_2w57_wvfh0000gn/T/tmpoazqtmx7 --webdriver=51448
501 4663 4662 0 6:57PM ttys002 0:01.73 /usr/local/bin/phantomjs --cookies-file=/var/folders/hx/xp4thw0x7rj15r_2w57_wvfh0000gn/T/tmpoazqtmx7 --webdriver=51448
501 4666 764 0 6:57PM ttys003 0:00.00 grep --color phantom
$ kill -KILL 4662
$ ps -ef | grep phantom
501 4663 1 0 6:57PM ttys002 0:03.63 /usr/local/bin/phantomjs --cookies-file=/var/folders/hx/xp4thw0x7rj15r_2w57_wvfh0000gn/T/tmpoazqtmx7 --webdriver=51448
501 4670 764 0 6:58PM ttys003 0:00.00 grep --color phantom
I reproduced it. It's because of this. Is selenium sending SIGKILL at shutdown?
From here, we'll use the debugger with pdb.set_trace () to find out what Python is doing. Let's put pdb in zombie1.py and see how it works.
from selenium.webdriver.phantomjs.webdriver import WebDriver
browser = WebDriver(executable_path='./node_modules/.bin/phantomjs')
import pdb; pdb.set_trace()
browser.close()
browser.quit()
print('Finished')
Looking at the operation, it seems that the HTTP request is sent at selenium / webdriver / remote / remote_connection.py (470) _request ().
464 if password_manager:
465 opener = url_request.build_opener(url_request.HTTPRedirectHandler(),
466 HttpErrorHandler(),
467 url_request.HTTPBasicAuthHandler(password_manager))
468 else:
469 opener = url_request.build_opener(url_request.HTTPRedirectHandler(),
470 HttpErrorHandler())
471 -> resp = opener.open(request, timeout=self._timeout)
472 statuscode = resp.code
The request we are sending is as follows.
-> request = Request(url, data=body.encode('utf-8'), method=method)
(Pdb) p url
'http://127.0.0.1:51524/wd/hub/session/57277cb0-bec1-11e6-a0b1-31edd9b29650/window'
(Pdb) p body
'{"sessionId": "57277cb0-bec1-11e6-a0b1-31edd9b29650"}'
(Pdb) p method
'DELETE'
(Pdb)
Other than that, it doesn't seem to do anything in particular.
What about quit ()?
In the process of self.service.stop ()
of selenium / webdriver / phantomjs / webdriver.py (76) quit (), there was a place where SIGTERM and SIGKILL were sent.
selenium/webdriver/common/service.py(154)stop():
(Pdb) list
149 stream.close()
150 except AttributeError:
151 pass
152 self.process.terminate()
153 self.process.kill()
154 -> self.process.wait()
155 self.process = None
156 except OSError:
157 # kill may not be available under windows environment
158 pass
159
As far as I read the code, I am sending SIGKILL after sending SIGTERM.
But the npm version had a SIGTERM signal handler. Is this SIGKILL necessary in the first place?
As a test, comment out self.process.kill ()
and execute it.
selenium/webdriver/common/service.py:
self.process.terminate()
# self.process.kill() ##Comment out
self.process.wait()
I will do it.
$ python zombie1.py
Finished
$ ps aux | grep phantomjs
sximada 5270 0.0 0.0 2424612 500 s002 R+ 7:34PM 0:00.00 grep --color phantomjs
$
There are no processes left. It seems that the child process has been killed and the grandchild process remains because of self.process.kill ()
.
Apparently, the npm version of phantomjs sent SIGTERM sent by self.process.terminate ()
.
It seems that the handler was killed by SIGKILL before sending SIGTERM to the grandchild process.
I feel like I'm stuck in the middle of the border.
It looks good if you don't go through npm, so don't do npm install phantomjs
You can install it with homebrew or drop it from http://phantomjs.org/download.html.
I feel like I've taken a detour. I pray that no one will make the same detour.
Recommended Posts