I used phantomjs from Python's selenium library and it became a zombie

I used phantomjs from Python's selenium library and it became a zombie

This article is the 11th day of Crawler / Web Scraping Advent Calendar 2016.

In March 2016, a book called Web Scraping with Python was published. It's a little power, but I also helped.

The scraping tools selenium and phantomjs introduced in Web scraping with Python, When creating a scraper, there is a problem that phantomjs cannot be closed depending on the environment. This time, I would like to introduce the case or the problem.

The phantomjs process remains for some reason

Run the following code. It is assumed that Python, Node, selenium, and phantomjs are already installed.

run.py

from selenium.webdriver.phantomjs.webdriver import WebDriver

browser = WebDriver()
browser.close()
browser.quit()
print('Finished')

Run this run.py.

$ python run.py
Finished

If there is an executable environment, output Finished to the standard output and exit.

Now let's check if the phantomjs process exists. If you are having problems, the phantomjs process should be left behind.

$ ps aux | grep phantomjs
sximada          74272 100.0  0.0  2432804   2004 s006  S+    4:41PM   0:00.01 grep --color phantomjs
sximada          74267   0.0  0.7  3647068  59976 s006  S     4:41PM   0:02.01 /usr/local/bin/phantomjs --cookies-file=/var/folders/hx/xp4thw0x7rj15r_2w57_wvfh0000gn/T/tmp2vmrand3 --webdriver=50599

It's sad, is not it.

There are several similar posts on stackoverflow about this issue, some of which are like pkill phantomjs white. I feel like it's serious.

Try to create an environment where problems occur

The above operating environment used the following.

The installation of OS, Python, and Node is far from the essence, so I will skip the explanation. For selenium, just do pip install selenium normally.

The problem is phantomjs. Because it's a node, I want to do npm install phantomjs.

The home of phantomjs is https://github.com/ariya/phantomjs. But this is not something you can put in with npm. npm install phantomjs installs https://github.com/Medium/phantomjs. This is the NPM wrapper for installing phantomjs, as you can see in the repository description. A wrapper that allows you to install / run phantomjs on npm.

In fact, if you do the following, you'll probably run into problems (although in some cases it won't). Create package.json.

$ npm init .

phantomjs(github.com/Medium/phantomjs)をインストールします。

$ npm install phantomjs

zombie1.py:

from selenium.webdriver.phantomjs.webdriver import WebDriver

browser = WebDriver(executable_path='./node_modules/.bin/phantomjs')  # MODIFIED
browser.close()
browser.quit()

print('Finished')

Run zombie1.py.

$ python zombie1.py
Finished

Check if the process remains.

(py3.5.2) $ ps aux | grep phantomjs
sximada           2426   0.0  0.0  2423392    408 s002  R+    5:51PM   0:00.00 grep --color phantomjs
sximada           2421   0.0  0.6  3645988  46780 s002  S     5:51PM   0:01.56 /usr/local/bin/phantomjs --cookies-file=/var/folders/hx/xp4thw0x7rj15r_2w57_wvfh0000gn/T/tmp8nobgxn7 --webdriver=50641

Hey there is left.

It does not occur with phatomjs installed with homebrew

What about phantomjs installed with homebrew?

$ brew install phantomjs

zombie2.py:

from selenium.webdriver.phantomjs.webdriver import WebDriver

browser = WebDriver(executable_path='/usr/local/bin/phantomjs')
browser.close()
browser.quit()

print('Finished')

Run zombie2.py.

$ python zombie2.py
Finished

Check if the process remains.

$ ps aux | grep phantomjs
sximada           3530   0.0  0.0  2432804    796 s002  R+    6:11PM   0:00.00 grep --color phantomjs
$

There aren't any left. What's the difference?

Phantomjs put in homebrew is an executable binary file

If you check with the file command, phantomjs / usr / local / bin / phantomjs entered by homebrew is an executable binary file. For convenience, this direct method is the binary version.

$ file /usr/local/bin/phantomjs
/usr/local/bin/phantomjs: Mach-O 64-bit executable x86_64

The phantomjs that I put in npm is a nodejs script

On the other hand, if you check phantomjs entered with npm with the file command, it is a text file. For convenience, this direct method is the npm version.

$ file node_modules/.bin/phantomjs
node_modules/.bin/phantomjs: a /usr/bin/env node script text executable, ASCII text

The contents are described as follows in the script of nodejs.

#!/usr/bin/env node

/**
 * Script that will execute the downloaded phantomjs binary.  stdio are
 * forwarded to and from the child process.
 *
 * The following is for an ugly hack to avoid a problem where the installer
 * finds the bin script npm creates during global installation.
 *
 * {NPM_INSTALL_MARKER}
 */

var path = require('path')
var spawn = require('child_process').spawn

var binPath = require(path.join(__dirname, '..', 'lib', 'phantomjs')).path

var args = process.argv.slice(2)

// For Node 0.6 compatibility, pipe the streams manually, instead of using
// `{ stdio: 'inherit' }`.
var cp = spawn(binPath, args)
cp.stdout.pipe(process.stdout)
cp.stderr.pipe(process.stderr)
process.stdin.pipe(cp.stdin)

cp.on('error', function (err) {
  console.error('Error executing phantom at', binPath)
  console.error(err.stack)
})

cp.on('exit', function(code){
  // Wait few ms for error to be printed.
  setTimeout(function(){
    process.exit(code)
  }, 20)
});

process.on('SIGTERM', function() {
  cp.kill('SIGTERM')
  process.exit(1)
})

It seems that you are running the binary as a child process with var cp = spawn (binPath, args). There is a handler for SIGTERM near the end, and it seems that when SIGTERM comes, it sends SIGTERM to the child process and exits.

The binary version has a two-stage process structure, and the npm installation version has a three-stage process structure.

If you start selenium using the binary version and the npm version, the process will have the following structure.

Binary version:

$ pstree 3812
-+= 03812 sximada python zombie2.py
 \--- 03815 sximada /usr/local/bin/phantomjs --cookies-file=/var/folders/hx/xp4thw0x7rj15r_2w57_wvfh0000gn/T/tmpu8trzjh0 --webdriver=50761

npm version:

$ pstree 3701
-+= 03701 sximada python zombie1.py
 \-+- 03704 sximada node ./node_modules/.bin/phantomjs --cookies-file=/var/folders/hx/xp4thw0x7rj15r_2w57_wvfh0000gn/T/tmp9c0y1sj7 --webdriver=50747
   \--- 03705 sximada /usr/local/bin/phantomjs --cookies-file=/var/folders/hx/xp4thw0x7rj15r_2w57_wvfh0000gn/T/tmp9c0y1sj7 --webdriver=50747

Looking at the process, it seems that there are still grandchild processes hanging at the bottom of this.

$ pstree 4537
-+= 04537 sximada python zombie1.py
 \-+- 04540 sximada node ./node_modules/.bin/phantomjs --cookies-file=/var/folders/hx/xp4thw0x7rj15r_2w57_wvfh0000gn/T/tmpg1eq1xst --webdriver=51406
   \--- 04541 sximada /usr/local/bin/phantomjs --cookies-file=/var/folders/hx/xp4thw0x7rj15r_2w57_wvfh0000gn/T/tmpg1eq1xst --webdriver=51406
$ ps aux | grep phantomjs
sximada           4554   0.0  0.0  2432804    632 s003  R+    6:50PM   0:00.00 grep --color phantomjs
sximada           4541   0.0  0.6  3646488  47532 s002  S     6:49PM   0:05.84 /usr/local/bin/phantomjs --cookies-file=/var/folders/hx/xp4thw0x7rj15r_2w57_wvfh0000gn/T/tmpg1eq1xst --webdriver=51406

Manually reproduce the phenomenon

After running node ./node_modules/.bin/phantomjs --cookies-file = / var / folders / hx / xp4thw0x7rj15r_2w57_wvfh0000gn / T / tmpg1eq1xst --webdriver = 51406 Try kill -KILL on that process.

$ node ./node_modules/.bin/phantomjs --cookies-file=/var/folders/hx/xp4thw0x7rj15r_2w57_wvfh0000gn/T/tmpoazqtmx7 --webdriver=51448
[INFO  - 2016-12-10T09:57:42.829Z] GhostDriver - Main - running on port 51448

When the process starts, kill it with SIGKILL.

$ ps -ef | grep phantom
  501  4662   763   0  6:57PM ttys002    0:00.12 node ./node_modules/.bin/phantomjs --cookies-file=/var/folders/hx/xp4thw0x7rj15r_2w57_wvfh0000gn/T/tmpoazqtmx7 --webdriver=51448
  501  4663  4662   0  6:57PM ttys002    0:01.73 /usr/local/bin/phantomjs --cookies-file=/var/folders/hx/xp4thw0x7rj15r_2w57_wvfh0000gn/T/tmpoazqtmx7 --webdriver=51448
  501  4666   764   0  6:57PM ttys003    0:00.00 grep --color phantom
$ kill -KILL 4662
$ ps -ef | grep phantom
  501  4663     1   0  6:57PM ttys002    0:03.63 /usr/local/bin/phantomjs --cookies-file=/var/folders/hx/xp4thw0x7rj15r_2w57_wvfh0000gn/T/tmpoazqtmx7 --webdriver=51448
  501  4670   764   0  6:58PM ttys003    0:00.00 grep --color phantom

I reproduced it. It's because of this. Is selenium sending SIGKILL at shutdown?

Behavior of selenium.webdriver.phantomjs.webdriver.Webdriver.close ()

From here, we'll use the debugger with pdb.set_trace () to find out what Python is doing. Let's put pdb in zombie1.py and see how it works.

from selenium.webdriver.phantomjs.webdriver import WebDriver

browser = WebDriver(executable_path='./node_modules/.bin/phantomjs')
import pdb; pdb.set_trace()
browser.close()
browser.quit()

print('Finished')

Looking at the operation, it seems that the HTTP request is sent at selenium / webdriver / remote / remote_connection.py (470) _request ().

464                 if password_manager:
465                     opener = url_request.build_opener(url_request.HTTPRedirectHandler(),
466                                                       HttpErrorHandler(),
467                                                       url_request.HTTPBasicAuthHandler(password_manager))
468                 else:
469                     opener = url_request.build_opener(url_request.HTTPRedirectHandler(),
470                                                       HttpErrorHandler())
471  ->             resp = opener.open(request, timeout=self._timeout)
472                 statuscode = resp.code

The request we are sending is as follows.

-> request = Request(url, data=body.encode('utf-8'), method=method)
(Pdb) p url
'http://127.0.0.1:51524/wd/hub/session/57277cb0-bec1-11e6-a0b1-31edd9b29650/window'
(Pdb) p body
'{"sessionId": "57277cb0-bec1-11e6-a0b1-31edd9b29650"}'
(Pdb) p method
'DELETE'
(Pdb)

Other than that, it doesn't seem to do anything in particular.

Behavior of selenium.webdriver.phantomjs.webdriver.Webdriver.quit ()

What about quit ()?

In the process of self.service.stop () of selenium / webdriver / phantomjs / webdriver.py (76) quit (), there was a place where SIGTERM and SIGKILL were sent.

selenium/webdriver/common/service.py(154)stop():

(Pdb) list
149                             stream.close()
150                         except AttributeError:
151                             pass
152                     self.process.terminate()
153                     self.process.kill()
154  ->                 self.process.wait()
155                     self.process = None
156             except OSError:
157                 # kill may not be available under windows environment
158                 pass
159

As far as I read the code, I am sending SIGKILL after sending SIGTERM.

But the npm version had a SIGTERM signal handler. Is this SIGKILL necessary in the first place? As a test, comment out self.process.kill () and execute it.

selenium/webdriver/common/service.py:

                self.process.terminate()
                # self.process.kill()   ##Comment out
                self.process.wait()

I will do it.

$ python  zombie1.py
Finished
$ ps aux | grep phantomjs
sximada           5270   0.0  0.0  2424612    500 s002  R+    7:34PM   0:00.00 grep --color phantomjs
$

There are no processes left. It seems that the child process has been killed and the grandchild process remains because of self.process.kill ().

Caused by receiving SIGKILL before completing SIGTERM handler processing of the child process

Apparently, the npm version of phantomjs sent SIGTERM sent by self.process.terminate (). It seems that the handler was killed by SIGKILL before sending SIGTERM to the grandchild process. I feel like I'm stuck in the middle of the border.

This can be avoided by installing phantomjs without using npm

It looks good if you don't go through npm, so don't do npm install phantomjs You can install it with homebrew or drop it from http://phantomjs.org/download.html.

I feel like I've taken a detour. I pray that no one will make the same detour.

Recommended Posts

I used phantomjs from Python's selenium library and it became a zombie
phantomjs and selenium
I installed and used the Deep Learning library Chainer
Create a decision tree from 0 with Python and understand it (3. Data analysis library Pandas edition)
I made a server with Python socket and ssl and tried to access it from a browser
Get an image from a web page and resize it
[Python] I installed the game from pip and played it
A memorandum when I tried to get it automatically with selenium
I tried to make a periodical process with Selenium and Python
I want to create a pipfile and reflect it in docker
I made a chatbot with Tensor2Tensor and this time it worked
I want to pass an argument to a python function and execute it from PHP on a web server