[PYTHON] (Memo) Until you extract only the part you want from a certain Web page, convert it to a Sphinx page, and print it as a PDF

For that reason, I wanted to document the healthy walking exercises on this page. I was studying ** Sphinx ** as a trial, so I will use this. I feel like I'm using it forcibly, so it may have been extremely inefficient, but it's good because it was relatively well done.

Deliverables (healthy exercises)

Blank pages are mixed in the middle, but it's okay.

What I used

Sphinx

http://sphinx-users.jp/

BeautifulSoup

http://tdoc.info/beautifulsoup/

There seems to be ** HTMLParser **, but I didn't know how to use it, and it was easy for me to understand personally, so I used it.

pip install BeautifulSoup

After that, I will use this to extract the explanation image and sentences from the HTML of the page in question and convert it into the form of the material with Sphinx. More on that later.

Revert to Sphinx page

Let's think about the general flow

First of all, the URL of the gymnastics page is

http://kaigoouen.net/program/gymnastics/gymnastics_{Gymnastics type number}_{page number}.html

With that feeling, there are two places where numbers are entered (I don't care, but does this site make HTML for each page?).

Gymnastics type number

There are three levels of healthy walking exercises: hops, steps, and jumps, but it seems that ** hops have 2, steps have 3, and jumps have 4 ** (By the way, 1 seems to be something like the previous story, so this time. Is unnecessary).

page number

Hops, steps, and jumps are all made up of multiple pages, so here is the ** number of pages **.

If you think based on these, is it a flow like this?

  1. Select the type of gymnastics
  2. Select a page from the selected gymnastics
  3. Extract the desired part from the HTML of the selected page
  4. Go to the next page and go back to 2 (repeat until the last page)

I plan to make the materials for hops, steps, and jumps individually, so this is generally OK.

Let's think about the detailed flow

In the flow considered above, consider part 3.

Apparently, the part I want is in the div element ** of the class ** box02.

For example, hops look like this.

<div class="box02">
    <h2>
        <img alt="item name" src="text.gif"></img>
    </h2>
    <div class="box02_txt_hop">
        <div class="hop_box">
            <div class="hop_txt">Commentary</div>
            <div class="hop_image">
                <img alt="Reference photo" src="image.jpg "></img>
            </div>
        </div>
        <div class="hop_box">
            <!--Item continuation-->
        </div>
</div>
<div class="box02">
    <!--Next item-->
</div>

So what I want in this is ...

that's all. Use the BeautifulSoup module to narrow down these. It wasn't too difficult because you can ** find ** or ** findAll ** with element names or class names. After that, write a Python script that rewrites the narrowed down one to .rst format for Sphinx. Just do your best while looking at the Document.

For the time being, I used the heading ** for the item name, and ** the table method ** because I wanted to arrange the commentary and images side by side.

A little trouble

The HTML of the original page was weird

When extracting the explanations and images of each item, I first thought of ** let's trace the relevant part from the top by class name **. So, when I examined the elements of Firefox with ** Verify **, it seemed that I should dig into it like ** box02> box02_txt_hop> hop_box> hop_image or hop_txt **, so I was doing it. Then, the last commentary for each item will be omitted. So, when I looked closely, it seemed that for some reason, only the explanation at the end of each item was not included in the ** hop_box class for both the text and the image, and it was directly under the upper box02_txt_hop ** (reason). Is unknown). So, hop_box is decided to ignore from the beginning. In the first place, Find and findAll of Beautiful Soup seem to ** search from all the elements below the target element according to the conditions **, so it seems that it was not necessary to search one layer at a time.

The table automatically becomes striped

If the contents were only text, it would be more beautiful, but I personally didn't like it because it was hard to see if it included images. However, I didn't know how to change the style (PDF aside from HTML), so I dealt with it by creating a new table one by one.

UnicodeDecodeError and UnicodeEncodeError

Frequently used when outputting .rst files from Python scripts.

type(text)

I will fix it steadily while checking it.

Until printing as PDF

There seem to be two ways, ** rst2pdf ** and ** latex **, but the former seemed easier, so I decided to use ** rst2pdf this time **.

And I was addicted to it one after another (maybe I didn't want to install it with pip).

I get an error with `` `make pdf```

ImportError: reportlab requires Python 2.7+ or 3.3+; 3.0-3.2 are not supported.

Then, refer to here

pip install -U reportlab==2.5



 Then the error changes.

>[ERROR] pdfbuilder.py:130 need more than 3 values to unpack
Traceback (most recent call last):
  File "/home/vagrant/www/public/hopstepjump/venv/lib/python2.6/site-packages/rst2pdf/pdfbuilder.py", line 78, in write
    docname, targetname, title, author = entry[:4]
ValueError: need more than 3 values to unpack
FAILED
build succeeded, 1048 warnings.

 In short, the way to write ** conf.py ** was different. At first I put the documents I want to pdf in ** pdf_documents **

```py
('hop','step','jump'),

I thought that it would be connected like this, but it was different,

('docName', u'file name', u'The title that appears on the cover of the PDF', u'Author'),

It seems that (in fact, it was written in the manual). This tuple is included in the list, so if you want to convert multiple documents to PDF, you can add multiple tuples.

Anyway, take a second look and `` `make pdf```

IOError: decoder jpeg not available identity=[ImageReader@0x3a3ce10 filename='/home/vagrant/www/public/hopstepjump/venv/lib/python2.6/site-packages/rst2pdf/images/image-missing.jpg'] FAILED build succeeded.

Another new problem. Is jpeg useless?

After investigating, it seems that libjpeg is necessary, not that PIL is bad. http://kwmt27.net/index.php/2013/07/14/python-pil-error-decoder-jpeg-not-available/

Install this and reinstall PIL Can you go with yum?

yum search libjpeg

============================= N/S Matched: libjpeg ============================= libjpeg-turbo-devel.i686 : Headers for the libjpeg-turbo library libjpeg-turbo-devel.x86_64 : Headers for the libjpeg-turbo library libjpeg-turbo-static.x86_64 : Static version of the libjpeg-turbo library libjpeg-turbo.x86_64 : A MMX/SSE2 accelerated library for manipulating JPEG : image files libjpeg-turbo.i686 : A MMX/SSE2 accelerated library for manipulating JPEG image : files

I don't know which one, so try `` `yum install libjpeg``` (I hope that you will choose the best one for you).

Updated: libjpeg-turbo.x86_64 >0:1.2.1-3.el6_5

Complete!

It seems that it was done, so follow the page I referred to

pip install -i pillow

make pdf

IOError: decoder jpeg not available identity=[ImageReader@0x221e150 filename='/home/vagrant/www/public/hopstepjump/venv/lib/python2.6/site-packages/rst2pdf/images/image-missing.jpg'] FAILED build succeeded.

No progress ... yum list | grep libjpegWhen I tried it, it was installed properly, but pillow and pil are the same, aren't they?

So

pip install pil

But it doesn't change ...

Is this this? http://all-rounder-biz.blogspot.jp/2013/06/macioerror-decoder-jpeg-not-available.html Wrong·····.

Then this? http://d.hatena.ne.jp/rougeref/20130116 Yes, I was disappointed ...

Well, if you look at the display after installing PIL


*** TKINTER support not available
*** JPEG support not available
--- ZLIB (PNG/ZIP) support available
*** FREETYPE2 support not available
*** LITTLECMS support not available
--------------------------------------------------------------------

It has become. Certainly JPEG remains not available. Error messages don't lie.

Is it here? http://dev-pao.blogspot.jp/2010/04/python-imaging-library-piljpeg.html No ...

Next http://d.hatena.ne.jp/rougeref/20130116

Oh, maybe ...

yum install libjpeg-devel

Then, erase PIL and reinstall.

--- JPEG support available

** It seems that the cause was that devel could not be installed **. Anyway, the JPEG problem is solved, so `` `make pdf```.

[ERROR] image.py:110 Missing image file: /home/vagrant/www/public/hopstepjump/http://kaigoouen.net/img/hop_pic_108.jpg \u304b\u304b\u3068\u304b\u3089\u3064\u3044\u3066\u30fb\u30fb\u30fb\u3064\u307e\u5148\u3067\u3051\u308a\u51fa\u3059\u3001\u3092\u7e70\u308a\u8fd4\u3057\u306a\u304c\u3089\u3001\u6b69\u304d\u307e\u3057\u3087\u3046\u3002 line done build succeeded.

Alright, it's not working, but ** the place where I'm trying to load the image is obviously strange **. But I don't know how to fix it. Isn't it supposed to fetch images from the outside?

It was difficult to check it as an error message, so I tried various experiments and found that replace in **. ) Seemed to be okay **. Is it a kind of bug? Anyway, the solution is to not use it.

Next, the characters are garbled.

What is suspicious is

sh: fc-match: command not found [ERROR] findfonts.py:208 Unknown font: DejaVu Sans Mono-Bold

Part (full out). What is ** fc-match **? So when I google it, it looks like a command that comes in a library called ** fontconfig **.

Then with yum

yum search fontconfig

=========================== N/S Matched: fontconfig ============================ fontconfig.i686 : Font configuration and customization library fontconfig.x86_64 : Font configuration and customization library fontconfig-devel.i686 : Font configuration and customization library fontconfig-devel.x86_64 : Font configuration and customization library

There are also ordinary and develop. I will not step on the same rut twice ... so I installed develop. And when I tried `` `make pdf``` again, the previous "command not found" was resolved. However, the line below is unresolved. In the first place, there is'DejaVu Sans Mono-Bold', but I don't know such a font, what is this?

http://www.fontsquirrel.com/fonts/dejavu-sans-mono

Oh, it's an English font by all means. Does that mean you can't read the stylesheet? ** conf.py ** didn't seem to be wrong, so check out ** ja.json **, which is the stylesheet. I think I wrote it according to the document. Then, the result of worrying about several hours was that I didn't close the quotation marks in only one place. I also learned that singles are not good and that they have to be doubles.

Anyway, it finally started to search for the set Japanese font ... but ** I can't find it **. In the first place, the ** fc-list command did not return anything **.

Well, when I moved the font file under / usr / share / fonts, it was solved and the garbled characters were fixed, but then I don't understand the meaning of pdf_font_path ** in ** conf.py. Currently, the contents are empty, but it can be displayed in Japanese without any problem, and even if I try to write the path of another place, it is not reflected. What if I want to put it somewhere else? (Rewrite the settings of ** Fontconfig ** somewhere?)

It became a Python script like this

hopstepjump.py


#! /usr/bin/env python
# *-*coding:utf-8*-*
import sys,re,urllib2

from BeautifulSoup import BeautifulSoup


#### reference page ####
base_url = 'http://kaigoouen.net'

hopstepjump = {
	'hop':{
		'index':2,
		'last_page':25,
	},
	'step':{
		'index':3,
		'last_page':39,
	},
	'jump':{
		'index':4,
		'last_page':46,
	},
}


#### for make .rst files ####
br = u'\n'

page_break = u".. raw:: pdf%s   PageBreak" % (br*2)

def soup_to_sphinx(pg):

	p = 1

	while p <= pg['last_page']:

		url = base_url + '/program/gymnastics/gymnastics_{index}_{page}.html'.format(index=pg['index'],page=p)

		htmldata = urllib2.urlopen(url)

		soup = BeautifulSoup( unicode(htmldata.read(),'utf-8') )

		for box in soup.findAll('div',{'class':'box02'}):

			lessons = box.find('div',{'class':'box02_txt_%s' % choice})

			if lessons is not None:

				title = box.contents[1].contents[0]['alt']
				print( sphinx_head(title) )

				images = lessons.findAll('div',{'class':'%s_image' % choice})

				texts = lessons.findAll('div',{'class':'%s_txt' % choice})
				texts = iter(texts)

				for image in images:
					src = base_url + image.contents[0]['src']
					image = sphinx_image(src)

					text = texts.next().renderContents()
					text = sphinx_text(text)

					print( sphinx_listtable(image,text) )

				print(page_break)				

		htmldata.close()
		p += 1
	

def sphinx_head(txt):
	return br.join([br,txt,u"="*30+br]).encode('utf-8')


def sphinx_listtable(s_img,s_txt):
	table = u".. list-table::" + br
	image = u"   * - %s" % s_img
	text = u"     - | %s" % s_txt

	return br.join([table,image,text,br]).encode('utf-8')


def sphinx_image(src):
	option = u":width: 150pt"
	return u".. image:: %s" % src + br + u"          %s" % option


def sphinx_text(txt):
	text = txt.decode('utf-8').replace(u"<br />",br)
	if br in text:
		texts = text.splitlines()
		text = reduce(lambda x,y: x +  br + u"       | " + y,texts)
	return text



if __name__ == '__main__':

	try:
		choice = sys.argv[1]
		page = hopstepjump[choice]
	except IndexError:
		print("[error]: Option required ('hop'Or'step'Or'jump'のいずれOr)。")
		exit("    -Example:'python %s hop'" % sys.argv[0])
	except KeyError:
		print("[error]: The options are different ('hop'Or'step'Or'jump'のいずれOr)。")
		exit("    -Example:'python %s hop'" % sys.argv[0])

	soup_to_sphinx(page)

How to use

python hopstepjump.py hop > hop.rst

It's a pleasure to write a file with more than 1000 lines with a single command. Anyway, if you make a configuration with this, all you have to do is `make html``` or `make pdf```.

GitHub

https://github.com/juniskw/hopstepjump/tree/no_replace_listtable

By the way, a foreigner I don't know is authorized. Is this something mischievous?

Recommended Posts

(Memo) Until you extract only the part you want from a certain Web page, convert it to a Sphinx page, and print it as a PDF
Use PIL in Python to extract only the data you want from Exif
I want to cut out only the face from a person image with Python and save it ~ Face detection and trimming with face_recognition ~
Get an image from a web page and resize it
When you want to keep the Sphinx documentation theme as usual
Output the report as PDF from DB with Python and automatically attach it to an email and send it
I want to pass an argument to a python function and execute it from PHP on a web server
I want to send a signal only from the sub thread to the main thread
[Personal memo] Get data on the Web and make it a DataFrame
The infrastructure shop decided to develop "Web tools" as a theme. .. Until you think about the environment for creating "Web tools".
Use Pillow to make the image transparent and overlay only part of it
Extract data from a web page with Python
[Python3] Take a screenshot of a web page on the server and crop it further
Extract images and tables from pdf with python to reduce the burden of reporting