[PYTHON] Make a Kindle book that images mathematical formulas from TeX files

Rakuten Kobo and iBooks support MathML, and rendering is quite good, but the largest Amazon Kindle does not (maybe it is cost-effective, I do not think it will be supported in the future). , In order to make a reflow-type math-like Kindle book, it is unwilling to image mathematical formulas.

I'd like to use svg, which is a vector image, at least when converting mathematical formulas to images, but I'm not sure if Kindle supports it or not (Is it not supported by iOS apps?). , It seems safe to make it png quietly. However, it is not possible to manually image all documents such as tex containing hundreds of mathematical formulas and format them into formats such as epub3, so the purpose of this article is to automate. ..

important point

The environment is macOS. Please note that the sense of variable naming and the handling of directories etc. may be bare amateurs. It may be rewritten a little smarter later.

Specially used

Rough procedure

  1. Document creation with tex
  2. Convert to tex-> html + mathjax with pandoc
  3. Make the mathjax part png
  4. Convert to html-> epub3 with pandoc
  5. Unzip epub3, rework and recompress
  6. Insert epub3 into Kindle Previewer and create mobi (kindlegen is also OK)

Code for the time being

import subprocess, lxml.html, lxml.etree, imagesize, re, tempfile

def tex2html(source, target): #Please change the pandoc options accordingly
    subprocess.call(['pandoc',
                     '-s',
                     '-t', 'html5',
                     '--mathjax',
                     '--css=stylesheet.css',
                     '-o', target, source])    

def get_html(filename, encoding='utf-8', xml=False):
    #Lxml when parsing xhtml.When parsing etree and html, lxml.Use html
    with open(filename, 'r', encoding=encoding) as f:
        if xml == True:
            html = lxml.etree.parse(f).getroot()
        else:
            html = lxml.html.parse(f).getroot()
    return html

def write_html(html, filename, encoding='utf-8', xml=False):
    #Lxml when parsing xhtml.When parsing etree and html, lxml.Use html
    if xml == True:
        src = lxml.etree.tostring(html, 
                                  encoding=encoding,
                                  xml_declaration=True,
                                  doctype='<!DOCTYPE html>',
                                  method='xml',
                                  pretty_print=True).decode(encoding)
    else:
        src = lxml.html.tostring(html,
                                 encoding=encoding,
                                 doctype='<!DOCTYPE html>',
                                 pretty_print=True).decode(encoding)
    with open(filename, 'w', encoding=encoding) as f:
        f.write(src)

def convert2png(filename, newfilename, convert=True):
    html= get_html(filename)
    textemplate = r'''\documentclass[a0paper, uplatex]{jsarticle}
\usepackage[dvipdfmx]{graphicx}
\usepackage[margin=1cm]{geometry}
\usepackage{amsmath,amssymb}
\pagestyle{empty}
\begin{document}
\scalebox{4}{\parbox{.25\linewidth}{MATH}}
\end{document}
'''
    imgs = {} #For recording the tex code and image file so as not to make the same image
    for span in html.xpath(r'//span[@class="math inline" or @class="math display"]'):
        tex = span.text
        if tex in imgs:
            imgsrc = imgs[tex]
        else:
            imgsrc = r'math{0:04d}.png'.format(len(imgs)+1)
            imgs[tex] = imgsrc
            if convert == True:
                with open('tmp.tex', 'w') as texf:
                    texf.write(textemplate.replace('MATH', tex))
                subprocess.call('uplatex tmp.tex', shell=True)
                subprocess.call('dvipdfmx tmp.dvi', shell=True)
                subprocess.call('convert -trim tmp.pdf '+imgsrc, shell=True)
        span.tag = 'img'
        span.text = None
        span.attrib['src'] = imgsrc
        span.attrib['alt'] = tex
        width, height = imagesize.get(imgsrc)
        span.attrib['height'] = str(height)

    write_html(html, newfilename)
                           
def html2epub(source, target, css, cover):
    subprocess.call(['pandoc',
                     '-t', 'epub3',
                     '--toc',
                     '--epub-chapter-level=1',
                     '--epub-stylesheet', css,
                     '--epub-cover-image', cover,
                     '-o', target, source])

def extract_epub(filename, tmpdir):
    subprocess.call('mkdir {0}'.format(tmpdir), shell=True)
    subprocess.call('unzip -d {0} {1}'.format(tmpdir, filename), shell=True)

def make_epub(filename, tmpdir):
    subprocess.os.chdir(tmpdir)
    subprocess.call(r'zip -0 ../{filename} mimetype;zip -XrD ../{filename} *'.format(filename=filename, tmpdir = tmpdir),shell=True)
    subprocess.os.chdir('../')
    
def make_epub_for_kindle(name, css, cover, convert=True):
    tmpdir = name + '_tmpdir'
    if not subprocess.os.path.isdir(tmpdir):
        subprocess.os.mkdir(tmpdir)
    subprocess.os.chdir(tmpdir)
    tex2html('../'+name+'.tex', name+'.html')
    convert2png(name+'.html', name+'2.html', convert)
    html2epub(name+'2.html', name+'0.epub', '../'+css, '../'+cover)
    epubdir = tempfile.TemporaryDirectory(dir='./')
    extract_epub(name+'0.epub', epubdir.name)
    ns = {'xhtml': 'http://www.w3.org/1999/xhtml'}
    for filename in [fn for fn in subprocess.os.listdir(epubdir.name) if re.match(r'ch[\d]{3}.xhtml', fn)]:
        xhtml = get_html(epubdir.name+'/'+filename, xml=True)
        xhtml.attrib['{http://www.idpf.org/2007/ops}lang'] = 'ja'
        xhtml.attrib['lang'] = 'ja'
        for img in xhtml.xpath('//xhtml:img', namespaces=ns):
            height = int(img.attrib['height'])
            img.attrib['style'] = 'height: ' + str(round(height/40, 2)) + 'em;'
            del img.attrib['height']
        write_html(xhtml, epubdir.name+'/'+filename, xml=True)
    make_epub(name+'.epub', epubdir.name)
    epubdir.cleanup()
    subprocess.os.chdir('../')

Description of each function

tex2html(source, target)

Use pandoc to convert TeX files to HTML5. With the --mathjax option, the formula part is

<span class="math inline">\(e^{\pi i} = -1\)</span>

<span class="math display">\[e^{\pi i} = -1\]</span>

It is output in the form of. This part is extracted with lxml and imaged sequentially.

get_html(filename, encoding='utf-8', xml=False)

Parses the HTML document with lxml and returns the root element html element. HTML is assumed by default, but if you want to parse XHTML, add the option xml = True.

write_html(html, filename, encoding='utf-8', xml=False)

Write an HTML file with the html element as an argument.

convert2png(filename, newfilename, convert=True)

In an HTML document

<span class="math inline">\(e^{\pi i} = -1\)</span>

Image magick's convert to image such parts

<img src="math0001.png " class="math inline" alt="\(e^{\pi i} = -1\)" height="30">

We will replace it with the ʻimgelement of the form. Theheight attribute is a fairly important value used to adjust the height of the last formula. This is picked up by a python library called ʻimagesize. If there are many formulas, it will take time to convert. If you set convert = False, the image conversion procedure will be skipped. If you do not need to convert again, such as when reworking, setting it to False will save time.

By the way, the generated image is made a little larger (4 times the initial setting, maybe about 40pt). If you create an image with the default settings, it will be crushed and unreadable, so we use a method of making it larger and reducing it.

By the way, when I try to image tex, I get a hit called dvipng, but it seems to be troublesome to support Japanese conversion, so how to convert pdf to png with convert of imagemagick Is easy.

html2epub(source, target, css, cover)

Convert the HTML document with the image of the formula to epub with pandoc. css is the file name of the stylesheet to be embedded, and cover is the file name of the cover image to be embedded.

extract_epub(filename, tmpdir)

Unzip to tmpdir to fix the epub created by html2epub. I think that epub3 is relatively famous as a zip file.

make_epub(filename, tmpdir)

Put the reworked files together in epub (I referred to here: Compress to EPUB using terminal).

make_epub_for_kindle(name, css, cover, convert=True)

It is a one-touch version of a series of processes. The name.tex file in the current directory is used as the source, and finally the name.epub file is generated. Since many files are generated in the process, it is specified to create a directory called name_tmpdir and output the generated files there. Place the css and cover files in the same directory as name.tex.

What is the fix?

I made epub with pandoc once, unzipped it, * reworked * it, and made it epub again. What the heck is doing with this rework is adjusting the height of the mathematically formulated image. If MathML supports it, you don't have to worry about this, but when it comes to imaging formulas, you need to adjust the width or height, and if you specify absolutely (or not specify), the epub viewer Even if you change the font size with, the size of the formula does not change. To avoid this

<img src="math001.png " style='height: 1.0em;'>

It seems that there is no choice but to specify relative to the style attribute (please let me know if there is another better way). Here's an excerpt of the code that sets that part:

for img in xhtml.xpath('//xhtml:img', namespaces=ns):
    height = int(img.attrib['height'])
    img.attrib['style'] = 'height: ' + str(round(height/40, 2)) + 'em;'

Specifically, the unit ʻem` is added to the value obtained by dividing the height of the actual png image by 40. Please adjust in some cases.

If you set the style attribute at the stage ofhtml2epub (), you may think that you do not have to bother to modify it, but when you convert html to epub with pandoc , There is a Pandoc specification that the style attribute of the ʻimg` element is deleted.

Well, with that said, there are other elements that modify the ʻepub generated by pandoc, so I think this decompression and compression work is not wasteful. For example, adding the lang attribute and the ʻepub: lang attribute to the html element is only the timing of this rework. There are quite a few other elements that should be modified (for some people), such as the ʻidattribute of thesection` element.

stylesheet.css

If the class attribute is math display, display: block will be required.

stylesheet.css


img.math.display{
    display: block;
    margin: .5em auto; /*Centered*/
}

img.math.inline{
    margin-left: .2em;
    margin-right: .2em;
}

I like the margins. In addition, please adjust the heading and line spacing by yourself.

sample

sample.tex


\documentclass[uplatex]{jsbook}
\usepackage{amsmath,amssymb}
\begin{document}
\title{System of numbers}

\chapter{Natural number}

The smallest set guaranteed by the axiom of infinity$\omega$Then$\emptyset\in\omega$And any$x\in\omega$Against
\begin{align*}
  \sigma: x\mapsto x\cup\{x\}
\end{align*}
By$\omega$From$\omega$The mapping to is defined.

\section{Peano's axioms}

$(\omega, \emptyset, \sigma)$Satisfies the so-called Peano axioms.

\chapter{integer}

Two natural numbers$m, n$Against$(m, n)$The integer$m-n$The hint of integer composition is to consider it as.

\section{Equivalence relation}

$(1, 0)$When$(2, 1)$は同じ整数Whenみるので,同値関係が必要になる.

\end{document}
make_epub_for_kindle('sample', 'stylesheet.css', 'cover.png')
スクリーンショット 2017-09-17 21.12.48.png

Probably

This time it is based on TeX files, but it seems that it can be done from trendy markdown. However, since TeX is used for imaging, a TeX environment is essential.

Recommended Posts

Make a Kindle book that images mathematical formulas from TeX files
From a book that programmers can learn ... (Python): Pointer
"A book that understands Flask from scratch" Reading memo
From a book that programmers can learn ... (Python): About sorting
From a book that programmers can learn (Python): Decoding messages
From a book that programmers can learn (Python): Find the mode
From a book that programmers can learn ... (Python): Review of arrays
From a book that programmers can learn ... Collect small problem parts
From a book that programmers can learn (Python): Statistical processing-deviation value
Make a Santa classifier from a Santa image
python / Make a dict from a list.
Make a Discord Bot that you can search for and paste images
From a book that programmers can learn (Python): Conditional search (maximum value)
From a book that makes the programmer's way of thinking interesting (Python)