A memo that handles double-byte double quotes in Python regular expressions

Overview

Purpose

  • I want to extract the character string in the range enclosed by double-byte double quotation marks.
Example)He said “Hello World!I said

From

Hello World!

Extract

Regular expressions

Constructed with reference to the following

Regular expression: An expression that matches only the contents of parentheses

re.search(r"(?<=\“).*?(?=\”)", sentence)

Attention

Initially I tried to unify to half-width double quotes using python's full-width half-width conversion package `` `jaconv```, but that didn't work.

This is because jaconv.normalize handles double-byte double quotes as follows.

'”'=> '"'
'“' => '``'

jaconv 0.2.4 -PyPI

Summary

Be careful because it is difficult to tell whether the double quotation is full-width or half-width, and which character code it is.

Double-byte double quotes are a bad civilization! </ B>

reference

Recommended Posts