UTF8 text processing in python

The python2.x series is confusing because the str object and the unicode object are separate. After researching various things, it became like this. The python3.x series seems to be easier because the text is unicode processed.

MacOS X 10.6.8 Python 2.6.1

python


# coding: UTF-8

import codecs
import string
import re

f_in  = codecs.open('test.txt', 'r', 'utf-8')
f_out = codecs.open('test_out.txt', 'w', 'utf-8')

lines = f_in.readlines() #Read
lines2 = []
for line in lines:
	line = string.replace(line,u'text',u'text') #text置換
	line = re.sub(r'(\d)(?=(\d{3})+(?!\d))', r'\1', line) #Regular expression replacement
	lines2.append(line) #Make a separate list
else:
	f_out.write(string.join(lines2,'')) #writing
	f_in.close()
	f_out.close()

test.txt


This is sample text.
Insert a comma every 3 digits.
iPad mini 36800 yen

test_out.txt


This is a sample text.
Insert a comma every 3 digits.
iPad mini 36,800 yen

Postscript: I wrote the code that works with python3.3. After all, python3 also uses the codecs module, Is replace done by a function of str object and just not using u'' literal?

python


from __future__ import unicode_literals

If you add, all strings are treated as unicode even if there is no u'' literal, so It works normally with python2.6. That might be the best at the moment.

python


# coding: UTF-8
from __future__ import unicode_literals # <-Treat all character strings as unicode. Not required for 3 series
import codecs
import re

f_in  = codecs.open('test.txt', 'r', 'utf-8')
f_out = codecs.open('test_out.txt', 'w', 'utf-8')

lines = f_in.readlines() #Read
lines2 = []
for line in lines:
    line = line.replace('text','text') #text置換
    line = re.sub(r'(\d)(?=(\d{3})+(?!\d))', r'\1,', line) #Regular expression replacement
    lines2.append(line) #Make a separate list
else:
    f_out.write(''.join(lines2)) #writing
    f_in.close()

Recommended Posts

UTF8 text processing in python
Text processing in Python
Clustering text in Python
File processing in Python
Multithreaded processing in python
Queue processing in Python
Asynchronous processing (threading) in python
Speech to speech in python [text to speech]
Image Processing Collection in Python
Using Python mode in Processing
Signal processing in Python (1): Fourier transform
GOTO in Python with Sublime Text 3
100 Language Processing Knock Chapter 1 in Python
Open UTF-8 with BOM in Python
Extract text from images in Python
Sort large text files in Python
Reading and writing text in Python
Quadtree in Python --2
CURL in python
Metaprogramming in Python
Python 3.3 in Anaconda
Geocoding in python
SendKeys in Python
python image processing
Meta-analysis in Python
Unittest in python
To set default encoding to utf-8 in python
Epoch in Python
Discord in Python
Sudoku in Python
DCI in Python
Python file processing
quicksort in python
nCr in python
N-Gram in Python
Programming in python
Plink in Python
Constant in python
Lifegame in Python.
FizzBuzz in Python
Sqlite in python
StepAIC in Python
Easy image processing in Python with Pillow
N-gram in python
LINE-Bot [0] in Python
Csv in python
Disassemble in Python
Reflection in Python
Constant in python
nCr in Python.
format in python
Scons in Python3
Try text mining your diary in Python
Puyo Puyo in python
python in virtualenv
PPAP in Python
Quad-tree in Python
Duplicate prohibition processing in GAE / Python Datastore
Reflection in Python
Chemistry in Python
Hashable in python