Python string manipulation master

We have summarized the basic string operations in Python. It may not be enough for experienced people, but ...

(Addition 2018.12.23: The grammar of print is made compatible with Python3. If the code copied and pasted in Python2 does not work, please put from __future__ import print_function at the beginning of the code side.)

Python string = immutable

Python strings are immutable, so even if you want to partially rewrite them, It will be assembled as a new string object. For example, a method called replace that performs a string replacement returns another string object with the replaced content.

Linking

Use the + operator for concatenation.

a = 'Python'
b = '2.7'
c = a + b
print(c)  # => 'Python2.7'

Since it is processed in order, it is ok even if you connect a lot

a = 'Python'
b = ' is '
c = 'fancy'
print(a + b + c)  # => 'Python is fancy'

There is also a technique to join using the join method and list / tuple. As an aside, Ruby's join is an Array method (takes a concatenated string as an argument), and Python's join is a str method (takes a list / tuple as an argument). Experienced people need to be careful.

strings = ['dog', 'cat', 'penguin']
print(','.join(strings))  #=> 'dog,cat,penguin'

repetition

If you repeat the same content, give an integer with the * operator and a string will be generated that repeats the specified number of times.

s = 'dog?'
print(s * 3)  #=> 'dog?dog?dog?'

Value embedding

There are three ways to expand the value of a variable into a string. Maybe there are others just I don't know.

sprintf style: '% s,% s'% ('Hello','World')
Extended sprintf style: '% (a) s,% (b) s'% dict (a ='Hello', b ='World')
Use format method: '{0}, {1}'. format ('Hello','World')

(Note) I'm not sure about the exact name of the second one, but I decided to call it the extended sprintf style without permission.

sprintf style

If you give a value or list / tuple to a string with the % operator, you can expand it as follows.

a = 'Python'
b = 'a programming language'
print('%s is %s' % (a, b))  # => 'Python is a programming language'

c = 'World'
print('Hello, %s!' % c)  # => 'Hello, World!'

You need as many values to give as there are expansion symbols (such as % s) in the string. It cannot be more or less. If there is one expansion symbol, the value after% does not need to be list / tuple. (It is also expanded by one element list / tuple) In the above example, the template string of the first print statement contains two expansion symbols % s, so the number of tuple elements given by the value after% There are also two. If you want to keep the character % as a character in the template string, use %% and **'%' 2 characters **.

There are the following format specifiers. If you are not sure, it may be a good idea to set it to % s for the time being. I would like to give the explanation of how to write the format specifier to printf's wikipedia page.

--% s --Expand as a string --% d --Expand as an integer --% f --Expand as decimal point

When you want to expand tuples and lists as strings like '(1, 2, 3)'

tuple_var = (1, 2, 3)
print('tuple_var is: %s' % (tuple_var,))

If you don't do this, you'll get angry like there's only one placeholder to replace, even though there are three!

Extended sprintf style

Extended sprintf style is a name I gave myself (^^;

Specify the key of the dict object in parentheses after the % of the format string, and specify the dict object on the right side of the % operator for the format string. This is useful if you already have a dict variable when embedding the same value repeatedly.

v = dict(first='Michael', family='Jackson')
print('He is %(first)s, %(first)s %(family)s.' % v)

Use of format method

You can use the template language dedicated to the format method by using the format method.

print('{0}, {1}'.format('Hello', 'World'))  #=> 'Hello, World'

For details, see Format specification mini language specification.

Replace

s = 'Today is Monday.'
ss = s.replace('Monday', 'Sunday')  #=> 'Today is Sunday.'
print(ss)
s2 = 'Hello Hello'
ss2 = s2.replace('Hello', 'Bye')  #=> 'Bye Bye'If you do not specify the third argument, everything is replaced
print(ss2)
s3 = 'World World'
ss3 = s3.replace('World', 'Hello', 1)  #=> 'Hello World' #Specify the number to replace with the third number
print(ss3)

Use the sub method of re (regular expression) package to replace the character string according to a certain pattern.

import re
s = 'Hello World'
print(re.sub(r"[a-z]", "A", s))  #=> 'HAAAA WAAAA'

Get the Nth character

s = 'abc'
n = 1  # 'a'Want
print(s[n-1])  #0 Get characters at base index

s2 = 'xyz'
print(s[-1])  # 'z'Last character

Get substring (take out M character from Nth character)

s = "This is a pen."
n = 1
m = 4
print(s[n-1:n-1+m])  # 'This'
print(s[0:4])  # 'This'
print(s[-4:-1])  # 'pen'

Search

Use find. If you want to search backwards, you can use rfind. find returns the string position starting from 0 if the string is found, or -1 if not found.

s = 'abcabcabc'
index = s.find('b')  #index is 1(2nd character)

You can specify the position to start the search with the second argument.

s = 'abcabcabc'
index = s.find('b', 2)  #index is 4(5th character)

You can find all the targets in the string with the following code.

s = 'abcabcabc'
target = 'b'
index = -1
while True:
	index = s.find(target, index + 1)
	if index == -1:
		break
	print('start=%d' % index)

Process one character at a time

Since the string type is also an iterator, it can be processed with for as follows. If you want a list of characters, you can use list (strvalue).

for c in 'aiueo':
	print(c)

print(list('hoge'))  # => ['h', 'o', 'g', 'e']

There may be a way to retrieve while referring to the characters in the index.

s = 'aiueo'
for i in range(len(s)):
	c = s[i]
	print(c)

Remove whitespace at both ends

You can use strip, lstrip, and rstrip. strip is a character string with spaces, tab characters, and line breaks (\ r and \ n) removed from both ends. lstrip applies the same processing as strip to only the left end, rstrip returns the same processing as strip applied only to the right end.

s = ' x '
print('A' + s.strip() + 'B')  # => 'AxB'
print('A' + s.lstrip() + 'B')  # => 'Ax B'
print('A' + s.rstrip() + 'B') # => 'A xB'

Delete line breaks (processing equivalent to perl or ruby chomp)

It seems that you can use rstrip. However, if there are two patterns with a space and a line break at the end and you want to delete only the line break, you need to specify the character to be deleted with the argument.

line = 'hoge\n'
msg = line.rstrip() + 'moge'
print(msg)  # => 'hogemoge'

with open('./test.txt') as fh:
	for line in fh:
		no_line_break_line = line.rstrip()
		#Do something


#Delete only line breaks without removing spaces
line_with_space = 'line \n'  #I don't want to remove the whitespace before the line break
print(line_with_space.rstrip('\n'))  # => 'line '

Capitalize all

ʻUse the upper ()` method.

print('hello'.upper())  # => 'HELLO'

Make all lowercase

Use the lower () method.

print('BIG'.lower())  # => 'big'

Find out if a string is included as a substring

s = 'abc'
print('b' in s)  #=> True
print('x' in s)  #=> False

Count the number of times a string appears as a substring

You can do it yourself using the find method that came out earlier, but there is a convenient method called count.

s = 'aaabbc'
print(s.count('b'))  #=> 2

Convert int to string

v = 1
print(str(v))
print('%d' % v)

Convert float to string

f = 1.234
print(str(f))  #=> '1.234'
print('%f' % f)  #=> '1.234000'

Convert list to string, convert tuple to string

There are times when you want to express it as a character string in a debug print, etc.

v = [1,2,3]
print(str(v))  #=> '[1, 2, 3]'
print('%s' % v)  #=> '[1, 2, 3]'

If you try to display one tuple with % s, Python will interpret the given tuple as a list of values for the template and you will get an error.

v = (1, 2, 3)
print(str(v)) #=> '(1, 2, 3)'Good example
print('%s' % v) #=> '(1, 2, 3)'I expect, but I get a TypeError
print('%s' % (v,)) #=> '(1, 2, 3)'Good example

It is also good to try assembling using join etc.

v = [1,2,3]
print('<' + ('/'.join([ str(item) for item in v ])) + '>')  #=> '<1/2/3>'

The same is true for tuple objects.

Convert dict to string

There are times when you want to express it as a character string in a debug print, etc.

v = dict(a=1, b=2)
print(str(v))  #=> "{'a': 1, 'b': 2}"
print('%s' % v)  #=> "{'a': 1, 'b': 2}"

You can also use keys, list comprehensions, and join to generate strings in one liner.

v = dict(a=1, b=2)
print('<' + ', '.join([ '%s=%s' % (k, v[k]) for k in v.keys() ]) + '>')  #=> '<a=1, b=2>'

Make bytes a unicode string

The data read from a file or socket (opened in binary mode) is a byte string as it is, so if you do not interpret it as a unicode string, you will not be able to operate in character units. In Python2 series (2.7 etc.), str (byte string) and unicode (character string) are distinguished, and it is better to treat the character string as a unicode object in the scene where multibyte characters are expected for input such as Web application. .. Use the decode () method to interpret a byte string as a unicode string with the encoding specified.

In Python3 series, str type is a character string type (corresponding to Python2 series unicode type), and bytes type is a byte string type (corresponding to Python2 series str type).

with open('utf8_content_file.txt', 'rb') as fh:  #Binary mode because it is rb
	byte_content = fh.read()  #Read all,Byte sequence at this point
    print len(byte_content)  #Number of bytes
    unicode_string = byte_content.decode('utf-8')  # utf-Interpreted as a sequence of characters with 8 encodings
    print len(unicode_string)  #word count

The default encoding of the decode () method is ʻutf-8`, so if you know that the byte string to be interpreted is UTF-8, you can omit the encoding.

bytes_data = b'\xe3\x83\x90\xe3\x82\xa4\xe3\x83\x88\xe5\x88\x97'
print(bytes_data.decode())  # => 'Byte sequence'

The encodings that are often used in Japanese are listed below.

--ʻUtf_8 UTF-8 (Also known as: ʻutf-8 ʻU8 ʻutf8 cp65001) --shift_jis Shift JIS (also known as csshiftjis shiftjis`` sjis s_jis) --cp932 Shift JIS (Extended Shift JIS) (Also known as: 932 ms932`` mskanji mks-kanji) --ʻEuc_jp EUC-JP (Also known as: ʻeucjp ʻujis ʻu-jis) --ʻIso2022_jpJIS (ISO-2022-JP) (Also known as:csiso2022jp ʻiso2022jp ʻiso-2022-jp`)

Other encodings supported by Python can be found on the codecs package page: https://docs.python.org/ja/3/library/codecs.html

Make unicode string into bytes

Conversely, when writing to a file or socket (opened in binary mode), the string must be a byte string. In that case, use the ʻencode ()` method of the unicode object.

unicode_string = u'String of multibyte characters'
with open('./utf8_content_file.txt', 'wb') as fh:  #writing+Open in binary mode
    byte_content = unicode_string.encode('utf-8')  # utf-Get the byte string when expressed in 8 encoding
    fh.write(byte_content)  #Write byte string

If you don't pass the encoding for the ʻencode () method as well, it behaves as if you were passing ʻutf-8.

str_data = 'Byte sequence'
print(str_data.encode()) # => b'\xe3\x83\x90\xe3\x82\xa4\xe3\x83\x88\xe5\x88\x97'

Use template engine

The template engine is so feature-rich that we'll only cover a few major libraries here.

jinja2
mako

Is jinja2 the most major?

Reference link

-7.1. String — General string operations (Python2.7) -6.1. string — General string operations (Python3.5.1)