A memorandum for not being asked "I don't know why to use a for statement" when processing text in pandas.

Collect information for the purpose of preprocessing Japanese text.

I would appreciate it if you could tell me if there is a better processing method.

Execution environment

macOS Catalina
Python 3.7.4

pandas 0.25.3

TL;DR

--Simple processing is implemented in the method of df ["column name "]. Str --If you want to do something that is not implemented in pandas df ["column name"] .apply ()

Sample data

Store information of ladies' fashion brands scraped from HP. The company name, brand name, store name, and address are stored in csv.

Since it is scraped from multiple HP, it is not unified such as half-width and full-width and blank. The zip code may or may not be included.

The table below is an example of the data. Using this data as an example, the execution results are also shown.

company	brand	location	address
pal	BONbazaar	Bombazaar Tokyo Dome City LaQua	1-1 Kasuga, Bunkyo-ku, Tokyo-1 LaQua Building 2F
world	index	Emio style	1 Takadanobaba, Shinjuku-ku, Tokyo-35-3 Emio Style 1F
pal	Whim Gazette	Marunouchi store	〒100-6390 2 Marunouchi, Chiyoda-ku, Tokyo-4-1 Marunouchi Building B1F
stripe	SEVENDAYS=SUNDAY	Aeon Mall Urawa Misono 2F	5-50-1, Misono, Midori-ku, Saitama-shi, Saitama Aeon Mall Urawa Misono
pal	mystic	Funabashi store	〒273-0012 2 Hamacho, Funabashi City, Chiba Prefecture-1-1 LaLaport TOKYO-BAY LaLaport 3
pal	pual ce cin	Ofuna Lumine Wing Store	〒247-0056 1 Ofuna, Kamakura City, Kanagawa Prefecture-4-1 Lumine Wing 4F
stripe	Green Parks	sara Shapo Koiwa store	7 Minamikoiwa, Edogawa-ku, Tokyo-24-15 Shapo Koiwa 1F
pal	Discoat	Discoat Petit Ikebukuro Shopping Park	〒171-8532 1 Minamiikebukuro, Toshima-ku, Tokyo-29-1 Ikebukuro SP B1F
adastoria	niko and...	Atre Kawagoe	105 Wakitamachi, Kawagoe City, Saitama Prefecture Atre Kawagoe 4F
pal	CIAOPANIC TYPY	Kameari store	〒125-0061 3 Kameari, Katsushika-ku, Tokyo-49-3 Ario Kameari 2F

Since only one column (ʻaddress` column) is used for explanation, the processing of Series is almost introduced, but the processing of DataFrame is also described as much as possible.

processing

Basically, in Series.str, python string methods and [regular expression operations] You can call (https://docs.python.org/ja/3/library/re.html), so you can use these.

――Excerpts of frequently used items --If you want to check all the methods, please refer to the official document. --Reference: Series.str search results - Working with text data — pandas 1.0.1 documentation

String method system

strip

Series.str.strip

Delete the whitespace characters at the beginning and end of the character string.

df['strip']=df['address'].str.strip()

Of course, Series.str.rstrip, which deletes only the beginning, and lstrip, which deletes only the end, are also implemented.

split, rsplit

Series.str.split

--Split the string by the specified separator and return it in * list * --If you set ʻexpand = True, you can split into multiple columns. --Note that the number is matched to the most split column --In the example below, it is divided into three columns, but if the text cannot be divided into three, None` is inserted.

df['address'].str.split(expand=True)

0	1	2
1-1 Kasuga, Bunkyo-ku, Tokyo-1	LaQua Building	2F
1 Takadanobaba, Shinjuku-ku, Tokyo-35-3	Emio style	1F
〒100-6390	2 Marunouchi, Chiyoda-ku, Tokyo-4-1	Marunouchi Building B1F
5-50-1 Misono, Midori-ku, Saitama-shi, Saitama	Aeon Mall Urawa Misono	None
〒273-0012 2 Hamacho, Funabashi City, Chiba Prefecture-1-1	LaLaport TOKYO-BAY LaLaport 3	None
〒247-0056	1 Ofuna, Kamakura City, Kanagawa Prefecture-4-1	Lumine Wing 4F
7 Minamikoiwa, Edogawa-ku, Tokyo-24-15 Shapo Koiwa	1F	None
〒171-8532 1 Minamiikebukuro, Toshima-ku, Tokyo-29-1 Ikebukuro SP	B1F	None
Wakitamachi, Kawagoe City, Saitama Prefecture	105	Atre Kawagoe 4F
〒125-0061 3 Kameari, Katsushika-ku, Tokyo-49-3 Ario Kameari 2F	None	None

rsplit --Split from the right side --You can adjust the number of divisions by specifying n --For example, if you set n = 1, the number of columns will be fixed at 2 because it will be divided only once.

df['address'].str.rsplit(expand=True, n=1)

0	1
1-1 Kasuga, Bunkyo-ku, Tokyo-1 LaQua Building	2F
1 Takadanobaba, Shinjuku-ku, Tokyo-35-3 Emio style	1F
〒100-6390 2 Marunouchi, Chiyoda-ku, Tokyo-4-1	Marunouchi Building B1F
5-50-1 Misono, Midori-ku, Saitama-shi, Saitama	Aeon Mall Urawa Misono
〒273-0012 2 Hamacho, Funabashi City, Chiba Prefecture-1-1	LaLaport TOKYO-BAY LaLaport 3
〒247-0056 1 Ofuna, Kamakura City, Kanagawa Prefecture-4-1	Lumine Wing 4F
7 Minamikoiwa, Edogawa-ku, Tokyo-24-15 Shapo Koiwa	1F
〒171-8532 1 Minamiikebukuro, Toshima-ku, Tokyo-29-1 Ikebukuro SP	B1F
105 Wakitamachi, Kawagoe City, Saitama Prefecture	Atre Kawagoe 4F
〒125-0061 3 Kameari, Katsushika-ku, Tokyo-49-3 Ario Kameari 2F	None

find

Series.str.find

--Same function as str.find --Returns the position of the string if it is included, ** -1 ** if it is not included

df['address'].str.find('Tokyo')
>> 0    0
1    0
2    10
3    -1
4    -1
5    -1
6    0
7    9
8    -1
9    9

--Since the return value is not a bool value, you can narrow down the data by using the following usage ** No ** --df [df ['address'] .str.find ('Tokyo')] --df.query ('address.str.find (" Tokyo ")') --If you just want to determine if a string is included, like hoge in hogehoge, you should use contains.

#For the time being"Tokyo"It is also possible to specify a column that does not contain
df.query('address.str.find("Tokyo")!=-1')

normalize Series.str.normalize

--Character normalization -Equivalent to ʻunicodedata.normalize`

Mainly, full-width numbers and symbols are converted to half-width, and half-width katakana is converted to full-width. Specify'NFKC'(normal form KC) for form.

# string
import unicodedata
unicodedata.normalize('NFKC', '123! ?? ＠ # Hankaku Katakana')
>> '123!?@#Handkerchief Katakana'

# pandas
df['normalize'] = df['address'].str.normalize(form='NFKC')

Reference: Character list normalized by Python unicodedata.normalize ('NFKC', x)

For Japanese text, it is easier to use neologdn (described later) *

Regular expression system

findall

Series.str.findall

--Equivalent to re.findall () --Unlike str.find (), you can use regular expressions here --Returns all matching words

df['address'].str.findall('(.{2}Ward)')
>> 0    [Bunkyo Ward]
1    [Shinjuku ward]
2    [Daita Ward]
3    [City Midori Ward]
4       []
5       []
6    [Togawa Ward]
7    [Toshima ward]
8       []
9    [Katsushika]

contains

Series.str.contains

--contain ** s **, so be careful --Functionally re.match () --Since the bool value is returned, it can be used to narrow down the data.

df['address'].str.contains('.{2}Ward')
>> 0     True
1     True
2     True
3     True
4    False
5    False
6     True
7     True
8    False
9     True

#  "○○ ward"Display only data containing
df.query('address.str.contains(".{2}Ward")')['address']
>>0 1-1 Kasuga, Bunkyo-ku, Tokyo-1 LaQua Building 2F
1 Takadanobaba, Shinjuku-ku, Tokyo 1-35-3 Emio Style 1F
2    〒100-6390 2 Marunouchi, Chiyoda-ku, Tokyo-4-1 Marunouchi Building B1F
3 5-50 Misono, Midori-ku, Saitama-shi, Saitama Aeon Mall Urawa Misono
6 7 Minamikoiwa, Edogawa-ku, Tokyo-24-15 Shapo Koiwa 1F
7             〒171-8532 1 Minamiikebukuro, Toshima-ku, Tokyo-29-1 Ikebukuro SP B1F
9               〒125-0061 3 Kameari, Katsushika-ku, Tokyo-49-3 Ario Kameari 2F

Flag columns that contain strings

By converting a bool value to an int value, you can create a column with 1 in the data containing a certain character string. Convenient for feature engineering when creating features.

#Data flag tokyo including Tokyo_Create flg
df['tokyo_flg'] = df['address'].str.contains("Tokyo").astype(int)
df['tokyo_flg']
>> 0    1
1    1
2    1
3    0
4    0
5    0
6    1
7    1
8    0
9    1

extract Series.str.extract

--Returns a matched pattern. --Returns None if there is no pattern, sodropna ()if not needed --If you set a named group name, that name will be the column name as it is --The named group name is (? P <name> ...)

df['address'].str.extract('(Tokyo|Kanagawa Prefecture)([^Ward city]+[Ward city])').dropna()

df['address'].str.extract('(?P<pref>Tokyo|Kanagawa Prefecture)(?P<city>[^Ward city]+[Ward city])').dropna()

	pref	city
0	Tokyo	Bunkyo Ward
1	Tokyo	Shinjuku ward
2	Tokyo	Chiyoda Ward
5	Kanagawa Prefecture	Kamakura city
6	Tokyo	Edogawa Ward
7	Tokyo	Toshima ward
9	Tokyo	Katsushika

You can create tables for Tokyo's 23 wards and Kanagawa prefecture's XX city.

replace

Series.str.replace

--Equivalent to re.sub () --Series.str.replace (pat, repl) converts the string that matches pat to repl --Often used when you want to remove extra strings on a rule basis

#Delete zip code
df['address'] = df['address'].str.replace("〒[0-9]{3}\-[0-9]{4}", "")

By the way, Series.replace is different from Series.str.replace in dictionary format. Can be passed

Do something that pandas doesn't have

If you want to use your own functions or packaged functions (neologdn, mecab, etc.)

Use Series.apply. By passing a function like Series.apply (func), you can execute the processing of that function on the data of Series. You can also pass a lambda function.

#At the beginning of the text'Street address'Insert
df['address'].apply(lambda x:  'Street address' + x)

The actual text preprocessing will be summarized below.

neologdn (text preprocessing)

https://github.com/ikegami-yukino/neologdn

neologdn 0.4

A Japanese text normalization package that can normalize long vowels and tildes, which cannot be processed by the standard library normalize alone.

Of course, before analyzing with mecab If you use it before getting the string with the regular expression, the regular expression to be written will be simplified, so it is better to execute it first for Japanese text.

import neologdn
df['neologdn'] = df['address'].apply(neologdn.normalize)

#DataFrame with lambda function.You can also apply
df['neologdn'] = df.apply(lambda x: neologdn.normalize(x['address']), axis=1)

Word-separation

mecab-python3

https://github.com/SamuraiT/mecab-python3

mecab-python3 0.996.3

If you want only the word-separated result, specify -Owakati.

import MeCab

# `-d`Specify the dictionary path with
tagger = MeCab.Tagger('-Owakati -d /usr/local/lib/mecab/dic/ipadic/')
df['neologdn'].apply(tagger.parse)

In this case, the last line break \ n is also attached, so if you want to remove it, you can define your own function or use a lambda statement.

tagger = MeCab.Tagger('-Owakati -d /usr/local/lib/mecab/dic/ipadic/')

#Define function
def my_parser(text):
    res = tagger.parse(text)
    return res.strip()

df['neologdn'].apply(my_parser)


#You don't have to declare a function if you use a lambda function
df['neologdn'].apply(lambda x : tagger.parse(x).strip())
>>0 1-chome, Kasuga, Bunkyo-ku, Tokyo-1 LaQua Building 2 F
1 Takadanobaba, Shinjuku-ku, Tokyo 1- 35 -3 Emio Style 1 F
2 Marunouchi, Chiyoda-ku, Tokyo 2- 4 -1 Marunouchi Building B 1 F
3 5-50 Misono, Midori-ku, Saitama City, Saitama Prefecture 1 Aeon Mall Urawa Misono
4 Hamacho, Funabashi City, Chiba Prefecture 2- 1 -1 LaLaport TOKYO-BAY LaLaport 3
5 Ofuna, Kamakura City, Kanagawa Prefecture 1- 4 -1 Lumine Wing 4 F
6 Minamikoiwa, Edogawa-ku, Tokyo 7- 24 -15 Shapo Koiwa 1 F
7 1 Minamiikebukuro, Toshima-ku, Tokyo- 29 -1 Ikebukuro SP B 1 F
8 105 Wakitamachi, Kawagoe City, Saitama Prefecture Atre Kawagoe 4 F
9 Kameari, Katsushika-ku, Tokyo 3- 49 -3 Ario Kameari 2 F

Sudachipy

https://github.com/WorksApplications/SudachiPy

SudachiDict-core 20190718
SudachiPy 0.4.2

In SudachiPy, if you want to get the word-separation result like the above mecab, you need to create a function that returns only the surface layer from the analysis result object.

from sudachipy import tokenizer
from sudachipy import dictionary


tokenizer_obj = dictionary.Dictionary().create()
mode = tokenizer.Tokenizer.SplitMode.C

def sudachi_tokenize(text):
    res = tokenizer_obj.tokenize(text, mode)
    return ' '.join([m.surface() for m in res])

df['address'].apply(sudachi_tokenize)
>>0 1-chome, Kasuga, Bunkyo-ku, Tokyo-1 LaQua Building 2 F
1 Takadanobaba, Shinjuku-ku, Tokyo 1- 35 -3 Emio Style 1 F
2 Marunouchi, Chiyoda-ku, Tokyo 2- 4 -1 Marunouchi Building B 1 F
3 5-50 Misono, Midori-ku, Saitama-shi, Saitama 1 Aeon Mall Urawa Misono
4 Hamacho, Funabashi City, Chiba Prefecture 2- 1 -1 LaLaport TOKYO-BAY LaLaport 3
5 Ofuna, Kamakura City, Kanagawa Prefecture 1- 4 -1 Lumine Wing 4 F
6 Minamikoiwa, Edogawa-ku, Tokyo 7- 24 -15 Shapo Koiwa 1 F
7 1 Minamiikebukuro, Toshima-ku, Tokyo- 29 -1 Ikebukuro SP B 1 F
8 105 Wakitamachi, Kawagoe City, Saitama Prefecture Atre Kawagoe 4 F
9 Kameari, Katsushika-ku, Tokyo 3- 49 -3 Ario Kameari 2 F

By the way, unlike mecab's ipadic, Sudachi's SplitMode.C seems to put together the addresses. (Do you treat prefectures + cities, wards, towns and villages as named entities?)

pass mode as an argument

In addition to the above results Sudachi has 3 split units (Split Mode), so you can use the previous function sudachi_tokenize to specify the mode. Try to customize.

Since Series.apply can pass variable length arguments, you can add arguments on the function side and specify the arguments on the apply side.

def sudachi_tokenize_with_mode(text, mode):
    res = tokenizer_obj.tokenize(text, mode)
    return ' '.join([m.surface() for m in res])

df['address'].apply(sudachi_tokenize_with_mode, mode=tokenizer.Tokenizer.SplitMode.A)
>>0 1-chome, Kasuga, Bunkyo-ku, Tokyo-1 LaQua Building 2 F
1 Takadanobaba, Shinjuku-ku, Tokyo 1- 35 -3 Emio Style 1 F
2 Marunouchi, Chiyoda-ku, Tokyo 2- 4 -1 Marunouchi Building B 1 F
3 5-50 Misono, Midori-ku, Saitama City, Saitama Prefecture 1 Aeon Mall Urawa Misono
4 Hamacho, Funabashi City, Chiba Prefecture 2- 1 -1 LaLaport TOKYO-BAY LaLaport 3
5 Ofuna, Kamakura City, Kanagawa Prefecture 1- 4 -1 Lumine Wing 4 F
6 Minamikoiwa, Edogawa-ku, Tokyo 7- 24 -15 Shapo Koiwa 1 F
7 1 Minamiikebukuro, Toshima-ku, Tokyo- 29 -1 Ikebukuro SP B 1 F
8 105 Wakitamachi, Kawagoe City, Saitama Prefecture Atre Kawagoe 4 F
9 Kameari, Katsushika-ku, Tokyo 3- 49 -3 Ario Kameari 2 F

With SplitMode.A, the result is almost the same as the result of mecab.

Use expand

Sudachi has a Normalization feature that modifies * simulation * to * simulation *.

Consider returning normalized_form at the same time and making it a DataFrame.

Since Series.apply does not have an expand function, try executing it by specifying result_type ='expand' in DataFrame.apply.

def sudachi_tokenize_multi(text):
    res = tokenizer_obj.tokenize(text, mode)
    return ' '.join([m.surface() for m in res]), ' '.join([m.normalized_form() for m in res])

df.apply(lambda x: sudachi_tokenize_multi(x['neologdn']), axis=1, result_type='expand')

	0	1
0	1-chome, Kasuga, Bunkyo-ku, Tokyo-1 LaQua Building 2 F	1-chome, Kasuga, Bunkyo-ku, Tokyo-1 LaQua Building 2 f
1	1 Takadanobaba, Shinjuku-ku, Tokyo- 35 -3 Emio Style 1 F	1 Takadanobaba, Shinjuku-ku, Tokyo- 35 -3 Emio Style 1 f
2	2 Marunouchi, Chiyoda-ku, Tokyo- 4 -1 Marunouchi Building B 1 F	2 Marunouchi, Chiyoda-ku, Tokyo- 4 -1 Marunouchi Building b 1 f
3	5-50 Misono, Midori-ku, Saitama-shi, Saitama Aeon Mall Urawa Misono	5-50 Misono, Midori-ku, Saitama-shi, Saitama Aeon Mall Urawa Misono
4	2 Hamacho, Funabashi City, Chiba Prefecture- 1 -1 LaLaport TOKYO-BAY LaLaport 3	2 Hamacho, Funabashi City, Chiba Prefecture- 1 -1 LaLaport Tokyo-Bay LaLaport 3
5	1 Ofuna, Kamakura City, Kanagawa Prefecture- 4 -1 Lumine Wing 4 F	1 Ofuna, Kamakura City, Kanagawa Prefecture- 4 -1 Lumine Wing 4 f
6	7 Minamikoiwa, Edogawa-ku, Tokyo- 24 -15 Shapo Koiwa 1 F	7 Minamikoiwa, Edogawa-ku, Tokyo- 24 -15 Shappo Koiwa 1 f
7	1 Minamiikebukuro, Toshima-ku, Tokyo- 29 -1 Ikebukuro SP B 1 F	1 Minamiikebukuro, Toshima-ku, Tokyo- 29 -1 Ikebukuro SP b 1 f
8	105 Wakitamachi, Kawagoe City, Saitama Prefecture Atre Kawagoe 4F	105 Wakitamachi, Kawagoe City, Saitama Prefecture Atre Kawagoe 4 f
9	3 Kameari, Katsushika-ku, Tokyo- 49 -3 Ario Kameari 2 F	3 Kameari, Katsushika-ku, Tokyo- 49 -3 Ario Kameari 2 f

In the case of the address, I thought it would have no particular effect, but * Chapo * was converted to * Shappo *, and * TOKYO --BAY * was converted to * Tokyo --Bay *. Alphabetic characters like F are also lowercase for some reason.

Other

As for the order of personal preprocessing

neologdn
Remove unnecessary strings with regular expressions and fill in missing values
Divide with mecab

However, if you use SudachiPy, it also has a normalization function, so you may want to tokenize it first and then delete or complete the characters.

As I wrote at the beginning, I would appreciate it if you could tell me if there is a better processing method.

[PYTHON] Japanese text preprocessing without for statement in pandas