[PYTHON] Japanese text preprocessing without for statement in pandas

A memorandum for not being asked "I don't know why to use a for statement" when processing text in pandas.

Collect information for the purpose of preprocessing Japanese text.

I would appreciate it if you could tell me if there is a better processing method.

Execution environment

pandas 0.25.3

TL;DR

--Simple processing is implemented in the method of df ["column name "]. Str --If you want to do something that is not implemented in pandas df ["column name"] .apply ()

Sample data

Store information of ladies' fashion brands scraped from HP. The company name, brand name, store name, and address are stored in csv.

Since it is scraped from multiple HP, it is not unified such as half-width and full-width and blank. The zip code may or may not be included.

The table below is an example of the data. Using this data as an example, the execution results are also shown.

company brand location address
pal BONbazaar Bombazaar Tokyo Dome City LaQua 1-1 Kasuga, Bunkyo-ku, Tokyo-1 LaQua Building 2F
world index Emio style 1 Takadanobaba, Shinjuku-ku, Tokyo-35-3 Emio Style 1F
pal Whim Gazette Marunouchi store 〒100-6390 2 Marunouchi, Chiyoda-ku, Tokyo-4-1 Marunouchi Building B1F
stripe SEVENDAYS=SUNDAY Aeon Mall Urawa Misono 2F 5-50-1, Misono, Midori-ku, Saitama-shi, Saitama Aeon Mall Urawa Misono
pal mystic Funabashi store 〒273-0012 2 Hamacho, Funabashi City, Chiba Prefecture-1-1 LaLaport TOKYO-BAY LaLaport 3
pal pual ce cin Ofuna Lumine Wing Store 〒247-0056 1 Ofuna, Kamakura City, Kanagawa Prefecture-4-1 Lumine Wing 4F
stripe Green Parks sara Shapo Koiwa store 7 Minamikoiwa, Edogawa-ku, Tokyo-24-15 Shapo Koiwa 1F
pal Discoat Discoat Petit Ikebukuro Shopping Park 〒171-8532 1 Minamiikebukuro, Toshima-ku, Tokyo-29-1 Ikebukuro SP B1F
adastoria niko and... Atre Kawagoe 105 Wakitamachi, Kawagoe City, Saitama Prefecture Atre Kawagoe 4F
pal CIAOPANIC TYPY Kameari store 〒125-0061 3 Kameari, Katsushika-ku, Tokyo-49-3 Ario Kameari 2F

Since only one column (ʻaddress` column) is used for explanation, the processing of Series is almost introduced, but the processing of DataFrame is also described as much as possible.

processing

Basically, in Series.str, python string methods and [regular expression operations] You can call (https://docs.python.org/ja/3/library/re.html), so you can use these.

――Excerpts of frequently used items --If you want to check all the methods, please refer to the official document. --Reference: Series.str search results - Working with text data — pandas 1.0.1 documentation

String method system

strip

Series.str.strip

Delete the whitespace characters at the beginning and end of the character string.

df['strip']=df['address'].str.strip()

Of course, Series.str.rstrip, which deletes only the beginning, and lstrip, which deletes only the end, are also implemented.

split, rsplit

Series.str.split

--Split the string by the specified separator and return it in * list * --If you set ʻexpand = True, you can split into multiple columns. --Note that the number is matched to the most split column --In the example below, it is divided into three columns, but if the text cannot be divided into three, None` is inserted.

df['address'].str.split(expand=True)
0 1 2
1-1 Kasuga, Bunkyo-ku, Tokyo-1 LaQua Building 2F
1 Takadanobaba, Shinjuku-ku, Tokyo-35-3 Emio style 1F
〒100-6390 2 Marunouchi, Chiyoda-ku, Tokyo-4-1 Marunouchi Building B1F
5-50-1 Misono, Midori-ku, Saitama-shi, Saitama Aeon Mall Urawa Misono None
〒273-0012 2 Hamacho, Funabashi City, Chiba Prefecture-1-1 LaLaport TOKYO-BAY LaLaport 3 None
〒247-0056 1 Ofuna, Kamakura City, Kanagawa Prefecture-4-1 Lumine Wing 4F
7 Minamikoiwa, Edogawa-ku, Tokyo-24-15 Shapo Koiwa 1F None
〒171-8532 1 Minamiikebukuro, Toshima-ku, Tokyo-29-1 Ikebukuro SP B1F None
Wakitamachi, Kawagoe City, Saitama Prefecture 105 Atre Kawagoe 4F
〒125-0061 3 Kameari, Katsushika-ku, Tokyo-49-3 Ario Kameari 2F None None
df['address'].str.rsplit(expand=True, n=1)
0 1
1-1 Kasuga, Bunkyo-ku, Tokyo-1 LaQua Building 2F
1 Takadanobaba, Shinjuku-ku, Tokyo-35-3 Emio style 1F
〒100-6390 2 Marunouchi, Chiyoda-ku, Tokyo-4-1 Marunouchi Building B1F
5-50-1 Misono, Midori-ku, Saitama-shi, Saitama Aeon Mall Urawa Misono
〒273-0012 2 Hamacho, Funabashi City, Chiba Prefecture-1-1 LaLaport TOKYO-BAY LaLaport 3
〒247-0056 1 Ofuna, Kamakura City, Kanagawa Prefecture-4-1 Lumine Wing 4F
7 Minamikoiwa, Edogawa-ku, Tokyo-24-15 Shapo Koiwa 1F
〒171-8532 1 Minamiikebukuro, Toshima-ku, Tokyo-29-1 Ikebukuro SP B1F
105 Wakitamachi, Kawagoe City, Saitama Prefecture Atre Kawagoe 4F
〒125-0061 3 Kameari, Katsushika-ku, Tokyo-49-3 Ario Kameari 2F None

find

Series.str.find

--Same function as str.find --Returns the position of the string if it is included, ** -1 ** if it is not included

df['address'].str.find('Tokyo')
>> 0    0
1    0
2    10
3    -1
4    -1
5    -1
6    0
7    9
8    -1
9    9

--Since the return value is not a bool value, you can narrow down the data by using the following usage ** No ** --df [df ['address'] .str.find ('Tokyo')] --df.query ('address.str.find (" Tokyo ")') --If you just want to determine if a string is included, like hoge in hogehoge, you should use contains.

#For the time being"Tokyo"It is also possible to specify a column that does not contain
df.query('address.str.find("Tokyo")!=-1')

normalize Series.str.normalize

--Character normalization -Equivalent to ʻunicodedata.normalize`

Mainly, full-width numbers and symbols are converted to half-width, and half-width katakana is converted to full-width. Specify'NFKC'(normal form KC) for form.

# string
import unicodedata
unicodedata.normalize('NFKC', '123! ?? @ # Hankaku Katakana')
>> '123!?@#Handkerchief Katakana'

# pandas
df['normalize'] = df['address'].str.normalize(form='NFKC')

Reference: Character list normalized by Python unicodedata.normalize ('NFKC', x)

Regular expression system

findall

Series.str.findall

--Equivalent to re.findall () --Unlike str.find (), you can use regular expressions here --Returns all matching words

df['address'].str.findall('(.{2}Ward)')
>> 0    [Bunkyo Ward]
1    [Shinjuku ward]
2    [Daita Ward]
3    [City Midori Ward]
4       []
5       []
6    [Togawa Ward]
7    [Toshima ward]
8       []
9    [Katsushika]

contains

Series.str.contains

--contain ** s **, so be careful --Functionally re.match () --Since the bool value is returned, it can be used to narrow down the data.

df['address'].str.contains('.{2}Ward')
>> 0     True
1     True
2     True
3     True
4    False
5    False
6     True
7     True
8    False
9     True

#  "○○ ward"Display only data containing
df.query('address.str.contains(".{2}Ward")')['address']
>>0 1-1 Kasuga, Bunkyo-ku, Tokyo-1 LaQua Building 2F
1 Takadanobaba, Shinjuku-ku, Tokyo 1-35-3 Emio Style 1F
2    〒100-6390 2 Marunouchi, Chiyoda-ku, Tokyo-4-1 Marunouchi Building B1F
3 5-50 Misono, Midori-ku, Saitama-shi, Saitama Aeon Mall Urawa Misono
6 7 Minamikoiwa, Edogawa-ku, Tokyo-24-15 Shapo Koiwa 1F
7             〒171-8532 1 Minamiikebukuro, Toshima-ku, Tokyo-29-1 Ikebukuro SP B1F
9               〒125-0061 3 Kameari, Katsushika-ku, Tokyo-49-3 Ario Kameari 2F
Flag columns that contain strings

By converting a bool value to an int value, you can create a column with 1 in the data containing a certain character string. Convenient for feature engineering when creating features.

#Data flag tokyo including Tokyo_Create flg
df['tokyo_flg'] = df['address'].str.contains("Tokyo").astype(int)
df['tokyo_flg']
>> 0    1
1    1
2    1
3    0
4    0
5    0
6    1
7    1
8    0
9    1

extract Series.str.extract

--Returns a matched pattern. --Returns None if there is no pattern, sodropna ()if not needed --If you set a named group name, that name will be the column name as it is --The named group name is (? P <name> ...)

df['address'].str.extract('(Tokyo|Kanagawa Prefecture)([^Ward city]+[Ward city])').dropna()

df['address'].str.extract('(?P<pref>Tokyo|Kanagawa Prefecture)(?P<city>[^Ward city]+[Ward city])').dropna()
pref city
0 Tokyo Bunkyo Ward
1 Tokyo Shinjuku ward
2 Tokyo Chiyoda Ward
5 Kanagawa Prefecture Kamakura city
6 Tokyo Edogawa Ward
7 Tokyo Toshima ward
9 Tokyo Katsushika

You can create tables for Tokyo's 23 wards and Kanagawa prefecture's XX city.

replace

Series.str.replace

--Equivalent to re.sub () --Series.str.replace (pat, repl) converts the string that matches pat to repl --Often used when you want to remove extra strings on a rule basis

#Delete zip code
df['address'] = df['address'].str.replace("〒[0-9]{3}\-[0-9]{4}", "")

By the way, Series.replace is different from Series.str.replace in dictionary format. Can be passed

Do something that pandas doesn't have

If you want to use your own functions or packaged functions (neologdn, mecab, etc.)

Use Series.apply. By passing a function like Series.apply (func), you can execute the processing of that function on the data of Series. You can also pass a lambda function.

#At the beginning of the text'Street address'Insert
df['address'].apply(lambda x:  'Street address' + x)

The actual text preprocessing will be summarized below.

neologdn (text preprocessing)

neologdn 0.4

A Japanese text normalization package that can normalize long vowels and tildes, which cannot be processed by the standard library normalize alone.

Of course, before analyzing with mecab If you use it before getting the string with the regular expression, the regular expression to be written will be simplified, so it is better to execute it first for Japanese text.

import neologdn
df['neologdn'] = df['address'].apply(neologdn.normalize)

#DataFrame with lambda function.You can also apply
df['neologdn'] = df.apply(lambda x: neologdn.normalize(x['address']), axis=1)

Word-separation

mecab-python3

mecab-python3 0.996.3

If you want only the word-separated result, specify -Owakati.

import MeCab

# `-d`Specify the dictionary path with
tagger = MeCab.Tagger('-Owakati -d /usr/local/lib/mecab/dic/ipadic/')
df['neologdn'].apply(tagger.parse)

In this case, the last line break \ n is also attached, so if you want to remove it, you can define your own function or use a lambda statement.

tagger = MeCab.Tagger('-Owakati -d /usr/local/lib/mecab/dic/ipadic/')

#Define function
def my_parser(text):
    res = tagger.parse(text)
    return res.strip()

df['neologdn'].apply(my_parser)


#You don't have to declare a function if you use a lambda function
df['neologdn'].apply(lambda x : tagger.parse(x).strip())
>>0 1-chome, Kasuga, Bunkyo-ku, Tokyo-1 LaQua Building 2 F
1 Takadanobaba, Shinjuku-ku, Tokyo 1- 35 -3 Emio Style 1 F
2 Marunouchi, Chiyoda-ku, Tokyo 2- 4 -1 Marunouchi Building B 1 F
3 5-50 Misono, Midori-ku, Saitama City, Saitama Prefecture 1 Aeon Mall Urawa Misono
4 Hamacho, Funabashi City, Chiba Prefecture 2- 1 -1 LaLaport TOKYO-BAY LaLaport 3
5 Ofuna, Kamakura City, Kanagawa Prefecture 1- 4 -1 Lumine Wing 4 F
6 Minamikoiwa, Edogawa-ku, Tokyo 7- 24 -15 Shapo Koiwa 1 F
7 1 Minamiikebukuro, Toshima-ku, Tokyo- 29 -1 Ikebukuro SP B 1 F
8 105 Wakitamachi, Kawagoe City, Saitama Prefecture Atre Kawagoe 4 F
9 Kameari, Katsushika-ku, Tokyo 3- 49 -3 Ario Kameari 2 F

Sudachipy

SudachiDict-core 20190718
SudachiPy 0.4.2

In SudachiPy, if you want to get the word-separation result like the above mecab, you need to create a function that returns only the surface layer from the analysis result object.

from sudachipy import tokenizer
from sudachipy import dictionary


tokenizer_obj = dictionary.Dictionary().create()
mode = tokenizer.Tokenizer.SplitMode.C

def sudachi_tokenize(text):
    res = tokenizer_obj.tokenize(text, mode)
    return ' '.join([m.surface() for m in res])

df['address'].apply(sudachi_tokenize)
>>0 1-chome, Kasuga, Bunkyo-ku, Tokyo-1 LaQua Building 2 F
1 Takadanobaba, Shinjuku-ku, Tokyo 1- 35 -3 Emio Style 1 F
2 Marunouchi, Chiyoda-ku, Tokyo 2- 4 -1 Marunouchi Building B 1 F
3 5-50 Misono, Midori-ku, Saitama-shi, Saitama 1 Aeon Mall Urawa Misono
4 Hamacho, Funabashi City, Chiba Prefecture 2- 1 -1 LaLaport TOKYO-BAY LaLaport 3
5 Ofuna, Kamakura City, Kanagawa Prefecture 1- 4 -1 Lumine Wing 4 F
6 Minamikoiwa, Edogawa-ku, Tokyo 7- 24 -15 Shapo Koiwa 1 F
7 1 Minamiikebukuro, Toshima-ku, Tokyo- 29 -1 Ikebukuro SP B 1 F
8 105 Wakitamachi, Kawagoe City, Saitama Prefecture Atre Kawagoe 4 F
9 Kameari, Katsushika-ku, Tokyo 3- 49 -3 Ario Kameari 2 F

By the way, unlike mecab's ipadic, Sudachi's SplitMode.C seems to put together the addresses. (Do you treat prefectures + cities, wards, towns and villages as named entities?)

pass mode as an argument

In addition to the above results Sudachi has 3 split units (Split Mode), so you can use the previous function sudachi_tokenize to specify the mode. Try to customize.

Since Series.apply can pass variable length arguments, you can add arguments on the function side and specify the arguments on the apply side.

def sudachi_tokenize_with_mode(text, mode):
    res = tokenizer_obj.tokenize(text, mode)
    return ' '.join([m.surface() for m in res])

df['address'].apply(sudachi_tokenize_with_mode, mode=tokenizer.Tokenizer.SplitMode.A)
>>0 1-chome, Kasuga, Bunkyo-ku, Tokyo-1 LaQua Building 2 F
1 Takadanobaba, Shinjuku-ku, Tokyo 1- 35 -3 Emio Style 1 F
2 Marunouchi, Chiyoda-ku, Tokyo 2- 4 -1 Marunouchi Building B 1 F
3 5-50 Misono, Midori-ku, Saitama City, Saitama Prefecture 1 Aeon Mall Urawa Misono
4 Hamacho, Funabashi City, Chiba Prefecture 2- 1 -1 LaLaport TOKYO-BAY LaLaport 3
5 Ofuna, Kamakura City, Kanagawa Prefecture 1- 4 -1 Lumine Wing 4 F
6 Minamikoiwa, Edogawa-ku, Tokyo 7- 24 -15 Shapo Koiwa 1 F
7 1 Minamiikebukuro, Toshima-ku, Tokyo- 29 -1 Ikebukuro SP B 1 F
8 105 Wakitamachi, Kawagoe City, Saitama Prefecture Atre Kawagoe 4 F
9 Kameari, Katsushika-ku, Tokyo 3- 49 -3 Ario Kameari 2 F

With SplitMode.A, the result is almost the same as the result of mecab.

Use expand

Sudachi has a Normalization feature that modifies * simulation * to * simulation *.

Consider returning normalized_form at the same time and making it a DataFrame.

Since Series.apply does not have an expand function, try executing it by specifying result_type ='expand' in DataFrame.apply.

def sudachi_tokenize_multi(text):
    res = tokenizer_obj.tokenize(text, mode)
    return ' '.join([m.surface() for m in res]), ' '.join([m.normalized_form() for m in res])

df.apply(lambda x: sudachi_tokenize_multi(x['neologdn']), axis=1, result_type='expand')
0 1
0 1-chome, Kasuga, Bunkyo-ku, Tokyo-1 LaQua Building 2 F 1-chome, Kasuga, Bunkyo-ku, Tokyo-1 LaQua Building 2 f
1 1 Takadanobaba, Shinjuku-ku, Tokyo- 35 -3 Emio Style 1 F 1 Takadanobaba, Shinjuku-ku, Tokyo- 35 -3 Emio Style 1 f
2 2 Marunouchi, Chiyoda-ku, Tokyo- 4 -1 Marunouchi Building B 1 F 2 Marunouchi, Chiyoda-ku, Tokyo- 4 -1 Marunouchi Building b 1 f
3 5-50 Misono, Midori-ku, Saitama-shi, Saitama Aeon Mall Urawa Misono 5-50 Misono, Midori-ku, Saitama-shi, Saitama Aeon Mall Urawa Misono
4 2 Hamacho, Funabashi City, Chiba Prefecture- 1 -1 LaLaport TOKYO-BAY LaLaport 3 2 Hamacho, Funabashi City, Chiba Prefecture- 1 -1 LaLaport Tokyo-Bay LaLaport 3
5 1 Ofuna, Kamakura City, Kanagawa Prefecture- 4 -1 Lumine Wing 4 F 1 Ofuna, Kamakura City, Kanagawa Prefecture- 4 -1 Lumine Wing 4 f
6 7 Minamikoiwa, Edogawa-ku, Tokyo- 24 -15 Shapo Koiwa 1 F 7 Minamikoiwa, Edogawa-ku, Tokyo- 24 -15 Shappo Koiwa 1 f
7 1 Minamiikebukuro, Toshima-ku, Tokyo- 29 -1 Ikebukuro SP B 1 F 1 Minamiikebukuro, Toshima-ku, Tokyo- 29 -1 Ikebukuro SP b 1 f
8 105 Wakitamachi, Kawagoe City, Saitama Prefecture Atre Kawagoe 4F 105 Wakitamachi, Kawagoe City, Saitama Prefecture Atre Kawagoe 4 f
9 3 Kameari, Katsushika-ku, Tokyo- 49 -3 Ario Kameari 2 F 3 Kameari, Katsushika-ku, Tokyo- 49 -3 Ario Kameari 2 f

In the case of the address, I thought it would have no particular effect, but * Chapo * was converted to * Shappo *, and * TOKYO --BAY * was converted to * Tokyo --Bay *. Alphabetic characters like F are also lowercase for some reason.

Other

As for the order of personal preprocessing

  1. neologdn
  2. Remove unnecessary strings with regular expressions and fill in missing values
  3. Divide with mecab

However, if you use SudachiPy, it also has a normalization function, so you may want to tokenize it first and then delete or complete the characters.

As I wrote at the beginning, I would appreciate it if you could tell me if there is a better processing method.

Recommended Posts

Japanese text preprocessing without for statement in pandas
Japanese preprocessing for machine learning
Change the list in a for statement
Precautions when using for statements in pandas
Enter Japanese comments in Blender's text editor
Invert pandas DataFrame upside down with just 15 characters without using a for statement
Don't use readlines () in your Python for statement!
[Python] Read Japanese csv with pandas without garbled characters (and extract columns written in Japanese)
For Else statement
Double loop in for statement and then print statement behavior
Summary of pre-processing practices for Python beginners (Pandas dataframe)
Convenient to use matplotlib subplots in a for statement
When you want to plt.save in a for statement
[Memo] Text matching in pandas data frame using flashtext