A memorandum for not being asked "I don't know why to use a for statement" when processing text in pandas.
Collect information for the purpose of preprocessing Japanese text.
I would appreciate it if you could tell me if there is a better processing method.
Execution environment
pandas 0.25.3
TL;DR
--Simple processing is implemented in the method of df ["column name "]. Str
--If you want to do something that is not implemented in pandas df ["column name"] .apply ()
Store information of ladies' fashion brands scraped from HP. The company name, brand name, store name, and address are stored in csv.
Since it is scraped from multiple HP, it is not unified such as half-width and full-width and blank. The zip code may or may not be included.
The table below is an example of the data. Using this data as an example, the execution results are also shown.
company | brand | location | address |
---|---|---|---|
pal | BONbazaar | Bombazaar Tokyo Dome City LaQua | 1-1 Kasuga, Bunkyo-ku, Tokyo-1 LaQua Building 2F |
world | index | Emio style | 1 Takadanobaba, Shinjuku-ku, Tokyo-35-3 Emio Style 1F |
pal | Whim Gazette | Marunouchi store | 〒100-6390 2 Marunouchi, Chiyoda-ku, Tokyo-4-1 Marunouchi Building B1F |
stripe | SEVENDAYS=SUNDAY | Aeon Mall Urawa Misono 2F | 5-50-1, Misono, Midori-ku, Saitama-shi, Saitama Aeon Mall Urawa Misono |
pal | mystic | Funabashi store | 〒273-0012 2 Hamacho, Funabashi City, Chiba Prefecture-1-1 LaLaport TOKYO-BAY LaLaport 3 |
pal | pual ce cin | Ofuna Lumine Wing Store | 〒247-0056 1 Ofuna, Kamakura City, Kanagawa Prefecture-4-1 Lumine Wing 4F |
stripe | Green Parks | sara Shapo Koiwa store | 7 Minamikoiwa, Edogawa-ku, Tokyo-24-15 Shapo Koiwa 1F |
pal | Discoat | Discoat Petit Ikebukuro Shopping Park | 〒171-8532 1 Minamiikebukuro, Toshima-ku, Tokyo-29-1 Ikebukuro SP B1F |
adastoria | niko and... | Atre Kawagoe | 105 Wakitamachi, Kawagoe City, Saitama Prefecture Atre Kawagoe 4F |
pal | CIAOPANIC TYPY | Kameari store | 〒125-0061 3 Kameari, Katsushika-ku, Tokyo-49-3 Ario Kameari 2F |
Since only one column (ʻaddress` column) is used for explanation, the processing of Series is almost introduced, but the processing of DataFrame is also described as much as possible.
Basically, in Series.str
, python string methods and [regular expression operations] You can call (https://docs.python.org/ja/3/library/re.html), so you can use these.
――Excerpts of frequently used items --If you want to check all the methods, please refer to the official document. --Reference: Series.str search results - Working with text data — pandas 1.0.1 documentation
strip
Delete the whitespace characters at the beginning and end of the character string.
df['strip']=df['address'].str.strip()
Of course, Series.str.rstrip
, which deletes only the beginning, and lstrip
, which deletes only the end, are also implemented.
split, rsplit
--Split the string by the specified separator and return it in * list *
--If you set ʻexpand = True, you can split into multiple columns. --Note that the number is matched to the most split column --In the example below, it is divided into three columns, but if the text cannot be divided into three,
None` is inserted.
df['address'].str.split(expand=True)
0 | 1 | 2 |
---|---|---|
1-1 Kasuga, Bunkyo-ku, Tokyo-1 | LaQua Building | 2F |
1 Takadanobaba, Shinjuku-ku, Tokyo-35-3 | Emio style | 1F |
〒100-6390 | 2 Marunouchi, Chiyoda-ku, Tokyo-4-1 | Marunouchi Building B1F |
5-50-1 Misono, Midori-ku, Saitama-shi, Saitama | Aeon Mall Urawa Misono | None |
〒273-0012 2 Hamacho, Funabashi City, Chiba Prefecture-1-1 | LaLaport TOKYO-BAY LaLaport 3 | None |
〒247-0056 | 1 Ofuna, Kamakura City, Kanagawa Prefecture-4-1 | Lumine Wing 4F |
7 Minamikoiwa, Edogawa-ku, Tokyo-24-15 Shapo Koiwa | 1F | None |
〒171-8532 1 Minamiikebukuro, Toshima-ku, Tokyo-29-1 Ikebukuro SP | B1F | None |
Wakitamachi, Kawagoe City, Saitama Prefecture | 105 | Atre Kawagoe 4F |
〒125-0061 3 Kameari, Katsushika-ku, Tokyo-49-3 Ario Kameari 2F | None | None |
rsplit
--Split from the right side
--You can adjust the number of divisions by specifying n
--For example, if you set n = 1
, the number of columns will be fixed at 2 because it will be divided only once.df['address'].str.rsplit(expand=True, n=1)
0 | 1 |
---|---|
1-1 Kasuga, Bunkyo-ku, Tokyo-1 LaQua Building | 2F |
1 Takadanobaba, Shinjuku-ku, Tokyo-35-3 Emio style | 1F |
〒100-6390 2 Marunouchi, Chiyoda-ku, Tokyo-4-1 | Marunouchi Building B1F |
5-50-1 Misono, Midori-ku, Saitama-shi, Saitama | Aeon Mall Urawa Misono |
〒273-0012 2 Hamacho, Funabashi City, Chiba Prefecture-1-1 | LaLaport TOKYO-BAY LaLaport 3 |
〒247-0056 1 Ofuna, Kamakura City, Kanagawa Prefecture-4-1 | Lumine Wing 4F |
7 Minamikoiwa, Edogawa-ku, Tokyo-24-15 Shapo Koiwa | 1F |
〒171-8532 1 Minamiikebukuro, Toshima-ku, Tokyo-29-1 Ikebukuro SP | B1F |
105 Wakitamachi, Kawagoe City, Saitama Prefecture | Atre Kawagoe 4F |
〒125-0061 3 Kameari, Katsushika-ku, Tokyo-49-3 Ario Kameari 2F | None |
find
--Same function as str.find
--Returns the position of the string if it is included, ** -1 ** if it is not included
df['address'].str.find('Tokyo')
>> 0 0
1 0
2 10
3 -1
4 -1
5 -1
6 0
7 9
8 -1
9 9
--Since the return value is not a bool value, you can narrow down the data by using the following usage ** No **
--df [df ['address'] .str.find ('Tokyo')]
--df.query ('address.str.find (" Tokyo ")')
--If you just want to determine if a string is included, like hoge in hogehoge
, you should use contains
.
#For the time being"Tokyo"It is also possible to specify a column that does not contain
df.query('address.str.find("Tokyo")!=-1')
normalize Series.str.normalize
--Character normalization -Equivalent to ʻunicodedata.normalize`
Mainly, full-width numbers and symbols are converted to half-width, and half-width katakana is converted to full-width. Specify'NFKC'(normal form KC) for form.
# string
import unicodedata
unicodedata.normalize('NFKC', '123! ?? @ # Hankaku Katakana')
>> '123!?@#Handkerchief Katakana'
# pandas
df['normalize'] = df['address'].str.normalize(form='NFKC')
Reference: Character list normalized by Python unicodedata.normalize ('NFKC', x)
findall
--Equivalent to re.findall ()
--Unlike str.find ()
, you can use regular expressions here
--Returns all matching words
df['address'].str.findall('(.{2}Ward)')
>> 0 [Bunkyo Ward]
1 [Shinjuku ward]
2 [Daita Ward]
3 [City Midori Ward]
4 []
5 []
6 [Togawa Ward]
7 [Toshima ward]
8 []
9 [Katsushika]
contains
--contain ** s **, so be careful
--Functionally re.match ()
--Since the bool value is returned, it can be used to narrow down the data.
df['address'].str.contains('.{2}Ward')
>> 0 True
1 True
2 True
3 True
4 False
5 False
6 True
7 True
8 False
9 True
# "○○ ward"Display only data containing
df.query('address.str.contains(".{2}Ward")')['address']
>>0 1-1 Kasuga, Bunkyo-ku, Tokyo-1 LaQua Building 2F
1 Takadanobaba, Shinjuku-ku, Tokyo 1-35-3 Emio Style 1F
2 〒100-6390 2 Marunouchi, Chiyoda-ku, Tokyo-4-1 Marunouchi Building B1F
3 5-50 Misono, Midori-ku, Saitama-shi, Saitama Aeon Mall Urawa Misono
6 7 Minamikoiwa, Edogawa-ku, Tokyo-24-15 Shapo Koiwa 1F
7 〒171-8532 1 Minamiikebukuro, Toshima-ku, Tokyo-29-1 Ikebukuro SP B1F
9 〒125-0061 3 Kameari, Katsushika-ku, Tokyo-49-3 Ario Kameari 2F
By converting a bool value to an int value, you can create a column with 1
in the data containing a certain character string.
Convenient for feature engineering when creating features.
#Data flag tokyo including Tokyo_Create flg
df['tokyo_flg'] = df['address'].str.contains("Tokyo").astype(int)
df['tokyo_flg']
>> 0 1
1 1
2 1
3 0
4 0
5 0
6 1
7 1
8 0
9 1
extract Series.str.extract
--Returns a matched pattern.
--Returns None
if there is no pattern, sodropna ()
if not needed
--If you set a named group name, that name will be the column name as it is
--The named group name is (? P <name> ...)
df['address'].str.extract('(Tokyo|Kanagawa Prefecture)([^Ward city]+[Ward city])').dropna()
df['address'].str.extract('(?P<pref>Tokyo|Kanagawa Prefecture)(?P<city>[^Ward city]+[Ward city])').dropna()
pref | city | |
---|---|---|
0 | Tokyo | Bunkyo Ward |
1 | Tokyo | Shinjuku ward |
2 | Tokyo | Chiyoda Ward |
5 | Kanagawa Prefecture | Kamakura city |
6 | Tokyo | Edogawa Ward |
7 | Tokyo | Toshima ward |
9 | Tokyo | Katsushika |
You can create tables for Tokyo's 23 wards and Kanagawa prefecture's XX city.
replace
--Equivalent to re.sub ()
--Series.str.replace (pat, repl)
converts the string that matches pat
to repl
--Often used when you want to remove extra strings on a rule basis
#Delete zip code
df['address'] = df['address'].str.replace("〒[0-9]{3}\-[0-9]{4}", "")
By the way, Series.replace is different from Series.str.replace
in dictionary format. Can be passed
If you want to use your own functions or packaged functions (neologdn, mecab, etc.)
Use Series.apply.
By passing a function like Series.apply (func)
, you can execute the processing of that function on the data of Series. You can also pass a lambda function.
#At the beginning of the text'Street address'Insert
df['address'].apply(lambda x: 'Street address' + x)
The actual text preprocessing will be summarized below.
neologdn 0.4
A Japanese text normalization package that can normalize long vowels and tildes, which cannot be processed by the standard library normalize alone.
Of course, before analyzing with mecab If you use it before getting the string with the regular expression, the regular expression to be written will be simplified, so it is better to execute it first for Japanese text.
import neologdn
df['neologdn'] = df['address'].apply(neologdn.normalize)
#DataFrame with lambda function.You can also apply
df['neologdn'] = df.apply(lambda x: neologdn.normalize(x['address']), axis=1)
mecab-python3
mecab-python3 0.996.3
If you want only the word-separated result, specify -Owakati
.
import MeCab
# `-d`Specify the dictionary path with
tagger = MeCab.Tagger('-Owakati -d /usr/local/lib/mecab/dic/ipadic/')
df['neologdn'].apply(tagger.parse)
In this case, the last line break \ n
is also attached, so if you want to remove it, you can define your own function or use a lambda statement.
tagger = MeCab.Tagger('-Owakati -d /usr/local/lib/mecab/dic/ipadic/')
#Define function
def my_parser(text):
res = tagger.parse(text)
return res.strip()
df['neologdn'].apply(my_parser)
#You don't have to declare a function if you use a lambda function
df['neologdn'].apply(lambda x : tagger.parse(x).strip())
>>0 1-chome, Kasuga, Bunkyo-ku, Tokyo-1 LaQua Building 2 F
1 Takadanobaba, Shinjuku-ku, Tokyo 1- 35 -3 Emio Style 1 F
2 Marunouchi, Chiyoda-ku, Tokyo 2- 4 -1 Marunouchi Building B 1 F
3 5-50 Misono, Midori-ku, Saitama City, Saitama Prefecture 1 Aeon Mall Urawa Misono
4 Hamacho, Funabashi City, Chiba Prefecture 2- 1 -1 LaLaport TOKYO-BAY LaLaport 3
5 Ofuna, Kamakura City, Kanagawa Prefecture 1- 4 -1 Lumine Wing 4 F
6 Minamikoiwa, Edogawa-ku, Tokyo 7- 24 -15 Shapo Koiwa 1 F
7 1 Minamiikebukuro, Toshima-ku, Tokyo- 29 -1 Ikebukuro SP B 1 F
8 105 Wakitamachi, Kawagoe City, Saitama Prefecture Atre Kawagoe 4 F
9 Kameari, Katsushika-ku, Tokyo 3- 49 -3 Ario Kameari 2 F
Sudachipy
SudachiDict-core 20190718
SudachiPy 0.4.2
In SudachiPy, if you want to get the word-separation result like the above mecab, you need to create a function that returns only the surface layer from the analysis result object.
from sudachipy import tokenizer
from sudachipy import dictionary
tokenizer_obj = dictionary.Dictionary().create()
mode = tokenizer.Tokenizer.SplitMode.C
def sudachi_tokenize(text):
res = tokenizer_obj.tokenize(text, mode)
return ' '.join([m.surface() for m in res])
df['address'].apply(sudachi_tokenize)
>>0 1-chome, Kasuga, Bunkyo-ku, Tokyo-1 LaQua Building 2 F
1 Takadanobaba, Shinjuku-ku, Tokyo 1- 35 -3 Emio Style 1 F
2 Marunouchi, Chiyoda-ku, Tokyo 2- 4 -1 Marunouchi Building B 1 F
3 5-50 Misono, Midori-ku, Saitama-shi, Saitama 1 Aeon Mall Urawa Misono
4 Hamacho, Funabashi City, Chiba Prefecture 2- 1 -1 LaLaport TOKYO-BAY LaLaport 3
5 Ofuna, Kamakura City, Kanagawa Prefecture 1- 4 -1 Lumine Wing 4 F
6 Minamikoiwa, Edogawa-ku, Tokyo 7- 24 -15 Shapo Koiwa 1 F
7 1 Minamiikebukuro, Toshima-ku, Tokyo- 29 -1 Ikebukuro SP B 1 F
8 105 Wakitamachi, Kawagoe City, Saitama Prefecture Atre Kawagoe 4 F
9 Kameari, Katsushika-ku, Tokyo 3- 49 -3 Ario Kameari 2 F
By the way, unlike mecab's ipadic, Sudachi's SplitMode.C
seems to put together the addresses.
(Do you treat prefectures + cities, wards, towns and villages as named entities?)
In addition to the above results
Sudachi has 3 split units (Split Mode), so you can use the previous function sudachi_tokenize
to specify the mode. Try to customize.
Since Series.apply can pass variable length arguments, you can add arguments on the function side and specify the arguments on the apply side.
def sudachi_tokenize_with_mode(text, mode):
res = tokenizer_obj.tokenize(text, mode)
return ' '.join([m.surface() for m in res])
df['address'].apply(sudachi_tokenize_with_mode, mode=tokenizer.Tokenizer.SplitMode.A)
>>0 1-chome, Kasuga, Bunkyo-ku, Tokyo-1 LaQua Building 2 F
1 Takadanobaba, Shinjuku-ku, Tokyo 1- 35 -3 Emio Style 1 F
2 Marunouchi, Chiyoda-ku, Tokyo 2- 4 -1 Marunouchi Building B 1 F
3 5-50 Misono, Midori-ku, Saitama City, Saitama Prefecture 1 Aeon Mall Urawa Misono
4 Hamacho, Funabashi City, Chiba Prefecture 2- 1 -1 LaLaport TOKYO-BAY LaLaport 3
5 Ofuna, Kamakura City, Kanagawa Prefecture 1- 4 -1 Lumine Wing 4 F
6 Minamikoiwa, Edogawa-ku, Tokyo 7- 24 -15 Shapo Koiwa 1 F
7 1 Minamiikebukuro, Toshima-ku, Tokyo- 29 -1 Ikebukuro SP B 1 F
8 105 Wakitamachi, Kawagoe City, Saitama Prefecture Atre Kawagoe 4 F
9 Kameari, Katsushika-ku, Tokyo 3- 49 -3 Ario Kameari 2 F
With SplitMode.A
, the result is almost the same as the result of mecab.
Sudachi has a Normalization feature that modifies * simulation * to * simulation *.
Consider returning normalized_form
at the same time and making it a DataFrame.
Since Series.apply
does not have an expand function, try executing it by specifying result_type ='expand'
in DataFrame.apply
.
def sudachi_tokenize_multi(text):
res = tokenizer_obj.tokenize(text, mode)
return ' '.join([m.surface() for m in res]), ' '.join([m.normalized_form() for m in res])
df.apply(lambda x: sudachi_tokenize_multi(x['neologdn']), axis=1, result_type='expand')
0 | 1 | |
---|---|---|
0 | 1-chome, Kasuga, Bunkyo-ku, Tokyo-1 LaQua Building 2 F | 1-chome, Kasuga, Bunkyo-ku, Tokyo-1 LaQua Building 2 f |
1 | 1 Takadanobaba, Shinjuku-ku, Tokyo- 35 -3 Emio Style 1 F | 1 Takadanobaba, Shinjuku-ku, Tokyo- 35 -3 Emio Style 1 f |
2 | 2 Marunouchi, Chiyoda-ku, Tokyo- 4 -1 Marunouchi Building B 1 F | 2 Marunouchi, Chiyoda-ku, Tokyo- 4 -1 Marunouchi Building b 1 f |
3 | 5-50 Misono, Midori-ku, Saitama-shi, Saitama Aeon Mall Urawa Misono | 5-50 Misono, Midori-ku, Saitama-shi, Saitama Aeon Mall Urawa Misono |
4 | 2 Hamacho, Funabashi City, Chiba Prefecture- 1 -1 LaLaport TOKYO-BAY LaLaport 3 | 2 Hamacho, Funabashi City, Chiba Prefecture- 1 -1 LaLaport Tokyo-Bay LaLaport 3 |
5 | 1 Ofuna, Kamakura City, Kanagawa Prefecture- 4 -1 Lumine Wing 4 F | 1 Ofuna, Kamakura City, Kanagawa Prefecture- 4 -1 Lumine Wing 4 f |
6 | 7 Minamikoiwa, Edogawa-ku, Tokyo- 24 -15 Shapo Koiwa 1 F | 7 Minamikoiwa, Edogawa-ku, Tokyo- 24 -15 Shappo Koiwa 1 f |
7 | 1 Minamiikebukuro, Toshima-ku, Tokyo- 29 -1 Ikebukuro SP B 1 F | 1 Minamiikebukuro, Toshima-ku, Tokyo- 29 -1 Ikebukuro SP b 1 f |
8 | 105 Wakitamachi, Kawagoe City, Saitama Prefecture Atre Kawagoe 4F | 105 Wakitamachi, Kawagoe City, Saitama Prefecture Atre Kawagoe 4 f |
9 | 3 Kameari, Katsushika-ku, Tokyo- 49 -3 Ario Kameari 2 F | 3 Kameari, Katsushika-ku, Tokyo- 49 -3 Ario Kameari 2 f |
In the case of the address, I thought it would have no particular effect, but * Chapo * was converted to * Shappo *, and * TOKYO --BAY * was converted to * Tokyo --Bay *. Alphabetic characters like F are also lowercase for some reason.
As for the order of personal preprocessing
However, if you use SudachiPy, it also has a normalization function, so you may want to tokenize it first and then delete or complete the characters.
As I wrote at the beginning, I would appreciate it if you could tell me if there is a better processing method.
Recommended Posts