[PYTHON] That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 2]

That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 2]

We will solve the Python problem of Data Science 100 Knock (Structured Data Processing). This group of questions uses pandas for data processing in the model answer, but we will process it using NumPy's structured array after studying.

: arrow_backward: Previous article (# 1) : arrow_forward: Next article (# 3)

Introduction

As a study of structured arrays in NumPy, solve the Python problem in Data Science 100 Knock (Structured Data Processing) I will go.

Many people who do data science in Python may be pandas lovers, but in fact you can do the same with NumPy without using ** pandas **. And NumPy is usually faster. As a person who loves pandas, I'm still not used to operating NumPy, so I'd like to try to graduate from pandas by operating this "Data Science 100 Knock" with NumPy this time. It is a policy not to vectorize functions by np.vectorize () or np.frompyfunc ().

This time I will do the 10th to 16th questions. It seems to be the theme of conditional indexing of strings. The initial data was read as follows (data type specification is postponed for the time being).

import numpy as np
import pandas as pd

#For model answer
df_store = pd.read_csv('data/store.csv')
df_customer = pd.read_csv('data/customer.csv')

#Data we handle
arr_store = np.genfromtxt(
    'data/store.csv', delimiter=',', encoding='utf-8',
    names=True, dtype=None)
arr_customer = np.genfromtxt(
    'data/customer.csv', delimiter=',', encoding='utf-8',
    names=True, dtype=None)

P_010

P-010: From the store data frame (df_store), extract all items whose store code (store_cd) starts with "S14" and display only 10 items.

Use np.char.startswith () to see if the beginnings of the strings match. Give an array of character strings as the first argument and a word to search for as the second argument.

In[010]


arr_store[np.char.startswith(arr_store['store_cd'], 'S14')][:10]

It's easy to use a function like np.char.xxx (), but I'm not good at string-related operations in NumPy, so if you want * speed *, it's better to use the Python standard for loop. Often there are times. In that case, it will be faster if you bother to convert the NumPy array to a list.

In[010]


arr_store[[item[:3] == 'S14'
           for item in arr_store['store_cd'].tolist()]][:10]

If you want to see only the first few characters, there is an easier and faster way to do it. If you check the store code (store_cd) column,

arr_store['store_cd']
# array(['S12014', 'S13002', 'S14010', 'S14033', 'S14036', 'S13051',
#        ...
#        'S13003', 'S12053', 'S13037', 'S14024', 'S14006'], dtype='<U6')

You can see that they are all composed of 6-digit character strings. All you need is the first 3 characters, so reread it with the ʻU3` data type. Then, the 4th and subsequent characters are not read and are discarded, as shown below.

arr_store['store_cd'].astype('<U3')
# array(['S12', 'S13', 'S14', 'S14', 'S14', 'S13', 'S13', 'S14', 'S13',
#        ...
#        'S14', 'S13', 'S12', 'S13', 'S12', 'S13', 'S14', 'S14'],
#       dtype='<U3')

This should pull out the one from "S14", so the answer is:

In[010]


arr_store[arr_store['store_cd'].astype('<U3') == 'S14'][:10]

The output is as follows.

Out[010]


array([('S14010', 'Kikuna store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture横浜市港北区菊名一丁目', 'Kanagawa Ken Yokohama Shikou Hokukuki Kunai Choume', '045-123-4032', 139.6326, 35.50049, 1732.),
       ('S14033', 'Akuwa store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture横浜市瀬谷区阿久和西一丁目', 'Kanagawa Ken Yokohama Seya Kakuwanishi Itchoume', '045-123-4043', 139.4961, 35.45918, 1495.),
       ('S14036', 'Sagamihara Chuo store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture相模原市中央二丁目', 'Kanagawa Kensagamihara Shichuo Unichoume', '042-123-4045', 139.3716, 35.57327, 1679.),
       ('S14040', 'Nagatsuta store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture横浜市緑区長津田みなみ台五丁目', 'Ken Kanagawa Yokohama Midori Ward Hall Ivy Minami Daigochoume', '045-123-4046', 139.4994, 35.52398, 1548.),
       ('S14050', 'Akuwanishi store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture横浜市瀬谷区阿久和西一丁目', 'Kanagawa Ken Yokohama Seya Kakuwanishi Itchoume', '045-123-4053', 139.4961, 35.45918, 1830.),
       ('S14028', 'Futatsubashi store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture横浜市瀬谷区二ツ橋町', 'Ken Kanagawa Yokohama Seya Ward Office Futatsubashicho', '045-123-4042', 139.4963, 35.46304, 1574.),
       ('S14012', 'Honmokuwada store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture横浜市中区本牧和田', 'Kanagawa Ken Yokohama Shinakakuhon Mokuwada', '045-123-4034', 139.6582, 35.42156, 1341.),
       ('S14046', 'Kitayamada store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture横浜市都筑区北山田一丁目', 'Ken Kanagawa Yokohama Tsuzuki Ward Hall Tsuzuki Ward Hall', '045-123-4049', 139.5916, 35.56189,  831.),
       ('S14022', 'Zushi store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture逗子市逗子一丁目', 'Kanagawa Kenzushi Shizushi Ichoume', '046-123-4036', 139.5789, 35.29642, 1838.),
       ('S14011', 'Hiyoshihoncho store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture横浜市港北区日吉本町四丁目', 'Kanagawa Ken Yokohama Shiko Hoku Hiyoshi Honcho Yonchome', '045-123-4033', 139.6316, 35.54655,  890.)],
      dtype=[('store_cd', '<U6'), ('store_name', '<U6'), ('prefecture_cd', '<i4'), ('prefecture', '<U4'), ('address', '<U19'), ('address_kana', '<U30'), ('tel_no', '<U12'), ('longitude', '<f8'), ('latitude', '<f8'), ('floor_area', '<f8')])

Let's compare the speeds in various ways.

Time[010]


#Model answer
%timeit df_store.query("store_cd.str.startswith('S14')", engine='python').head(10)
# 3.46 ms ± 23.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#Other ways of pandas
%timeit df_store[df_store['store_cd'].str.startswith('S14')][:10]
# 876 µs ± 18.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df_store.loc[[index for index, item in enumerate(df_store['store_cd']) if item[:3] == 'S14']][:10]
# 732 µs ± 17.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

#Method using NumPy
%timeit arr_store[np.char.startswith(arr_store['store_cd'], 'S14')][:10]
# 54.3 µs ± 3.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit arr_store[[item[:3] == 'S14' for item in arr_store['store_cd'].tolist()]][:10]
# 22.8 µs ± 377 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit arr_store[arr_store['store_cd'].astype('<U3') == 'S14'][:10]
# 5.55 µs ± 91.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

String conditional indexes using pd.DataFrame.query () are notorious for being slow.

P_011

P-011: Extract all items from the customer data frame (df_customer) with only one customer ID (customer_id) ending in 1, and display only 10 items.

It's the same as the previous question. Use the np.char.endswith () function.

In[011]


arr_customer[np.char.endswith(arr_customer['customer_id'], '1')][:10]

If you are concerned about speed, for loop.

In[011]


arr_customer[[item[-1] == '1'
              for item in arr_customer['customer_id']]][:10]

Think of a faster way. You cannot use the .astype () tactic to see the last character in a string. However, in this example, it is possible to perform high-speed processing in another way. If you check the customer ID (customer_id) column,

arr_customer['customer_id']
# array(['CS021313000114', 'CS037613000071', 'CS031415000172', ...,
#        'CS012403000043', 'CS033512000184', 'CS009213000022'], dtype='<U14')

You can see that ** all lines have the same number of characters ** (14 digits). You just have to look for the one with "1" at the end. To do this, first convert the array ʻarr_customer ['customer_id'] to a byte string using the .tobytes () method, and then convert this array, which was read with the data type'U14', to . Reread with the 'U1' data type using the np.frombuffer ()function. If you further rearrange it into a 14-column array with thereshape ()` method, you get:

#Break down all customer ID data into characters
np.frombuffer(arr_customer['customer_id'].tobytes(), dtype='<U1')
# array(['C', 'S', '0', ..., '0', '2', '2'], dtype='<U1')

#Return to 14 characters per line
np.frombuffer(arr_customer['customer_id'].tobytes(), dtype='<U1').reshape(len(arr_customer), -1)
# array([['C', 'S', '0', ..., '1', '1', '4'],
#        ['C', 'S', '0', ..., '0', '7', '1'],
#        ...,
#        ['C', 'S', '0', ..., '1', '8', '4'],
#        ['C', 'S', '0', ..., '0', '2', '2']], dtype='<U1')

Since the row required by the "row whose last column is" 1 "" in this array is as follows. This technique is quite applicable.

In[011]


arr_customer[np.frombuffer(arr_customer['customer_id'].tobytes(), dtype='<U1')
             .reshape(len(arr_customer), -1)[:, -1]
             == '1'][:10]

The output is as follows.

Out[011]


array([('CS037613000071', 'Masahiko Hexagon', 9, 'unknown', '1952-04-01', 66, '136-0076', 'Minamisuna, Koto-ku, Tokyo**********', 'S13037', 20150414, '0-00000000-0'),
       ('CS028811000001', 'Kaori Horii', 1, 'Female', '1933-03-27', 86, '245-0016', 'Izumi-cho, Izumi-ku, Yokohama-shi, Kanagawa**********', 'S14028', 20160115, '0-00000000-0'),
       ('CS040412000191', 'Ikue Kawai', 1, 'Female', '1977-01-05', 42, '226-0021', 'Kitahassakucho, Midori-ku, Yokohama-shi, Kanagawa**********', 'S14040', 20151101, '1-20091025-4'),
       ('CS028314000011', 'Kosuge Aoi', 1, 'Female', '1983-11-26', 35, '246-0038', 'Miyazawa, Seya Ward, Yokohama City, Kanagawa Prefecture**********', 'S14028', 20151123, '1-20080426-5'),
       ('CS039212000051', 'Erika Fujishima', 1, 'Female', '1997-02-03', 22, '166-0001', 'Asagayakita, Suginami-ku, Tokyo**********', 'S13039', 20171121, '1-20100215-4'),
       ('CS015412000111', 'Natsuki Matsui', 1, 'Female', '1972-10-04', 46, '136-0071', 'Kameido, Koto-ku, Tokyo**********', 'S13015', 20150629, '0-00000000-0'),
       ('CS004702000041', 'Hiroshi Nojima', 0, 'male', '1943-08-24', 75, '176-0022', 'Koyama, Nerima-ku, Tokyo**********', 'S13004', 20170218, '0-00000000-0'),
       ('CS041515000001', 'Chinatsu Kurita', 1, 'Female', '1967-01-02', 52, '206-0001', 'Wada, Tama City, Tokyo**********', 'S13041', 20160422, 'E-20100803-F'),
       ('CS029313000221', 'Hikari Hojo', 1, 'Female', '1987-06-19', 31, '279-0011', 'Mihama, Urayasu City, Chiba Prefecture**********', 'S12029', 20180810, '0-00000000-0'),
       ('CS034312000071', 'Nao Mochizuki', 1, 'Female', '1980-09-20', 38, '213-0026', 'Hisasue, Takatsu-ku, Kawasaki-shi, Kanagawa**********', 'S14034', 20160106, '0-00000000-0')],
      dtype=[('customer_id', '<U14'), ('customer_name', '<U10'), ('gender_cd', '<i4'), ('gender', '<U2'), ('birth_day', '<U10'), ('age', '<i4'), ('postal_cd', '<U8'), ('address', '<U26'), ('application_store_cd', '<U6'), ('application_date', '<i4'), ('status_cd', '<U12')])

Let's compare the speed.

Time[011]


%timeit df_customer.query("customer_id.str.endswith('1')", engine='python').head(10)
# 12.2 ms ± 454 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit arr_customer[np.char.endswith(arr_customer['customer_id'], '1')][:10]
# 20.7 ms ± 847 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit arr_customer[[item[-1] == '1' for item in arr_customer['customer_id']]][:10]
# 9.44 ms ± 185 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit arr_customer[np.frombuffer(arr_customer['customer_id'].tobytes(), dtype='<U1').reshape(len(arr_customer), -1)[:, -1] == '1'][:10]
# 1.83 ms ± 77 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

np.char.xxx () seems to be slower than pd.query () in some cases ...

P_012

P-012: Display all items from the store data frame (df_store) only for stores in Yokohama.

In[012]


arr_store[np.char.find(arr_store['address'], 'Yokohama') >= 0]

Or

In[012]


arr_store[['Yokohama' in item
           for item in arr_store['address'].tolist()]]

The number of characters in the address field is indefinite, but "Yokohama City" always appears in the 5th, 6th, and 7th characters, so after cutting it to only the first 7 characters with .astype ('<U7'), only the last 3 characters Make an array of

In[012]


arr_store[np.frombuffer(arr_store['address'].astype('<U7').view('<U1')
                        .reshape(len(arr_store), -1)[:, 4:].tobytes(),
                        dtype='<U3')
          == 'Yokohama']

You can also.

Out[012]


array([('S14010', 'Kikuna store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture横浜市港北区菊名一丁目', 'Kanagawa Ken Yokohama Shikou Hokukuki Kunai Choume', '045-123-4032', 139.6326, 35.50049, 1732.),
       ('S14033', 'Akuwa store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture横浜市瀬谷区阿久和西一丁目', 'Kanagawa Ken Yokohama Seya Kakuwanishi Itchoume', '045-123-4043', 139.4961, 35.45918, 1495.),
       ('S14040', 'Nagatsuta store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture横浜市緑区長津田みなみ台五丁目', 'Ken Kanagawa Yokohama Midori Ward Hall Ivy Minami Daigochoume', '045-123-4046', 139.4994, 35.52398, 1548.),
       ('S14050', 'Akuwanishi store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture横浜市瀬谷区阿久和西一丁目', 'Kanagawa Ken Yokohama Seya Kakuwanishi Itchoume', '045-123-4053', 139.4961, 35.45918, 1830.),
       ('S14028', 'Futatsubashi store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture横浜市瀬谷区二ツ橋町', 'Ken Kanagawa Yokohama Seya Ward Office Futatsubashicho', '045-123-4042', 139.4963, 35.46304, 1574.),
       ('S14012', 'Honmokuwada store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture横浜市中区本牧和田', 'Kanagawa Ken Yokohama Shinakakuhon Mokuwada', '045-123-4034', 139.6582, 35.42156, 1341.),
       ('S14046', 'Kitayamada store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture横浜市都筑区北山田一丁目', 'Ken Kanagawa Yokohama Tsuzuki Ward Hall Tsuzuki Ward Hall', '045-123-4049', 139.5916, 35.56189,  831.),
       ('S14011', 'Hiyoshihoncho store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture横浜市港北区日吉本町四丁目', 'Kanagawa Ken Yokohama Shiko Hoku Hiyoshi Honcho Yonchome', '045-123-4033', 139.6316, 35.54655,  890.),
       ('S14048', 'Nakagawa Chuo store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture横浜市都筑区中川中央二丁目', 'Ken Kanagawa Yokohama Shitsuzuki Kunakagawa Chuo Unichoume', '045-123-4051', 139.5758, 35.54912, 1657.),
       ('S14042', 'Shinyamashita store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture横浜市中区新山下二丁目', 'Kanagawa Ken Yokohama Shinakakushin Yamashitani Chome', '045-123-4047', 139.6593, 35.43894, 1044.),
       ('S14006', 'Kuzugaya store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture横浜市都筑区葛が谷', 'Kanagawa Ken Yokohama Tsuzuki Ward Hall', '045-123-4031', 139.5633, 35.53573, 1886.)],
      dtype=[('store_cd', '<U6'), ('store_name', '<U6'), ('prefecture_cd', '<i4'), ('prefecture', '<U4'), ('address', '<U19'), ('address_kana', '<U30'), ('tel_no', '<U12'), ('longitude', '<f8'), ('latitude', '<f8'), ('floor_area', '<f8')])

P_013

P-013: From the customer data frame (df_customer), extract all the data whose status code (status_cd) starts with the alphabet A to F, and display only 10 items.

This is troublesome. In the model answer, we search by regular expression, but for the time being, we will do it without using regular expression. If you go with np.char.startswith (), it will be like this.

In[013]


arr_customer[
    np.char.startswith(arr_customer['status_cd'],
                       np.array(list('ABCDEF'))[:, None])
    .any(0)][:10]

It feels pretty tough. It's easier to use for.

In[013]


arr_customer[[item[0] in set('ABCDEF')
              for item in arr_customer['status_cd'].tolist()]][:10]

Read only the beginning of the string and np.in1d (). It's fast and easy.

In[013]


arr_customer[np.in1d(arr_customer['status_cd'].astype('<U1'),
                     np.fromiter('ABCDEF', dtype='<U1'))][:10]

By the way, the search word is also a NumPy array here, but the speed did not change much even with list ('ABCDEF').

Out[013]


array([('CS031415000172', 'Kimiko Utada', 1, 'Female', '1976-10-04', 42, '151-0053', 'Yoyogi, Shibuya-ku, Tokyo**********', 'S13031', 20150529, 'D-20100325-C'),
       ('CS015414000103', 'Yoko Okuno', 1, 'Female', '1977-08-09', 41, '136-0073', 'Kitasuna, Koto-ku, Tokyo**********', 'S13015', 20150722, 'B-20100609-B'),
       ('CS011215000048', 'Saya Ashida', 1, 'Female', '1992-02-01', 27, '223-0062', 'Hiyoshihoncho, Kohoku Ward, Yokohama City, Kanagawa Prefecture**********', 'S14011', 20150228, 'C-20100421-9'),
       ('CS029415000023', 'Riho Umeda', 1, 'Female', '1976-01-17', 43, '279-0043', 'Fujimi, Urayasu City, Chiba Prefecture**********', 'S12029', 20150610, 'D-20100918-E'),
       ('CS035415000029', 'Maki Terazawa', 9, 'unknown', '1977-09-27', 41, '158-0096', 'Tamagawadai, Setagaya-ku, Tokyo**********', 'S13035', 20141220, 'F-20101029-F'),
       ('CS031415000106', 'Yumiko Uno', 1, 'Female', '1970-02-26', 49, '151-0053', 'Yoyogi, Shibuya-ku, Tokyo**********', 'S13031', 20150201, 'F-20100511-E'),
       ('CS029215000025', 'Miho Ishikura', 1, 'Female', '1993-09-28', 25, '279-0022', 'Imagawa, Urayasu City, Chiba Prefecture**********', 'S12029', 20150708, 'B-20100820-C'),
       ('CS033605000005', 'Yuta Inomata', 0, 'male', '1955-12-05', 63, '246-0031', 'Seya, Seya Ward, Yokohama City, Kanagawa Prefecture**********', 'S14033', 20150425, 'F-20100917-E'),
       ('CS033415000229', 'Nanami Itagaki', 1, 'Female', '1977-11-07', 41, '246-0021', 'Futatsubashi-cho, Seya-ku, Yokohama-shi, Kanagawa**********', 'S14033', 20150712, 'F-20100326-E'),
       ('CS008415000145', 'Mao Kuroya', 1, 'Female', '1977-06-27', 41, '157-0067', 'Kitami, Setagaya-ku, Tokyo**********', 'S13008', 20150829, 'F-20100622-F')],
      dtype=[('customer_id', '<U14'), ('customer_name', '<U10'), ('gender_cd', '<i4'), ('gender', '<U2'), ('birth_day', '<U10'), ('age', '<i4'), ('postal_cd', '<U8'), ('address', '<U26'), ('application_store_cd', '<U6'), ('application_date', '<i4'), ('status_cd', '<U12')])

P_014

P-014: From the customer data frame (df_customer), extract all the data whose status code (status_cd) ends with the numbers 1 to 9 and display only 10 items.

In[014]


arr_customer[[item[-1] in set('123456789')
              for item in arr_customer['status_cd'].tolist()]][:10]

In[014]


arr_customer[np.in1d(np.frombuffer(arr_customer['status_cd'].tobytes(),
                                   dtype='<U1')
                     .reshape(len(arr_customer), -1)[:, -1],
                     np.fromiter('123456789', dtype='<U1'))][:10]

Out[014]


array([('CS001215000145', 'Miki Tazaki', 1, 'Female', '1995-03-29', 24, '144-0055', 'Nakarokugo, Ota-ku, Tokyo**********', 'S13001', 20170605, '6-20090929-2'),
       ('CS033513000180', 'Haruka Anzai', 1, 'Female', '1962-07-11', 56, '241-0823', 'Zenbu-cho, Asahi-ku, Yokohama-shi, Kanagawa**********', 'S14033', 20150728, '6-20080506-5'),
       ('CS011215000048', 'Saya Ashida', 1, 'Female', '1992-02-01', 27, '223-0062', 'Hiyoshihoncho, Kohoku Ward, Yokohama City, Kanagawa Prefecture**********', 'S14011', 20150228, 'C-20100421-9'),
       ('CS040412000191', 'Ikue Kawai', 1, 'Female', '1977-01-05', 42, '226-0021', 'Kitahassakucho, Midori-ku, Yokohama-shi, Kanagawa**********', 'S14040', 20151101, '1-20091025-4'),
       ('CS009315000023', 'Fumiyo Kohinata', 1, 'Female', '1980-04-15', 38, '154-0012', 'Komazawa, Setagaya-ku, Tokyo**********', 'S13009', 20150319, '5-20080322-1'),
       ('CS015315000033', 'Fukushi Rinako', 1, 'Female', '1983-03-17', 36, '135-0043', 'Shiohama, Koto-ku, Tokyo**********', 'S13015', 20141024, '4-20080219-3'),
       ('CS023513000066', 'Kobe Sora', 1, 'Female', '1961-12-17', 57, '210-0005', 'Higashida-cho, Kawasaki-ku, Kawasaki-shi, Kanagawa**********', 'S14023', 20150915, '5-20100524-9'),
       ('CS035513000134', 'Miho Ichikawa', 1, 'Female', '1960-03-27', 59, '156-0053', 'Sakura, Setagaya-ku, Tokyo**********', 'S13035', 20150227, '8-20100711-9'),
       ('CS001515000263', 'Takamatsu summer sky', 1, 'Female', '1962-11-09', 56, '144-0051', 'Nishikamata, Ota-ku, Tokyo**********', 'S13001', 20160812, '1-20100804-1'),
       ('CS040314000027', 'Kimimaro Tsuruta', 9, 'unknown', '1986-03-26', 33, '226-0027', 'Nagatsuta, Midori-ku, Yokohama-shi, Kanagawa**********', 'S14040', 20150122, '2-20080426-4')],
      dtype=[('customer_id', '<U14'), ('customer_name', '<U10'), ('gender_cd', '<i4'), ('gender', '<U2'), ('birth_day', '<U10'), ('age', '<i4'), ('postal_cd', '<U8'), ('address', '<U26'), ('application_store_cd', '<U6'), ('application_date', '<i4'), ('status_cd', '<U12')])

P_015

P-015: From the customer data frame (df_customer), extract all the data whose status code (status_cd) starts with the letters A to F and ends with the numbers 1 to 9, and display only 10 items. ..

Even if there are multiple conditions, you can easily get them with a for loop.

In[015]


arr_customer[[item[0] in set('ABCDEF') and item[-1] in set('123456789')
              for item in arr_customer['status_cd'].tolist()]][:10]

The method of cutting a character string and using np.in1d () has many conditions and has become troublesome to write. No one would bother to write code like this: (Is there a better way ...?)

In[015]


statud_cd_split = np.frombuffer(arr_customer['status_cd'].tobytes(),
                                dtype='<U1').reshape(len(arr_customer), -1)
arr_customer[np.in1d(statud_cd_split[:, 0],
                     np.fromiter('ABCDEF', dtype='<U1'))
             &
             np.in1d(statud_cd_split[:, -1],
                     np.fromiter('123456789', dtype='<U1'))][:10]

Out[015]


array([('CS011215000048', 'Saya Ashida', 1, 'Female', '1992-02-01', 27, '223-0062', 'Hiyoshihoncho, Kohoku Ward, Yokohama City, Kanagawa Prefecture**********', 'S14011', 20150228, 'C-20100421-9'),
       ('CS022513000105', 'Kimiko Shimamura', 1, 'Female', '1962-03-12', 57, '249-0002', 'Yamanone, Zushi City, Kanagawa Prefecture**********', 'S14022', 20150320, 'A-20091115-7'),
       ('CS001515000096', 'Yoko Mizuno', 9, 'unknown', '1960-11-29', 58, '144-0053', 'Kamatahoncho, Ota-ku, Tokyo**********', 'S13001', 20150614, 'A-20100724-7'),
       ('CS013615000053', 'Nishiwaki Kii', 1, 'Female', '1953-10-18', 65, '261-0026', 'Makuharinishi, Mihama Ward, Chiba City, Chiba Prefecture**********', 'S12013', 20150128, 'B-20100329-6'),
       ('CS020412000161', 'Kaoru Komiya', 1, 'Female', '1974-05-21', 44, '174-0042', 'Higashisakashita, Itabashi-ku, Tokyo**********', 'S13020', 20150822, 'B-20081021-3'),
       ('CS001215000097', 'Asami Takenaka', 1, 'Female', '1990-07-25', 28, '146-0095', 'Tamagawa, Ota-ku, Tokyo**********', 'S13001', 20170315, 'A-20100211-2'),
       ('CS035212000007', 'Erika Uchimura', 1, 'Female', '1990-12-04', 28, '152-0023', 'Yakumo, Meguro-ku, Tokyo**********', 'S13035', 20151013, 'B-20101018-6'),
       ('CS002515000386', 'Ko Noda', 1, 'Female', '1963-05-30', 55, '185-0013', 'Nishikoigakubo, Kokubunji-shi, Tokyo**********', 'S13002', 20160410, 'C-20100127-8'),
       ('CS001615000372', 'Inagaki Suzuka', 1, 'Female', '1956-10-29', 62, '144-0035', 'Minamikamata, Ota-ku, Tokyo**********', 'S13001', 20170403, 'A-20100104-1'),
       ('CS032512000121', 'Tomoyo Matsui', 1, 'Female', '1962-09-04', 56, '210-0011', 'Fujimi, Kawasaki Ward, Kawasaki City, Kanagawa Prefecture**********', 'S13032', 20150727, 'A-20100103-5')],
      dtype=[('customer_id', '<U14'), ('customer_name', '<U10'), ('gender_cd', '<i4'), ('gender', '<U2'), ('birth_day', '<U10'), ('age', '<i4'), ('postal_cd', '<U8'), ('address', '<U26'), ('application_store_cd', '<U6'), ('application_date', '<i4'), ('status_cd', '<U12')])

It's tedious to write, but faster than a for loop.

Time[015]


%timeit df_customer.query("status_cd.str.contains('^[A-F].*[1-9]$', regex=True)", engine='python').head(10)
# 31 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit arr_customer[[item[0] in set('ABCDEF') and item[-1] in set('123456789') for item in arr_customer['status_cd'].tolist()]][:10]
# 16.7 ms ± 1.37 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
statud_cd_split = np.frombuffer(arr_customer['status_cd'].tobytes(), dtype='<U1').reshape(len(arr_customer), -1)
arr_customer[np.in1d(statud_cd_split[:, 0], np.fromiter('ABCDEF', dtype='<U1'))
             & np.in1d(statud_cd_split[:, -1], np.fromiter('123456789', dtype='<U1'))][:10]
# 3.94 ms ± 17.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

P_016

P-016: From the store data frame (df_store), display all data with 3 digits-3 digits-4 digits for the phone number (tel_no).

Should I use the regular expression module re at this point?

In[016]


import re

arr_store[[bool(re.fullmatch(r'[0-9]{3}-[0-9]{3}-[0-9]{4}', item))
           for item in arr_store['tel_no'].tolist()]]

In this case, looking at the data, there were only 10-digit phone numbers, and there were no buggy values, so we could replace it with the problem of "lines where the 4th and 8th characters are"-"". It was. One of the solutions is to flexibly read the problem according to the data.

In[016]


tel_no_split = np.frombuffer(arr_store['tel_no'].tobytes(),
                             dtype='<U1').reshape(len(arr_store), -1)
arr_store[(tel_no_split[:, 3] == '-') & (tel_no_split[:, 7] == '-')]

Out[016]


array([('S12014', 'Chigusadai store', 12, 'Chiba', 'Chiba千葉市稲毛区千草台一丁目', 'Chiba Ken Chiba Shiinagekuchigusadai Itchoume', '043-123-4003', 140.118 , 35.63559, 1698.),
       ('S13002', 'Kokubunji store', 13, 'Tokyo', 'Tokyo国分寺市本多二丁目', 'Tokyo Kokubunji Kokubunji', '042-123-4008', 139.4802, 35.70566, 1735.),
       ('S14010', 'Kikuna store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture横浜市港北区菊名一丁目', 'Kanagawa Ken Yokohama Shikou Hokukukikunai Choume', '045-123-4032', 139.6326, 35.50049, 1732.),
       ('S14033', 'Akuwa store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture横浜市瀬谷区阿久和西一丁目', 'Kanagawa Ken Yokohama Seya Kakuwanishi Itchoume', '045-123-4043', 139.4961, 35.45918, 1495.),
       ('S14036', 'Sagamihara Chuo store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture相模原市中央二丁目', 'Kanagawa Kensagamihara Shichuo Unichoume', '042-123-4045', 139.3716, 35.57327, 1679.),
       ('S14040', 'Nagatsuta store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture横浜市緑区長津田みなみ台五丁目', 'Ken Kanagawa Yokohama Midori Ward Hall Ivy Minami Daigochoume', '045-123-4046', 139.4994, 35.52398, 1548.),
       ('S14050', 'Akuwanishi store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture横浜市瀬谷区阿久和西一丁目', 'Kanagawa Ken Yokohama Seya Kakuwanishi Itchoume', '045-123-4053', 139.4961, 35.45918, 1830.),
       ('S13052', 'Morino store', 13, 'Tokyo', 'Tokyo町田市森野三丁目', 'Tokyo Tomachida Shimorino Sanchoume', '042-123-4030', 139.4383, 35.55293, 1087.),
       ('S14028', 'Futatsubashi store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture横浜市瀬谷区二ツ橋町', 'Ken Kanagawa Yokohama Seya Ward Office Futatsubashicho', '045-123-4042', 139.4963, 35.46304, 1574.),
       ('S14012', 'Honmokuwada store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture横浜市中区本牧和田', 'Kanagawa Ken Yokohama Shinakakuhon Mokuwada', '045-123-4034', 139.6582, 35.42156, 1341.),
       ('S14046', 'Kitayamada store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture横浜市都筑区北山田一丁目', 'Ken Kanagawa Yokohama Tsuzuki Ward Hall Tsuzuki Ward Hall', '045-123-4049', 139.5916, 35.56189,  831.),
       ('S14022', 'Zushi store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture逗子市逗子一丁目', 'Kanagawa Kenzushi Shizushi Ichoume', '046-123-4036', 139.5789, 35.29642, 1838.),
       ('S14011', 'Hiyoshihoncho store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture横浜市港北区日吉本町四丁目', 'Kanagawa Ken Yokohama Shiko Hoku Hiyoshi Honcho Yonchome', '045-123-4033', 139.6316, 35.54655,  890.),
       ('S13016', 'Koganei store', 13, 'Tokyo', 'Tokyo小金井市本町一丁目', 'Tokyo Koganei Hongcho Ichoume', '042-123-4015', 139.5094, 35.70018, 1399.),
       ('S14034', 'Kawasaki Nogawa store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture川崎市宮前区野川', 'Kanagawa Kenkawa Sakimiyama Ekunogawa', '044-123-4044', 139.5998, 35.57693, 1318.),
       ('S14048', 'Nakagawa Chuo store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture横浜市都筑区中川中央二丁目', 'Ken Kanagawa Yokohama Shitsuzuki Kunakagawa Chuo Unichoume', '045-123-4051', 139.5758, 35.54912, 1657.),
       ('S12007', 'Sakura store', 12, 'Chiba', 'Chiba佐倉市上志津', 'Chiba Ken Sakura Shikami Shizu', '043-123-4001', 140.1452, 35.71872, 1895.),
       ('S14026', 'Tsujido West Coast Store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture藤沢市辻堂西海岸二丁目', 'Kanagawa Ken Fujisawa Shitsuji Dounishi Kaigan Nichome', '046-123-4040', 139.4466, 35.32464, 1732.),
       ('S13041', 'Hachioji store', 13, 'Tokyo', 'Tokyo八王子市大塚', 'Tokyo Hachioji Ujisio Otsuka', '042-123-4026', 139.4235, 35.63787,  810.),
       ('S14049', 'Kawasaki Daishi store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture川崎市川崎区中瀬三丁目', 'Kanagawa Ken Kawasaki Kawasaki Kunakaze Sanchoume', '044-123-4052', 139.7327, 35.53759,  962.),
       ('S14023', 'Kawasaki store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture川崎市川崎区本町二丁目', 'Kanagawa Ken Kawasaki Kawasaki Kuhoncho Nichome', '044-123-4037', 139.7028, 35.53599, 1804.),
       ('S13018', 'Kiyose store', 13, 'Tokyo', 'Tokyo清瀬市松山一丁目', 'Tokyo Tokiyoshi Matsuyamai', '042-123-4017', 139.5178, 35.76885, 1220.),
       ('S14027', 'Minamifujisawa store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture藤沢市南藤沢', 'Kanagawa Ken Fujisawa Shiminami Fujisawa', '046-123-4041', 139.4896, 35.33762, 1521.),
       ('S14021', 'Isehara store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture伊勢原市伊勢原四丁目', 'Kanagawa Ken Isehara Shiisehara Yonchoume', '046-123-4035', 139.3129, 35.40169,  962.),
       ('S14047', 'Sagamihara store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture相模原市千代田六丁目', 'Kanagawa Kensagami Harashi Chiyoda Rokuchoume', '042-123-4050', 139.3748, 35.55959, 1047.),
       ('S12013', 'Narashino store', 12, 'Chiba', 'Chiba習志野市芝園一丁目', 'Chiba Kennarashi no Shishi Bazono Ichoume', '047-123-4002', 140.022 , 35.66122,  808.),
       ('S14042', 'Shinyamashita store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture横浜市中区新山下二丁目', 'Kanagawa Ken Yokohama Shinakakushin Yamashitani Chome', '045-123-4047', 139.6593, 35.43894, 1044.),
       ('S12030', 'Yawata store', 12, 'Chiba', 'Chiba市川市八幡三丁目', 'Cibaken Ichikawa Shiyawata Sanchoume', '047-123-4005', 139.924 , 35.72318, 1162.),
       ('S14025', 'Yamato store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture大和市下和田', 'Ken Kanagawa Yamato Shishimo Wada', '046-123-4039', 139.468 , 35.43414, 1011.),
       ('S14045', 'Atsugi store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture厚木市中町二丁目', 'Kanagawa Ken Atsugi Shinakacho Nichoume', '046-123-4048', 139.3651, 35.44182,  980.),
       ('S12029', 'Higashino store', 12, 'Chiba', 'Chiba浦安市東野一丁目', 'Chiba Ken Urayasushi Higashino Itchoume', '047-123-4004', 139.8968, 35.65086, 1101.),
       ('S12053', 'Takasu store', 12, 'Chiba', 'Chiba浦安市高洲五丁目', 'Chibaken Urayasushitakasugochome', '047-123-4006', 139.9176, 35.63755, 1555.),
       ('S14024', 'Sanda store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture川崎市多摩区三田四丁目', 'Kanagawa Ken Kawasaki Takuma Tayon Chome', '044-123-4038', 139.5424, 35.6077 ,  972.),
       ('S14006', 'Kuzugaya store', 14, 'Kanagawa Prefecture', 'Kanagawa Prefecture横浜市都筑区葛が谷', 'Kanagawa Ken Yokohama Tsuzuki Ward Hall', '045-123-4031', 139.5633, 35.53573, 1886.)],
      dtype=[('store_cd', '<U6'), ('store_name', '<U6'), ('prefecture_cd', '<i4'), ('prefecture', '<U4'), ('address', '<U19'), ('address_kana', '<U30'), ('tel_no', '<U12'), ('longitude', '<f8'), ('latitude', '<f8'), ('floor_area', '<f8')])

Why do I have to output all of this problem?

Recommended Posts

That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 2]
That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 1]
That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 3]
That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 5]
That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 4]
That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 6]
"Data Science 100 Knock (Structured Data Processing)" Python-006 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-001 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-002 Explanation
[Python] Data Science 100 Knock (Structured Data Processing) 021 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-005 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-004 Explanation
[Python] Data Science 100 Knock (Structured Data Processing) 020 Explanation
[Python] Data Science 100 Knock (Structured Data Processing) 025 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-003 Explanation
[Python] Data Science 100 Knock (Structured Data Processing) 019 Explanation
Preparing to try "Data Science 100 Knock (Structured Data Processing)"
Data science 100 knock (structured data processing) environment construction (Windows10)
[Python] Data Science 100 Knock (Structured Data Processing) 001-010 Impressions + Explanation Link Summary
That's why I quit pandas [Three ways to groupby.mean () with just NumPy]
[Python] 100 knocks on data science (structured data processing) 018 Explanation
[Python] 100 knocks on data science (structured data processing) 023 Explanation
100 language processing knock-20 (using pandas): reading JSON data
[Python] 100 knocks on data science (structured data processing) 017 Explanation
[Python] 100 knocks on data science (structured data processing) 026 Explanation
[Python] 100 knocks on data science (structured data processing) 016 Explanation
[Python] 100 knocks on data science (structured data processing) 024 Explanation
[Python] 100 knocks on data science (structured data processing) 027 Explanation
[Python] 100 knocks on data science (structured data processing) 029 Explanation
[Python] 100 knocks on data science (structured data processing) 015 Explanation
[Python] 100 knocks on data science (structured data processing) 028 Explanation
Data science 100 knock commentary (P021 ~ 040)
Data science 100 knock commentary (P061 ~ 080)
Data science 100 knock commentary (P041 ~ 060)
I tried 100 language processing knock 2020
Data processing tips with Pandas
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock-31 (using pandas): Verb
I tried 100 language processing knock 2020: Chapter 1
I tried 100 language processing knock 2020: Chapter 2
I tried 100 language processing knock 2020: Chapter 4
100 Language Processing Knock-38 (using pandas): Histogram
100 Language Processing Knock-33 (using pandas): Sahen noun
100 Language Processing Knock-91: Preparation of Analogy Data
I took Udemy's "Practical Python Data Science"
100 Language Processing Knock-35 (using pandas): Noun concatenation
100 Language Processing Knock-39 (using pandas): Zipf's Law
Example of efficient data processing with PANDAS
100 Language Processing Knock-34 (using pandas): "A B"