[PYTHON] Memory-saving conversion of log data to sequential category features in chronological order

Introduction

Hello, I am machine learning engineer. This is Kawamoto. It was too cold and I caught a cold. Today, I will write about how to preprocess log data that requires time-series information while reducing memory consumption.

Thing you want to do

Assuming you have a dataset with such behavior logs for each user

    userid  itemid  categoryid  timestamp
0        0       3           1 2019-01-04
1        0       4           1 2019-01-08
2        0       4           1 2019-01-19
3        0       5           1 2019-01-02
4        0       7           2 2019-01-17
5        0       8           2 2019-01-07
6        1       0           0 2019-01-06
7        1       1           0 2019-01-14
8        1       2           0 2019-01-20
9        1       6           2 2019-01-01
10       1       7           2 2019-01-12
11       1       8           2 2019-01-18
12       2       3           1 2019-01-16
13       2       4           1 2019-01-15
14       2       5           1 2019-01-10
15       2       5           1 2019-01-13
16       2       6           2 2019-01-03
17       2       7           2 2019-01-05
18       2       8           2 2019-01-11
19       2       8           2 2019-01-21
20       2       9           3 2019-01-09

Variable-length series data sorted by time for each user as shown below,

Itemid (in chronological order) that each user contacted
[[5, 3, 8, 4, 7, 4], 
 [6, 0, 7, 1, 8, 2], 
 [6, 7, 9, 5, 8, 5, 4, 3, 8]]
Category id (in chronological order) that each user contacted
[[1, 1, 2, 1, 2, 1], 
 [2, 0, 2, 0, 2, 0], 
 [2, 2, 3, 1, 2, 1, 1, 1, 2]]

I want to create a categorical variable that considers the following time series. Here, it is assumed that the latest record is used for itemid and categoryid.

The latest itemid that each user has contacted
[4, 2, 8]
The latest category id that each user has contacted
[1, 0, 2]

For series data, if you can create such a list, for example, Keras (functional API) can be input as sequential data by padding as follows.

import tensorflow as tf
inputs = []
inputs.append(tf.keras.preprocessing.sequence.pad_sequences(
    df['itemid'].values.tolist(), padding='post', truncating='post', maxlen=10))
inputs.append(tf.keras.preprocessing.sequence.pad_sequences(
    df['categoryid'].values.tolist(), padding='post', truncating='post', maxlen=10))

How to write in Pandas

Regarding series data, in pandas, if you write as follows,

#User ID,Sort in chronological order
df = df.sort_values(by=['userid','timestamp'])
#Group by as a list for each user
df = df.groupby('userid').agg(list).reset_index(drop=False)

print('Itemid (in chronological order) that each user contacted')
pprint.pprint(df['itemid'].values.tolist())
print('Category id (in chronological order) that each user contacted')
pprint.pprint(df['categoryid'].values.tolist()

You will get the above result.

Itemid (in chronological order) that each user contacted
[[5, 3, 8, 4, 7, 4], 
 [6, 0, 7, 1, 8, 2], 
 [6, 7, 9, 5, 8, 5, 4, 3, 8]]
Category id (in chronological order) that each user contacted
[[1, 1, 2, 1, 2, 1], 
 [2, 0, 2, 0, 2, 0], 
 [2, 2, 3, 1, 2, 1, 1, 1, 2]]

Also, regarding category data,

#Group by to get the latest one for each user
df_cate = df.loc[df.groupby('userid')['timestamp'].idxmax()]

print(df_cate)
print('The latest itemid that each user has contacted')
pprint.pprint(df_cate['itemid'].values.tolist())
print('The latest category id that each user has contacted')
pprint.pprint(df_cate['categoryid'].values.tolist())

You can get the above result by writing like.

The latest itemid that each user has contacted
[4, 2, 8]
The latest category id that each user has contacted
[1, 0, 2]

Possible problems with Pandas

If the above data set becomes large, Pandas will cause a memory error and batch conversion will not be possible. In addition, if the dataset itself does not fit in memory, it cannot handle it either. On the other hand, if you want to divide the dataset and process it, you need to keep the time series information of the records that are not included in each dataset.

What I want to do again

From the above background

-** Reduce memory consumption in preprocessing ** -** Supports split reading of dataset **

We needed a method to create series data in chronological order like this.

Below, I will write about the specific method.

Method

Here, create a list that holds time series information, and based on that

--Series features considering time series --Category features considering time series

I will write about how to create.

After that, I will explain what to do with the divided dataset.

Creating a sort target list

First, create a list from which you can perform chronological operations. If the value you want to have as series data in chronological order and for each user is ʻitem, and the time series information of that value is timestamp`,

[[[item,timestamp],[item,timestamp]...[item,timestamp]],
 [[item,timestamp],[item,timestamp]...[item,timestamp]],
 ...
 [[item,timestamp],[item,timestamp]...[item,timestamp]]]

Create a three-dimensional list called. Here, the first dimension uses the user id as an index.

The process is as follows.

def create_list(df, user_index_col, sort_col, target_col, user_num):
    """
    :param user_index_col:User ID column
    :param sort_col:Column containing the value used for sort
    :param target_col:Columns you want to sort
    :param user_num:Number of users (obtain from encoder etc.)
    """
    inputs = [[] for _ in range(user_num)]
    for _, user_index, sort_value, target_value in df[[user_index_col, sort_col, target_col]].itertuples():
        inputs[user_index].append([target_value, sort_value])

    return inputs

If you do this for the first dataset,

itemid_inputs = create_list(df, user_index_col='userid', sort_col='timestamp', target_col='itemid', user_num=3)
categoryid_inputs = create_list(df, user_index_col='userid', sort_col='timestamp', target_col='categoryid', user_num=3)

print('itemid')
pprint.pprint(itemid_inputs)

print('categoryid')
pprint.pprint(categoryid_inputs)

A list like the one below will be created.

itemid
[[[3, Timestamp('2019-01-04 00:00:00')],
  [4, Timestamp('2019-01-08 00:00:00')],
  [4, Timestamp('2019-01-19 00:00:00')],
  [5, Timestamp('2019-01-02 00:00:00')],
  [7, Timestamp('2019-01-17 00:00:00')],
  [8, Timestamp('2019-01-07 00:00:00')]],
 [[0, Timestamp('2019-01-06 00:00:00')],
  [1, Timestamp('2019-01-14 00:00:00')],
  [2, Timestamp('2019-01-20 00:00:00')],
  [6, Timestamp('2019-01-01 00:00:00')],
  [7, Timestamp('2019-01-12 00:00:00')],
  [8, Timestamp('2019-01-18 00:00:00')]],
 [[3, Timestamp('2019-01-16 00:00:00')],
  [4, Timestamp('2019-01-15 00:00:00')],
  [5, Timestamp('2019-01-10 00:00:00')],
  [5, Timestamp('2019-01-13 00:00:00')],
  [6, Timestamp('2019-01-03 00:00:00')],
  [7, Timestamp('2019-01-05 00:00:00')],
  [8, Timestamp('2019-01-11 00:00:00')],
  [8, Timestamp('2019-01-21 00:00:00')],
  [9, Timestamp('2019-01-09 00:00:00')]]]
categoryid
[[[1, Timestamp('2019-01-04 00:00:00')],
  [1, Timestamp('2019-01-08 00:00:00')],
  [1, Timestamp('2019-01-19 00:00:00')],
  [1, Timestamp('2019-01-02 00:00:00')],
  [2, Timestamp('2019-01-17 00:00:00')],
  [2, Timestamp('2019-01-07 00:00:00')]],
 [[0, Timestamp('2019-01-06 00:00:00')],
  [0, Timestamp('2019-01-14 00:00:00')],
  [0, Timestamp('2019-01-20 00:00:00')],
  [2, Timestamp('2019-01-01 00:00:00')],
  [2, Timestamp('2019-01-12 00:00:00')],
  [2, Timestamp('2019-01-18 00:00:00')]],
 [[1, Timestamp('2019-01-16 00:00:00')],
  [1, Timestamp('2019-01-15 00:00:00')],
  [1, Timestamp('2019-01-10 00:00:00')],
  [1, Timestamp('2019-01-13 00:00:00')],
  [2, Timestamp('2019-01-03 00:00:00')],
  [2, Timestamp('2019-01-05 00:00:00')],
  [2, Timestamp('2019-01-11 00:00:00')],
  [2, Timestamp('2019-01-21 00:00:00')],
  [3, Timestamp('2019-01-09 00:00:00')]]]

Sort in chronological order

Next, add the process of sorting the created list in chronological order.

def sort_list(inputs, is_descending):
    """
    :param is_descending:Whether in descending order
    """
    return [sorted(i_input, key=lambda i: i[1], reverse=is_descending) for i_input in inputs]

When this process is performed,

itemid_inputs = sort_list(itemid_inputs, is_descending=False)
categoryid_inputs = sort_list(categoryid_inputs, is_descending=False)

print('itemid')
pprint.pprint(itemid_inputs)

print('categoryid')
pprint.pprint(categoryid_inputs)

The list is sorted in chronological order as shown below.

itemid
[[[5, Timestamp('2019-01-02 00:00:00')],
  [3, Timestamp('2019-01-04 00:00:00')],
  [8, Timestamp('2019-01-07 00:00:00')],
  [4, Timestamp('2019-01-08 00:00:00')],
  [7, Timestamp('2019-01-17 00:00:00')],
  [4, Timestamp('2019-01-19 00:00:00')]],
 [[6, Timestamp('2019-01-01 00:00:00')],
  [0, Timestamp('2019-01-06 00:00:00')],
  [7, Timestamp('2019-01-12 00:00:00')],
  [1, Timestamp('2019-01-14 00:00:00')],
  [8, Timestamp('2019-01-18 00:00:00')],
  [2, Timestamp('2019-01-20 00:00:00')]],
 [[6, Timestamp('2019-01-03 00:00:00')],
  [7, Timestamp('2019-01-05 00:00:00')],
  [9, Timestamp('2019-01-09 00:00:00')],
  [5, Timestamp('2019-01-10 00:00:00')],
  [8, Timestamp('2019-01-11 00:00:00')],
  [5, Timestamp('2019-01-13 00:00:00')],
  [4, Timestamp('2019-01-15 00:00:00')],
  [3, Timestamp('2019-01-16 00:00:00')],
  [8, Timestamp('2019-01-21 00:00:00')]]]
categoryid
[[[1, Timestamp('2019-01-02 00:00:00')],
  [1, Timestamp('2019-01-04 00:00:00')],
  [2, Timestamp('2019-01-07 00:00:00')],
  [1, Timestamp('2019-01-08 00:00:00')],
  [2, Timestamp('2019-01-17 00:00:00')],
  [1, Timestamp('2019-01-19 00:00:00')]],
 [[2, Timestamp('2019-01-01 00:00:00')],
  [0, Timestamp('2019-01-06 00:00:00')],
  [2, Timestamp('2019-01-12 00:00:00')],
  [0, Timestamp('2019-01-14 00:00:00')],
  [2, Timestamp('2019-01-18 00:00:00')],
  [0, Timestamp('2019-01-20 00:00:00')]],
 [[2, Timestamp('2019-01-03 00:00:00')],
  [2, Timestamp('2019-01-05 00:00:00')],
  [3, Timestamp('2019-01-09 00:00:00')],
  [1, Timestamp('2019-01-10 00:00:00')],
  [2, Timestamp('2019-01-11 00:00:00')],
  [1, Timestamp('2019-01-13 00:00:00')],
  [1, Timestamp('2019-01-15 00:00:00')],
  [1, Timestamp('2019-01-16 00:00:00')],
  [2, Timestamp('2019-01-21 00:00:00')]]]

Creation of series data considering time series

First, the process for creating variable-length series features (sequential features) from the list created above is as follows.

def create_sequential(inputs):
    #Delete the list of timestamps in the list
    return [[i[0] for i in i_input] for i_input in inputs]

When you do this,

print('Itemid (in chronological order) that each user contacted')
pprint.pprint(create_sequential(itemid_inputs))

print('Category id (in chronological order) that each user contacted')
pprint.pprint(create_sequential(categoryid_inputs))

You can get the result you were looking for.

Itemid (in chronological order) that each user contacted
[[5, 3, 8, 4, 7, 4], 
 [6, 0, 7, 1, 8, 2], 
 [6, 7, 9, 5, 8, 5, 4, 3, 8]]

Category id (in chronological order) that each user contacted
[[1, 1, 2, 1, 2, 1], 
 [2, 0, 2, 0, 2, 0], 
 [2, 2, 3, 1, 2, 1, 1, 1, 2]]

Creation of category data considering time series

Next, the process to get the latest record of each user as a categorical variable from the list created above is as follows.

def create_category(inputs, n=-1):
    """
    :param n:What number of the chronological list to keep
    """
    #Delete the list of timestamps in the list
    #Leave only the nth series data in chronological order
    return [[i[0] for i in i_input][n] for i_input in inputs]

When you do this,

print('The latest itemid that each user has contacted')
pprint.pprint(create_category(itemid_inputs, -1))

print('The latest category id that each user has contacted')
pprint.pprint(create_category(categoryid_inputs, -1))

You can get the result you were looking for as follows:

The latest itemid that each user has contacted
[4, 2, 8]

The latest category id that each user has contacted
[1, 0, 2]

Processing summary

Here, the functions separated for the above explanation are integrated as follows.


def create_features(
        df, user_index_col, sort_col, target_col, user_num, is_descending, is_sequence, n=-1):
    """
    :param user_index_col:User ID column
    :param sort_col:Column containing the value used for sort
    :param target_col:Columns you want to sort
    :param user_num:Number of users (obtain from encoder etc.)
    :param is_descending:Whether in descending order
    :param is_sequence:Whether it is sequential
    :param n:Which number of the chronological list to keep (category only)
    """
    #Creating a list
    inputs = [[] for _ in range(user_num)]
    for _, user_index, sort_value, target_value in df[[user_index_col, sort_col, target_col]].itertuples():
        inputs[user_index].append([target_value, sort_value])

    #Sort list
    inputs = [sorted(i_input, key=lambda i: i[1], reverse=is_descending) for i_input in inputs]

    if is_sequence:
        return [[i[0] for i in i_input] for i_input in inputs]
    else:
        return [[i[0] for i in i_input][n] for i_input in inputs]

How to divide and read data

This is where I wanted to write the most, if you create a list to hold time series information as described above, if you can not put all the data in memory, etc., divide the data set and read it before each division unit It can also be used when processing is performed.

As an example, suppose that the first DataFrame is divided into three and stored in the dictionary as shown below. (I don't think it's actually the case, but as an example ...)

{'df1':    userid  itemid  categoryid  timestamp
0       0       3           1 2019-01-04
1       0       4           1 2019-01-08
2       0       4           1 2019-01-19
3       0       5           1 2019-01-02
4       0       7           2 2019-01-17
5       0       8           2 2019-01-07
6       1       0           0 2019-01-06,
 'df2':     userid  itemid  categoryid  timestamp
7        1       1           0 2019-01-14
8        1       2           0 2019-01-20
9        1       6           2 2019-01-01
10       1       7           2 2019-01-12
11       1       8           2 2019-01-18
12       2       3           1 2019-01-16
13       2       4           1 2019-01-15,
 'df3':     userid  itemid  categoryid  timestamp
14       2       5           1 2019-01-10
15       2       5           1 2019-01-13
16       2       6           2 2019-01-03
17       2       7           2 2019-01-05
18       2       8           2 2019-01-11
19       2       8           2 2019-01-21
20       2       9           3 2019-01-09}

Since the time series information is stored in the list, it can be processed by changing the function as follows, for example.


def create_features_by_datasets(
        df_dict, user_index_col, sort_col, target_col, user_num, is_descending, is_sequence, n=-1):
    inputs = [[] for _ in range(user_num)]

    #Processing for each division unit of the dataset
    for df in df_dict.values():
        for _, user_index, sort_value, target_value in df[[user_index_col, sort_col, target_col]].itertuples():
            inputs[user_index].append([target_value, sort_value])

    inputs = [sorted(i_input, key=lambda i: i[1], reverse=is_descending) for i_input in inputs]

    if is_sequence:
        return [[i[0] for i in i_input] for i_input in inputs]
    else:
        return [[i[0] for i in i_input][n] for i_input in inputs]

If you do the following,

print('Itemid (in chronological order) that each user contacted')
pprint.pprint(create_features_by_datasets(df_dict, user_index_col='userid', sort_col='timestamp', target_col='itemid', user_num=3, is_descending=False, is_sequence=True))
print('The latest itemid that each user has contacted')
pprint.pprint(create_features_by_datasets(df_dict, user_index_col='userid', sort_col='timestamp', target_col='itemid', user_num=3, is_descending=False, is_sequence=False))

The result is the same as above.

Itemid (in chronological order) that each user contacted
[[5, 3, 8, 4, 7, 4], 
 [6, 0, 7, 1, 8, 2],
 [6, 7, 9, 5, 8, 5, 4, 3, 8]]

The latest itemid that each user has contacted
 [4, 2, 8]

Sorting by other than time series information

In addition, this time we narrowed down the sorting criteria to time series information, but it is also possible to sort by other columns or in descending order. By changing the variables passed for the above processing, ascending / descending order and sorting by specifying columns are possible.

For example, in the following dataset

    userid  itemid  categoryid     score
0        0       3           1  0.730968
1        0       3           1  0.889117
2        0       3           1  0.714828
3        0       4           1  0.430737
4        0       5           1  0.734746
5        0       7           2  0.412346
6        1       0           0  0.660430
7        1       3           1  0.095672
8        1       4           1  0.985072
9        1       5           1  0.629274
10       1       6           2  0.617733
11       1       7           2  0.636219
12       1       8           2  0.246769
13       1       8           2  0.020140
14       2       0           0  0.812525
15       2       1           0  0.671100
16       2       2           0  0.174011
17       2       2           0  0.164321
18       2       3           1  0.783329
19       2       4           1  0.068837
20       2       5           1  0.265281

Even if there is a column called score, if you want to create series data or category data in descending order, the process is as follows.

print('Score order (itemid)')
pprint.pprint(create_features(df, user_index_col='userid', sort_col='score', target_col='itemid', user_num=3, is_descending=True, is_sequence=True))
print('Maximum score (itemid)')
pprint.pprint(create_features(df, user_index_col='userid', sort_col='score', target_col='itemid', user_num=3, is_descending=True, is_sequence=False, n=0))

The result is as follows.

Score order (itemid)
[[3, 5, 3, 3, 4, 7], 
 [4, 0, 7, 5, 6, 8, 3, 8], 
 [0, 3, 1, 5, 2, 2, 4]]

Maximum score (itemid)
[3, 4, 0]

At the end

This time, I wrote about how to convert log data to series features and category features in a memory-saving manner while considering time series information. If there is a better way, please let me know in the comments. Thank you very much.

Recommended Posts

Memory-saving conversion of log data to sequential category features in chronological order
Memory-saving matrix conversion of log data
Conversion of time data in 25 o'clock notation
Summary of tools needed to analyze data in Python
Ingenuity to handle data with Pandas in a memory-saving manner
How to get an overview of your data in Pandas
Django Changed to save lots of data in one go
Command to list all files in order of file name
Try to extract the features of the sensor data with CNN
How to extract features of time series data with PySpark Basics
How to create a large amount of test data in MySQL? ??