It is the 11th day of TensorFlow2.0 Advent Calendar 2019.

I would like to summarize how to preprocess text using the tf.data.Dataset API.

In this article, we will explain in the following order.

Explain what the tf.data.Dataset API is and what its effectiveness is
Actually explain the procedure for text preprocessing
Summary of tips for improving performance

Since the explanation is long (the code is also long ...), if you want to get a bird's-eye view of the code, [here](https://github.com/tokusumi/nlp-dnn-baselines/blob/master/nlp-dnn- You can refer to it from baselines / preprocess / tfdata.py).

(Note that the content of this article has not been fully verified. The code works, but there are some things I'm not sure if it contributes to the performance improvement. Updated from time to time. I will do it, but I hope you keep it for reference only.)

The following articles are related to the Advent calendar. I hope this is also helpful.

--Day 3: A basic introduction to the tf.data.Dataset API (The story of the strong dataset function that can be used with TensorFlow) ――Day 7: The tf.data.Dataset API introduces the procedure for dividing livedoor using Mecab (Separate livedoor news corpus using Mecab and tf.data. / masahikoofjoyto / items / b444262405ad7371c78a))

--Day 10: We are trying to speed up the map by parallelizing with joblib. In this article, I will introduce the parallelization function that tf.data .map itself has, but I would like to verify which is faster. (Rather, it seems that they can be combined) ([[TF2.0 application] A case where general-purpose Data Augmentation was parallelized and realized at high speed with the strong data set function of the TF example](https://qiita.com/ Suguru_Toyohara / items / 528447a73fc6dd20ea57)))

tf.data.Dataset API

I think the typical learning process is as follows.

Read data: Read from local storage, in-memory, cloud storage
Pre-processing: CPU processing
Pass data to learning device: GPU, pass to TPU
Learning: GPU, TPU processing

As the data set grows, you will run out of resources if you go through steps 1 to 4 one by one. (Especially for images, it is often several GB, so 1. It is not possible to process at once just by reading the data) Therefore, it is divided into batches (for example, every few images), and processing from 1 to 4 is performed all at once. It is recommended to repeat that. This is called pipeline processing.

With a straightforward pipeline, this sequence of processes can result in wasted latency in the overhead portion, as shown below. https://www.tensorflow.org/guide/data_performance

The tf.data.Dataset API has the following functions to distribute overhead processing and reduce unnecessary waiting time.

--prefetch: Processed in parallel by CPU and GPU / TPU respectively --map: Parallel processing of preprocessing --read_file: Parallel processing of reads

These will be described later. First, I will write about text preprocessing to know how to use the tf.data.Dataset API.

2. Text preprocessing flow

Now let's preprocess the text using the tf.data.Dataest API. I think the order may change, but I think the standard text preprocessing flow is as follows.

load: Load / shuffle text
standarize: Stopword delete, replace, unify to lowercase, etc.
tokenize: Separation (in Japanese)
Replace with encode: id
split: Data split for train and test
padding: Zero padding
batch: Get as batch data

2.1. load First of all, create a dataset loader. The processing flow is as follows.

Download data to local disc
Specify the data of local disc
Labeling
Shuffle data

Download data to local disc

Since the size of the dataset we handle is getting bigger these days, I don't think there are many cases where the data is on the local disc from the beginning. Therefore, the following cases can be considered.

--Download from external storage --Download from cloud storage --Get from Database

Here's an example of simply retrieving data from an external storage (no authentication required). Below you can download the text files cowper.txt, derby.txt, butler.txt to your local disc. (Since it is easy to download, we will use this English text data, but in reality it is supposed to be preprocessed for Japanese) It is a function that returns a list of downloaded local disc paths. If you replace the download method as appropriate and arrange the output, you can use the same procedure as below.

def download_file(directory_url: List[str], file_names: List[str]) -> List[str]:
    file_paths = [
        tf.keras.utils.get_file(file_name, directory_url + file_name)
        for file_name in file_names
    ]
    return file_paths

# download dataset in local disk
directory_url = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
file_names = ['cowper.txt', 'derby.txt', 'butler.txt']
file_paths = download_file(directory_url, file_names)

Specify local disc data & label & shuffle data

The rest of the process is summarized as follows. Now you have a Dataset that iterates text and label.

def load_dataset(file_paths: List[str], file_names: List[str], BUFFER_SIZE=1000):
    #Specify multiple files to load
    files = tf.data.Dataset.list_files(file_paths)
    #Apply map function for each file(labeling_map_fn will be described later(Read data&Labeling))
    datasets = files.interleave(
        labeling_map_fn(file_names),
    )
    #data shuffle
    all_labeled_data = datasets.shuffle(
        BUFFER_SIZE, reshuffle_each_iteration=False
    )
    return all_labeled_data

datasets = load_dataset(file_paths, file_names)
text, label = next(iter(datasets))
print(text)
# <tf.Tensor: id=99928, shape=(), dtype=string, numpy=b'Comes furious on, but speeds not, kept aloof'>
print(label)
# <tf.Tensor: id=99929, shape=(), dtype=int64, numpy=0>

We will look at the processing in detail.

tf.data.Dataset.list_files (): Specify multiple files to load

The files created by tf.data.Dataset.list_files are Dataset instances with the path of the local disc as the value as shown below. It's a hassle, but the Dataset instance needs to iterate and check its contents. It's even more annoying, but you can get the value using the ``` .numpy ()` `` method.

print(files)
# <DatasetV1Adapter shapes: (), types: tf.string>

next(iter(files))
# <tf.Tensor: id=99804, shape=(), dtype=string, numpy=b'/Users/username/.keras/datasets/cowper.txt'>

next(iter(files)).numpy()
# b'/Users/username/.keras/datasets/cowper.txt'

.interleave (): Apply map function for each file and return flat Dataset

After applying the map function to the dataset, flatten the results and combine them. In this usage, we first define a map funciton that reads a text file and returns a Dataset that iterates line by line. And if you pass it to ``` .interleave ()` ``, instead of creating a separate Dataset for each file, you will create a flat Dataset that is iterated line by line from all the files.

Reference: Official documentation

.shuffle (): data shuffle

As you can see from the name, it shuffles the Dataset. Randomly extract data from buffer_size during iteration. If iterates repeatedly and exceeds buffer_size, it will be extracted from the data for the next buffer_size. Therefore, a large buffer_size guarantees clutter. However, if buffer_size is large, it consumes resources accordingly, which is a trade-off.

Also, if you set `reshuffle_each_iteration = False```, it will shuffle in the same order no matter how many times you start iteration. Since the default is True, every time you write next (iter (dataset)) or for data in dataset: `` after simply calling ``` .shuffle () . It will be iterated in a different order. Whether it's good or bad, be careful.

labeling_map_fn: reading & labeling data

I will show you how to read a .txt file whose file name is a label and each line is one text data. I think this is standard processing, but I hope you can replace it as appropriate depending on the data format.

Here, we get a Dataset with flat text and labels by passing the following map function to `.interleave ()`.

For each file, read the file with ``` tf.data.TextLineDataset ()` `` and generate a Dataset instance.
Use `` `.map (labeler) ``` to assign the label id that is the same as the file name.

def labeling_map_fn(file_names):
    def _get_label(datasets):
        """
dataset value(file path)Parse the filename from
        file_Let the index number of names be label ID
        """
        filename = datasets.numpy().decode().rsplit('/', 1)[-1]
        label = file_names.index(filename)
        return label

    def _labeler(example, label):
        """Add label to dataset"""
        return tf.cast(example, tf.string), tf.cast(label, tf.int64)

    def _labeling_map_fn(file_path: str):
        """main map function"""
        #Read line by line from a text file
        datasets = tf.data.TextLineDataset(file_path)
        #Convert file path to label ID
        label = tf.py_function(_get_label, inp=[file_path], Tout=tf.int64)
        #Add label ID to Dataset
        labeled_dataset = datasets.map(lambda ex: _labeler(ex, label))
        return labeled_dataset
    return _labeling_map_fn

Along the way, I'm using a function called tf.py_function (doc). This is because the Dataset API map function argument is passed a Tensor object. Tensor object cannot refer to the value directly in python, but if you wrap it with tf.py_function, the same type value as when `next (iter (dataset))` is passed as an argument. So you can refer to the value with `.numpy ()` and write familiar python processing. However, there seems to be some difficulty in performance, so I would like to avoid using it as much as possible.

2.2. standarize & 2.3. tokenize Here, various processes are performed at once. It is assumed that you will use a python library or a solid one. There are many processes for text in tensorflow, but it is quite difficult, so I will assume that you will use the one written in python as it is. At least you can't write in separate words with tensorflow, so I think it's an essential process in Japanese.

Example (using janome)

janome is convenient because it is a morphological analysis implemented in python and can be used only with pip install. You can flexibly build a standardized pipeline called analyzer as shown below.

from janome.tokenizer import Tokenizer
from janome.analyzer import Analyzer
from janome.charfilter import (
    RegexReplaceCharFilter #String replacement
)
from janome.tokenfilter import (
    CompoundNounFilter, #Compound nounization
    POSStopFilter, #Remove specific part of speech
    LowerCaseFilter #Convert to lowercase
)

def janome_tokenizer():
    # standarize texts
    char_filters = [RegexReplaceCharFilter(u'Janome', u'janome')]
    tokenizer = Tokenizer()
    token_filters = [CompoundNounFilter(), POSStopFilter(['symbol','Particle']), LowerCaseFilter()]
    analyze = Analyzer(char_filters, tokenizer, token_filters).analyze

    def _tokenizer(text, label):
        tokenized_text = " ".join([wakati.surface for wakati in analyze(text.numpy().decode())])
        return tokenized_text, label
    return _tokenizer

With this alone, it will be standardized and divided as follows.

text, _ = janome_tokenizer()('The serpentine is a morphological analyzer. Easy to Use.', 0)
print(text)
# 'janome morphological analyzer easy to use.'

Wrap with tf.py_function

Call the above function from Dataset api. To do this, again use tf.py_function to convert. You need to specify the type of output. You can then call that function by passing it to the dataset with ``` .map ()` ``.

def tokenize_map_fn(tokenizer):
    """
    convert python function for tf.data map
    """
    def _tokenize_map_fn(text: str, label: int):
        return tf.py_function(tokenizer, inp=[text, label], Tout=(tf.string, tf.int64))
    return _tokenize_map_fn

datasets = datasets.map(tokenize_map_fn(janome_tokenizer()))

2.4. encode Use the tensorflow_datasets.text API to encode (convert string to ID). In particular, tfds.features.text.Tokenizer ()` `` and `` tfds.features.text.TokenTextEncoder are useful for encoding.

Create vocabulary

First, you need to create a vocabulary. If you create it first, you can omit the following. Here, we will create a vocabulary from the training data. Use `tfds.features.text.Tokenizer ()` to get the token and set () to remove the duplicates.

import tensorflow_datasets as tfds

def get_vocabulary(datasets) -> Set[str]:
    tokenizer = tfds.features.text.Tokenizer().tokenize

    def _tokenize_map_fn(text, label):
        def _tokenize(text, label):
            return tokenizer(text.numpy()), label
        return tf.py_function(_tokenize, inp=[text, label], Tout=(tf.string, tf.int64))

    dataset = datasets.map(_tokenize_map_fn)
    vocab = {g.decode() for f, _ in dataset for g in f.numpy()}
    return vocab

vocab_set = get_vocabulary(datasets)
print(vocab_set)
# {'indomitable', 'suspicion', 'wer', ... }

encode Here, we use tfds.features.text.TokenTextEncoder ()` `` to convert the token contained in the vocabulary to an ID. Use the followingencode_map_fn () `for ``` datasets.map () `.

def encoder(vocabulary_set: Set[str]):
    """
    encode text to numbers. must set vocabulary_set
    """
    encoder = tfds.features.text.TokenTextEncoder(vocabulary_set).encode

    def _encode(text: str, label: int):
        encoded_text = encoder(text.numpy())
        return encoded_text, label
    return _encode

def encode_map_fn(encoder):
    """
    convert python function for tf.data map
    """
    def _encode_map_fn(text: str, label: int):
        return tf.py_function(encoder, inp=[text, label], Tout=(tf.int64, tf.int64))
    return _encode_map_fn

datasets = datasets.map(encode_map_fn(encoder(vocab_set)))
print(next(iter(datasets))[0].numpy())
# [111, 1211, 4, 10101]

2.5. split Divide the dataset into train and test. The following can be omitted if it is separated from the beginning. With the Dataset API, dividing a dataset is very easy to implement as follows.

def split_train_test(data, TEST_SIZE: int, BUFFER_SIZE: int, SEED=123):
    """
    TEST_SIZE =Number of test data
    note: because of reshuffle_each_iteration = True (default),
    train_data is reshuffled if you reuse train_data.
    """
    train_data = data.skip(TEST_SIZE).shuffle(BUFFER_SIZE, seed=SEED)
    test_data = data.take(TEST_SIZE)
    return train_data, test_data

2.6. padding & 2.7. batch With tf.data.Dataset api, padding and batching can be done at the same time. As it is, epochs is the number of epochs and BATCH_SIZE is the batch size. Here are some things to keep in mind:

--If you set `` `drop_remainder = True```, when you batch the data, the last data of iteration that did not reach the batch size will not be used. --You can specify the padding size (= maximum length) with padded_shapes. If you do not specify this argument, it will be padded to the maximum length per batch.

train_data = train_data.padded_batch(BATCH_SIZE, padded_shapes=([max_len], []), drop_remainder=True)
test_data = test_data.padded_batch(BATCH_SIZE, padded_shapes=([max_len], []), drop_remainder=False)

Here, max_len can be obtained from the dataset as shown below, or it can be entered in a fixed manner.

Get maximum document length

Most models require a maximum token length. Now get it from the dataset. If you decide to enter, you can skip the following processing.

def get_max_len(datasets) -> int:
    tokenizer = tfds.features.text.Tokenizer().tokenize

    def _get_len_map_fn(text: str, label: int):
        def _get_len(text: str):
            return len(tokenizer(text.numpy()))
        return tf.py_function(_get_len, inp=[text, ], Tout=tf.int32)

    dataset = datasets.map(_get_len_map_fn)
    max_len = max({f.numpy() for f in dataset})
    return max_len

Summary of text preprocessing flow

I looked at the implementation using the tf.data.Dataset API in the following flow.

load: Load / shuffle text
standarize: Stopword delete, replace, unify to lowercase, etc.
tokenize: Separation (in Japanese)
Replace with encode: id
split: Data split for train and test
padding: Zero padding
batch: Get as batch data

At the time of learning, just pass it to the ``` .fit ()` `` method as shown below.

model.fit(train_data,
      epochs=epochs,
      validation_data=test_data
)

3. Tips for improving performance

As explained at the beginning, a series of preprocessing processes can cause unnecessary waiting time in the overhead part as follows. https://www.tensorflow.org/guide/data_performance

The tf.data.Dataset API has the following functions to distribute overhead processing and reduce unnecessary waiting time.

--prefetch: Processed in parallel by CPU and GPU / TPU respectively --map: Parallel processing of preprocessing --read_file: Parallel processing of reads

Reference: Optimizing input pipelines with tf.data prefetch Processes are executed in parallel on the CPU and GPU / TPU. It is automatically adjusted by tf.experiments.AUTOTUNE. https://www.tensorflow.org/guide/data_performance

No hassle. Just add the following processing at the end. (In this article, we will do it for train_data and test_data)

dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

map The map function can also be distributed. This is also automatically adjusted by tf.experiments.AUTOTUNE. Also, if it's too slow, you can use the `.batch ()` method first and then pass it. https://www.tensorflow.org/guide/data_performance

Just add an argument to the ``` .map ()` `` method as shown below.

dataset = dataset.map(map_func, num_parallel_calls=tf.data.experimental.AUTOTUNE)

read file Even when reading multiple files, the processing can be distributed and read at the same time. I / O is likely to be the bottleneck, especially when reading data from remote storage. (In this article, it is read from the local disc, so it may not be very effective.)

https://www.tensorflow.org/guide/data_performance

You need to add an argument to the ``` .interleave ()` `` method as shown below.

dataset = files.interleave(
    tf.data.TFRecordDataset, cycle_length=FLAGS.num_parallel_reads,
    num_parallel_calls=tf.data.experimental.AUTOTUNE)

cache Although the context changes, `.cache ()` is effective for improving performance. If you write as follows, it will be cached in memory.

dataset = dataset.cache()

If you pass a string as an argument as shown below, it will be saved in a file instead of in memory.

dataset = dataset.cache('tfdata')

Summary

It's been a long time, but I showed you how to preprocess text using the tf.data.Dataset API. You can find the cohesive code here [https://github.com/tokusumi/nlp-dnn-baselines/blob/master/nlp-dnn-baselines/preprocess/tfdata.py). In particular, we have summarized the introduction of the tf.data.Dataset API, the procedure for text preprocessing, and tips for improving performance. The explanation has become long, but thank you for reading to the end! I hope it will be helpful for you!

refs

[PYTHON] Summarize how to preprocess text (natural language processing) with tf.data.Dataset api