It is the 11th day of TensorFlow2.0 Advent Calendar 2019.
I would like to summarize how to preprocess text using the tf.data.Dataset API.
In this article, we will explain in the following order.
Since the explanation is long (the code is also long ...), if you want to get a bird's-eye view of the code, [here](https://github.com/tokusumi/nlp-dnn-baselines/blob/master/nlp-dnn- You can refer to it from baselines / preprocess / tfdata.py).
(Note that the content of this article has not been fully verified. The code works, but there are some things I'm not sure if it contributes to the performance improvement. Updated from time to time. I will do it, but I hope you keep it for reference only.)
The following articles are related to the Advent calendar. I hope this is also helpful.
--Day 3: A basic introduction to the tf.data.Dataset API (The story of the strong dataset function that can be used with TensorFlow) ――Day 7: The tf.data.Dataset API introduces the procedure for dividing livedoor using Mecab (Separate livedoor news corpus using Mecab and tf.data. / masahikoofjoyto / items / b444262405ad7371c78a))
--Day 10: We are trying to speed up the map by parallelizing with joblib. In this article, I will introduce the parallelization function that tf.data .map itself has, but I would like to verify which is faster. (Rather, it seems that they can be combined) ([[TF2.0 application] A case where general-purpose Data Augmentation was parallelized and realized at high speed with the strong data set function of the TF example](https://qiita.com/ Suguru_Toyohara / items / 528447a73fc6dd20ea57)))
I think the typical learning process is as follows.
As the data set grows, you will run out of resources if you go through steps 1 to 4 one by one. (Especially for images, it is often several GB, so 1. It is not possible to process at once just by reading the data) Therefore, it is divided into batches (for example, every few images), and processing from 1 to 4 is performed all at once. It is recommended to repeat that. This is called pipeline processing.
With a straightforward pipeline, this sequence of processes can result in wasted latency in the overhead portion, as shown below. https://www.tensorflow.org/guide/data_performance
The tf.data.Dataset API has the following functions to distribute overhead processing and reduce unnecessary waiting time.
--prefetch: Processed in parallel by CPU and GPU / TPU respectively --map: Parallel processing of preprocessing --read_file: Parallel processing of reads
These will be described later. First, I will write about text preprocessing to know how to use the tf.data.Dataset API.
Now let's preprocess the text using the tf.data.Dataest API. I think the order may change, but I think the standard text preprocessing flow is as follows.
2.1. load First of all, create a dataset loader. The processing flow is as follows.
Since the size of the dataset we handle is getting bigger these days, I don't think there are many cases where the data is on the local disc from the beginning. Therefore, the following cases can be considered.
--Download from external storage --Download from cloud storage --Get from Database
Here's an example of simply retrieving data from an external storage (no authentication required). Below you can download the text files cowper.txt, derby.txt, butler.txt to your local disc. (Since it is easy to download, we will use this English text data, but in reality it is supposed to be preprocessed for Japanese) It is a function that returns a list of downloaded local disc paths. If you replace the download method as appropriate and arrange the output, you can use the same procedure as below.
def download_file(directory_url: List[str], file_names: List[str]) -> List[str]:
file_paths = [
tf.keras.utils.get_file(file_name, directory_url + file_name)
for file_name in file_names
]
return file_paths
# download dataset in local disk
directory_url = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
file_names = ['cowper.txt', 'derby.txt', 'butler.txt']
file_paths = download_file(directory_url, file_names)
The rest of the process is summarized as follows. Now you have a Dataset that iterates text and label.
def load_dataset(file_paths: List[str], file_names: List[str], BUFFER_SIZE=1000):
#Specify multiple files to load
files = tf.data.Dataset.list_files(file_paths)
#Apply map function for each file(labeling_map_fn will be described later(Read data&Labeling))
datasets = files.interleave(
labeling_map_fn(file_names),
)
#data shuffle
all_labeled_data = datasets.shuffle(
BUFFER_SIZE, reshuffle_each_iteration=False
)
return all_labeled_data
datasets = load_dataset(file_paths, file_names)
text, label = next(iter(datasets))
print(text)
# <tf.Tensor: id=99928, shape=(), dtype=string, numpy=b'Comes furious on, but speeds not, kept aloof'>
print(label)
# <tf.Tensor: id=99929, shape=(), dtype=int64, numpy=0>
We will look at the processing in detail.
The files created by tf.data.Dataset.list_files are Dataset instances with the path of the local disc as the value as shown below. It's a hassle, but the Dataset instance needs to iterate and check its contents. It's even more annoying, but you can get the value using the ``` .numpy ()` `` method.
print(files)
# <DatasetV1Adapter shapes: (), types: tf.string>
next(iter(files))
# <tf.Tensor: id=99804, shape=(), dtype=string, numpy=b'/Users/username/.keras/datasets/cowper.txt'>
next(iter(files)).numpy()
# b'/Users/username/.keras/datasets/cowper.txt'
After applying the map function to the dataset, flatten the results and combine them. In this usage, we first define a map funciton that reads a text file and returns a Dataset that iterates line by line. And if you pass it to ``` .interleave ()` ``, instead of creating a separate Dataset for each file, you will create a flat Dataset that is iterated line by line from all the files.
Reference: Official documentation
As you can see from the name, it shuffles the Dataset. Randomly extract data from buffer_size during iteration. If iterates repeatedly and exceeds buffer_size, it will be extracted from the data for the next buffer_size. Therefore, a large buffer_size guarantees clutter. However, if buffer_size is large, it consumes resources accordingly, which is a trade-off.
Also, if you set `reshuffle_each_iteration = False```, it will shuffle in the same order no matter how many times you start iteration. Since the default is True, every time you write
next (iter (dataset))
or
for data in dataset:
`` after simply calling ``` .shuffle ()
. It will be iterated in a different order. Whether it's good or bad, be careful.
I will show you how to read a .txt file whose file name is a label and each line is one text data. I think this is standard processing, but I hope you can replace it as appropriate depending on the data format.
Here, we get a Dataset with flat text and labels by passing the following map function to `.interleave ()`
.
def labeling_map_fn(file_names):
def _get_label(datasets):
"""
dataset value(file path)Parse the filename from
file_Let the index number of names be label ID
"""
filename = datasets.numpy().decode().rsplit('/', 1)[-1]
label = file_names.index(filename)
return label
def _labeler(example, label):
"""Add label to dataset"""
return tf.cast(example, tf.string), tf.cast(label, tf.int64)
def _labeling_map_fn(file_path: str):
"""main map function"""
#Read line by line from a text file
datasets = tf.data.TextLineDataset(file_path)
#Convert file path to label ID
label = tf.py_function(_get_label, inp=[file_path], Tout=tf.int64)
#Add label ID to Dataset
labeled_dataset = datasets.map(lambda ex: _labeler(ex, label))
return labeled_dataset
return _labeling_map_fn
Along the way, I'm using a function called tf.py_function
(doc). This is because the Dataset API map function argument is passed a Tensor object. Tensor object cannot refer to the value directly in python, but if you wrap it with tf.py_function, the same type value as when `next (iter (dataset))`
is passed as an argument. So you can refer to the value with `.numpy ()`
and write familiar python processing.
However, there seems to be some difficulty in performance, so I would like to avoid using it as much as possible.
2.2. standarize & 2.3. tokenize Here, various processes are performed at once. It is assumed that you will use a python library or a solid one. There are many processes for text in tensorflow, but it is quite difficult, so I will assume that you will use the one written in python as it is. At least you can't write in separate words with tensorflow, so I think it's an essential process in Japanese.
janome is convenient because it is a morphological analysis implemented in python and can be used only with pip install. You can flexibly build a standardized pipeline called analyzer as shown below.
from janome.tokenizer import Tokenizer
from janome.analyzer import Analyzer
from janome.charfilter import (
RegexReplaceCharFilter #String replacement
)
from janome.tokenfilter import (
CompoundNounFilter, #Compound nounization
POSStopFilter, #Remove specific part of speech
LowerCaseFilter #Convert to lowercase
)
def janome_tokenizer():
# standarize texts
char_filters = [RegexReplaceCharFilter(u'Janome', u'janome')]
tokenizer = Tokenizer()
token_filters = [CompoundNounFilter(), POSStopFilter(['symbol','Particle']), LowerCaseFilter()]
analyze = Analyzer(char_filters, tokenizer, token_filters).analyze
def _tokenizer(text, label):
tokenized_text = " ".join([wakati.surface for wakati in analyze(text.numpy().decode())])
return tokenized_text, label
return _tokenizer
With this alone, it will be standardized and divided as follows.
text, _ = janome_tokenizer()('The serpentine is a morphological analyzer. Easy to Use.', 0)
print(text)
# 'janome morphological analyzer easy to use.'
Call the above function from Dataset api. To do this, again use tf.py_function to convert. You need to specify the type of output. You can then call that function by passing it to the dataset with ``` .map ()` ``.
def tokenize_map_fn(tokenizer):
"""
convert python function for tf.data map
"""
def _tokenize_map_fn(text: str, label: int):
return tf.py_function(tokenizer, inp=[text, label], Tout=(tf.string, tf.int64))
return _tokenize_map_fn
datasets = datasets.map(tokenize_map_fn(janome_tokenizer()))
2.4. encode
Use the tensorflow_datasets.text API to encode (convert string to ID).
In particular, tfds.features.text.Tokenizer ()` `` and `` tfds.features.text.TokenTextEncoder
are useful for encoding.
First, you need to create a vocabulary. If you create it first, you can omit the following.
Here, we will create a vocabulary from the training data. Use `tfds.features.text.Tokenizer ()`
to get the token and set () to remove the duplicates.
import tensorflow_datasets as tfds
def get_vocabulary(datasets) -> Set[str]:
tokenizer = tfds.features.text.Tokenizer().tokenize
def _tokenize_map_fn(text, label):
def _tokenize(text, label):
return tokenizer(text.numpy()), label
return tf.py_function(_tokenize, inp=[text, label], Tout=(tf.string, tf.int64))
dataset = datasets.map(_tokenize_map_fn)
vocab = {g.decode() for f, _ in dataset for g in f.numpy()}
return vocab
vocab_set = get_vocabulary(datasets)
print(vocab_set)
# {'indomitable', 'suspicion', 'wer', ... }
encode
Here, we use tfds.features.text.TokenTextEncoder ()` `` to convert the token contained in the vocabulary to an ID. Use the following
encode_map_fn () `for ``` datasets.map ()
`.
def encoder(vocabulary_set: Set[str]):
"""
encode text to numbers. must set vocabulary_set
"""
encoder = tfds.features.text.TokenTextEncoder(vocabulary_set).encode
def _encode(text: str, label: int):
encoded_text = encoder(text.numpy())
return encoded_text, label
return _encode
def encode_map_fn(encoder):
"""
convert python function for tf.data map
"""
def _encode_map_fn(text: str, label: int):
return tf.py_function(encoder, inp=[text, label], Tout=(tf.int64, tf.int64))
return _encode_map_fn
datasets = datasets.map(encode_map_fn(encoder(vocab_set)))
print(next(iter(datasets))[0].numpy())
# [111, 1211, 4, 10101]
2.5. split Divide the dataset into train and test. The following can be omitted if it is separated from the beginning. With the Dataset API, dividing a dataset is very easy to implement as follows.
def split_train_test(data, TEST_SIZE: int, BUFFER_SIZE: int, SEED=123):
"""
TEST_SIZE =Number of test data
note: because of reshuffle_each_iteration = True (default),
train_data is reshuffled if you reuse train_data.
"""
train_data = data.skip(TEST_SIZE).shuffle(BUFFER_SIZE, seed=SEED)
test_data = data.take(TEST_SIZE)
return train_data, test_data
2.6. padding & 2.7. batch With tf.data.Dataset api, padding and batching can be done at the same time. As it is, epochs is the number of epochs and BATCH_SIZE is the batch size. Here are some things to keep in mind:
--If you set `` `drop_remainder = True```, when you batch the data, the last data of iteration that did not reach the batch size will not be used. --You can specify the padding size (= maximum length) with padded_shapes. If you do not specify this argument, it will be padded to the maximum length per batch.
train_data = train_data.padded_batch(BATCH_SIZE, padded_shapes=([max_len], []), drop_remainder=True)
test_data = test_data.padded_batch(BATCH_SIZE, padded_shapes=([max_len], []), drop_remainder=False)
Here, max_len can be obtained from the dataset as shown below, or it can be entered in a fixed manner.
Most models require a maximum token length. Now get it from the dataset. If you decide to enter, you can skip the following processing.
def get_max_len(datasets) -> int:
tokenizer = tfds.features.text.Tokenizer().tokenize
def _get_len_map_fn(text: str, label: int):
def _get_len(text: str):
return len(tokenizer(text.numpy()))
return tf.py_function(_get_len, inp=[text, ], Tout=tf.int32)
dataset = datasets.map(_get_len_map_fn)
max_len = max({f.numpy() for f in dataset})
return max_len
I looked at the implementation using the tf.data.Dataset API in the following flow.
At the time of learning, just pass it to the ``` .fit ()` `` method as shown below.
model.fit(train_data,
epochs=epochs,
validation_data=test_data
)
As explained at the beginning, a series of preprocessing processes can cause unnecessary waiting time in the overhead part as follows. https://www.tensorflow.org/guide/data_performance
The tf.data.Dataset API has the following functions to distribute overhead processing and reduce unnecessary waiting time.
--prefetch: Processed in parallel by CPU and GPU / TPU respectively --map: Parallel processing of preprocessing --read_file: Parallel processing of reads
Reference: Optimizing input pipelines with tf.data prefetch Processes are executed in parallel on the CPU and GPU / TPU. It is automatically adjusted by tf.experiments.AUTOTUNE. https://www.tensorflow.org/guide/data_performance
No hassle. Just add the following processing at the end. (In this article, we will do it for train_data and test_data)
dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
map
The map function can also be distributed.
This is also automatically adjusted by tf.experiments.AUTOTUNE.
Also, if it's too slow, you can use the `.batch ()`
method first and then pass it.
https://www.tensorflow.org/guide/data_performance
Just add an argument to the ``` .map ()` `` method as shown below.
dataset = dataset.map(map_func, num_parallel_calls=tf.data.experimental.AUTOTUNE)
read file Even when reading multiple files, the processing can be distributed and read at the same time. I / O is likely to be the bottleneck, especially when reading data from remote storage. (In this article, it is read from the local disc, so it may not be very effective.)
https://www.tensorflow.org/guide/data_performanceYou need to add an argument to the ``` .interleave ()` `` method as shown below.
dataset = files.interleave(
tf.data.TFRecordDataset, cycle_length=FLAGS.num_parallel_reads,
num_parallel_calls=tf.data.experimental.AUTOTUNE)
cache
Although the context changes, `.cache ()`
is effective for improving performance.
If you write as follows, it will be cached in memory.
dataset = dataset.cache()
If you pass a string as an argument as shown below, it will be saved in a file instead of in memory.
dataset = dataset.cache('tfdata')
It's been a long time, but I showed you how to preprocess text using the tf.data.Dataset API. You can find the cohesive code here [https://github.com/tokusumi/nlp-dnn-baselines/blob/master/nlp-dnn-baselines/preprocess/tfdata.py). In particular, we have summarized the introduction of the tf.data.Dataset API, the procedure for text preprocessing, and tips for improving performance. The explanation has become long, but thank you for reading to the end! I hope it will be helpful for you!
refs
Recommended Posts