[PYTHON] Summarize how to preprocess text (natural language processing) with tf.data.Dataset api

It is the 11th day of TensorFlow2.0 Advent Calendar 2019.

I would like to summarize how to preprocess text using the tf.data.Dataset API.

In this article, we will explain in the following order.

  1. Explain what the tf.data.Dataset API is and what its effectiveness is
  2. Actually explain the procedure for text preprocessing
  3. Summary of tips for improving performance

Since the explanation is long (the code is also long ...), if you want to get a bird's-eye view of the code, [here](https://github.com/tokusumi/nlp-dnn-baselines/blob/master/nlp-dnn- You can refer to it from baselines / preprocess / tfdata.py).

(Note that the content of this article has not been fully verified. The code works, but there are some things I'm not sure if it contributes to the performance improvement. Updated from time to time. I will do it, but I hope you keep it for reference only.)

The following articles are related to the Advent calendar. I hope this is also helpful.

--Day 3: A basic introduction to the tf.data.Dataset API (The story of the strong dataset function that can be used with TensorFlow) ――Day 7: The tf.data.Dataset API introduces the procedure for dividing livedoor using Mecab (Separate livedoor news corpus using Mecab and tf.data. / masahikoofjoyto / items / b444262405ad7371c78a))

--Day 10: We are trying to speed up the map by parallelizing with joblib. In this article, I will introduce the parallelization function that tf.data .map itself has, but I would like to verify which is faster. (Rather, it seems that they can be combined) ([[TF2.0 application] A case where general-purpose Data Augmentation was parallelized and realized at high speed with the strong data set function of the TF example](https://qiita.com/ Suguru_Toyohara / items / 528447a73fc6dd20ea57)))

  1. tf.data.Dataset API

I think the typical learning process is as follows.

  1. Read data: Read from local storage, in-memory, cloud storage
  2. Pre-processing: CPU processing
  3. Pass data to learning device: GPU, pass to TPU
  4. Learning: GPU, TPU processing

As the data set grows, you will run out of resources if you go through steps 1 to 4 one by one. (Especially for images, it is often several GB, so 1. It is not possible to process at once just by reading the data) Therefore, it is divided into batches (for example, every few images), and processing from 1 to 4 is performed all at once. It is recommended to repeat that. This is called pipeline processing.

With a straightforward pipeline, this sequence of processes can result in wasted latency in the overhead portion, as shown below. idle.png https://www.tensorflow.org/guide/data_performance

The tf.data.Dataset API has the following functions to distribute overhead processing and reduce unnecessary waiting time.

--prefetch: Processed in parallel by CPU and GPU / TPU respectively --map: Parallel processing of preprocessing --read_file: Parallel processing of reads

These will be described later. First, I will write about text preprocessing to know how to use the tf.data.Dataset API.

2. Text preprocessing flow

Now let's preprocess the text using the tf.data.Dataest API. I think the order may change, but I think the standard text preprocessing flow is as follows.

  1. load: Load / shuffle text
  2. standarize: Stopword delete, replace, unify to lowercase, etc.
  3. tokenize: Separation (in Japanese)
  4. Replace with encode: id
  5. split: Data split for train and test
  6. padding: Zero padding
  7. batch: Get as batch data

2.1. load First of all, create a dataset loader. The processing flow is as follows.

  1. Download data to local disc
  2. Specify the data of local disc
  3. Labeling
  4. Shuffle data

Download data to local disc

Since the size of the dataset we handle is getting bigger these days, I don't think there are many cases where the data is on the local disc from the beginning. Therefore, the following cases can be considered.

--Download from external storage --Download from cloud storage --Get from Database

Here's an example of simply retrieving data from an external storage (no authentication required). Below you can download the text files cowper.txt, derby.txt, butler.txt to your local disc. (Since it is easy to download, we will use this English text data, but in reality it is supposed to be preprocessed for Japanese) It is a function that returns a list of downloaded local disc paths. If you replace the download method as appropriate and arrange the output, you can use the same procedure as below.

def download_file(directory_url: List[str], file_names: List[str]) -> List[str]:
    file_paths = [
        tf.keras.utils.get_file(file_name, directory_url + file_name)
        for file_name in file_names
    ]
    return file_paths

# download dataset in local disk
directory_url = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
file_names = ['cowper.txt', 'derby.txt', 'butler.txt']
file_paths = download_file(directory_url, file_names)

Specify local disc data & label & shuffle data

The rest of the process is summarized as follows. Now you have a Dataset that iterates text and label.

def load_dataset(file_paths: List[str], file_names: List[str], BUFFER_SIZE=1000):
    #Specify multiple files to load
    files = tf.data.Dataset.list_files(file_paths)
    #Apply map function for each file(labeling_map_fn will be described later(Read data&Labeling))
    datasets = files.interleave(
        labeling_map_fn(file_names),
    )
    #data shuffle
    all_labeled_data = datasets.shuffle(
        BUFFER_SIZE, reshuffle_each_iteration=False
    )
    return all_labeled_data

datasets = load_dataset(file_paths, file_names)
text, label = next(iter(datasets))
print(text)
# <tf.Tensor: id=99928, shape=(), dtype=string, numpy=b'Comes furious on, but speeds not, kept aloof'>
print(label)
# <tf.Tensor: id=99929, shape=(), dtype=int64, numpy=0>

We will look at the processing in detail.

tf.data.Dataset.list_files (): Specify multiple files to load

The files created by tf.data.Dataset.list_files are Dataset instances with the path of the local disc as the value as shown below. It's a hassle, but the Dataset instance needs to iterate and check its contents. It's even more annoying, but you can get the value using the ``` .numpy ()` `` method.

print(files)
# <DatasetV1Adapter shapes: (), types: tf.string>

next(iter(files))
# <tf.Tensor: id=99804, shape=(), dtype=string, numpy=b'/Users/username/.keras/datasets/cowper.txt'>

next(iter(files)).numpy()
# b'/Users/username/.keras/datasets/cowper.txt'

.interleave (): Apply map function for each file and return flat Dataset

After applying the map function to the dataset, flatten the results and combine them. In this usage, we first define a map funciton that reads a text file and returns a Dataset that iterates line by line. And if you pass it to ``` .interleave ()` ``, instead of creating a separate Dataset for each file, you will create a flat Dataset that is iterated line by line from all the files.

Reference: Official documentation

.shuffle (): data shuffle

As you can see from the name, it shuffles the Dataset. Randomly extract data from buffer_size during iteration. If iterates repeatedly and exceeds buffer_size, it will be extracted from the data for the next buffer_size. Therefore, a large buffer_size guarantees clutter. However, if buffer_size is large, it consumes resources accordingly, which is a trade-off.

Also, if you set `reshuffle_each_iteration = False```, it will shuffle in the same order no matter how many times you start iteration. Since the default is True, every time you write next (iter (dataset)) or for data in dataset: `` after simply calling ``` .shuffle () . It will be iterated in a different order. Whether it's good or bad, be careful.

labeling_map_fn: reading & labeling data

I will show you how to read a .txt file whose file name is a label and each line is one text data. I think this is standard processing, but I hope you can replace it as appropriate depending on the data format.

Here, we get a Dataset with flat text and labels by passing the following map function to `.interleave ()`.

  1. For each file, read the file with ``` tf.data.TextLineDataset ()` `` and generate a Dataset instance.
  2. Use `` `.map (labeler) ``` to assign the label id that is the same as the file name.
def labeling_map_fn(file_names):
    def _get_label(datasets):
        """
dataset value(file path)Parse the filename from
        file_Let the index number of names be label ID
        """
        filename = datasets.numpy().decode().rsplit('/', 1)[-1]
        label = file_names.index(filename)
        return label

    def _labeler(example, label):
        """Add label to dataset"""
        return tf.cast(example, tf.string), tf.cast(label, tf.int64)

    def _labeling_map_fn(file_path: str):
        """main map function"""
        #Read line by line from a text file
        datasets = tf.data.TextLineDataset(file_path)
        #Convert file path to label ID
        label = tf.py_function(_get_label, inp=[file_path], Tout=tf.int64)
        #Add label ID to Dataset
        labeled_dataset = datasets.map(lambda ex: _labeler(ex, label))
        return labeled_dataset
    return _labeling_map_fn

Along the way, I'm using a function called tf.py_function (doc). This is because the Dataset API map function argument is passed a Tensor object. Tensor object cannot refer to the value directly in python, but if you wrap it with tf.py_function, the same type value as when `next (iter (dataset))` is passed as an argument. So you can refer to the value with `.numpy ()` and write familiar python processing. However, there seems to be some difficulty in performance, so I would like to avoid using it as much as possible.

2.2. standarize & 2.3. tokenize Here, various processes are performed at once. It is assumed that you will use a python library or a solid one. There are many processes for text in tensorflow, but it is quite difficult, so I will assume that you will use the one written in python as it is. At least you can't write in separate words with tensorflow, so I think it's an essential process in Japanese.

Example (using janome)

janome is convenient because it is a morphological analysis implemented in python and can be used only with pip install. You can flexibly build a standardized pipeline called analyzer as shown below.

from janome.tokenizer import Tokenizer
from janome.analyzer import Analyzer
from janome.charfilter import (
    RegexReplaceCharFilter #String replacement
)
from janome.tokenfilter import (
    CompoundNounFilter, #Compound nounization
    POSStopFilter, #Remove specific part of speech
    LowerCaseFilter #Convert to lowercase
)

def janome_tokenizer():
    # standarize texts
    char_filters = [RegexReplaceCharFilter(u'Janome', u'janome')]
    tokenizer = Tokenizer()
    token_filters = [CompoundNounFilter(), POSStopFilter(['symbol','Particle']), LowerCaseFilter()]
    analyze = Analyzer(char_filters, tokenizer, token_filters).analyze

    def _tokenizer(text, label):
        tokenized_text = " ".join([wakati.surface for wakati in analyze(text.numpy().decode())])
        return tokenized_text, label
    return _tokenizer

With this alone, it will be standardized and divided as follows.

text, _ = janome_tokenizer()('The serpentine is a morphological analyzer. Easy to Use.', 0)
print(text)
# 'janome morphological analyzer easy to use.'

Wrap with tf.py_function

Call the above function from Dataset api. To do this, again use tf.py_function to convert. You need to specify the type of output. You can then call that function by passing it to the dataset with ``` .map ()` ``.

def tokenize_map_fn(tokenizer):
    """
    convert python function for tf.data map
    """
    def _tokenize_map_fn(text: str, label: int):
        return tf.py_function(tokenizer, inp=[text, label], Tout=(tf.string, tf.int64))
    return _tokenize_map_fn

datasets = datasets.map(tokenize_map_fn(janome_tokenizer()))

2.4. encode Use the tensorflow_datasets.text API to encode (convert string to ID). In particular, tfds.features.text.Tokenizer ()` `` and `` tfds.features.text.TokenTextEncoder are useful for encoding.

Create vocabulary

First, you need to create a vocabulary. If you create it first, you can omit the following. Here, we will create a vocabulary from the training data. Use `tfds.features.text.Tokenizer ()` to get the token and set () to remove the duplicates.

import tensorflow_datasets as tfds

def get_vocabulary(datasets) -> Set[str]:
    tokenizer = tfds.features.text.Tokenizer().tokenize

    def _tokenize_map_fn(text, label):
        def _tokenize(text, label):
            return tokenizer(text.numpy()), label
        return tf.py_function(_tokenize, inp=[text, label], Tout=(tf.string, tf.int64))

    dataset = datasets.map(_tokenize_map_fn)
    vocab = {g.decode() for f, _ in dataset for g in f.numpy()}
    return vocab

vocab_set = get_vocabulary(datasets)
print(vocab_set)
# {'indomitable', 'suspicion', 'wer', ... }

encode Here, we use tfds.features.text.TokenTextEncoder ()` `` to convert the token contained in the vocabulary to an ID. Use the followingencode_map_fn () `for ``` datasets.map () `.

def encoder(vocabulary_set: Set[str]):
    """
    encode text to numbers. must set vocabulary_set
    """
    encoder = tfds.features.text.TokenTextEncoder(vocabulary_set).encode

    def _encode(text: str, label: int):
        encoded_text = encoder(text.numpy())
        return encoded_text, label
    return _encode

def encode_map_fn(encoder):
    """
    convert python function for tf.data map
    """
    def _encode_map_fn(text: str, label: int):
        return tf.py_function(encoder, inp=[text, label], Tout=(tf.int64, tf.int64))
    return _encode_map_fn

datasets = datasets.map(encode_map_fn(encoder(vocab_set)))
print(next(iter(datasets))[0].numpy())
# [111, 1211, 4, 10101]

2.5. split Divide the dataset into train and test. The following can be omitted if it is separated from the beginning. With the Dataset API, dividing a dataset is very easy to implement as follows.

def split_train_test(data, TEST_SIZE: int, BUFFER_SIZE: int, SEED=123):
    """
    TEST_SIZE =Number of test data
    note: because of reshuffle_each_iteration = True (default),
    train_data is reshuffled if you reuse train_data.
    """
    train_data = data.skip(TEST_SIZE).shuffle(BUFFER_SIZE, seed=SEED)
    test_data = data.take(TEST_SIZE)
    return train_data, test_data

2.6. padding & 2.7. batch With tf.data.Dataset api, padding and batching can be done at the same time. As it is, epochs is the number of epochs and BATCH_SIZE is the batch size. Here are some things to keep in mind:

--If you set `` `drop_remainder = True```, when you batch the data, the last data of iteration that did not reach the batch size will not be used. --You can specify the padding size (= maximum length) with padded_shapes. If you do not specify this argument, it will be padded to the maximum length per batch.

train_data = train_data.padded_batch(BATCH_SIZE, padded_shapes=([max_len], []), drop_remainder=True)
test_data = test_data.padded_batch(BATCH_SIZE, padded_shapes=([max_len], []), drop_remainder=False)

Here, max_len can be obtained from the dataset as shown below, or it can be entered in a fixed manner.

Get maximum document length

Most models require a maximum token length. Now get it from the dataset. If you decide to enter, you can skip the following processing.

def get_max_len(datasets) -> int:
    tokenizer = tfds.features.text.Tokenizer().tokenize

    def _get_len_map_fn(text: str, label: int):
        def _get_len(text: str):
            return len(tokenizer(text.numpy()))
        return tf.py_function(_get_len, inp=[text, ], Tout=tf.int32)

    dataset = datasets.map(_get_len_map_fn)
    max_len = max({f.numpy() for f in dataset})
    return max_len

Summary of text preprocessing flow

I looked at the implementation using the tf.data.Dataset API in the following flow.

  1. load: Load / shuffle text
  2. standarize: Stopword delete, replace, unify to lowercase, etc.
  3. tokenize: Separation (in Japanese)
  4. Replace with encode: id
  5. split: Data split for train and test
  6. padding: Zero padding
  7. batch: Get as batch data

At the time of learning, just pass it to the ``` .fit ()` `` method as shown below.

model.fit(train_data,
      epochs=epochs,
      validation_data=test_data
)

3. Tips for improving performance

As explained at the beginning, a series of preprocessing processes can cause unnecessary waiting time in the overhead part as follows. idle.png https://www.tensorflow.org/guide/data_performance

The tf.data.Dataset API has the following functions to distribute overhead processing and reduce unnecessary waiting time.

--prefetch: Processed in parallel by CPU and GPU / TPU respectively --map: Parallel processing of preprocessing --read_file: Parallel processing of reads

Reference: Optimizing input pipelines with tf.data prefetch Processes are executed in parallel on the CPU and GPU / TPU. It is automatically adjusted by tf.experiments.AUTOTUNE. pipeline.png https://www.tensorflow.org/guide/data_performance

No hassle. Just add the following processing at the end. (In this article, we will do it for train_data and test_data)

dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

map The map function can also be distributed. This is also automatically adjusted by tf.experiments.AUTOTUNE. Also, if it's too slow, you can use the `.batch ()` method first and then pass it. map.png https://www.tensorflow.org/guide/data_performance

Just add an argument to the ``` .map ()` `` method as shown below.

dataset = dataset.map(map_func, num_parallel_calls=tf.data.experimental.AUTOTUNE)

read file Even when reading multiple files, the processing can be distributed and read at the same time. I / O is likely to be the bottleneck, especially when reading data from remote storage. (In this article, it is read from the local disc, so it may not be very effective.)

io.png https://www.tensorflow.org/guide/data_performance

You need to add an argument to the ``` .interleave ()` `` method as shown below.

dataset = files.interleave(
    tf.data.TFRecordDataset, cycle_length=FLAGS.num_parallel_reads,
    num_parallel_calls=tf.data.experimental.AUTOTUNE)

cache Although the context changes, `.cache ()` is effective for improving performance. If you write as follows, it will be cached in memory.

dataset = dataset.cache()

If you pass a string as an argument as shown below, it will be saved in a file instead of in memory.

dataset = dataset.cache('tfdata')

Summary

It's been a long time, but I showed you how to preprocess text using the tf.data.Dataset API. You can find the cohesive code here [https://github.com/tokusumi/nlp-dnn-baselines/blob/master/nlp-dnn-baselines/preprocess/tfdata.py). In particular, we have summarized the introduction of the tf.data.Dataset API, the procedure for text preprocessing, and tips for improving performance. The explanation has become long, but thank you for reading to the end! I hope it will be helpful for you!

refs

Recommended Posts

Summarize how to preprocess text (natural language processing) with tf.data.Dataset api
Let's enjoy natural language processing with COTOHA API
3. Natural language processing with Python 1-2. How to create a corpus: Aozora Bunko
Preparing to start natural language processing
Sentiment analysis with natural language processing! I tried to predict the evaluation from the review text
3. Natural language processing with Python 1-1. Word N-gram
I tried natural language processing with transformers.
From the introduction of GoogleCloudPlatform Natural Language API to how to use it
3. Natural language processing with Python 2-2. Co-occurrence network [mecab-ipadic-NEologd]
How to do multi-core parallel processing with python
How to achieve time wait processing with wxpython
[Python] I played with natural language processing ~ transformers ~
Let the COTOHA API do the difficult things-Introduction to "learn using" natural language processing-
I tried to extract named entities with the natural language processing library GiNZA
I tried to classify Mr. Habu and Mr. Habu with natural language processing × naive Bayes classifier
Python: Natural language processing
Quick batch text formatting + preprocessing for Aozora Bunko data for natural language processing with Python
[Chapter 5] Introduction to Python with 100 knocks of language processing
[Chapter 6] Introduction to scikit-learn with 100 knocks of language processing
3. Natural language processing with Python 4-1. Analysis for words with KWIC
Building an environment for natural language processing with Python
[Chapter 3] Introduction to Python with 100 knocks of language processing
100 natural language processing knocks Chapter 6 English text processing (second half)
[Chapter 2] Introduction to Python with 100 knocks of language processing
How to analyze with Google Colaboratory using Kaggle API
[Natural language processing / NLP] How to easily perform back translation by machine translation in Python
100 natural language processing knocks Chapter 6 English text processing (first half)
Compare how to write processing for lists by language
How to operate Discord API with Python (bot registration)
[Chapter 4] Introduction to Python with 100 knocks of language processing
3. Natural language processing with Python 5-1. Concept of sentiment analysis [AFINN-111]
[Python] Try to classify ramen shops by natural language processing
Try HeloWorld in your own language (with How to & code)
3. Natural language processing with Python 5-2. Emotion intensity analysis tool VADER
100 Language Processing with Python Knock 2015
How to update with SQLAlchemy?
Natural language processing 1 Morphological analysis
How to cast with Theano
Natural language processing 3 Word continuity
How to Alter with SQLAlchemy?
How to separate strings with','
How to RDP with Fedora31
Natural language processing 2 Word similarity
How to Delete with SQLAlchemy?
I will write a detailed explanation to death while solving 100 natural language processing knock 2020 with Python
How to use Python Kivy (reference) -I translated Kivy Language of API reference-
Dockerfile with the necessary libraries for natural language processing in python
Loose articles for those who want to start natural language processing
Natural Language Processing Case Study: Word Frequency in'Anne with an E'
How to send a request to the DMM (FANZA) API with python
How to create a serverless machine learning API with AWS Lambda
How to cancel RT with tweepy
100 natural language processing knocks Chapter 4 Commentary
100 Language Processing Knock with Python (Chapter 1)
How to use virtualenv with PowerShell
How to install python-pip with ubuntu20.04LTS
How to deal with imbalanced data
100 Language Processing Knock with Python (Chapter 3)
Artificial language Lojban and natural language processing (artificial language processing)
How to get started with Scrapy
How to get started with Python