[PYTHON] Create data for series labeling (part of speech tagging) quickly

Easy with Brown Corpus, which comes with NLTK's nltk_data. To create data for part-of-speech tagging, just call tagged_sents (). If you specify categories, you can handle only the data of that domain (in addition to news, there are various reviews, fiction, romance, mystery, etc.).

import nltk
from nltk.corpus import brown

corpus = brown.tagged_sents(categories='news')

def dataset(N=100):
    d = []
    for tagged_sent in corpus[:N]:
        untagged_sent = nltk.tag.untag(tagged_sent)
        pos_sequence = [pos for (word, pos) in tagged_sent]
        d.append((untagged_sent, pos_sequence))
    return d

if __name__ == "__main__":
    dataset = dataset()