[PYTHON] Create data for series labeling (part of speech tagging) quickly

Easy with Brown Corpus, which comes with NLTK's nltk_data. To create data for part-of-speech tagging, just call tagged_sents (). If you specify categories, you can handle only the data of that domain (in addition to news, there are various reviews, fiction, romance, mystery, etc.).

import nltk
from nltk.corpus import brown

corpus = brown.tagged_sents(categories='news')

def dataset(N=100):
    d = []
    for tagged_sent in corpus[:N]:
        untagged_sent = nltk.tag.untag(tagged_sent)
        pos_sequence = [pos for (word, pos) in tagged_sent]
        d.append((untagged_sent, pos_sequence))
    return d

if __name__ == "__main__":
    dataset = dataset()

Recommended Posts

Create data for series labeling (part of speech tagging) quickly
Performance verification of data preprocessing for machine learning (numerical data) (Part 2)
Search for patent data (create dashboard) while looking at the R & D part of the securities report
Differentiation of time series data (discrete)
[For beginners] Script within 10 lines (5. Resample of time series data using pandas)
Time series data anomaly detection for beginners
Create document classification data quickly using NLTK
[For recording] Keras image system Part 1: How to create your own data set?