Module to generate word N-gram in Python

Purpose

It takes an arbitrary text file as input and generates an N-gram for it. This time we will generate ** words ** N-gram.

data set

e.g. news articles

Generate N-gram for the following articles. It is assumed that the article is located in ./data/news.txt from the directory where the program is located.

It is a result that can be said to have overturned the common sense of space development, and is attracting attention as an epoch-making technology that reduces launch costs. At a press conference held at the Kennedy Space Center in Florida after a successful launch of the rocket, SpaceX CEO Elon Musk said, "The rocket can be returned. I was able to prove that, "he said, expressing his joy in the success of the experiment. After that, we will conduct experiments on the ground to see if the rocket returned this time is normal, and if there are no problems, next month or next month. He commented that he would launch the same rocket again, saying, "The rocket could be reused thousands of times in the future, but at present I think it can be reused 10 to 20 times. Including other rockets. , Reuse of all rockets will be the norm in the future, "he said.

program

text2bow is a function that converts a sentence into a word set, and mod = "file" when inputting a file. When inputting a character string, specify mod = "str". (If you use it as a module, this may be more)

ngram.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
import commands as cmd

#text->word(morpheme)set
def text2bow(obj,mod):

    # input:Mod for files="file", input:Mod for strings="str"
    if mod == "file":
        morp = cmd.getstatusoutput("cat " + obj + " | mecab -Owakati")
    elif mod == "str":
        morp = cmd.getstatusoutput("echo " + obj.encode('utf-8') + " | mecab -Owakati")
    else:
        print "error!!"
        sys.exit(0)

    words = morp[1].decode('utf-8')
    words = words.replace('\n','')

    bow = words.split(' ')

    return bow

# N-Gram generation
def gen_Ngram(words,N):

    ngram = []

    for i in range(len(words)):
        cw = ""
        
        if i >= N-1:
            for j in reversed(range(N)):
                cw += words[i-j]
        else:
            continue

        ngram.append(cw)
                
    return ngram

#output
def output_Ngram(ngram):

    for i in range(len(ngram)):
        print ngram[i].encode('utf-8')

def main():

    argvs = sys.argv

    # input:For files
    bow = text2bow(argvs[2],mod="file")

    # input:For strings
    #bow = text2bow(obj=u"This is N-This is a program that generates gram.",mod="str")

    ngram = gen_Ngram(bow,int(argvs[1]))

    output_Ngram(ngram)

if __name__ == "__main__":

    main()

Execution method

For the time being, this time it is assumed that a text file is passed as input. (When inputting a character string in the program, import ngram.py and use various methods. Pay attention only to the mod value of text2bow) The execution method is as follows.

ngram.py


$ python ngram.py N textfile

--N: Arbitrary number (e.g. 2-gram-> N = 2) --textfile: File path of the input text file

Run

Output 2-gram of the above news article.

ngram.py


$ python ngram.py 2 data/news.txt

Output result

Space exploration Of development Common sense Common sense Overturn Overturned Tato Tomo Can also be said ...

If you can get the above output, it's OK.

Summary

This time, I created a program that can handle the word N-gram in Python. To handle it as a module, import the program and use each method. I intended to make it with versatility in mind, so I think it can be imported and used easily.

Recommended Posts

Module to generate word N-gram in Python
N-gram in python
Use cryptography module to handle OpenSSL in Python
To add a module to python put in Julialang
Try to make a Python module in C language
To flush stdout in Python
Login to website in Python
Generate U distribution in Python
Speech to speech in python [text to speech]
Generate QR code in Python
How to develop in Python
Generate 8 * 8 (64) cubes in Blender Python
Generate Word Cloud from case law data in python3
Post to Slack in Python
How to generate exponential pulse time series data in python
[Python] How to do PCA in Python
[Python] Generate QR code in memory
Convert markdown to PDF in Python
How to use SQLite in Python
Generate Jupyter notebook ".ipynb" in Python
In the python command python points to python3.8
Try to calculate Trace in Python
How to use Mysql in python
How to wrap C in Python
How to use ChemSpider in Python
Python unittest module execution in vs2017
6 ways to string objects in Python
How to use PubChem in Python
How to handle Japanese in Python
An alternative to `pause` in Python
Master the weakref module in Python
KawaiiGen: Behind the Python module to generate cute girl face images
[Python / AWS Lambda layers] I want to reuse only module in AWS Lambda Layers
What to do when ModuleNotFoundError: No module named'XXX' occurs in Python
Generate a first class collection in Python
I tried to implement PLSA in Python
Try logging in to qiita with Python
How to access environment variables in Python
Generate AWS-S3 signed (time-limited) URLs in Python
I tried to implement permutation in Python
[Python] How to display random numbers (random module)
How to dynamically define variables in Python
How to do R chartr () in Python
Implementation module "deque" in queue and Python
Pin current directory to script directory in Python
[Itertools.permutations] How to put permutations in Python
PUT gzip directly to S3 in Python
Send email to multiple recipients in Python (Python 3)
Convert psd file to png in Python
Sample script to trap signals in Python
I tried to implement PLSA in Python 2
To set default encoding to utf-8 in python
Decorator to avoid UnicodeEncodeError in Python 3 print ()
How to work with BigQuery in Python
Log in to Slack using requests in Python
How to get a stacktrace in python
How to display multiplication table in python
Easy way to use Wikipedia in Python
How to extract polygon area in Python
3. Natural language processing with Python 1-1. Word N-gram
How to check opencv version in python