[PYTHON] A script that morphologically parses a specified URL

Experiment to try morphological analysis for each specified URL. I tried to remove HTML tags with a regular expression, but I can't remove them.

urlmecab.py



#!/user/bin/env python
# -*- coding: utf-8 -*-
import urllib
import sys
import MeCab
import re


while True:
	search_url = raw_input(u"input URL: ")
	

	def Mecab_file():	
		req = urllib.urlopen(search_url)
		dlText = req.read()

		mt = MeCab.Tagger("mecabrc")
		data = []
		p = re.compile(r"<[^>]*?>")
		sus = p.sub("", dlText)
		data.append(sus)


		node = mt.parseToNode("\n".join(data))
		words = {}
		
		while node:
			word = node.surface
			if word and node.posid >=36 and node.posid <=67:
				if not words.has_key(word):
					words[word] = 0
				words[word] += 1
			node = node.next
		word_items = words.items()
		word_items.sort()
		word_items.reverse()
		for word, count in word_items:
			print word, count
			
	if search_url:
		Mecab_file()
	else:
		break

Extract only nouns with the part of speech ID of MeCab.

if word and node.posid >=36 and node.posid <=67:

If you change this part, you may be able to play a lot. Loop as long as you keep typing the URL. Loop break with blank enter. http://〜入力する必要あり。

Recommended Posts

A script that morphologically parses a specified URL
A script that uses boto to upload a specified folder to Amason S3
A script that just gets an RSS feed
"Python Kit" that calls a Python script from Swift
Create a python script to check if the link at the specified URL is valid 2
A python script that gets the number of jobs for a specified condition from indeed.com
Create a python script to check if the link at the specified URL is valid
A script that keeps looking up until the URL is bookmarked with Hatena Bookmark
A script that takes a snapshot of an EBS volume
Make a BOT that shortens the URL of Discord
A shell script that puts Webmin into Alpine Linux
What's in that variable (when running a Python script)
A script that outputs a list of SoftLayer portal users
A shell script that numbers duplicate names when creating files
A Python script that saves a clipboard (GTK) image to a file.
Let's create a script that registers with Ideone.com in Python.
Creating a Python script that supports the e-Stat API (ver.2)
A shell script that just emails the SQL execution result
A set of script files that do wordcloud in Python3
A script that displays the running CloudFormation stack like a mannequin
A python script that converts Oracle Database data to csv
A Python script that compares the contents of two directories
I wrote a script that splits the image in two