[PYTHON] Summarize the titles of Hottentori at the end and look at the present on the Web

Now that Google Reader is over and services such as SmartNews and Gunosy that offer recommended content even when you sleep are attracting attention, I would like to make something smart here.

So, I pulled an article from a popular entry in Hateb, I wrote a program that summarizes it in one line.

Yes this.

Summary http://xiidec.appspot.com/markov.html

If you use this ...

The truth of Japan is that the lazy cat lion can ride the elite course in this country.

Like this

What to do with the story of wanting productivity as to why highly educated discriminatory remarks are required.

The hot news is mixed up and summarized in one line.

Regarding Ayumi Hamasaki, the reactor did not reach enough, and discriminatory remarks about core meltdown continued.

You can see the current state of the Web in one line!

How it works

  1. On the server side (Python), fetch the RSS of Hatena Bookmark popular entry.
  2. Decompose words with morphological analysis tiny_segmenter that works with Javascript.
  3. Reconstructed using a Markov chain, an algorithm often used for bots.

It's almost like this.

I rent a server of Google App Engine (free) and run it. The mechanism for fetching feed is almost the same as when feed parser automatically picks up cat images. Pass it to the client.

The client then decomposes the received string into words with a magical Javascript library called TinySegmenter.

It's nice weather today. ↓ today|Is|good|weather|is|Ne|。

Such an image.

Then, it is reconstructed using an algorithm called Markov chain. For more information, see [Markov Chain Articles] on Wikipedia (http://en.wikipedia.org/wiki/%E3%83%9E%E3%83%AB%E3%82%B3%E3%83%95%E9% If you read 80% A3% E9% 8E% 96), I don't think you can understand it very well, but if you explain roughly,

Today → is → good → weather → is → ne →. I am a cat → is → is →. → Name → → Not yet → Not available →. Parental → → → No gun → → Child → → Around → From → Loss → Only →.

Suppose there are multiple sentences.

First of all, the first word is randomly fetched. → "Today" The only word that follows "today" is "measles." → "Today" The words following "ha" are "good" and "cat". Choose randomly. → "Today is a cat" The only word that follows a cat is "de". → "Today is a cat" The words following "de" are "aru" and "children". Also choose randomly. → "Today is a cat and a child" Next to "children" is another choice. → "Today is a cat and a child" Next to "no" are "no gun" and "around". → "Today is a cat and a child's gun" Next to "Muteppou", "de" is selected again. → "Today is a cat and a child's gunless gun" Let's finish it. → "Today is a cat and a child's gun."

It turned out to be something like that. The Markov chain is actually a little deeper. A difficult formula comes out. Whatever the theory, let's hope it becomes like that as a result.

Source

This is the source on the server side.

markov.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-
import webapp2
import os
from google.appengine.ext.webapp import template
from xml.etree.ElementTree import *
import re

import urllib

class Markov(webapp2.RequestHandler):
	def get(self):
		mes=""
		if self.request.get('mode')=="2ch":
			mes=self.get_2ch()
		else:
			mes=self.get_hotentry_title()
		
		template_values={
		'mes':mes
		}
		path = os.path.join(os.path.dirname(__file__), 'html/markov.html')
		self.response.out.write(template.render(path, template_values))
		
	def get_hotentry_title(self):
		titles = ""
		tree = parse(urllib.urlopen('http://feeds.feedburner.com/hatena/b/hotentry'))
		for i in tree.findall('./{http://purl.org/rss/1.0/}item'):
			titles+= re.sub("[-:|/|:].{1,30}$","",i.find('{http://purl.org/rss/1.0/}title').text) + "\n"
		return titles
		
	def get_2ch(self):
		titles = ""
		response = urllib.urlopen('http://engawa.2ch.net/poverty/subject.txt')
		html = unicode(response.read(), "cp932", 'ignore').encode("utf-8")
		for line in html.split("\n"):
			if line != "":
				titles+=re.sub("\(.*?\)$","",line.split("<>", 2)[1])+ "\n"
		return titles
		
app = webapp2.WSGIApplication([
	('/markov.html', Markov)
], debug=True)

The get method of the Markov class is the method that works when the user accesses it. get_hotentry_title () gets a list of popular entries and passes them to markov.html. ElementTree is used to get RSS. It seemed like it would be a hassle to use feedparser on GAE.

get_2ch () is an extra function. Instead of the entry of the end, I will pick up the thread of 2ch. Add "? Mode = 2ch" to the end of the URL to get 2ch mode. If you enhance the function to change the information fetched according to the parameters like this, your dreams will spread.

re.sub("[-:|/|:].{1,30}$”,””,~~~)

This mysterious description called re.sub. This eliminates unnecessary noise.

◯◯ △△ Only one clear way to do 100 selections-XX blog

Delete the "-XX blog" with such a title to make it simple.

Next is the client side.

markov.html



<html>
    <head>
    </head>
    <body style="">
        <p>&nbsp;</p>
        <p>
        <meta charset="UTF-8">
        <title>Summary-kun</title>
        <link rel="stylesheet" href="http://code.jquery.com/mobile/1.1.0/jquery.mobile-1.1.0.min.css" />
	<script type="text/javascript" src="http://code.jquery.com/jquery-1.7.1.min.js"></script>
		<script type="text/javascript" src="http://code.jquery.com/mobile/1.1.0/jquery.mobile-1.1.0.min.js"></script>
			<script type="text/javascript" src="jscss/tiny_segmenter-0.1.js" charset="UTF-8">
        </script> <script type="text/javascript">
	var segmenter
	$(function(){
		segmenter = new TinySegmenter();//Instance generation
	})
	//Run
	function doAction(){
		var wkIn=$("#txtIN").val()//input
		var segs = segmenter.segment(wkIn);  //Returns an array of words
		var dict=makeDic(wkIn)
		var wkbest=doShuffle(dict);	
		for(var i=0;i<=10;i++){
		wkOut=doShuffle(dict).replace(/\n/g,"");	
			if(Math.abs(40-wkOut.length)<Math.abs(40-wkbest.length)){
				wkbest=wkOut
			}
		}
		
		$("#txtOUT").val(wkbest);//Output
		
	}
	//Shuffle sentences
	function doShuffle(wkDic){
		var wkNowWord=""
		var wkStr=""
		wkNowWord=wkDic["_BOS_"][Math.floor( Math.random() * wkDic["_BOS_"].length )];
		wkStr+=wkNowWord;
		while(wkNowWord != "_EOS_"){
			wkNowWord=wkDic[wkNowWord][Math.floor( Math.random() * wkDic[wkNowWord].length )];
			wkStr+=wkNowWord;
		}
		wkStr=wkStr.replace(/_EOS_$/,"。")
		return wkStr;
	}
	//Add to dictionary
	function makeDic(wkStr){
		wkStr=nonoise(wkStr);
		var wkLines= wkStr.split("。");
		var wkDict=new Object();
		for(var i =0;i<=wkLines.length-1;i++){
			var wkWords=segmenter.segment(wkLines[i]);
			if(! wkDict["_BOS_"] ){wkDict["_BOS_"]=new Array();}
			if(wkWords[0]){wkDict["_BOS_"].push(wkWords[0])};//Beginning of sentence

			for(var w=0;w<=wkWords.length-1;w++){
				var wkNowWord=wkWords[w];//Now word
				var wkNextWord=wkWords[w+1];//Next word
				if(wkNextWord==undefined){//End of sentence
					wkNextWord="_EOS_"
				}
				if(! wkDict[wkNowWord] ){
					wkDict[wkNowWord]=new Array();
				}
				wkDict[wkNowWord].push(wkNextWord);
				if(wkNowWord=="、"){//"," Can be used as the beginning of a sentence.
					wkDict["_BOS_"].push(wkNextWord);
				}
			}
			
		}
		return wkDict;
	}
	
	//Noise removal
	function nonoise(wkStr){
		wkStr=wkStr.replace(/\n/g,"。");
		wkStr=wkStr.replace(/[\?\!?!]/g,"。");
		wkStr=wkStr.replace(/[-||:: ・]/g,"。");
		wkStr=wkStr.replace(/[「」()\(\)\[\]【】]/g," ");
		return wkStr;
	}	
</script>  </meta>
<div data-role="page" id="first">
	<div data-role="content">	

        <p>To summarize the topical articles on the net in one line ...</p>
					<p><textarea cols="60" rows="8" name="txtIN" id="txtIN"  style="max-height:200px;">{{ mes }}</textarea></p>
        <input type="button" name="" value="Generate" onClick=" doAction()"></br>
        <textarea cols="60" rows="8" name="txtIN" id="txtOUT"></textarea>
        <p></p>

</div>
</div>
</body>
</html>

Wow messed up. First, doAction (), this is the main function. The character string received by segmenter.segment (wkIN) is decomposed into pieces. Based on that, make a dictionary of sentence connections with makeDic (). After that, mix it 10 times with doShuffle () and adopt the character string closest to 40 characters.

Complete.

It seems that various improvements can be made by changing the information fetched from the Web or changing the evaluation criteria of mixed sentences according to your preference.

Summary

Not very practical.

Recommended Posts

Summarize the titles of Hottentori at the end and look at the present on the Web
Clean up the repository at the end of the year and delete DS.store
Python Basic Course (at the end of 15)
Summarize the main points of growth hacks for web services and the points of analysis
Let's take a look at the Scapy code. Overload of special methods __div__, __getitem__ and so on.
Send Gmail at the end of the process [Python]
At the time of python update on ubuntu
The epidemic forecast of the new coronavirus was released on the Web at explosive speed
Remove specific strings at the end of python
[Python3] Take a screenshot of a web page on the server and crop it further
Look at the game records of Go professional Go players
Execute the command on the web server and display the result
How to insert a specific process at the start and end of spider with scrapy
Try using the Python web framework Django (2) --Look at setting.py
Decorator that displays "FIN method name" at the end of the method
The story of making a tool that runs on Mac and Windows at the game development site
Let's take a look at the feature map of YOLO v3
Capture GeneratorExit and detect the end of iteration from the generator side
Install and manage multiple environments of the same distribution on WSL
How to use Jupyter on the front end of supercomputer ITO
Let's take a look at the forest fire on the west coast of the United States with satellite images.