[PYTHON] That's why I analyze the homepages of each political party

Purpose

It's been a year since the ban on online elections was lifted, and there are some uses that rebel against the Public Offices Election Act. How are you doing today? Now, this time, I will analyze the homepages of each political party, find out what words are used, and extract political parties with similar characteristics.

procedure

  1. Download the homepage of each political party.
  2. Morphological analysis of the downloaded page with Mecab.
  3. Find the score for each word with tf-idf
  4. Measure the distance between sentences by cosine similarity and find the distance between political parties.
  5. View the association in GraphViz.

result

** Homepage analysis of each political party in 2014 ** http://needtec.sakura.ne.jp/analyze_election/page/analyzehp/2014

Source code https://github.com/mima3/analyze_election/tree/master/script_comp_manifesto

Download the source code and execute the following script.

#Download the homepage and store it in the DB
python create_parties_db.py parties_hp_2014.sqlite party_hp_json_2014.json

#Morphological analysis and totaling the number of words
python create_parties_tokens.py parties_hp_2014.sqlite

# tf-Calculate idf and cosine similarity and record the result in JSON and PNG.
python create_tf_idf_report.py parties_hp_2014.sqlite party_hp_result_2014.json  party_hp_result_2014.png "ms ui gothic"

To run it, you need to install the following libraries. ・ Nltk ・ Lxml ・ MeCab ・ Urllib2 ・ Pydot

Commentary

Analysis of sentences by tf-idf

The value of tf-idf of the word x in the sentence y is as follows.

tf = number of words x appearing in sentence y / number of words in sentence idf = 1.0 + log (total number of sentences / number of sentences in which the word x appears) tf-idf = tf × idf

Words that appear in many documents have a lower importance and a lower score, and words that appear only in a specific document have a higher importance and a higher score.

Measuring sentence distance by cosine similarity

Sentence 1 has a word (A, B, C) and the TF-IDF value of that word is (0.1, 0.2, 0.3). Sentence 2 has a word (C, D, E) and the TF-IDF value of that word is (0,4,0.5,0.6).

Assuming that the TF-IDF of words that do not exist in the sentence is 0, create TF-IDF for all words.

(A, B, C, D, E) in sentence 1 becomes (0.1,0.2,0.3,0,0). (A, B, C, D, E) in sentence 2 becomes (0,0,0.4,0.5,0.6)

The cosine of the vector angle between sentence 1 and sentence 2 represents the degree of similarity between the two. In the case of exactly the same sentence, the angle between sentence 1 and sentence 2 is 0 degrees.

In the case of Python, it calculates with nltk.cluster.util.cosine_distance.

Recommended Posts

That's why I analyze the homepages of each political party
That's why I'll analyze the comments from the House of Representatives election Nico Nama's party leader debate.
That's why I calculate the number of seats for the proportional representation in the lower house election
I want to know the population of each country in the world.
I investigated the mechanism of flask-login!
I analyzed the voting results of the Osaka Metropolis Plan for each ward
(Python) I tried to analyze 1 million hands ~ I tried to estimate the number of AA ~
That's why I'll look up tweets from the House of Representatives election
I tried to verify and analyze the acceleration of Python by Cython
I want to analyze the emotions of people who want to meet and tremble
I tried to analyze the negativeness of Nono Morikubo. [Compare with Posipa]
I checked the contents of docker volume
I tried the asynchronous server of Django 3.0
I checked the options of copyMakeBorder of OpenCV
I summarized the folder structure of Flask
I didn't know the basics of Python
The Python project template I think of.
I read the implementation of golang channel