Introduction

Knock 100 Language Processing is open to the public at Tohoku University's Inui-Okazaki Laboratory, while tackling practical issues. ， A collection of problems aimed at learning programming, data analysis, and research skills in a fun way.

So far, "Chapter 4 Morphological Analysis", "Chapter 5 Dependency Analysis /18613a549edc68cb20ca) "," Chapter 8 Machine Learning "," [Chapter 9 Vector Space Method (I)](http://qiita.com / Masaaki_Inaba / items / 74bf3a91347bd424556a) ”has been solved. We will continue to proceed with "Chapter 10 Vector Space Method (II)".

In Chapter 9, we implemented word2vec, but in Chapter 10, we will do various things using the word2vec library that is open to the public. This is the final chapter.

Chapters 8 and 9 were in a state where I could not even understand the problem sentence because I was a beginner of NLP, so the title was "For those who do not understand the meaning of the problem sentence", along with the interpretation of the problem sentence. I have explained. I don't think this chapter is that difficult, so I removed "For those who don't understand the meaning of the problem statement" from the title and returned to the style of putting the code in Chapters 4 and 5.

There are three terms that I couldn't understand / remember without looking up.

--k-means clustering --Clustering by Ward method --Visualization by t-SNE

For the time being, I will summarize it briefly.

k-means clustering / Ward clustering

Clustering is a method that allows machines to classify data without any teaching by humans. There are non-hierarchical clustering and hierarchical clustering, and their representatives are k-means (non-hierarchical) and Ward's method (hierarchical). Articles around here ([One of the most important, most commonly used and most difficult analysis methods, "cluster analysis"](http://business.nikkeibp.co.jp/atclbdt/15/258678/071500002/?ST= I think the main points are summarized in short (print)).

t-SNE It is one of the dimensional compression methods similar to the principal component analysis and singular value decomposition mentioned in Chapter 9, but since it compresses the dimensions so that it is easy for humans to interpret, it is compressed to about 2 to 3 dimensions. It is suitable for the purpose of visualizing. "Introduction of dimensional compression method using t-SNE" was easy to understand.

Then, after that, I will solve 90-99. Finally, I write a general statement.

90. Learning with word2vec

Apply word2vec to the corpus created in 81 and learn the word vector. Furthermore, convert the format of the learned word vector and run the program of 86-89.

Use word2vec from spark.mllib.

`Main.scala`


package nlp100_10

import java.io.{File, PrintWriter}
import scala.io.Source
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import nlp100_10.Model._

object Main {

  def main(args: Array[String]) {
    println("86.Display word vector")
    println(model.transform("United_States").toArray.mkString(" "))

    println("87.Word similarity")
    println(multiplyVec(model.transform("United_States"), model.transform("U.S")))

    println("88.10 words with high similarity")
    wordSynonyms("England", 10).foreach(println)

    println("89.Analogy by additive construct")
    println(vectorSynonyms(analogyWord("Spain", "Madrid", "Athens")).head._1)
}

`Model.scala`


package nlp100_10

import java.io.{File, PrintWriter}
import scala.io.Source
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.feature.{Normalizer, Word2Vec, Word2VecModel}
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import nlp100_9.WikiRDD._

object Model {
  val RAW_FILE = "/Users/inaba/Dropbox/NLP100-Spark/src/main/resources/enwiki-20150112-400-r10-105752.txt"
  val COMBINED_WORDS = "/Users/inaba/Dropbox/NLP100-Spark/src/main/resources/combined_words.txt"
  val CORPUS = "/Users/inaba/Dropbox/NLP100-Spark/src/main/resources/corpus"
  val MODEL_PATH = "/Users/inaba/Dropbox/NLP100-Spark/src/main/resources/model"
  val WORD_SIMILARITY_SET1 = "/Users/inaba/Dropbox/NLP100-Spark/src/main/resources/set1.tab"
  val WORD_SIMILARITY_SET2 = "/Users/inaba/Dropbox/NLP100-Spark/src/main/resources/set2.tab"
  val WORD_SIMILARITY_COMBINED = "/Users/inaba/Dropbox/NLP100-Spark/src/main/resources/combined.tab"
  val COUNTRY_CLUSTERS = "/Users/inaba/Dropbox/NLP100-Spark/src/main/resources/clusters"
  val COUNTRY_VECTORS = "/Users/inaba/Dropbox/NLP100-Spark/src/main/resources/country_vectors.txt"

  val sc = new SparkContext(new SparkConf().setAppName("NLP100").setMaster("local[*]"))

  val model: Word2VecModel = {
    if (!new File(CORPUS).exists) {
      //Create if you don't have a corpus
      sc.textFile(RAW_FILE).cleansData.replaceCombinedWord(COMBINED_WORDS).saveAsTextFile(CORPUS)
    }

    val model: Word2VecModel = if (new File(MODEL_PATH).exists) {
      Word2VecModel.load(sc, MODEL_PATH)
    } else {
      val input = sc.textFile(CORPUS).map(_.split(" ").toVector)

      //Model generation and training
      val m = new Word2Vec()
        .setVectorSize(300)
        .setNumPartitions(20)
        .setMinCount(1000)
        .fit(input)

      m.save(sc, MODEL_PATH)
      m
    }

    //Normalization
    val normalizer = new Normalizer
    new Word2VecModel(
      model.getVectors.map { case (key, array) =>
        key -> normalizer.transform(toVector(array)).toArray.map(_.toFloat)
      })
  }

  /*Various calculation methods*/

  def plusVec(vec1: Vector, vec2: Vector): Vector = Vectors.dense(vec1.toArray.zip(vec2.toArray).map { case (v1, v2) => v1 + v2 })

  def minusVec(vec1: Vector, vec2: Vector): Vector = Vectors.dense(vec1.toArray.zip(vec2.toArray).map { case (v1, v2) => v1 - v2 })

  def multiplyVec(vec1: Vector, vec2: Vector): Double = vec1.toArray.zip(vec2.toArray).map { case (v1, v2) => v1 * v2 }.sum

  def analogyVec(vec1: Vector, vec2: Vector, vec3: Vector) = plusVec(minusVec(vec1, vec2), vec3)

  def analogyWord(word1: String, word2: String, word3: String) = plusVec(minusVec(model.transform(word1), model.transform(word2)), model.transform(word3))

  def toVector(a: Array[Float]): Vector = Vectors.dense(a.map(_.toDouble))

  def vectorSynonyms(vector: Vector, num: Int = 10): List[(String, Double)] = {
    model.getVectors.map { case (k, array) => k -> multiplyVec(vector, toVector(array)) }.toList.sortBy(_._2).reverse.slice(0, num)
  }

  def wordSynonyms(word: String, num: Int = 10): List[(String, Double)] = {
    vectorSynonyms(model.transform(word), num)
  }

`Output result`


86.Display word vector
-0.05195711553096771 -0.02188839577138424 -0.02766110934317112 ...

87.Word similarity
0.7993002007168585

88.10 words with high similarity
(England,0.999999995906603)
(Scotland,0.8266511212525927)
(Wales,0.8146345041068417)
(London,0.7710435879598873)
(Australia,0.7684126888668479)
(Ireland,0.7508965993753893)
(Hampshire,0.7350064189984341)
(Lancashire,0.7295800707042573)
(Yorkshire,0.7289047527357796)
(Sydney,0.7255715511987988)

89.Analogy by additive construct
Greece

Perfect!

91. Preparation of analogy data

Download the evaluation data of the word analogy. The line starting with ":" in this data represents the section name. For example, the line ": capital-common-countries" marks the beginning of the section "capital-common-countries". From the downloaded evaluation data, extract the evaluation cases included in the section "family" and save them in a file.

The evaluation data of the word analogy of the question sentence seems to be broken, so I dropped it from here.

`Model.scala`


  def analogies: List[List[String]] = {
    val URL = "https://raw.githubusercontent.com/arfon/word2vec/master/questions-words.txt"
    val ANALOGY_DATA = "/Users/inaba/Dropbox/NLP100-Spark/src/main/resources/analogy.txt"

    if (new File(ANALOGY_DATA).exists) {
      Source.fromFile(ANALOGY_DATA).getLines().toList.map(_.split(" ").toList)
    } else {
      //Cut out only the family section
      var families = Source.fromURL(URL).getLines.toList
      families = families.slice(families.indexOf(": family") + 1, families.length)
      families = families.slice(0, families.indexWhere(_.startsWith(": ")))

      //File output
      val file = new PrintWriter(ANALOGY_DATA)
      file.write(families.mkString("\n"))
      file.close()

      families.map(_.split(" ").toList)
    }
  }

92. Application to analogy data

For each case of the evaluation data created in> 91, vec (word in the second column) --vec (word in the first column) + vec (word in the third column) is calculated, and the vector and similarity are Find the highest word and its similarity.

Add the desired word and similarity to the end of each case. Apply this program to the word vector created in 85 and the word vector created in 90.

Enter the correct answer (true) / incorrect answer (false) in the result.

`Main.scala`


    var result = List[Boolean]()

    analogies.foreach { words =>
      try {
        val actualAnswer = vectorSynonyms(analogyWord(words.head, words(1), words(2)), 1).head._1
        println("%s\t-\t%s\t+\t%s\t=\t%s\t%s".format(words.head, words(1), words(2), words(3), actualAnswer))
        result :+= (words(3) == actualAnswer)
      } catch {
        case e: IllegalStateException => Unit
      }
    }

93. Calculation of the accuracy rate of analogy tasks

Use the data created in> 92 to find the correct answer rate for the analogy task of each model.

`Mian.scala`


    println(result.count(x => x) / result.length.toDouble)

`Correct answer rate`


0.020512820512820513

I haven't learned well. .. .. It seems that boy --girl + brother = brother because I can't learn the gender difference such as boy / girl, father / mother, brother / sister.

94. Similarity calculation with WordSimilarity-353

Enter the evaluation data of The WordSimilarity-353 Test Collection and use the words in the first and second columns. Create a program that calculates the similarity of and adds the similarity value to the end of each line. Apply this program to the word vector created in 85 and the word vector created in 90.

`Main.scala`


val (human: List[Double], machine: List[Double]) = wordSimilarity(WORD_SIMILARITY_COMBINED).unzip

`Model.scala`


  def wordSimilarity(fileName: String): List[(Double, Double)] = {
    Source.fromFile(fileName).getLines.map { line =>
      try {
        val words = line.toString.split("\t")
        List(words(2).toDouble, multiplyVec(model.transform(words(0)), model.transform(words(1))))
      } catch {
        case e: IllegalStateException => Nil
        case e: NumberFormatException => Nil
      }
    }.filter(_.nonEmpty).toList.map(x => (x.head, x.last))
  }

95. Evaluation by WordSimilarity-353

Calculate the Spearman correlation coefficient between the similarity ranking output by each model and the human similarity judgment ranking using the data created in> 94.

`Main.scala`


    val diff = rank(human).zip(rank(machine)) //Convert a list of two similarities into a list of ranks and pair them
      .map(x => math.pow(x._1 - x._2, 2)) //Find the difference between the paired ranks and square

    println(spearman(diff))

`Model.scala`


  //Convert a list of similarities to a list of ranks
  def rank(words: List[Double]): List[Int] = {
    val ranking = words.sorted.zipWithIndex.map(x => (x._1, x._2 + 1)).toMap
    words.map(ranking)
  }

  //Spearman's rank correlation coefficient. If the order is the same, it is in ascending order, so it is slightly different from the original. a:4.5,b:4.5 => a:4,b:I'm 5
  def spearman(diff: List[Double]) = 1 - (6 * diff.sum) / (math.pow(diff.length, 3) - diff.length)

`result`


combined:  0.39186961031541434 (Low correlation)
set1: 0.2655867166143566 (Low correlation)
set2: 0.4190924041068492 (Correlated)

96. Extraction of vector for country name

Extract only the vector related to the country name from the learning result of word2vec.

`Main.scala`


    //Compound word processing(For example"United States" => "United_States")
    val countryNames = Source.fromFile(COMBINED_WORDS).getLines.map(line => line.replace(" ", "_")).toList

    val countryVectors: Map[String, Vector] = model.getVectors
      .filter(x => countryNames.indexOf(x._1) >= 0) //Extract only country name
      .map { case (key, array) => key -> Vectors.dense(array.map(y => y.toDouble)) } // String->Array[Float]String->Convert to Vector

97. k-means clustering

Execute k-means clustering for> 96 word vectors with the number of clusters k = 5.

`Main.scala`


    val countryRdd: RDD[Vector] = sc.parallelize(countryVectors.values.toList) //Convert Map to RDD

    val clusters = if (new File(COUNTRY_CLUSTERS).exists) {
      // k-means when the clustering model is saved in a file
      KMeansModel.load(sc, COUNTRY_CLUSTERS)
    } else {
      // k-means If the clustering model is not saved in a file
      val clusters: KMeansModel = KMeans.train(countryRdd, 5, 100)
      clusters.save(sc, COUNTRY_CLUSTERS)
      clusters
    }

    //Make sure there are 5 center points in the cluster
    clusters.clusterCenters.foreach(println)

    //Check which cluster each data belongs to
    countryVectors.keys.zip(clusters.predict(countryRdd).collect)
      .toList.sortBy(_._2).foreach(println)

`Output image`


[0.04466799285174126,0.04456286245424832,-0.01976845185050652, ...
[0.03334749694396224,0.015676170529332012,-0.03916260437108576, ...
[-0.014139431890928082,-0.0038628893671557307,-0.04137489525601268, ...
[0.03492516125058473,0.024117531810506163,-0.029571880074923465, ...
[0.043189115822315216,0.02963972231373191,-0.03933139890432358, ...

(Morocco,0)
(Macedonia,0)
Omission
(Sudan,0)
(Chile,1)
(Indonesia,1)
Omission
(Czech_Republic,1)
(Jordan,2)
(Jersey,2)
Omission
(Bermuda,2)
(Lebanon,3)
(France,3)
Omission
(Denmark,3)
(India,4)
(Pakistan,4)
Omission
(China,4)

Clustering by 98.Ward method, visualization by 99.t-SNE

Ward's method clustering Perform hierarchical clustering by Ward's method for> 96 word vectors. Furthermore, visualize the clustering result as a dendrogram.

Visualization by t-SNE Visualize the vector space with t-SNE for> 96 word vectors.

It seems that Spark's spark.mllib does not support Ward's method (Clustering --spark.mllib). There are two more questions, but it's a hassle to find another library. It seems that there is Weka that works on the JVM, but the path is not perfect because it doesn't feel like "the last decade". .. So, I will do it with scipy following the Kitanosaka memorandum that I am always indebted to.

`Main.py`


from scipy.cluster.hierarchy import ward, dendrogram
from sklearn.manifold import TSNE
from matplotlib import pyplot as plt

#Restore word vector from file(A vector that represents the characteristics of the country name in the first column and the country names in the second to 301st columns separated by spaces.)
COUNTRY_VECTORS = '/Users/inaba/Dropbox/NLP100-Spark/src/main/resources/country_vectors.txt'

with open(COUNTRY_VECTORS) as input_handler:
    lines = [line.split() for line in input_handler]

country_names = [line[0] for line in lines]
country_vectors = [line[1:] for line in lines]

#Clustering with Ward's method
ward_result = ward(country_vectors)

#Display in dendrogram
dendrogram(ward_result, labels=country_names)
plt.show()

# t-SNE
t_sne_result = TSNE().fit_transform(country_vectors)

#display
fig, ax = plt.subplots()
ax.scatter(t_sne_result[:, 0], t_sne_result[:, 1])
for index, label in enumerate(country_names):
    ax.annotate(label, xy=(t_sne_result[index, 0], t_sne_result[index, 1]))
plt.show()

Easy. But I can hardly read the country name. .. .. I'm not very convinced that I can barely read it. Here, let's just confirm that scipy.cluster.hierarchy and sklearn.manifold can be used for ward clustering, dendrogram visualization, and t-SNE visualization.

Summary of 100 language processing knocks (impressions)

Finally, I will briefly summarize what I have learned throughout. It's just an impression that a beginner has tried a little bit of language processing, so don't think that it explains the correct thing. (Tsukomi is welcome in the comment section.)

――Even seemingly difficult machine learning terms and statistical terms are not scary if you focus on the purpose and input / output and treat them as a black box tool. --Spark is indispensable for ETL processing of big data, but Python-based libraries have more rich machine learning algorithms. ――It is not difficult to use a library for machine learning and natural language processing. ――But the distance between "trying to use the library" and "learning and analyzing with high accuracy" is endless. .. .. (What do you need to do to fill the distance?) ――It is difficult to test whether the analysis result obtained as a result of machine learning is correct or incorrect. It is not possible to confirm that the Expected result and Actual result are exactly the same as in a normal (?) Application. After all, the validity of the learning result can only be judged by whether it is useful or not, and in order to judge whether it is useful or not, the purpose of the analysis must be clear. This chapter is good because it's a practice, but if you just cluster it loosely for no purpose, you'll get a "So what?" Result.

[PYTHON] 100 Language Processing Knock Chapter 10 Vector Space Method (II) + Overall Summary

Introduction

k-means clustering / Ward clustering

90. Learning with word2vec

Main.scala

Model.scala

Output result

91. Preparation of analogy data

Model.scala

92. Application to analogy data

Main.scala

93. Calculation of the accuracy rate of analogy tasks

Mian.scala

Correct answer rate

94. Similarity calculation with WordSimilarity-353

Main.scala

Model.scala

95. Evaluation by WordSimilarity-353

Main.scala

Model.scala

result

96. Extraction of vector for country name

Main.scala

97. k-means clustering

Main.scala

Output image

Clustering by 98.Ward method, visualization by 99.t-SNE

Main.py

Summary of 100 language processing knocks (impressions)

`Main.scala`

`Model.scala`

`Output result`

`Model.scala`

`Main.scala`

`Mian.scala`

`Correct answer rate`

`Main.scala`

`Model.scala`

`Main.scala`

`Model.scala`

`result`

`Main.scala`

`Main.scala`

`Output image`

`Main.py`