Knock 100 Language Processing is open to the public at Tohoku University's Inui-Okazaki Laboratory, while tackling practical issues. , A collection of problems aimed at learning programming, data analysis, and research skills in a fun way.
So far, "Chapter 4 Morphological Analysis", "Chapter 5 Dependency Analysis /18613a549edc68cb20ca) "," Chapter 8 Machine Learning "," [Chapter 9 Vector Space Method (I)](http://qiita.com / Masaaki_Inaba / items / 74bf3a91347bd424556a) ”has been solved. We will continue to proceed with "Chapter 10 Vector Space Method (II)".
In Chapter 9, we implemented word2vec, but in Chapter 10, we will do various things using the word2vec library that is open to the public. This is the final chapter.
Chapters 8 and 9 were in a state where I could not even understand the problem sentence because I was a beginner of NLP, so the title was "For those who do not understand the meaning of the problem sentence", along with the interpretation of the problem sentence. I have explained. I don't think this chapter is that difficult, so I removed "For those who don't understand the meaning of the problem statement" from the title and returned to the style of putting the code in Chapters 4 and 5.
There are three terms that I couldn't understand / remember without looking up.
--k-means clustering --Clustering by Ward method --Visualization by t-SNE
For the time being, I will summarize it briefly.
Clustering is a method that allows machines to classify data without any teaching by humans. There are non-hierarchical clustering and hierarchical clustering, and their representatives are k-means (non-hierarchical) and Ward's method (hierarchical). Articles around here ([One of the most important, most commonly used and most difficult analysis methods, "cluster analysis"](http://business.nikkeibp.co.jp/atclbdt/15/258678/071500002/?ST= I think the main points are summarized in short (print)).
t-SNE It is one of the dimensional compression methods similar to the principal component analysis and singular value decomposition mentioned in Chapter 9, but since it compresses the dimensions so that it is easy for humans to interpret, it is compressed to about 2 to 3 dimensions. It is suitable for the purpose of visualizing. "Introduction of dimensional compression method using t-SNE" was easy to understand.
Then, after that, I will solve 90-99. Finally, I write a general statement.
Apply word2vec to the corpus created in 81 and learn the word vector. Furthermore, convert the format of the learned word vector and run the program of 86-89.
Use word2vec from spark.mllib.
Main.scala
package nlp100_10
import java.io.{File, PrintWriter}
import scala.io.Source
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import nlp100_10.Model._
object Main {
def main(args: Array[String]) {
println("86.Display word vector")
println(model.transform("United_States").toArray.mkString(" "))
println("87.Word similarity")
println(multiplyVec(model.transform("United_States"), model.transform("U.S")))
println("88.10 words with high similarity")
wordSynonyms("England", 10).foreach(println)
println("89.Analogy by additive construct")
println(vectorSynonyms(analogyWord("Spain", "Madrid", "Athens")).head._1)
}
Model.scala
package nlp100_10
import java.io.{File, PrintWriter}
import scala.io.Source
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.feature.{Normalizer, Word2Vec, Word2VecModel}
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import nlp100_9.WikiRDD._
object Model {
val RAW_FILE = "/Users/inaba/Dropbox/NLP100-Spark/src/main/resources/enwiki-20150112-400-r10-105752.txt"
val COMBINED_WORDS = "/Users/inaba/Dropbox/NLP100-Spark/src/main/resources/combined_words.txt"
val CORPUS = "/Users/inaba/Dropbox/NLP100-Spark/src/main/resources/corpus"
val MODEL_PATH = "/Users/inaba/Dropbox/NLP100-Spark/src/main/resources/model"
val WORD_SIMILARITY_SET1 = "/Users/inaba/Dropbox/NLP100-Spark/src/main/resources/set1.tab"
val WORD_SIMILARITY_SET2 = "/Users/inaba/Dropbox/NLP100-Spark/src/main/resources/set2.tab"
val WORD_SIMILARITY_COMBINED = "/Users/inaba/Dropbox/NLP100-Spark/src/main/resources/combined.tab"
val COUNTRY_CLUSTERS = "/Users/inaba/Dropbox/NLP100-Spark/src/main/resources/clusters"
val COUNTRY_VECTORS = "/Users/inaba/Dropbox/NLP100-Spark/src/main/resources/country_vectors.txt"
val sc = new SparkContext(new SparkConf().setAppName("NLP100").setMaster("local[*]"))
val model: Word2VecModel = {
if (!new File(CORPUS).exists) {
//Create if you don't have a corpus
sc.textFile(RAW_FILE).cleansData.replaceCombinedWord(COMBINED_WORDS).saveAsTextFile(CORPUS)
}
val model: Word2VecModel = if (new File(MODEL_PATH).exists) {
Word2VecModel.load(sc, MODEL_PATH)
} else {
val input = sc.textFile(CORPUS).map(_.split(" ").toVector)
//Model generation and training
val m = new Word2Vec()
.setVectorSize(300)
.setNumPartitions(20)
.setMinCount(1000)
.fit(input)
m.save(sc, MODEL_PATH)
m
}
//Normalization
val normalizer = new Normalizer
new Word2VecModel(
model.getVectors.map { case (key, array) =>
key -> normalizer.transform(toVector(array)).toArray.map(_.toFloat)
})
}
/*Various calculation methods*/
def plusVec(vec1: Vector, vec2: Vector): Vector = Vectors.dense(vec1.toArray.zip(vec2.toArray).map { case (v1, v2) => v1 + v2 })
def minusVec(vec1: Vector, vec2: Vector): Vector = Vectors.dense(vec1.toArray.zip(vec2.toArray).map { case (v1, v2) => v1 - v2 })
def multiplyVec(vec1: Vector, vec2: Vector): Double = vec1.toArray.zip(vec2.toArray).map { case (v1, v2) => v1 * v2 }.sum
def analogyVec(vec1: Vector, vec2: Vector, vec3: Vector) = plusVec(minusVec(vec1, vec2), vec3)
def analogyWord(word1: String, word2: String, word3: String) = plusVec(minusVec(model.transform(word1), model.transform(word2)), model.transform(word3))
def toVector(a: Array[Float]): Vector = Vectors.dense(a.map(_.toDouble))
def vectorSynonyms(vector: Vector, num: Int = 10): List[(String, Double)] = {
model.getVectors.map { case (k, array) => k -> multiplyVec(vector, toVector(array)) }.toList.sortBy(_._2).reverse.slice(0, num)
}
def wordSynonyms(word: String, num: Int = 10): List[(String, Double)] = {
vectorSynonyms(model.transform(word), num)
}
Output result
86.Display word vector
-0.05195711553096771 -0.02188839577138424 -0.02766110934317112 ...
87.Word similarity
0.7993002007168585
88.10 words with high similarity
(England,0.999999995906603)
(Scotland,0.8266511212525927)
(Wales,0.8146345041068417)
(London,0.7710435879598873)
(Australia,0.7684126888668479)
(Ireland,0.7508965993753893)
(Hampshire,0.7350064189984341)
(Lancashire,0.7295800707042573)
(Yorkshire,0.7289047527357796)
(Sydney,0.7255715511987988)
89.Analogy by additive construct
Greece
Perfect!
Download the evaluation data of the word analogy. The line starting with ":" in this data represents the section name. For example, the line ": capital-common-countries" marks the beginning of the section "capital-common-countries". From the downloaded evaluation data, extract the evaluation cases included in the section "family" and save them in a file.
The evaluation data of the word analogy of the question sentence seems to be broken, so I dropped it from here.
Model.scala
def analogies: List[List[String]] = {
val URL = "https://raw.githubusercontent.com/arfon/word2vec/master/questions-words.txt"
val ANALOGY_DATA = "/Users/inaba/Dropbox/NLP100-Spark/src/main/resources/analogy.txt"
if (new File(ANALOGY_DATA).exists) {
Source.fromFile(ANALOGY_DATA).getLines().toList.map(_.split(" ").toList)
} else {
//Cut out only the family section
var families = Source.fromURL(URL).getLines.toList
families = families.slice(families.indexOf(": family") + 1, families.length)
families = families.slice(0, families.indexWhere(_.startsWith(": ")))
//File output
val file = new PrintWriter(ANALOGY_DATA)
file.write(families.mkString("\n"))
file.close()
families.map(_.split(" ").toList)
}
}
For each case of the evaluation data created in> 91, vec (word in the second column) --vec (word in the first column) + vec (word in the third column) is calculated, and the vector and similarity are Find the highest word and its similarity.
Add the desired word and similarity to the end of each case. Apply this program to the word vector created in 85 and the word vector created in 90.
Enter the correct answer (true) / incorrect answer (false) in the result.
Main.scala
var result = List[Boolean]()
analogies.foreach { words =>
try {
val actualAnswer = vectorSynonyms(analogyWord(words.head, words(1), words(2)), 1).head._1
println("%s\t-\t%s\t+\t%s\t=\t%s\t%s".format(words.head, words(1), words(2), words(3), actualAnswer))
result :+= (words(3) == actualAnswer)
} catch {
case e: IllegalStateException => Unit
}
}
Use the data created in> 92 to find the correct answer rate for the analogy task of each model.
Mian.scala
println(result.count(x => x) / result.length.toDouble)
Correct answer rate
0.020512820512820513
I haven't learned well. .. .. It seems that boy --girl + brother = brother because I can't learn the gender difference such as boy / girl, father / mother, brother / sister.
Enter the evaluation data of The WordSimilarity-353 Test Collection and use the words in the first and second columns. Create a program that calculates the similarity of and adds the similarity value to the end of each line. Apply this program to the word vector created in 85 and the word vector created in 90.
Main.scala
val (human: List[Double], machine: List[Double]) = wordSimilarity(WORD_SIMILARITY_COMBINED).unzip
Model.scala
def wordSimilarity(fileName: String): List[(Double, Double)] = {
Source.fromFile(fileName).getLines.map { line =>
try {
val words = line.toString.split("\t")
List(words(2).toDouble, multiplyVec(model.transform(words(0)), model.transform(words(1))))
} catch {
case e: IllegalStateException => Nil
case e: NumberFormatException => Nil
}
}.filter(_.nonEmpty).toList.map(x => (x.head, x.last))
}
Calculate the Spearman correlation coefficient between the similarity ranking output by each model and the human similarity judgment ranking using the data created in> 94.
Main.scala
val diff = rank(human).zip(rank(machine)) //Convert a list of two similarities into a list of ranks and pair them
.map(x => math.pow(x._1 - x._2, 2)) //Find the difference between the paired ranks and square
println(spearman(diff))
Model.scala
//Convert a list of similarities to a list of ranks
def rank(words: List[Double]): List[Int] = {
val ranking = words.sorted.zipWithIndex.map(x => (x._1, x._2 + 1)).toMap
words.map(ranking)
}
//Spearman's rank correlation coefficient. If the order is the same, it is in ascending order, so it is slightly different from the original. a:4.5,b:4.5 => a:4,b:I'm 5
def spearman(diff: List[Double]) = 1 - (6 * diff.sum) / (math.pow(diff.length, 3) - diff.length)
result
combined: 0.39186961031541434 (Low correlation)
set1: 0.2655867166143566 (Low correlation)
set2: 0.4190924041068492 (Correlated)
Extract only the vector related to the country name from the learning result of word2vec.
Main.scala
//Compound word processing(For example"United States" => "United_States")
val countryNames = Source.fromFile(COMBINED_WORDS).getLines.map(line => line.replace(" ", "_")).toList
val countryVectors: Map[String, Vector] = model.getVectors
.filter(x => countryNames.indexOf(x._1) >= 0) //Extract only country name
.map { case (key, array) => key -> Vectors.dense(array.map(y => y.toDouble)) } // String->Array[Float]String->Convert to Vector
Execute k-means clustering for> 96 word vectors with the number of clusters k = 5.
Main.scala
val countryRdd: RDD[Vector] = sc.parallelize(countryVectors.values.toList) //Convert Map to RDD
val clusters = if (new File(COUNTRY_CLUSTERS).exists) {
// k-means when the clustering model is saved in a file
KMeansModel.load(sc, COUNTRY_CLUSTERS)
} else {
// k-means If the clustering model is not saved in a file
val clusters: KMeansModel = KMeans.train(countryRdd, 5, 100)
clusters.save(sc, COUNTRY_CLUSTERS)
clusters
}
//Make sure there are 5 center points in the cluster
clusters.clusterCenters.foreach(println)
//Check which cluster each data belongs to
countryVectors.keys.zip(clusters.predict(countryRdd).collect)
.toList.sortBy(_._2).foreach(println)
Output image
[0.04466799285174126,0.04456286245424832,-0.01976845185050652, ...
[0.03334749694396224,0.015676170529332012,-0.03916260437108576, ...
[-0.014139431890928082,-0.0038628893671557307,-0.04137489525601268, ...
[0.03492516125058473,0.024117531810506163,-0.029571880074923465, ...
[0.043189115822315216,0.02963972231373191,-0.03933139890432358, ...
(Morocco,0)
(Macedonia,0)
Omission
(Sudan,0)
(Chile,1)
(Indonesia,1)
Omission
(Czech_Republic,1)
(Jordan,2)
(Jersey,2)
Omission
(Bermuda,2)
(Lebanon,3)
(France,3)
Omission
(Denmark,3)
(India,4)
(Pakistan,4)
Omission
(China,4)
- Ward's method clustering Perform hierarchical clustering by Ward's method for> 96 word vectors. Furthermore, visualize the clustering result as a dendrogram.
- Visualization by t-SNE Visualize the vector space with t-SNE for> 96 word vectors.
It seems that Spark's spark.mllib does not support Ward's method (Clustering --spark.mllib). There are two more questions, but it's a hassle to find another library. It seems that there is Weka that works on the JVM, but the path is not perfect because it doesn't feel like "the last decade". .. So, I will do it with scipy following the Kitanosaka memorandum that I am always indebted to.
Main.py
from scipy.cluster.hierarchy import ward, dendrogram
from sklearn.manifold import TSNE
from matplotlib import pyplot as plt
#Restore word vector from file(A vector that represents the characteristics of the country name in the first column and the country names in the second to 301st columns separated by spaces.)
COUNTRY_VECTORS = '/Users/inaba/Dropbox/NLP100-Spark/src/main/resources/country_vectors.txt'
with open(COUNTRY_VECTORS) as input_handler:
lines = [line.split() for line in input_handler]
country_names = [line[0] for line in lines]
country_vectors = [line[1:] for line in lines]
#Clustering with Ward's method
ward_result = ward(country_vectors)
#Display in dendrogram
dendrogram(ward_result, labels=country_names)
plt.show()
# t-SNE
t_sne_result = TSNE().fit_transform(country_vectors)
#display
fig, ax = plt.subplots()
ax.scatter(t_sne_result[:, 0], t_sne_result[:, 1])
for index, label in enumerate(country_names):
ax.annotate(label, xy=(t_sne_result[index, 0], t_sne_result[index, 1]))
plt.show()
Easy. But I can hardly read the country name. .. .. I'm not very convinced that I can barely read it. Here, let's just confirm that scipy.cluster.hierarchy and sklearn.manifold can be used for ward clustering, dendrogram visualization, and t-SNE visualization.
Finally, I will briefly summarize what I have learned throughout. It's just an impression that a beginner has tried a little bit of language processing, so don't think that it explains the correct thing. (Tsukomi is welcome in the comment section.)
――Even seemingly difficult machine learning terms and statistical terms are not scary if you focus on the purpose and input / output and treat them as a black box tool. --Spark is indispensable for ETL processing of big data, but Python-based libraries have more rich machine learning algorithms. ――It is not difficult to use a library for machine learning and natural language processing. ――But the distance between "trying to use the library" and "learning and analyzing with high accuracy" is endless. .. .. (What do you need to do to fill the distance?) ――It is difficult to test whether the analysis result obtained as a result of machine learning is correct or incorrect. It is not possible to confirm that the Expected result and Actual result are exactly the same as in a normal (?) Application. After all, the validity of the learning result can only be judged by whether it is useful or not, and in order to judge whether it is useful or not, the purpose of the analysis must be clear. This chapter is good because it's a practice, but if you just cluster it loosely for no purpose, you'll get a "So what?" Result.
Recommended Posts