[PYTHON] I checked the image of Science University on Twitter with Word2Vec.

To do

--Get Tweets with Twitter API --Add words to Mecab --Use Word2Vec

Get Tweets with Twitter API

I used the original ruby script.

# gem install twitter
require "twitter"

client = Twitter::REST::Client.new do |config|
  config.consumer_key        = ""
  config.consumer_secret     = ""
  config.access_token        = ""
  config.access_token_secret = ""
end

@result = client.search("Science University").take(10000)

File.open("tus.csv", 'w') do |file|
  @result.each do |tweet|
    file.write(tweet.text.gsub(/(\s)/,""))
    file.write("\n")
  end
end

You will have a csv file. There are many articles about getting tokens, so I will omit it.

Add words to Mecab


cd /usr/local/lib/mecab/dic
mkdir userdic
cd userdic
touch tus.csv
echo 'Science University,,,1,noun,General,*,*,*,*,Rikadai,Rikadai,Rikadai' >> tus.csv

#compile
/usr/local/Cellar/mecab/0.996/libexec/mecab/mecab-dict-index \
-d /usr/local/lib/mecab/dic/ipadic \
-u tus.dic \
-f utf-8 \
-t utf-8 tus.csv

#Press Enter to compile

reading tus.csv ... 1
emitting double-array: 100% |###########################################| 

done!

#pass through path

vi /usr/local/etc/mecabrc
#Specify where the generated dic is
userdic = /usr/local/lib/mecab/dic/userdic/tus.dic

word2vec

# coding: UTF-8
import pandas as pd
import numpy as np
import MeCab

tweets = pd.read_csv('/Users/Hiroto/git/scripts/tus.csv').tweet

#Create a word-separated file
wakati = ""
for tweet in tweets:
    mt = MeCab.Tagger("-Owakati")
    wakati = wakati + mt.parse(tweet)

f = open('tus_wakati.txt', 'w')
f.write(wakati)
f.close()

# word2vec
from gensim.models import word2vec
data = word2vec.Text8Corpus('tus_wakati.txt')
model = word2vec.Word2Vec(data, size=100)

Similarity of the subject

out=model.most_similar(positive=[u'Science University'],topn= 100)
for x in out:
    print(x[0],x[1])
word Degree of similarity
Ne 0.9801737666130066
U 0.9679325222969055
world 0.9637500643730164
inequality 0.9604602456092834
Yeah 0.9603763818740845
So 0.9602923393249512
is 0.9574853181838989
That kind of 0.9568058252334595
Lol 0.9534944295883179
darkness 0.9462004899978638
0.9435620307922363
0.9433774948120117
Raw 0.942541241645813
From 0.9420970678329468
Good 0.9348764419555664
Yo 0.9348678588867188
0.9291704893112183
Feeling 0.929074764251709
Me 0.9288586378097534
together 0.9273968935012817
Twitter 0.9265207052230835
Is 0.9249017238616943
Secret meeting 0.9227114915847778
Teru 0.9216452836990356
To go 0.9207674264907837
God 0.9192628264427185
Good luck 0.918117880821228
Ah ~ 0.9180813431739807
Disagreeable 0.9164369106292725
reason 0.9164099097251892
Waka 0.9158462882041931
Understood 0.915264368057251
) 0.913904070854187
Is 0.9111155867576599
Delicious 0.9105844497680664
Nana 0.9098367691040039
Man 0.909660816192627
Shit 0.9095121622085571
so 0.907973051071167
If 0.906628429889679
meaning 0.9065468311309814
Sophia 0.905195415019989
Or 0.9034873247146606
Guy 0.9014643430709839
Go 0.8999437689781189
What 0.8993074893951416
Drink 0.8984052538871765
march 0.8983776569366455
Say 0.8976813554763794
Ta 0.8964160680770874
Often 0.896243691444397
eat 0.8960259556770325
want to see 0.8957585096359253
Child 0.8946411609649658
nice to meet you 0.8943185806274414
Want 0.8941484689712524
Stunning 0.893967866897583
zebra 0.8935203552246094
Too 0.8934850692749023
you 0.8934849500656128
illumination 0.8927890062332153
go 0.8927274942398071
Ichi 0.8926646709442139
Is 0.8919773697853088
arithmetic 0.8915943503379822
( 0.8915064930915833
why 0.8907312154769897
Humanities 0.8906354904174805
Hmm 0.8897289037704468
- 0.8896894454956055
Yeah 0.8896220922470093
Department 0.8895649313926697
K 0.8881763219833374
Thoughts 0.8881138563156128
I don't know 0.8880779147148132
school 0.8879990577697754
But 0.8878818154335022
Incident 0.8878498077392578
Please 0.8875197172164917
Know 0.8871732354164124
Iwa 0.8870071172714233
Personality 0.8869134187698364
Hey 0.8867558240890503
Soukei 0.8866025805473328
I'd love to 0.8860080242156982
I wonder 0.8857483267784119
But 0.8853344321250916
Stop 0.8850265145301819
age 0.8849031925201416
k 0.884624719619751
which one 0.8840593695640564
Or 0.8840340971946716
Live 0.883965253829956
Discount 0.8836942911148071
By all means 0.8836302757263184
Crying 0.8831743597984314
yumalaonvae 0.883036196231842
o 0.8830046653747559
Note 0.8829131126403809
why 0.8827589154243469

** Inequality **, ** Darkness ** is like science What are "secret meetings" and "zebras" ...

Summary

--It's not working well because you haven't removed the trash from your tweets (maybe) --The number of tweets acquired is small (1696 tweets this time) --Even if you use the ruby code for 10,000, you can only get 1696 tweets. ――I wanted you to come out with "proffesional" or "unit"

Recommended Posts

I checked the image of Science University on Twitter with Word2Vec.
I tried to find the entropy of the image with python
I tried "gamma correction" of the image with Python + OpenCV
I studied with Kaggle Start Book on the subject of kaggle [Part 1]
I checked the contents of docker volume
I checked the options of copyMakeBorder of OpenCV
Post the subject of Gmail on twitter
I tried playing with the image with Pillow
I tried "smoothing" the image with Python + OpenCV
I tried image recognition of CIFAR-10 with Keras-Learning-
2016 The University of Tokyo Mathematics Solved with Python
I tried "differentiating" the image with Python + OpenCV
I tried image recognition of CIFAR-10 with Keras-Image recognition-
I checked the list of shortcut keys of Jupyter
I tried "binarizing" the image with Python + OpenCV
I checked the session retention period of django
I checked the processing speed of numpy one-dimensionalization
I tried using the image filter of OpenCV
I tried playing with the calculator on tkinter
I installed Pygame with Python 3.5.1 in the environment of pyenv on OS X
I tried object detection with YOLO v3 (TensorFlow 2.1) on the GPU of windows!
I made a twitter app that decodes the characters of Pricone with heroku (failure)
Extract the table of image files with OneDrive & Python
[OpenCV / Python] I tried image analysis of cells with OpenCV
I want to plot the location information of GTFS Realtime on Jupyter! (With balloon)
Maybe I overestimated the impact of ShellShock on CGI
Try to estimate the number of likes on Twitter
I checked the output specifications of PyTorch's Bidirectional LSTM
I checked out the versions of Blender and Python
Predict the gender of Twitter users with machine learning
I measured the performance of 1 million documents with mongoDB
I checked the default OS and shell of docker-machine
Tweet the triple forecast of the boat race on Twitter
I made a twitter app that identifies and saves the image of a specific character on the twitter timeline by pytorch transfer learning
When I calculated the similar words of careful + brave with word2vec, it felt unexpectedly valid
I tried to find the average of the sequence with TensorFlow
I wrote the basic grammar of Python with Jupyter Lab
Let's execute the command on time with the bot of discord
I evaluated the strategy of stock system trading with Python.
I implemented the FloodFill algorithm with TRON BATTLE of CodinGame.
I made a dot picture of the image of Irasutoya. (part1)
Until the start of the django tutorial with pycharm on Windows
I made a dot picture of the image of Irasutoya. (part2)
I wrote the basic operation of matplotlib with Jupyter Lab
Get the host name of the host PC with Docker on Linux
I made Word2Vec with Pytorch
I tried to make a script that traces the tweets of a specific user on Twitter and saves the posted image at once
I tried to solve the first question of the University of Tokyo 2019 math entrance exam with python sympy
Life game with Python [I made it] (on the terminal & Tkinter)
Read the coordinates of the plot on the graph with Python-matplotlib (super beginner)
I compared the speed of Hash with Topaz, Ruby and Python
I checked the distribution of the number of video views of "Flag-chan!" [Python] [Graph]
Calculate the similarity between sentences with Doc2Vec, an evolution of Word2Vec
[Statistics] Grasp the image of the central limit theorem with a graph
[Python] I wrote the route of the typhoon on the map using folium
I tried cross-validation based on the grid search results with scikit-learn
I tried to build the SD boot image of LicheePi Nano
[Introduction to StyleGAN] I played with "The Life of a Man" ♬
I tried to process the image in "sketch style" with OpenCV
Consider the speed of processing to shift the image buffer with numpy.ndarray
I analyzed the tweets about the new coronavirus posted on Twitter Part 2