[PYTHON] Program for Twitter Trend Analysis (Personal Note)

Hello. I am currently studying Twitter trends ** as the title suggests in my second year student at Komazawa University ** GMS Faculty **. ** This article introduces research, code, and potential references.

Originally, I was interested in Twitter trends, and since I was in the first grade, I have been using Python, Twitter API, and MeCab, but it was ** a primitive thing that morphologically analyzes and aggregates word by word **. Besides, I was playing with language and location information and appearing kanji () ↓ Then, in the same way as N-gram, for example, we recorded every 2-12 vocabulary and aggregated all ** simple trend analysis **. As a side note, I made a lot of manual rules about where the particles don't come and what the auxiliary verbs are, as in the Twitter trend. This is the beginning of the second grade ↓ After that, when it came to what to study, I had the idea of defining and modeling a stationary trend that fluctuates throughout the day, but I was wondering what it would be like. Then, a teacher at a certain university pointed out that the points and rules were in the hands of others. It's about the summer of the second grade ↓ I was wandering a little, but https://pj.ninjal.ac.jp/corpus_center/goihyo.html ** I arrived at the National Institute for Japanese Language and Language Unidic, Classification Vocabulary, Semantic Classification **, and tried to do semantic normalization and clustering of words. It was quite good, but the correspondence table of multiple dictionaries? It was supposed to be a homonym when reverse lookup, but the meaning was referred to by an extreme person, so some people got involved, but the problem is I don't think so. Also, there is an unknown word in Unidic, and there is a problem that the trend is not taken into consideration. Is it about October in the second grade? ↓ When I was able to normalize the meaning, I used CaboCha because I had been thinking from the previous stage that I wanted to use the dependency to create a trend with the dependency of the meaning. ** This was a hassle and I couldn't do it on the seminar server or AWS, so I had a hard time installing it on my Windows and binding it with subprocess (since Python has only 32bit). Since the position to separate Unidic and Cabocha (Can Cabocha also be specified in the dictionary?) Was different, we matched them in the intersection in the way of the least common multiple. ↓ As a result, we were able to create a semantic trend based on the dependency of meaning. ** However, since it says "It should be a homonym, but the meaning is referred to by an extreme person", I made a correction, and that is "Hit in the campaign! There is a method called "follow and tweet", and in "simple trend analysis", there was a method to delete using duplication of expressed trends, but I decided to use that as well. ↓ So far, we have been able to analyze "meaning trends based on the dependency of meaning" and "simple expression trends", and now I think that meaning trends are difficult to separate, unlike expressions. I mean, it's a deep psychology or something big, and I think it's important to make it concrete in this research. Independent? Aggregating by semantic trend accompanying the expression trend, ** estimating the expression trend from the meaning trend, how much both can be used properly, how to combine ** are future issues. Now in November of my second year, I am doing my best for my graduation research. I wrote it because I had a paragraph. (Maybe I should have written it earlier)

Nowadays, the expression trend is a list of characters, and the meaning trend is a list of meanings in which characters are replaced with meanings.

** If you want to use it, please let me know even on Twitter. I was worried about that. Please assume that you own the copyright. ** ** I also wanted to write it in a backup sense. The environment is Windows 10-64bit Python 3 (Miscellaneous) https://twitter.com/kenkensz9

First, it is a program for estimating the expression trend. custam_freq_sentece.txt is the full text obtained for parsing custam_freq_tue.txt is a trend candidate. custam_freq.txt is a trend. custam_freq_new.txt is the program that puts out the longest trend by excluding duplicates from the trend. In addition, trends are changed every hour. The part of freshtime = int (time.time () * 1000)-200000. This should be changed according to the speed of tweet acquisition. Also, there is a word list badword that will not be processed if this word is included in the tweet in the first place, and this may not be the version of the program, but management is poor, so please feel free to use it.

hyousyutu_trend.py



# coding: utf-8
import tweepy
import datetime
import re
import itertools
import collections
from pytz import timezone
import time
import MeCab
#import threading
#from multiprocessing import Pool
import os
#import multiprocessing
import concurrent.futures
import urllib.parse
#import
import pdb; pdb.set_trace()
import gc
import sys
import emoji

consumer_key = ""
consumer_secret = ""
access_token = ""
access_token_secret = ""
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
authapp = tweepy.AppAuthHandler(consumer_key,consumer_secret)
apiapp = tweepy.API(authapp)
#Authentication

m_owaka = MeCab.Tagger("-Owakati")
m_ocha = MeCab.Tagger("-Ochasen")
#mecab morpheme decomposition definition


lang_dict="{'en': 'English', 'und': 'unknown', 'is': 'Icelandic', 'ay': 'Aymara', 'ga': 'Irish', 'az': 'Azerbaigen', 'as': 'Assamese', 'aa': 'Afar', 'ab': 'Aphazian', 'af': 'Afrikaans', 'am': 'Amharic', 'ar': 'Arabic', 'sq': 'Albanian', 'hy': 'Armenian', 'it': 'Italian', 'yi': 'Yiddish', 'iu': 'Inuktitut', 'ik': 'Inupia', 'ia': 'Interlingua', 'ie': 'Interlingue', 'in': 'Indonesian', 'ug': 'Uyghur', 'cy': 'Welsh', 'vo': 'Volapuk', 'wo': 'Wolof', 'uk': 'Ukrainian', 'uz': 'Uzbek', 'ur': 'Urdu', 'et': 'Estonian', 'eo': 'Esperanto', 'or': 'Orian', 'oc': 'Okitan', 'nl': 'Dutch', 'om': 'Oromo', 'kk': 'Kazakh', 'ks': 'Kashmir', 'ca': 'Catalan', 'gl': 'Galician', 'ko': 'Korean', 'kn': 'Kannada', 'km': 'Cambodian', 'rw': 'Kyawanda', 'el': 'Greek language', 'ky': 'Kyrgyz', 'rn': 'Kirundi', 'gn': 'Guarani', 'qu': 'Quetua', 'gu': 'Gujarati', 'kl': 'Greenlandic', 'ku': 'Kurdish', 'ckb': '中央Kurdish', 'hr': 'Croatian', 'gd': 'Gaelic', 'gv': 'Gaelic', 'xh': 'Xhosa', 'co': 'Corsican', 'sm': 'Samoan', 'sg': 'Sangho', 'sa': 'Sanskrit', 'ss': 'Swati', 'jv': 'Javanese', 'ka': 'Georgian', 'sn': 'Shona', 'sd': 'Sindhi', 'si': 'Sinhala', 'sv': 'Swedish', 'su': 'Sudanese', 'zu': 'Zulu', 'es': 'Spanish', 'sk': 'Slovak', 'sl': 'Slovenian', 'sw': 'Swahili', 'tn': 'Setswana', 'st': 'Seto', 'sr': 'Serbian', 'sh': 'セルボCroatian', 'so': 'Somali', 'th': 'Thai', 'tl': 'Tagalog', 'tg': 'Tajik', 'tt': 'Tatar', 'ta': 'Tamil', 'cs': 'Czech language', 'ti': 'Tigrinya', 'bo': 'Tibetan', 'zh': 'Chinese', 'ts': 'Zonga', 'te': 'Telugu', 'da': 'Danish', 'de': 'German', 'tw': 'Twi', 'tk': 'Turkmen', 'tr': 'Turkish', 'to': 'Tongan', 'na': 'Nauruan', 'ja': 'Japanese', 'ne': 'Nepali', 'no': 'Norwegian', 'ht': 'Haitian', 'ha': 'Hausa', 'be': 'White Russian', 'ba': 'Bashkir', 'ps': 'Pasito', 'eu': 'Basque', 'hu': 'Hungarian', 'pa': 'Punjabi', 'bi': 'Bislama', 'bh': 'Bihari', 'my': 'Burmese', 'hi': 'Hindi', 'fj': 'Fijian', 'fi': 'Finnish', 'dz': 'Bhutanese', 'fo': 'Faroese', 'fr': 'French', 'fy': 'Frisian', 'bg': 'Bulgarian', 'br': 'Breton', 'vi': 'Vietnamese', 'iw': 'Hebrew', 'fa': 'Persian', 'bn': 'Bengali', 'pl': 'Polish language', 'pt': 'Portuguese', 'mi': 'Maori', 'mk': 'Macedonian', 'mg': 'Malagasy', 'mr': 'Malata', 'ml': 'Malayalam', 'mt': 'Maltese', 'ms': 'Malay', 'mo': 'Moldavian', 'mn': 'Mongolian', 'yo': 'Yoruba', 'lo': 'Laota', 'la': 'Latin', 'lv': 'Latvian', 'lt': 'Lithuanian', 'ln': 'Lingala', 'li': 'Limburgish', 'ro': 'Romanian', 'rm': 'Rate romance', 'ru': 'Russian'}"
lang_dict=eval(lang_dict)
lang_dict_inv = {v:k for k, v in lang_dict.items()}
#Language dictionary


all=[]
#List initialization
if os.path.exists('custam_freq_tue.txt'):
  alll=open("custam_freq_tue.txt","r",encoding="utf-8-sig")
  alll=alll.read()
  all=eval(alll)
  del alll
#all=[]
#Ready to export

#freq_write=open("custam_freq.txt","w",encoding="utf-8-sig")
sent_write=open("custam_freq_sentece.txt","a",encoding="utf-8-sig", errors='ignore')
#Ready to export

use_lang=["Japanese"]
use_type=["tweet"]
#config

uselang=""
for k in use_lang:
 k_key=lang_dict_inv[k]
 uselang=uselang+" lang:"+k_key
#config preparation


def inita(f,k):
  suball=[]
  small=[]
  for s in k:
    if not int(f)==int(s[1]):
     #print("------",f)
     suball.append(small)
     small=[]
    #print(s[0],s[1])
    small.append(s)
    f=s[1]
  suball.append(small)
  #If 2 is included
  return suball


def notwo(al):
 micro=[]
 final=[]
 kaburilist=[]
 for fg in al:
  kaburilist=[] 
  if len(fg)>1:
   for v in itertools.combinations(fg, 2):
    micro=[]
    for s in v:
     micro.append(s[0])
    micro=sorted(micro,key=len,reverse=False)
    kaburi=len(set(micro[0]) & set(micro[1]))
    per=kaburi*100//len(micro[1])
    #print(s[1],per,kaburi,len(micro[0]),len(micro[1]),"m",micro)
    if per>50:
     kaburilist.append(micro[0])
     kaburilist.append(micro[1])
    else:
     final.append([micro[0],s[1]])
     #print("fin1",micro[0],s[1])
    if micro[0] in micro[1]:
     pass
     #print(micro[0],micro[1])
     #print("included"*5)
     #if micro[0] in kaburilist:
     # kaburilist.remove(micro[0])
  else:
   pass
   #print(fg[0][1],fg[0][0])
   final.append([fg[0][0],fg[0][1]])
   #print("fin3",fg[0][0],fg[0][1])
  #if kaburilist:
   #longword=max(kaburilist,key=len)
   #final.append([longword,s[1]])
   ##print("fin2",longword,s[1])
   #kaburilist.remove(longword)
   #kaburilist=list(set(kaburilist))
   #for k in kaburilist:
   # if k in final:
   #  final.remove(k)
   #  #print("finremove1",k)
 return final

def siage(fin):
 fin=list(map(list, set(map(tuple, fin))))
 finallen = sorted(fin, key=lambda x: len(x[0]))
 finallendic=dict(finallen)
 finalword=[]
 for f in finallen:
  finalword.append(f[0])
 #print("f1",finalword)

 notwo=[]
 for v in itertools.combinations(finalword, 2):
  #print(v)
  if v[0] in v[1]:
   #print("in")
   if v[0] in finalword:
    finalword.remove(v[0])

 #print("f2",finalword)
 finall=[]
 for f in finalword:
  finall.append([f,finallendic[f]])
 finall = sorted(finall, key=lambda x: int(x[1]), reverse=True)
 #print("final",finall)
 kk=open("custam_freq_new.txt", 'w', errors='ignore')
 kk.write(str(finall))
 kk.close()


def eval_pattern(use_t):
 tw=0
 rp=0
 rt=0
 if "tweet" in use_t:
  tw=1
 if "retweet" in use_t:
  rt=1
 if "reply" in use_t:
  rp=1
 sword=""
 if tw==1:
  sword="filter:safe OR -filter:safe"
  if rp==0:
   sword=sword+" exclude:replies"
  if rt==0:
   sword=sword+" exclude:retweets"
 elif tw==0:
  if rp==1 and rt ==1:
   sword="filter:reply OR filter:retweets"
  elif rp==0 and rt ==0:
   print("NO")
   sys.exit()
  elif rt==1:
   sword="filter:retweets"
  elif rp==1:
   sword="filter:replies"
 return sword
pat=eval_pattern(use_type)+" "+uselang

#config read function and execution

def a(n):
 return n+1
def f(k):
 k = list(map(a, k))
 return k
def g(n,m):
 b=[]
 for _ in range(n):
  m=f(m)
  b.append(m)
 return b
#Serial number list generation

def validate(text):
    if re.search(r'(.)\1{1,}', text):
     return False
    elif re.search(r'(..)\1{1,}', text):
     return False
    elif re.search(r'(...)\1{1,}', text):
     return False
    elif re.search(r'(...)\1{1,}', text):
     return False
    elif re.search(r'(....)\1{1,}', text):
     return False
    else:
     return True
#Function to check for duplicates

def eval_what_nosp(c,i):
   no_term=[]
   no_start=[]
   no_in=[]
   koyu_meisi=[]
   if re.findall(r"[「」、。)(『』&@_;【/<>,!】\/@]", c[0]):
     no_term.append(i)
     no_start.append(i)
     no_in.append(i)
   if len(c) == 4:
    if "suffix" in c[3]:
     no_start.append(i)
    if "Proper noun" in c[3]:
     koyu_meisi.append(i)
    if c[3]=="noun-Non-independent-General":
     no_term.append(i)
     no_start.append(i)
     no_in.append(i)
    if "Particle" in c[3]:
     no_term.append(i)
     no_start.append(i)
     #no_in.append(i)
    if c[3]=="Particle-Attributive":
     no_start.append(i)
    if c[3]=="Particle":
     no_start.append(i)
    if "O" in c[2]:
     if c[3]=="noun-Change connection":
      no_term.append(i)
      no_start.append(i)
      no_in.append(i)
   if len(c) == 6:
    if c[4]=="Sahen Suru":
     no_start.append(i)
    if c[3]=="verb-Non-independent":
     no_start.append(i)
    if "suffix" in c[3]:
     no_start.append(i)
    if c[3]=="Auxiliary verb":
     if c[2]=="Ta":
      no_start.append(i)
      no_in.append(i)
    if c[3]=="Auxiliary verb":
     if c[2]=="Absent":
      no_start.append(i)
    if c[3]=="Auxiliary verb":
     if "Continuous use" in c[5]:
      no_term.append(i)
      no_start.append(i)
    if c[2]=="To do":
     if c[3]=="verb-Independence":
       if c[5]=="Continuous form":
        no_start.append(i)
        no_in.append(i)
    if c[2]=="Become":
     if c[3]=="verb-Independence":
      no_start.append(i)
      no_in.append(i)
    if c[2]=="Teru":
     if c[3]=="verb-Non-independent":
      no_start.append(i)
      no_in.append(i)
    if c[2]=="is":
     if c[3]=="Auxiliary verb":
      no_start.append(i)
      no_in.append(i)
    if c[2]=="Chau":
     if c[3]=="verb-Non-independent":
      no_start.append(i)
      no_in.append(i)
    if c[2]=="is there":
     if c[3]=="verb-Independence":
      no_term.append(i)
      no_start.append(i)
      no_in.append(i)
    if c[2]=="Auxiliary verb":
     if c[3]=="Special da":
      no_term.append(i)
      no_start.append(i)
      no_in.append(i)
    if c[2]=="Trout":
     if c[3]=="Auxiliary verb":
      no_term.append(i)
      no_start.append(i)
      no_in.append(i)
    if "Continuous use" in c[5]:
      no_term.append(i)
    if c[5]=="Word connection":
     no_start.append(i)
    if c[2]=="Give me":
     if c[3]=="verb-Non-independent":
      no_start.append(i)
      no_in.append(i)
   x=""
   y=""
   z=""
   koyu=""
   if no_term:
    x=no_term[0]
   if no_start:
    y=no_start[0]
   if no_in:
    z=no_in[0]
   if koyu_meisi:
    koyu=koyu_meisi[0]
    #print("koyu",koyu)
    koyu=int(koyu)
   return x,y,z,koyu


small=[]
nodouble=[]
seq=""
def process(ty,tw,un,tagg):
 global all
 global seq
 global small
 global nodouble
 tw=tw.replace("\n"," ")
 sent_write.write(str(tw))
 sent_write.write("\n")
 parselist=m_owaka.parse(tw)
 parsesplit=parselist.split()
 parseocha=m_ocha.parse(tw)
 l = [x.strip() for x in parseocha[0:len(parseocha)-5].split('\n')]
 nodouble=[]
 no_term=[]
 no_start=[]
 no_in=[]
 km_l=[]
 for i, block in enumerate(l):
  c=block.split('\t')
  #sent_write.write("\n")
  #sent_write.write(str(c))
  #sent_write.write("\n")
  #print(str(c))
  ha,hi,hu,km=eval_what_nosp(c,i)
  no_term.append(ha)
  no_start.append(hi)
  no_in.append(hu)
  km_l.append(km)
  #Completed writing
 if km_l[0]:
  for r in km_l:
   strin=parsesplit[r]
   if not strin in nodouble:
    all.append([strin,un])
    nodouble.append(strin)
 for s in range(2,8):
  #A chain of 2 to 8.
  #Important because you can improve accuracy instead of making it heavier
  num=g(len(parsesplit)-s+1,range(-1,s-1))
  for nr in num:
    #2 for one sentence-All streets of 8 chains
    #print(no_term)
   if not len(set(nr) & set(no_in)):
    if not nr[-1] in no_term:
     if not nr[0] in no_start:
      small=[]
      #print(str(parsesplit))
      for nr2 in nr:
       #print(nr2,parsesplit[nr2])
      #Add word to small at the position indexed by the array inside
       small.append(parsesplit[nr2])
      seq="".join(small)
      judge_whole=0
      bad_direct_word=["Like","\'mat","I\'mat"]
      #if "" in seq:
      # judge_whole=1
      #if "" in seq:
      # judge_whole=1
      for bd  in bad_direct_word:
       if seq==bd:
        judge_whole=1
        break
      parselist=m_owaka.parse(seq)
      l = [x.strip() for x in parseocha[0:len(parseocha)-5].split('\n')]
      for n in range(len(l)):
       if len(l[n].split("\t"))==6:
        if l[n].split("\t")[3]=="verb-Independence":
         if len(l[n+1].split("\t"))==6:
          if l[n+1].split("\t")[3]:
           judge_whole=1
           break
      if judge_whole==0:
       if validate(seq) and len(seq) > 3 and not re.findall(r'[「」、。『』/\\/@]', seq):
        if not  seq in nodouble:
         #Continuous avoidance
         all.append([seq,un])
         nodouble.append(seq)
         #print("Added successfully",seq)
         #Do not aggregate the same word twice
        else:
         #print("Already included",seq)
         pass
       else:
        #print("Exclusion",seq)
        pass
     else:
      #print("The beginning is no_is start",seq)
      pass
    else:
      #print("The end is no_term",seq)
      pass
    #print("\n")
 #print(parsesplit)
 #print(l)
 if tagg:
       print("tagg",tagg)
       for sta in tagg:
        all.append(["#"+str(sta),un])
 #Include tag

N=1
#Number of tweets acquired


def print_varsize():
    import types
    print("{}{: >15}{}{: >10}{}".format('|','Variable Name','|','  Size','|'))
    print(" -------------------------- ")
    for k, v in globals().items():
        if hasattr(v, 'size') and not k.startswith('_') and not isinstance(v,types.ModuleType):
            print("{}{: >15}{}{: >10}{}".format('|',k,'|',str(v.size),'|'))
        elif hasattr(v, '__len__') and not k.startswith('_') and not isinstance(v,types.ModuleType):
            print("{}{: >15}{}{: >10}{}".format('|',k,'|',str(len(v)),'|'))






def collect_count():
  global all
  global deadline
  hh=[]
  tueall=[]
  #print("alllll",all)
  freshtime=int(time.time()*1000)-200000
  deadline=-1
  #import pdb; pdb.set_trace()
  #print(N_time)
  print(len(N_time))
  for b in N_time:
   if int(b[1]) < freshtime:
     deadline=b[0]
  print("dead",deadline)
  dellist=[]
  if not deadline ==-1:
   for b in N_time:
    print("b",b)
    if int(b[0]) < int(deadline):
      dellist.append(b)
  for d in dellist:
   N_time.remove(d)
  #print(N_time)
  #import pdb; pdb.set_trace()
  #time.sleep(2)
  #import pdb; pdb.set_trace()
  for a in all:
   if int(a[1]) > freshtime:
    #Number of tweets you want to get/45*Subtract the value of 1000. Now 5000/45*1000=112000
    tueall.append(a[0])
    #print("tuealllappend"*10)
    #print(tueall)
   else:
    all.remove(a)
    #print("allremove",a)
  #import pdb; pdb.set_trace()
  c = collections.Counter(tueall)
  c=c.most_common()
  #print("c",c)
  #print(c)
  for r in c:
   if r and r[1]>1:
    hh.append([str(r[0]),str(r[1])])
  k=str(hh).replace("[]","")
  freq_write=open("custam_freq.txt","w",encoding="utf-8-sig", errors='ignore')
  freq_write.write(str(k))
  #import pdb; pdb.set_trace()
  oldunix=N_time[0][1]
  newunix=N_time[-1][1]
  dato=str(datetime.datetime.fromtimestamp(oldunix/1000)).replace(":","-")
  datn=str(datetime.datetime.fromtimestamp(newunix/1000)).replace(":","-")
  dato=dato.replace(" ","_")
  datn=datn.replace(" ","_")
  #print(dato,datn)
  #import pdb; pdb.set_trace()
  freq_writea=open("trenddata/custam_freq-"+dato+"-"+datn+"--"+str(len(N_time))+".txt","w",encoding="utf-8-sig", errors='ignore')
  freq_writea.write(str(k))
  #import pdb; pdb.set_trace()
  freq_write_tue=open("custam_freq_tue.txt","w",encoding="utf-8-sig", errors='ignore')
  freq_write_tue.write(str(all))
  #print(c)

def remove_emoji(src_str):
    return ''.join(c for c in src_str if c not in emoji.UNICODE_EMOJI)

def deEmojify(inputString):
    return inputString.encode('ascii', 'ignore').decode('ascii')

def get_tag(tw,text_content):
 taglist=[]
 entities=eval(str(tw.entities))["hashtags"]
 for e in entities:
  text=e["text"]
  taglist.append(text)
 for _ in range(len(taglist)+2):
  for s in taglist:
   text_content=re.sub(s,"",text_content)
   #text_content=re.sub(r"#(.+?)+ ","",text_content)
 return taglist,text_content

def get_time(id):
 two_raw=format(int(id),'016b').zfill(64)
 unixtime = int(two_raw[:-22],2) + 1288834974657
 unixtime_th = datetime.datetime.fromtimestamp(unixtime/1000)
 tim = str(unixtime_th).replace(" ","_")[:-3]
 return tim,unixtime

non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), '')
N_time=[]
def gather(tweet,type,tweet_type,removed_text):
 global N
 global N_all
 global lagtime
 global all_time
 global all
 global auth
 global N_time
 if get_time(tweet.id):
  tim,unix=get_time(tweet.id)
 else:
  exit
 #Get detailed tweet time
 #original_text=tweet.text
 nowtime=time.time()
 tweet_pertime=str(round(N/(nowtime-all_time),1))
 lag=str(round(nowtime-unix/1000,1))
 #Calculate lag
 lang=lang_dict[tweet.lang]
 print(N_all,N,tweet_pertime,"/s","+"+lag,tim,type,tweet_type,lang)
 #Information display.(All tweets, processed tweets, processing speed, lag, real time, acquisition route, tweet type, language)
 print(removed_text.replace("\n"," "))
 taglist,tag_removed_text=get_tag(tweet,removed_text)
 #import pdb; pdb.set_trace()
 #print(type(tweet))
 #import pdb; pdb.set_trace()
 #Exclude tags
 noemoji=remove_emoji(tag_removed_text)
 try:
  process(tweet_type,tag_removed_text,unix,taglist)
  N_time.append([N,unix])
  print("trt",tag_removed_text)
 except Exception as pe:
   print("process error")
   print(pe)
   #import pdb; pdb.set_trace()
 #Send to actual processing
 surplus=N%1000
 if surplus==0:
   #sumprocess()
   try:
    collect_count()
   except Exception as eeee:
    print(eeee)
   #exit
   #Let's count
   cft_read=open("custam_freq.txt","r",encoding="utf-8-sig")
   cft_read=cft_read.read()
   cft_read=eval(cft_read)
   max_freq=cft_read[0][1]
   #Maximum value
   allen=inita(max_freq,cft_read)
   #Make a list of trends with the same frequency.
   finf=notwo(allen)
   #Find and remove duplicate strings and trends
   siage(finf)
   #New it_Write as freq
   print_varsize()
   #Display memory information
 N=N+1
#streaming body


def judge_tweet_type(tweet):
 text = re.sub("https?://[\w/:%#\$&\?\(\)~\.=\+\-]+","",tweet.text)
 if tweet.in_reply_to_status_id_str :
  text=re.sub(r"@[a-zA-Z0-9_]* ","",text)
  text=re.sub(r"@[a-zA-Z0-9_]","",text)
  return "reply",text
 else:
  head= str(tweet.text).split(":")
  if len(head) >= 2 and "RT" in head[0]:
   text=re.sub(r"RT @[a-zA-Z0-9_]*: ","",text)
   return "retwe",text
  else:
   return "tweet",text



badword=["Question box","Let's throw marshmallows","I get a question","Participation in the war","Delivery","@","Follow","Application","Smartphone RPG","Gacha","S4live","campaign","Drift spirits","Present","Cooperative live","We are accepting consultations completely free of charge","Omikuji","Chance to win","GET","get","shindanmaker","Hit","lottery"]
N_all=0
def gather_pre(tweet,type):
    global N_all
    N_all=N_all+1
    #Count all tweets passing through here
    go=0
    for b in badword:
     if  b in tweet.text:
      go=1
      break
    #Judge whether there is a bad word in the text, GO judgment because it is not included in 0
    if go == 0:
      if tweet.lang=="ja":
        tweet_type,removed_text=judge_tweet_type(tweet)
        #Determine tweet type
        if tweet_type=="tweet":
         try:
          gather(tweet,type,tweet_type,removed_text)
          #print(type(tweet))
          #Send to gather processing.
         except Exception as eee:
          #gather("Ah","Ah","Ah","Ah")
          #import pdb; pdb.set_trace()
          pass
lagtime=0


def search(last_id):
 #print(pat)
 global pat
 time_search =time.time()
 for status in apiapp.search(q=pat,count="100",result_type="recent",since_id=last_id):
   #Get newer tweets than the last tweet you got with search
   gather_pre(status,"search")
#search body

interval = 2.16
#search call interval
#min2

trysearch=0
#search number of calls

class StreamingListener(tweepy.StreamListener):
    def on_status(self, status):
        global time_search
        global trysearch
        gather_pre(status,"stream")
        time_stream=time.time()
        time_stream-time_search % interval
        if time_stream-time_search-interval>interval*trysearch:
           #For a certain period of time(interbal)Execute search every time.
           last_id=status.id
           #executor = concurrent.futures.ThreadPoolExecutor(max_workers=8)
           #executor.submit(search(last_id))
           #When trying parallel processing
           search(last_id)
           trysearch=trysearch+1
#streaming body

def carry():
 listener = StreamingListener()
 streaming = tweepy.Stream(auth, listener)
 streaming.sample()
#stream call function

time_search =time.time()
#The time when the search was last executed, but defined before the stream

executor = concurrent.futures.ThreadPoolExecutor(max_workers=8)
#Parallel definition

all_time=time.time()
#Execution start time definition

try:
 carry()
except Exception as e:
 import pdb; pdb.set_trace()
 print(e)
 #import pdb; pdb.set_trace()
 pass
 #except Exception as ee:
  #print(ee)
  #import pdb; pdb.set_trace()

#carry body and error handling

Below is a program of meaning trends, but I can recommend it with confidence because it is being highly acclaimed (). The part I bring from the thesaurus is not something I wrote, but I will leave it.

imi_trend.py



from bs4 import BeautifulSoup
import collections
import concurrent.futures
import datetime
import emoji
import itertools
import MeCab
from nltk import Tree
import os
from pathlib import Path
from pytz import timezone
import re
import spacy
import subprocess
import sys
import time
import tweepy
import unidic2ud
import unidic2ud.cabocha as CaboCha
from urllib.error import HTTPError, URLError
from urllib.parse import quote_plus
from urllib.request import urlopen
m=MeCab.Tagger("-d ./unidic-cwj-2.3.0")

os.remove("bunrui01.csv")
os.remove("all_tweet_text.txt")
os.remove("all_kakari_imi.txt")

bunrui01open=open("bunrui01.csv","a",encoding="utf-8")
textopen=open("all_tweet_text.txt","a",encoding="utf-8")
akiopen=open("all_kakari_imi.txt","a",encoding="utf-8")

catedic={}
with open('categori.txt') as f:

 a=f.read()
 aa=a.split("\n")
 b=[]
 bunrui01open.write(",,,")
 for i, j in enumerate(aa):
  catedic[j]=i
  bunrui01open.write(str(j))
  bunrui01open.write(",")
 bunrui01open.write("\n")
 print(catedic)
with open('./BunruiNo_LemmaID_ansi_user.csv') as f:
 a=f.read()
 aa=a.split(",\n")
 b=[]
 for bb in aa:
  if len(bb.split(","))==2:
   b.append(bb.split(","))
 word_origin_num_to_cate=dict(b)
with open('./cate_rank2.csv') as f:
 a=f.read()
 aa=a.split("\n")
 b=[]
 for bb in aa:
  if len(bb.split(","))==2:
   b.append(bb.split(","))
 cate_rank=dict(b)

class Synonym:

    def getSy(self, word, target_url, css_selector):

        try:
            #Encoded because the URL to access contains Japanese
            self.__url = target_url + quote_plus(word, encoding='utf-8')

            #Access and parse
            self.__html = urlopen(self.__url)
            self.__soup = BeautifulSoup(self.__html, "lxml")

            result = self.__soup.select_one(css_selector).text

            return result
        except HTTPError as e:
            print(e.reason)
        except URLError as e:
            print(e.reason)
sy = Synonym()
alist = ["Selection"]

#Use "Japanese Thesaurus Associative Thesaurus" to search
target = "https://renso-ruigo.com/word/"
selector = "#content > div.word_t_field > div"
#for item in alist:
#    print(sy.getSy(item, target, selector))

consumer_key = ""
consumer_secret = ""
access_token = ""
access_token_secret = ""
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
authapp = tweepy.AppAuthHandler(consumer_key,consumer_secret)
apiapp = tweepy.API(authapp)
#Authentication(Here api)

def remove_emoji(src_str):
    return ''.join(c for c in src_str if c not in emoji.UNICODE_EMOJI)

def get_tag(tw,text_content):
 taglist=[]
 entities=eval(str(tw.entities))["hashtags"]
 for e in entities:
  text=e["text"]
  taglist.append(text)
 for _ in range(len(taglist)+2):
  for s in taglist:
   text_content=re.sub(s,"",text_content)
   #text_content=re.sub(r"#(.+?)+ ","",text_content)
 return taglist,text_content

def get_swap_dict(d):
    return {v: k for k, v in d.items()}

def xcut(asub,a):
 asub.append(a[0])
 a=a[1:len(a)]
 return asub,a

def ycut(asub,a):
 asub.append(a[0])
 a=a[1:len(a)]
 return asub,a

def bunruikugiri(lastx,lasty):
 hoge=[]
 #import pdb; pdb.set_trace()
 editx=[]
 edity=[]
 for _ in range(500):
  edity,lasty=ycut(edity,lasty)
  #target=sum(edity)
  for _ in range(500):
   target=sum(edity)
   #rint("sum",sum(editx),"target",target)
   if sum(editx)<target:
    editx,lastx=xcut(editx,lastx)
   elif sum(editx)>target:
    edity,lasty=ycut(edity,lasty)
   else:
    hoge.append(editx)
    editx=[]
    edity=[]
    if lastx==[] and lasty==[]:
     return hoge
    break

all_appear_cate=[]
all_unfound_word=[]
all_kumiawase=[]
nn=1
all_kakari_imi=[]
def process(tw,ty):
 global nn
 wordnum_toword={}
 catenum_wordnum={}
 word_origin_num=[]
 mozisu=[]
 try:
  tw=re.sub("https?://[\w/:%#\$&\?\(\)~\.=\+\-]+","",tw)
  tw=tw.replace("#","")
  tw=tw.replace(",","")
  tw=tw.replace("\u3000","") #Important for matching the number of characters
  tw=re.sub(re.compile("[!-/:-@[-`{-~]"), '', tw)
  parseocha=m.parse(tw)
  print(tw)
  l = [x.strip() for x in parseocha[0:len(parseocha)-5].split('\n')]
  bunrui_miti_sentence=[]
  for i, block in enumerate(l):
   if len(block.split('\t')) > 1:
    c=block.split('\t')
    d=c[1].split(",")
    #Word processing process
    print(d,len(d))
    if len(d)>9:
        if d[10] in ["To do"]:
          word_origin_num.append(d[10])
          bunrui_miti_sentence.append(d[8])
          mozisu.append(len(d[8]))
        elif d[-1] in word_origin_num_to_cate:
         word_origin_num.append(int(d[-1]))
         wordnum_toword[int(d[-1])]=d[8]
         bunrui_miti_sentence.append(word_origin_num_to_cate[str(d[-1])])
         mozisu.append(len(d[8]))
        else:
          #print("nai",d[8])
          #Display of unknown words
          all_unfound_word.append(d[10])
          bunrui_miti_sentence.append(d[8])
          mozisu.append(len(c[0]))
    else:
        mozisu.append(len(c[0]))
        all_unfound_word.append(c[0])
        bunrui_miti_sentence.append(c[0])
        #else:
        #  mozisu.append(l[])
  #print("kouho",word_origin_num,"\n")
  #Words to original numbers
  #print(tw)
  #If you look at sentences made with semantic classification and unknown words
  for s in bunrui_miti_sentence:
    print(s," ",end="")
  print("\n")
  stn=0
  cmd = "echo "+str(tw)+" | cabocha -f1"
  cmdtree="echo "+str(tw)+" | cabocha "
  proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE,shell=True)
  proctree = subprocess.Popen(cmdtree, stdout=subprocess.PIPE, stderr=subprocess.PIPE,shell=True)
  proc=proc.communicate()[0].decode('cp932')
  proctree=proctree.communicate()[0].decode('cp932')
  print(proctree)
  proclist=proc.split("\n")
  #print(proc)
  #f1 information
  #print(proclist)
  #Listing information
  procnumlist=[]
  wordlis=[]
  eachword=""
  num=0
  for p in proclist:
   if p[0]=="*":
    f=p.split(" ")[1]
    t=p.split(" ")[2].replace("D","")
    procnumlist.append([f,t])
    if eachword:
     wordlis.append([num,eachword])
     num=num+1
     eachword=""
   elif p=="EOS\r":
     wordlis.append([num,eachword])
     num=num+1
     eachword=""
     break
   else:
    #print("aaaaa",p.split("\t")[0])
    eachword=eachword+p.split("\t")[0]
  tunagari_num_dict=dict(procnumlist)

  print(tunagari_num_dict)
  bunsetu_num_word=dict(wordlis)
  #print(bunsetu_num_word)
  bunsetu_mozisu=[] 
  for v in bunsetu_num_word.values():
   bunsetu_mozisu.append(len(v))
  if sum(bunsetu_mozisu) != sum(mozisu):
    return
  #print("mozisu",mozisu)
  #print("bunsetumozi",bunsetu_mozisu)
  res=bunruikugiri(mozisu,bunsetu_mozisu)
  #print("res",res)
  nnn=0
  small_cateandcharlist=[]
  big_cateandcharlist=[]
  for gc in res:
    for _ in range(len(gc)):
     print(bunrui_miti_sentence[nnn],end="  ")
     if bunrui_miti_sentence[nnn] in list(catedic.keys()):
       small_cateandcharlist.append(bunrui_miti_sentence[nnn])
     nnn=nnn+1
    #Unknown words and particles are considered to be the same, so the mecabne gold dictionary can be used.
    if small_cateandcharlist==[]:
     big_cateandcharlist.append(["null"])
    else:
     big_cateandcharlist.append(small_cateandcharlist)
    small_cateandcharlist=[]
    print("\n")
  #print("bcacl",big_cateandcharlist)
  twewtnai_kakari_imi=[]
  if len(big_cateandcharlist)>1 and len(big_cateandcharlist)==len(bunsetu_num_word):
  #Dependencies and morphological analysis delimiters do not match
   for kk, vv in tunagari_num_dict.items():
     if vv != "-1":
      for aaw in big_cateandcharlist[int(kk)]:
       for bbw in big_cateandcharlist[int(vv)]:
        twewtnai_kakari_imi.append([aaw,bbw])
        if not "Rank symbol" in str([aaw,bbw]):
          if not "null" in str([aaw,bbw]):
           if not "Number sign" in str([aaw,bbw]):
            if not "Things" in str([aaw,bbw]):
             all_kakari_imi.append(str([aaw,bbw]))
             akiopen.write(str([aaw,bbw]))
     else:
       break
  else:
    return
  akiopen.write("\n")
  akiopen.write(str(bunrui_miti_sentence))
  akiopen.write("\n")
  akiopen.write(str(tw))
  akiopen.write("\n")
  print("tki",twewtnai_kakari_imi)
  tweetnai_cate=[]
  word_cate_num=[]
  for k in word_origin_num:
    if str(k) in word_origin_num_to_cate:
     ram=word_origin_num_to_cate[str(k)]
     print(ram,cate_rank[ram],end="")
     tweetnai_cate.append(ram)
     all_appear_cate.append(ram)
     word_cate_num.append(catedic[ram])
     catenum_wordnum[catedic[ram]]=int(k)
     stn=stn+1
    else:
     if k in ["To do"]:
      all_appear_cate.append(k)
      tweetnai_cate.append(k)
  print("\n")
  #print(tweetnai_cate)
  #import pdb; pdb.set_trace()
  for k in tweetnai_cate:
   if k in catedic:
    aac=catedic[k]
  #print("gyaku",word_cate_num)
  #print("wt",wordnum_toword)
  #print("cw",catenum_wordnum)
  bunrui01open.write(str(tw))
  bunrui01open.write(",")
  bunrui01open.write(str(tim))
  bunrui01open.write(",")
  bunrui01open.write(str(unix))
  bunrui01open.write(",")
  ps=0
  for tt in list(range(544)):
    if int(tt) in word_cate_num:
     a=catenum_wordnum[tt]
     #Word number from the sword
     bunrui01open.write(str(wordnum_toword[a]))
     #Word from word number
     bunrui01open.write(",")
     ps=ps+1
    else:
     bunrui01open.write("0,")
  bunrui01open.write("end")
  bunrui01open.write("\n")
  textopen.write(str(nn))
  textopen.write(" ")
  textopen.write(tw)
  textopen.write("\n")
  nn=nn+1
  #Put all the streets
  for k in list(itertools.combinations(tweetnai_cate,2)):
   all_kumiawase.append(k)

 except Exception as ee:
  print(ee)
  import pdb; pdb.set_trace()
  pass

def judge_tweet_type(tweet):
 if tweet.in_reply_to_status_id_str:
  return "reply"
 else:
  head= str(tweet.text).split(":")
 if len(head) >= 2 and "RT" in head[0]:
  return "retwe"
 else:
  return "tweet"
#Judging whether it is a rip, retweet, or tweet

def get_time(id):
 two_raw=format(int(id),'016b').zfill(64)
 unixtime = int(two_raw[:-22],2) + 1288834974657
 unixtime_th = datetime.datetime.fromtimestamp(unixtime/1000)
 tim = str(unixtime_th).replace(" ","_")[:-3]
 return tim,unixtime
#Tweet time from id

N=1
def gather(tweet,type,tweet_typea):
 global all_appear_cate
 global N
 global all_time
 global tim
 global unix
 tim,unix=get_time(tweet.id)
 original_text=tweet.text.replace("\n","")
 taglist,original_text=get_tag(tweet,original_text)
 nowtime=time.time()
 tweet_pertime=str(round(N/(nowtime-all_time),1))
 lag=str(round(nowtime-unix/1000,1))
 #lang=lang_dict[tweet.lang]
 try:
  process(remove_emoji(original_text),tweet_typea,)
 except Exception as e:
  print(e)
  #import pdb; pdb.set_trace()
  pass
 print(N,tweet_pertime,"/s","+"+lag,tim,type,tweet_typea)
 N=N+1
 if N%500==0:
   ccdd=collections.Counter(all_appear_cate).most_common()
   for a in ccdd:
    print(a)
   #ccdd=collections.Counter(all_unfound_word).most_common()
   #for a in ccdd:
   # print("Absent",a)
   ccdd=collections.Counter(all_kumiawase).most_common(300)
   for a in ccdd:
    print(a)
   ccdd=collections.Counter(all_kakari_imi).most_common(300)
   for a in ccdd:
    print("all_kakari_imi",a)
   #import pdb; pdb.set_trace() 
#All tweets of stream and search are collected

def pre_gather(tw,ty):
 #print(ty)
# if  "http://utabami.com/TodaysTwitterLife" in tw.text:
  print(tw.text)
  if ty=="stream":
   tweet_type=judge_tweet_type(tw)
   if tw.lang=="ja" and tweet_type=="tweet":
    gather(tw,ty,tweet_type)
  elif ty=="search":
    gather(tw,ty,"tweet")

def search(last_id):
 time_search =time.time()
 for status in apiapp.search(q="filter:safe OR -filter:safe -filter:retweets -filter:replies lang:ja",count="100",result_type="recent",since_id=last_id):
   pre_gather(status,"search")
#search body

class StreamingListener(tweepy.StreamListener):
    def on_status(self, status):
        global time_search
        global trysearch
        pre_gather(status,"stream")
        time_stream=time.time()
        time_stream-time_search % interval
        if time_stream-time_search-interval>interval*trysearch:
           last_id=status.id
           #executor = concurrent.futures.ThreadPoolExecutor(max_workers=2)
           #executor.submit(search(last_id))
           search(last_id)
           trysearch=trysearch+1
#streaming body

def carry():
 listener = StreamingListener()
 streaming = tweepy.Stream(auth, listener)
 streaming.sample()


interval = 2.1
trysearch=0
time_search =time.time()
#executor = concurrent.futures.ThreadPoolExecutor(max_workers=2)
all_time=time.time()
try:
 #executor.submit(carry)
 carry()
except Exception as er:
 print(er)
 import pdb; pdb.set_trace()
 pass


bandicam 2020-11-16 17-53-41-545.jpg bandicam 2020-11-16 17-53-18-841.jpg

For every 500 tweets Number of occurrences of simple meaning Number of occurrences of 2-gram meaning The number of occurrences of a meaningful continuous? Also for all Tweet text, Unidic analysis information, CaboCha dependency, replacement with semantic classification, etc.

bunrui01.csv-The horizontal axis is the meaning classification of 544, the vertical axis is the tweet, 0 does not exist, 1 is the csv to write so that it is the corresponding word all_tweet_text-Processed tweets and what number they are all_kakari_imi-Dependent meaning pair, meaning classification replacement, text categori.txt-A txt that describes the semantic classification of 544 and creates a dictionary with index at runtime. For more information on BunruiNo_LemmaID_ansi_user.csv, see https://pj.ninjal.ac.jp/corpus_center/goihyo.html As you can see, it is a correspondence table of word original numbers and meaning classifications. cate_rank2.csv-A dictionary of the order of appearance of semantic classifications created at one time.

I will explain the other variables later,

It's for my own memo, and people who understand it will do their best to understand it, so I will do this.

Recommended Posts

Program for Twitter Trend Analysis (Personal Note)
Analyzing Twitter Data | Trend Analysis
Completely personal note
Flask's personal note # 2
Flask's personal note # 1
[CovsirPhy] COVID-19 Python package for data analysis: S-R trend analysis
Folder structure for analysis
Program for studying statistics
Negative / Positive Analysis 2 Twitter Negative / Positive Analysis (1)
(Personal note) Sankey diagram
Source analysis for Django--INSTALLED_APPS
Easy Twitter Posting Program
Negative / Positive Analysis 3 Twitter Negative / Positive Analysis (2)