Recently, Doc2Vec is interesting, so I'm doing various things. I found the following article, so I tried it myself. → I searched for a similar card of MTG with Doc2Vec
Word2Vec regards Word as a vector, but Doc2Vec (Paragraph2Vec) sees Document as a set of Word and assigns a vector to realize similarity between documents and vector calculation. ..
For example, it is possible to calculate the similarity between news articles, the similarity between resumes, the similarity between books, and of course the similarity between a person's profile and a book. All are targeted.
Source: http://qiita.com/okappy/items/32a7ba7eddf8203c9fa1
Blizzard's Hearthstone (http://us.battle.net/hearthstone/ja/) card 922 texts
Card sample
The corpus is scraped from 4Kame's Hearthstone Card List. This time I used Scrapy. Create a project with the following command.
terminal
$ scrapy startproject hearth_stone
The following files will be created.
terminal
$ tree hearth_stone/
hearth_stone
├── hearth_stone
│ ├── __init__.py
│ ├── __pycache__
│ ├── items.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ ├── __init__.py
│ └── __pycache__
└── scrapy.cfg
4 directories, 6 files
First, define the data structure to be output to ʻitems.py`. This time I only need the name and text, but I may use it in the future so I will scrape all the data that the card has.
items.py
# -*- coding: utf-8 -*-
import scrapy
class HearthStoneItem(scrapy.Item):
name = scrapy.Field()
rarity = scrapy.Field()
ruby = scrapy.Field()
type = scrapy.Field()
hero = scrapy.Field()
race = scrapy.Field()
text = scrapy.Field()
mana = scrapy.Field()
attack = scrapy.Field()
health = scrapy.Field()
Next, create a Spider (crawler).
terminal
# For example, to create a new spider:
# scrapy genspider mydomain mydomain.com
$ scrapy genspider hearthstone 4gamer.net
The following files will be created.
spiders/hearthstone.py
# -*- coding: utf-8 -*-
import scrapy
class HearthstoneSpider(scrapy.Spider):
name = "hearthstone"
allowed_domains = ["4gamer.net"]
start_urls = ['http://4gamer.net/']
def parse(self, response):
pass
Rewrite this for 4Kame's site.
spiders/hearthstone.py
# -*- coding: utf-8 -*-
import scrapy
from ..items import HearthStoneItem
from bs4 import BeautifulSoup
class HearthStoneSpider(scrapy.Spider):
name = "hearth_stone"
allowed_domains = ["4gamer.net"]
start_urls = ['http://www.4gamer.net/games/209/G020915/FC20140702001/']
def parse(self, response):
soup = BeautifulSoup(response.body, "lxml")
for card in soup.find("div", id="UNIT_LIST").findAll("div"):
item = HearthStoneItem()
item['name'] = card.find("span", class_="name").string
item['rarity'] = card.find("span", class_="rarity").string
item['ruby'] = card.find("span", class_="ruby").string
item['type'] = card.find("span", class_="type").string
item['hero'] = card.find("span", class_="class").string
item['race'] = card.find("span", class_="race").string
item['text'] = card.find("span", class_="card_comment").find("p").string
item['mana'] = card.find("span", class_="mana").string
item['attack'] = card.find("span", class_="attack").string
item['health'] = card.find("span", class_="health").string
yield item
Now that you are ready to scrape, run the crawler.
You can specify the output file name with -o
. (File type is automatically determined from the extension)
terminal
$ scrapy crawl hearthstone -o hearth_stone.json
If the crawl / scraping is successful, the following files will be created. It looks like a cipher, but it's just Unicode, so it's okay.
hearth_stone.json
[
{"attack": "4", "ruby": "Abomination", "rarity": "\u30ec\u30a2", "type": "\u30df\u30cb\u30aa\u30f3", "hero": "\u4e2d\u7acb", "mana": "5", "name": "\u6d9c\u308c\u3057\u3082\u306e", "race": "-", "health": "4", "text": "\u6311\u767a\uff06\u65ad\u672b\u9b54\uff1a\u5168\u3066\u306e\u30ad\u30e3\u30e9\u30af\u30bf\u30fc\u306b2\u30c0\u30e1\u30fc\u30b8\u3092\u4e0e\u3048\u308b\u3002"},
{"attack": "1", "ruby": "Abusive Sergeant", "rarity": "\u30b3\u30e2\u30f3", "type": "\u30df\u30cb\u30aa\u30f3", "hero": "\u4e2d\u7acb", "mana": "1", "name": "\u9b3c\u8ecd\u66f9", "race": "-", "health": "1", "text": "\u96c4\u53eb\u3073\uff1a\u3053\u306e\u30bf\u30fc\u30f3\u306e\u9593\u3001\u30df\u30cb\u30aa\u30f31\u4f53\u306b\u653b\u6483\u529b\uff0b2\u3092\u4ed8\u4e0e\u3059\u308b\u3002"},
{"attack": "3", "ruby": "Acidic Swamp Ooze", "rarity": "\u30d5\u30ea\u30fc", "type": "\u30df\u30cb\u30aa\u30f3", "hero": "\u4e2d\u7acb", "mana": "2", "name": "\u9178\u6027\u6cbc\u30a6\u30fc\u30ba", "race": "-", "health": "2", "text": "\u96c4\u53eb\u3073\uff1a\u6575\u306e\u6b66\u5668\u3092\u7834\u58ca\u3059\u308b\u3002"},
{"attack": "4", "ruby": "Acidmaw ", "rarity": "\u30ec\u30b8\u30a7\u30f3\u30c9", "type": "\u30df\u30cb\u30aa\u30f3", "hero": "\u30cf\u30f3\u30bf\u30fc", "mana": "7", "name": "\u30a2\u30b7\u30c3\u30c9\u30e2\u30fc", "race": "\u7363", "health": "2", "text": "\u81ea\u5206\u4ee5\u5916\u306e\u30df\u30cb\u30aa\u30f3\u304c\u30c0\u30e1\u30fc\u30b8\u3092\u53d7\u3051\u308b\u5ea6\u3001\u305d\u306e\u30df\u30cb\u30aa\u30f3\u3092\u7834\u58ca\u3059\u308b\u3002"},
{"attack": "1", "ruby": "Acolyte of Pain", "rarity": "\u30b3\u30e2\u30f3", "type": "\u30df\u30cb\u30aa\u30f3", "hero": "\u4e2d\u7acb", "mana": "3", "name": "\u82e6\u75db\u306e\u4f8d\u796d", "race": "-", "health": "3", "text": "\u3053\u306e\u30df\u30cb\u30aa\u30f3\u304c\u30c0\u30e1\u30fc\u30b8\u3092\u53d7\u3051\u308b\u5ea6\u3001\u30ab\u30fc\u30c9\u30921\u679a\u5f15\u304f\u3002"},
{"attack": "3", "ruby": "Al'Akir the Windlord", "rarity": "\u30ec\u30b8\u30a7\u30f3\u30c9", "type": "\u30df\u30cb\u30aa\u30f3", "hero": "\u30b7\u30e3\u30fc\u30de\u30f3", "mana": "8", "name": "\u98a8\u306e\u738b\u30a2\u30e9\u30ad\u30a2", "race": "-", "health": "5", "text": "\u75be\u98a8\u3001\u7a81\u6483\u3001\u8056\u306a\u308b\u76fe\u3001\u6311\u767a"},
{"attack": "0", "ruby": "Alarm-o-Bot", "rarity": "\u30ec\u30a2", "type": "\u30df\u30cb\u30aa\u30f3", "hero": "\u4e2d\u7acb", "mana": "3", "name": "\u30a2\u30e9\u30fc\u30e0\u30ed\u30dc", "race": "\u30e1\u30ab", "health": "3", "text": "\u81ea\u5206\u306e\u30bf\u30fc\u30f3\u306e\u958b\u59cb\u6642\u3001\u3053\u306e\u30df\u30cb\u30aa\u30f3\u3092\u3001\u81ea\u5206\u306e\u624b\u672d\u306e\u30e9\u30f3\u30c0\u30e0\u306a\u30df\u30cb\u30aa\u30f3\u3068\u5165\u308c\u66ff\u3048\u308b"},
{"attack": "3", "ruby": "Aldor Peacekeeper", "rarity": "\u30ec\u30a2", "type": "\u30df\u30cb\u30aa\u30f3", "hero": "\u30d1\u30e9\u30c7\u30a3\u30f3", "mana": "3", "name": "\u30a2\u30eb\u30c0\u30fc\u306e\u5e73\u548c\u306e\u756a\u4eba", "race": "-", "health": "3", "text": "\u96c4\u53eb\u3073\uff1a\u6575\u306e\u30df\u30cb\u30aa\u30f31\u4f53\u306e\u653b\u6483\u529b\u30921\u306b\u5909\u3048\u308b\u3002"},
{"attack": "8", "ruby": "Alexstrasza", "rarity": "\u30ec\u30b8\u30a7\u30f3\u30c9", "type": "\u30df\u30cb\u30aa\u30f3", "hero": "\u4e2d\u7acb", "mana": "9", "name": "\u30a2\u30ec\u30af\u30b9\u30c8\u30e9\u30fc\u30b6", "race": "\u30c9\u30e9\u30b4\u30f3", "health": "8", "text": "\u96c4\u53eb\u3073\uff1a\u30d2\u30fc\u30ed\u30fc1\u4eba\u306e\u6b8b\u308a\u4f53\u529b\u309215\u306b\u3059\u308b\u3002"},
{"attack": "2", "ruby": "Alexstrasza's Champion ", "rarity": "\u30ec\u30a2", "type": "\u30df\u30cb\u30aa\u30f3", "hero": "\u30a6\u30a9\u30ea\u30a2\u30fc", "mana": "2", "name": "\u30a2\u30ec\u30af\u30b9\u30c8\u30e9\u30fc\u30b6\u306e\u52c7\u8005", "race": "-", "health": "3", "text": "\u96c4\u53eb\u3073\uff1a\u81ea\u5206\u306e\u624b\u672d\u306b\u30c9\u30e9\u30b4\u30f3\u30ab\u30fc\u30c9\u304c\u3042\u308b\u5834\u5408\u3001\u653b\u6483\u529b\uff0b1\u3068\u7a81\u6483\u3092\u5f97\u308b\u3002"},
{"attack": "2", "ruby": "Amani Berserker", "rarity": "\u30b3\u30e2\u30f3", "type": "\u30df\u30cb\u30aa\u30f3", "hero": "\u4e2d\u7acb", "mana": "2", "name": "\u30a2\u30de\u30cb\u306e\u72c2\u6226\u58eb", "race": "-", "health": "3", "text": "\u6fc0\u6012\uff1a\u653b\u6483\u529b\uff0b3\u3002"},
{"attack": null, "ruby": "Ancestor's Call", "rarity": "\u30a8\u30d4\u30c3\u30af", "type": "\u546a\u6587", "hero": "\u30b7\u30e3\u30fc\u30de\u30f3", "mana": "4", "name": "\u7956\u970a\u306e\u58f0", "race": "-", "health": null, "text": "\u5404\u30d7\u30ec\u30a4\u30e4\u30fc\u306e\u624b\u672d\u304b\u3089\u3001\u30e9\u30f3\u30c0\u30e0\u306a\u30df\u30cb\u30aa\u30f31\u4f53\u3092\u305d\u308c\u305e\u308c\u306e\u9663\u5730\u306b\u8ffd\u52a0\u3059\u308b\u3002"},
{"attack": null, "ruby": "Ancestral Healing", "rarity": "\u30b3\u30e2\u30f3", "type": "\u546a\u6587", "hero": "\u30b7\u30e3\u30fc\u30de\u30f3", "mana": "0", "name": "\u7956\u970a\u306e\u7652\u3057", "race": "-", "health": null, "text": "\u30df\u30cb\u30aa\u30f31\u4f53\u306e\u4f53\u529b\u3092\u4e0a\u9650\u307e\u3067\u56de\u5fa9\u3057\u3001\u6311\u767a\u3092\u4ed8\u4e0e\u3059\u308b\u3002"},
{"attack": null, "ruby": "Ancestral Knowledge ", "rarity": "\u30b3\u30e2\u30f3", "type": "\u546a\u6587", "hero": "\u30b7\u30e3\u30fc\u30de\u30f3", "mana": "2", "name": "\u7956\u970a\u306e\u77e5\u8b58", "race": "-", "health": null, "text": "\u30ab\u30fc\u30c9\u30922\u679a\u5f15\u304f\u3002\u30aa\u30fc\u30d0\u30fc\u30ed\u30fc\u30c9\uff1a (2)"},
{"attack": null, "ruby": "Ancestral Spirit", "rarity": "\u30ec\u30a2", "type": "\u546a\u6587", "hero": "\u30b7\u30e3\u30fc\u30de\u30f3", "mana": "2", "name": "\u7956\u970a\u306e\u5c0e\u304d", "race": "-", "health": null, "text": "\u30df\u30cb\u30aa\u30f31\u4f53\u306b\u3001\u300c\u65ad\u672b\u9b54\uff1a\u3053\u306e\u30df\u30cb\u30aa\u30f3\u3092\u518d\u5ea6\u53ec\u559a\u3059\u308b\u300d\u3092\u4ed8\u4e0e\u3059\u308b\u3002"},
{"attack": "5", "ruby": "Ancient Brewmaster", "rarity": "\u30b3\u30e2\u30f3", "type": "\u30df\u30cb\u30aa\u30f3", "hero": "\u4e2d\u7acb", "mana": "4", "name": "\u8001\u7df4\u306e\u9152\u9020\u5927\u5e2b", "race": "-", "health": "4", "text": "\u96c4\u53eb\u3073\uff1a\u5473\u65b9\u306e\u30df\u30cb\u30aa\u30f31\u4f53\u3092\u6226\u5834\u304b\u3089\u81ea\u5206\u306e\u624b\u672d\u306b\u623b\u3059\u3002"},
{"attack": "2", "ruby": "Ancient Mage", "rarity": "\u30ec\u30a2", "type": "\u30df\u30cb\u30aa\u30f3", "hero": "\u4e2d\u7acb", "mana": "4", "name": "\u8001\u7df4\u306e\u30e1\u30a4\u30b8", "race": "-", "health": "5", "text": "\u96c4\u53eb\u3073\uff1a\u96a3\u63a5\u3059\u308b\u30df\u30cb\u30aa\u30f3\u306b\u3001\u546a\u6587\u30c0\u30e1\u30fc\u30b8\uff0b1\u3092\u4ed8\u4e0e\u3059\u308b\u3002"},
{"attack": "5", "ruby": "Ancient of Lore", "rarity": "\u30a8\u30d4\u30c3\u30af", "type": "\u30df\u30cb\u30aa\u30f3", "hero": "\u30c9\u30eb\u30a4\u30c9", "mana": "7", "name": "\u77e5\u8b58\u306e\u53e4\u4ee3\u6a39", "race": "-", "health": "5", "text": "\u9078\u629e\uff1a\u30ab\u30fc\u30c9\u30921\u679a\u5f15\u304f\u3002\u307e\u305f\u306f\u3001\u4f53\u529b\u30925\u56de\u5fa9\u3059\u308b\u3002"},
{"attack": "5", "ruby": "Ancient of War", "rarity": "\u30a8\u30d4\u30c3\u30af", "type": "\u30df\u30cb\u30aa\u30f3", "hero": "\u30c9\u30eb\u30a4\u30c9", "mana": "7", "name": "\u6226\u306e\u53e4\u4ee3\u6a39", "race": "-", "health": "5", "text": "\u9078\u629e\uff1a\u653b\u6483\u529b\uff0b5\u3002\u307e\u305f\u306f\u3001\u4f53\u529b\uff0b5\u3068\u6311\u767a\u3002"},
{"attack": "4", "ruby": "Ancient Watcher", "rarity": "\u30ec\u30a2", "type": "\u30df\u30cb\u30aa\u30f3", "hero": "\u4e2d\u7acb", "mana": "2", "name": "\u53e4\u4ee3\u306e\u756a\u4eba", "race": "-", "health": "5", "text": "\u653b\u6483\u3067\u304d\u306a\u3044\u3002"},
{"attack": "1", "ruby": "Angry Chicken", "rarity": "\u30ec\u30a2", "type": "\u30df\u30cb\u30aa\u30f3", "hero": "\u4e2d\u7acb", "mana": "1", "name": "\u30a2\u30f3\u30b0\u30ea\u30fc\u30c1\u30ad\u30f3", "race": "\u7363", "health": "1", "text": "\u6fc0\u6012\uff1a\u653b\u6483\u529b\uff0b5\u3002"},
{"attack": "9", "ruby": "Anima Golem", "rarity": "\u30a8\u30d4\u30c3\u30af", "type": "\u30df\u30cb\u30aa\u30f3", "hero": "\u30a6\u30a9\u30fc\u30ed\u30c3\u30af", "mana": "6", "name": "\u30a2\u30cb\u30de\u30fb\u30b4\u30fc\u30ec\u30e0", "race": "\u30e1\u30ab", "health": "9", "text": "\u6bce\u30bf\u30fc\u30f3\u306e\u7d42\u4e86\u6642\u306b\u3001\u3053\u306e\u30df\u30cb\u30aa\u30f3\u304c\u81ea\u5206\u306e\u552f\u4e00\u306e\u30df\u30cb\u30aa\u30f3\u3067\u3042\u308b\u5834\u5408\u3001\u3053\u306e\u30df\u30cb\u30aa\u30f3\u3092\u7834\u58ca\u3059\u308b\u3002"},
#abridgement
]
Use the created corpus to train doc2vec and create a model. It is executed with the following directory structure.
terminal
doc2vec
├── card2vec.py
└── hearth_stone # (Scrapy Project)
├── hearth_stone
│ ├── __init__.py
│ ├── __pycache__
│ │ ├── __init__.cpython-35.pyc
│ │ ├── items.cpython-35.pyc
│ │ └── settings.cpython-35.pyc
│ ├── items.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ ├── __init__.py
│ ├── __pycache__
│ │ └── __init__.cpython-35.pyc
│ └── hearthstone.py
├── hearth_stone.json
└── scrapy.cfg
Create a model of doc2vec with the following code.
card2vec.py
# -*- coding: utf-8 -*-
import json
import MeCab
from gensim.models import doc2vec
import os
def load_json(target_game_name):
#Create input data for card name and card text
names = []
text = ""
texts = []
#Specify the output of Mecab as a word
mecab = MeCab.Tagger("-Owakati")
json_path = target_game_name + "/" + target_game_name + ".json"
#Morphological analysis of the text on the card and dividing it into a single string separated by line breaks
with open(json_path, "r") as file:
card_dict = json.load(file)
for card in card_dict:
if card["name"] not in names:
names.append(card["name"])
mecab_result = mecab.parse(card["text"])
if mecab_result is False:
text += "\n"
texts.append("")
else:
text += mecab_result
texts.append(card["text"])
with open(target_game_name + ".txt", "w") as file:
file.write(text)
return names, texts
def generate_doc2vec_model(target_game_name):
print("Training Start")
#Read card text
card_text = doc2vec.TaggedLineDocument(target_game_name + ".txt")
#Learning
model = doc2vec.Doc2Vec(card_text, size=300, window=8, min_count=1,
workers=4, iter=400, dbow_words=1, negative=5)
#Save model
model.save(target_game_name + ".model")
print("Training Finish")
return model
if __name__ == '__main__':
TARGET_GAME_NAME = "hearth_stone"
names, texts = load_json(TARGET_GAME_NAME)
if os.path.isfile(TARGET_GAME_NAME + ".model") is True:
model = doc2vec.Doc2Vec.load(TARGET_GAME_NAME + ".model")
else:
model = generate_doc2vec_model(TARGET_GAME_NAME)
card2vec.Continuation of py
#Card name for which you want to find a similar card
TARGET_CARD_NAME = "Hogger"
card_index = names.index(TARGET_CARD_NAME)
#Receive a list of similar cards and tuples of similarity (top 10 similarities)
similar_docs = model.docvecs.most_similar(card_index)
print(names[card_index])
print(texts[card_index])
print("--------------------is similar to--------------------")
for similar_doc in similar_docs:
print(names[similar_doc[0]] + " " + str(similar_doc[1]))
print(texts[similar_doc[0]], "\n")
Try to output a similar card with the card name as input. This time, I personally enter my favorite card.
terminal
$ python card2vec.py
Hogger
At the end of your turn, have a provocation 2/Summon 1 Noor of 2.
--------------------is similar to--------------------
Elwin's Disaster Hogger 0.9119920134544373
Each time this minion is damaged, it has a provocation 2/Summon 1 Noor of 2.
Obsidian Destroyer 0.8980860114097595
At the end of your turn, have a provocation 1/Summon 1 Scarab.
Summon stone 0.8811841011047363
Each time you cast a spell, summon a random minion of the same cost.
Mirror image 0.8686900734901428
0 with provocation/Summon 2 minions of 2.
Isera 0.8627046346664429
At the end of your turn, add a dream card to your hand.
#abridgement
input | output |
---|---|
The one with the highest similarity to the input "hogger" was "Elwin's disaster hogger". As you can see, it's interesting that the same hogger card was output with just the text. The 2nd and 3rd cards and the text of the card are "Summon \ * \ * \ * \ * when ~ ~", so it is taken with fairly high accuracy.
terminal
Iron Jaguar Note
Scream: Add 1 "Buried Mine" to a random location on the enemy deck. "Buried mines" explode when pulled, dealing 10 damage.
--------------------is similar to--------------------
Tsuchigumo 0.8792235851287842
Add 3 "Ambush" to each random position on the enemy deck. When "Ambush" is drawn, 4 in your position/Summon 1 Nervian of 4.
Elise Star Seeker 0.8761336803436279
Yell: Add a "Map to the Golden Monkey" to a random location on your deck.
Destructive Weapon 0.8525710105895996
At the beginning of your turn, deal 2 damage to a random enemy.
Ancient shade 0.8471906185150146
Cry: Add an "Ancient Curse" to a random location on your deck. If you draw the "Ancient Curse", you will take 7 damage.
Canyon Tyrant Mukura 0.8456800580024719
Yell: Add 2 "bananas" to your hand.
#abridgement
input | output |
---|---|
This also has a solid feature such as "Add ~ \ * \ * \ * \ * to random positions on the enemy's deck. If \ * \ * \ * \ * is drawn ~ ~." I will. "Elise Starseeker" and "Ancient Shade", which insert cards at random positions on the deck, are also at the top.
terminal
Road Jaluxus
Cry: Your hero is destroyed, and Lord Jaluxus becomes your hero.
--------------------is similar to--------------------
Leader Execution 0.8202652335166931
Desperate: Lagunaros, King of Fire, becomes his hero.
Command 0.8199278712272644
During this turn, allies' minions will not be less than 1. Draw a card.
Elise Star Seeker 0.8164669275283813
Yell: Add a "Map to the Golden Monkey" to a random location on your deck.
Tasker's Tournament 0.8136664032936096
Yell: Show one of each player's deck minions. If your minion costs more, restore 7 health to your hero.
Deadly Ichiya 0.8121815323829651
Destroy a random enemy minion.
#abridgement
input | output |
---|---|
"Road Jaluxus" is a card with a strange effect that changes your hero. The only cards that have this effect are "Road Jaluxus" and "Executors, the Leading Elder". In this result, the similarity between the two is high, so it can be said that the features are fairly well taken.
As you can see, even short text such as card text such as Hearthstone can be vectorized well with doc2vec. As a corpus, I was able to get a lot of accuracy with only 922 sheets. Regarding the parameters of Doc2Vec, I referred to the paper shown in References. As an aside, I tried it with Yu-Gi-Oh's card, but I didn't know if it was similar because I wasn't familiar with Yu-Gi-Oh, so I stopped. I will upload the source and model set on Github, so if you are interested, please try it. https://github.com/GuiltyMorishita/card2vec
Distributed Representations of Sentences and Documents An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation
Recommended Posts