[PYTHON] I made a downloader for word distributed expression

Distributed representations of words are commonly used in modern natural language processing. Recently, many trained models have been released, and it is less necessary to spend time and money to learn by yourself. However, even though it is open to the public, it takes a lot of time to find it and download it.

To eliminate this trouble, I made a downloader for word distribution expression. The name is ** chakin **. chakki-works/chakin (I feel motivated if you give me a star m (__) m)

The feature of chakin is that it is written in Python, can be installed with pip, can be done from search to download in one stop, and supports 23 vectors (as of May 29, 2017). .. We plan to increase the number of supported vectors in the future.

Let's see how to use it.

How to use chakin

Installation is easy. Type the following command using pip:

$ pip install chakin

You can use it after installation. You need to write three lines of code to download the dataset. This time, let's try downloading the Japanese distributed representation data set. First, launch Python:

$ python

After launching Python, import the installed chakin. After importing, you can search for pretrained models by specifying the language (Japanese in this case) in the search method:

>>> import chakin
>>> chakin.search(lang='Japanese')
                         Name  Dimension     Corpus VocabularySize               Method  Language
6                fastText(ja)        300  Wikipedia           580K             fastText  Japanese
22  word2vec.Wiki-NEologd.50d         50  Wikipedia           335K   word2vec + NEologd  Japanese

Currently, it supports searching only the target language. This area is one of the places where we want to improve usability in the future.

Once you find the dataset you want to download, specify its index in the download method to download it. This time, I specified ** 22 **, which is the index of "word2vec.Wiki-NEologd.50d":

>>> chakin.download(number=22, save_dir='./')
Test: 100% ||               | Time: 0:00:02  60.7 MiB/s
'./latest-ja-word2vec-gensim-model.zip'

That's all for how to use it.

Supported vectors

It currently supports the vectors listed below. We will continue to add corresponding vectors in the future, so please use it.

Name Dimension Corpus VocabularySize Method Language
fastText(ar) 300 Wikipedia 610K fastText Arabic
fastText(de) 300 Wikipedia 2.3M fastText German
fastText(en) 300 Wikipedia 2.5M fastText English
fastText(es) 300 Wikipedia 985K fastText Spanish
fastText(fr) 300 Wikipedia 1.2M fastText French
fastText(it) 300 Wikipedia 871K fastText Italian
fastText(ja) 300 Wikipedia 580K fastText Japanese
fastText(ko) 300 Wikipedia 880K fastText Korean
fastText(pt) 300 Wikipedia 592K fastText Portuguese
fastText(ru) 300 Wikipedia 1.9M fastText Russian
fastText(zh) 300 Wikipedia 330K fastText Chinese
GloVe.6B.50d 50 Wikipedia+Gigaword 5 (6B) 400K GloVe English
GloVe.6B.100d 100 Wikipedia+Gigaword 5 (6B) 400K GloVe English
GloVe.6B.200d 200 Wikipedia+Gigaword 5 (6B) 400K GloVe English
GloVe.6B.300d 300 Wikipedia+Gigaword 5 (6B) 400K GloVe English
GloVe.42B.300d 300 Common Crawl(42B) 1.9M GloVe English
GloVe.840B.300d 300 Common Crawl(840B) 2.2M GloVe English
GloVe.Twitter.25d 25 Twitter(27B) 1.2M GloVe English
GloVe.Twitter.50d 50 Twitter(27B) 1.2M GloVe English
GloVe.Twitter.100d 100 Twitter(27B) 1.2M GloVe English
GloVe.Twitter.200d 200 Twitter(27B) 1.2M GloVe English
word2vec.GoogleNews 300 Google News(100B) 3.0M word2vec English
word2vec.Wiki-NEologd.50d 50 Wikipedia 335K word2vec + NEologd Japanese

in conclusion

Distributed representations of pre-learned words are common and important in natural language processing. However, it is surprisingly troublesome to find them by yourself. In this article, I introduced a downloader that I made to eliminate the trouble. We hope you find this article useful.

I also tweet information about machine learning and natural language processing on my Twitter account. @Hironsan

We look forward to your follow-up if you are interested in this area.

Recommended Posts

I made a downloader for word distributed expression
I made a library for actuarial science
I made a python dictionary file for Neocomplete
〇✕ I made a game
I made a spare2 cheaper algorithm for uWSGI
I made a useful tool for Digital Ocean
[Python] I made a Youtube Downloader with Tkinter.
I made a peeping prevention product for telework.
I made a user management tool for Let's Chat
I made a window for Log output with Tkinter
I made a cleaning tool for Google Container Registry
I made a VM that runs OpenCV for Python
[Python] I made a classifier for irises [Machine learning]
I made a python text
Made a command for FizzBuzz
I made a discord bot
[VSCode] I made a user snippet for Python print f-string
I made a tool to create a word cloud from wikipedia
I made a resource monitor for Raspberry Pi with a spreadsheet
I made a learning kit for word2vec / doc2vec / GloVe / fastText
I made a face diagnosis AI for a female professional golfer ③
I made a C ++ learning site
I touched PyAutoIt for a moment
I made a Line-bot using Python!
I made a CUI-based translation script (2)
I made a wikipedia gacha bot
I made a fortune with Python.
I made a CUI-based translation script
I made a daemon with Python
Python> I made a test code for my own external file
I made a client / server CLI tool for WebSocket (like Netcat for WebSocket)
I made a lot of files for RDP connection with Python
I made a development environment for Django 3.0 with Docker, Docker-compose, Poetry
I made a scaffolding tool for the Python web framework Bottle
I made a Python wrapper library for docomo image recognition API.
I made a Docker container to use JUMAN ++, KNP, python (for pyKNP).
I made a new AWS S3 bucket
I made a payroll program in Python!
I touched "Orator" so I made a note
[Updated Ver1.3.1] I made a data preprocessing library DataLiner for machine learning.
I made an alternative module for japandas.DataReader
Beginner: I made a launcher using dictionary
I made a conversation partner like Siri
I made a script to display emoji
I made a Hex map with Python
I made a life game with Numpy
I made a stamp generator with GAN
I made a browser automatic stamping tool.
After studying Python3, I made a Slackbot
I made a roguelike game with Python
I made a simple blackjack with Python
I made a configuration file with Python
I made a WEB application with Django
A textbook for beginners made by Python beginners
I made a neuron simulator with Python
[For beginners] I made a motion sensor with Raspberry Pi and notified LINE!
I made a stamp substitute bot with line
I made a competitive programming glossary with Python
I made a weather forecast bot-like with Python.
I made a GUI application with Python + PyQt5
I made a Twitter fujoshi blocker with Python ①