Use Python and word2vec (learned) with Azure Databricks

Purpose of this article

I want to execute processing using word2vec that has already been learned in Azure Databricks. I used to use word2vec from Python in my local environment, but I was addicted to it because I thought it would work with copy and paste, so I'll write it down.

If you write the conclusion first, **-Mount the BLOB that uploaded the trained model on databricks and load it ** **-Note that if you do not use the with open command when loading, you will get a "File not found" error **

word2vec overview

Word2vec can handle word similarity mathematically

As the name implies, it converts words into vectors. A super important technology that is indispensable for natural language processing. Replace a simple string of words with a vector so that it can be handled mathematically.

"Rice" "Machine learning" "Deep learning"   ↓     ↓        ↓ image.png

This makes it possible to mathematically calculate the similarity between words as a distance in space **. It can be mathematically defined that the words "machine learning" and "deep learning" have similar meanings.

It's hard to learn by yourself

The basic premise of word2vec is the idea that the meaning of a word is formed by the surrounding words. This is called the "distribution hypothesis".

To put it plainly, the meaning of a word can be inferred by looking at the words around it.

For example, suppose you have the following sentence: ** ・ [Machine learning] technology is indispensable for the realization of artificial intelligence. ** ** Even if the word "machine learning" is unknown, it is speculated that this may be a technology related to artificial intelligence.

Similarly, there may be a statement like this: ** ・ The technology called [deep learning] has dramatically accelerated research on artificial intelligence. ** **

By learning such a large number of sentences, it becomes possible to predict the meaning of unknown words. It can also be seen that [machine learning] and [deep learning], in which similar words appear in the surroundings, seem to be semantically similar.

However, these learning requires reading a large amount of documents, and the cost of learning is high. Therefore, it is basic to use a trained model first.

Steps to use word2vec with Azure Databricks


Preparation

  1. Create Azure Databricks resources
  2. Create a container with Storage Account
  3. Download the trained model and store it in a container


Execution

  1. Mount the container on Azure databricks
  2. Load the model with gensim
  3. Run word2vec

Preparation

1. Create Azure Databricks resources

There is nothing to be careful about. You can create it from the azure portal normally. image.png

2. Create a container with Storage Account

Again, there is nothing to be careful about. Create a container. The public access level can be private. image.png

3. Download the trained model and store it in a container

Download the trained model by referring to the following article. (By the way, I used fastText)

List of ready-to-use word embed vectors https://qiita.com/Hironsan/items/8f7d35f0a36e0f99752c

Upload the downloaded "model.vec" file to the created container.

Run

From here, operations on the Databricks notebook.

1. Mount the container on Azure databricks

This article was very easy to understand.

Analyze the data in the Blob with a query! https://tech-blog.cloud-config.jp/2020-04-30-databricks-for-ml/

python


mount_name= "(Arbitrary mount destination directory name)"
storage_account_name = "(Storage account name)"
container_name = "(Container name)"
storage_account_access_key = "(Storage account access key)"

mount_point = "/mnt/" + mount_name
source = "wasbs://" + container_name + "@" + storage_account_name + ".blob.core.windows.net"
conf_key = "fs.azure.account.key." + storage_account_name + ".blob.core.windows.net"


mounted = dbutils.fs.mount(
  source=source,
  mount_point = mount_point,
  extra_configs = {conf_key: storage_account_access_key}
)

2. Load the model with gensim

python


import gensim
word2vec_model = gensim.models.KeyedVectors.load_word2vec_format("mount_name/container_name/model.vec", binary=False)

When I run it above, I get an error for some reason. Why is the mount properly made? (By the way, if it is local, it works properly.)

FileNotFoundError: [Errno 2] No such file or directory:

So, use the with open command as shown below to receive it once with f_read and then load it.

python


import gensim
with open("mount_name/container_name/model.vec", "r") as f_read:
  word2vec_model = gensim.models.KeyedVectors.load_word2vec_format(f_read, binary=False)

Databricks File System (DBFS)-Local File api https://docs.microsoft.com/ja-jp/azure/databricks/data/databricks-file-system#local-file-apis

This time it was a success.

3. Run word2vec

Try it out. Try to put out the word closest to "Japanese".

python


word2vec_model.most_similar(positive=['Japanese'])

Out [3]: [('Chinese', 0.7151615619659424), ('Japanese', 0.5991291999816895), ('Foreign', 0.5666396617889404), ('Japanese', 0.5619238018989563), ('Korean', 0.5443094968795776), ('Overseas Chinese', 0.5377858877182007), ('Resident in Japan', 0.5263140201568604), ('Chinese', 0.5200497508049011), ('Residence', 0.5198684930801392), ('International Student', 0.5194666981697083)]

Recommended Posts

Use Python and word2vec (learned) with Azure Databricks
Using Python and MeCab with Azure Databricks
Use Python and MeCab with Azure Functions
Easy! Use gensim and word2vec with MAMP.
[Python] I introduced Word2Vec and played with it.
[Python] Use JSON with Python
Use mecab with Python3
Use Python 3.8 with Anaconda
Use python with docker
Use Python / Django with Windows Azure Cloud Service!
Ubuntu 20.04 on raspberry pi 4 with OpenCV and use with python
Programming with Python and Tkinter
Encryption and decryption with Python
Use Trello API with python
Python and hardware-Using RS232C with Python-
Use Twitter API with Python
Sentiment analysis with Python (word2vec)
python with pyenv and venv
Use subsonic API with python3
Works with Python and R
Easy to use Nifty Cloud API with botocore and python
This time I learned python III and IV with Prorate
Communicate with FX-5204PS with Python and PyUSB
Use MLflow with Databricks ④ --Call model -
Python: How to use async with
Use Azure SQL Database with SQLAlchemy
Robot running with Arduino and python
Use PointGrey camera with Python (PyCapture2)
Use vl53l0x with Raspberry Pi (python)
Install Python 2.7.9 and Python 3.4.x with pip.
Neural network with OpenCV 3 and Python 3
AM modulation and demodulation with python
[Python] font family and font with matplotlib
Scraping with Node, Ruby and Python
[Python / matplotlib] Understand and use FuncAnimation
Scraping with Python, Selenium and Chromedriver
[Python] Object-oriented programming learned with Pokemon
[Python] Use Basic/Digest authentication with Flask
Read and use Python files from Python
JSON encoding and decoding with python
Hadoop introduction and MapReduce with Python
[GUI with Python] PyQt5-Drag and drop-
Perceptron learning experiment learned with Python
Python data structures learned with chemoinformatics
Efficient net pick-up learned with Python
Reading and writing NetCDF with Python
Use Python in pyenv with NeoVim
Use Azure Blob Storage from Python
1. Statistics learned with Python 1-1. Basic statistics (Pandas)
How to use FTP with Python
Use Windows 10 speech synthesis with Python
I played with PyQt5 and Python3
Use OpenCV with Python 3 in Window
Multiple integrals with Python and Sympy
Use PostgreSQL with Lambda (Python + psycopg2)
Coexistence of Python2 and 3 with CircleCI (1.0)
Easy modeling with Blender and Python
Sugoroku game and addition game with python
FM modulation and demodulation with Python
How to use Python with Jw_cad (Part 2 Command explanation and operation)
Use MeCab and neologd with Google Colab