Using Python and MeCab with Azure Databricks

Purpose of this article

I want to perform simple natural language processing (morphological analysis + α) using MeCab in the pre-processing of Azure Data Factory. It would be convenient if you could implement it as a function and call it later from various services such as LogicApps. So I considered two implementation methods.

  1. Azure Functions (Using Python and MeCab with Azure Functions)
  2. Azure Databricks (this article)

Azure Functions seems to be sufficient for the time being, but assuming a situation where heavy processing such as machine learning will be performed in the future, I also tried Databricks because I wanted to understand the service of Databricks as well.

If you write the conclusion first, ** ・ For beginners of Azure Databricks, the following Microsoft Learn (free) is easy to understand **

Run Data Engineering with Azure Databricks https://docs.microsoft.com/ja-jp/learn/paths/data-engineering-with-databricks/

** ・ MeCab can be used by installing "mecab-python3" on the cluster with PyPI ** **-Complete by accessing the Azure portal and Databricks with a browser, no local environment settings required **

There are many points of lack of understanding, so please point out any mistakes. Correct and add as appropriate.

Databricks overview

Apache Spark-based analytics platform. Computing resources can be scaled out and distributed as needed.

Billing system

There are some parts that are a little difficult to understand, but the charges are roughly for the following two.

· Virtual machines (VMs) provisioned in the cluster · Databricks units (DBU) based on the selected VM instance

There are also small charges for managed disks, blob storage, and public IP addresses.

Azure Databricks pricing https://azure.microsoft.com/ja-jp/pricing/details/databricks/

By the way, if you use the 14-day "trial version", you will be exempt from charging for DBU. On the other hand, be aware that VMs will be charged as usual.

With Databricks (not Azure), you can try it for 14 days, including computing resources, for free. The interface is the same for Azure Databricks and Databricks, so you can try this. https://databricks.com/try-databricks

language

You can choose from Python, Scala, SQL, and R when you create your notebook. By using the Databricks Magic command, it is possible to mix multiple languages in a notebook. (If you write% python at the beginning of a cell, that cell will be executed by python, etc.)

Create Databricks from Azure Portal

If you search and create from the Azure portal normally, there is no particular hesitation.

image.png

I'm wondering whether to set the price level to Standard or Premium, but it seems that it is possible to change the price level later while keeping the notebook, user, and cluster configuration, so I'm not too nervous Good. In Premium, access control, authentication, and audit log functions will be enhanced.

Azure Databricks workspace upgrade or downgrade https://docs.microsoft.com/ja-jp/azure/databricks/administration-guide/account-settings/account#upgrade-or-downgrade-an-azure-databricks-workspace

Also, as mentioned above, if you select the trial version and use it all the time, you will be charged firmly with the VM fee, so be careful. (DBU billing is exempt)

Create a cluster with Databricks

After deploying Databricks, go to the resource and launch the workspace. Select Clusters from the Databricks screen and Create Cluster. image.png

Create a Cluster by setting the type and number of VMs to be provisioned. image.png

Install MeCab on a Databricks cluster

Library can be installed from the details screen for the created Cluster. image.png

After that, you can install the package with PyPI etc. image.png

did it. image.png

Create and use Notebook from Workspace

Create a Notebook in Python from Workspace> Create> Notebook. After that, you can morphologically analyze with import MeCab. image.png

Summary

Compared to using Python with Functions, it was very easy to set up because everything was completed on the Web. Even when managing with multiple people, it is easy because there is no need to match the local environment.

cost

The cost of the instance "DS3 v2" specified by default is as follows. You will be charged for the time (in minutes) that the instance is up.

image.png

It scales out under heavy load, for example, doubling the number of compute nodes (Workers) doubles the billing amount. (Both VM and DBU cost double)

Azure Databricks pricing https://azure.microsoft.com/ja-jp/pricing/details/databricks/

Recommended Posts

Using Python and MeCab with Azure Databricks
Use Python and MeCab with Azure Functions
Use Python and word2vec (learned) with Azure Databricks
When using MeCab with virtualenv python
From Python to using MeCab (and CaboCha)
Tweet analysis with Python, Mecab and CaboCha
I'm using tox and Python 3.3 with Travis-CI
Use mecab with Python3
I tried using mecab with python2.7, ruby2.3, php7
Programming with Python and Tkinter
Encryption and decryption with Python
[Python] Morphological analysis with MeCab
Python and hardware-Using RS232C with Python-
[S3] CRUD with S3 using Python [Python]
Using Quaternion with Python ~ numpy-quaternion ~
[Python] Using OpenCV with Python (Basic)
Email hipchat with postfix, fluentd and python on Azure
python with pyenv and venv
Using OpenCV with Python @Mac
Works with Python and R
Send using Python with Gmail
IP spoof using tor on macOS and check with python
Using Python with SPSS Modeler extension nodes ① Setup and visualization
Serial communication control with python and I2C communication (using USBGPIO8 device)
Using MLflow with Databricks ② --Visualization of experimental parameters and metrics -
Serial communication control with python and SPI communication (using USBGPIO8 device)
This and that for using Step Functions with CDK + Python
Communicate with FX-5204PS with Python and PyUSB
Complement python with emacs using company-jedi
Shining life with Python and OpenCV
Harmonic mean with Python Harmonic mean (using SciPy)
Robot running with Arduino and python
Install Python 2.7.9 and Python 3.4.x with pip.
[Python] Using OpenCV with Python (Image Filtering)
Neural network with OpenCV 3 and Python 3
AM modulation and demodulation with python
[Python] font family and font with matplotlib
Scraping with Node, Ruby and Python
Using Rstan from Python with PypeR
Authentication using tweepy-User authentication and application authentication (Python)
[Python] Using OpenCV with Python (Image transformation)
Scraping with Python, Selenium and Chromedriver
Notes on using MeCab from Python
[Python] Using OpenCV with Python (Edge Detection)
Scraping with Python and Beautiful Soup
JSON encoding and decoding with python
Hadoop introduction and MapReduce with Python
[GUI with Python] PyQt5-Drag and drop-
Using Sessions and Reflections with SQLAlchemy
Reading and writing NetCDF with Python
Clustering and visualization using Python and CytoScape
I played with PyQt5 and Python3
Notes on using rstrip with python.
Reading and writing CSV with Python
Multiple integrals with Python and Sympy
Coexistence of Python2 and 3 with CircleCI (1.0)
Easy modeling with Blender and Python
Precautions when using six with Python 2.5
Sugoroku game and addition game with python
FM modulation and demodulation with Python
Trial of voice recognition using Azure with Python (input from microphone)