[PYTHON] Databricks

Databricks is a service for creating apps that process large amounts of data in parallel. Developed by the Apache Spark developers as a managed service for Apache Spark. I've been studying for the past few days, but I'll write down the points that caught my attention.

In a nutshell, Apache Spark automatically converts code that processes tabular data into parallel processing and executes it in parallel. Developers can process huge data in parallel as if they were writing code in Pandas on Jupyter. A library related to machine learning is also included, so everything from data preprocessing to analysis and prediction can be done with Spark.

Also, the difference from its predecessor (?) Hadoop is that the storage is separated. By specializing in data processing, Spark can be used in combination with various other storages.

Main sources of information

I will record the following because there was something difficult to understand with Databricks.

CLI

It's not confusing, but it's quite annoying without the CLI, so installation is essential. Use pip3 for installation. You can use it by setting the connection information with databricks configure.

#Setting
$ pip3 install databricks-cli
$ databricks configure --token
Databricks Host (should begin with https://): https://hogehoge.cloud.databricks.com/
Token: (Enter the token created by GUI)

#Operation check
$ databricks fs ls

The databricks command has various subcommands, but databricks fs comes with dbfs as an abbreviation from the beginning.

Secret

Passwords etc. are saved in Secret for each scope.

Create scope

databricks secrets create-scope --scope astro_snowflake

scope display

databricks secrets list-scopes

Add secrets to scope

databricks secrets put --scope hoge_scope --key User
databricks secrets put --scope hoge_scope --key Password

secrets display

databricks secrets list --scope astro_snowflake

Two file storages, DBFS and Workspace

There are two files used by Databricks as follows.

You can't access Workspace directly programmatically!

Relationship between SQL Table and DataFrame

Databricks allows you to use Python or SQL to access the table. To use the same table in another language, you need to register the table as a Table where you can see it from SQL.

Tables have global, which can be accessed from anywhere, and local, which can only be accessed from the same Notebook.

Register the Python DataFrame as a local table called "temp_table_name". You can now refer to it from SQL.

df.createOrReplaceTempView("temp_table_name")

Register the Python DataFrame as a global table called "global_table_name". You can now refer to it from SQL. The global table can be referenced from Data in the Web UI.

df.write.format("parquet").saveAsTable("global_table_name")

Read the table registered with the name "temp_table_name" as a Python DataFrame.

temp_table = spark.table("temp_table_name")

The global table is stored on DBFS. The location where it is saved can be confirmed by the location that appears in DESCRIBE DETAIL.

DESCRIBE DETAIL `global_table_name`

For example dbfs: / user / hive / warehouse / global_table_name

Notebook magic commands

https://docs.databricks.com/notebooks/notebooks-use.html

Magic command

Recommended Posts

Databricks
Unit test Databricks Notebook