[PYTHON] [Old article] Data Science Experience (DSX) is now available on the Lite plan (much free) on IBM Cloud, so I touched it ★ 2017/11 Update

image

: warning: This article was first posted in June 2017, but at this point ([February 2019]) it is already out of date. The article itself will be left as it is for the purpose of archiving, but please do not refer to the contents of this article. Alternative articles include: </ font>

-A quick introduction to the AI platform "Watson Studio" announced at IBM's Think 2018


[Note] This article was uploaded in 2017/06 with the title "The Free version of Data Science Experience (DSX) is now available on Bluemix, so I tried it." However, since there was a UI change related to the start of Lite account and DSX / WML in 2017/11, I reviewed the description and took a screen shot again. The content of the article is almost the same as before. The changed part is written as: new :. </ font>

Introduction

Hello! On 06/01/2017 ** The Data Science Experience icon has appeared on IBM Cloud! Even if I get excited with **, I think most people have a cool reaction, "What is Data Science Experience?" (Gackli ..) There are some articles on Data Science Experience (DSX) on Qiita, but I would like to briefly introduce "What's that?" After being registered in the IBM Cloud catalog. I did.

What is Data Science Experience (DSX)?

(I think the following expressions are quick for skillful Qiita readers.) In short, we will provide a set of development and execution environments for the following open data science analysis that has been gaining momentum recently. It is a SaaS service. As a user, we assume a team of data scientists who can code. (For those who hate coding: new: SPSS is also available on DSX! :-))

--Scala / Python on Jupyter Notebook (*) --R on R Studio

  • Spark Cluster --Brunel (Visualization) / Apache Toree (Spark and Jupyter integration) etc.

Furthermore

--Articles for study ・ Tutotial and open data --Collaboration function for analysis team --Notebook GitHub integration

It is also attached.

  • Is it Jupyter Hub because it can actually be developed by a team? I don't know what you are using.

image

What do you like?

Well, the current situation is ** the point is that it is a SaaS service that integrates open source software **, so it can be said that you can create a similar environment by yourself, but I think it has the following advantages.

-(Because it is SaaS) There is no need to arrange infrastructure or set up the environment in the first place. --No knowledge of infrastructure settings such as cooperation between Jupyter and Spark is required ――Therefore, you can start developing the code immediately (or you can try and study immediately). --Multilingual environment (Polyglot) eliminates the need for analysis teams to "unify tools and environments" ――No need to build and operate the Spark Cluster environment (it's quite difficult, this) --Easy integration with services on IBM Cloud such as dashDB and Object Storage --You can easily deploy your notebook on github

DSX seems to be particularly focused on ** "increasing the productivity of the analysis team" **. Each data scientist has his or her favorite language and tools that he is good at, such as "I want to do it in R" and "Well, it's AI from now on, so it's Python." If you want to analyze on an individual basis, you can use whatever you like, but if you do "analysis work" with ** "team" ** and ** "work" **, that is not the case. If you do not unify the language and tool environment, it will be inconvenient for the team to evaluate and share the analysis results. However, it is quite painful and moral down to be forced to decide "this analysis work is xxx". .. .. DSX seems to be aiming for an environment where the team can ** analyze this area ** with their favorite language and tools, and ** collaborate ** the deliverables. (It can be inferred from the fact that the price system is not the price of one user, but how much it costs for five people.)

Transition of DSX on IBM Cloud

The Data Science Experience itself was offered as a service on SaaS independently of Bluemix in 2016, but only with a 30-day trial. (That is, it could not be used after the trial deadline.): New: After that, it was published in the Bluemix catalog & Free version was provided in 2017/06, and it will be free for a long time with the name change from Bluemix to IBM Cloud in 2017/11. The Lite plan was offered, but DSX and WML are also available on the Lite plan. The point is (although resources are limited) ** Lite plan allows you to try it for free and for free **, so it's a good place to start "studying Jupyter / Python / Scala + Spark". I think not. (Tutotial for studying and sample notebooks are also available.)

By the way, the resources that can be used in the Lite plan are as follows. Although it is small, I think that the "study" level is sufficient. (The Lite plan has the same functionality as the paid Enterprise version, only the available machine resources and the number of Spark Clusters are different.) Data Science Experience

image

I will try it

Below, while introducing the functions of DSX in the Free environment as an introduction, I will try from creating a project to running an existing notebook with explanations of Python / Spark. In DSX, resources such as various notebooks and data are collected, managed and shared using a management unit called "project". image

First create a DSX service instance on IBM Cloud

Log in to IBM Cloud and select Data Science Experience from the catalog. image

On the next screen, give the service a name of your choice, select Lite Plan, and then "Create". For the: warning: Lite plan, set ** "Deployment area" to "Southern United States" **. As of November 2017, only "Southern United States" is available for Lite plans. (Is it appropriate because the largest selection of services is in the "Southern United States") image

When the screen changes, "Get Started" image

Select the IBM Cloud organization and space to be used with DSX and "Continue" (Is it okay by default) image

Wait for a while, and when it becomes Done, "Get Started" image

Introducing the menu

Below is the initial screen of DSX. : new: 2017/11 update made it cooler. --This panel is displayed by clicking "Get Started" in the upper right.

image

―― ① This is the center of the operation, creating a project and setting the data source. -② Links to documents and various settings ―― ③ Shortcut icon

The menu of ① is as follows. image --Projects --Access to created projects and notebooks --Tools --Access to Jupyter and RStudio --Data Services --Definition of various data sources such as databases and storage

: new: Beta, but SPSS Modeler and Stream Designer have also been added

The bottom of the screen image

--④ Recently used project ――⑤ There are many blog articles and tutorials in the community resources, so you can start studying immediately from here. ――Click ⑥ to ask DSX support. (I have never done it)

image

Try to make a project

"Create Project" with the shortcut of ③ image

Enter your favorite project name in the Name field

image

** To use DSX, ①Spark ②Object Storage instance is required **. You can also make these for free with the Lite plan. If it is undefined, you can define it immediately by clicking the following from this panel, so please specify the instance to be used by "Reload" again after creating it. (If it's already defined, just select it)

[If the account does not have an instance] image

After specifying the instance, click "Create" image

The project is ready. It's still clean, but you can see that the structure is such that notebooks and data assets are stored in the project. From here you can create new notebooks and machine learning models.

image

Try to make a new Notebook

Create a new Notebook. "Add notebooks" on the upper right

image

Set your favorite name in Name, select the language and Spark version, and click "Create Notebook". I chose the latest Python 3.5 / Spark 2.1 here. image

As a result, the familiar Jupyter Notebook environment has been created as shown below. The menu and color scheme at the top are different from the open source Jupyter Notebook, but since the substance is Jupyter itself, those who already have Jupyter experience will not get lost in operation.

image

By the way, the following menus on the upper right are the functions of DSX. image

# Explanation
Publish notebook to github
Share your notebook on direct links, twitter, and LinkedIn
Notebook run scheduling
Project token(※)Insert
Information about this notebook, such as environment, creation date, etc.
Notebook version storage (up to 10))
Add comment
File or data source connection
Search for bookmarks and community resources
  • A project token is authentication information for accessing data. See here for more information.

Once the notebook opens, all you have to do is start coding. As shown below, Spark Context has already been initialized, and numpy, pandas, matplotlib, etc., which are standard libraries for data science in Python, can also be used. By the way, seaborn was not included, but I was able to install it with! Pip install seaborn. In this way, it is easy to "add a library that does not exist".

image

Use of notebook prepared in advance

It is hard for "studying from now on" to start from nothing, but DSX has many notebooks (in English) that "you can study while reading the explanation and actually moving it". Let's try running the existing "Notebook for using Spark with Python".

Search for "Apache Spark Lab" in Community Notebooks and you'll find the following three-part Notebooks: Double-click on Part 1 to open it. image

A notebook with explanations will open as shown below. Select "Copy" from the icon on the upper right. image

Select the project name and Spark environment to use and select "Create Notebook"

image

After waiting for a while, Notebook will be copied to your environment and will work as shown below. image

As a preparation before execution, clear the previous output if it remains. 「Cell」-「All Output」-「Clear」 image

All you have to do is actually execute the cell while reading the explanation. I think it's good for studying because you can immediately try what you learned in the commentary. (By the way, step execution of the cell is done with the following button or "Shift + Enter") image

The contents of this notebook are beyond the scope of this article, so I will omit them, but there are various other notebooks, so you can choose the theme you are interested in and study in the same way.

That was "I tried to touch it".

To collaborate as a team

To collaborate with multiple members on a single project, follow these steps: As far as I tried, it seems that Lite accounts can also do it.

  1. Click "User Invitation" in the "Administration"-"Accounts"-"Users" panel in the upper right menu of IBM Coud.

image

  1. Enter the email address of the user you want to invite, set the appropriate access rights, and then click the "Invite User" button. image

  2. The following email will be sent to the invited members, so accept the invitation with "Join Now" and sign up for IBM Cloud. image

image

image

  1. When the invited member logs in to IBM Cloud, the DSX-related services of the invited party are available as shown below. (However, the project cannot be used yet)

image

  1. Invited members will sign up at the DSX Sites (https://datascience.ibm.com/). This action associates your IBM Cloud account with your DSX account.

image

Since you already have an ID to log in to IBM Cloud, sign up with "Already hace an IBM Cloud account?" At the bottom right. However, at this point you can't see anything because the invitee hasn't shared the project yet.

image

  1. The inviting administrator opens the project you want to share and "Add" the invited members with "Add new collaborators" with the appropriate permissions. "Invite" button when id is added to Collaborator

image

image

  1. By the above operation, the new member will be notified and the project will be visible. image

image

There are both IBM Cloud account and DSX account around here, and it is complicated, so please refer to the document Set up an enterprise accountをご参照ください。

Note that the notebook is locked while someone is editing it so that multiple people do not update the same notebook.

In fact, DSX is also an on-premise version

Although not introduced in this article, DSX also has DSX Local that runs in a private cloud and DSX Desktop that can be used on desktops (open beta as of June 2017). If you are interested, please search DSX Document or the Internet. image

Collaboration with Watson Machine Learning is also progressing

DSX and WML are separate services on IBM Cloud, but their cooperation is steadily progressing. If you're doing data science / predictive analytics on the IBM Cloud, you'll probably use both. Watson Machine Learning is also available for free with the Lite plan, so please try it.

Recommended Posts