[PYTHON] A note about connecting Spark to OpenStack Swift-based IBM Object Storage

Although there are various data sources, it seems that the hurdle is low to put the file as it is in the storage in a hurry. If it is a big data premise like Spark, log files etc. are often assumed, but since general business data can not be avoided, first CSV, so when I tried a little around CSV Note. (At IBM Data Scientist Experience, Python 2 with Spark 2.0)

Preparation

Upload the CSV file to Swift-based IBM Object Storage. Use "Drop you file here or browse your files to add a new file" on the right side of the screen. I uploaded baseball.csv, cars.csv, whiskey.csv as shown in the screenshot. Screen Shot 2016-11-23 at 21.22.24.png

I tried to read the data from Object Storage

First I tried with the code inserted from the DSX code Screen Shot 2016-11-23 at 21.20.47.png

There is a process to set the configuration and a process to load CSV data into DataFrame. (1) Set Hadoop configuration from Spark The configuration parameters listed in Accessing OpenStack Swift from Spark are set according to the file uploaded to IBM Object Storage. It seems that it is.

test.py


# @hidden_cell
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

# This function accesses a file in your Object Storage. The definition contains your credentials.
# You might want to remove those credentials before you share (name):
def set_hadoop_config_with_credentials_xxxxxxxxxxxxxxxxxxcxxxxxcc(name):
    """This function sets the Hadoop configuration so it is possible to
    access data from Bluemix Object Storage V3 using Spark"""

    prefix = 'fs.swift.service.' + name
    hconf = sc._jsc.hadoopConfiguration()
    hconf.set(prefix + '.auth.url', 'https://identity.open.softlayer.com'+'/v3/auth/tokens')
    hconf.set(prefix + '.auth.endpoint.prefix', 'endpoints')
    hconf.set(prefix + '.tenant', '4dbbbca3a8ec42eea9349120fb91dcf9')
    hconf.set(prefix + '.username', 'xxxxxxxxxcxxxxxxcccxxxxxccc')
    hconf.set(prefix + '.password', 'Xxxxxxxxxxxxxxxxx')
    hconf.setInt(prefix + '.http.port', 8080)
    hconf.set(prefix + '.region', 'dallas')
    hconf.setBoolean(prefix + '.public', True)

# you can choose any name
name = 'keystone'
set_hadoop_config_with_credentials_Xxxxxxxxxxxxxxxxxxcxxxxxc(name)

(2) Load CSV data into Spark Data Frame

test.py


df_data_1=sqlContext.read.format('com.databricks.spark.csv')\
    .options(header='true',inferschema='true')\
    .load("swift://PredictiveAnalyticsProject2." + name + "/whiskey.csv")

This seems to be using a package called spark-csv that was needed before Spark 2.0. From spark2.0, Spark DataFrame seems to be able to handle csv directly, and read.csv () can also be used, which is convenient. Screen Shot 2016-11-23 at 21.38.46.png

I tried to write data to Object Storage

I wrote the contents of Spark DataFrame that loaded whiskey.csv to Object Storage with the name whiskey_new.csv. (1) I tried to output with write of Spark Data Frame Simply write write.csv () and it's OK. Screen Shot 2016-11-23 at 21.50.24.png

As for mode, it seems that Save Mode as shown in here can be used. Looking at the output file, it was divided into multiple files and output instead of one CSV as shown in the screenshot below. I think this is because it is processed by multiple nodes. (It's strange? When you read this again from Spark with read.csv, it is treated as one textFile, csv, so there seems to be no problem). Screen Shot 2016-11-23 at 21.49.21.png However, it seems that there are people other than me who want to output in one file, and there was also a Q & A asking if it could be output in one file (http://stackoverflow.com/questions/31674530/write-single-). csv-file-using-spark-csv)

(2) I tried REST API without using Spark I don't think it is suitable for handling a large amount of data, but you can put it as one CSV on Object Storage with the following code. (The get part of the code that DSX inserts for creating Pandas DataFrame is just changed to put. Put destination / url2 for uploading from the response while getting the authentication information by making the first API call / POST Seems to be assembling)

put_sample.py


def put_object_storage_file_with_credentials_xxxxxxxxxx(container, filename, indata):
    url1 = ''.join(['https://identity.open.softlayer.com', '/v3/auth/tokens'])
    data = {'auth': {'identity': {'methods': ['password'],
            'password': {'user': {'name': 'member_1825cd3bc875420fc629ccfd22c22e20433a7ac9','domain': {'id': '07e33cca1abe47d293b86de49f1aa8bc'},
            'password': 'xxxxxxxxxx'}}}}}
    headers1 = {'Content-Type': 'application/json'}
    resp1 = requests.post(url=url1, data=json.dumps(data), headers=headers1)
    resp1_body = resp1.json()
    for e1 in resp1_body['token']['catalog']:
        if(e1['type']=='object-store'):
            for e2 in e1['endpoints']:
                        if(e2['interface']=='public'and e2['region']=='dallas'):
                            url2 = ''.join([e2['url'],'/', container, '/', filename])
    s_subject_token = resp1.headers['x-subject-token']
    headers2 = {'X-Auth-Token': s_subject_token, 'accept': 'application/json'}
    resp2 = requests.put(url=url2, headers=headers2 , data=indata)
    print resp2
    return (resp2.content)


put_object_storage_file_with_credentials_xxxxxxxxxx( 'PredictiveAnalyticsProject2', 'whiskey_new_s.csv' , df_data_2.to_csv( index = False ) )

Postscript

There seems to be a way to use Python's swiftclient as an image to put with the above REST API, but it didn't seem to work in IBM's Data Scientist Experience environment. Using IBM Object Storage in Bluemix, with Python Screen Shot 2016-11-23 at 22.36.18.png

Recommended Posts

A note about connecting Spark to OpenStack Swift-based IBM Object Storage
Convert the cURL API to a Python script (using IBM Cloud object storage)
A note about __call__
A note about subprocess
A note about mprotect (2)
A note about KornShell (ksh)
A note about TensorFlow Introduction
A note about [python] __debug__
Python: A Note About Classes 1 "Abstract"
A note about get_scorer in sklearn
A light introduction to object detection
A note about mock (Python mock library)
[Cloudian # 3] Try to create a new object storage bucket with Python (boto3)