(Note) Basic statistics on Python & Pandas on IBM DSX

I tried Python & Pandas

Make a note of the script that you will always run when analyzing data in Python in the future. Runs on Python2 with Spark 2.0 in IBM's Data Science Experience environment. (This time it doesn't have to be Spark at all) Since the number of fields is quite large in actual analysis work, I tried to think of a method that does not require coding of field names (column names) in the script so that analysis can be performed efficiently. Try column expansion and flagging of category data, which is a function equivalent to "field reorganization" of SPSS Modeler, which is required for machine learning data preparation! I didn't try missing value related this time, so I'll take the next opportunity. (Data has already been entered in df_wiskey used in this article)

#First, check the contents of the DataFrame
df_wiskey.head(10)
Screen Shot 2016-11-15 at 18.17.19.png
#Next, check the attributes of the column (field) (this time, proceed with a fairly appropriate w)
df_wiskey.dtypes
Screen Shot 2016-11-15 at 18.17.37.png
#Basic statistics of numerical data
df_wiskey.describe()
Screen Shot 2016-11-15 at 18.17.48.png
#Graph the distribution of numerical data

#Put matplotlib in inline mode
%matplotlib inline

import matplotlib
import matplotlib.pyplot as plt

for x in df_wiskey.columns[df_wiskey.dtypes == 'float64']:
    xdesc = df_wiskey[x].describe()
    plt.hist(df_wiskey[x] , range=(xdesc['min'], xdesc['max']) )
    plt.title( x )
    plt.show()
Screen Shot 2016-11-15 at 18.18.03.png Screen Shot 2016-11-15 at 18.18.11.png Screen Shot 2016-11-15 at 18.18.20.png
#Numerical data,Correlation between two variables
df_wiskey.corr()
Screen Shot 2016-11-15 at 18.18.29.png
#Data other than numerical data
df_wiskey[df_wiskey.columns[df_wiskey.dtypes == 'object']].head(5) 
Screen Shot 2016-11-15 at 18.18.35.png
#Aggregate data appearance frequency for non-numerical data (assumed to be category value)
for x in df_wiskey.columns[df_wiskey.dtypes == 'object']:
    valcal = df_wiskey[x].value_counts();
    print '-- '+x+' -----------------------------------'
    print valcal.head(10)
    print '--------------------------------------------'
Screen Shot 2016-11-15 at 18.18.43.png Screen Shot 2016-11-15 at 18.18.54.png Screen Shot 2016-11-15 at 18.19.01.png
#Cross tabulation between category data--Simple but the display feels strange
crosstab( df_wiskey.Country , df_wiskey.Category)
Screen Shot 2016-11-15 at 18.19.11.png
#Heatmap in Country vs Category(Bourbon concentrates on USA, Single Malt covers most countries)
df_wiskey_pd = pivot_table( data=df_wiskey , columns='Country' , index='Category' , values='Name' , aggfunc='count')
plt.imshow(df_wiskey_pd , aspect= 'auto' ,interpolation='nearest')
plt.colorbar()
plt.xticks(range(df_wiskey_pd.shape[1]), df_wiskey_pd.columns , rotation='vertical')
plt.yticks(range(df_wiskey_pd.shape[0]), df_wiskey_pd.index)
plt.show()
Screen Shot 2016-11-15 at 18.19.19.png
#Fieldize the data in the Country column to enter into the modeling technique, T/Set F
# (The column name is Country_XXXXXXXX) 
for x in df_wiskey.groupby('Country').count().index : 
    x1 = 'Country_' + x  
    df_wiskey[x1]  = 'F'
    #If the country set in the Country column is xxxxx, then Country_Change to T for XXXXXXXXX
    df_wiskey.loc[df_wiskey[x1][df_wiskey.Country == x].index , x1] = 'T'
#Display only the first 3 lines
df_wiskey.head(3)
Screen Shot 2016-11-15 at 18.20.42.png

Postscript

Data Scientist Experience notebooks may be pretty easy to use: grinning:

Notebook_Python2 copy.png

Recommended Posts

(Note) Basic statistics on Python & Pandas on IBM DSX
1. Statistics learned with Python 1-1. Basic statistics (Pandas)
Installing pandas on python2.6
python basic on windows ②
Python basic grammar note (3)
Python Basic --Pandas, Numpy-
Python application: Pandas Part 1: Basic
Note: Python
Python note
[Note] Python environment construction on rental server "CORESERVER"
Install Python3, numpy, pandas, matplotlib, etc. on Windows
A note on optimizing blackbox functions in Python
Note on encoding when LANG = C in Python
[Note] Installing Python 3.6 + α on Windows and RHEL
Create a Python execution environment on IBM i
Basic operation of Python Pandas Series and Dataframe (1)
Note: Python Decorator
Python programming note
[Python] Learning Note 1
My pandas (python)
Statistics with python
Python on Windows
twitter on python3
Python study note_004
Basic Python writing
python on mac
Python study note_003
Python3 basic grammar
Python on Windbg
RF Python Basic_02
Python beginner's note
python pandas notes
[Note] pandas unstack
Make a note of the list of basic Pandas usage
Try basic operations for Pandas DataFrame on Jupyter Notebook
Hit Watson's REST API from Python on IBM Bluemix
Installing Python 3 on Mac and checking basic operation Part 1
A note on handling variables in Python recursive functions
A note on speeding up Python code with Numba
Stray build of Python 3 on Sakura's rental server (Note)
Notes on writing config files for Python Note: configparser