pandas is a data tool that began to be developed in the financial field around 2008. The author, Wes McKinney, was a member of the well-known financial hedge fund AQR Capital Management. .. For that reason, it has a number of powerful functions even when viewed as a practical analysis tool for financial and economic data.
We will analyze the dataset obtained from Yahoo! Finance using pandas. This time we'll use some stock data and daily closing prices for the S & P 500 Index (whose identifier is SPX).
pandas has functions for input and output such as CSV and JSON.
function | Description |
---|---|
read_csv | ','Read delimited data |
read_table | tab('\t')Read delimited data |
read_json | Read JSON format data |
read_msgpack | Read data in msgpack format |
read_pickle | Read binary data |
The to_XXX function that is paired with these is provided in the data frame, and data can be output in any format. It's very easy to not have to call the CSV or JSON Parser to write the code.
import pandas as pd
stock = pd.read_csv('stock_px.csv', parse_dates=True, index_col=0)
Moreover, an index is automatically created for the data read from CSV. You can also recreate a new object with a new index that is more suitable.
Another feature of pandas is that it handles missing values well. It is not always possible to handle clean, flawless data in data analysis. So all pandas object stats exclude missing values. You can also set a threshold for how much missing values are allowed and fill in the blanks with the specified values.
Finding and aggregating summary statistics and grouping by index level is very easy.
stock.head(10) #Show only the first 10
# =>
# AAPL MSFT XOM SPX
# 2003-01-02 7.40 21.11 29.22 909.03
# 2003-01-03 7.45 21.14 29.24 908.59
# 2003-01-06 7.45 21.52 29.96 929.01
# 2003-01-07 7.43 21.93 28.95 922.93
# 2003-01-08 7.28 21.31 28.83 909.93
# 2003-01-09 7.34 21.93 29.44 927.57
# 2003-01-10 7.36 21.97 29.03 927.57
# 2003-01-13 7.32 22.16 28.91 926.26
# 2003-01-14 7.30 22.39 29.17 931.66
# 2003-01-15 7.22 22.11 28.77 918.22
stock['AAPL'].sum() #total
# => 277892.75
stock['AAPL'].mean() #Arithmetic mean
# => 125.51614724480578
stock['AAPL'].median() #Median
# => 91.45500000000001
Let's find out how much there is a correlation between daily profit and SPX in the year.
rets = stock.pct_change().dropna()
spx_corr = lambda x: x.corrwith(x['SPX'])
stock_by_year = rets.groupby(lambda x: x.year)
result_1 = stock_by_year.apply(spx_corr) #Correlation between daily profits and SPX
print( result_1 )
# => AAPL MSFT XOM SPX
# 2003 0.541124 0.745174 0.661265 1
# 2004 0.374283 0.588531 0.557742 1
# 2005 0.467540 0.562374 0.631010 1
# 2006 0.428267 0.406126 0.518514 1
# 2007 0.508118 0.658770 0.786264 1
# 2008 0.681434 0.804626 0.828303 1
# 2009 0.707103 0.654902 0.797921 1
# 2010 0.710105 0.730118 0.839057 1
# 2011 0.691931 0.800996 0.859975 1
plt.figure() #Canvas drawing
result_1.plot() #Plot with matplotlib
plt.show()
plt.savefig("image.png ")
Find the correlation between columns.
result_2 = stock_by_year.apply(lambda g: g['AAPL'].corr(g['MSFT'])) #Correlation between Apple and Microsoft
print( result_2 )
# =>
# 2003 0.480868
# 2004 0.259024
# 2005 0.300093
# 2006 0.161735
# 2007 0.417738
# 2008 0.611901
# 2009 0.432738
# 2010 0.571946
# 2011 0.581987
plt.figure()
result_2.plot()
plt.show()
plt.savefig("image2.png ")
Find the linear regression of the data by the least squares method.
def regression(data, yvar, xvars):
Y = data[yvar]
X = data[xvars]
X['intercept'] = 1.
result = sm.OLS(Y, X).fit()
return result.params
result_3 = stock_by_year.apply(regression, 'AAPL', ['SPX'])
print(result_3)
# => SPX intercept
# 2003 1.195406 0.000710
# 2004 1.363463 0.004201
# 2005 1.766415 0.003246
# 2006 1.645496 0.000080
# 2007 1.198761 0.003438
# 2008 0.968016 -0.001110
# 2009 0.879103 0.002954
# 2010 1.052608 0.001261
# 2011 0.806605 0.001514
plt.figure()
result_3.plot()
plt.show()
plt.savefig("image3.png ")
Introduction to data analysis with Python-Data processing using NumPy and pandas http://www.oreilly.co.jp/books/9784873116556/
Recommended Posts