[PYTHON] Analyze stock prices using pandas data aggregation and group operations

Categorizing data and applying functions to each category is called aggregation or transformation. These are considered to be the most important parts of the data analysis workflow. Pandas has powerful functions in group operations and can be operated intuitively.

A famous paper by Hadley Wickham, author of various packages for the R language The Split-Apply-Combine Strategy for Data Analysis (PDF) Describes the group operation process "Separation-Apply-Join". Pandas also uses this group aggregation operation model as a base idea. The data is separated by one or more keys at the beginning of the process, then the function is applied to each group, and the result of applying the function is combined and stored in an object that shows the result.

I tried Getting stock price in Japan with Ruby before. I would like to try an example using pandas to operate a group operation on an actual stock price using the collected data.

Group the stock prices of several companies by year and apply the function

Variables grouped by groupby in pandas are GroupBy objects. The apply method separates the data into pieces that make it easier to work with, applies a function to each object, and then combines them.

#Pick up the stock prices of several companies
#NTT DATA
stock_9613 = pd.read_csv('stock_9613.csv',
                         parse_dates=True, index_col=0)
# DTS
stock_9682 = pd.read_csv('stock_9682.csv',
                         parse_dates=True, index_col=0)
#IT Holdings
stock_3626 = pd.read_csv('stock_3626.csv',
                         parse_dates=True, index_col=0)
# NSD
stock_9759 = pd.read_csv('stock_9759.csv',
                         parse_dates=True, index_col=0)

#Extract closing prices after 2010 into one data frame
df = pd.DataFrame([
    stock_9613.ix['2010-01-01':, 'closing price'],
    stock_9682.ix['2010-01-01':, 'closing price'],
    stock_3626.ix['2010-01-01':, 'closing price'],
    stock_9759.ix['2010-01-01':, 'closing price']
], index=['NTT DATA', 'DTS', 'IT Holdings', 'NSD']).T

#=>Date data DTS IT e NSD
# (Omission)
# 2015-01-05     4530  2553  1811  1779
# 2015-01-06     4375  2476  1748  1755
# 2015-01-07     4300  2459  1748  1754
# 2015-01-08     4350  2481  1815  1775
# 2015-01-09     4330  2478  1805  1756
# 2015-01-13     4345  2480  1813  1766
# 2015-01-14     4260  2485  1809  1770
# 2015-01-15     4340  2473  1839  1790
# 2015-01-16     4295  2458  1821  1791

The stock prices of each company since 2010 have been obtained. Well, it seems that the several companies listed here often collaborate, but how much is the actual correlation in the stock market? From here, I will be a little curious and try to find the annual correlation coefficient for NTT DATA.

Find the correlation coefficient between grouped stock prices

#Find the transition
rets = df.pct_change().dropna()
#Group by year
by_year = rets.groupby(lambda x: x.year)
#Define an anonymous function that calculates the correlation coefficient
vol_corr = lambda x: x.corrwith(x['NTT DATA'])
#Apply a function to a grouped object
result1 = by_year.apply(vol_corr)
print(result1)

#=>NTT DATA DTS IT Ho NSD
# 2010        1  0.346437     0.492006  0.443910
# 2011        1  0.485108     0.575495  0.619912
# 2012        1  0.261388     0.268531  0.212315
# 2013        1  0.277970     0.358796  0.408304
# 2014        1  0.381762     0.404376  0.385258
# 2015        1  0.631186     0.799621  0.770759

Let's visualize it with matplotlib.

You can also use the apply method to find the correlation between columns. For example, let's find the correlation of DTS's stock price with NTT DATA.

#Apply an anonymous function to find the correlation coefficient between one column and another
result2 = by_year.apply(lambda g: g['DTS'].corr(g['NTT DATA']))
print(result2)

#=>
# 2010    0.346437
# 2011    0.485108
# 2012    0.261388
# 2013    0.277970
# 2014    0.381762
# 2015    0.631186

Define a slightly more complex function and apply it to grouped stock prices

The same applies to ordinary functions. For example, let's find a least squares (OLS) linear regression for each group.

#Create your own function for linear regression
def regression(data, yvar, xvars):
    Y = data[yvar]
    X = data[xvars]
    X['intercept'] = 1.
    result = sm.OLS(Y, X).fit() #Linear regression method in econometrics library
    return result.params #Results are returned

#Apply a linear regression function to grouped stock prices
result3 = by_year.apply(regression, 'DTS', ['NTT DATA'])

#=>NTT data intercept
# 2010  0.313685   0.000773
# 2011  0.509025  -0.000057
# 2012  0.360677   0.000705
# 2013  0.238903   0.002063
# 2014  0.395362   0.001214
# 2015  0.418843  -0.002459

It's only half a month since 2015, so I can't say anything about it, but for the time being, I've got the results for each year. By applying the function to the data grouped in this way, you can try the analysis from various angles, which is very convenient.

Summary

Being able to apply the function itself with the apply method opens up many possibilities. The functions applied here can be freely written by the analyst, except for the rule that an object or scalar value is returned as a return value.

The source code for this article is here.

reference

Introduction to data analysis with Python-Data processing using NumPy and pandas http://www.oreilly.co.jp/books/9784873116556/