I want to cross-tabulate the value variables with two categorical variables in pandas and find the total unique value of the value variables.
SELECT date,carrier,count(DISTINCT host) as cnt
FROM apachelog
GROUP BY date,carrier;
Count the unique value of the variable host for each date and carrier variable (3 categories: Android, iOS, Other)
date | Android | iOS | Other |
---|---|---|---|
2015/04/01 | 5000 | 1300 | 15 |
2015/04/02 | 5200 | 1230 | 16 |
2015/04/03 | 5300 | 1100 | 17 |
... | ... | ... | ... |
Average value of column A for each cell of crosstab (column B x column C)
python
pd.pivot_table(df,value='A',index='B',columns='C')
Number of columns A for each cell in crosstab (column B x column C)
python
pd.pivot_table(df,value='A',index='B',columns='C',aggfunc = 'count')
Total of column A for each cell of crosstab (column B x column C)
python
pd.pivot_table(df,value='A',index='B',columns='C',aggfunc = 'sum')
So what about the number of unique values in column A for each cell? Use a lambda expression. How to write a lambda expression is lambda argument 1, argument 2 ...: Processing
pivot_unique.py
import pandas as pd
#Keep the data in a pandas dataframe
df = pd.DataFrame(apachlog)
#In the following method, the value will include the na value.
#Data frame in advance.info()A hand that can be used when you know that there is no na
pv = pd.pivot_table(df, values='host',index = 'date',columns = 'carrier',
aggfunc = lambda x:len(x.unique()))
#The following method is the value excluding the na value.
pv = pd.pivot_table(df, values='host',index = 'date',columns = 'carrier',
aggfunc = lambda x:x.nunique())
You can export pandas dataframes in tsv format with the to_csv method.
output.py
fout = 'output.tsv'
f = open(fout,'w',encoding = 'utf-8')
pv.to_csv(fout,sep = '\t')
f.close()
use_groupby.py
#Create a group-by-object by specifying the x-axis and y-axis
grouped = df.groupby(['date','carrier'])
#Group by to a data frame does not produce any results, it just creates a GroupBy object for the data frame with the aggregate key.
#Aggregate with column "host" for grouped object.
#The function to aggregate is agg(Specify this content with a lambda expression)
#count that na is also included in the unique value
grouped.host.agg(lambda x: len(x.unique()))
#Count not including na
grouped.host.agg(lambda x: x.nunique())
The aggregated result of the groupby method is a "vertical holding" (= long type) table.
date | carrier | |
---|---|---|
2015/04/01 | Android | 5000 |
2015/04/01 | iOS | 1300 |
2015/04/01 | Other | 15 |
2015/04/02 | Android | 5200 |
... | ... | ... |
If you want to expand "vertical holding" (long type) to cross type (?) (= Wide type), use the *** unstack method *** as shown below to display the above aggregated image. You can get it.
#g = grouped.host.agg(lambda x: len(x.unique()))
g = grouped.host.agg(lambda x: x.nunique()))
g.unstack()
-How to use <Python, pandas> pivot_table and plot -Error in pivot_table method in O'Reilly "Introduction to Data Analysis with Python" -Donpisha's question on stack_overflow -O'Reilly [Introduction to Data Analysis with Python] Chapter 9 Data Aggregation and Group Calculation
Recommended Posts