[PYTHON] Find the sum of unique values with pandas crosstab

Purpose

I want to cross-tabulate the value variables with two categorical variables in pandas and find the total unique value of the value variables.

The following code when written in SQL

SELECT date,carrier,count(DISTINCT host) as cnt
FROM apachelog
GROUP BY date,carrier;

Aggregate image

Count the unique value of the variable host for each date and carrier variable (3 categories: Android, iOS, Other)

date Android iOS Other
2015/04/01 5000 1300 15
2015/04/02 5200 1230 16
2015/04/03 5300 1100 17
... ... ... ...

Code in python3 (1) Use pivot_table method

Average value of column A for each cell of crosstab (column B x column C)

python


 pd.pivot_table(df,value='A',index='B',columns='C')

Number of columns A for each cell in crosstab (column B x column C)

python


 pd.pivot_table(df,value='A',index='B',columns='C',aggfunc = 'count')

Total of column A for each cell of crosstab (column B x column C)

python


 pd.pivot_table(df,value='A',index='B',columns='C',aggfunc = 'sum')

So what about the number of unique values in column A for each cell? Use a lambda expression. How to write a lambda expression is lambda argument 1, argument 2 ...: Processing

pivot_unique.py


import pandas as pd

#Keep the data in a pandas dataframe
df = pd.DataFrame(apachlog)

#In the following method, the value will include the na value.
#Data frame in advance.info()A hand that can be used when you know that there is no na
pv = pd.pivot_table(df, values='host',index = 'date',columns = 'carrier',
 aggfunc = lambda x:len(x.unique()))

#The following method is the value excluding the na value.
pv = pd.pivot_table(df, values='host',index = 'date',columns = 'carrier',
 aggfunc = lambda x:x.nunique())

(Bonus) Export to a file

You can export pandas dataframes in tsv format with the to_csv method.

output.py


fout = 'output.tsv'
f = open(fout,'w',encoding = 'utf-8')

pv.to_csv(fout,sep = '\t')
f.close()

Code in python3 (2) Use groupby method

use_groupby.py


#Create a group-by-object by specifying the x-axis and y-axis
grouped = df.groupby(['date','carrier'])

#Group by to a data frame does not produce any results, it just creates a GroupBy object for the data frame with the aggregate key.

#Aggregate with column "host" for grouped object.
#The function to aggregate is agg(Specify this content with a lambda expression)
#count that na is also included in the unique value
grouped.host.agg(lambda x: len(x.unique()))

#Count not including na
grouped.host.agg(lambda x: x.nunique())

The aggregated result of the groupby method is a "vertical holding" (= long type) table.

date carrier
2015/04/01 Android 5000
2015/04/01 iOS 1300
2015/04/01 Other 15
2015/04/02 Android 5200
... ... ...

If you want to expand "vertical holding" (long type) to cross type (?) (= Wide type), use the *** unstack method *** as shown below to display the above aggregated image. You can get it.

#g = grouped.host.agg(lambda x: len(x.unique()))
g = grouped.host.agg(lambda x: x.nunique()))

g.unstack()

Execution environment

reference

-How to use <Python, pandas> pivot_table and plot -Error in pivot_table method in O'Reilly "Introduction to Data Analysis with Python" -Donpisha's question on stack_overflow -O'Reilly [Introduction to Data Analysis with Python] Chapter 9 Data Aggregation and Group Calculation

Recommended Posts

Find the sum of unique values with pandas crosstab
Find out the day of the week with datetime
Aggregate VIP values of Smash Bros. with Pandas
Get the sum of each of multiple columns with awk
Find out the location of packages installed with pip
The Power of Pandas: Python
I tried to find the average of the sequence with TensorFlow
Reformat the timeline of the pandas time series plot with matplotlib
Find the average / standard deviation of the brightness values in the image
Animate the alpha and beta values of the world's top market cap stocks with pandas + matplotlib
Find the definition of the value of errno
Extract the maximum value with pandas.
Find the optimal value of a function with a genetic algorithm (Part 2)
Find the transfer function of one degree of freedom system with PythonControl.
Find the Levenshtein Distance with python
How to find the memory address of a Pandas dataframe value
Format the CSV file of "National Holiday" of the Cabinet Office with pandas
Find the numerical solution of the second-order ordinary differential equation with scipy
I wrote the basic operation of Pandas with Jupyter Lab (Part 1)
Find the smallest index that meets the cumulative sum threshold with numpy
I wrote the basic operation of Pandas with Jupyter Lab (Part 2)
Find the inertial spindle and moment of inertia from the inertial tensor with NumPy
Find the general terms of the Tribonacci sequence with linear algebra and Python
I made a mistake in fetching the hierarchy with MultiIndex of pandas
Find the area of the union of overlapping rectangles
Plot the Nikkei Stock Average with pandas
Align the size of the colorbar with matplotlib
Import of japandas with pandas 1.0 and above
Get the top nth values in Pandas
Visualize the boundary values of the multi-layer perceptron
Check the existence of the file with python
Measure the relevance strength of a crosstab
Find the SHA256 value with R (with bonus)
Migemo version of the: find command,: mfind
The third night of the loop with for
Pandas of the beginner, by the beginner, for the beginner [Python]
Tips: Comparison of the size of three values
Replace column names / values with pandas dataframe
The second night of the loop with for
Real-time calculation of mean values with coroutines
Find the coefficients of the least squares polynomial
Example of efficient data processing with PANDAS
Count the number of characters with echo
Fill the missing value (null) of DataFrame with the values before and after with pyspark
Find the index of items that match the conditions in the pandas data frame / series
Compare the sum of each element in two lists with the specified value in Python
I compared the moving average of IIR filter type with pandas and scipy
Type conversion of multiple columns of pandas DataFrame with astype at the same time
Note that the calculation of average pairwise correlation was very easy with pandas
I tried the pivot table function of pandas
The story of doing deep learning with TPU
How to find the area of the Voronoi diagram
Note: Prepare the environment of CmdStanPy with docker
Prepare the execution environment of Python3 with Docker
Combinatorial optimization to find the hand of "Millijan"
Find the second derivative with JAX automatic differentiation
Automatic operation of Chrome with Python + Selenium + pandas
2016 The University of Tokyo Mathematics Solved with Python
Find the mood value with python (Rike Koi)
[Note] Export the html of the site with python.
See the behavior of drunkenness with reinforcement learning