[PYTHON] Find the sum of unique values with pandas crosstab

Purpose

I want to cross-tabulate the value variables with two categorical variables in pandas and find the total unique value of the value variables.

The following code when written in SQL

SELECT date,carrier,count(DISTINCT host) as cnt
FROM apachelog
GROUP BY date,carrier;

Aggregate image

Count the unique value of the variable host for each date and carrier variable (3 categories: Android, iOS, Other)

date	Android	iOS	Other
2015/04/01	5000	1300	15
2015/04/02	5200	1230	16
2015/04/03	5300	1100	17
...	...	...	...

Code in python3 (1) Use pivot_table method

Average value of column A for each cell of crosstab (column B x column C)

`python`


　pd.pivot_table(df,value='A',index='B',columns='C')

Number of columns A for each cell in crosstab (column B x column C)

`python`


 pd.pivot_table(df,value='A',index='B',columns='C',aggfunc = 'count')

Total of column A for each cell of crosstab (column B x column C)

`python`


　pd.pivot_table(df,value='A',index='B',columns='C',aggfunc = 'sum')

So what about the number of unique values in column A for each cell? Use a lambda expression. How to write a lambda expression is lambda argument 1, argument 2 ...: Processing

`pivot_unique.py`


import pandas as pd

#Keep the data in a pandas dataframe
df = pd.DataFrame(apachlog)

#In the following method, the value will include the na value.
#Data frame in advance.info()A hand that can be used when you know that there is no na
pv = pd.pivot_table(df, values='host',index = 'date',columns = 'carrier',
 aggfunc = lambda x:len(x.unique()))

#The following method is the value excluding the na value.
pv = pd.pivot_table(df, values='host',index = 'date',columns = 'carrier',
 aggfunc = lambda x:x.nunique())

(Bonus) Export to a file

You can export pandas dataframes in tsv format with the to_csv method.

`output.py`


fout = 'output.tsv'
f = open(fout,'w',encoding = 'utf-8')

pv.to_csv(fout,sep = '\t')
f.close()

Code in python3 (2) Use groupby method

`use_groupby.py`


#Create a group-by-object by specifying the x-axis and y-axis
grouped = df.groupby(['date','carrier'])

#Group by to a data frame does not produce any results, it just creates a GroupBy object for the data frame with the aggregate key.

#Aggregate with column "host" for grouped object.
#The function to aggregate is agg(Specify this content with a lambda expression)
#count that na is also included in the unique value
grouped.host.agg(lambda x: len(x.unique()))

#Count not including na
grouped.host.agg(lambda x: x.nunique())

The aggregated result of the groupby method is a "vertical holding" (= long type) table.

date	carrier
2015/04/01	Android	5000
2015/04/01	iOS	1300
2015/04/01	Other	15
2015/04/02	Android	5200
...	...	...

If you want to expand "vertical holding" (long type) to cross type (?) (= Wide type), use the *** unstack method *** as shown below to display the above aggregated image. You can get it.

#g = grouped.host.agg(lambda x: len(x.unique()))
g = grouped.host.agg(lambda x: x.nunique()))

g.unstack()

Execution environment

anaconda3-4.0.0
OS X El Capitan 10.11.3

reference

-How to use <Python, pandas> pivot_table and plot -Error in pivot_table method in O'Reilly "Introduction to Data Analysis with Python" -Donpisha's question on stack_overflow -O'Reilly [Introduction to Data Analysis with Python] Chapter 9 Data Aggregation and Group Calculation

Recommended Posts

Find the sum of unique values with pandas crosstab

Find out the day of the week with datetime

Aggregate VIP values of Smash Bros. with Pandas

Get the sum of each of multiple columns with awk

Find out the location of packages installed with pip

The Power of Pandas: Python

I tried to find the average of the sequence with TensorFlow

Reformat the timeline of the pandas time series plot with matplotlib

Find the average / standard deviation of the brightness values in the image

Animate the alpha and beta values of the world's top market cap stocks with pandas + matplotlib

Find the definition of the value of errno

Extract the maximum value with pandas.

Find the optimal value of a function with a genetic algorithm (Part 2)

Find the transfer function of one degree of freedom system with PythonControl.

Find the Levenshtein Distance with python

How to find the memory address of a Pandas dataframe value

Format the CSV file of "National Holiday" of the Cabinet Office with pandas

Find the numerical solution of the second-order ordinary differential equation with scipy

I wrote the basic operation of Pandas with Jupyter Lab (Part 1)

Find the smallest index that meets the cumulative sum threshold with numpy

I wrote the basic operation of Pandas with Jupyter Lab (Part 2)

Find the inertial spindle and moment of inertia from the inertial tensor with NumPy

Find the general terms of the Tribonacci sequence with linear algebra and Python

I made a mistake in fetching the hierarchy with MultiIndex of pandas

Find the area of the union of overlapping rectangles

Plot the Nikkei Stock Average with pandas

Align the size of the colorbar with matplotlib

Import of japandas with pandas 1.0 and above

Get the top nth values in Pandas

Visualize the boundary values of the multi-layer perceptron

Check the existence of the file with python

Measure the relevance strength of a crosstab

Find the SHA256 value with R (with bonus)

Migemo version of the: find command,: mfind

The third night of the loop with for

Pandas of the beginner, by the beginner, for the beginner [Python]

Tips: Comparison of the size of three values

Replace column names / values with pandas dataframe

The second night of the loop with for

Real-time calculation of mean values with coroutines

Find the coefficients of the least squares polynomial

Example of efficient data processing with PANDAS

Count the number of characters with echo

Fill the missing value (null) of DataFrame with the values before and after with pyspark

Find the index of items that match the conditions in the pandas data frame / series

Compare the sum of each element in two lists with the specified value in Python

I compared the moving average of IIR filter type with pandas and scipy

Type conversion of multiple columns of pandas DataFrame with astype at the same time

Note that the calculation of average pairwise correlation was very easy with pandas

I tried the pivot table function of pandas

The story of doing deep learning with TPU

How to find the area of the Voronoi diagram

Note: Prepare the environment of CmdStanPy with docker

Prepare the execution environment of Python3 with Docker

Combinatorial optimization to find the hand of "Millijan"

Find the second derivative with JAX automatic differentiation

Automatic operation of Chrome with Python + Selenium + pandas

2016 The University of Tokyo Mathematics Solved with Python

Find the mood value with python (Rike Koi)

[Note] Export the html of the site with python.

See the behavior of drunkenness with reinforcement learning