[PYTHON] Draw a graph by processing with Pandas groupby

Introduction

I'm starting to play with Kaggle's Titanic, but before complementing missing values or reviewing hyperparameters, I'd like to take a closer look at the data and look at it. I want to quickly group the read data by the value of Survived and draw a graph, but it doesn't work very well. Because I don't understand Pandas's" GroupBy ".

There are many examples of graph drawing by our predecessors on the net, but I wrote this article thinking that it might be useful for beginners by describing the path of my understanding.

Aiming goal

Draw a graph like the one below.

チケット記号ごとの生存・死亡・不明者数

In this graph, the horizontal axis is the symbol `Ticket```, the vertical axis is the survival ( s```), death (`` d```), unknown (`` na. The number of people in `) is accumulated and sorted in descending order by the total number of people. For example, the ticket symbol of` CA. 2343``` on the far left is 11 people in total, 4 people unknown, and the rest. 7 people have died.

I want to draw such a graph quickly.

Read data

Read the data and check the number for each same symbol in the data of `` `Ticket```.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

train_data = pd.read_csv("../train.csv")
test_data = pd.read_csv("../test.csv")
total_data = pd.concat([train_data, test_data]) # train_data and test_Concatenate data

ticket_freq = total_data["Ticket"].value_counts()
CA. 2343        11
CA 2144          8
1601             8
S.O.C. 14879     7
3101295          7
                ..
350404           1
248706           1
367655           1
W./C. 14260      1
350047           1
Name: Ticket, Length: 929, dtype: int64

CA.2343 11 people,8 CA 2144,And so on.



# Create data for graphs
## Group by groupby

 First, group ``` total_data``` with a ticket symbol.

```python
total_data_ticket = total_data.groupby("Ticket")
#output
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001F5A14327C8>

The disadvantage of groupby is,It doesn't show the contents of the data.here, ***Grouped***Understand in my head,Go to next.



## Extract only survival information
 Next, retrieve the survival information (``` Survived```).

```python
total_data_ticket = total_data.groupby("Ticket")["Survived"]
total_data_ticket
#output
<pandas.core.groupby.generic.SeriesGroupBy object at 0x000001F5A1437B48>

The data is not displayed here either.

Count by survival, death, unknown

Then use `value_counts () ``` to count the number of Survived``` per value. By setting `dropna = False```, N / A is also counted. ..

total_data_ticket = total_data.groupby("Ticket")["Survived"].value_counts(dropna=False)
total_data_ticket
#output
Ticket       Survived
110152       1.0         3
110413       1.0         2
             0.0         1
110465       0.0         2
110469       NaN         1
                        ..
W.E.P. 5734  NaN         1
             0.0         1
W/C 14208    0.0         1
WE/P 5735    0.0         1
             1.0         1
Name: Survived, Length: 1093, dtype: int64

Change the shape of data

To draw a graph, change the data such as survival, death, and unknown data in a column direction. Use `` unstack () `.

total_data_ticket = total_data.groupby("Ticket")["Survived"].value_counts(dropna=False).unstack()
total_data_ticket
#output
Survived	NaN	0.0	1.0
Ticket			
110152	NaN	NaN	3.0
110413	NaN	1.0	2.0
110465	NaN	2.0	NaN
110469	1.0	NaN	NaN
110489	1.0	NaN	NaN
...	...	...	...
W./C. 6608	1.0	4.0	NaN
W./C. 6609	NaN	1.0	NaN
W.E.P. 5734	1.0	1.0	NaN
W/C 14208	NaN	1.0	NaN
WE/P 5735	NaN	1.0	1.0
929 rows × 3 columns

Draw a graph

Change N / A to numbers

Looking at the output above, the value still has `NaN```, so set `NaN``` to 0.

total_data_ticket.fillna(0, inplace=True)
total_data_ticket
#output
Survived	NaN	0.0	1.0
Ticket			
110152	0.0	0.0	3.0
110413	0.0	1.0	2.0
110465	0.0	2.0	0.0
110469	1.0	0.0	0.0
110489	1.0	0.0	0.0
...	...	...	...
W./C. 6608	1.0	4.0	0.0
W./C. 6609	0.0	1.0	0.0
W.E.P. 5734	1.0	1.0	0.0
W/C 14208	0.0	1.0	0.0
WE/P 5735	0.0	1.0	1.0
929 rows × 3 columns

Change column name

The column names are `NaN```,` 0.0, `` `1.0, but this is awkward, so change the column name.

total_data_ticket.columns = ["nan", "d", "s"]
total_data_ticket
#output
	nan	d	s
Ticket			
110152	0.0	0.0	3.0
110413	0.0	1.0	2.0
110465	0.0	2.0	0.0
110469	1.0	0.0	0.0
110489	1.0	0.0	0.0
...	...	...	...
W./C. 6608	1.0	4.0	0.0
W./C. 6609	0.0	1.0	0.0
W.E.P. 5734	1.0	1.0	0.0
W/C 14208	0.0	1.0	0.0
WE/P 5735	0.0	1.0	1.0
929 rows × 3 columns

Calculate the total number of people per row

I want to sort by total number of people in descending order, so I calculate the total number of people and save it in a new column. I use `sum ()` to calculate the total, but I calculate it in the column direction, so` `` sum (axis = 1) ```.

total_data_ticket["count"] = total_data_ticket.sum(axis=1)
total_data_ticket
#output
	nan	d	s	count
Ticket				
110152	0.0	0.0	3.0	3.0
110413	0.0	1.0	2.0	3.0
110465	0.0	2.0	0.0	2.0
110469	1.0	0.0	0.0	1.0
110489	1.0	0.0	0.0	1.0
...	...	...	...	...
W./C. 6608	1.0	4.0	0.0	5.0
W./C. 6609	0.0	1.0	0.0	1.0
W.E.P. 5734	1.0	1.0	0.0	2.0
W/C 14208	0.0	1.0	0.0	1.0
WE/P 5735	0.0	1.0	1.0	2.0
929 rows × 4 columns

Now you are ready to draw the graph.

Draw a graph

Decide the area of the number of people and sort in descending order

The code is shown first and explained in order.

total_data_ticket[total_data_ticket["count"] > 3].sort_values("count", ascending=False)[["nan", "d", "s"]].plot.bar(figsize=(15,10),stacked=True)
code Contents
total_data_ticket[total_data_ticket["count"] > 3] "count"Data greater than 3
.sort_values("count", ascending=False) "count"Sort in descending order
[["nan", "d", "s"]] Extract only the three columns on the left("count"Is not useful)
.plot.bar(figsize=(15,10),stacked=True) Draw a bar graph.Specify the size,I made it a stacking method

Now you can draw the graph shown at the beginning.

チケット記号ごとの生存・死亡・不明者数 (再掲)

Looking at this, people with `CA. 2343``` and `` CA 2144can imagine `Survived = 0``` ...

Whole code

Finally, the whole code is shown.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

train_data = pd.read_csv("../train.csv")
test_data = pd.read_csv("../test.csv")
total_data = pd.concat([train_data, test_data])

ticket_freq = total_data["Ticket"].value_counts()
ticket_freq

total_data_ticket = total_data.groupby("Ticket")["Survived"].value_counts(dropna=False).unstack()

total_data_ticket.fillna(0, inplace=True)
total_data_ticket.columns = ["nan", "d", "s"]
total_data_ticket["count"] = total_data_ticket.sum(axis=1)
total_data_ticket[total_data_ticket["count"] > 3].sort_values("count", ascending=False)[["nan", "d", "s"]].plot.bar(figsize=(15,10),stacked=True)

in conclusion

Using this technique, we also check other non-numeric data such as the surnames and titles of `Embarked```, `Cabin, and `` `` Name.

reference

-How to use Pandas groupby -Create a graph with the pandas plot method and visualize the data -Data overview with Pandas

Recommended Posts

Draw a graph by processing with Pandas groupby
Draw a graph with pandas + XlsxWriter
Draw a graph with NetworkX
Draw a graph with networkx
Draw a graph with Julia + PyQtGraph (2)
Draw a loose graph with matplotlib
Draw a graph with Julia + PyQtGraph (1)
Draw a graph with Julia + PyQtGraph (3)
Draw a graph with PySimple GUI
Simply draw a graph by specifying a file
Draw a graph with PyQtGraph Part 1-Drawing
Draw a flat surface with a matplotlib 3d graph
Draw a graph with Japanese labels in Jupyter
How to draw a 2-axis graph with pyplot
Processing datasets with pandas (1)
Processing datasets with pandas (2)
Draw a graph with PyQtGraph Part 3-PlotWidget settings
[Python] Draw a directed graph with Dash Cytoscape
Draw a graph with PyQtGraph Part 4-PlotItem settings
Draw a graph with matplotlib from a csv file
Draw a graph with PyQtGraph Part 6-Displaying a legend
Draw a graph with PyQtGraph Part 5-Increase the Y-axis
[Python] How to draw a line graph with Matplotlib
Draw a graph with PyQtGraph Part 2--Detailed plot settings
Standardize by group with pandas
Data processing tips with Pandas
Study math with Python: Draw a sympy (scipy) graph with matplotlib
Multiple file processing with Kivy + Matplotlib + Draw Graph on GUI
How to fix multi-columns generated by Pandas groupby processing to single
[Visualization] I want to draw a beautiful graph with Plotly
Data visualization with Python-It's too convenient to draw a graph by attribute with "Facet" at once
Draw a beautiful circle with numpy
Manipulating strings with pandas group by
Let's make a graph with python! !!
Make a nice graph with plotly
Easily draw a map with matplotlib.basemap
Feature generation with pandas group by
[PyQt] Display a multi-axis graph with QtChart
How to draw a graph using Matplotlib
Pandas: groupby () to complete value by group
Draw a heart in Ruby with PyCall
Draw a Mandelbrot set with Brainf * ck
How to draw a bar graph that summarizes multiple series with matplotlib
Draw a graph that can be moved around with HoloViews and Bokeh
Draw hierarchical axis labels with matplotlib + pandas
Example of efficient data processing with PANDAS
100 Language Processing Knock-34 (using pandas): "A B"
Create a graph with borders removed with matplotlib
When to_csv with Pandas, it became line by line
Draw a graph of a quadratic function in Python
How to draw a 3D graph before optimization
Draw a "breast curved surface" in a 3D graph (1)
Read Python csv data with Pandas ⇒ Graph with Matplotlib
A memo that made a graph animated with plotly
Try to draw a life curve with python
[Python] Draw a Mickey Mouse with Turtle [Beginner]
100 language processing knock-99 (using pandas): visualization by t-SNE
Draw a "breast curved surface" in a 3D graph (2)
Make holiday data into a data frame with pandas
Read line by line from a file with Python
I made a random number graph with Numpy