[Python] What do you do with visualization of 4 or more variables?

Purpose of this article

What if you want to see the relationships between many variables at once in your data analysis?

I think pair plot is typical,More pack! I was wondering if I could put it together recently, **Sankey Diagram**I knew that, so I drew it.



 ** Addendum: **
 <font color = "red"> Please read the additional part at the end of the article first. </ Font>


# How to use Plotly Sankey Diagram


#### **`It seems that you can use Plotly,First of all[Official site](https://plot.ly/python/sankey-diagram/)Copy the sample code of,I will check if it works. `**
import plotly.graph_objects as go

fig = go.Figure(data=[go.Sankey(
    node = dict(
      pad = 15,
      thickness = 20,
      line = dict(color = "black", width = 0.5),
      label = ["A1", "A2", "B1", "B2", "C1", "C2"],
      color = "blue"
    ),
    link = dict(
      source = [0, 1, 0, 2, 3, 3], # indices correspond to labels, eg A1, A2, A2, B1, ...
      target = [2, 3, 3, 4, 4, 5],
      value = [8, 4, 2, 8, 4, 2]
  ))])

fig.update_layout(title_text="Basic Sankey Diagram", font_size=10)
fig.show()

image.png

It worked! It's good to see the details of that part when you hover your mouse over it!

The code is long and difficult compared to matplotlib and seaborn,The important parts are:.



```python
label = ["A1", "A2", "B1", "B2", "C1", "C2"],
source = [0, 1, 0, 2, 3, 3], 
target = [2, 3, 3, 4, 4, 5],
value = [8, 4, 2, 8, 4, 2]

For example, in the Sankey Diagram diagram above, `source: A1, target: B2, 2.00``` corresponds to the orange part of the three lists in the link below. It means that "only` `2``` flows from` `label [0]` to label [3]` ``.

image01.png

If you can create a list that specifies the start and end points of a node and the amount of flow that flows through it, you can draw a Sankey Diagram!

Draw a Sankey Diagram from a data frame

So, let's start creating the Sankey Diagram from the data frame of the main subject.

To show the results first, this time, I created the following figure using the data of the Titanic.

image.png

Commentary

Load the library.

import numpy as np
import pandas as pd
import plotly.graph_objects as go

I will download the data.

!wget https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv

Read the data, this time only the categorical variables and the variables with integer values are displayed, so narrow down the variable names.

filename = "/content/titanic.csv"
df = pd.read_csv(filename, encoding='utf-8')

cate_list = ["Survived", "Pclass", "Sex", "Siblings/Spouses Aboard", "Parents/Children Aboard"]

n = len(cate_list)

Then create a label_list.

label_list = []
source_list = []
target_list = []
value_list = []

for cate in cate_list:
    tmp_label_list=[]
    for v in df[cate].unique():
        lab = "{0}={1}".format(cate, v)
        tmp_label_list.append(lab)
        tmp_label_list.sort()
    label_list.extend(tmp_label_list)

Create three lists of link information.

for i in range(n-1):
    source_cate = cate_list[i]
    target_cate = cate_list[i+1]

    for sc in df[source_cate].unique():
        for tc in df[target_cate].unique():

            v = sum((df[source_cate]==sc) & (df[target_cate]==tc))
            source_lab = "{0}={1}".format(source_cate, sc)
            target_lab = "{0}={1}".format(target_cate, tc)

            source_list.append(source_lab)
            target_list.append(target_lab)
            value_list.append(v)

Finally, `source_list``` and `target_list``` must be specified by index, so

label_Refer to list and convert.



```python
source_list = [label_list.index(si) for si in source_list]
target_list = [label_list.index(ti) for ti in target_list]

All you have to do now is run the same code as the sample.

fig = go.Figure(data=[go.Sankey(
    node = dict(
      pad = 15,
      thickness = 20,
      line = dict(color = "black", width = 0.5),
      label = label_list,
      color = "blue"
    ),
    link = dict(
      source = source_list,
      target = target_list,
      value = value_list
  ))])

fig.update_layout(title_text="Basic Sankey Diagram", font_size=10)
fig.show()

image.png

that's all!

Code list

Show list
import numpy as np
import pandas as pd

!wget https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv

filename = "/content/titanic.csv"
df = pd.read_csv(filename, encoding='utf-8')

cate_list = ["Survived", "Pclass", "Sex", "Siblings/Spouses Aboard", "Parents/Children Aboard"]

n = len(cate_list)

label_list = []
source_list = []
target_list = []
value_list = []

for cate in cate_list:
    tmp_label_list=[]
    for v in df[cate].unique():
        lab = "{0}={1}".format(cate, v)
        tmp_label_list.append(lab)
        tmp_label_list.sort()
    label_list.extend(tmp_label_list)


for i in range(n-1):
    source_cate = cate_list[i]
    target_cate = cate_list[i+1]

    for sc in df[source_cate].unique():
        for tc in df[target_cate].unique():

            v = sum((df[source_cate]==sc) & (df[target_cate]==tc))
            source_lab = "{0}={1}".format(source_cate, sc)
            target_lab = "{0}={1}".format(target_cate, tc)

            source_list.append(source_lab)
            target_list.append(target_lab)
            value_list.append(v)

source_list = [label_list.index(si) for si in source_list]
target_list = [label_list.index(ti) for ti in target_list]

fig = go.Figure(data=[go.Sankey(
    node = dict(
      pad = 15,
      thickness = 20,
      line = dict(color = "black", width = 0.5),
      label = label_list,
      color = "blue"
    ),
    link = dict(
      source = source_list,
      target = target_list,
      value = value_list
  ))])

fig.update_layout(title_text="Basic Sankey Diagram", font_size=10)
fig.show()

Afterword

Actually, I noticed when I wrote the code myself, but this figure only shows the relationship between the variables before and after a certain variable. In the example above, even if you know the relationship between `Survived``` and Pclass``` and `` `` Pclass``` and `` Sex I don't know Survived and Sex. It seems that up to 3 variables can be expressed by color etc., but it seems impossible if it becomes more than that.

(Oh, this isn't a visualization of 4 dimensions or more ...?)

If you know a better way, please let us know in the comments.

Postscript

I've done a lot above, but if I was reading the docs there was an easier way.

fig = px.parallel_categories(df, dimensions=cate_list, color='Survived')
fig.show()

plot03.gif

It's amazing to be able to move the order of variables and the order of elements! !!

that's all!

reference

Plotly:Sankey Diagram in Python Plotly:basic-parallel-category-diagram-with-plotlyexpress CS109:A Titanic Probability

Recommended Posts

[Python] What do you do with visualization of 4 or more variables?
Python | What you can do with Python
What you can't do with hstack or vstack with dstack
What to do if you can't install pyaudio with pip #Python
What you can do with API vol.1
Recommendation of Altair! Data visualization with Python
What you can do with programming skills
What to do if you couldn't send an email to Yahoo with Python.
What are you comparing with Python is and ==?
What are you using when testing with Python?
You will be an engineer in 100 days --Day 35 --Python --What you can do with Python
What to do if you get an error when installing python with pyenv
Do Houdini with Python3! !! !!
Logistics visualization with Python
Links to do what you want with Sublime Text
What you can and cannot do with Tensorflow 2.x
How much do you know the basics of Python?
What to do when you can't bind CaboCha to Python
What to do if you get an OpenSSL error when installing Python 2 with pyenv
What to do if you run python in IntelliJ and end with an error
[AWS] What to do when you want to pip with Lambda
Do Django with CodeStar (Python3.6.8, Django2.2.9)
Make a note of what you want to do in the future with Raspberry Pi
Do Django with CodeStar (Python3.8, Django2.1.15)
What to do if you can't sort files with subscripts
[Python] Extract text data from XML data of 10GB or more.
What beginners learned from the basics of variables in python
A note on what you did to use Flycheck with Python
What to do if you get a minus zero in Python
What to do if you get a UnicodeDecodeError with pip install
[Talking about the drawing structure of plotly] Dynamic visualization with plotly [python]
What to do if you can't build your project with Maven
What should I do with the Python directory structure after all?
[Python] What is a with statement?
What to do with Magics install
[ns3-30] Enable visualization of Python scripts
Nice to meet you with python
Do you need a Python re.compile?
Run mruby with Python or Blender
Proper use of Python visualization packages
Getting Started with Python Basics of Python
To do tail recursion with Python2
Life game with Python! (Conway's Game of Life)
10 functions of "language with battery" python
Let's do image scraping with Python
Easy data visualization with Python seaborn.
Until you run python with apache
Data analysis starting with python (data visualization 1)
Coexistence of Python2 and 3 with CircleCI (1.0)
[Python] Inherit a class with class variables
Data analysis starting with python (data visualization 2)
Basic study of OpenCV with Python
What to do if you get a "Wrong Python Platform" warning when using Python with the NetBeans IDE
What to do if you get angry with "Value Error: unknown local: UTF-8" in python manage.py syncdb
What to do if you can't find well with grep's -f option
How many types of Python do you have in Windows 10? I had 5 types.
What to do if you can't find PDO in Laravel or CakePHP
What to do if ipython and python start up with different versions
What to do if you get lost in file reference with FileNotFoundError
It's more recent, but I wanted to do BMI calculation with python.
What to do if the server doesn't start with python manage.py runserver