at first

In this article, the author who has never used Graphviz uses ** Graphviz ** in ** jupyter-notebook **. By the way, there are various information on the net such as how to use Graphviz. For example

Or something. You can use it regardless of which of the above two articles you set. If you look at the site and feel that this is okay, it's all right. This time, I will mix the two articles a little to create an environment.

What is Graphviz

First, let's take a brief look at Graphviz. I think it is a summary of the following sites.

Beginning with "Graphviz", a tool for converting text data into graph images
[Summary of how to draw a graph in Graphviz and dot language](Summary of how to draw a graph in Graphviz and dot language)

Graphviz is an abbreviation for Graph Visualization Software, a tool for creating graphs. A text file written using a data description language called the dot language can be converted and output as an image file. It is also used when drawing decision trees for machine learning. Various platforms (Windows, Mac, Linux) are available. By the way, dot seems to be a program that draws directed graphs. 　　　　　　　　　　　　

Source: Statistical text analysis (6) -Word network analysis-

The figure above is the difference between a directed graph and an undirected graph. To put it in words, is there a specific direction from one vertex to another? Is there an arrow in the figure above? In other words, the relationship between each vertex is fixed or not. For example, a directed graph is a hyperlink, and an undirected graph is a train route map. I can't go back to the original page when I click the hyperlink (don't think about the back button lol) </ font> can do.

Graphviz installation

There are two installation methods I tried this time. By the way, when I went to the download page of the official website, it was different from the one introduced on the above site.

How to download from the official website
How to drop the zip file

These two methods.

Download from the official website

First, click the link on the official website of here. Then click download Click Graphviz Windows packages under Windows Since it will jump to the github page, select the file "2.38 yaml" If you search the link attached to the URL, it will be downloaded without permission. After that, start the installer and start the installation. You may refer to the here site. Please judge whether to check "everyone" by the number of accounts on your computer.

Drop the zip file

You can drop the zip file from the here site. Once dropped, all you have to do is unzip the file.

Implementation

This time, we will visualize the graph of the decision tree using the surviving dataset of the Titanic. This code is [here](https://qiita.com/5sigma_AAA/items/0c23907da9330681147b#%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92%E3%81% AE% E5% AE% 9F% E8% A3% 85) is referred to the code. I will write it separately in a jupyter style.

First, load the required libraries. (The user part is your account name)

from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc, accuracy_score

import pandas as pd

from sklearn.tree import export_graphviz
import pydotplus 
from io import StringIO
from IPython.display import Image

#Data reading
train = pd.read_csv("/Users/user/jupyter/train.csv")

Basically, it seems that general data sets have missing values, so check

train.isnull().sum()

PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2

There seems to be missing values in Age, Cabin, and Embarked. I will ignore Cabin because I will not use it this time. The missing value of Age complements the average of the entire Age, and Embarked complements the mode "S".

train["Age"] = train["Age"].fillna(train["Age"].median())
train["Embarked"] = train["Embarked"].fillna("S")

When learning, we have to make it an int type, so we convert the strings'Sex' and'Embarked' to numbers. This time, the numbers are assigned manually, but it can also be converted using dummy variables.

#Conversion of categorical variables
train['Sex'] = train['Sex'].apply(lambda x: 1 if x == 'male' else 0)
train['Embarked'] = train['Embarked'].map( {'S': 0 , 'C':1 , 'Q':2}).astype(int)

Delete unused classes.

train = train.drop(['Cabin','Name','PassengerId','Ticket'],axis =1)

Since the data preprocessing is completed, check once for missing values.

train.isnull().sum()

Survived 0 Pclass 0 Sex 0 Age 0 SibSp 0 Parch 0 Fare 0 Embarked 0

I was able to eliminate missing values. Next, we will separate the data set for train and the data set for test. The distribution of the test images was 30% of the total.

#Divided into training data and test data
train_X = train.drop('Survived',axis = 1)
train_y = train.Survived
(train_X , test_X , train_y , test_y) = train_test_split(train_X, train_y , test_size = 0.3 , random_state = 0)

Build a model and train. The depth is 3 layers.

#train
model = DecisionTreeClassifier(max_depth=3,random_state = 0)
model.fit(train_X , train_y)

#accuracy
pred = model.predict(test_X)
fpr, tpr, thresholds = roc_curve(test_y , pred,pos_label = 1)
auc(fpr,tpr)
print("accuracy=",accuracy_score(pred,test_y)

accuracy=0.8208955223880597

From here, I will draw with Graphviz.

#Process to treat the character string like a file object
dot_data = StringIO() 

export_graphviz( 
    model, out_file=dot_data, 
    feature_names=train_X.columns,
    class_names=["Death", "Survival"]
) 

graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) 
#Specify the absolute path of the directory where you downloaded graphviz
graph.progs = {'dot': u"C:\\Users\\user\\anaconda3\\bin\\release\\bin\\dot.exe"}
#Visualize in notebook
Image(graph.create_png())

I was able to draw a decision tree.

Finally

This time I used jupyter, but of course it is possible with other IDEs. By the way, the hardest part was finding the installation file from the official website.

[PYTHON] Using Graphviz with Jupyter Notebook