In this article, the author who has never used Graphviz uses ** Graphviz ** in ** jupyter-notebook **. By the way, there are various information on the net such as how to use Graphviz. For example
Or something. You can use it regardless of which of the above two articles you set. If you look at the site and feel that this is okay, it's all right. This time, I will mix the two articles a little to create an environment.
First, let's take a brief look at Graphviz. I think it is a summary of the following sites.
Graphviz is an abbreviation for Graph Visualization Software, a tool for creating graphs. A text file written using a data description language called the dot language can be converted and output as an image file. It is also used when drawing decision trees for machine learning. Various platforms (Windows, Mac, Linux) are available. By the way, dot seems to be a program that draws directed graphs.
Source: Statistical text analysis (6) -Word network analysis-
The figure above is the difference between a directed graph and an undirected graph. To put it in words, is there a specific direction from one vertex to another? Is there an arrow in the figure above? In other words, the relationship between each vertex is fixed or not. For example, a directed graph is a hyperlink, and an undirected graph is a train route map. I can't go back to the original page when I click the hyperlink (don't think about the back button lol) </ font> can do.
There are two installation methods I tried this time. By the way, when I went to the download page of the official website, it was different from the one introduced on the above site.
These two methods.
First, click the link on the official website of here. Then click download Click Graphviz Windows packages under Windows Since it will jump to the github page, select the file "2.38 yaml" If you search the link attached to the URL, it will be downloaded without permission. After that, start the installer and start the installation. You may refer to the here site. Please judge whether to check "everyone" by the number of accounts on your computer.
You can drop the zip file from the here site. Once dropped, all you have to do is unzip the file.
This time, we will visualize the graph of the decision tree using the surviving dataset of the Titanic. This code is [here](https://qiita.com/5sigma_AAA/items/0c23907da9330681147b#%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92%E3%81% AE% E5% AE% 9F% E8% A3% 85) is referred to the code. I will write it separately in a jupyter style.
First, load the required libraries. (The user part is your account name)
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc, accuracy_score
import pandas as pd
from sklearn.tree import export_graphviz
import pydotplus
from io import StringIO
from IPython.display import Image
#Data reading
train = pd.read_csv("/Users/user/jupyter/train.csv")
Basically, it seems that general data sets have missing values, so check
train.isnull().sum()
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2
There seems to be missing values in Age, Cabin, and Embarked. I will ignore Cabin because I will not use it this time. The missing value of Age complements the average of the entire Age, and Embarked complements the mode "S".
train["Age"] = train["Age"].fillna(train["Age"].median())
train["Embarked"] = train["Embarked"].fillna("S")
When learning, we have to make it an int type, so we convert the strings'Sex' and'Embarked' to numbers. This time, the numbers are assigned manually, but it can also be converted using dummy variables.
#Conversion of categorical variables
train['Sex'] = train['Sex'].apply(lambda x: 1 if x == 'male' else 0)
train['Embarked'] = train['Embarked'].map( {'S': 0 , 'C':1 , 'Q':2}).astype(int)
Delete unused classes.
train = train.drop(['Cabin','Name','PassengerId','Ticket'],axis =1)
Since the data preprocessing is completed, check once for missing values.
train.isnull().sum()
Survived 0 Pclass 0 Sex 0 Age 0 SibSp 0 Parch 0 Fare 0 Embarked 0
I was able to eliminate missing values. Next, we will separate the data set for train and the data set for test. The distribution of the test images was 30% of the total.
#Divided into training data and test data
train_X = train.drop('Survived',axis = 1)
train_y = train.Survived
(train_X , test_X , train_y , test_y) = train_test_split(train_X, train_y , test_size = 0.3 , random_state = 0)
Build a model and train. The depth is 3 layers.
#train
model = DecisionTreeClassifier(max_depth=3,random_state = 0)
model.fit(train_X , train_y)
#accuracy
pred = model.predict(test_X)
fpr, tpr, thresholds = roc_curve(test_y , pred,pos_label = 1)
auc(fpr,tpr)
print("accuracy=",accuracy_score(pred,test_y)
accuracy=0.8208955223880597
From here, I will draw with Graphviz.
#Process to treat the character string like a file object
dot_data = StringIO()
export_graphviz(
model, out_file=dot_data,
feature_names=train_X.columns,
class_names=["Death", "Survival"]
)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
#Specify the absolute path of the directory where you downloaded graphviz
graph.progs = {'dot': u"C:\\Users\\user\\anaconda3\\bin\\release\\bin\\dot.exe"}
#Visualize in notebook
Image(graph.create_png())
I was able to draw a decision tree.
This time I used jupyter, but of course it is possible with other IDEs. By the way, the hardest part was finding the installation file from the official website.
Recommended Posts