A user is currently viewing the decision tree using the following code. Is there a way that we can export some calculated fields as output too?
For example, is it possible to display the sum of an input attribute at each node, i.e. sum of feature 1 from 'X' data array in the leaves of the tree.
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:]
y = iris.target
from sklearn.tree import DecisionTreeClassifier
alg=DecisionTreeClassifier( max_depth=5,min_samples_leaf=2, max_leaf_nodes = 10)
## View tree
import graphviz
from sklearn import tree
dot_data = tree.export_graphviz(alg,out_file=None, node_ids = True, proportion = True, class_names = True, filled = True, rounded = True)
graph = graphviz.Source(dot_data)
For doing that, we have to define a function which will return an array of values that satisfy the conditions of node and feature, where node is the index of the node from the tree that you want values for and feature is the column (or feature) that you want from X.
def node_feature_values(X, clf, node=0, feature=0, require_leaf=False):
"""this function will return an array of values
from the input array X. Array values will be limited to
1. samples that passed through
2. and from the feature .
clf must be a fitted DecisionTreeClassifier
leaf_ids = find_leaves(X, clf)
if (require_leaf and
node not in leaf_ids):
print(" is set, "
"select one of these nodes:
# a sparse array that contains node assignment by sample
node_indicator = clf.decision_path(X)
node_array = node_indicator.toarray()
# which samples at least passed through the node
samples_in_node_mask = node_array[:,node]==1
return X[samples_in_node_mask, feature]
It is applied to the following example
values_arr = node_feature_values(X, alg, node=12, feature=0, require_leaf=True)
array([6.3, 5.8, 7.1, 6.3, 6.5, 7.6, 7.3, 6.7, 7.2, 6.5, 6.4, 6.8, 5.7,
5.8, 6.4, 6.5, 7.7, 7.7, 6.9, 5.6, 7.7, 6.3, 6.7, 7.2, 6.1, 6.4,
7.4, 7.9, 6.4, 7.7, 6.3, 6.4, 6.9, 6.7, 6.9, 5.8, 6.8, 6.7, 6.7,
6.3, 6.5, 6.2, 5.9])
Now a user can perform any operation including sum of feature 1 from 'X' data array in the leaves of the tree.
print("There are {} total samples in this node, "
"{}% of the total".format(len(values_arr), len(values_arr) / float(len(X))*100))
print("Feature Sum: {}".format(values_arr.sum()))
There are 43 total samples in this node,28.666666666666668% of the total
Feature Sum: 286.69999999999993