How to explore a decision tree built using scikit learn - python

I am building a decision tree using
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, Y_train)
This all works fine. However, how do I then explore the decision tree?
For example, how do I find which entries from X_train appear in a particular leaf?

You need to use the predict method.
After training the tree, you feed the X values to predict their output.
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)
iris = load_iris()
tree = clf.fit(iris.data, iris.target)
tree.predict(iris.data)
output:
>>> tree.predict(iris.data)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
To get details on the tree structure, we can use tree_.__getstate__()
Tree structure translated into an "ASCII art" picture
0
_____________
1 2
______________
3 12
_______ _______
4 7 13 16
___ ______ _____
5 6 8 9 14 15
_____
10 11
tree structure as an array.
In [38]: tree.tree_.__getstate__()['nodes']
Out[38]:
array([(1, 2, 3, 0.800000011920929, 0.6666666666666667, 150, 150.0),
(-1, -1, -2, -2.0, 0.0, 50, 50.0),
(3, 12, 3, 1.75, 0.5, 100, 100.0),
(4, 7, 2, 4.949999809265137, 0.16803840877914955, 54, 54.0),
(5, 6, 3, 1.6500000953674316, 0.04079861111111116, 48, 48.0),
(-1, -1, -2, -2.0, 0.0, 47, 47.0),
(-1, -1, -2, -2.0, 0.0, 1, 1.0),
(8, 9, 3, 1.5499999523162842, 0.4444444444444444, 6, 6.0),
(-1, -1, -2, -2.0, 0.0, 3, 3.0),
(10, 11, 2, 5.449999809265137, 0.4444444444444444, 3, 3.0),
(-1, -1, -2, -2.0, 0.0, 2, 2.0),
(-1, -1, -2, -2.0, 0.0, 1, 1.0),
(13, 16, 2, 4.850000381469727, 0.042533081285444196, 46, 46.0),
(14, 15, 1, 3.0999999046325684, 0.4444444444444444, 3, 3.0),
(-1, -1, -2, -2.0, 0.0, 2, 2.0),
(-1, -1, -2, -2.0, 0.0, 1, 1.0),
(-1, -1, -2, -2.0, 0.0, 43, 43.0)],
dtype=[('left_child', '<i8'), ('right_child', '<i8'),
('feature', '<i8'), ('threshold', '<f8'),
('impurity', '<f8'), ('n_node_samples', '<i8'),
('weighted_n_node_samples', '<f8')])
Where:
The first node [0] is the root node.
internal nodes have left_child and right_child refering to nodes with positive values, and greater than the current node.
leaves have -1 value for the left and right child nodes.
nodes 1,5,6, 8,10,11,14,15,16 are leaves.
the node structure is built using the Depth First Search Algorithm.
the feature field tells us which of the iris.data features was used in the node to determine the path for this sample.
the threshold tells us the value used to evaluate the direction based on the feature.
impurity reaches 0 at the leaves... since all the samples are in the same class once you reach the leaf.
n_node_samples tells us how many samples reach each leaf.
Using this information we could trivially track each sample X to the leaf where it eventually lands by following the classification rules and thresholds on a script. Additionally, the n_node_samples would allow us to perform unit tests ensuring that each node gets the correct number of samples.Then using the output of tree.predict, we could map each leaf to the associated class.

NOTE: This is not an answer, only a hint on possible solutions.
I encountered a similar problem recently in my project. My goal is to extract the corresponding chain of decisions for some particular samples. I think your problem is a subset of mine, since you just need to record the last step in the decision chain.
Up to now, it seems the only viable solution is to write a custom predict method in Python to keep track of the decisions along the way. The reason is that the predict method provided by scikit-learn cannot do this out-of-box (as far as I know). And to make it worse, it is a wrapper for C implementation which is pretty hard to customize.
Customization is fine for my problem, since I'm dealing with a unbalanced dataset, and the samples I care about (positive ones) are rare. So I can filter them out first using sklearn predict and then get the decision chain using my customization.
However, this may not work for you if you have a large dataset. Because if you parse the tree and do predict in Python, it will run slow in Python speed and will not (easily) scale. You may have to fallback to customizing the C implementation.

I've changed a bit what Dr. Drew posted.
The following code, given a data frame and the decision tree after being fitted, returns:
rules_list: a list of rules
values_path: a list of entries (entries for each class going through the path)
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
def get_rules(dtc, df):
rules_list = []
values_path = []
values = dtc.tree_.value
def RevTraverseTree(tree, node, rules, pathValues):
'''
Traverase an skl decision tree from a node (presumably a leaf node)
up to the top, building the decision rules. The rules should be
input as an empty list, which will be modified in place. The result
is a nested list of tuples: (feature, direction (left=-1), threshold).
The "tree" is a nested list of simplified tree attributes:
[split feature, split threshold, left node, right node]
'''
# now find the node as either a left or right child of something
# first try to find it as a left node
try:
prevnode = tree[2].index(node)
leftright = '<='
pathValues.append(values[prevnode])
except ValueError:
# failed, so find it as a right node - if this also causes an exception, something's really f'd up
prevnode = tree[3].index(node)
leftright = '>'
pathValues.append(values[prevnode])
# now let's get the rule that caused prevnode to -> node
p1 = df.columns[tree[0][prevnode]]
p2 = tree[1][prevnode]
rules.append(str(p1) + ' ' + leftright + ' ' + str(p2))
# if we've not yet reached the top, go up the tree one more step
if prevnode != 0:
RevTraverseTree(tree, prevnode, rules, pathValues)
# get the nodes which are leaves
leaves = dtc.tree_.children_left == -1
leaves = np.arange(0,dtc.tree_.node_count)[leaves]
# build a simpler tree as a nested list: [split feature, split threshold, left node, right node]
thistree = [dtc.tree_.feature.tolist()]
thistree.append(dtc.tree_.threshold.tolist())
thistree.append(dtc.tree_.children_left.tolist())
thistree.append(dtc.tree_.children_right.tolist())
# get the decision rules for each leaf node & apply them
for (ind,nod) in enumerate(leaves):
# get the decision rules
rules = []
pathValues = []
RevTraverseTree(thistree, nod, rules, pathValues)
pathValues.insert(0, values[nod])
pathValues = list(reversed(pathValues))
rules = list(reversed(rules))
rules_list.append(rules)
values_path.append(pathValues)
return (rules_list, values_path)
It follows an example:
df = pd.read_csv('df.csv')
X = df[df.columns[:-1]]
y = df['classification']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
dtc = DecisionTreeClassifier(max_depth=2)
dtc.fit(X_train, y_train)
The Decision Tree fitted has generated the following tree: Decision Tree with width 2
At this point, just calling the function:
get_rules(dtc, df)
This is what the function returns:
rules = [
['first <= 63.5', 'first <= 43.5'],
['first <= 63.5', 'first > 43.5'],
['first > 63.5', 'second <= 19.700000762939453'],
['first > 63.5', 'second > 19.700000762939453']
]
values = [
[array([[ 1568., 1569.]]), array([[ 636., 241.]]), array([[ 284., 57.]])],
[array([[ 1568., 1569.]]), array([[ 636., 241.]]), array([[ 352., 184.]])],
[array([[ 1568., 1569.]]), array([[ 932., 1328.]]), array([[ 645., 620.]])],
[array([[ 1568., 1569.]]), array([[ 932., 1328.]]), array([[ 287., 708.]])]
]
Obviously, in values, for each path, there is the leaf values too.

The below code should produce a plot of your top ten features:
import numpy as np
import matplotlib.pyplot as plt
importances = clf.feature_importances_
std = np.std(clf.feature_importances_,axis=0)
indices = np.argsort(importances)[::-1]
# Print the feature ranking
print("Feature ranking:")
for f in range(10):
print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(10), importances[indices],
color="r", yerr=std[indices], align="center")
plt.xticks(range(10), indices)
plt.xlim([-1, 10])
plt.show()
Taken from here and modified slightly to fit the DecisionTreeClassifier.
This doesn't exactly help you explore the tree, but it does tell you about the tree.

This code will do exactly what you want. Here, n is the number observations in X_train. At the end, the (n,number_of_leaves)-sized array leaf_observations holds in each column boolean values for indexing into X_train to get the observations in each leaf. Each columns of leaf_observations corresponds to an element in leaves, which has the node IDs for the leaves.
# get the nodes which are leaves
leaves = clf.tree_.children_left == -1
leaves = np.arange(0,clf.tree_.node_count)[leaves]
# loop through each leaf and figure out the data in it
leaf_observations = np.zeros((n,len(leaves)),dtype=bool)
# build a simpler tree as a nested list: [split feature, split threshold, left node, right node]
thistree = [clf.tree_.feature.tolist()]
thistree.append(clf.tree_.threshold.tolist())
thistree.append(clf.tree_.children_left.tolist())
thistree.append(clf.tree_.children_right.tolist())
# get the decision rules for each leaf node & apply them
for (ind,nod) in enumerate(leaves):
# get the decision rules in numeric list form
rules = []
RevTraverseTree(thistree, nod, rules)
# convert & apply to the data by sequentially &ing the rules
thisnode = np.ones(n,dtype=bool)
for rule in rules:
if rule[1] == 1:
thisnode = np.logical_and(thisnode,X_train[:,rule[0]] > rule[2])
else:
thisnode = np.logical_and(thisnode,X_train[:,rule[0]] <= rule[2])
# get the observations that obey all the rules - they are the ones in this leaf node
leaf_observations[:,ind] = thisnode
This needs the helper function defined here, which recursively traverses the tree starting from a specified node to build the decision rules.
def RevTraverseTree(tree, node, rules):
'''
Traverase an skl decision tree from a node (presumably a leaf node)
up to the top, building the decision rules. The rules should be
input as an empty list, which will be modified in place. The result
is a nested list of tuples: (feature, direction (left=-1), threshold).
The "tree" is a nested list of simplified tree attributes:
[split feature, split threshold, left node, right node]
'''
# now find the node as either a left or right child of something
# first try to find it as a left node
try:
prevnode = tree[2].index(node)
leftright = -1
except ValueError:
# failed, so find it as a right node - if this also causes an exception, something's really f'd up
prevnode = tree[3].index(node)
leftright = 1
# now let's get the rule that caused prevnode to -> node
rules.append((tree[0][prevnode],leftright,tree[1][prevnode]))
# if we've not yet reached the top, go up the tree one more step
if prevnode != 0:
RevTraverseTree(tree, prevnode, rules)

I think an easy option would be to use the apply method of the trained decision tree. Train the tree, apply the traindata and build a lookup table from the returned indices:
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
iris = load_iris()
clf = DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)
# apply training data to decision tree
leaf_indices = clf.apply(iris.data)
lookup = {}
# build lookup table
for i, leaf_index in enumerate(leaf_indices):
try:
lookup[leaf_index].append(iris.data[i])
except KeyError:
lookup[leaf_index] = []
lookup[leaf_index].append(iris.data[i])
# test
unkown_sample = [[4., 3.1, 6.1, 1.2]]
index = clf.apply(unkown_sample)
print(lookup[index[0]])

Have you tried dumping your DecisionTree into a graphviz' .dot file [1] and then load it with graph_tool [2].:
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from graph_tool.all import *
iris = load_iris()
clf = DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)
tree.export_graphviz(clf,out_file='tree.dot')
#load graph with graph_tool and explore structure as you please
g = load_graph('tree.dot')
for v in g.vertices():
for e in v.out_edges():
print(e)
for w in v.out_neighbours():
print(w)
[1] http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html
[2] https://graph-tool.skewed.de/

Related

How to retain node ordering when converting graph from networkx to pytorch geometric?

Question: How to retain the node ordering/labels when converting a graph from networkx to pytorch geometric?
Code: (to be run in Google Colab)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import networkx as nx
import torch
from torch.nn import Linear
import torch.nn.functional as F
torch.__version__
# install pytorch geometric
!pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-1.10.0+cpu.html
from torch_geometric.nn import GCNConv
from torch_geometric.utils.convert import to_networkx, from_networkx
# Make the networkx graph
G = nx.Graph()
# Add some cars
G.add_nodes_from([
('Ford', {'y': 0, 'Name': 'Ford'}),
('Lexus', {'y': 1, 'Name': 'Lexus'}),
('Peugot', {'y': 2, 'Name': 'Peugot'}),
('Mitsubushi', {'y': 3, 'Name': 'Mitsubishi'}),
('Mazda', {'y': 4, 'Name': 'Mazda'}),
])
# Relabel the nodes
remapping = {x[0]: i for i, x in enumerate(G.nodes(data = True))}
G = nx.relabel_nodes(G, remapping, copy=False)
# Add some edges --> A = [(0, 1, 0, 1, 1), (1, 0, 1, 1, 0), (0, 1, 0, 0, 1), (1, 1, 0, 0, 0), (1, 0, 1, 0, 0)] as the adjacency matrix
G.add_edges_from([
(0, 1), (0, 3), (0, 4),
(1, 2), (1, 3),
(2, 1), (2, 4),
(3, 0), (3, 1),
(4, 0), (4, 2)
])
# Convert the graph into PyTorch geometric
pyg_graph = from_networkx(G)
pyg_graph.edge_index
When I print the edge indices in the last line of the code, I get different answers each time I run it. Most importantly, I am looking to consistently get the same (correct) answer whereby each node numbering is retained from networkx:
tensor([[0, 0, 1, 1, 1, 2, 2, 3, 3, 4, 4, 4],
[4, 2, 4, 2, 3, 0, 1, 1, 4, 0, 1, 3]])
The form of this edge index tensor is:
the first list contains the node ids of the source node
the second list contains the node ids of the target node
For the node ids to be retained, we would expect node 0 to appear three times in the first (source) list instead of just twice.
Is there any way for me to force PyTorch Geometric to copy over the node ids?
Thanks
[EDIT] One possible work-around I have is using the following bit of code which is able to produce edge index and weight tensors for PyTorch geometric
# Create a dictionary of the mappings from company --> node id
mapping_dict = {x: i for i, x in enumerate(list(G.nodes()))}
# Get the number of nodes
num_nodes = len(mapping_dict)
# Now create a source, target, and edge list for PyTorch geometric graph
edge_source_list = []
edge_target_list = []
edge_weight_list = []
# iterate through all the edges
for e in G.edges():
# first element of tuple is appended to source edge list
edge_source_list.append(mapping_dict[e[0]])
# last element of tuple is appended to target edge list
edge_target_list.append(mapping_dict[e[1]])
# add the edge weight to the edge weight list
edge_weight_list.append(1)
# now create full edge lists for pytorch geometric - undirected edges need to be defined in both directions
full_source_list = edge_source_list + edge_target_list # full source list
full_target_list = edge_target_list + edge_source_list # full target list
full_weight_list = edge_weight_list + edge_weight_list # full edge weight list
print(len(edge_source_list), len(edge_target_list), len(full_source_list))
# now convert these to torch tensors
edge_index_tensor = torch.LongTensor( np.concatenate([ [np.array(full_source_list)], [np.array(full_target_list)]] ))
edge_weight_tensor = torch.FloatTensor(np.array(full_weight_list))
It seems this issue was resolved in the comments (the solution proposed by #Sparky05 is to use copy=True, which is the default for nx.relabel_nodes), but below is the explanation for why the node order is changed.
When copy=False is passed, nx.relabel_nodes will re-add the nodes to the graph in the order they appear in the set of keys of remapping dict. The relevant lines in the code are here:
def _relabel_inplace(G, mapping):
old_labels = set(mapping.keys())
new_labels = set(mapping.values())
if len(old_labels & new_labels) > 0:
# skip codes for labels sets that overlap
else:
# non-overlapping label sets
nodes = old_labels
# skip lines
for old in nodes: # this is now in the set order
By using set the order of nodes is modified, so to preserve the order the non-overlapping label sets should be treated as:
else:
# non-overlapping label sets
nodes = mapping.keys()
A related PR is submitted here.

How to pass edge attribute as the edge distance to nx.closeness_centrality()?

Suppose I have a graph defined by this matrix:
test = np.array([[0, 0, 4, 0],
[0, 0, 6, 0],
[4, 6, 0, 10],
[0, 0, 10, 0]])
import networkx as nx
test_nx = nx.from_numpy_array(test)
Next, I'd like to compute the weighted closeness centrality for each node of this graph.
nx.closeness_centrality(test_nx, distance="edges")
I get:
{0: 0.6, 1: 0.6, 2: 1.0, 3: 0.6}
However, this is clearly not considering edge weights. I'm guessing the reason is I'm not passing the "distance" argument properly.
According to the docs:
closeness_centrality(G, u=None, distance=None, normalized=True)
distance (edge attribute key, optional (default=None)) – Use the
specified edge attribute as the edge distance in shortest path
calculations
Can anyone advise me how to pass edge weights to this function? My desired output would be a dictionary of closeness centrality values (one per node) which considers that these edges have weights and they are not simply binary.
If you look at the edges using this:
print(test_nx.edges(data=True))
# output: [(0, 2, {'weight': 4}), (1, 2, {'weight': 6}), (2, 3, {'weight': 10})]
you can see that the key used to save the edge weight is weight. The right distance key will be this one.
nx.closeness_centrality(test_nx, distance="weight")
# output {0: 0.10714285714285714, 1: 0.09375, 2: 0.15, 3: 0.075}

compare two arrays to make an accuracy of KNN prediction

I have two arrays from which I have to find the accuracy of my prediction.
predictions = [1, 0, 0, 1, 1, 1, 0, 1, 1, 0]
y_test = [1, 0, 0, 1, 0, 1, 0, 1, 1, 1]
so in this case, the accuracy is = (8/10)*100 = 80%
I have written a method to do this task. Here is my code, but I dont get the accuracy of 80% in this case.
def getAccuracy(y_test, predictions):
correct = 0
for x in range(len(y_test)):
if y_test[x] is predictions[x]:
correct += 1
return (correct/len(y_test)) * 100.0
Thanks for helping me.
You're code should work, if the numbers in the arrays are in a specific range that are not recreated by the python interpreter. This is because you used is which is an identity check and not an equality check. So, you are checking memory addresses, which are only equal for a specific range of numbers. So, use == instead and it will always work.
For a more Pythonic solution you can also take a look at list comprehensions:
assert len(predictions) == len(y_test), "Unequal arrays"
identity = sum([p == y for p, y in zip(predictions, y_test)]) / len(predictions) * 100
if you want to take 80.0 as result for your example, It's doing that.
Your code gives 80.0 as you wanted, however you should use == instead of is, see the reason.
def getAccuracy(y_test, predictions):
n = len(y_test)
correct = 0
for x in range(n):
if y_test[x] == predictions[x]:
correct += 1
return (correct/n) * 100.0
predictions = [1, 0, 0, 1, 1, 1, 0, 1, 1, 0]
y_test = [1, 0, 0, 1, 0, 1, 0, 1, 1, 1]
print(getAccuracy(y_test, predictions))
80.0
Here's an implementation using Numpy:
import numpy as np
n = len(y_test)
100*np.sum(np.isclose(predictions, y_test))/n
or if you convert your lists to numpy arrays, then
100*np.sum(predictions == y_test)/n

How to produce different random results with DEAP?

I am using the DEAP library to maximize a metric, and I noticed that whenever I restart the algorithm (which is supposed to create a random list of binary values - 1s and 0s) it is producing the same initial values.
I became suspicious and copied their basic DEAP example here and re-ran the algorithms again:
import array, random
from deap import creator, base, tools, algorithms
creator.create("FitnessMax", base.Fitness, weights=(1.0,))
creator.create("Individual", array.array, typecode='b', fitness=creator.FitnessMax)
toolbox = base.Toolbox()
toolbox.register("attr_bool", random.randint, 0, 1)
toolbox.register("individual", tools.initRepeat, creator.Individual, toolbox.attr_bool, 10)
toolbox.register("population", tools.initRepeat, list, toolbox.individual)
def evalOneMax(individual):
return sum(individual),
toolbox.register("evaluate", evalOneMax)
toolbox.register("mate", tools.cxTwoPoints)
toolbox.register("mutate", tools.mutFlipBit, indpb=0.05)
toolbox.register("select", tools.selTournament, tournsize=3)
population = toolbox.population(n=10)
NGEN=40
for gen in range(NGEN):
offspring = algorithms.varAnd(population, toolbox, cxpb=0.5, mutpb=0.1)
fits = toolbox.map(toolbox.evaluate, offspring)
for fit, ind in zip(fits, offspring):
ind.fitness.values = fit
population = offspring
The code above is exactly their example, but with the population and individual size reduced to 10. I ran the algorithm 5 times and it produced exact copies of each other. I also added a print statement to get the below output:
>python testGA.py
[1, 0, 1, 0, 1, 0, 1, 1, 1, 1]
Starting the Evolution Algorithm...
Evaluating Individual: [0, 1, 0, 1, 0, 1, 1, 1, 1, 0]
Evaluating Individual: [1, 1, 0, 1, 0, 1, 0, 1, 0, 0]
Evaluating Individual: [0, 0, 1, 0, 0, 1, 1, 0, 0, 1]
Evaluating Individual: [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Evaluating Individual: [0, 1, 1, 0, 1, 0, 1, 1, 0, 1]
Evaluating Individual: [1, 0, 1, 1, 1, 0, 0, 1, 0, 0]
Evaluating Individual: [0, 1, 0, 0, 0, 1, 0, 0, 0, 1]
Evaluating Individual: [1, 1, 0, 1, 0, 1, 0, 1, 1, 1]
Evaluating Individual: [1, 1, 1, 1, 0, 0, 1, 0, 0, 0]
Evaluating Individual: [0, 0, 1, 1, 1, 1, 0, 1, 1, 1]
This output is generated every time I call the function - In that order. They are exactly identical.
I have read that I shouldn't have to seed the random.randint function, and I tested it by writing a basic script that just prints out a list of 10 random ints ranged 0 to 1. This workd fine, it just seems to produce the same values when I feed it through DEAP.
Is this normal? How can I ensure that, when I run the algorithm, I get different 'individuals' every time?
EDIT:
Sorry for the late reply, here is the full source I am using:
import random, sys
from deap import creator, base, tools
class Max():
def __init__(self):
creator.create("FitnessMax", base.Fitness, weights=(1.0,))
creator.create("Individual", list, fitness=creator.FitnessMax)
INDIVIDUAL_SIZE = 10
self.toolbox = base.Toolbox()
self.toolbox.register("attr_bool", random.randint, 0, 1)
self.toolbox.register("individual", tools.initRepeat, creator.Individual, self.toolbox.attr_bool, n=INDIVIDUAL_SIZE)
self.toolbox.register("population", tools.initRepeat, list, self.toolbox.individual)
self.toolbox.register("mate", tools.cxTwoPoints)
self.toolbox.register("mutate", tools.mutFlipBit, indpb=0.05)
self.toolbox.register("select", tools.selTournament, tournsize=3)
self.toolbox.register("evaluate", self.evaluate)
print self.main()
def evaluate(self, individual):
# Some debug code
print 'Evaluating Individual: ' + str(individual)
return sum(individual),
def main(self):
CXPB, MUTPB, NGEN = 0.5, 0.2, 40
random.seed(64)
pop = self.toolbox.population(n=10)
print "Starting the Evolution Algorithm..."
fitnesses = list(map(self.toolbox.evaluate, pop))
for ind, fit in zip(pop, fitnesses):
ind.fitness.values = fit
# ----------------------------------------------------------
# Killing the program here - just want to see the population created
sys.exit()
print "Evaluated %i individuals" % (len(pop))
for g in range(NGEN):
print "-- Generation %i --" % (g)
# Select the next genereation individuals
offspring = self.toolbox.select(pop, len(pop))
# Clone the selected individuals
offspring = list(map(self.toolbox.clone, offspring))
# Apply crossover and mutation on the offspring
for child1, child2 in zip(offspring[::2], offspring[1::2]):
if random.random() < CXPB:
self.toolbox.mate(child1, child2)
del child1.fitness.values
del child2.fitness.values
for mutant in offspring:
if random.random() < MUTPB:
self.toolbox.mutate(mutant)
del mutant.fitness.values
# Evaluate the individuals with an invalid fitness
invalid_ind = [ind for ind in offspring if not ind.fitness.valid]
fitnesses = map(self.toolbox.evaluate, invalid_ind)
for ind, fit in zip(invalid_ind, fitnesses):
ind.fitness.values = fit
print "\tEvaluated %i individuals" % (len(pop))
pop[:] = offspring
fits = [ind.fitness.values[0] for ind in pop]
length = len(pop)
mean = sum(fits) / length
sum2 = sum(x*x for x in fits)
std = abs(sum2 / length - mean**2)**0.5
print "\tMin %s" % (min(fits))
print "\tMax %s" % (max(fits))
print "\tAvg %s" % (mean)
print "\tStd %s" % (std)
class R_Test:
def __init__(self):
print str([random.randint(0, 1) for i in range(10)])
if __name__ == '__main__':
#rt = R_Test()
mx = Max()
The R_Test class is there to test the random generation in Python. I read here that the seed is dynamically called if not given in Python, and I wanted to test this.
How I have been executing the above code has been as such:
> python testGA.py
... the 10 outputs
> python testGA.py
... the exact same outputs
> python testGA.py
... the exact same outputs
> python testGA.py
... the exact same outputs
> python testGA.py
... the exact same outputs
Obviously 5 times isn't exactly a strenuous test, but the fact that all 10 values are the same 5 times in a row raised a red flag.
The problem is that you specify a seed for the random number generator in your main function. Simply comment the line : random.seed(64) and you will get different results every time you execute your program.
In DEAP example files, a specific seed is set because we also use these examples as integration tests. If after a modification in the framework base code, the output of an example is different, we want to know. It also allow us to bench the time required by each example and provide a ballpark estimate to our users. The results of these benchmarks are available online at http://deap.gel.ulaval.ca/speed/.

Feature selection with LinearSVC

When I try running the following code with my data (from this example)
X_new = LinearSVC(C=0.01, penalty="l1", dual=False).fit_transform(X, y)
I get:
"Invalid threshold: all features are discarded"
I tried specifying my own threshold:
clf = LinearSVC(C=0.01, penalty="l1", dual=False)
clf.fit(X,y)
X_new = clf.transform(X, threshold=my_threshold)
but I either get:
An array X_new of the same size as X, this is whenever my_threshold is one of:
'mean'
'median'
Or the "Invalid threshold" error (e.g. when passing scalar values to threshold)
I can't post the entire matrix X, but below are a few stats of the data:
> X.shape
Out: (29,312)
> np.mean(X, axis=1)
Out:
array([-0.30517191, -0.1147345 , 0.03674294, -0.15926932, -0.05034101,
-0.06357734, -0.08781186, -0.12865185, 0.14172452, 0.33640029,
0.06778798, -0.00217696, 0.09097335, -0.17915627, 0.03701893,
-0.1361117 , 0.13132006, 0.14406628, -0.05081956, 0.20777349,
-0.06028931, 0.03541849, -0.07100492, 0.05740661, -0.38585413,
0.31837905, 0.14076042, 0.1182338 , -0.06903557])
> np.std(X, axis=1)
Out:
array([ 1.3267662 , 0.75313658, 0.81796146, 0.79814621, 0.59175161,
0.73149726, 0.8087903 , 0.59901198, 1.13414141, 1.02433752,
0.99884428, 1.11139231, 0.89254901, 1.92760784, 0.57181158,
1.01322265, 0.66705546, 0.70248779, 1.17107696, 0.88254386,
1.06930436, 0.91769016, 0.92915593, 0.84569395, 1.59371779,
0.71257806, 0.94307434, 0.95083782, 0.88996455])
y = array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,
0, 0, 0, 0, 0, 0])
This is all with scikit-learn 0.14.
You should first analyze if your SVM model is training well before trying to use it as a transformation base. It is possible, that you are using too small C parameter, which is causing sklearn to train a trivial model which leads to the removal of all features. You can check it by either performing classification tests on your data, or at least printing the found coefficients (clf.coef_)
It would be a good idea to run a grid search technique, for the best C in terms of generalization properties, and then use it for transformation.

Categories