I have programmed this ID3 algorithm and for some reason the predicted value seems to always return None. I cannot figure out why the code does not enter the if statement in the predict function but have narrowed down the problem to that area.
I have tried changing the predict function multiple times and debugging but cannot find out why the issue persists and the feature value is not in tree[root_node]. Can someone please help with this?
def predict(tree, instance):
if not isinstance(tree, dict):
return tree
else:
root_node = next(iter(tree))
feat_val = instance[root_node]
if feat_val in tree[root_node]:
return predict(tree[root_node][feat_val], instance)
else:
return None
def evaluate(tree, test_data_m, label):
correct_preditct = 0
wrong_preditct = 0
for index, row in test_data_m.iterrows():#for each row in the dataset
result = predict(tree, test_data_m.loc[index])
if result == test_data_m[label][index]:
correct_predict += 1 #increase correct count
else:
wrong_predict += 1 #increase incorrect count
accuracy = correct_predict / (correct_predict + wrong_predict)
return accuracy
Related
I want to make a loop to add new columns to a dataframe.
each time it adds a new column , I want to generate the values in the column using lambda function.
The function I wish to pass in the lambda is the function calcOnOff(). This function has 4 parameters :
v3proba, is the value of another column of this same row
on_to_off, is the current val of the loop iterator
off_to_on, is the current val of the second loop iterator
prevOnOff, is the value of this same column on the previous row.
Here is my code below
import pandas as pd
# I create a simple dataframe
dataExample={'Name':['Karan','Rohit','Sahil','Aryan','dex'],'v3proba':[0.23,0.42,0.51,0.4,0.7]}
dfExample=pd.DataFrame(dataExample)
# func to be applied on each new column of the dataframe
def calcOnOff(v3proba, on_to_off, off_to_on, prevOnOff):
if(prevOnOff == "OFF" and (v3proba*100) >= off_to_on ):
return "ON"
elif(prevOnOff == "OFF" and (v3proba*100) < off_to_on ):
return "OFF"
elif(prevOnOff == "ON" and (v3proba*100) < on_to_off ):
return "OFF"
elif(prevOnOff == "ON" and (v3proba*100) >= on_to_off ):
return "ON"
else:
return "ERROR"
# my iterators
off_to_on = 50
on_to_off = 49
# loops to generate new columns and populate col values
for off_to_on in range(50,90):
for on_to_off in range(10,49):
dfExample[str(off_to_on) + '-' + str(on_to_off)] = dfExample.apply(lambda row: calcOnOff(row['v3proba'], on_to_off, off_to_on, row[str(off_to_on) + '-' + str(on_to_off)].shift()), axis=1)
dfExample
The expected output would be a table with arround 1500 columns that look like this :
I think the problem in my algo is how to handle the first row as .shift() will look for an inexistant row?
Any idea what I am doing wrong?
Preliminary remarks
You can't address the field before it's created. So the code row[f'{off_to_on}-{on_to_off}'].shift() won't work, you'll get a KeyError here.
I guess, you want to shift down one row along the column by expression row[...].shift(). It doesn't work like that. row[...] returns a value, which is contained in a cell, not the column.
It's not clear what should be the previous state for the very first row. What is the value of prevOnOff parameter in this case?
How to fill in the column taking into account previous calculations
Let's use generators for this purpose. They can keep the inner state, so we can reuse a previously calculated value to get the next one.
But first, I'm gonna clarify the logic of calcOnOff. As I can see, it returns On if proba >= threshold or Off otherwise, where threshold is on_off if previous == On or off_on otherwise. So we can rewrite it like this:
def calcOnOff(proba, on_off, off_on, previous):
threshold = on_off if previous == 'On' else off_on
return 'On' if proba >= threshold else 'Off'
Next, let's transform previous to boolean and calcOnOff into a generator:
def calc_on_off(on_off, off_on, prev='Off'):
prev = prev == 'On'
proba = yield
while True:
proba = yield 'On' if (prev:=proba >= (on_off if prev else off_on)) else 'Off'
Here I made an assumption that the initial state is Off (default value of prev), and assume that previous value was On if prev == True or Off otherwise.
Now, I suggest to use itertools.product in order to generate parameters on_off and off_on. For each pair of these values we create an individual generator:
calc = calc_on_off(on_off, off_on).send
calc(None) # push calc to the first yield
This we can apply to the 100 * df['v3proba']:
proba = 100*df['v3proba']
df[...] = proba.apply(calc)
Full code
import pandas as pd
from itertools import product
data = {
'Name': ['Karan','Rohit','Sahil','Aryan','dex'],
'v3proba': [0.23,0.42,0.51,0.4,0.7]
}
df = pd.DataFrame(data)
def calc_on_off(on_off, off_on, prev='Off'):
prev = prev == 'On'
proba = yield
while True:
prev = proba >= (on_off if prev else off_on)
proba = yield 'On' if prev else 'Off'
proba = 100*df.v3proba
on_off = range(10, 50)
off_on = range(50, 90)
for state in product(on_off, off_on):
calc = calc_on_off(*state).send
calc(None)
name = '{1}-{0}'.format(*state) # 0:on_off, 1:off_on
df[name] = proba.apply(calc)
Update: Comparing with the provided expected result
P.S. No Generators
What if I don't want to use generators? Then we have somehow keep intermediate output outside the function. Let's do it with globals:
def calc_on_off(proba):
# get data outside
global prev, on_off, off_on
threshold = on_off if (prev == 'On') else off_on
# save data outside
prev = 'On' if proba >= threshold else 'Off'
return prev
default_state = 'Off'
proba = 100*df.v3proba
r_on_off = range(10, 50)
r_off_on = range(50, 90)
for on_off, off_on in product(r_on_off, r_off_on):
prev = default_state
df[f'{off_on}-{on_off}'] = proba.apply(calc_on_off)
I'm evaluating chess positions, the implementation isn't really relevant. I've inserted print checks to see how many paths I'm able to prune, but nothing is printed, meaning I don't really prune anything.
I've understood the algorithm and followed the pseudo-code to the letter. Anyone has any idea on what's going wrong?
def alphabeta(self,node,depth,white, alpha,beta):
ch = Chessgen()
if(depth == 0 or self.is_end(node)):
return self.stockfish_evaluation(node.board)
if (white):
value = Cp(-10000)
for child in ch.chessgen(node):
value = max(value, self.alphabeta(child,depth-1,False, alpha,beta))
alpha = max(alpha, value)
if (alpha >= beta):
print("Pruned white")
break
return value
else:
value = Cp(10000)
for child in ch.chessgen(node):
value = min(value, self.alphabeta(child,depth-1,True, alpha,beta))
beta = min(beta,value)
if(beta <= alpha):
print("Pruned black")
break
return value
What is your pseudo code ?
The one I found gives a little diferent code:
As I do not have your full code, I cannot run it:
def alphabeta(self,node,depth,white, alpha,beta):
ch = Chessgen() ### can you do the init somewhere else to speed up the code ?
if(depth == 0 or self.is_end(node)):
return self.stockfish_evaluation(node.board)
if (white):
value = Cp(-10000)
for child in ch.chessgen(node):
value = max(value, self.alphabeta(child,depth-1,False, alpha,beta))
if (value >= beta):
print("Pruned white")
return value
alpha = max(alpha, value)
return value
else:
value = Cp(10000)
for child in ch.chessgen(node):
value = min(value, self.alphabeta(child,depth-1,True, alpha,beta))
if(value <= alpha):
print("Pruned black")
return value
beta = min(beta,value)
return value
A full working simple chess program can be found here:
https://andreasstckl.medium.com/writing-a-chess-program-in-one-day-30daff4610ec
I am currently making a Decision tree classifier using Gini and Information Gain and splitting the tree based on the the best attribute with the most gain each time. However, it is sticking the same attribute every time and simply adjusting the value for its question. This results in a very low accuracy of usually around 30% as it is only taking into account the very first attribute.
Finding the best split
# Used to find the best split for data among all attributes
def split(r):
max_ig = 0
max_att = 0
max_att_val = 0
i = 0
curr_gini = gini_index(r)
n_att = len(att)
for c in range(n_att):
if c == 3:
continue
c_vals = get_column(r, c)
while i < len(c_vals):
# Value of the current attribute that is being tested
curr_att_val = r[i][c]
true, false = fork(r, c, curr_att_val)
ig = gain(true, false, curr_gini)
if ig > max_ig:
max_ig = ig
max_att = c
max_att_val = r[i][c]
i += 1
return max_ig, max_att, max_att_val
Compare to split the data down the true based on true or false
# Used to compare and test if the current row is greater than or equal to the test value
# in order to split up the data
def compare(r, test_c, test_val):
if r[test_c].isdigit():
return r[test_c] == test_val
elif float(r[test_c]) >= float(test_val):
return True
else:
return False
# Splits the data into two lists for the true/false results of the compare test
def fork(r, c, test_val):
true = []
false = []
for row in r:
if compare(row, c, test_val):
true.append(row)
else:
false.append(row)
return true, false
Iterate through tree
def rec_tree(r):
ig, att, curr_att_val = split(r)
if ig == 0:
return Leaf(r)
true_rows, false_rows = fork(r, att, curr_att_val)
true_branch = rec_tree(true_rows)
false_branch = rec_tree(false_rows)
return Node(att, curr_att_val, true_branch, false_branch)
The working solution i have was to change the split function as follows. To be completly honest i amnt able to see whats wrong but it might be obvious
The working function is as follows
def split(r):
max_ig = 0
max_att = 0
max_att_val = 0
# calculates gini for the rows provided
curr_gini = gini_index(r)
no_att = len(r[0])
# Goes through the different attributes
for c in range(no_att):
# Skip the label column (beer style)
if c == 3:
continue
column_vals = get_column(r, c)
i = 0
while i < len(column_vals):
# value we want to check
att_val = r[i][c]
# Use the attribute value to fork the data to true and false streams
true, false = fork(r, c, att_val)
# Calculate the information gain
ig = gain(true, false, curr_gini)
# If this gain is the highest found then mark this as the best choice
if ig > max_ig:
max_ig = ig
max_att = c
max_att_val = r[i][c]
i += 1
return max_ig, max_att, max_att_val
I want to enumerate the binary series generated with the code below (just copy-paste to see what I'm trying to do), I used Global var but still cannot find the way to pass the value of counters (nn,nx,ny). Please don't mind how to make the same series in a better way, I just want to know how to pass the value of the counters thru these recursions in order to enumerate the output as in the image at the head of this post. Thanks.
def ConcatenateString(saccum,nn):
if len(saccum)<4:
biset=[1,0]
for a in biset:
if a==1:
prevstring = saccum
newsaccum = saccum+str(a)
nx=nn+1
print(nx,newsaccum)
ConcatenateString(newsaccum,nx)
else:
newsaccum = prevstring+str(a)
ny=nx+1
print(ny,newsaccum)
ConcatenateString(newsaccum,ny)
nn=ny
return (nn)
##MAIN
newstring=str("")
nc=0
ConcatenateString(newstring,nc)
You should send nn to function and get it back to continue counting
nn = ConcatenateString(newsaccum, nn)
def ConcatenateString(saccum,nn):
if len(saccum)<4:
biset=[1,0]
for a in biset:
if a==1:
prevstring = saccum
newsaccum = saccum+str(a)
nn += 1
print(nn, newsaccum)
nn = ConcatenateString(newsaccum, nn)
else:
newsaccum = prevstring+str(a)
nn += 1
print(nn,newsaccum)
nn = ConcatenateString(newsaccum, nn)
return nn
ConcatenateString("", 0)
EDIT: Reduced version.
def ConcatenateString(saccum,nn):
if len(saccum)<4:
biset=[1,0]
for a in biset:
if a == 1:
prevstring = saccum
newsaccum = saccum + str(a)
else:
newsaccum = prevstring + str(a)
nn += 1
print(nn, newsaccum)
nn = ConcatenateString(newsaccum, nn)
return nn
ConcatenateString("", 0)
I was trying to understand what is the effective difference between those two pieces of code. They are both written for an assignment I got at school, but only the first one works as it should. I've been unable to understand what goes wrong in the second one so I'd be fantastically grateful if someone could shine some light on this problem.
First code:
def classify(self, obj):
if sum([c[0].classify(obj)*c[1] for c in self.classifiers]) >0:
return 1
else: return -1
def update_weights(self, best_error, best_classifier):
w=self.data_weights
for index in range(len(self.data_weights)):
if self.standard.classify(self.data[index])==best_classifier.classify(self.data[index]):
s=-1
else: s=1
self.data_weights[index] = self.data_weights[index]*math.exp(s*error_to_alpha(best_error))
Second code:
def classify(self, obj):
score = 0
for c, alpha in self.classifiers:
score += alpha * c.classify(obj)
if score > 0:
return 1
else:
return -1
def update_weights(self, best_error, best_classifier):
alpha = error_to_alpha(best_error)
for d, w in zip(self.data, self.data_weights):
if self.standard.classify(d) == best_classifier.classify(d):
w *= w * math.exp(alpha)
else:
w *= w * math.exp(-1.0*alpha)
the second doesn't modify the weights.
in the first you explicitly modify the weights array with the line
self.data_weights[index] = ...
but in the second you are only modifying w:
w *= ...
(and you have an extra factor of w). in the second case, w is a variable that is initialised from data_weights, but it is a new variable. it is not the same thing as the array entry, and changing its value does not change the array itself.
so when you later go to look at data_weights in the second case, it will not have been updated.