Errors with Python Classification and Regression Trees - python

I'm learning how to use decision trees in python. I modified an example to import a csv file instead of using the iris dataset from this site:
http://machinelearningmastery.com/get-your-hands-dirty-with-scikit-learn-now/
Code:
import numpy as np
import urllib
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn import datasets
from sklearn import metrics
# URL for the Pima Indians Diabetes dataset (UCI Machine Learning Repository)
url = "http://goo.gl/j0Rvxq"
# download the file
raw_data = urllib.urlopen(url)
# load the CSV file as a numpy matrix
dataset = np.loadtxt(raw_data, delimiter=",")
#print(dataset.shape)
# separate the data from the target attributes
X = dataset[:,0:7]
y = dataset[:,8]
# fit a CART model to the data
model = DecisionTreeClassifier()
model.fit(dataset.data, dataset.target)
print model
Error:
Traceback (most recent call last):
File "DatasetTest2.py", line 24, in <module>
model.fit(dataset.data, dataset.target)
AttributeError: 'numpy.ndarray' object has no attribute 'target'
I am not sure why this error is occuring. If I use the iris data set from the example it works just fine. Eventually, I need to be able to perform decision trees on csv files.
I've also tried the following code that also results in the same error:
# Import Python Modules
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn import datasets
from sklearn import metrics
import pandas as pd
import numpy as np
#Import Data
raw_data = pd.read_csv("DataTest1.csv")
dataset = raw_data.as_matrix()
#print dataset.shape
#print dataset
# separate the data from the target attributes
X = dataset[:,[2,3,4,7,10]]
y = dataset[:,[1]]
#print X
# fit a CART model to the data
model = DecisionTreeClassifier()
model.fit(dataset.data, dataset.target)
print model

The dataset object that is imported in that example is not a plain table of data. It is a special object that is set up with attributes like data and target so that it can be used as shown in the example. If you have your own data, you will need to decide what to use as data and target. From your example, it looks like you want to do model.fit(X, y).

Related

UserWarning: X does not have valid feature names, but LogisticRegression was fitted with feature names

I wrote a program in Flask to get input from users to enter the lengths and widths to predict the fish type but as soon as I enter it shows an error known as
UserWarning: X does not have valid feature names, but LogisticRegression was fitted
with feature names
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
df=pd.read_csv('Fish.csv')
df.head()
X = df.drop('Species', axis=1)
y = df['Species']
cols = X.columns
index = X.index
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=0)
from sklearn.ensemble import RandomForestClassifier
random=RandomForestClassifier()
random.fit(X_train,y_train)
y_pred=random.predict(X_test)
from sklearn.metrics import accuracy_score
score=accuracy_score(y_test,y_pred)
# Create a Pickle file
import pickle
pickle_out = open("model.pkl","wb")
pickle.dump(logistic_model, pickle_out)
pickle_out.close()
logistic_model.predict([[242.0,23.2,25.4,30.0,11.5200,4.0200]])
import numpy as np
import pickle
import pandas as pd
from flask import Flask, request, jsonify, render_template
app=Flask(__name__)
pickle_in = open("model.pkl","rb")
random = pickle.load(pickle_in)
#app.route('/')
def home():
return render_template('index.html')
#app.route('/predict',methods=["POST"])
def predict():
"""
For rendering results on HTML GUI
"""
int_features = [x for x in request.form.values()]
final_features = [np.array(int_features)]
prediction = random.predict(final_features)
return render_template('index.html', prediction_text = 'The fish belongs to species {}'.format(str(prediction)))
if __name__=='__main__':
app.run()
Data Set
https://www.kaggle.com/datasets/aungpyaeap/fish-market
I also faced same warning:
UserWarning: X does not have valid feature names, but LogisticRegression was fitted with feature names.
This warning actually saying while fitting data to our model during model.fit(), that dataframe X_train has got attribute names but while you are trying to predict using dataframe or numpy array converted into row vector, you're not providing features/attribute names to that tuples to which you want to do prediction.
For understanding clearly what i meant to say, just see sample image below:
Hope this might help beginners while doing prediction on unseen data by model
Your X and y is a pandas dataframe. Before fitting it to Random forest classifier make it a numpy array like,
X = X.values
y = y.values
After this do the train test split,
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=0)
Now Fit the model (the code is same as yours below),
from sklearn.ensemble import RandomForestClassifier
random = RandomForestClassifier()
random.fit(X_train,y_train)
y_pred=random.predict(X_test)
In the flask app, you are giving input in numpy array but during the training you have pandas dataframe that's why that warning was raised. Now, it should work properly!

how to give input to loaded .pkl model in python

I have a Random Forest model, and model saved in .pkl file.
I have loaded the .pkl model but now I have to input the test data and predict the accuracy.
how to input file to .pkl model?
import pickle
def read_from_pickle(RF):
with open(RF, 'rb') as file:
try:
while True:
yield pickle.load(file)
except EOFError:
pass
this is the code i have used to load the model
Next..how to input?
this solution is with random Forrest regressor my model was dynamic price prediction
import pandas as pd
import numpy as np
from sklearn import pipeline, preprocessing,metrics,model_selection,ensemble,linear_model
from sklearn_pandas import DataFrameMapper
from sklearn.metrics import mean_squared_error
// firstly we loaded this library and then we loaded the dataset and all the cleaning stuff we did after that
data.to_csv("Pune_hpp.csv",index=False)
mapper = DataFrameMapper([
(['area_type','size','new_total_sqft','bath','balcony',], preprocessing.StandardScaler()),
# (['area_type','size'],preprocessing.OneHotEncoder())
],df_out=True)
// hear we created two pipeline for it bcz we have compared two algorithm with mse and rsme method and loaded the this below algo
pipeline_obj_LR=pipeline.Pipeline([
('mapper',mapper),
("model",linear_model.LinearRegression())
])
pipeline_obj=pipeline.Pipeline([
('mapper',mapper),
("model",ensemble.RandomForestRegressor())
])
X=['area_type','size','new_total_sqft','bath','balcony'] // X with INPUT
Y=['price'] // Y as OUTPUT
// hear the comparison process start
pipeline_obj_LR.fit(data[X],data[Y]) // this logistic regression
pipeline_obj.fit(data[X],data[Y]) // random forest
pipeline_obj.predict(data[X]) // some predict we have done
predict=pipeline_obj_LR.predict(data[X])
//BELLOW is the actual way to compare and which algo is best fited
predict=pipeline_obj_LR.predict(data[X])
Root Mean Squared Error on train and test data
print('MSE using linear_regression: ', mean_squared_error(data[Y], predict))
print('RMSE using linear_regression: ', mean_squared_error(data[Y], predict)**(0.5))
// above is for the lr
predict=pipeline_obj.predict(data[X])
Root Mean Squared Error on train and test data
print('MSE using randomforestregression: ', mean_squared_error(data[Y], predict))
print('RMSE using randomforestregression: ', mean_squared_error(data[Y], predict)**(0.5))
// above it is for RFR and in my I have done with the random forest reason to do with the joblib was I had the huge dataset and it easy to implement and it's line of code also very less and you have seen I have not use the pipeline_obj_LR this how we have inputed the value in pkl file
import joblib
joblib.dump(pipeline_obj,'dynamic_price_pred.pkl')
modelReload=joblib.load('dynamic_price_pred.pkl')

KeyError:"['class']" not found in axis

I found a tutorial about decision tree algorithm using pyxll add-in for excel, and tried to execute. I get an error: KeyError:"['class']" not found in axis.
from pyxll import xl_func
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import os
#xl_func("float, int, int: object")
def ml_get_zoo_tree_2(train_size=0.75, max_depth=5, random_state=245245):
# Load the zoo data
dataset = pd.read_csv(os.path.join(os.path.dirname(__file__), "zoo.csv"))
# Drop the animal names since this is not a good feature to split the data on
dataset = dataset.drop("animal_name", axis=1)
# Split the data into a training and a testing set
features = dataset.drop("class", axis=1)
targets = dataset["class"]
train_features, test_features, train_targets, test_targets = \
train_test_split(features, targets, train_size=train_size, random_state=random_state)
# Train the model
tree = DecisionTreeClassifier(criterion="entropy", max_depth=max_depth)
tree = tree.fit(train_features, train_targets)
# Add the feature names to the tree for use in predict function
tree._feature_names = features.columns
return tree
If i removed line 17 and 18 for class code, then i get error NameError: name 'features' is not defined, then when i removed feature i get error as target has to be defined.
You need the correct dataset to go with that tutorial. You can download it (and the code) from here https://github.com/pyxll/pyxll-examples/tree/master/machine-learning.

gplearn library for generating new lines of data from given dataset

I am using gplearn library (genetic programming) for generating new rules from the given dataset. Currently I have 11 rows of data with 24 columns(features) that I give as input to the SymbolicRegressor method to get new rules. However, I am only getting only one rule. Generally with crossover shouldn't I get 11 new rules if I give 11 lines of data as input. If I doing it wrong what is the right way of doing it ?
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import ExtraTreesRegressor
from gplearn.genetic import SymbolicRegressor
data = pd.read_csv("D:/Subjects/Thesis/snort_rules/ransomware_dataset.csv")
x_train = data.iloc[:,0:23]
y_train = data.iloc[:,:-1]
gp = SymbolicRegressor(population_size=11,
generations=2, stopping_criteria=0.01,
p_crossover=0.8, p_subtree_mutation=0.1,
p_hoist_mutation=0.05, p_point_mutation=0.05,
max_samples=0.9, verbose=1,
parsimony_coefficient=0.01, random_state=0)
gp.fit(x_train, y_train)
print(gp._program)
The output is :
X7/(X15*(-X16*X20 - X19 + X2))

Trying to print out the decision tree for a forest from scikit-learn ensemble

I am trying to print out the decision tree for a forest from scikit-learn ensemble: For example for a DecisionTreeClassifier, I would use:
from sklearn import tree
clf = tree.DecisionTreeClassifier( criterion ='entropy', max_depth = 3,
min_samples_leaf =
clf = clf.fit( X_train, y_train) #Input this to analyze the training set.
import pydot, StringIO
dot_data = StringIO.StringIO()
tree.export_graphviz( clf, out_file = dot_data,
feature_names =[' age', 'sex', 'first_class', 'second_class', 'third_class'])
graph = pydot.graph_from_dot_data( dot_data.getvalue())
graph.write_png('visualtree.png')
from IPython.core.display import Image
Image( filename =visualtree.png')
I tried a similar approach for Random Forest Regressor (see below and got an error)
# Fit regression model
from sklearn.ensemble import RandomForestRegressor
rfr_1 = RandomForestRegressor(n_estimators=10, max_depth=5)
rfr_1.fit(X, y)
from sklearn.ensemble import*
import pydot, StringIO
dot_data = StringIO.StringIO()
ensemble.export_graphviz( rf1, out_file = dot_data,
feature_names =[' Temperature', 'Translator Bacteria'])
graph = pydot.graph_from_dot_data( dot_data.getvalue())
graph.write_png('fish.png')
from IPython.core.display import Image
Image( filename ='fish.png')
File "randomforestregressor.py", line 45, in
ensemble.export_graphviz( rf1, out_file = dot_data,
NameError: name 'ensemble' is not defined
How would I accomplish this? thanks!
The error message is pretty explicit:
File "randomforestregressor.py", line 45, in ensemble.export_graphviz( rf1, out_file = dot_data,
NameError: name 'ensemble' is not defined
You access a variable named ensemble in your script line 45 but you never define such a variable. In your case you probably intended that variable to point to the sklearn.ensemble package:
from sklearn import ensemble
However if you do this you will likely get an AttributeError as the sklearn.ensemble package does not have export_graphviz function.
Instead what you might want to do is generate one image per tree in the forest by iterating over the elements of the rfr_1.estimators_ list and calling the export_graphviz method of the sklearn.tree package on each of those tree.
However in practice displaying the trees of a forest is very often useless. Practitioners typically build random forests with hundreds or thousands of trees to get a good predictive accuracy. In such cases, visually inspecting that many trees is impractical.

Categories