How can I get the final tree model? - python

Given this model:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
import graphviz
X, y = make_classification(n_samples=1000, n_features=10,n_informative=3, n_redundant=5, random_state=42)
df = pd.DataFrame(data=X)
df.columns = 'X' + (df.columns+1).astype(str)
df[df.columns[-3:]] = df[df.columns[-3:]].astype(int)
df['Y'] = y
X_train, X_test, y_train, y_test = train_test_split(df.drop('Y', axis=1), df['Y'], test_size=0.3, random_state=42)
n_negative_class = y_train.value_counts().sort_index()[0]
n_positive_class = y_train.value_counts().sort_index()[1]
xgb = XGBClassifier(random_state = 42, n_estimators=50,
scale_pos_weight = n_negative_class/n_positive_class,
use_label_encoder=False)
xgb.fit(X_train, y_train, eval_metric="auc")
y_train_scores = xgb.predict_proba(X_test)[:,1]
xgboost.to_graphviz(xgb, num_trees=49)
How can I plot the final tree used in xgb.predict_proba(X_test)[:,1]? Is necesarily the last one (as XGBoost trees learn from the last tree)? Or XGBoost chooses some tree among those 50 estimators given the loss or eval_metric given?

Related

f_importances function in sklearn

I found this question here which seems to address my problem(Determining the most contributing features for SVM classifier in sklearn).
However as my understanding of Python language is limited I need some help.
I have a dependent variable which is 'Group' that has two levels 'Group1' and 'Group2'.
This is the code I found, adapted to my data:
import pandas as pd
df = pd.read_csv('C:/Users/myPC/OneDrive/Desktop/analysis/dataframe6.csv')
X = df.drop('Group', axis=1)
y = df['Group']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
from sklearn.svm import SVC
svclassifier = SVC(kernel='linear')
svclassifier.fit(X_train, y_train)
y_pred = svclassifier.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
from matplotlib import pyplot as plt
from sklearn import svm
def f_importances(coef, names):
imp = coef
imp,names = zip(*sorted(zip(imp,names)))
plt.barh(range(len(names)), imp, align='center')
plt.yticks(range(len(names)), names)
plt.show()
features_names = ['input1', 'input2']
svclassifier = SVC(kernel='linear')
svclassifier.fit(X_train, y_train)
f_importances(svclassifier.coef_, features_names)
It produces just a blank plot.
I think there is something I should change in features_names = ['input1', 'input2'] but I am not sure what.
The code you used to plot expects a one-dimensional array. The attribute coef_, according to the documentation will be:
coef_ ndarray of shape (n_classes * (n_classes - 1) / 2, n_features)
Weights assigned to the features when kernel="linear".
Using an example :
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
np.random.seed(123)
df = pd.DataFrame(np.random.uniform(0,1,(400,3)),columns=['input1','input2','input3'])
df['Group'] = np.random.choice(['Group1','Group2'],400)
X = df.drop('Group', axis=1)
y = df['Group']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
svclassifier = SVC(kernel='linear')
svclassifier.fit(X_train, y_train)
We check the shape of the array:
print(svclassifier.coef_.shape)
(1, 3)
Because you have only 2 class, there's only 1 row. We can do:
from matplotlib import pyplot as plt
from sklearn import svm
def f_importances(coef, names):
imp = coef
imp,names = zip(*sorted(zip(imp,names)))
plt.barh(range(len(names)), imp, align='center')
plt.yticks(range(len(names)), names)
plt.show()
features_names = ['input1', 'input2','input3']
svclassifier = SVC(kernel='linear')
svclassifier.fit(X_train, y_train)
f_importances(svclassifier.coef_[0], features_names)
This is the plot I got :

assign looped regression rsquared to object

I have a function with a regression loop built in. I want to assign the rsquareds from each iteration to an object that I can print out later.
here's part of the function (including the regression) for brevity:
cuts = [stats, stats_po, stats_ic, stats_id, stats_h, stats_a, stats_bos, stats_bkn, stats_nyk, stats_phi, stats_tor, stats_chi, stats_cle, stats_det, stats_ind, stats_mil, stats_den, stats_min, stats_okc, stats_por, stats_uta, stats_gsw, stats_lac, stats_lal, stats_phx, stats_sac, stats_atl, stats_cha, stats_mia, stats_orl, stats_was, stats_dal, stats_hou, stats_mem, stats_nop, stats_sas, stats_o1, stats_o2, stats_d1, stats_d2, stats_l25]
def process_cuts(c):
c = c.dropna(axis=0,how='all')
n = c.team.str.rsplit(" ",n=1, expand=True)
c['city'] = n[0]
c['team_name']=n[1]
c['team_name']=c['team_name'].str.replace('Trailblazers','Blazers')
c['team_name']=c['team_name'].str.replace('Bobcats','Hornets')
for z in ['Points','Steals','Blocks','Assists','OReb','DefReb','Turnovers','FieldGoals','ThreeShots','FTP', 'Fouls','FTMiss','FGMiss','FreeThrows']:
y = mergered[z]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])
print(coeff_df)
y_pred = regressor.predict(X_test)
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
from sklearn import metrics
rsquared = 'Rsquared:' + ' ' + z, metrics.r2_score(y_test,y_pred)
cuts_diffs = list(map(process_cuts, cuts))
I want to store the rsquareds for each y and print them out for each data cut.
appreciate your help

What does the error mean and how to fix it - "ValueError: query data dimension must match training data dimension"

I am trying to write the code for K-NN
Below is my code. - I know that issue is in `predict() but I am not able to figure out how o fix it.
# Importing the libraries
import numpy as np
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('UniversalBank.csv')
X = dataset.iloc[:,[ 1,2,3,5,6,7,8,10,11,12,13]].values #,
y = dataset.iloc[:,9].values
#Splitting the dataset to training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state= 0)
#Feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
#Fitting the classifier to training set
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train,y_train)
#Predicting the test results
y_pred = classifier.predict(X_test)

How to increase the model accuracy of multiple linear regression

This is the custom code
#Custom model for multiple linear regression
import numpy as np
import pandas as pd
dataset = pd.read_csv("50s.csv")
x = dataset.iloc[:,:-1].values
y = dataset.iloc[:,4:5].values
from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()
x[:,3] = lb.fit_transform(x[:,3])
from sklearn.preprocessing import OneHotEncoder
on = OneHotEncoder(categorical_features=[3])
x = on.fit_transform(x).toarray()
x = x[:,1:]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=1/5, random_state=0)
con = np.matrix(X_train)
z = np.matrix(y_train)
#training model
result1 = con.transpose()*con
result1 = np.linalg.inv(result1)
p = con.transpose()*z
f = result1*p
l = []
for i in range(len(X_test)):
temp = f[0]*X_test[i][0] + f[1]*X_test[i][1] +f[2]*X_test[i][2]+f[3]*X_test[i][3]+f[4]*X_test[i][4]
l.append(temp)
import matplotlib.pyplot as plt
plt.scatter(y_test,l)
plt.show()
Then I created created a model with scikit learn
and compared the results with y_test and l(predicted values of above code)
comparisons are as follows
for i in range(len(prediction)):
print(y_test[i],prediction[i],l[i],sep=' ')
103282.38 103015.20159795816 [[116862.44205399]]
144259.4 132582.27760816005 [[118661.40080974]]
146121.95 132447.73845175043 [[124952.97891882]]
77798.83 71976.09851258533 [[60680.01036438]]
This were the comparison between y_test,scikit-learn model predictions and custom code predictions
please help with the accuracy of model.
blue :Custom model predictions
yellow : scikit-learn model predictions

Tensorflow: how to create feature_columns for numpy matrix input

I'm using tensorflow 1.8.0, python 3.6.5.
The data is iris data set. Here is the code:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import tensorflow as tf
X = iris['data']
y = iris['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
input_train=tf.estimator.inputs.numpy_input_fn(x=X_train,
y=y_train, num_epochs=100, shuffle=False)
classifier_model = tf.estimator.DNNClassifier(hidden_units=[10,
20, 10], n_classes=3, feature_columns=??)
Here is my problem, how do I setup the feature_columns for a numpy matrix?
If I covert the X and y to pandas.DataFrame, I can use the following code for the feature_columns, and it works in the DNNClassifier model.
features = X.columns
feature_columns = [tf.feature_column.numeric_column(key=key) for key in features]
You can wrap your numpy ndarray in a dictionary and pass it to numpy_input_fn method as input x and then use the key in that dictionary to define your feature_column. Also note that because each data in your X_train has 4 dimensions, you need to specify the shape parameter when defining tf.feature_column.numeric_column. Here is the completed code:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import tensorflow as tf
iris = load_iris()
X = iris['data']
y = iris['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
input_train = tf.estimator.inputs.numpy_input_fn(
x = {'x': X_train},
y = y_train,
num_epochs = 100,
shuffle = False)
feature_columns = [tf.feature_column.numeric_column(key='x', shape=(X_train.shape[1],))]
classifier_model = tf.estimator.DNNClassifier(
hidden_units=[10, 20, 10],
n_classes=3,
feature_columns=feature_columns)

Categories