May I know why I get the error message -
NameError: name 'X_train_std' is not defined
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(C=1000.0, random_state=0)
lr.fit(X_train_std, y_train)
plot_decision_regions(X_combined_std,
y_combined, classifier=lr,
test_idx=range(105,150))
plt.xlabel('petal length [standardized]')
plt.ylabel('petal width [standardized]')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()
lr.predict_proba(X_test_std[0,:])
weights, params = [], []
for c in np.arange(-5, 5):
lr = LogisticRegression(C=10**c, random_state=0)
lr.fit(X_train_std, y_train)
weights.append(lr.coef_[1])
params.append(10**c)
weights = np.array(weights)
plt.plot(params, weights[:, 0],
label='petal length')
plt.plot(params, weights[:, 1], linestyle='--',
label='petal width')
plt.ylabel('weight coefficient')
plt.xlabel('C')
plt.legend(loc='upper left')
plt.xscale('log')
plt.show()
Plesea see the link -
https://www.freecodecamp.org/forum/t/how-to-modify-my-python-logistic-regression/265795
https://bytes.com/topic/python/answers/972352-why-i-get-x_train_std-not-defined#post3821849
https://www.researchgate.net/post/Why_I_get_the_X_train_std_is_not_defined
.
Well, X_train_std is not defined/declared. You need to declare the variable and give it a value before using it.
Like:
X_train_std = 3
You didn't copy enough of the sample code. Somewhere above this, there is likely a call to train_test_split
Basically, to do what you want, you need a set of X variables, your Y variable (what will be predicted). You normally split them into a training set and a test set, and in addition, many algorithms work better on standardized (zero mean, 1 standard deviation), which is what the _std probably means in your variable name.
The code that comes before your snippet probably looks something like:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
my_df = pd.DataFrame(....this is your data for the test...)
X = my_df[[X_variable_column_names_here]]
Y = my_df[Y_variable_column_name]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3)
scaler = StandardScaler()
X_train_std = scaler.fit_transform(X_train)
X_test_std = scaler.transform(X_test)
Edit: It looks from the axis labels on your plot that you're trying to do logistic regression against the Iris dataset. The fully worked example is here:
https://scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html
Related
Is it possible to change the threshold of a decisiontreeclassifier? I'm studying the precision/recall trade-off and want to change the threshold to favor recall. I'm studying the hand's on ML, but there it uses the SGDClassifier, at some point it uses the cross_val_predict() with the method="decision_function" attribute, but this does not exist for the decisiontreeclassifier. I'm using a pipeline and a cross-validation.
My study is with this dataset:
https://www.kaggle.com/datasets/imnikhilanand/heart-attack-prediction
features = df_heart.drop(['output'], axis=1).copy()
labels = df_heart.output
#split
X_train, X_test, y_train, y_test= train_test_split(features, labels,
train_size=0.7,
random_state=42,
stratify=features["sex"]
)
# categorical features
cat = ['sex', 'tipo_de_dor', 'ang_indz_exerc', 'num_vasos', 'acuc_sang_jejum', 'eletrc_desc', 'pico_ST_exerc', 'talassemia']
# treatment of categorical variables
t = [('cat', OneHotEncoder(handle_unknown='ignore'), cat)]
preprocessor = ColumnTransformer(transformers=t, remainder='passthrough')
#pipeline
pipe = Pipeline(steps=[('preprocessor', preprocessor),
('clf', DecisionTreeClassifier(min_samples_leaf=8, random_state=42),)
]
)
pipe.fit(X_train, y_train)
valid_cruz_strat = StratifiedKFold(n_splits=14, shuffle=True, random_state=42)
y_train_pred = cross_val_predict(pipe['clf'], X_train, y_train, cv=valid_cruz_strat)
conf_mat = confusion_matrix(y_train, y_train_pred)
ConfusionMatrixDisplay(confusion_matrix=conf_mat,
display_labels=pipe['clf'].classes_).plot()
plt.grid(False)
plt.show()
threshold = 0 #this is only for support the graph
idx = (thresholds >= threshold).argmax() # first index ≥ threshold
plt.plot(thresholds, precisions[:-1], 'b--', label = 'Precisão')
plt.plot(thresholds, recalls[:-1], 'g-', label = 'Recall')
plt.vlines(threshold, 0, 1.0, "k", "dotted", label="threshold")
plt.title('Precisão x Recall', fontsize = 14)
plt.plot(thresholds[idx], precisions[idx], "bo")
plt.plot(thresholds[idx], recalls[idx], "go")
plt.axis([-.5, 1.5, 0, 1.1])
plt.grid()
plt.xlabel("Threshold")
plt.legend(loc="lower left")
plt.show()
valid_cruz_strat = StratifiedKFold(n_splits=14, shuffle=True, random_state=42)
y_score = cross_val_predict(pipe['clf'], X_train, y_train, cv=valid_cruz_strat)
precisions, recalls, thresholds = precision_recall_curve(y_train, y_score)
threshold = 0.75 #this is only for support the graph
idx = (thresholds >= threshold).argmax()
plt.figure(figsize=(6, 5))
plt.plot(recalls, precisions, linewidth=2, label="Precision/Recall curve")
plt.plot([recalls[idx], recalls[idx]], [0., precisions[idx]], "k:")
plt.plot([0.0, recalls[idx]], [precisions[idx], precisions[idx]], "k:")
plt.plot([recalls[idx]], [precisions[idx]], "ko",
label="Point at threshold "+str(threshold))
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.axis([0, 1, 0, 1])
plt.grid()
plt.legend(loc="lower left")
plt.show()
When I check the arrays generated by the precision_recall_curve() function I see that it only contains 3 elements. Is this correct behavior? When I do the cross_val_predict() function for an SGDClassifier, for example, as it is in the book, without the method='decision_function' attribute and I use the output in precision_recall_curve() and it generates arrays with 3 elements and if I use the method='decision_function ' it generates arrays with several elements.
My main question is how to choose the threshold for the DecisionTreeClassifier, and if there is a way to generate the Precision x Recall curve with several points, I only manage with these three points and I am not able to assimilate how to improve the recall.
Move the threshold to improve recall, and understand how to do it with Decision tree classifier
This topic usually falls under the name "model calibration." scikit-learn supports a few kinds of probability calibration which could be informative to read about as well.
One way to "change the threshold" in a DecisionTreeClassifier would involve invoking .predict_proba(X) and observing a metric(s) over possible thresholds:
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score
import numpy as np
import matplotlib.pyplot as plt
X, y = make_classification(n_samples=10000, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
clf = DecisionTreeClassifier(max_depth=5)
clf.fit(X_train, y_train)
prob_pred = clf.predict_proba(X_test)[:, 1]
thresholds = np.arange(0.0, 1.0, step=0.01)
recall_scores = [recall_score(y_test, prob_pred > t) for t in thresholds]
precis_scores = [precision_score(y_test, prob_pred > t) for t in thresholds]
Now we have a set of thresholds between 0.0 and 1.0, and we've computed precision and recall over each threshold (Side note: this problem is less-well-defined for multilabel or multiclass prediction—usually these metrics are averaged over each class or similar).
Then we'll plot:
fig, ax = plt.subplots(1, 1)
ax.plot(thresholds, recall_scores, label="Recall # t")
ax.plot(thresholds, precis_scores, label="Precision # t")
ax.axvline(0.5, c="gray", linestyle="--", label="Default Threshold")
ax.set_xlabel("Threshold")
ax.set_ylabel("Metric # Threshold")
ax.set_box_aspect(1)
ax.legend()
plt.show()
Which results in a figure like this:
This shows us that the default threshold used by .predict() at 0.5 may not be the best in all circumstances. In fact, there are a range of thresholds where precision and recall are fairly close, but favors one over the other. In this case: lowering the threshold slightly will tend to favor recall, while increasing the threshold will tend to favor precision.
In practice: the threshold appropriate for the problem comes down to domain knowledge since there's always a trade-off between precision and recall.
I'm trying to create a kNN algorithm for stock prediction, with at least 80% correct predictions on the test data. I have a problem with the StandardScaler from sklearn. For some reason it says that there is a "typo" in the word "Scaler", which I find is weird. Does someone know how to solve this issue? If you find more mistakes in the code, please tell me how to fix them, but I think it should be mostly correct (some might be wrong). I want the polynomial line to show around a week in the future. I use data from a private API Key from Marketstack.com, which is provided in JSON formatting. The data contains of EOD data (end of day) with a limit of 1000 days in Descending order.
# Exports API data to a csv file on my hardware and then I import the csv data after it's sorted
df.to_csv('Test_Sample.csv', index=False)
dataframe = pd.read_csv('Test_Sample.csv')
dataframe['symbol']=dataframe['symbol'].astype(float)
dataframe['exchange']=dataframe['exchange'].astype(float)
dataframe['date']=dataframe['date'].astype(float)
dataframe.info()
X = df.iloc[:, :-1].values
Y = df.iloc[:, 4].values
# 80% training data, 20% testing data
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
# Scale train and test data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() #Here is the mistake, under scaler (Error code: 'Typo in the word scaler')
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# Classify data
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, Y_train)
Y_pred = classifier.predict(X_test)
# Train and test result
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(Y_test, Y_pred))
print(confusion_matrix(Y_test, Y_pred))
# Scatter all the data points in a figure
import matplotlib.pyplot as plt
plt.figure()
plt.scatter(X, Y, color='blue')
plt.xlabel('Date')
plt.ylabel('Price')
plt.title('Financial Instrument Predicted Price')
# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=4)
X_poly = poly.fit_transform(X)
poly.fit(X_poly, Y)
plt.plot(X, poly.fit_transform(X), color='red')
plt.show()
ValueError: could not convert string to float: 'AAPL'
You don't have a typo, in the comments you said:
ValueError: could not convert string to float: 'AAPL'
The error is clear actually, you have a string in your dataset, and trying to normalize/standardize your data. For most of the algorithms you need to encode your strings into integers. Since you did not provide any data sample, you can do, before splitting you can check your dataframe with
dataframe.info()
if it contains strings.
Edit: Check if your first row is supposed to be your header, then you can do the following:
dataframe = pd.read_csv('Test_Sample.csv', header = 0)
I have 26 observations to apply a simple linear regression but when I split the data to 70% for train and 30% for test data usually the results for the test data (R squared / P value) are not good. Is it because the samples for the test are too small ? 8 or 9 observation are not enough ? What should I do ? no random state so he the algorithm choose the data randomly
Also wondering how to choose between OLS and M-estimation(which is more resistant to outliers which I have on my data check below because Variable B is impacted by other variables except A) to apply for my dataset.
this is the code I have done so far and looking to do cross validation in the train data.
Is it possible according to the number of observations I have?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats
data = pd.read_excel("C:\\Users\\AchourAh\\Desktop\\PL32_PMM_03_09_2018_SP_Level.xlsx",'Sheet1')
data1 = data.fillna(0) #Replace null values of the whole dataset with 0
print(data1)
X = data1.iloc[0:len(data1),1].values.reshape(-1, 1)
Y = data1.iloc[0:len(data1),2].values.reshape(-1, 1)
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size =0.33)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, Y_train)
plt.scatter(X_train, Y_train, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('SP00114585')
plt.xlabel('COP COR Quantity')
plt.ylabel('PAUS Quantity')
plt.show()
plt.scatter(X_test, Y_test, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('SP00114585')
plt.xlabel('COP COR Quantity')
plt.ylabel('PAUS Quantity')
plt.show()
X2 = sm.add_constant(X_train)
est = sm.OLS(Y_train, X2)
est2 = est.fit()
print(est2.summary())
X3 = sm.add_constant(X_test)
est3 = sm.OLS(Y_test, X3)
est4 = est3.fit()
print(est4.summary())
This is an example of the data I have and my goal is not to predict a good model but to describe the impact of variable A on B. Also when analyzing the whole data together results are always better than splitting the data
Variable A Variable B
87.000 573.000
90.000 99.000
258.000 339.000
180.000 618.000
0 69.000
90.000 621.000
90.000 231.000
210.000 345.000
255.000 255.000
0 0
213.000 372.000
405.000 405.000
162.000 162.000
405.000 405.000
0 186.000
105.000 252.000
474.000 501.000
531.000 531.000
549.000 549.000
525.000 525.000
360.000 660.000
546.000 546.000
645.000 645.000
561.000 600.000
978.000 1.104.000
960.000 960.000
Also, plotted the results using SKlearn and analyzing the results based on the statsmodels. Can I assume that the plotted results are represented by the values due to statsmodels or there is something to change in the code ?
Y=df["Column name"]
X=df[[ "All other Columns"]]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 465)
Good luck
So, basically, I'm using a RF for descriptive modelling as follows:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import class_weight
class_weights = class_weight.compute_class_weight('balanced', np.unique(y), y)
class_weights = dict(enumerate(class_weights))
class_weights
{0: 0.5561096747856852, 1: 4.955559597429368}
clf = RandomForestClassifier(class_weight=class_weights, random_state=0)
cross_val_score(clf, X, y, cv=10, scoring='f1').mean()
And plotting variables importance as:
import matplotlib.pyplot as plt
def plot_importances(clf, features, n):
importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]
if n:
indices = indices[:n]
plt.figure(figsize=(10, 5))
plt.title("Feature importances")
plt.bar(range(len(indices)), importances[indices], align='center')
plt.xticks(range(len(indices)), features[indices], rotation=90)
plt.xlim([-1, len(indices)])
plt.show()
return features[indices]
imp = plot_importances(clf, X.columns, 30)
I was expecting variable importances to be the same across multiple runs. However, their importances changes whenever I re-run the notebook.
I don't understand why is that. Is it related to the cross_val_score method somehow?
I cannot reproduce the problem. For me the variable importances does remain the same for multiple runs when I produce some data using:
X, y = make_classification(n_samples=1000,
n_features=10,
n_informative=3,
n_redundant=0,
n_repeated=0,
n_classes=2,
random_state=0,
shuffle=False)
X = pd.DataFrame(X)
Also changing the data to have an uneven weighting by selecting the first 750 y/X data points does not lead to differences in importances.
What data do you use?
I am new to python and want to plot a CV error for each fold and for each degree of polynomial. The below code calculate error values for different degree polynomials and for each fold. Kindly guide me in this regard.
from sklearn.cross_validation import KFold
kf = KFold(len(dF), n_folds=5)
e_test = []
orders = [2,3]
dims = [6,10]
for i, order in enumerate(orders):
dF = getDataByDegree(d,order)
error = []
wTemp = np.empty(dims[i])
wTemp.fill(0.001)
for train_index, test_index in kf:
x_train, x_test = dF[train_index], data['l'][train_index]
y_train, y_test = dF[test_index], data['l'][test_index]
w, x_error = gradientDes(wTemp,x_train,x_test)
y_error = errorfun(w,y_train,y_test)
error.insert(i,y_error[0])
e_test.insert(i,error)
fig, ax = plt.subplots()
for i in range(1,len(orders):
ax.plot(orders,values[i],lw=2, label='Test Error - Fold %s' % str(int(i)+1))
plt.show()
You are looking for what sklearn calls validation curve. The validation_curve function lets you explore a range of a certain model hyperparameter while doing CV for you.
See this example if you want to plot the errors.