I would like to plot y_test and prediction in a scatter plot.
I am using the logistic regression as model.
from sklearn.linear_model import LogisticRegression
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['Spam'])
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=27)
lr = LogisticRegression(solver='liblinear').fit(X_train, y_train)
pred_log = lr.predict(X_test)
I have tried as follows
## Plot the model
plt.scatter(y_test, pred_log)
plt.xlabel("True Values")
plt.ylabel("Predictions")
and I got this:
that I do not think it is what I should expect.
y_test is (250,), similarly pred_log is (250,)
Am I considering the wrong variables to plot, or they are right?
I have no idea one what the plot with those four values mean. I would have been expected more dots in the plot, but maybe I am wrong.
Please let me know if you need more info. Thanks
I think you know LogisticRegression is a classification algorithm. If you do binary classification it will predict whether predicted class is 0 or 1.If you want to get visualization about how model preform, you should consider confusion matrix.You can't use scatterplot for visualize classification results.
import seaborn as sns
cm = confusion_matrix(y_true, y_pred)
sns.heatmap(cf_matrix, annot=True)
confusion matrix shows how many labels have correct predictions and how many are wrong.Looking at confusion matrix you can calculate how accurate the model.We can use different metrices like precision,recall and F1 score.
Related
i am running following code, graph for training dataset is giving error,
import pandas as pd
import numpy as np
df = pd.read_csv('11.csv')
df.head()
AT V AP RH PE
0 8.34 40.77 1010.84 90.01 480.48
1 23.64 58.49 1011.40 74.20 445.75
2 29.74 56.90 1007.15 41.91 438.76
3 19.07 49.69 1007.22 76.79 453.09
4 11.80 40.66 1017.13 97.20 464.43
x = df.drop(['PE'], axis = 1).values
y = df['PE'].values
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2, random_state=0)
from sklearn.linear_model import LinearRegression
ml = LinearRegression()
ml.fit(x_train, y_train)
y_pred = ml.predict(x_test)
print(y_pred)
import matplotlib.pyplot as plt
plt.scatter(x_train, y_train, color = 'red')
plt.plot(x_train, ml.predict(x_test), color = 'green')
plt.show() ***
please help to reshape 2d to 1d array for plotting graphs
**ValueError: x and y must be the same size**
EDIT: Now that your question has it's format fixed, I'm spotting a few errors, with a theme of using 1D linear regression code to plot your multiple regression problem.
plt.scatter(x_train, y_train, color = 'red'): You're trying to plot multiple variables in one axis (AT, V, AP, RH) using x_train. You cannot do this since this is multiple linear regression. (For example, one can't fit pressure and volume on the x-axis against temperature on the y. What does the x-axis represent? It doesn't make sense.) You cannot plot what you are trying to plot, and I cannot give you suggestions since I don't know what you're trying to plot. You can try one variable at a time, e.g. plt.scatter(x_train['AT'], y_train, color='red'). Or you use different color to plot each variable on the same graph - though I don't recommend this since your x-axis could be of different units.
plt.plot(x_train, ml.predict(x_test): You should be using y_test for your x-input. E.g. plt.plot(y_test, ml.predict(x_test)). This is a problem with the length of your data, not your width/columns like the error above. Though if my suggestion isn't what you wanted (it's a little strange to plot y_test and your y predictions), you might be inputting (incorrectly) assumptions/code for 1D linear regression when you're working with multiple linear regression - a potential theme in these errors.
I have a dataframe with 36540 rows. the objective is to predict y HITS_DAY.
#data
https://github.com/soufMiashs/Predict_Hits
I am trying to train a non-linear regression model but model doesn't seem to learn much.
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=42)
data_dmatrix = xgb.DMatrix(data=x,label=y)
xg_reg = xgb.XGBRegressor(learning_rate = 0.1, objectif='reg:linear', max_depth=5,
n_estimators = 1000)
xg_reg.fit(X_train,y_train)
preds = xg_reg.predict(X_test)
df=pd.DataFrame({'ACTUAL':y_test, 'PREDICTED':preds})
what am I doing wrong?
You're not doing anything wrong in particular (except maybe the objectif parameter for xgboost which doesn't exist), however, you have to consider how xgboost works. It will try to create "trees". Trees have splits based on the values of the features. From the plot you show here, it looks like there are very few samples that go above 0. So making a test train split random will likely result in a test set with virtually no samples with a value above 0 (so a horizontal line).
Other than that, it seems you want to fit a linear model on non-linear data. Selecting a different objective function is likely to help with this.
Finally, how do you know that your model is not learning anything? I don't see any evaluation metrics to confirm this. Try to think of meaningful evaluation metrics for your model and show them. This will help you determine if your model is "good enough".
To summarize:
Fix the imbalance in your dataset (or at least take it into consideration)
Select an appropriate objective function
Check evaluation metrics that make sense for your model
From this example it looks like your model is indeed learning something, even without parameter tuning (which you should do!).
import pandas
import xgboost
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Read the data
df = pandas.read_excel("./data.xlsx")
# Split in X and y
X = df.drop(columns=["HITS_DAY"])
y = df["HITS_DAY"]
# Show the values of the full dataset in a plot
y.sort_values().reset_index()["HITS_DAY"].plot()
# Split in test and train, use stratification to make sure the 2 groups look similar
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42, stratify=[element > 1 for element in y.values]
)
# Show the plots of the test and train set (make sure they look similar!)
y_train.sort_values().reset_index()["HITS_DAY"].plot()
y_test.sort_values().reset_index()["HITS_DAY"].plot()
# Create the regressor
estimator = xgboost.XGBRegressor(objective="reg:squaredlogerror")
# Fit the regressor
estimator.fit(X_train, y_train)
# Predict on the test set
predictions = estimator.predict(X_test)
df = pandas.DataFrame({"ACTUAL": y_test, "PREDICTED": predictions})
# Show the actual vs predicted
df.sort_values("ACTUAL").reset_index()[["ACTUAL", "PREDICTED"]].plot()
# Show some evaluation metrics
print(f"Mean squared error: {mean_squared_error(y_test.values, predictions)}")
print(f"R2 score: {r2_score(y_test.values, predictions)}")
Output:
Mean squared error: 0.01525351142868279
R2 score: 0.07857787102063485
I want to make a decision boundary for extracted features with binary variable columns, and I want to make a plot to separate out two class. I tried with Logistic regression to make decision boundary, but in the rendered plot, data points that belong to two class are not well separated. When I tried to make a scatter plot or decision boundary by using SVM, I have a memory error.
Is there any way I can take a sample to make scatter plot when I do SVM? how can I get correct decision boundary for binary classification? any idea?
data:
I have 265x16 columns dataframe of features, it is can be seen on example data snippet on github
what I tried:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import pandas as pd
## load features
df=pd.read_csv('binary_clf_feats.csv')
X_feats=df.iloc[:, 2:11,].values
y_label=df['price_status'].values
seed=np.random.seed(234)
X_train, X_test, y_train, y_test = train_test_split(X_feats, y_label, test_size = 0.2, random_state = seed)
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
parameters = log_reg.coef_[0]
parameter0 = log_reg.intercept_
# Plotting the decision boundary
fig = plt.figure(figsize=(10,7))
x_values = [np.min(X_train[:,] -50 ), np.max(X_train[:,] +50 )]
y_values = np.dot((-1./parameters[1]), (np.dot(parameters[0],x_values) + parameter0))
colors=['red' if l==0 else 'blue' for l in y_train]
plt.scatter(X_train[:, 0], X_train[:, 1], label='Logistics regression', color=colors)
plt.plot(x_values, y_values, label='Decision Boundary')
plt.show()
but this approach gave me following plot:
I am expecting blue, red data point should be well separated. How can I manipulate my features data for getting correct scatter plot or SVM plot? any better idea to make this happen? thanks
Looking at your dataset, you have more than 2 features. In general, 2D-plotting more than 2 features is not possible / no standard practice. You need to ask yourself what you are actually visualizing if it was possible.
Try not to focus on plotting but to increase your model accuracy first. A few improvements:
- scale values
- bin values
- combine features / drop features
- generate new features
- try other models: from a decision tree it will be easy to explain what features lead to what decision
I have a function that imports a random forest classifier from scikit learn, i fit it with data and finally I want to display accuracy, kappa and confusion matrix. All works except printing the confusion matrix. I do not get any error, but the confusion matrix does not print.
I have tried calling print(cm) and it works, but it does not print in usual pandas dataframe style, which is what I am looking for.
Here's the code
def rf_clf(X, y, test_size = 0.3, random_state = 42):
"""This function splits the data into train and test and fits it in a random forest classifier
to the data provided, analysing its errors (Accuracy and Kappa). Also as this is classification,
the function will output a confusion matrix"""
#Split data in train and test, as well as predictors (X) and targets, (y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state, stratify=y)
#import random forest classifier
base_model = RandomForestClassifier(random_state=random_state)
#Train the model
base_model.fit(X_train,y_train)
#make predictions on test set
y_pred=base_model.predict(X_test)
#Print Accuracy and Kappa
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Kappa:",metrics.cohen_kappa_score(y_test, y_pred))
#create confusion matrix
labs = [y_test[i][0] for i in range(len(y_test))]
cm = pd.DataFrame(confusion_matrix(labs, y_pred))
cm #here is the issue. Kinda works with print(cm)
Import metrics from sklearn at the beginning.
from sklearn import metrics
Use this when you want to show confussion matrix.
# Get and show confussion matrix
cm = metrics.confusion_matrix(y_test, y_pred)
print(cm)
With this you should view confussion matrix in raw text.
If you want to show confussion Matrix with colours do it in this other way:
Import
from sklearn.metrics import confusion_matrix
import pandas as pd
import seaborn as sns; sns.set()
Use it that way:
cm = confusion_matrix(y_test, y_pred)
cmat_df = pd.DataFrame(cm, index=class_names, columns=class_names)
ax = sns.heatmap(cmat_df, square=True, annot=True, cbar=False)
ax.set_xlabel('Predicción')
ax.set_ylabel('Real')`
Hope for the best!
I'm teaching myself some more tricks with python and scikit, and I'm trying to plot a linear regression model. My code can be seen below. But my program and console give the following error: x and y must be the same size. Additionally, my program makes it to the end of my code, but nothing gets plotted.
To fix the size error, the first thing that came to mind was testing the length of x and y with something like len(x) == len(y). But as far as I can tell, my data seems to be the same length. Maybe the error is referring to something other than length (if so, I'm not sure what). Would really appreciate any help.
from sklearn import cross_validation
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn import linear_model
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Create linear regression object
regr = linear_model.LinearRegression()
#load csv file with pandas
df = pd.read_csv("pokemon.csv")
#remove all string columns
df = df.drop(['Name','Type_1','Type_2','isLegendary','Color','Pr_Male','hasGender','Egg_Group_1','Egg_Group_2','hasMegaEvolution','Body_Style'], axis=1)
y= df.Catch_Rate
x_train, x_test, y_train, y_test = cross_validation.train_test_split(df, y, test_size=0.25, random_state=0)
# Train the model using the training sets
regr.fit(x_train, y_train)
# Make predictions using the testing set
pokemon_y_pred = regr.predict(x_test)
print (pokemon_y_pred)
# Plot outputs
plt.title("Linear Regression Model of Catch Rate")
plt.scatter(x_test, y_test, color='black')
plt.plot(x_test, pokemon_y_pred, color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
This is referring to the fact that your x-variable has more than one dimension; plot and scatter only work for 2D plots, and it seems that your x_test has multiple features while y_test and pokemon_y_pred are one-dimensional.
This error generates only when you have more different values of x for one y actually there are comparatively more columns in x_test than y_test.Thats why there is a size problem.
There should not be different x for one y:-basic mathematics fundamental.