Linear regression plot is really bad - python

observations that are different from each other so i run regression again but for only one cluster.But it also came out wrong What exactly is wrong here? I'll also have to point out that i am still new to this (linerear regression etc.) so my understanding of all this is still bad. How can i fix this plot and please if it's possible try to explain why it's wrong.
Code :
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import numpy as np
np.random.seed(0)
kmeans.cluster_centers_
kmeans.labels_
n,y_test = train_test_split(X, Y, test_size = 0.4, random_state = 0)
plt.scatter(X.iloc[:, 1], Y)
plt.show()

You're performing multiple linear regression, since you have 2 input features ('Age', 'Annual Income (k$)') that try to predict the output feature ('Spending Score (1-100)'). You need to plot this data in 3D, in order to properly visualize the regression.
Even though I can't test your code without the data, something like this should work (after training the model):
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X.iloc[:, 0], X.iloc[:, 1], Y)
ax.plot(X.iloc[:, 0], X.iloc[:, 1], y_pred, color='red')
ax.set_xlabel('Age')
ax.set_ylabel('Annual Income')
ax.set_zlabel('Spending Score')

Related

Trying to plot outliers using DBSCAN

I have never been great with Python plotting concepts, and now I'm still apparently missing something new.
Here is my code.
import pandas as pd
import matplotlib.pyplot as plt
import sys
from numpy import genfromtxt
from sklearn.cluster import DBSCAN
data = pd.read_csv('C:\\Users\\path_here\\wine.csv')
data
# Reading in 2D Feature Space
model = DBSCAN(eps=0.9, min_samples=10).fit(data)
array_flavanoids = data.iloc[:, 2]
# Slicing array
array_colorintensity = data.iloc[:, 3]
# Scatter plot function
colors = model.labels_
plt.scatter(array_flavanoids, array_colorintensity, c=colors, marker='o')
plt.xlabel('Concentration of flavanoids', fontsize=16)
plt.ylabel('Color intensity', fontsize=16)
plt.title('Concentration of flavanoids vs Color intensity', fontsize=20)
plt.show()
Here is my result.
I am expecting the outliers to be in a different color than the non-outliers. So, something like this.
Maybe one color for outliers and another for non-outliers. I am just trying to learn the concept in this exercise. I am trying to follow the example from this link.
https://towardsdatascience.com/outlier-detection-python-cd22e6a12098
I am using this data source.
https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009
I am testing different data sets.
I got this to work.
from sklearn.cluster import DBSCAN
def dbscan(X, eps, min_samples):
ss = StandardScaler()
X = ss.fit_transform(X)
db = DBSCAN(eps=eps, min_samples=min_samples)
db.fit(X)
y_pred = db.fit_predict(X)
plt.scatter(X[:,0], X[:,1],c=y_pred, cmap='Paired')
plt.title("DBSCAN")
dbscan(data, eps=.5, min_samples=5)
I found this to be a great resource.
https://medium.com/#plog397/functions-to-plot-kmeans-hierarchical-and-dbscan-clustering-c4146ed69744

Polynomial Regression plot not showing correctly

I run this code for polynomial regression using sklearn but my plot is not what i was expecting. As you can see here i'm not getting a smooth line but it's jumping from one point to another. From my understanding i have to sort X, but when i do that all i get is an empty plot with a linear line.
import operator
import numpy as np
from sklearn.cluster import KMeans
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.formula.api as smf
df = pd.read_csv('D:\Mall_Customers.csv', usecols = ['Age', 'Annual Income (k$)','Spending Score (1-100)'])
x = StandardScaler().fit_transform(df)
kmeans = KMeans(n_clusters=3, max_iter=100)
y_kmeans= kmeans.fit_predict(x)
mydict = {i: np.where(kmeans.labels_ == i)[0] for i in range(kmeans.n_clusters)}
dictlist = []
for key, value in mydict.items():
temp = [key,value]
dictlist.append(temp)
df0 = df[df.index.isin(mydict[0].tolist())]
X = df0[['Age', 'Annual Income (k$)']]
Y = df0['Spending Score (1-100)']
poly_reg = PolynomialFeatures(degree=2)
X_poly = poly_reg.fit_transform(X)
model = LinearRegression()
model.fit(X_poly, Y)
y_poly_pred = model.predict(X_poly)
r2 = r2_score(Y,y_poly_pred)
print(r2)
model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression(fit_intercept = False))
model.fit(X,Y)
plt.scatter(X.iloc[:, 1], Y, color='red')
plt.plot(X, Y, color='blue')
plt.xlabel('Age. Annual income')
plt.ylabel('Spending Score')
plt.show()
TLDR; the data is not linear dependent.
The reason the graph got so messy is because you plotted the X (train data) with the Y (the actual prediction data) and the fact that you were plotting this data while:
the data was messy and not really linear dependent
is what made the result this messy graph.
I suggest you to:
split to the train data into train, test and then after you train the model check the error with the test and maybe create 2 plots, 1 with the model results according to the test data and one with the actual result for the test data.
and change plot code to this:
.
plt.scatter(X, Y)
plt.plot(X, Y_pred, color='red')
plt.show()

Why do I get nested clusters in kmeans when I use normalized data while I get non-overlapping clusters when I use non normalized data?

I am currently following a course on the basics of machine learning provided by IBM. After the teacher finished building the model, I noticed that he does not use the normalized data to fit the model, but rather uses regular data and in the end he gets a good cluster and non-overlapping clusters. But when I tried to use the normalized data to train the model, I got a catastrophe and I got nested clusters, as the code and image show. Why did the process of normalization lead to that? Although it is always good "as I know" to use normalization in mathematical basis algorithms.
code does not use normalized data
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.cluster import KMeans
cust_df = pd.read_csv('D:\machine learning\Cust_Segmentation.csv')
cust_df.head()
df = cust_df.drop('Address', axis = 1)
X = df.values[:, 1:]
X = np.nan_to_num(X)
from sklearn.preprocessing import StandardScaler
norm_featur = StandardScaler().fit_transform(X)
clusterNum = 3
kmeans = KMeans(init = 'k-means++', n_clusters = clusterNum, n_init = 12)
kmeans.fit(X)
k_means_labels = kmeans.labels_
df['cluster'] = kmeans.labels_
k_means_cluster_centers = kmeans.cluster_centers_
area = np.pi * ( X[:, 1])**2
plt.scatter(X[:, 0], X[:, 3], s=area, c=kmeans.labels_.astype(np.float), alpha=0.5)
plt.xlabel('Age', fontsize=18)
plt.ylabel('Income', fontsize=16)
plt.show()
CLUSTERS WITH OUT USING NORMALIZATION
code using normalized data
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.cluster import KMeans
cust_df = pd.read_csv('D:\machine learning\Cust_Segmentation.csv')
cust_df.head()
df = cust_df.drop('Address', axis = 1)
X = df.values[:, 1:]
X = np.nan_to_num(X)
from sklearn.preprocessing import StandardScaler
norm_feature = StandardScaler().fit_transform(X)
clusterNum = 3
kmeans = KMeans(init = 'k-means++', n_clusters = clusterNum, n_init = 12)
kmeans.fit(norm_feature)
k_means_labels = kmeans.labels_
df['cluster'] = kmeans.labels_
k_means_cluster_centers = kmeans.cluster_centers_
area = np.pi * ( norm_feature[:, 1])**2
plt.scatter(norm_feature[:, 0], norm_feature[:, 3], s=area, c=kmeans.labels_.astype(np.float),
alpha=0.5)
plt.xlabel('Age', fontsize=18)
plt.ylabel('Income', fontsize=16)
plt.show()
CLUSTER AFTER NORMALIZATION
Income and age are on fairly different scales here. In your first plot, a difference of ~100 in income is about the same as a difference of ~10 in age. But in k-means, that difference in income is considered 10x larger. The vertical axis easily dominates the clustering.
This is probably 'wrong', unless you happen to believe that a change of 1 in income is 'the same as' a change of in 10 age, for purposes of figuring out what's similar. This is why you standardize, which makes a different assumption: that they are equally important.
Your second plot doesn't quite make sense; k-means can't produce 'overlapping' clusters. The problem is that you have only plotted 2 of the 4 (?) dimensions you clustered on. You can't plot 4D data, but I suspect that if you applied PCA to the result to reduce to 2 dimensions first and plotted it, you'd see separated clusters.

Plot only Class 1 vs Baseline in Lift-curve and Cumulative-gains-chart in scikitplot

I am working on a problem of propensity modeling for an ad campaign. My data set consists of users who have historically clicked on the ads and those who have not clicked.
To measure the performance of my model, I plotting cumulative gains and lift charts using sklearn. Below is the code for the same:
import matplotlib.pyplot as plt
import scikitplot as skplt
Y_test_pred_ = model.predict_proba(X_test_df)[:]
skplt.metrics.plot_cumulative_gain(Y_test, Y_test_pred_)
plt.show()
skplt.metrics.plot_lift_curve(Y_test, Y_test_pred_)
plt.show()
The plot I am getting is showing graphs for both - class 0 users and class 1 users
I need to plot only the class 1 curve against the baseline curve.
Is there a way I can do that?
You can use the kds package for the same.
For Cummulative Gains Plot:
# pip install kds
import kds
kds.metrics.plot_cumulative_gain(y_test, y_prob)
For Lift Chart:
import kds
kds.metrics.plot_lift(y_test, y_prob)
Example
# REPRODUCABLE EXAMPLE
# Load Dataset and train-test split
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn import tree
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.33,random_state=3)
clf = tree.DecisionTreeClassifier(max_depth=1,random_state=3)
clf = clf.fit(X_train, y_train)
y_prob = clf.predict_proba(X_test)
# CUMMULATIVE GAIN PLOT
import kds
kds.metrics.plot_cumulative_gain(y_test, y_prob[:,1])
# LIFT PLOT
kds.metrics.plot_lift(y_test, y_prob[:,1])
I can explain code if needed:
Args:
df : dataframe containing one score column and one target column
score : string containing the name of the score column
target : string containing the name of the target column
title : string containing the name of the graph that will be generated
def get_cum_gains(df, score, target, title):
df1 = df[[score,target]].dropna()
fpr, tpr, thresholds = roc_curve(df1[target], df1[score])
ppr=(tpr*df[target].sum()+fpr*(df[target].count()-
df[target].sum()))/df[target].count()
plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
plt.plot(ppr, tpr, label='')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.grid(b=True, which='both', color='0.65',linestyle='-')
plt.xlabel('%Population')
plt.ylabel('%Target')
plt.title(title+'Cumulative Gains Chart')
plt.legend(loc="lower right")
plt.subplot(1,2,2)
plt.plot(ppr, tpr/ppr, label='')
plt.plot([0, 1], [1, 1], 'k--')
plt.grid(b=True, which='both', color='0.65',linestyle='-')
plt.xlabel('%Population')
plt.ylabel('Lift')
plt.title(title+'Lift Curve')
This is a bit hacky, but it does what you want. The point is to get
access to access to ax variable that matplotlib create. Then
manipulate it to delete the undesired plot.
# Some dummy data to work with
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
X, y = load_breast_cancer(return_X_y=True)
# ploting
import scikitplot as skplt
import matplotlib.pyplot as plt
# classify
clf = LogisticRegression(solver='liblinear', random_state=42).fit(X, y)
# classifier's output probabilities for the two classes
y_preds_probas = clf.predict_proba(X)
# get access to the figure and axes
fig, ax = plt.subplots()
# ax=ax creates the plot on the same ax we just initialized.
skplt.metrics.plot_lift_curve(y, y_preds_probas, ax=ax)
## Now the solution to your problem.
del ax.lines[0] # delete the desired class plot
ax.legend().set_visible(False) # hide the legend
ax.legend().get_texts()[0].set_text("Cancer") # turn the legend back on
plt.show()
You might have to mess around with ax.lines[1] etc to
delete exactly what you want of course.

Python sklearn poly regression

I'm stuck solving this issue for two days now. I have some datapoints I put in a scatter plot and get this:
Which is nice, but now I also want to add a regression line, so I had a look at this example from sklearn and changed the code to this
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
degrees = [3, 4, 5]
X = combined[['WPI score']]
y = combined[['CPI score']]
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
ax = plt.subplot(1, len(degrees), i + 1)
plt.setp(ax, xticks=(), yticks=())
polynomial_features = PolynomialFeatures(degree=degrees[i], include_bias=False)
linear_regression = LinearRegression()
pipeline = Pipeline([("polynomial_features", polynomial_features), ("linear_regression", linear_regression)])
pipeline.fit(X, y)
# Evaluate the models using crossvalidation
scores = cross_val_score(pipeline, X, y, scoring="neg_mean_squared_error", cv=10)
X_test = X #np.linspace(0, 1, len(combined))
plt.plot(X, pipeline.predict(X_test), label="Model")
plt.scatter(X, y, label="CPI-WPI")
plt.xlabel("X")
plt.ylabel("y")
plt.legend(loc="best")
plt.title("Degree {}\nMSE = {:.2e}(+/- {:.2e})".format(degrees[i], -scores.mean(), scores.std()))
plt.savefig(pic_path + 'multi.png', bbox_inches='tight')
plt.show()
which has the following output:
Note that X and y are both DataFrames of size (151, 1). I can post the content of X and y too, if necessary.
What I want is a nice smooth line, but I seem not to be able to figure out, how to do this.
[Edit]
The question here is: How do I get a single smooth, curvy polynomial line instead of multiple ones with seemingly random pattern.
[Edit 2]
The problem is, when I use the linspace like this:
X_test = np.linspace(1, 4, 151)
X_test = X_test[:, np.newaxis]
I get a even more random pattern:
The trick was to set the code like following:
X_test = np.linspace(min(X['GPI score']), max(X['GPI score']), X.shape[0])
X_test = X_test[:, np.newaxis]
plt.plot(X_test, pipeline.predict(X_test), label="Model")
Which yields the following result (a much nicer, single smooth line)

Categories