I am new to machine learning with python. I've managed to draw the straight decision boundary for logistic regression using matplotlib. However, I am facing a bit of difficulty in plotting a curve line to understand the case of overfitting using some sample dataset.
I am trying to build a logistic regression model using regularization and use regularization to control overfitting my data set.
I am aware of the sklearn library, however I prefer writing code separately
The test data sample I am working on is given below:
x=np.matrix('2,300;4,600;7,300;5,500;5,400;6,400;3,400;4,500;1,200;3,400;7,700;3,550;2.5,650')
y=np.matrix('0;1;1;1;0;1;0;0;0;0;1;1;0')
The decision boundary I am expecting is given in the graph below:
Any help would be appreciated.
I could plot a straight decision boundary using the code below:
# plot of x 2D
plt.figure()
pos=np.where(y==1)
neg=np.where(y==0)
plt.plot(X[pos[0],0], X[pos[0],1], 'ro')
plt.plot(X[neg[0],0], X[neg[0],1], 'bo')
plt.xlim([min(X[:,0]),max(X[:,0])])
plt.ylim([min(X[:,1]),max(X[:,1])])
plt.show()
# plot of the decision boundary
plt.figure()
pos=np.where(y==1)
neg=np.where(y==0)
plt.plot(x[pos[0],1], x[pos[0],2], 'ro')
plt.plot(x[neg[0],1], x[neg[0],2], 'bo')
plt.xlim([x[:, 1].min()-2 , x[:, 1].max()+2])
plt.ylim([x[:, 2].min()-2 , x[:, 2].max()+2])
plot_x = [min(x[:,1])-2, max(x[:,1])+2] # Takes a lerger decision line
plot_y = (-1/theta_NM[2])*(theta_NM[1]*plot_x +theta_NM[0])
plt.plot(plot_x, plot_y)
And my decision boundary looks like this:
In an ideal scenario the above decision boundary is good but I would like to plot a curve decision boundary that will fit my training data very well but will overfit my test data. something similar to shown in the 1st plot
This can be done by gridding the parameter space and setting each grid point to the value of the closest point. Then running a contour plot on this grid.
But there are numerous variations, such as setting it to a value of a distance-weighted average; or smoothing the final contour; etc.
Here's an example for finding the initial contour:
import numpy as np
import matplotlib.pyplot as plt
# get the data as numpy arrays
xys = np.array(np.matrix('2,300;4,600;7,300;5,500;5,400;6,400;3,400;4,500;1,200;3,400;7,700;3,550;2.5,650'))
vals = np.array(np.matrix('0;1;1;1;0;1;0;0;0;0;1;1;0'))[:,0]
N = len(vals)
# some basic spatial stuff
xs = np.linspace(min(xys[:,0])-2, max(xys[:,0])+1, 10)
ys = np.linspace(min(xys[:,1])-100, max(xys[:,1])+100, 10)
xr = max(xys[:,0]) - min(xys[:,0]) # ranges so distances can weight x and y equally
yr = max(xys[:,1]) - min(xys[:,1])
X, Y = np.meshgrid(xs, ys) # meshgrid for contour and distance calcs
# set each gridpoint to the value of the closest data point:
Z = np.zeros((len(xs), len(ys), N))
for n in range(N):
Z[:,:,n] = ((X-xys[n,0])/xr)**2 + ((Y-xys[n,1])/yr)**2 # stack arrays of distances to each points
z = np.argmin(Z, axis=2) # which data point is the closest to each grid point
v = vals[z] # set the grid value to the data point value
# do the contour plot (use only the level 0.5 since values are 0 and 1)
plt.contour(X, Y, v, cmap=plt.cm.gray, levels=[.5]) # contour the data point values
# now plot the data points
pos=np.where(vals==1)
neg=np.where(vals==0)
plt.plot(xys[pos,0], xys[pos,1], 'ro')
plt.plot(xys[neg,0], xys[neg,1], 'bo')
plt.show()
Related
I have a number of samples of a variable. I would like to use these samples to plot the probability distribution of the variable. I'm using kernel density estimation with a Gaussian kernel. I'm using sklearn library for this purpose. Here is the sample code I have implemented:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KernelDensity
# -- data
init_range = 0.0793
X = np.random.uniform(low=-init_range, high=init_range, size=133280)[:, np.newaxis]
# -- kernel density estimation
kde = KernelDensity(kernel="gaussian", bandwidth=0.2).fit(X)
X_plot = np.linspace(min(X).item(), max(X).item(), 1000)[:, np.newaxis]
log_dens = kde.score_samples(X_plot)
# -- plot density
plt.plot( X_plot[:, 0], np.exp(log_dens), lw=2, linestyle="-")
plt.ylim([0, 2.1])
plt.show()
Below is the resulting output:
As you can see, the value on the y axis is above one. Hence, the y axis is NOT showing the probability distribution. I further plotted the histogram for this data:
# -- plot hist
n_bins = 40
weights = np.ones_like(X) / float(len(X))
prob, bins, _ = plt.hist(X, n_bins, density=False, histtype='step', color='red', weights=weights)
plt.show()
and the result is below:
which makes sense as the bins sum up to one: 0.025*40=1
I'm having a hard time understanding why my kde plot is not a distribution. How can I fix this? Is there a normalization step that I'm missing?
First, if you extend the limits of your X_plot axis (i.e. X_plot = np.linspace(-1, 1,...)), you'll see that your KDE estimates a rather tall gaussian, and the area under curve is still 1.
Density values over 1 are perfectly legal, since the assumed distribution is continuous: there's no real probabilities for the exact points, and you should not treat your Y values as such; estimated probabilty for an interval is the respective area under curve.
Sample code to verify the estimated probability of hitting 0-0.004 range (roughly the same width as your histogram bin):
import scipy.integrate as integrate
interval = np.linspace(0, 0.004, 1000)[:, np.newaxis]
log_dens = kde.score_samples(interval)
print(integrate.trapz(np.exp(log_dens), interval[:,0]))
Second, once you check the area under curve you'll see your current hyperparameters aren't yielding too accurate of an estimation, reducing the bandwith or choosing a different algo might help.
You can also apply grid search to find the least inaccurate algo and bandwith, though this will take a good amount of time unless you reduce your sample size; also, choosing a narrow bandwidth may result in undersmoothing.
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(KernelDensity(), {'kernel':['gaussian', 'tophat'],'bandwidth': np.logspace(-2, 0, 10)}, cv=5, n_jobs=-1)
grid.fit(X)
print(f"best hyperparameters: {grid.best_params_}")
kde = grid.best_estimator_
I found the better number of clusters and my clusters for eacch data.
Now hoe can i plot my scatter based on centers and clusters to see datas?
This is my dataset.
This is code i using.
x = df_diabetes_normalizado['Glicose']
y = df_diabetes_normalizado['Massa Corporal']
Cluster = df_diabetes_normalizado['clusters']
centers = np.random.randn(1, 2)
fig = plt.figure(figsize=(14,9))
ax = fig.add_subplot(111)
scatter = ax.scatter(x,y,c=Cluster,s=50)
for i,j in centers:
ax.scatter(i,j,s=50,c='red',marker='+')
ax.set_xlabel('x')
ax.set_ylabel('y')
fig.show()
However the plot is so cofuse for me.
Could you please give me some guide how i can fix my script to generare my correct scater based in centers and clustering distribution?
Because
You're plotting the wrong variable: your dependent variable should be 'Classe' (1/0, presumably for diabetic or not) Not 'clusters', which is merely an integer telling you how many clusters exhibit those characteristics, not whether they're in Classe==0 or 1.
clearly you have 8 variables ('Numero Gravida', 'Glicose', 'Pressao', ..., 'Idade') and your dependent variable ('Classe') is a function of all 8 of them, not just the two you arbitrarily picked to plot: x='Glicose' and y='Massa Corporal'.
1) clusters is merely an integer telling you how many clusters exhibit those characteristics, not whether each cluster is in Classe==0 or 1.
Cluster = df_diabetes_normalizado['clusters']
...
scatter = ax.scatter(x,y,c=Cluster, ...)
Your plot is wrongly using color to show c=Cluster i.e. the number of clusters, you're not plotting Classe anywhere. Plot Classe instead. (You might choose to use size=Clusters, so larger clusters plot larger)
2) 'Generate the correct scatterplot [of two variables]' is not well-defined; clearly you have 8 variables ('Numero Gravida', 'Glicose', 'Pressao', ..., 'Idade') and your dependent variable ('Classe') is a function of all 8 of them, not just the two you arbitrarily picked to plot: x='Glicose' and y='Massa Corporal'.
Assuming you don't want to do a 3D or n-dimensional plot, you either do:
some dimensional reduction with PCA (Principal Component Analysis), then plot the most important two/three pseudovariables (see e.g. this example...)
or else build a model based on a custom cluster distance function.
If you post MCVE for your dataset, and you tell us what sort of plot you actually want, then can post code.
Example using iris dataset:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:,0:2]
y = iris.target
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
assignments = kmeans.labels_ # this is the CLUSTERS column in your case
plt.figure(figsize=(12,8))
classes = np.unique(assignments)
colors= ['r','b','k','y'] # 4 CLUSTERS SO 4 COLORS HERE
for s,l in enumerate(classes):
xs = X[:,0]
ys = X[:,1]
plt.scatter(xs[assignments==s], ys[assignments==s], c = colors[s]) # color based on group
plt.plot(kmeans.cluster_centers_[0][0], kmeans.cluster_centers_[0][1], 'ro',markersize=16, alpha = 0.5, label='')
plt.plot(kmeans.cluster_centers_[1][0], kmeans.cluster_centers_[1][1], 'bo',markersize=16, alpha = 0.5)
plt.plot(kmeans.cluster_centers_[2][0], kmeans.cluster_centers_[2][1], 'ko',markersize=16, alpha = 0.5)
plt.plot(kmeans.cluster_centers_[3][0], kmeans.cluster_centers_[3][1], 'yo',markersize=16, alpha = 0.5)
plt.grid()
I'm facing a silly problem while plotting a graph from a regression function calculated using sci-kit-learn. After constructing the function I need to plot a graph that shows X and Y from the original data and calculated dots from my function. The problem is that my function is not a line, despite being linear, it uses a Fourier series in order to give the right shape for my curve, and when I try to plot the lines using:
ax.plot(df['GDPercapita'], modelp1.predict(df1), color='k')
I got a Graph like this:
Plot
But the trhu graph is supposed to be a line following those black points:
Dots to be connected
I'm generating the graph using the follow code:
fig, ax = plt.subplots()
ax.scatter(df['GDPercapita'], df['LifeExpectancy'], edgecolors=(0, 0, 0))
ax.scatter(df['GDPercapita'], modelp1.predict(df1),color='k') #this line is changed to get the first pic.
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show(block=True)
Does anyone have an idea about what to do?
POST DISCUSSION EDIT:
Ok, so first things first:
The data can be download at: http://www.est.ufmg.br/~marcosop/est171-ML/dados/worldDevelopmentIndicators.csv
I had to generate new data using a Fourier expasion, with normalized values of GDPercapita, in order to perform an exhaustive optimization algorithm for Regression Function used to predict the LifeExpectancy, and found out the number o p parameters that generate the best Regression Function, this number is p=22.
Now I have to generate a Polynomial Function using the predictions points of the regression fuction with p=22, to show how the best regression function is compared to the Polynomial function with the 22 degrees.
To generate the prediction I use the following code:
from sklearn import linear_model
modelp22 = linear_model.LinearRegression()
modelp22.fit(xp22,y_train)
df22 = df[p22]
fig, ax = plt.subplots()
ax.scatter(df['GDPercapita'], df['LifeExpectancy'], edgecolors=(0, 0, 0))
ax.scatter(df['GDPercapita'], modelp22.predict(df22),color='k')
ax.set_xlabel('GDPercapita')
ax.set_ylabel('LifeExpectancy')
plt.show(block=True)
Now I need to use the predictions points to create a Polynomial Function and plot a graph with: The original data(first scatter), the predictions points(secont scatter) and the Polygonal Funciontion (a curve or plot) to show their visual relation.
My lab uses what our PI calls "modified Bland–Altman plots" to analyze regression quality. The code I wrote using Seaborn only handles discrete data, and I'd like to generalize it.
A Bland–Altman plot compares the difference between two measures to their average. The "modification" is that the x-axis is, instead of the average, the ground truth value. The y-axis is the difference between the predicted and true values. In effect, the modified B–A plot can be seen as the plot of residuals from the line y=x—i.e. the line predicted=truth.
The code to generate this plot, as well as an example, is given below.
def modified_bland_altman_plot(predicted, truth):
predicted = np.asarray(predicted)
truth = np.asarray(truth, dtype=np.int) # np.int is a hack for stripplot
diff = predicted - truth
ax = sns.stripplot(truth, diff, jitter=True)
ax.set(xlabel='truth', ylabel='difference from truth', title="Modified Bland-Altman Plot")
# Plot a horizontal line at 0
ax.axhline(0, ls=":", c=".2")
return ax
Admittedly, this example has terrible bias in its prediction, shown by the downward slope.
I'm curious about two things:
Is there a generally accepted name for these "modified Bland–Altman plots"?
How can one create these for non-discrete data? We use stripplot, which requires discrete data. I know that seaborn has the residplot function, but it doesn't take a custom function for the line from which residuals are measured, e.g. predicted=true. Instead, it measures from the best-fit line it computes.
It seems you're looking for a standard scatter plot here:
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(1)
def modified_bland_altman_plot(predicted, truth):
predicted = np.asarray(predicted)
truth = np.asarray(truth)
diff = predicted - truth
fig, ax = plt.subplots()
ax.scatter(truth, diff, s=9, c=truth, cmap="rainbow")
ax.set_xlabel('truth')
ax.set_ylabel('difference from truth')
ax.set_title("Modified Bland-Altman Plot")
# Plot a horizontal line at 0
ax.axhline(0, ls=":", c=".2")
return ax
x = np.random.rayleigh(scale=10, size=201)
y = np.random.normal(size=len(x))+10-x/10.
modified_bland_altman_plot(y, x)
plt.show()
Lets say I have a large data set to where I can manipulate it all in some sort analysis. Which can be looking at values in a probability distribution.
Now that I have this large data set, I then want to compare known, actual data to it. Primarily, how many of the values in my data set have the same value or property with the known data. For example:
This is a cumulative distribution. The continuous lines are from generated data from simulations and the decreasing intensities are just predicted percentages. The stars are then observational (known) data, plotted against generated data.
Another example I have made is how visually the points could possibly be projected on a histogram:
I'm having difficulty marking where the known data points fall in the generated data set and plot it cumulatively along side the distribution of the generated data.
If I were to try and retrieve the number of points that fall in the vicinity of the generated data, I would start out like this (its not right):
def SameValue(SimData, DefData, uncert):
numb = [(DefData-uncert) < i < (DefData+uncert) for i in SimData]
return sum(numb)
But I am having trouble accounting for the points falling in the value ranges and then having it all set up to where I can plot it. Any idea on how to gather this data and project this onto a cumulative distribution?
The question is pretty chaotic with lots of irrelevant information but staying vague at the essetial points. I will try interprete it the best I can.
I think what you are after is the following: Given a finite sample from an unknown distribution, what is the probability to obtain a new sample at a fixed value?
I'm not sure if there is a general answer to it, but in any case that would be a question to be asked to statistics or mathematics people. My guess is that you would need to make some assumptions about the distribution itself.
For the practical case however, it might be sufficient to find out in which bin of the sampled distribution the new value would lie.
So assuming we have a distribution x, which we divide into bins. We can compute the histogram h, using numpy.histogram. The probability to find a value in each bin is then given by h/h.sum().
Having a value v=0.77, of which we want to know the probability according to the distribution, we can find out the bin in which it would belong by looking for the index ind in the bin array where this value would need to be inserted for the array to stay sorted. This can be done using numpy.searchsorted.
import numpy as np; np.random.seed(0)
x = np.random.rayleigh(size=1000)
bins = np.linspace(0,4,41)
h, bins_ = np.histogram(x, bins=bins)
prob = h/float(h.sum())
ind = np.searchsorted(bins, 0.77, side="right")
print prob[ind] # which prints 0.058
So the probability is 5.8% to sample a value in the bin around 0.77.
A different option would be to interpolate the histogram between the bin centers, as to find the the probability.
In the code below we plot a distribution similar to the one from the picture in the question and use both methods, the first for the frequency histogram, the second for the cumulative distribution.
import numpy as np; np.random.seed(0)
import matplotlib.pyplot as plt
x = np.random.rayleigh(size=1000)
y = np.random.normal(size=1000)
bins = np.linspace(0,4,41)
h, bins_ = np.histogram(x, bins=bins)
hcum = np.cumsum(h)/float(np.cumsum(h).max())
points = [[.77,-.55],[1.13,1.08],[2.15,-.3]]
markers = [ur'$\u2660$',ur'$\u2665$',ur'$\u263B$']
colors = ["k", "crimson" , "gold"]
labels = list("ABC")
kws = dict(height_ratios=[1,1,2], hspace=0.0)
fig, (axh, axc, ax) = plt.subplots(nrows=3, figsize=(6,6), gridspec_kw=kws, sharex=True)
cbins = np.zeros(len(bins)+1)
cbins[1:-1] = bins[1:]-np.diff(bins[:2])[0]/2.
cbins[-1] = bins[-1]
hcumc = np.linspace(0,1, len(cbins))
hcumc[1:-1] = hcum
axc.plot(cbins, hcumc, marker=".", markersize="2", mfc="k", mec="k" )
axh.bar(bins[:-1], h, width=np.diff(bins[:2])[0], alpha=0.7, ec="C0", align="edge")
ax.scatter(x,y, s=10, alpha=0.7)
for p, m, l, c in zip(points, markers, labels, colors):
kw = dict(ls="", marker=m, color=c, label=l, markeredgewidth=0, ms=10)
# plot points in scatter distribution
ax.plot(p[0],p[1], **kw)
#plot points in bar histogram, find bin in which to plot point
# shift by half the bin width to plot it in the middle of bar
pix = np.searchsorted(bins, p[0], side="right")
axh.plot(bins[pix-1]+np.diff(bins[:2])[0]/2., h[pix-1]/2., **kw)
# plot in cumulative histogram, interpolate, such that point is on curve.
yi = np.interp(p[0], cbins, hcumc)
axc.plot(p[0],yi, **kw)
ax.legend()
plt.tight_layout()
plt.show()