I found the better number of clusters and my clusters for eacch data.
Now hoe can i plot my scatter based on centers and clusters to see datas?
This is my dataset.
This is code i using.
x = df_diabetes_normalizado['Glicose']
y = df_diabetes_normalizado['Massa Corporal']
Cluster = df_diabetes_normalizado['clusters']
centers = np.random.randn(1, 2)
fig = plt.figure(figsize=(14,9))
ax = fig.add_subplot(111)
scatter = ax.scatter(x,y,c=Cluster,s=50)
for i,j in centers:
ax.scatter(i,j,s=50,c='red',marker='+')
ax.set_xlabel('x')
ax.set_ylabel('y')
fig.show()
However the plot is so cofuse for me.
Could you please give me some guide how i can fix my script to generare my correct scater based in centers and clustering distribution?
Because
You're plotting the wrong variable: your dependent variable should be 'Classe' (1/0, presumably for diabetic or not) Not 'clusters', which is merely an integer telling you how many clusters exhibit those characteristics, not whether they're in Classe==0 or 1.
clearly you have 8 variables ('Numero Gravida', 'Glicose', 'Pressao', ..., 'Idade') and your dependent variable ('Classe') is a function of all 8 of them, not just the two you arbitrarily picked to plot: x='Glicose' and y='Massa Corporal'.
1) clusters is merely an integer telling you how many clusters exhibit those characteristics, not whether each cluster is in Classe==0 or 1.
Cluster = df_diabetes_normalizado['clusters']
...
scatter = ax.scatter(x,y,c=Cluster, ...)
Your plot is wrongly using color to show c=Cluster i.e. the number of clusters, you're not plotting Classe anywhere. Plot Classe instead. (You might choose to use size=Clusters, so larger clusters plot larger)
2) 'Generate the correct scatterplot [of two variables]' is not well-defined; clearly you have 8 variables ('Numero Gravida', 'Glicose', 'Pressao', ..., 'Idade') and your dependent variable ('Classe') is a function of all 8 of them, not just the two you arbitrarily picked to plot: x='Glicose' and y='Massa Corporal'.
Assuming you don't want to do a 3D or n-dimensional plot, you either do:
some dimensional reduction with PCA (Principal Component Analysis), then plot the most important two/three pseudovariables (see e.g. this example...)
or else build a model based on a custom cluster distance function.
If you post MCVE for your dataset, and you tell us what sort of plot you actually want, then can post code.
Example using iris dataset:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:,0:2]
y = iris.target
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
assignments = kmeans.labels_ # this is the CLUSTERS column in your case
plt.figure(figsize=(12,8))
classes = np.unique(assignments)
colors= ['r','b','k','y'] # 4 CLUSTERS SO 4 COLORS HERE
for s,l in enumerate(classes):
xs = X[:,0]
ys = X[:,1]
plt.scatter(xs[assignments==s], ys[assignments==s], c = colors[s]) # color based on group
plt.plot(kmeans.cluster_centers_[0][0], kmeans.cluster_centers_[0][1], 'ro',markersize=16, alpha = 0.5, label='')
plt.plot(kmeans.cluster_centers_[1][0], kmeans.cluster_centers_[1][1], 'bo',markersize=16, alpha = 0.5)
plt.plot(kmeans.cluster_centers_[2][0], kmeans.cluster_centers_[2][1], 'ko',markersize=16, alpha = 0.5)
plt.plot(kmeans.cluster_centers_[3][0], kmeans.cluster_centers_[3][1], 'yo',markersize=16, alpha = 0.5)
plt.grid()
Related
I have a Dataset with 3 labels and 27 features. I was trying to use the PCA on it and reduce the dimensions to 2. The results are a bit confusing. Honestly, I didn't expect too good results, but I got the first picture and I was very surprised.
Since I have three labels, I thought that I got my three classes pretty clear. However, when I apply the colors, I get the following picture:
I am a bit wondered about the fact that the three classes are totally mixed on three clearly seperated groups. I also tried it in 3D an the results looks exactly the same.
Is there any error in my code or does anyone know a reason why this could happen?
import pandas as pd
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from sklearn.preprocessing import (StandardScaler, MaxAbsScaler, RobustScaler,
Normalizer, QuantileTransformer, PowerTransformer, MinMaxScaler)
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
Dataset = pd.read_csv("...", header=0)
feature_spalten = ['...']
x = Dataset[feature_spalten]
y = Dataset.Classifier
sc = StandardScaler()
x = sc.fit_transform(x)
p = PCA()
p.fit(x)
x_transformed = p.transform(x)
plt.figure()
plt.scatter(x_transformed[:, 0], x_transformed[:, 1])
plt.figure()
for label in y.unique():
x_transformed_filtered = x_transformed[y == label, :]
plt.scatter(x_transformed_filtered[:, 0], x_transformed_filtered[:, 1],
label=label, s = 25)
plt.legend()
plt.show()
This is suggestive that your data is clustered in high dimensional space, with each cluster comprised of instances with an assortment of the labels.
The objective of PCA is to find a lower-dimensional projection that preserves the variance of the data. The following hypothetical example shows how linearly separable two-dimensional data (with three clusters) can be projected to one dimension, with the clusters in the projection not corresponding to labels (red versus blue).
Lets say I have a large data set to where I can manipulate it all in some sort analysis. Which can be looking at values in a probability distribution.
Now that I have this large data set, I then want to compare known, actual data to it. Primarily, how many of the values in my data set have the same value or property with the known data. For example:
This is a cumulative distribution. The continuous lines are from generated data from simulations and the decreasing intensities are just predicted percentages. The stars are then observational (known) data, plotted against generated data.
Another example I have made is how visually the points could possibly be projected on a histogram:
I'm having difficulty marking where the known data points fall in the generated data set and plot it cumulatively along side the distribution of the generated data.
If I were to try and retrieve the number of points that fall in the vicinity of the generated data, I would start out like this (its not right):
def SameValue(SimData, DefData, uncert):
numb = [(DefData-uncert) < i < (DefData+uncert) for i in SimData]
return sum(numb)
But I am having trouble accounting for the points falling in the value ranges and then having it all set up to where I can plot it. Any idea on how to gather this data and project this onto a cumulative distribution?
The question is pretty chaotic with lots of irrelevant information but staying vague at the essetial points. I will try interprete it the best I can.
I think what you are after is the following: Given a finite sample from an unknown distribution, what is the probability to obtain a new sample at a fixed value?
I'm not sure if there is a general answer to it, but in any case that would be a question to be asked to statistics or mathematics people. My guess is that you would need to make some assumptions about the distribution itself.
For the practical case however, it might be sufficient to find out in which bin of the sampled distribution the new value would lie.
So assuming we have a distribution x, which we divide into bins. We can compute the histogram h, using numpy.histogram. The probability to find a value in each bin is then given by h/h.sum().
Having a value v=0.77, of which we want to know the probability according to the distribution, we can find out the bin in which it would belong by looking for the index ind in the bin array where this value would need to be inserted for the array to stay sorted. This can be done using numpy.searchsorted.
import numpy as np; np.random.seed(0)
x = np.random.rayleigh(size=1000)
bins = np.linspace(0,4,41)
h, bins_ = np.histogram(x, bins=bins)
prob = h/float(h.sum())
ind = np.searchsorted(bins, 0.77, side="right")
print prob[ind] # which prints 0.058
So the probability is 5.8% to sample a value in the bin around 0.77.
A different option would be to interpolate the histogram between the bin centers, as to find the the probability.
In the code below we plot a distribution similar to the one from the picture in the question and use both methods, the first for the frequency histogram, the second for the cumulative distribution.
import numpy as np; np.random.seed(0)
import matplotlib.pyplot as plt
x = np.random.rayleigh(size=1000)
y = np.random.normal(size=1000)
bins = np.linspace(0,4,41)
h, bins_ = np.histogram(x, bins=bins)
hcum = np.cumsum(h)/float(np.cumsum(h).max())
points = [[.77,-.55],[1.13,1.08],[2.15,-.3]]
markers = [ur'$\u2660$',ur'$\u2665$',ur'$\u263B$']
colors = ["k", "crimson" , "gold"]
labels = list("ABC")
kws = dict(height_ratios=[1,1,2], hspace=0.0)
fig, (axh, axc, ax) = plt.subplots(nrows=3, figsize=(6,6), gridspec_kw=kws, sharex=True)
cbins = np.zeros(len(bins)+1)
cbins[1:-1] = bins[1:]-np.diff(bins[:2])[0]/2.
cbins[-1] = bins[-1]
hcumc = np.linspace(0,1, len(cbins))
hcumc[1:-1] = hcum
axc.plot(cbins, hcumc, marker=".", markersize="2", mfc="k", mec="k" )
axh.bar(bins[:-1], h, width=np.diff(bins[:2])[0], alpha=0.7, ec="C0", align="edge")
ax.scatter(x,y, s=10, alpha=0.7)
for p, m, l, c in zip(points, markers, labels, colors):
kw = dict(ls="", marker=m, color=c, label=l, markeredgewidth=0, ms=10)
# plot points in scatter distribution
ax.plot(p[0],p[1], **kw)
#plot points in bar histogram, find bin in which to plot point
# shift by half the bin width to plot it in the middle of bar
pix = np.searchsorted(bins, p[0], side="right")
axh.plot(bins[pix-1]+np.diff(bins[:2])[0]/2., h[pix-1]/2., **kw)
# plot in cumulative histogram, interpolate, such that point is on curve.
yi = np.interp(p[0], cbins, hcumc)
axc.plot(p[0],yi, **kw)
ax.legend()
plt.tight_layout()
plt.show()
I have two set of data with one containing around 11 million data points and the another around 5000. I would like to plot them both on one histogram. But because of the difference in size I need to normalise the frequency so I can plot them on the same figure. Below I have simulated what I have done with my data to be able to plot them. I have used the normed=True.
from numpy.random import randn
import matplotlib.pyplot as plt
import random
datalist1=[]
for x in range(1,50000):
datalist1.append(random.uniform(1,2))
datalist2=randn(5000000)
fig= plt.figure(1)
plt.hist(datalist1,bins=20,color='b',alpha=0.3,label='theoretical',histtype='stepfilled', normed=True)
plt.hist(datalist2,bins=20,alpha=0.5,color='g',label='experimental',histtype='stepfilled',normed=True)
plt.xlabel("Value")
plt.ylabel("Normalised Frequency")
plt.legend()
plt.show()
Can you please tell me if this is a good way to get around this issue? I would like to match the tallest hight between the two histogram frequencies to be 1 (or 100%).
The normed=True setting normalizes the histogram to an area of 1. That gives the histogram an interpretation as estimates of probability density functions.
In short, it actually makes sense not to normalize on the peak but on the area.
But if you really want to normalize by height you can modify the polygon data of the histogram:
h = plt.hist(datalist1,bins=20,color='b',alpha=0.3,label='theoretical',histtype='stepfilled', normed=True)
p = h[2][0]
p.xy[:,1] /= p.xy[:, 1].max()
h = plt.hist(datalist2,bins=20,alpha=0.5,color='g',label='experimental',histtype='stepfilled',normed=True)
p = h[2][0]
p.xy[:,1] /= p.xy[:, 1].max()
This solution feels a bit hackish, but at least it's quick and dirty :)
I am new to machine learning with python. I've managed to draw the straight decision boundary for logistic regression using matplotlib. However, I am facing a bit of difficulty in plotting a curve line to understand the case of overfitting using some sample dataset.
I am trying to build a logistic regression model using regularization and use regularization to control overfitting my data set.
I am aware of the sklearn library, however I prefer writing code separately
The test data sample I am working on is given below:
x=np.matrix('2,300;4,600;7,300;5,500;5,400;6,400;3,400;4,500;1,200;3,400;7,700;3,550;2.5,650')
y=np.matrix('0;1;1;1;0;1;0;0;0;0;1;1;0')
The decision boundary I am expecting is given in the graph below:
Any help would be appreciated.
I could plot a straight decision boundary using the code below:
# plot of x 2D
plt.figure()
pos=np.where(y==1)
neg=np.where(y==0)
plt.plot(X[pos[0],0], X[pos[0],1], 'ro')
plt.plot(X[neg[0],0], X[neg[0],1], 'bo')
plt.xlim([min(X[:,0]),max(X[:,0])])
plt.ylim([min(X[:,1]),max(X[:,1])])
plt.show()
# plot of the decision boundary
plt.figure()
pos=np.where(y==1)
neg=np.where(y==0)
plt.plot(x[pos[0],1], x[pos[0],2], 'ro')
plt.plot(x[neg[0],1], x[neg[0],2], 'bo')
plt.xlim([x[:, 1].min()-2 , x[:, 1].max()+2])
plt.ylim([x[:, 2].min()-2 , x[:, 2].max()+2])
plot_x = [min(x[:,1])-2, max(x[:,1])+2] # Takes a lerger decision line
plot_y = (-1/theta_NM[2])*(theta_NM[1]*plot_x +theta_NM[0])
plt.plot(plot_x, plot_y)
And my decision boundary looks like this:
In an ideal scenario the above decision boundary is good but I would like to plot a curve decision boundary that will fit my training data very well but will overfit my test data. something similar to shown in the 1st plot
This can be done by gridding the parameter space and setting each grid point to the value of the closest point. Then running a contour plot on this grid.
But there are numerous variations, such as setting it to a value of a distance-weighted average; or smoothing the final contour; etc.
Here's an example for finding the initial contour:
import numpy as np
import matplotlib.pyplot as plt
# get the data as numpy arrays
xys = np.array(np.matrix('2,300;4,600;7,300;5,500;5,400;6,400;3,400;4,500;1,200;3,400;7,700;3,550;2.5,650'))
vals = np.array(np.matrix('0;1;1;1;0;1;0;0;0;0;1;1;0'))[:,0]
N = len(vals)
# some basic spatial stuff
xs = np.linspace(min(xys[:,0])-2, max(xys[:,0])+1, 10)
ys = np.linspace(min(xys[:,1])-100, max(xys[:,1])+100, 10)
xr = max(xys[:,0]) - min(xys[:,0]) # ranges so distances can weight x and y equally
yr = max(xys[:,1]) - min(xys[:,1])
X, Y = np.meshgrid(xs, ys) # meshgrid for contour and distance calcs
# set each gridpoint to the value of the closest data point:
Z = np.zeros((len(xs), len(ys), N))
for n in range(N):
Z[:,:,n] = ((X-xys[n,0])/xr)**2 + ((Y-xys[n,1])/yr)**2 # stack arrays of distances to each points
z = np.argmin(Z, axis=2) # which data point is the closest to each grid point
v = vals[z] # set the grid value to the data point value
# do the contour plot (use only the level 0.5 since values are 0 and 1)
plt.contour(X, Y, v, cmap=plt.cm.gray, levels=[.5]) # contour the data point values
# now plot the data points
pos=np.where(vals==1)
neg=np.where(vals==0)
plt.plot(xys[pos,0], xys[pos,1], 'ro')
plt.plot(xys[neg,0], xys[neg,1], 'bo')
plt.show()
I'm plotting some data from various tests. Sometimes in a test I happen to have one outlier (say 0.1), while all other values are three orders of magnitude smaller.
With matplotlib, I plot against the range [0, max_data_value]
How can I just zoom into my data and not display outliers, which would mess up the x-axis in my plot?
Should I simply take the 95 percentile and have the range [0, 95_percentile] on the x-axis?
There's no single "best" test for an outlier. Ideally, you should incorporate a-priori information (e.g. "This parameter shouldn't be over x because of blah...").
Most tests for outliers use the median absolute deviation, rather than the 95th percentile or some other variance-based measurement. Otherwise, the variance/stddev that is calculated will be heavily skewed by the outliers.
Here's a function that implements one of the more common outlier tests.
def is_outlier(points, thresh=3.5):
"""
Returns a boolean array with True if points are outliers and False
otherwise.
Parameters:
-----------
points : An numobservations by numdimensions array of observations
thresh : The modified z-score to use as a threshold. Observations with
a modified z-score (based on the median absolute deviation) greater
than this value will be classified as outliers.
Returns:
--------
mask : A numobservations-length boolean array.
References:
----------
Boris Iglewicz and David Hoaglin (1993), "Volume 16: How to Detect and
Handle Outliers", The ASQC Basic References in Quality Control:
Statistical Techniques, Edward F. Mykytka, Ph.D., Editor.
"""
if len(points.shape) == 1:
points = points[:,None]
median = np.median(points, axis=0)
diff = np.sum((points - median)**2, axis=-1)
diff = np.sqrt(diff)
med_abs_deviation = np.median(diff)
modified_z_score = 0.6745 * diff / med_abs_deviation
return modified_z_score > thresh
As an example of using it, you'd do something like the following:
import numpy as np
import matplotlib.pyplot as plt
# The function above... In my case it's in a local utilities module
from sci_utilities import is_outlier
# Generate some data
x = np.random.random(100)
# Append a few "bad" points
x = np.r_[x, -3, -10, 100]
# Keep only the "good" points
# "~" operates as a logical not operator on boolean numpy arrays
filtered = x[~is_outlier(x)]
# Plot the results
fig, (ax1, ax2) = plt.subplots(nrows=2)
ax1.hist(x)
ax1.set_title('Original')
ax2.hist(filtered)
ax2.set_title('Without Outliers')
plt.show()
If you aren't fussed about rejecting outliers as mentioned by Joe and it is purely aesthetic reasons for doing this, you could just set your plot's x axis limits:
plt.xlim(min_x_data_value,max_x_data_value)
Where the values are your desired limits to display.
plt.ylim(min,max) works to set limits on the y axis also.
I think using pandas quantile is useful and much more flexible.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)
pd_series = pd.Series(np.random.normal(size=300))
pd_series_adjusted = pd_series[pd_series.between(pd_series.quantile(.05), pd_series.quantile(.95))]
ax1.boxplot(pd_series)
ax1.set_title('Original')
ax2.boxplot(pd_series_adjusted)
ax2.set_title('Adjusted')
plt.show()
I usually pass the data through the function np.clip, If you have some reasonable estimate of the maximum and minimum value of your data, just use that. If you don't have a reasonable estimate, the histogram of clipped data will show you the size of the tails, and if the outliers are really just outliers the tail should be small.
What I run is something like this:
import numpy as np
import matplotlib.pyplot as plt
data = np.random.normal(3, size=100000)
plt.hist(np.clip(data, -15, 8), bins=333, density=True)
You can compare the results if you change the min and max in the clipping function until you find the right values for your data.
In this example, you can see immediately that the max value of 8 is not good because you are removing a lot of meaningful information. The min value of -15 should be fine since the tail is not even visible.
You could probably write some code that based on this find some good bounds that minimize the sizes of the tails according to some tolerance.
In some cases (e.g. in histogram plots such as the one in Joe Kington's answer) rescaling the plot could show that the outliers exist but that they have been partially cropped out by the zoom scale. Removing the outliers would not have the same effect as just rescaling. Automatically finding appropriate axes limits seems generally more desirable and easier than detecting and removing outliers.
Here's an autoscale idea using percentiles and data-dependent margins to achieve a nice view.
# xdata = some x data points ...
# ydata = some y data points ...
# Finding limits for y-axis
ypbot = np.percentile(ydata, 1)
yptop = np.percentile(ydata, 99)
ypad = 0.2*(yptop - ypbot)
ymin = ypbot - ypad
ymax = yptop + ypad
Example usage:
fig = plt.figure(figsize=(6, 8))
ax1 = fig.add_subplot(211)
ax1.scatter(xdata, ydata, s=1, c='blue')
ax1.set_title('Original')
ax1.axhline(y=0, color='black')
ax2 = fig.add_subplot(212)
ax2.scatter(xdata, ydata, s=1, c='blue')
ax2.axhline(y=0, color='black')
ax2.set_title('Autscaled')
ax2.set_ylim([ymin, ymax])
plt.show()