Cluster datapoints using kmeans sklearn in python

Cluster datapoints using kmeans sklearn in python - python

I am using the following python code to cluster my datapoints using kmeans.
data = np.array([[30, 17, 10, 32, 32], [18, 20, 6, 20, 15], [10, 8, 10, 20, 21], [3, 16, 20, 10, 17], [3, 15, 21, 17, 20]])
kmeans_clustering = KMeans( n_clusters = 3 )
idx = kmeans_clustering.fit_predict( data )
#use t-sne
X = TSNE(n_components=2).fit_transform( data )
fig = plt.figure(1)
plt.clf()
#plot graph
colors = np.array([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk'])
plt.scatter(X[:,0], X[:,1], c=colors[kmeans_clustering.labels_])
plt.title('K-Means (t-SNE)')
plt.show()
However, the plot of the clusters I get is wrong as I get everything in one point.
Hence, please let me know where I am making my code wrong? I want to view the kmeans clusters seperately in my scatter plot.
EDIT
The t-sne vales I get are as follows.
[[ 1.12758575e-04 9.30458337e-05]
[ -1.82559784e-04 -1.06657936e-04]
[ -9.56485652e-05 -2.38951623e-04]
[ 5.56515580e-05 -4.42453191e-07]
[ -1.42039677e-04 -5.62548119e-05]]

Use the perplexity parameter of the TSNE. The default value of the perplexity is 30, it seems that's too much for your case, even though the documentation states that TSNE is quite insensitive to this parameter.
The perplexity is related to the number of nearest neighbors that is used in other manifold learning algorithms. Larger datasets usually require a larger perplexity. Consider selecting a value between 5 and 50. The choice is not extremely critical since t-SNE is quite insensitive to this parameter.
X = TSNE(n_components=2, perplexity=2.0).fit_transform( data )

You could also use PCA (Principal Components Analysis) instead of t-SNE to plot your clusters:
import numpy as np
import pandas as pd
from sklearn.cluster import Kmeans
from sklearn.decomposition import PCA
data = np.array([[30, 17, 10, 32, 32], [18, 20, 6, 20, 15], [10, 8, 10, 20,
21], [3, 16, 20, 10, 17], [3, 15, 21, 17, 20]])
kmeans = KMeans(n_clusters = 3)
labels = kmeans.fit_predict(data)
pca = PCA(n_components=2)
data_reduced = pca.fit_transform(data)
data_reduced = pd.DataFrame(data_reduced)
ax = data_reduced.plot(kind='scatter', x=0, y=1, c=labels, cmap='rainbow')
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_title('Projection of the clustering on a the axis of the PCA')
for x, y, label in zip(data_reduced[0], data_reduced[1], kmeans.labels_):
ax.annotate('Cluster {0}'.format(label), (x,y))

Related

How to hide x-axis range where no data point is presented for Bokeh chart line

Recently I have one requirement to visual data trend for some collected dataset using Bokeh, I know this can be done by Bokeh line function.
However, I encountered an issue when x-axis is datetime data type, technically it should skip the area from x-axis where no data point is presented, for the data used in my sample code, the next datatime following '10:30:00' should be '14:00:00', but from the screenshot attached we can see '11:00:00' is still reserved. The same following '16:30:00' should be '19:00:00' rather than '17:00:00'.
I added a made-up image to illustrate my intension, please carefully check the red-segment, x-axis and its label.
Is there any way to trim x-axis where no data point is presented? Screenshot and sample code is attached as follows. Thanks.
#! /usr/bin/env python
import numpy as np
import pandas as pd
from datetime import datetime
import time
from bokeh.io import output_file, show, save
from bokeh.plotting import figure
from bokeh.plotting import ColumnDataSource
from bokeh.layouts import gridplot
from bokeh.models import LinearAxis, Range1d
from bokeh.models.widgets import Tabs, Panel
from bokeh.models import HoverTool
from bokeh.models import CrosshairTool
def get_data():
df = pd.DataFrame(np.array([['08:00:00', 11], ['08:30:00', 15],
['09:00:00', 13], ['09:30:00', 17],
['10:00:00', 15], ['10:30:00', 19],
['14:00:00', 17], ['14:30:00', 13],
['15:00:00', 15], ['15:30:00', 11],
['16:00:00', 13], ['16:30:00', 17],
['19:00:00', 15], ['19:30:00', 19],
['20:00:00', 17], ['20:30:00', 13],
['21:00:00', 15], ['21:30:00', 11],
['22:00:00', 13], ['22:30:00', 17]]),
columns=['time', 'number'])
column_data_source = ColumnDataSource(data={
'x': pd.to_datetime(df['time'], format='%H:%M:%S'),
'x0': pd.Series([x.strftime('%H:%M') for x in pd.to_datetime(df['time'], format='%H:%M:%S')]),
'y_number': df['number'],
})
return column_data_source
def plot_figure(cds):
plot = figure(plot_width=1200, plot_height=600,
x_axis_type='datetime',
y_range=(10, 20))
cross = CrosshairTool()
cross.line_color = 'white'
cross.line_alpha = 0.7
plot.add_tools(cross)
plot.title.text = 'Number of Cars Collected at Different Time'
plot.background_fill_color = 'black'
plot.xgrid.grid_line_color = None
plot.ygrid.grid_line_color = None
plot.xaxis.axis_label = 'time'
line = plot.line(x='x', y='y_number', source=cds, color='white', legend_label='number')
plot.add_tools(HoverTool(renderers=[line], tooltips=[('time', '#x0'), ('number', '#y_number')], mode='vline'))
plot.legend.location = 'bottom_left'
# plot.legend.orientation = 'horizontal'
plot.legend.label_text_color = 'white'
plot.legend.background_fill_color = 'black'
plot.legend.background_fill_alpha = 0.1
show(plot)
if __name__ == "__main__":
cds = get_data()
plot_figure(cds)

As I wrote in the comment the trick is to resample your data. I use mean() to fill with np.nan if an indes is missing.
Don't be confussed by how I creat the DataFrame. Somehow this didn't work creating the DataFrame line your did.
data = [['08:00:00', 11], ['08:30:00', 15],
['09:00:00', 13], ['09:30:00', 17],
['10:00:00', 15], ['10:30:00', 19],
['14:00:00', 17], ['14:30:00', 13],
['15:00:00', 15], ['15:30:00', 11],
['16:00:00', 13], ['16:30:00', 17],
['19:00:00', 15], ['19:30:00', 19],
['20:00:00', 17], ['20:30:00', 13],
['21:00:00', 15], ['21:30:00', 11],
['22:00:00', 13], ['22:30:00', 17]]
time=[]
number=[]
for item in data:
time.append(item[0])
number.append(item[1])
df = pd.DataFrame({'x':time, 'number':number})
df['x'] = pd.to_datetime(df['x'], format='%H:%M:%S')
df = df.set_index('x')
df = df.resample('30T').mean()
df['time'] = df.index.strftime('%H:%M')
df
Output
Comment
In your data you don't habe any information about the year. So this line
pd.to_datetime(df['x'], format='%H:%M:%S')
falls back to the default value 1900-01-01. If you zoom out of the figure this is maybe irritating and unwanted.

How to change seaborn regplot scattplot to lineplot?

I am trying to change the scatterplot to be a lineplot I have attempted to try using plot.lines[0].set_linestyle("-") however this only affects the regression line which is already a lineplot.
I understand if I used sns.lineplot their is a setting in their to turn on regression however I am trying to do this using regplot.
df = pd.DataFrame({"x": [1, 2, 3, 4, 5, 6], "y": [10, 30, 60, 90, 60, 30]})
plot = sns.regplot(x="x", y="y", data=df, ci=65)
plt.show()
The reason I want to change the scatterplot to a lineplot is beacouse its hard to see whats going on with large datasets other whys.
To clarify I am trying display a lineplot instead of a scatterplot for the original data.

This looks a bit odd, but I'm guessing it's what you want?
import seaborn as sns
df = pd.DataFrame({"x": [1, 2, 3, 4, 5, 6], "y": [10, 30, 60, 90, 60, 30]})
sns.regplot(x="x", y="y", data=df, ci=65,scatter=False)
sns.lineplot(x="x", y="y", data=df)

Python - legend values duplicate

I'm plotting a matrix, as shown below, and the legend repeats over and over again. I've tried using numpoints = 1 and this didn't seem to have any effect. Any hints?
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (10, 8) # set default figure size, 8in by 6inimport numpy as np
data = pd.read_csv('data/assg-03-data.csv', names=['exam1', 'exam2', 'admitted'])
x = data[['exam1', 'exam2']].as_matrix()
y = data.admitted.as_matrix()
# plot the visualization of the exam scores here
no_admit = np.where(y == 0)
admit = np.where(y == 1)
from pylab import *
# plot the example figure
plt.figure()
# plot the points in our two categories, y=0 and y=1, using markers to indicated
# the category or output
plt.plot(x[no_admit,0], x[no_admit,1],'yo', label = 'Not admitted', markersize=8, markeredgewidth=1)
plt.plot(x[admit,0], x[admit,1], 'r^', label = 'Admitted', markersize=8, markeredgewidth=1)
# add some labels and titles
plt.xlabel('$Exam 1 score$')
plt.ylabel('$Exam 2 score$')
plt.title('Admit/No Admit as a function of Exam Scores')
plt.legend()

It's nearly impossible to understand the problem if you don't put an example of data format especially if one is not familiar with pandas.
However, assuming your input has this format:
x=pd.DataFrame(np.array([np.arange(10),np.arange(10)**2]).T,columns=['exam1','exam2']).as_matrix()
y=pd.DataFrame(np.arange(10)%2).as_matrix()
>>x
array([[ 0, 0],
[ 1, 1],
[ 2, 4],
[ 3, 9],
[ 4, 16],
[ 5, 25],
[ 6, 36],
[ 7, 49],
[ 8, 64],
[ 9, 81]])
>> y
array([[0],
[1],
[0],
[1],
[0],
[1],
[0],
[1],
[0],
[1]])
the reason is the strange transformation from DataFrame to matrix, I guess it wouldn't happen if you have vectors (1D arrays).
For my example this works (not sure if it is the cleanest form, I don't know where the 2D matrix for x and y comes from):
plt.plot(x[no_admit,0][0], x[no_admit,1][0],'yo', label = 'Not admitted', markersize=8, markeredgewidth=1)
plt.plot(x[admit,0][0], x[admit,1][0], 'r^', label = 'Admitted', markersize=8, markeredgewidth=1)

How to make grid of the irregular data?

I have the numpy arrays of longitudes, latitudes, and the data.
I want to plot this data as a raster image using numpy, scipy, and matplotlib.
import numpy as np
from matplotlib.mlab import griddata
import matplotlib.pyplot as plt
longitudes = np.array([[139.79391479492188, 140.51760864257812, 141.19119262695312, 141.82083129882812, 142.41165161132812],
[139.79225158691406, 140.51416015625, 141.18606567382812, 141.8140869140625, 142.40338134765625],
[139.78591918945312, 140.50637817382812, 141.17694091796875, 141.80377197265625, 142.3919677734375],
[139.78387451171875, 140.50253295898438, 141.17147827148438, 141.79678344726562, 142.38360595703125],
[139.77781677246094, 140.4949951171875, 141.16250610351562, 141.78646850585938, 142.37196350097656]],dtype=float)
latitudes = np.array([[55.61929702758789, 55.621070861816406, 55.61888122558594, 55.613487243652344, 55.60547637939453],
[55.53120040893555, 55.532840728759766, 55.53053665161133, 55.525047302246094, 55.5169677734375],
[55.44305419921875, 55.444580078125, 55.44219207763672, 55.43663024902344, 55.42848587036133],
[55.35470199584961, 55.356109619140625, 55.353614807128906, 55.34796905517578, 55.33975601196289],
[55.26683807373047, 55.268131256103516, 55.26553726196289, 55.25981140136719, 55.25152587890625]],dtype=float)
data = np.array([[10, 10, 10, 10, 10],
[20, 20, 20, 20, 20],
[30, 30, 30, 30, 30],
[40, 40, 40, 40, 40],
[50, 50, 50, 50, 50]],dtype=float)
x = longitudes.ravel()
y = latitudes.ravel()
z = data.ravel()
xMin, xMax = np.min(x), np.max(x)
yMin, yMax = np.min(y), np.max(y)
xi = np.linspace(xMin, xMax, 0.005) ##choosen spacing of 0.005
yi = np.linspace(yMin, yMax, 0.005) ##choosen spacing of 0.005
The data are not exactly a grid. Actually I could not imagine how to do it ahead:
zi_matplotlib = griddata(x, y, z, xi, yi, interp='linear')
from scipy.interpolate import griddata ##Using scipy method
zi_scipy = griddata((x, y), z, (xi, yi), method='nearest')
plt.imshow(????)
Any ideas and solution please.

You can use interpolation to convert the distorted grid into a regular grid. The interpolation fits the original data points and returns a function that can be evaluated at any point of your choosing, and in this case, you would choose a regular grid of points.
Here's an example:
import numpy as np
from scipy.interpolate import interp2d
import matplotlib.pyplot as plt
# your data here, as posted in the question
f = interp2d(lon, lat, data, kind="cubic", bounds_error=False)
dlon, dlat = 1.2, .2
xlon = np.linspace(min(lon.flat), max(lon.flat), 20)
xlat = np.linspace(min(lat.flat), max(lat.flat), 20)
# the next few lines are because there seems to be a bug in interp2d
# instead one would just want to use r = interp2d(X.flat, Y.flat) (where X,Y are as below)
# but for the version of scipy I'm using ('0.13.3'), this throws an exception.
r = np.zeros((len(xlon), len(xlat)))
for i, rlat in enumerate(xlat):
for j, rlon in enumerate(xlon):
r[i,j] = f(rlon, rlat)
X, Y = np.meshgrid(xlon, xlat)
plt.imshow(r, interpolation="nearest", origin="lower", extent=[min(xlon), max(xlon), min(xlat), max(xlat)], aspect=6.)
plt.scatter(lon.flat, lat.flat, color='k')
plt.show()
Here, I left the mesh fairly coarse (20x20) and used interpolation="nearest" so you could still see the colored squares representing each of the interpolated values, done, of course, on a regular grid (created using the two linspace calls). Note also the use or origin="lower" which sets the image and the scatter plot to have the same orientation.
To interpret this, the main issue is that changing of values from left-to-right. This is due to the data being specified as constant across the horizontal set of points, but because the points where these specified were warped, the interpolated values slowly change as they move across. For example, the lowest scatter point on the right should have approximately the same color as the highest one towards the left. Also, indicative of this is that there's not much color change between any of the two leftmost pairs, but a lot between the two right most, where the warping is largest.
Note that the interpolation could be done for any values, not only a regular grid, which is just being used for imshow as per the original question. Also note that I used bounds_error=False so I could evaluate a few points slightly outside of the original dataset, but be very careful with this as points outside of the original data will quickly become unreasonable due to the cubics being evaluated beyond the region where they were fit.

Assuming that longitudes and latitudes are equally spaced, you can use imshow directly as it features interpolation:
import numpy as np
import matplotlib.pyplot as plt
longitudes = np.array([[139.79391479492188, 140.51760864257812, 141.19119262695312, 141.82083129882812, 142.41165161132812],
[139.79225158691406, 140.51416015625, 141.18606567382812, 141.8140869140625, 142.40338134765625],
[139.78591918945312, 140.50637817382812, 141.17694091796875, 141.80377197265625, 142.3919677734375],
[139.78387451171875, 140.50253295898438, 141.17147827148438, 141.79678344726562, 142.38360595703125],
[139.77781677246094, 140.4949951171875, 141.16250610351562, 141.78646850585938, 142.37196350097656]],dtype=float)
latitudes = np.array([[55.61929702758789, 55.621070861816406, 55.61888122558594, 55.613487243652344, 55.60547637939453],
[55.53120040893555, 55.532840728759766, 55.53053665161133, 55.525047302246094, 55.5169677734375],
[55.44305419921875, 55.444580078125, 55.44219207763672, 55.43663024902344, 55.42848587036133],
[55.35470199584961, 55.356109619140625, 55.353614807128906, 55.34796905517578, 55.33975601196289],
[55.26683807373047, 55.268131256103516, 55.26553726196289, 55.25981140136719, 55.25152587890625]],dtype=float)
data = np.array([[10, 10, 10, 10, 10],
[20, 20, 20, 20, 20],
[30, 30, 30, 30, 30],
[40, 40, 40, 40, 40],
[50, 50, 50, 50, 50]],dtype=float)
extent = (longitudes[0,0], longitudes[0,-1], latitudes[0,0], latitudes[-1,0])
plt.imshow(data, interpolation='bilinear', extent=extent, aspect='auto')
plt.show()
I'm aware that this does not exactly answer your question. But I think it is an easy solution to the underlying problem.
Edit
I just realized that your data is in fact not exactly a grid, but almost. You have to decide if you still want to use my solution...

Here's an example of a scatter 3d plot using your data, breaking out each set of lat/long data in its own series with respective colored markers.
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
longitudes = np.array([[139.79391479492188, 140.51760864257812, 141.19119262695312, 141.82083129882812, 142.41165161132812],
[139.79225158691406, 140.51416015625, 141.18606567382812, 141.8140869140625, 142.40338134765625],
[139.78591918945312, 140.50637817382812, 141.17694091796875, 141.80377197265625, 142.3919677734375],
[139.78387451171875, 140.50253295898438, 141.17147827148438, 141.79678344726562, 142.38360595703125],
[139.77781677246094, 140.4949951171875, 141.16250610351562, 141.78646850585938, 142.37196350097656]],dtype=float)
latitudes = np.array([[55.61929702758789, 55.621070861816406, 55.61888122558594, 55.613487243652344, 55.60547637939453],
[55.53120040893555, 55.532840728759766, 55.53053665161133, 55.525047302246094, 55.5169677734375],
[55.44305419921875, 55.444580078125, 55.44219207763672, 55.43663024902344, 55.42848587036133],
[55.35470199584961, 55.356109619140625, 55.353614807128906, 55.34796905517578, 55.33975601196289],
[55.26683807373047, 55.268131256103516, 55.26553726196289, 55.25981140136719, 55.25152587890625]],dtype=float)
data = np.array([[10, 10, 10, 10, 10],
[20, 20, 20, 20, 20],
[30, 30, 30, 30, 30],
[40, 40, 40, 40, 40],
[50, 50, 50, 50, 50]],dtype=float)
colors = ['r','g','b','k','k']
markers = ['o','o','o','o','^']
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
for i in range(5):
ax.scatter(longitudes[i], latitudes[i], data[i], c=colors[i], marker=markers[i])
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
ax.set_zlabel('Data')
plt.show()
Which results in an image like

Python display specific values on x-axis using matplotlib

I'm querying data from a simple sqlite3 DB which is pulling a list of the number of connections per port observed on my system. I'm trying to graph this into a simple bar-chart using matplotlib.
Thus far, I'm using the follow code:
import matplotlib as mpl
mpl.use('Agg') # force no x11
import matplotlib.pyplot as plt
import sqlite3
con = sqlite3.connect('test.db')
cur = con.cursor()
cur.execute('''
SELECT dst_port, count(dst_port) as count from logs
where dst_port != 0
group by dst_port
order by count desc;
'''
)
data = cur.fetchall()
dst_ports, dst_port_count = zip(*data)
#dst_ports = [22, 53223, 40959, 80, 3389, 23, 443, 35829, 8080, 4899, 21320, 445, 3128, 44783, 4491, 9981, 8001, 21, 1080, 8081, 3306, 8002, 8090]
#dst_port_count = [5005, 145, 117, 41, 34, 21, 17, 16, 15, 11, 11, 8, 8, 8, 6, 6, 4, 3, 3, 3, 1, 1, 1]
print dst_ports
print dst_port_count
fig = plt.figure()
# aesthetics and data
plt.grid()
plt.bar(dst_ports, dst_port_count, align='center')
#plt.xticks(dst_ports)
# labels
plt.title('Number of connections to port')
plt.xlabel('Destination Port')
plt.ylabel('Connection Attempts')
# save figure
fig.savefig('temp.png')
When I run the above, the data is successful retrieved from the DB and a graph is generated. However, the graph isn't what I was expecting. For example, on the x-axis, it plots all values between 0 and 5005. I'm looking for it to display only the values in dst_ports. I've tried using xticks but this doesn't work either.
I've included some sample data in the above code which I've commented out that may be useful.
In addition, here is an example of the graph output from the above code:
And also a grpah when using xticks:

You need to create some xdata by np.arange():
import matplotlib as mpl
import numpy as np
import matplotlib.pyplot as plt
dst_ports = [22, 53223, 40959, 80, 3389, 23, 443, 35829, 8080, 4899, 21320, 445, 3128, 44783, 4491, 9981, 8001, 21, 1080, 8081, 3306, 8002, 8090]
dst_port_count = [5005, 145, 117, 41, 34, 21, 17, 16, 15, 11, 11, 8, 8, 8, 6, 6, 4, 3, 3, 3, 1, 1, 1]
fig = plt.figure(figsize=(12, 4))
# aesthetics and data
plt.grid()
x = np.arange(1, len(dst_ports)+1)
plt.bar(x, dst_port_count, align='center')
plt.xticks(x, dst_ports, rotation=45)
# labels
plt.title('Number of connections to port')
plt.xlabel('Destination Port')
plt.ylabel('Connection Attempts')
Here is the output:

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cluster datapoints using kmeans sklearn in python - python

Related

How to hide x-axis range where no data point is presented for Bokeh chart line

How to change seaborn regplot scattplot to lineplot?

Python - legend values duplicate

How to make grid of the irregular data?

Python display specific values on x-axis using matplotlib

Categories

Resources