I was working on clustering a lot of data, which has two different clusters.
The first type is a 6-dimensional cluster whereas the second type is a 12-dimensional cluster. For now I have decided to use kmeans (as it seems the most intuitive clustering algorithm for the start).
The question is how can I map these clusters on a 2d plot so that I can infer whether kmeans is working or not. I would like to use matplotlib, but any other python package is fine.
Cluster 1 is a cluster made up of these data types (int,float,float,int,float,int)
Cluster 2 is a cluster made up of 12 float types.
Trying to get an output similar to this
Any tips will be useful.
Well after searching internet and getting lots of weird comment less solutions. I was able to figure out how to do it. Here's the code if you are trying to do something similar. It contains codes from various sources and a lot of them written/edited by me. I hope its easier to understand than others out there.
The function was based on kmeans2 from scipy which returns the centroid_list and label_list. The kmeansdata is the numpy array passed to kmeans2 for clustering and the num_clusters denotes the number of clusters passed to kmeans2.
The code writes back a new png file ensuring it doesn't overwrite something else. Also plots only 50 clusters (If you have 1000's of clusters, then dont try to output all of them)
(It was written for python2.7, should work for other versions too I guess.)
import numpy
import colorsys
import random
import os
from matplotlib.mlab import PCA as mlabPCA
from matplotlib import pyplot as plt
def get_colors(num_colors):
"""
Function to generate a list of randomly generated colors
The function first generates 256 different colors and then
we randomly select the number of colors required from it
num_colors -> Number of colors to generate
colors -> Consists of 256 different colors
random_colors -> Randomly returns required(num_color) colors
"""
colors = []
random_colors = []
# Generate 256 different colors and choose num_clors randomly
for i in numpy.arange(0., 360., 360. / 256.):
hue = i / 360.
lightness = (50 + numpy.random.rand() * 10) / 100.
saturation = (90 + numpy.random.rand() * 10) / 100.
colors.append(colorsys.hls_to_rgb(hue, lightness, saturation))
for i in range(0, num_colors):
random_colors.append(colors[random.randint(0, len(colors) - 1)])
return random_colors
def random_centroid_selector(total_clusters , clusters_plotted):
"""
Function to generate a list of randomly selected
centroids to plot on the output png
total_clusters -> Total number of clusters
clusters_plotted -> Number of clusters to plot
random_list -> Contains the index of clusters
to be plotted
"""
random_list = []
for i in range(0 , clusters_plotted):
random_list.append(random.randint(0, total_clusters - 1))
return random_list
def plot_cluster(kmeansdata, centroid_list, label_list , num_cluster):
"""
Function to convert the n-dimensional cluster to
2-dimensional cluster and plotting 50 random clusters
file%d.png -> file where the output is stored indexed
by first available file index
e.g. file1.png , file2.png ...
"""
mlab_pca = mlabPCA(kmeansdata)
cutoff = mlab_pca.fracs[1]
users_2d = mlab_pca.project(kmeansdata, minfrac=cutoff)
centroids_2d = mlab_pca.project(centroid_list, minfrac=cutoff)
colors = get_colors(num_cluster)
plt.figure()
plt.xlim([users_2d[:, 0].min() - 3, users_2d[:, 0].max() + 3])
plt.ylim([users_2d[:, 1].min() - 3, users_2d[:, 1].max() + 3])
# Plotting 50 clusters only for now
random_list = random_centroid_selector(num_cluster , 50)
# Plotting only the centroids which were randomly_selected
# Centroids are represented as a large 'o' marker
for i, position in enumerate(centroids_2d):
if i in random_list:
plt.scatter(centroids_2d[i, 0], centroids_2d[i, 1], marker='o', c=colors[i], s=100)
# Plotting only the points whose centers were plotted
# Points are represented as a small '+' marker
for i, position in enumerate(label_list):
if position in random_list:
plt.scatter(users_2d[i, 0], users_2d[i, 1] , marker='+' , c=colors[position])
filename = "name"
i = 0
while True:
if os.path.isfile(filename + str(i) + ".png") == False:
#new index found write file and return
plt.savefig(filename + str(i) + ".png")
break
else:
#Changing index to next number
i = i + 1
return
plot_cluster(X[:], kmean.cluster_centers_, kmean.labels_, clusters)
Related
In my work I have the task to read in a CSV file and do calculations with it. The CSV file consists of 9 different columns and about 150 lines with different values acquired from sensors. First the horizontal acceleration was determined, from which the distance was derived by double integration. This represents the lower plot of the two plots in the picture. The upper plot represents the so-called force data. The orange graph shows the plot over the 9th column of the CSV file and the blue graph shows the plot over the 7th column of the CSV file.
As you can see I have drawn two vertical lines in the lower plot in the picture. These lines represent the x-value, which in the upper plot is the global minimum of the orange function and the intersection with the blue function. Now I want to do the following, but I need some help: While I want the intersection point between the first vertical line and the graph to be (0,0), i.e. the function has to be moved down. How do I achieve this? Furthermore, the piece of the function before this first intersection point (shown in purple) should be omitted, so that the function really only starts at this point. How can I do this?
In the following picture I try to demonstrate how I would like to do that:
If you need my code, here you can see it:
import numpy as np
import matplotlib.pyplot as plt
import math as m
import loaddataa as ld
import scipy.integrate as inte
from scipy.signal import find_peaks
import pandas as pd
import os
# Loading of the values
print(os.path.realpath(__file__))
a,b = os.path.split(os.path.realpath(__file__))
print(os.chdir(a))
print(os.chdir('..'))
print(os.chdir('..'))
path=os.getcwd()
path=path+"\\Data\\1 Fabienne\\Test1\\left foot\\50cm"
print(path)
dataListStride = ld.loadData(path)
indexStrideData = 0
strideData = dataListStride[indexStrideData]
#%%Calculation of the horizontal acceleration
def horizontal(yAngle, yAcceleration, xAcceleration):
a = ((m.cos(m.radians(yAngle)))*yAcceleration)-((m.sin(m.radians(yAngle)))*xAcceleration)
return a
resultsHorizontal = list()
for i in range (len(strideData)):
strideData_yAngle = strideData.to_numpy()[i, 2]
strideData_xAcceleration = strideData.to_numpy()[i, 4]
strideData_yAcceleration = strideData.to_numpy()[i, 5]
resultsHorizontal.append(horizontal(strideData_yAngle, strideData_yAcceleration, strideData_xAcceleration))
resultsHorizontal.insert(0, 0)
#plt.plot(x_values, resultsHorizontal)
#%%
#x-axis "convert" into time: 100 Hertz makes 0.01 seconds
scale_factor = 0.01
x_values = np.arange(len(resultsHorizontal)) * scale_factor
#Calculation of the global high and low points
heel_one=pd.Series(strideData.iloc[:,7])
plt.scatter(heel_one.idxmax()*scale_factor,heel_one.max(), color='red')
plt.scatter(heel_one.idxmin()*scale_factor,heel_one.min(), color='blue')
heel_two=pd.Series(strideData.iloc[:,9])
plt.scatter(heel_two.idxmax()*scale_factor,heel_two.max(), color='orange')
plt.scatter(heel_two.idxmin()*scale_factor,heel_two.min(), color='green')#!
#Plot of force data
plt.plot(x_values[:-1],strideData.iloc[:,7]) #force heel
plt.plot(x_values[:-1],strideData.iloc[:,9]) #force toe
# while - loop to calculate the point of intersection with the blue function
i = heel_one.idxmax()
while strideData.iloc[i,7] > strideData.iloc[i,9]:
i = i-1
# Length calculation between global minimum orange function and intersection with blue function
laenge=(i-heel_two.idxmin())*scale_factor
print(laenge)
#%% Integration of horizontal acceleration
velocity = inte.cumtrapz(resultsHorizontal,x_values)
plt.plot(x_values[:-1], velocity)
#%% Integration of the velocity
s = inte.cumtrapz(velocity, x_values[:-1])
plt.plot(x_values[:-2],s)
I hope it's clear what I want to do. Thanks for helping me!
I didn't dig all the way through your code, but the following tricks may be useful.
Say you have x and y values:
x = np.linspace(0,3,100)
y = x**2
Now, you only want the values corresponding to, say, .5 < x < 1.5. First, create a boolean mask for the arrays as follows:
mask = np.logical_and(.5 < x, x < 1.5)
(If this seems magical, then run x < 1.5 in your interpreter and observe the results).
Then use this mask to select your desired x and y values:
x_masked = x[mask]
y_masked = y[mask]
Then, you can translate all these values so that the first x,y pair is at the origin:
x_translated = x_masked - x_masked[0]
y_translated = y_masked - y_masked[0]
Is this the type of thing you were looking for?
Using matplotlib in Python I'm plotting anywhere between 20 and 50 lines. Using matplotlib's sliding colour scales these become indistinguishable after a certain number of lines are plotted (well before 20).
While I've seen a few example of code in Matlab and C# to create colour maps of an arbitrary number of colours which are maximally distinguishable from one another I can't find anything for Python.
Can anyone point me in the direction of something in Python that will do this?
Cheers
I loved the idea of palette created by #xuancong84 and modified his code a bit to make it not depending on alpha channel. I drop it here for others to use, thank you #xuancong84!
import math
import numpy as np
from matplotlib.colors import ListedColormap
from matplotlib.cm import hsv
def generate_colormap(number_of_distinct_colors: int = 80):
if number_of_distinct_colors == 0:
number_of_distinct_colors = 80
number_of_shades = 7
number_of_distinct_colors_with_multiply_of_shades = int(math.ceil(number_of_distinct_colors / number_of_shades) * number_of_shades)
# Create an array with uniformly drawn floats taken from <0, 1) partition
linearly_distributed_nums = np.arange(number_of_distinct_colors_with_multiply_of_shades) / number_of_distinct_colors_with_multiply_of_shades
# We are going to reorganise monotonically growing numbers in such way that there will be single array with saw-like pattern
# but each saw tooth is slightly higher than the one before
# First divide linearly_distributed_nums into number_of_shades sub-arrays containing linearly distributed numbers
arr_by_shade_rows = linearly_distributed_nums.reshape(number_of_shades, number_of_distinct_colors_with_multiply_of_shades // number_of_shades)
# Transpose the above matrix (columns become rows) - as a result each row contains saw tooth with values slightly higher than row above
arr_by_shade_columns = arr_by_shade_rows.T
# Keep number of saw teeth for later
number_of_partitions = arr_by_shade_columns.shape[0]
# Flatten the above matrix - join each row into single array
nums_distributed_like_rising_saw = arr_by_shade_columns.reshape(-1)
# HSV colour map is cyclic (https://matplotlib.org/tutorials/colors/colormaps.html#cyclic), we'll use this property
initial_cm = hsv(nums_distributed_like_rising_saw)
lower_partitions_half = number_of_partitions // 2
upper_partitions_half = number_of_partitions - lower_partitions_half
# Modify lower half in such way that colours towards beginning of partition are darker
# First colours are affected more, colours closer to the middle are affected less
lower_half = lower_partitions_half * number_of_shades
for i in range(3):
initial_cm[0:lower_half, i] *= np.arange(0.2, 1, 0.8/lower_half)
# Modify second half in such way that colours towards end of partition are less intense and brighter
# Colours closer to the middle are affected less, colours closer to the end are affected more
for i in range(3):
for j in range(upper_partitions_half):
modifier = np.ones(number_of_shades) - initial_cm[lower_half + j * number_of_shades: lower_half + (j + 1) * number_of_shades, i]
modifier = j * modifier / upper_partitions_half
initial_cm[lower_half + j * number_of_shades: lower_half + (j + 1) * number_of_shades, i] += modifier
return ListedColormap(initial_cm)
These are the colours I get:
from matplotlib import pyplot as plt
import numpy as np
N = 16
M = 7
H = np.arange(N*M).reshape([N,M])
fig = plt.figure(figsize=(10, 10))
ax = plt.pcolor(H, cmap=generate_colormap(N*M))
plt.show()
Recently, I also encountered the same problem. So I created the following simple Python code to generate visually distinguishable colors for jupyter notebook matplotlib. It is not quite maximally perceptually distinguishable, but it works better than most built-in colormaps in matplotlib.
The algorithm splits the HSV scale into 2 chunks, 1st chunk with increasing RGB value, 2nd chunk with decreasing alpha so that the color can blends into the white background.
Take note that if you are using any toolkit other than jupyter notebook, you must make sure the background is white, otherwise, alpha blending will be different and the resultant color will also be different.
Furthermore, the distinctiveness of color highly depends on your computer screen, projector, etc. A color palette distinguishable on one screen does not necessarily imply on another. You must physically test it out if you want to use for presentation.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
def generate_colormap(N):
arr = np.arange(N)/N
N_up = int(math.ceil(N/7)*7)
arr.resize(N_up)
arr = arr.reshape(7,N_up//7).T.reshape(-1)
ret = matplotlib.cm.hsv(arr)
n = ret[:,3].size
a = n//2
b = n-a
for i in range(3):
ret[0:n//2,i] *= np.arange(0.2,1,0.8/a)
ret[n//2:,3] *= np.arange(1,0.1,-0.9/b)
# print(ret)
return ret
N = 16
H = np.arange(N*N).reshape([N,N])
fig = plt.figure(figsize=(10, 10))
ax = plt.pcolor(H, cmap=ListedColormap(generate_colormap(N*N)))
The goal here is to color value above a certain threshold into one color and values below this threshold into another color. The code below tries to just separate it into two histographs but it only looks balanced if the threshold is at 50%. I'm assuming I must play around with the discreetlevel variable.
finalutilityrange is some vector with a bunch of values(you must generate it to test the code), which I am trying to graph. The value deter is the value that determines whether they will be blue or red. discreetlevel is just the amount of bins I would want.
import random
import numpy as np
import matplotlib.pyplot as plt
discreetlevel = 10
deter = 2
for x in range(0,len(finalutilityrange)):
if finalutilityrange[x-1]>=deter:
piraterange.append(finalutilityrange[x-1])
else:
nonpiraterange.append(finalutilityrange[x-1])
plt.hist(piraterange,bins=discreetlevel,normed=False,cumulative=False,color = 'b')
plt.hist(nonpiraterange,bins=discreetlevel),normed=False,cumulative=False,color = 'r')
plt.title("Histogram")
plt.xlabel("Utlity")
plt.ylabel("Probability")
plt.show()
This solution is a bit more complex than #user2699's. I am just presenting it for completeness. You have full control over the patch objects that hist returns, so if you can ensure that the threshold you are using is exactly on a bin edge, it is easy to change to color of selected patches. You can do this because hist can accept a sequence of bin edges as the bins parameter.
import numpy as np
from matplotlib import pyplot as plt
# Make sample data
finalutilityrange = np.random.randn(100)
discreetlevel = 10
deter = 0.2
# Manually create `discreetlevel` bins anchored to `deter`
binsAbove = round(discreetlevel * np.count_nonzero(finalutilityrange > deter) / finalutilityrange.size)
binsBelow = discreetlevel - binsAbove
binwidth = max((finalutilityrange.max() - deter) / binsAbove,
(deter - finalutilityrange.min()) / binsBelow)
bins = np.concatenate([
np.arange(deter - binsBelow * binwidth, deter, binwidth),
np.arange(deter, deter + (binsAbove + 0.5) * binwidth, binwidth)
])
# Use the bins to make a single histogram
h, bins, patches = plt.hist(finalutilityrange, bins, color='b')
# Change the appropriate patches to red
plt.setp([p for p, b in zip(patches, bins) if b >= deter], color='r')
The result is a homogenous histogram with bins of different colors:
The bins may be a tad wider than if you did not anchor to deter. Either the first or last bin will generally go a little past the edge of the data.
This answer doesn't address your code since it isn't self-contained, but for what you're trying to do the default histogram should work (assuming numpy/pyplot is loaded)
x = randn(100)
idx = x < 0.2 # Threshold to separate values
hist([x[idx], x[~idx]], color=['b', 'r'])
Explanation:
first line just generates some random data to test,
creates an index for where the data is below some threshold, this can be negated with ~ to find where it's above the threshold
Last line plots the histogram. The command takes a list of separate groups to plot, which doesn't make a big difference here but if normed=True it will
There's more the hist plot can do, so look over the documentation before you accidentally implement it yourself.
Just as above do:
x = np.random.randn(100)
threshold_x = 0.2 # Threshold to separate values
x_lower, x_upper = (
[_ for _ in x if _ < threshold_x],
[_ for _ in x if _ >= threshold_x]
)
hist([x_lower, x_upper], color=['b', 'r'])
I have an array with probability values stored in it. Some values are 0. I need to plot a histogram such that there are equal number of elements in each bin. I tried using matplotlibs hist function but that lets me decide number of bins. How do I go about plotting this?(Normal plot and hist work but its not what is needed)
I have 10000 entries. Only 200 have values greater than 0 and lie between 0.0005 and 0.2. This distribution isnt even as 0.2 only one element has whereas 2000 approx have value 0.0005. So plotting it was an issue as the bins had to be of unequal width with equal number of elements
The task does not make much sense to me, but the following code does, what i understood as the thing to do.
I also think the last lines of the code are what you really wanted to do. Using different bin-widths to improve visualization (but don't target the distribution of equal amount of samples within each bin)! I used astroml's hist with method='blocks' (astropy supports this too)
Code
# Python 3 -> beware the // operator!
import numpy as np
import matplotlib.pyplot as plt
from astroML import plotting as amlp
N_VALUES = 1000
N_BINS = 100
# Create fake data
prob_array = np.random.randn(N_VALUES)
prob_array /= np.max(np.abs(prob_array),axis=0) # scale a bit
# Sort array
prob_array = np.sort(prob_array)
# Calculate bin-borders,
bin_borders = [np.amin(prob_array)] + [prob_array[(N_VALUES // N_BINS) * i] for i in range(1, N_BINS)] + [np.amax(prob_array)]
print('SAMPLES: ', prob_array)
print('BIN-BORDERS: ', bin_borders)
# Plot hist
counts, x, y = plt.hist(prob_array, bins=bin_borders)
plt.xlim(bin_borders[0], bin_borders[-1] + 1e-2)
print('COUNTS: ', counts)
plt.show()
# And this is, what i think, what you really want
fig, (ax1, ax2) = plt.subplots(2)
left_blob = np.random.randn(N_VALUES/10) + 3
right_blob = np.random.randn(N_VALUES) + 110
both = np.hstack((left_blob, right_blob)) # data is hard to visualize with equal bin-widths
ax1.hist(both)
amlp.hist(both, bins='blocks', ax=ax2)
plt.show()
Output
Pretty much exactly what the question states, but a little context:
I'm creating a program to plot a large number of points (~10,000, but it will be more later on). This is being done using matplotlib's plt.scatter. This command is part of a loop that saves the figure, so I can later animate it.
What I want to be able to do is randomly select a small portion of these particles (say, maybe 100?) and give them a different marker than the rest, even though they're part of the same data set. This is so I can use them as placeholders to see the motion of individual particles, as well as the bulk material.
Is there a way to use a different marker for a small subset of the same data?
For reference, the particles are uniformly distributed just using the numpy random sampler, but my code for that is:
for i in range(N): # N number of particles
particle_position[i] = np.random.uniform(0, xmax) # Initialize in spatial domain
particle_velocity[i] = np.random.normal(0, 5) # Initialize in velocity space
for i in range(maxtime):
plt.scatter(particle_position, particle_velocity, s=1, c=norm_xvel, cmap=br_disc, lw=0)
The position and velocity change on each iteration of the main loop (there's quite a bit of code), but these are the main initialization and plotting routines.
I had an idea that perhaps I could randomly select a bunch of i values from range(N), and use an ax.scatter() command to plot them on the same axes?
Here is a possible solution to have a subset of your points identified with a different marker:
import matplotlib.pyplot as plt
import numpy as np
SIZE = 100
SAMPLE_SIZE = 10
def select_subset(seq, size):
"""selects a subset of the data using ...
"""
return seq[:size]
points_x = np.random.uniform(-1, 1, size=SIZE)
points_y = np.random.uniform(-1, 1, size=SIZE)
plt.scatter(points_x, points_y, marker=".", color="blue")
plt.scatter(select_subset(points_x, SAMPLE_SIZE),
select_subset(points_y, SAMPLE_SIZE),
marker="o", color="red")
plt.show()
It uses plt.scatter twice; once on the full data set, the other on the sample points.
You will have to decide how you want to select the sample of points - it is isolated in the select_subset function..
You could also extract the sample points from the data set to prevent marking them twice, but numpy is rather inefficient at deleting or resizing.
Maybe a better method is to use a mask? A mask has the advantage of leaving your original data intact and in order.
Here is a way to proceed with masks:
import matplotlib.pyplot as plt
import numpy as np
import random
SIZE = 100
SAMPLE_SIZE = 10
def make_mask(data_size, sample_size):
mask = np.array([True] * sample_size + [False ] * (data_size - sample_size))
np.random.shuffle(mask)
return mask
points_x = np.random.uniform(-1, 1, size=SIZE)
points_y = np.random.uniform(-1, 1, size=SIZE)
mask = make_mask(SIZE, SAMPLE_SIZE)
not_mask = np.invert(mask)
plt.scatter(points_x[not_mask], points_y[not_mask], marker=".", color="blue")
plt.scatter(points_x[mask], points_y[mask], marker="o", color="red")
plt.show()
As you see, scatter is called once on a subset of the data points (the ones not selected in the sample), and a second time on the sampled subset, and draws each subset with its own marker. It is efficient & leaves the original data intact.
The code below does what you want. I have selected a random set v_sub_index of N_sub indices in the correct range (0 to N) and draw those (with _sub suffix) from the larger samples particle_position and particle_velocity. Please note that you don't have to loop to generate random samples. Numpy has great functionality for that without having to use for loops.
import numpy as np
import matplotlib.pyplot as pl
N = 100
xmax = 1.
v_sigma = 2.5 / 2. # 95% of the samples contained within 0, 5
v_mean = 2.5 # mean at 2.5
N_sub = 10
v_sub_index = np.random.randint(0, N, N_sub)
particle_position = np.random.rand (N) * xmax
particle_velocity = np.random.randn(N)
particle_position_sub = np.array(particle_position[v_sub_index])
particle_velocity_sub = np.array(particle_velocity[v_sub_index])
particle_position_nosub = np.delete(particle_position, v_sub_index)
particle_velocity_nosub = np.delete(particle_velocity, v_sub_index)
pl.scatter(particle_position_nosub, particle_velocity_nosub, color='b', marker='o')
pl.scatter(particle_position_sub , particle_velocity_sub , color='r', marker='^')
pl.show()