How to plot coarse-grained average of a set of data points? - python

I have a set of discrete 2-dimensional data points. Each of these points has a measured value associated with it. I would like to get a scatter plot with points colored by their measured values. But the data points are so dense that points with different colors would overlap with each other, that may not be good for visualization. So I am thinking if I could associate the color for each point based on the coarse-grained average of measured values of some points near it. Does anyone know how to implement this in Python?
Thanks!

I have it done by using sklearn.neighbors.RadiusNeighborsClassifier(), the idea is the take the average of the values of the neighbors within a specific radius. Suppose the coordinates of the data points are in the list temp_coors, the values associated with these points are coloring, then coloring could be coarse-grained in the following way:
r_neigh = RadiusNeighborsRegressor(radius=smoothing_radius, weights='uniform')
r_neigh.fit(temp_coors, coloring)
coloring = r_neigh.predict(temp_coors)

Related

Clustering on evenly spaced grid points

I have a 50 by 50 grid of evenly spaced (x,y) points. Each of these points has a third scalar value. This can be visualized using a contourplot which I have added. I am interested in the regions indicated in by the red circles. These regions of low "Z-values" are what I want to extract from this data.
2D contour plot of 50 x 50 evenly spaced grid points:
I want to do this by using clustering (machine learning), which can be lightning quick when applied correctly. The problem is, however, that the points are evenly spaced together and therefore the density of the entire dataset is equal everywhere.
I have tried using a DBSCAN algorithm with a custom distance metric which takes into account the Z values of each point. I have defined the distance between two points as follows:\
def custom_distance(point1,point2):
average_Z = (point1[2]+point2[2])/2
distance = np.sqrt(np.square((point1[0]-point2[0])) + np.square((point1[1]-point2[1])))
distance = distance * average_Z
return distance
This essentially determines the Euclidean distance between two points and adds to it the average of the two Z values of both points. In the picture below I have tested this distance determination function applied in a DBSCAN algorithm. Each point in this 50 by 50 grid each has a Z value of 1, except for four clusters that I have randomly placed. These points each have a z value of 10. The algorithm is able to find the clusters in the data based on their z value as can be seen below.
DBSCAN clustering result using scalar value distance determination:
Positive about the results I tried to apply it to my actual data, only to be disappointed by the results. Since the x and y values of my data are very large, I have simply scaled them to be 0 to 49. The z values I have left untouched. The results of the clustering can be seen in the image below:
Clustering result on original data:
This does not come close to what I want and what I was expecting. For some reason the clusters that are found are of rectangular shape and the light regions of low Z values that I am interested in are not extracted with this approach.
Is there any way I can make the DBSCAN algorithm work in this way? I suspect the reason that it is currently not working has something to do with the differences in scale of the x,y and z values. I am also open for tips or recommendations on other approaches on how to define and find the lighter regions in the data.

Pathway of lowest values between 2 points in 2D heatmap

I was wondering if I could get some concept ideas from you all before spending too much time on this.
I have a (X,Y,Z) heatmap file showing the energy (Z value) of multiple XY coordinates.
X,Y,Z
-8.000000,0.000000,30
-7.920000,0.000000,30
-7.840000,0.000000,30
-7.760000,0.000000,30
-7.680000,0.000000,30
(...)
7.680000,25.000000,30
7.760000,25.000000,30
7.840000,25.000000,30
7.920000,25.000000,30
8.000000,25.000000,30
I would like to determine possible pathways between 2 points in the XY space. These pathways should consist of a series of XY coordinates with the lowest Z values necessary in order to connect the selected regions.
I appreciate any suggestions on how to approach this.

Python contour plot vs pcolormesh for probability map

So I have two arrays of points that I need to plot that I have stored in arrays, but at each of these points there is a probability of some event happening so each has a value ranging from 0 to 1. My idea was to find a way to assign these probabilities to their respective (x,y) coordinate and display it as a heatmap. The code to plot this is as follows:
plt.pcolormesh(xcoord,ycoord,des_mag)
plt.show()
Where xcoord and ycoord are arrays. I could only make this run if I made des_mag a 2D array, in this case a 2000x2000 array with only entries on the diagonal since xcoord and ycoord each contain 2000 coordinates. All the des_mag values vary from 0 to 1. When I run this the output is simply a graph with a solid background and one tiny grid point in the corner with a different color. I'm 95% confident the issue is my lack of understanding on what it is I need to input for the plot, but I can't seem to find many examples for clarity on the issue. If anyone has any suggestions it would be greatly appreciated.

How to plot overlapping clusters in python

I am trying to plot a visualization for clusters obtained from the Fuzzy C-means clustering algorithm. With crisp clusters like that obtained through k-means, it is easy to visualize through a normal scatter plot such as the one obtained through matplotlib. Is there a recommended way to plot fuzzy clusters to visualize the overlaps? If yes, how?
One option would be to divide your data into two groups: points that are part of a cluster with degree of belonging >= X, and those less than X. Call the points with degree of belonging >= X the crisp groups. For those less than X you would make groups for each of your different clusters, call these the fuzzy groups. Every fuzzy group would have all of the data points not in the crisp groups.
Now, when you go to plot, assign a color to each of your clusters, say you have three clusters A, B, and C. Assign them colors blue, green, and red. Plot the crisp groups at 100% opacity their group color, and then for each of the fuzzy groups look at the degree of belonging for the points and plot them at some scaled back opacity in their cluster's color.
Since you would have to assign a color to each fuzzy group as a whole it may be best to "bin" them like a histogram by degree of belonging, or you could skip the groups all together and just plot each point separately.
e.g. say we have 2 clusters A and B, and
data = [(0.2,0.8),(0.5,0.5),(0.65,0.35),(0.25,0.75)]
where data represents the degree of belonging (A,B) for each our points (whose coordinates I won't list, but assume they can be represented by ptn). Then if X is .7 we would do crisp_A = [pt1] and crisp_B = [pt4]. Then fuzzy_A = [pt2, pt3] and fuzzy_B = [pt2,pt2]. Plot crisp_A and crisp_B as full colors, and then use a cm.hsv or something akin to scale fuzzy_A and fuzzy_B by their respective degrees of belonging.

Finding n nearest data points to grid locations

I'm working on a problem where I have a large set (>4 million) of data points located in a three-dimensional space, each with a scalar function value. This is represented by four arrays: XD, YD, ZD, and FD. The tuple (XD[i], YD[i], ZD[i]) refers to the location of data point i, which has a value of FD[i].
I'd like to superimpose a rectilinear grid of, say, 100x100x100 points in the same space as my data. This grid is set up as follows.
[XGrid, YGrid, ZGrid] = np.mgrid[Xmin:Xmax:Xstep, Ymin:Ymax:Ystep, Zmin:Zmax:Zstep]
XG = XGrid[:,0,0]
YG = YGrid[0,:,0]
ZG = ZGrid[0,0,:]
XGrid is a 3D array of the x-value at each point in the grid. XG is a 1D array of the x-values going from Xmin to Xmax, separated by a distance of XStep.
I'd like to use an interpolation algorithm I have to find the value of the function at each grid point based on the data surrounding it. In this algorithm I require 20 data points closest (or at least close) to my grid point of interest. That is, for grid point (XG[i], YG[j], ZG[k]) I want to find the 20 closest data points.
The only way I can think of is to have one for loop that goes through each data point and a subsequent embedded for loop going through all (so many!) data points, calculating the Euclidean distance, and picking out the 20 closest ones.
for i in range(0,XG.shape):
for j in range(0,YG.shape):
for k in range(0,ZG.shape):
Distance = np.zeros([XD.shape])
for a in range(0,XD.shape):
Distance[a] = (XD[a] - XG[i])**2 + (YD[a] - YG[j])**2 + (ZD[a] - ZG[k])**2
B = np.zeros([20], int)
for a in range(0,20):
indx = np.argmin(Distance)
B[a] = indx
Distance[indx] = float(inf)
This would give me an array, B, of the indices of the data points closest to the grid point. I feel like this would take too long to go through each data point at each grid point.
I'm looking for any suggestions, such as how I might be able to organize the data points before calculating distances, which could cut down on computation time.
Have a look at a seemingly simmilar but 2D problem and see if you cannot improve with ideas from there.
From the top of my head, I'm thinking that you can sort the points according to their coordinates (three separate arrays). When you need the closest points to the [X, Y, Z] grid point you'll quickly locate points in those three arrays and start from there.
Also, you don't really need the euclidian distance, since you are only interested in relative distance, which can also be described as:
abs(deltaX) + abs(deltaY) + abs(deltaZ)
And save on the expensive power and square roots...
No need to iterate over your data points for each grid location: Your grid locations are inherently ordered, so just iterate over your data points once, and assign each data point to the eight grid locations that surround it. When you're done, some grid locations may have too few data points. Check the data points of adjacent grid locations. If you have plenty of data points to go around (it depends on how your data is distributed), you can already select the 20 closest neighbors during the initial pass.
Addendum: You may want to reconsider other parts of your algorithm as well. Your algorithm is a kind of piecewise-linear interpolation, and there are plenty of relatively simple improvements. Instead of dividing your space into evenly spaced cubes, consider allocating a number of center points and dynamically repositioning them until the average distance of data points from the nearest center point is minimized, like this:
Allocate each data point to its closest center point.
Reposition each center point to the coordinates that would minimize the average distance from "its" points (to the "centroid" of the data subset).
Some data points now have a different closest center point. Repeat steps 1. and 2. until you converge (or near enough).

Categories