I have a graph that generates 4 curves and I need to draw an approximate graph using ax2+bx+c for a particular area.
So I have. Collection of data points (x,y).... up to ....1000
My question:
1. is there any algo for selecting appropiate three point data point
purpose If I have three data point.I can determine a,b,c and get y point after placing x value using algebra equation
Problem The point which I am selecting I am getting c value is higher and if I place into ax2+bx+c than it gives me data out of range.
2.is there any algo which draw approximate line of existing quadratic equation
3.Another approach is: we have x and y points. we can subtract in y by 1% or 2% and get y value and draw the graph. In that case,what will be appropriate value for y value
The implementation language is Python using any available library.
Related
I have a 50 by 50 grid of evenly spaced (x,y) points. Each of these points has a third scalar value. This can be visualized using a contourplot which I have added. I am interested in the regions indicated in by the red circles. These regions of low "Z-values" are what I want to extract from this data.
2D contour plot of 50 x 50 evenly spaced grid points:
I want to do this by using clustering (machine learning), which can be lightning quick when applied correctly. The problem is, however, that the points are evenly spaced together and therefore the density of the entire dataset is equal everywhere.
I have tried using a DBSCAN algorithm with a custom distance metric which takes into account the Z values of each point. I have defined the distance between two points as follows:\
def custom_distance(point1,point2):
average_Z = (point1[2]+point2[2])/2
distance = np.sqrt(np.square((point1[0]-point2[0])) + np.square((point1[1]-point2[1])))
distance = distance * average_Z
return distance
This essentially determines the Euclidean distance between two points and adds to it the average of the two Z values of both points. In the picture below I have tested this distance determination function applied in a DBSCAN algorithm. Each point in this 50 by 50 grid each has a Z value of 1, except for four clusters that I have randomly placed. These points each have a z value of 10. The algorithm is able to find the clusters in the data based on their z value as can be seen below.
DBSCAN clustering result using scalar value distance determination:
Positive about the results I tried to apply it to my actual data, only to be disappointed by the results. Since the x and y values of my data are very large, I have simply scaled them to be 0 to 49. The z values I have left untouched. The results of the clustering can be seen in the image below:
Clustering result on original data:
This does not come close to what I want and what I was expecting. For some reason the clusters that are found are of rectangular shape and the light regions of low Z values that I am interested in are not extracted with this approach.
Is there any way I can make the DBSCAN algorithm work in this way? I suspect the reason that it is currently not working has something to do with the differences in scale of the x,y and z values. I am also open for tips or recommendations on other approaches on how to define and find the lighter regions in the data.
How to design a simple code to automatically quantify a 2D rough surface based on given scatter points geometrically? For example, to use a number, r=0 for a smooth surface, r=1 for a very rough surface and the surface is in between smooth and rough when 0 < r < 1.
To more explicitly illustrate this question, the attached figure below is used to show several sketches of 2D rough surfaces. The dots are the scattered points with given coordinates. Accordingly, every two adjacent dots can be connected and a normal vector of each segment can be computed (marked with arrow). I would like to design a function like
def roughness(x, y):
...
return r
where x and y are sequences of coordinates of each scatter point. For example, in case (a), x=[0,1,2,3,4,5,6], y=[0,1,0,1,0,1,0]; in case (b), x=[0,1,2,3,4,5], y=[0,0,0,0,0,0]. When we call the function roughness(x, y), we will get r=1 (very rough) for case (a) and r=0 (smooth) for case (b). Maybe r=0.5 (medium) for case (d). The question is refined to what appropriate components do we need to put inside the function roughness?
Some initial thoughts:
Roughness of a surface is a local concept, which we only consider within a specific range of area, i.e. only with several local points around the location of interest. To use mean of local normal vectors? This may fail: (a) and (b) are with the same mean, (0,1), but (a) is rough surface and (b) is smooth surface. To use variance of local normal vectors? This may also fail: (c) and (d) are with the same variance, but (c) is rougher than (d).
maybe something like this:
import numpy as np
def roughness(x, y):
# angles between successive points
t = np.arctan2(np.diff(y), np.diff(x))
# differences between angles
ts = np.sin(t)
tc = np.cos(t)
dt = ts[1:] * tc[:-1] - tc[1:] * ts[:-1]
# sum of squares
return np.sum(dt**2) / len(dt)
would give you something like you're asking?
Maybe you should consider a protocol definition:
1) geometric definition of the surface first
2) grant unto that geometric surface intrinsic properties.
2.a) step function can be based on quadratic curve between two peaks or two troughs with their concatenated point as the focus of the 'roughness quadratic' using the slope to define roughness in analogy to the science behind road speed-bumps.
2.b) elliptical objects can be defined by a combination of deformation analysis with centered circles on the incongruity within the body. This can be solved in many ways analogous to step functions.
2.c) flat lines: select points that deviate from the mean and do a Newtonian around with a window of 5-20 concatenated points or what ever is clever.
3) define a proper threshold that fits what ever intuition you are defining as "roughness" or apply conventions of any professional field to your liking.
This branched approach might be quicker to program, but I am certain this solution can be refactored into a Euclidean construct of 3-point ellipticals, if someone is up for a geometry problem.
The mathematical definitions of many surface parameters can be found here, which can be easily put into numpy:
https://www.keyence.com/ss/products/microscope/roughness/surface/parameters.jsp
Image (d) shows a challenge: basically you want to flatten the shape before doing the calculation. This requires prior knowledge of the type of geometry you want to fit. I found an app Gwyddion that can do this in 3D, but it can only interface with Python 2.7, not 3.
If you know which base shape lies underneath:
fit the known shape
calculate the arc distance between each two points
remap the numbers by subtracting 1) from the original data and assigning new coordinates according to 2)
perform normal 2D/3D roughness calculations
I have working code that plots a bivariate gaussian distribution. The distribution is produced by adjusting the COV matrix to account for specific variables. Specifically, every XY coordinate is applied with a radius. The COV matrix is then adjusted by a scaling factor to expand the radius in x-direction and contract in y-direction. The direction of this is measured by theta. The output is expressed as a probability density function (PDF).
I have normalised the PDF values. However, I'm calling a separate PDF for each frame. As such, the maximum value changes and hence the probability will be transformed differently for each frame.
Question: Using #Prasanth's suggestion. Is it possible to create normalized arrays for each frame before plotting, and then plot these arrays?
Below is the function I'm currently using to normalise the PDF for a single frame.
normPDF = (PDFs[0]-PDFs[1])/max(PDFs[0].max(),PDFs[1].max())
Is it possible to create normalized arrays for each frame before plotting, and then plot these arrays?
Indeed is possible. In your case you probably need to rescale your arrays between two values, say -1 and 1, before plotting. So that the minimum becomes -1, the maximum 1 and the intermediate values are scaled accordingly.
You could also choose 0 and 1 or whatever as minimum and maximum, but let's go with -1 and 1 so that a the middle value is 0.
To do this, in your code replace:
normPDF = (PDFs[0]-PDFs[1])/max(PDFs[0].max(),PDFs[1].max())
with:
renormPDF = PDFs[0]-PDFs[1]
renormPDF -= renormPDF.min()
normPDF = (renormPDF * 2 / renormPDF.max()) -1
This three lines ensure that normPDF.min() == -1 and normPDF.max() == 1.
Now when plotting the animation the axis on the right of your image does not change.
Your problem is to find the maximum values of PDFs[0].max() and PDFs[1].max() for all frames.
Why don't you run plotmvs on all your planned frames in order to find the absolute maximum for PDFs[0] and PDFs[1] and then run your animation with these absolute maxima to normalize your plots? This way, the colorbar will be the same for all frames.
I'm working on a problem where I have a large set (>4 million) of data points located in a three-dimensional space, each with a scalar function value. This is represented by four arrays: XD, YD, ZD, and FD. The tuple (XD[i], YD[i], ZD[i]) refers to the location of data point i, which has a value of FD[i].
I'd like to superimpose a rectilinear grid of, say, 100x100x100 points in the same space as my data. This grid is set up as follows.
[XGrid, YGrid, ZGrid] = np.mgrid[Xmin:Xmax:Xstep, Ymin:Ymax:Ystep, Zmin:Zmax:Zstep]
XG = XGrid[:,0,0]
YG = YGrid[0,:,0]
ZG = ZGrid[0,0,:]
XGrid is a 3D array of the x-value at each point in the grid. XG is a 1D array of the x-values going from Xmin to Xmax, separated by a distance of XStep.
I'd like to use an interpolation algorithm I have to find the value of the function at each grid point based on the data surrounding it. In this algorithm I require 20 data points closest (or at least close) to my grid point of interest. That is, for grid point (XG[i], YG[j], ZG[k]) I want to find the 20 closest data points.
The only way I can think of is to have one for loop that goes through each data point and a subsequent embedded for loop going through all (so many!) data points, calculating the Euclidean distance, and picking out the 20 closest ones.
for i in range(0,XG.shape):
for j in range(0,YG.shape):
for k in range(0,ZG.shape):
Distance = np.zeros([XD.shape])
for a in range(0,XD.shape):
Distance[a] = (XD[a] - XG[i])**2 + (YD[a] - YG[j])**2 + (ZD[a] - ZG[k])**2
B = np.zeros([20], int)
for a in range(0,20):
indx = np.argmin(Distance)
B[a] = indx
Distance[indx] = float(inf)
This would give me an array, B, of the indices of the data points closest to the grid point. I feel like this would take too long to go through each data point at each grid point.
I'm looking for any suggestions, such as how I might be able to organize the data points before calculating distances, which could cut down on computation time.
Have a look at a seemingly simmilar but 2D problem and see if you cannot improve with ideas from there.
From the top of my head, I'm thinking that you can sort the points according to their coordinates (three separate arrays). When you need the closest points to the [X, Y, Z] grid point you'll quickly locate points in those three arrays and start from there.
Also, you don't really need the euclidian distance, since you are only interested in relative distance, which can also be described as:
abs(deltaX) + abs(deltaY) + abs(deltaZ)
And save on the expensive power and square roots...
No need to iterate over your data points for each grid location: Your grid locations are inherently ordered, so just iterate over your data points once, and assign each data point to the eight grid locations that surround it. When you're done, some grid locations may have too few data points. Check the data points of adjacent grid locations. If you have plenty of data points to go around (it depends on how your data is distributed), you can already select the 20 closest neighbors during the initial pass.
Addendum: You may want to reconsider other parts of your algorithm as well. Your algorithm is a kind of piecewise-linear interpolation, and there are plenty of relatively simple improvements. Instead of dividing your space into evenly spaced cubes, consider allocating a number of center points and dynamically repositioning them until the average distance of data points from the nearest center point is minimized, like this:
Allocate each data point to its closest center point.
Reposition each center point to the coordinates that would minimize the average distance from "its" points (to the "centroid" of the data subset).
Some data points now have a different closest center point. Repeat steps 1. and 2. until you converge (or near enough).
Good Morning,
I am implimenting a Cressman filter for doing distance weighted averages in Numpy.. I use a Ball Tree implimentation (thanks to Jake VanderPlas) to return a list of locatations for each point in a request array.. the query array (q) is shape [n,3] and at each point has the x,y,z at point I want to do a weighted average of points stored in the tree.. the code wrapped around the tree returns points within a certain distance so I get an arrays of variable length arrays..
I use a where to find non-empty entries (ie positions where there were at least some points within the radius of influence) creating the isgood array...
I then loop over all query points to return the weighted average of the values self.z (note that this can either be dims=1 or dims=2 to allow multiple co-gridding)
so the thing that complilcates using map or other quicker methods is the nonuniformity of the lengths of the arrays within self.distances and self.locations... I am still fairly green to numpy/python but I can not think of a way to do this array wise (ie not reverting to loops)
self.locations, self.distances = self.tree.query_radius( q, r, return_distance=True)
t2=time()
if debug: print "Removing voids"
isgood=np.where( np.array([len(x) for x in self.locations])!=0)[0]
interpol = np.zeros( (len(self.locations),) + np.shape(self.z[0]) )
interpol.fill(np.nan)
for dist, ix, posn, roi in zip(self.distances[isgood], self.locations[isgood], isgood, r[isgood]):
interpol[isgood[jinterpol]] = np.average(self.z[ix], weights=(roi**2-dist**2) / (roi**2 + dist**2), axis=0)
jinterpol += 1
so... Any hints of how to speed up the loop?..
For a typical mapping as appied to mapping weather radar data from a range,azimuth,elevation grid to a cartesian grid where I have 240x240x34 points and 4 variables takes 99s to query the tree (written by Jake in C and cython.. this is the hard step as you need to search the data!) and 100 seconds to do the calculation... which in my opinon is slow?? where is my overhead? is np.mean efficient or as it is called millions of times is there a speedup to be gained here? would I gain by using float32 rather than the default64... or even scaling to ints (which would be very hard to avoid wrap around in the weighting... any hints gratefully recieved!
You can find a discussion about the relative merits of the Cressman scheme vs using a Gaussian weight function at:
http://www.flame.org/~cdoswell/publications/radar_oa_00.pdf
The key is to match the smoothing parameter to the data (I recommend using a value close to the average spacing between data points). Once you know the smoothing parameter, you can set an "influence radius" equal to the radius where the weight function falls to 0.01 (or whatever).
How important is speed? If you wish, rather than calling an exponential function to determine the weight, you can make up a discrete table of weights for some fixed number of radius increments, which speeds up the calculation considerably. Ideally, you should have data outside the grid boundaries that can be used in the mapping of the values surrounding the gridpoints (even on the boundary points of the grid). Note this is NOT a true interpolation scheme - it won't return the observed values at the data points exactly. Like the Cressman scheme, it's a low-pass filer.