I am attempting to do the opposite of this: Given a 2D image of (continuous) intensities, generate a set of irregularly spaced accumulation points, i.e, points that irregularly cover the 2D map, being closer to each other at the areas of high intensities (but without overlap!).
My first try was "weighted" k-means. As I didn't find a working implementation of weighted k-means, the way I introduce the weights consists of repeating the points with high intensities. Here is my code:
import numpy as np
from sklearn.cluster import KMeans
def accumulation_points_finder(x, y, data, n_points, method, cut_value):
#computing the rms
rms = estimate_rms(data)
#structuring the data
X,Y = np.meshgrid(x, y, sparse=False)
if cut_value > 0.:
mask = data > cut_value
#applying the mask
X = X[mask]; Y = Y[mask]; data = data[mask]
_data = np.array([X, Y, data])
else:
X = X.ravel(); Y = Y.ravel(); data = data.ravel()
_data = np.array([X, Y, data])
if method=='weighted_kmeans':
res = []
for i in range(len(data)):
w = int(ceil(data[i]/rms))
res.extend([[X[i],Y[i]]]*w)
res = np.asarray(res)
#kmeans object instantiation
kmeans = KMeans(init='k-means++', n_clusters=n_points, n_init=25, n_jobs=2)
#performing kmeans clustering
kmeans.fit(res)
#returning just (x,y) positions
return kmeans.cluster_centers_
Here are two different results: 1) Making use of all the data pixels. 2) Making use of only pixels above some threshold (RMS).
As you can see the points seems to be more regularly spaced than concentrated at areas of high intensities.
So my question is if there exist a (deterministic if possible) better method for computing such accumulation points.
Partition the data using quadtrees (https://en.wikipedia.org/wiki/Quadtree) into units of equal variance (or maybe also possible to make use of the concentration value?), using a defined threhold, then keep one point per unit (the centroid). There will be more subdivisions in areas with rapidly changing values, fewer in the background areas.
Related
I am trying to calculate the area of a shape enclosed by a large set of unordered points in python. I have a 2D array of points which I can plot as a scatterplot like this.
There are several ways to calculate the area enclosed by points, but these all assume ordered points, such as here and here. This method calculates the area unordered points, but it doesn't appear to work for complex shapes, as seen here. How would I calculate this area from unordered points in python?
Sample data looks like this:
[[225.93459 -27.25677 ]
[226.98128 -32.001945]
[223.3623 -34.119724]
[225.84741 -34.416553]]
From pen and paper one can see that this shape contains an area of ~12 (unitless) but putting these coordinates into one of the algorithms linked to previously returns an area of ~0.78.
Let's first mention that in the question How would I calculate this area from unordered points in python? used phrase 'unordered points' in the context of calculation of an area usually means that given are points of a contour enclosing an area which area is to calculate.
But in the question provided data sample are not points of a contour but just a cloud of points, which if visualized using a scatterplot results in a visually perceivable area.
The above is the reason why in the question provided links to algorithms calculating areas from 'unordered points' don't apply at all to what the question is about.
In other words, the actual title of the question I will answer below will be:
Calculate the visually perceivable area a cloud of (x,y) points is forming when visualized as a scatterplot
One of the possible options is mentioned in a comment to the question:
Honestly, you might consider taking THAT graph as a bitmap, and counting the number of non-white pixels in it. That is probably as close as you can get. – Tim Roberts
Given the image perfectly covering (without any margin) all the non-white pixels you can calculate the area the image rectangle is covering in units used in the underlying (x,y) data by calculating the area TA of the rectangle visible in the image from the underlying list of points P with (x,y) point coordinates ( P = [(x1,y1), (x2,y2), ...] ) as follows:
X = [x for x,y in P]
Y = [y for x,y in P]
TA = (max(X)-min(X))*(max(Y)-min(Y))
Assuming N_white is the number of all white pixels in the image with N pixels the actual area A covered by non-white pixels expressed in units used in the list of points P will be:
A = TA*(N-N_white)/N
Another approach using a list of points P with (x,y) point coordinates only ( without creation of an image ) consists of following steps:
decide which area Ap a point is covering and calculate half of the size h2 of a rectangle with this area around that point ( h2 = 0.5*sqrt(Ap) )
create a list R with rectangles around all points in the list P: R = [(x-h2, y+h2, x+h2, y-h2) for x,y in P]
use the code provided through a link listed in the stackoverflow question
Area of Union Of Rectangles using Segment Trees to calculate the total area covered by the rectangles in the list R.
The above approach has the advantage over the graphical one obtained from the scatterplot that with the choice of the area covered by a point you directly influence the used precision/resolution/granularity for the area calculation.
Given a 2D array of points the area covered by the points can be calculated with help of the return value of the same hist2d() function provided in the matplotlib module (as matplotlib.pyplot.hist2d()) which is used to show the scatterplot.
The 'trick' is to set the cmin parameter value of the function to 1 ( cmin=1 ) and then calculate the number of numpy.nan values in the by the function returned array setting them in relation to entire amount of array values.
In other words all what is necessary to calculate the area when creating the scatterplot is already there for easy use in a simple area calculation formulas if you know that the histogram creating function provide as return value all what is therefore necessary.
Below code of a ready to use function for the area calculation along with demonstration of function usage:
def area_of_points(points, grid_size = [1000, 1000]):
"""
Returns the area covered by N 2D-points provided in a 'points' array
points = [ (x1,y1), (x2,y2), ... , (xN, yN) ]
'grid_size' gives the number of grid cells in x and y direction the
'points' bounding box is divided into for calculation of the area.
Larger 'grid_size' values mean smaller grid cells, higher precision
of the area calculation and longer runtime.
area_of_points() requires installed matplotlib module. """
import matplotlib.pyplot as plt
import numpy as np
pts_x = [x for x,y in points]
pts_y = [y for x,y in points]
pts_bb_area = (max(pts_x)-min(pts_x))*(max(pts_y)-min(pts_y))
h2D,_,_,_ = plt.hist2d( pts_x, pts_y, bins = grid_size, cmin=1)
numberOfWhiteBins = np.count_nonzero(np.isnan(h2D))
numberOfAll2Dbins = h2D.shape[0]*h2D.shape[1]
areaFactor = 1.0 - numberOfWhiteBins/numberOfAll2Dbins
pts_pts_area = areaFactor * pts_bb_area
print(f'Areas: b-box = {pts_bb_area:8.4f}, points = {pts_pts_area:8.4f}')
plt.show()
return pts_pts_area
#:def area_of_points(points, grid_size = [1000, 1000])
import numpy as np
np.random.seed(12345)
x = np.random.normal(size=100000)
y = x + np.random.normal(size=100000)
pts = [[xi,yi] for xi,yi in zip(x,y)]
print(area_of_points(pts))
# ^-- prints: Areas: b-box = 114.5797, points = 7.8001
# ^-- prints: 7.800126875291629
The above code creates following scatterplot:
Notice that the printed output Areas: b-box = 114.5797, points = 7.8001 and the by the function returned area value 7.800126875291629 give the area in units in which the x,y coordinates in the array of points are specified.
Instead of usage of a function when utilizing the know how you can play around with the parameter of the scatterplot calculating the area of what can be seen in the scatterplot.
Below code which changes the displayed scatterplot using the same underlying point data:
import numpy as np
np.random.seed(12345)
x = np.random.normal(size=100000)
y = x + np.random.normal(size=100000)
pts = [[xi,yi] for xi,yi in zip(x,y)]
pts_values_example = \
[[0.53005, 2.79209],
[0.73751, 0.18978],
... ,
[-0.6633, -2.0404],
[1.51470, 0.86644]]
# ---
pts_x = [x for x,y in pts]
pts_y = [y for x,y in pts]
pts_bb_area = (max(pts_x)-min(pts_x))*(max(pts_y)-min(pts_y))
# ---
import matplotlib.pyplot as plt
bins = [320, 300] # resolution of the grid (for the scatter plot)
# ^-- resolution of precision for the calculation of area
pltRetVal = plt.hist2d( pts_x, pts_y, bins = bins, cmin=1, cmax=15 )
plt.colorbar() # display the colorbar (for a 2d density histogram)
plt.show()
# ---
h2D, xedges1D, yedges1D, h2DhistogramObject = pltRetVal
numberOfWhiteBins = np.count_nonzero(np.isnan(h2D))
numberOfAll2Dbins = (len(xedges1D)-1)*(len(yedges1D)-1)
areaFactor = 1.0 - numberOfWhiteBins/numberOfAll2Dbins
area = areaFactor * pts_bb_area
print(f'Areas: b-box = {pts_bb_area:8.4f}, points = {area:8.4f}')
# prints "Areas: b-box = 114.5797, points = 20.7174"
creating following scatterplot:
Notice that the calculated area is now larger due to smaller values used for grid resolution resulting in more of the area colored.
I have data from distinct curves, and want to fit each of them individually. However, the data is mixed into a single array, so first I believe I need a way to separate the data.
I know that each of the individual curves is under the family A/x+B. As of now I cut out each of the curves by hand and curve fit, but would like to automate this process, have the computer separate these curves a fit them. I attempted to use machine learning, but didn't know where to start, what packages to use. I am using python, but can also use C++, in fact I hope to transfer it to C++ by the end. Where do you think I should start, is it worth it to use unsupervised machine learning, or is there a better way to separate the data?
The expected curves:
An example of the data
Well, you sure do have an interesting problem.
I see that there are curves with Y-axis values that are considerably larger than the rest of them. I would simply take the first N-values with the largest Y-axis values and then fit them to an exponential decay curve (or that other curve you mention). You can then simply take the points that most fit that curve and then leave the other points alone.
Except...
This is a terrible way to extrapolate data. Doing this, you are cherry-picking the data you want. This is falsifying information and is very bad.
Your best bet is to create a single curve that all points fit too if you cannot isolate all of those points into separate curves with external information.
But...
We do know some information: a valid function must have only 1 output given a single input.
If the X-Axis is discreet, this means you can create a lookup table of Outputs given the input. This allows you to count how many curves there are associated with the specific X-value (which could be a time unit). In other words, you have to have external information to separate points locally. You can then reorder the points in increasing Y-value, and now you have your separate curves defined in discrete points.
Basically, this is an unsolvable problem in the general sense, but in your specific application, there might be extra rules that further define the domain and range such that you can do data filtering.
One more thing...
I am making these statements with the assumption that the (X,Y) values are floats that cannot maintain accuracy after some mathematical operations.
If you are using things like unum numbers, you might be able to keep enough information in the decimal such that your fitting functions can differentiate between points without extra filtering.
This case is more of a hope than anything, as adopting a new number representation to get more accuracy to isolate sampled points is a stretch at best.
Just for completeness, there are some mathematical libraries that might help you.
Boost.uBLAS
Eigen
LAPACK++
Hopefully, I have given you enough information to allow you to solve your problem.
I extracted data from the plot for analysis. Here is example code that loads, separates, fits and plots the three data sets. It works when the separate data files are appended into a single text file.
import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
##########################################################
# data load and separation section
datafilename = 'temp.dat'
textdata = open(datafilename, 'rt').read()
xLists = [[], [], []]
yLists = [[], [], []]
previousY = 0.0 # initialize
whichList = -1 # initialize
datalines = textdata.split('\n')
for line in datalines:
if not line: # allow for blank lines in data file
continue
spl = line.split()
x = float(spl[0])
y = float(spl[1])
if y > previousY + 50.0: # this separator must be greater than max noise
whichList += 1
previousY = y
xLists[whichList].append(x)
yLists[whichList].append(y)
##########################################################
# curve fitting section
def func(x, a, b):
return a / x + b
parameterLists = []
for curveIndex in range(len(xLists)):
# these are the same as the scipy defaults
initialParameters = numpy.array([1.0, 1.0])
xData = numpy.array(xLists[curveIndex], dtype=float)
yData = numpy.array(yLists[curveIndex], dtype=float)
# curve fit the test data
fittedParameters, pcov = curve_fit(func, xData, yData, initialParameters)
parameterLists.append(fittedParameters)
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
for curveIndex in range(len(xLists)):
# first the raw data as a scatter plot
axes.plot(xLists[curveIndex], yLists[curveIndex], 'D')
# create data for each fitted equation plot
xModel = numpy.linspace(min(xLists[curveIndex]), max(xLists[curveIndex]))
yModel = func(xModel, *parameterLists[curveIndex])
# now the model as a line plot
axes.plot(xModel, yModel)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)
The idea:
create N naive, easy to calculate, sufficiently precise(for clustering), approximations. Then "classify" each data-point to the closest such approximation.
This is done like this:
The approximations are analytical approximations using these two equations I derived:
where (x1,y1) and (x2,y2) are coordinates of two points on the curve.
To get these two points I assumed that (1) the first points(according to the x-axis) are distributed equally between the different real curves. And (2) the 2 first points of each real curve, are smaller or bigger than the 2 first points of each other real curve. Thus sorting them and dividing into N groups will successfully cluster the first *2*N* points. If these assumptions are false you can still manually classify the 2 first points of each real curve and the rest will be classified automatically (this is actually the first approach I implemented).
Then cluster rest of the points to each point's closest approximation. Closest meaning with the smallest error.
Edit: A stronger approach for the initial approximation could be by calculating A and B for a couple of pairs of points and using their mean A and B as the approximation. And maybe even possibly doing K-means on these points/approximations.
The Code:
import numpy as np
import matplotlib.pyplot as plt
# You should probably edit this variable
NUM_OF_CURVES = 4
# <data> should be a 1-D array containing the Y values of the series
# <x_of_data> should be a 1-D array containing the corresponding X values of the series
data, x_of_data = np.loadtxt('...')
# clustering of first 2*num_of_curves points
# I started at NUM_OF_CURVES instead of 0 because my xs started at 0.
# The range (0:NUM_OF_CURVES*2) will probably be better for you.
raw_data = data[NUM_OF_CURVES:NUM_OF_CURVES*3]
raw_xs = x_of_data[NUM_OF_CURVES:NUM_OF_CURVES*3]
sort_ind = np.argsort(raw_data)
Y = raw_data[sort_ind].reshape(NUM_OF_CURVES,-1).T
X = raw_xs[sort_ind].reshape(NUM_OF_CURVES,-1).T
# approximation of A and B for each curve
A = ((Y[0]*Y[1])*(X[0]-X[1]))/(Y[1]-Y[0])
B = (A / Y[0]) - X[0]
# creating approximating curves
f = []
for i in range(NUM_OF_CURVES):
f.append(A[i]/(x_of_data+B[i]))
curves = np.vstack(f)
# clustering the points to the approximating curves
raw_clusters = [[] for _ in range(NUM_OF_CURVES)]
for i in range(len(data)):
raw_clusters[np.abs(curves[:,i]-data[i]).argmin()].append((x_of_data[i],data[i]))
# changing the clusters to np.arrays of the shape (2,-1)
# where row 0 contains the X coordinates and row 1 the Y coordinates
clusters = []
for i in range(len(raw_clusters)):
clusters.append(np.array(list(zip(*raw_clusters[i]))))
Example:
raw series:
separated series:
I have made a workflow code to detect the edges of a flame in an image. I could get the edge line. It consists of many pixel points stored in an array (data in my code). Now based on the data, I would like to calculate the length of the edge. The idea is to calculate the distance between every point in data and sum them all to get the length. I really stuck in making that. Please help me, many thanks.
Here is a processed image:
Here is the original image that converted to the processed image, I put in the code is to compare the result:
import cv2
import matplotlib.pyplot as plt
if __name__ == '__main__':
path = '1897_1.jpg' #processed image
pic = cv2.imread(path)
original = cv2.imread('1897_2.jpg') #original image
img2 = cv2.flip(original, 1)
b,g,r = cv2.split(pic)
img4 = cv2.flip(b, 1)
h,w = img4.shape
data = []
th_val = 20
for i in range(h):
for j in range(w):
val = img4[i, j]
if (val >= th_val):
data.append(j)
break
b1 = range(len(data))
b2 = len(data)
result = [b2]
print (b2)
plt.figure(figsize = (10, 8))
plt.subplot(121)
plt.imshow(img4)
plt.plot(data, b1)
plt.axis('off');
plt.subplot(122)
plt.plot(data, b1)
plt.imshow(img2)
plt.axis('off')
I came up with a very simple solution, it is far from optimal, but it works for this example, and it is a good starting point. Unfortunately, this solution is not optimal for the blue chanell, where the curve is not smooth, but it works for green and red chanells.
data contains width coordinates of the first red pixel overcoming threshold. So, all first pixels are separated by 1 pixel step on vertical axes and data[i+1] - data[i] on horizontal axes. These two values can be considered as two cathetus of the squeare triangle, and the hypothenuse is the distance we want to calculate. So, here is the solution:
length = 0
for i in range(0,len(data)-1):
cathetus = data[i+1]-data[i]
hypothenuse = (cathetus**2 + 1**2)**1/2
length += hypothenuse
print(length)
Update
I have came up with two solutions: a hardcoded one and one released in the form of the function. Let us start with the first one: mean is a rather good approximator for the signal + noise. In the situation, when you do not have very strong noise or missing data, you may use this approach. In the example below we select points with x in [1,2,3] then we calculate mean y for these points and assign mean to coordinate x=2. Next we select points x in [2,3,4] and so on. As a result, we obtain mean_data list with y coordinates and mean_x with x coordinates. We can calculate length with the approach described above. You may also increase the power of smoothing by averaging over 4 and more points from data.
mean_data = []
mean_x = range(1,len(data)-1)
for i in range(0,len(data)-2):
mean_d = (data[i] + data[i+1] + data[i+2])/3
mean_data.append(mean_d)
Another approach is to use smoothing tools from scipy package. One of them is described below. When calculating the length you will have to adjust to new x axes xnew.
from scipy.interpolate import spline
import numpy as np
#transform to np.arrays initial data
b1_ = np.array(b1)
data_ = np.array(data)
# create new x with more data points
xnew = np.linspace(b1_.min(),b1_.max(),50) #50 is a number of points in between
smoothed_data = spline(b1_,data_,xnew)
We're working with a dataset of spoken numbers. The wavefiles are converted to MFCC values. Each row (wavfile) consists of around 20 to 40 (depending on the length of the soundfile) arrays, with 13 floatvalues in each array. The goal of the task is to identify 10 spoken numbers. Because we don't have labels we want to cluster them in 10 groups using a learning method.
The code looks like this:
def kmeans(data, k=3, normalize=False, limit= 500):
"""Basic k-means clustering algorithm.
"""
# optionally normalize the data. k-means will perform poorly or strangely if the dimensions
# don't have the same ranges.
if normalize:
stats = (data.mean(axis=0), data.std(axis=0))
data = (data - stats[0]) / stats[1]
# pick the first k points to be the centers. this also ensures that each group has at least
# one point.
centers = data[:k]
for i in range(limit):
# core of clustering algorithm...
# first, use broadcasting to calculate the distance from each point to each center, then
# classify based on the minimum distance.
classifications = np.argmin(((data[:, :, None] - centers.T[None, :, :])**2).sum(axis=1), axis=1)
# next, calculate the new centers for each cluster.
new_centers = np.array([data[classifications == j, :].mean(axis=0) for j in range(k)])
# if the centers aren't moving anymore it is time to stop.
if (new_centers == centers).all():
break
else:
centers = new_centers
else:
# this will not execute if the for loop exits on a break.
raise RuntimeError(f"Clustering algorithm did not complete within {limit} iterations")
# if data was normalized, the cluster group centers are no longer scaled the same way the original
# data is scaled.
if normalize:
centers = centers * stats[1] + stats[0]
print(f"Clustering completed after {i} iterations")
return classifications, centers
classifications, centers = kmeans(speechdata, k=5)
plt.figure(figsize=(12, 8))
plt.scatter(x=speechdata[:, 0], y=speechdata[:, 1], s=100, c=classifications)
plt.scatter(x=centers[:, 0], y=centers[:, 1], s=500, c='k', marker='^')
the line "classifications, centers = kmeans(speechdata, k=5)" gives me an error: IndexError: too many indices for array.
How do I transform my array of array data, with varying length (one row has shape (20,13) and one might have (38,13) so that I can cluster them?
I am working on a project in which I am trying to model the movement of an object in a kymograph. In order to do so, I fit a curve to each line of pixels in an image, and append the location of the vertex to approximately model the location of the object in the image. Below is a sample image.
As you can see, early in the time series (at the top of the image) the position of the object is nicely focused and easily modeled with a Gaussian curve. However, closer to the end of the time series (at the bottom of the image), the peak is much more diffuse. I suspect that the data at the bottom of the image will be fit much more closely by a curve modeling a Poisson distribution (image below, right) while the data at the top/middle of the image will be fit much more closely by a Gaussian or polynomial curve (image below, left).
Is there any way to, for each line of pixels, fit more than one curve to the same data and then score each for a least-squares fit? This way, I could (hopefully) switch models midway through an image to accommodate changing behaviors of the object that I am trying to track. My current code is below:
from PIL import Image
def populateData(picture) :
"""Open an image and populate a list of lists with the grayscale value"""
im = Image.open(picture)
size = im.size
width = size[0]
height = size[1]
allPixels = list(im.getdata())
pixelList = [allPixels[width*i :
width * (i+1)] for i in range(height)]
return(pixelList)
rawData = populateData("testTop.tif")
import numpy as np
from scipy.optimize import curve_fit
def findVertex(listOfRows) :
xList = []
for row in listOfRows :
x = np.arange(len(row))
ffunc = lambda x, a, x0, s: a*np.exp(-0.5*(x-x0)**2/s**2)
p, _ = curve_fit(ffunc, x, row, p0=[100,5,2])
x0 = p[1]
xList.append(x0)
xArray = np.array(xList)
return(xArray)
xValues = findVertex(rawData)
def buildRows(listOfRows) :
yArray = np.arange(len(listOfRows))
return(yArray)
yValues = buildRows(rawData)
from matplotlib import pyplot as plt
from scipy import ndimage
image = ndimage.imread("testTop.tif",flatten=True)
fig = plt.figure()
axes = fig.add_subplot(111)
axes.imshow(image)
axes.plot(xValues, yValues, 'k-')
axes.set_title('testLine')
axes.grid()
axes.set_xlabel('x')
axes.set_ylabel('time')
plt.show()
EDIT:
This is the file I used as an input (testTop.tif)
You will need to work out some form of goodness of fit between the fit and your data. Taking the sum of the squared differences between your current fit (a Gaussian) and your data divided by the variance.
sumerrsq = 0.
for i in range(yValues.shape[0]):
sumerrsq += np.power(yValues[i] - xValues[i],2)
goodfit = np.sqrt(sumerrsq/var)
I think you can use use the second output from curve fit (the covariance) to get the variance,
p, pcov = curve_fit(ffunc, x, row, p0=[100,5,2])
var = np.diag(pcov)
You can then check the value of goodfit and if it is not sufficient, switch to a different distribution. In using a different distribution, you may need to use a different estimation of error (this assumes the errors are normally distributed).
Note, without the data (and not being sure what array was which) I couldn't test any of this code...
According to the curve_fit docs:
To compute one standard deviation errors on the parameters use perr =
np.sqrt(np.diag(pcov)).
So if that's the value you're trying to compare, then you could take that second returned value from curve_fit (the one you are currently assigning to _), use it to calculate perr as above, and compare the error between multiple curves.
I would suggest you work with a 2D fit model. A 1d Gaussian distribution is the basis but the mean and variance depend on position and time. You then would fit the model against the 2d image data.
In case you want to stay with your approach, it looks like it's just the starting value for mean and variance which you need to tweak in order to get a better fit for the lines with large times.
To your question, you can model any score function you want, so you could do something like:
def score(x,y):
if x < 10:
return x**2 - y
else:
return x - y
So in order to work with two different models in different ranges, follow this example.