How to cluster continuous peak widths - python

I have a sample data on different prices of a single product and how many times they have been entered in the database. My objective is to determine the optimal price range/(s) for the product based on its frequency in the data set.
This is the price-frequency plot:
I am using Python 3.7.1 on Jupyter. I tried using peak_widths function from the signal library. I got the peak widths, their start points and end points. I made an if-else loop logic but I think I'm going wrong somewhere, I'm getting only a single cluster.
I used this to get peak widths
from scipy.signal import peak_widths
pkwdh_106 = signal.peak_widths(count_106, index_106, rel_height=1)
plt.plot(index_106, count_106[index_106], "o"); plt.plot(count_106)
plt.hlines(*pkwdh_106[1:])
plt.xlabel('Price')
plt.ylabel('Frequency')
This is the peaks' information
Width of each peak / maxima : [7. 5. 4. 7. 2.88888889]
Y level of each width : [1. 1. 1. 1. 2.]
Starting point of each width : [ 8. 15. 20. 36. 40.]
Ending point of each width : [15. 20. 24. 43. 42.88888889]
This is the loop that I tried but something is missing here.
sum_width = 0
name='cluster('+str(i)+')'
name=[]
start = len(start_x) # Length of starting array is 5
end = len(end_x) # Length of ending array is 5
for i in range(end-1):
# Iterating till 4 END elements
for j in range(start-1):
# Iterating till 4 START elements
if end_x[i]>=start_x[j+1]:
# If ending of 1st width is greater than or equal to starting of 2nd width
sum_width = sum_width + width[i]
# Add starting of 1st width length to the total sum variable
name.append(width[i])
# Adding the width lengths to an array
print('If loop total width - ', sum_width)
print('If loop ', name)
else:
sum_width = sum_width + width[i]
# If ending of 1st width is lesser than the starting of 2nd width
name.append(width[i])
# Add starting of 1st width length to the total sum variable
print('Else loop total width - ', sum_width)
print('Else loop ', name)
break
break
print(sum_width)
This is the actual output that I am getting -
If loop total width - 7.0
If loop [7.0]
7.0
If loop total width - 12.0
If loop [7.0, 5.0]
12.0
If loop total width - 16.0
If loop [7.0, 5.0, 4.0]
16.0
If loop total width - 23.0
If loop [7.0, 5.0, 4.0, 7.0]
23.0
I expect to get two clusters like -
[7.0, 5.0, 4.0] [7.0, 2.88888889]

Currently your code will append the width to the name list regardless whether the condition (end_x[i]>=start_x[j+1]) is true or false. It then directly breaks out of the second loop, making it redundant.
Some changes need to be made:
You only need to do one-pass through the end_x and start_x lists. So only one for-loop is needed.
You'll need to keep track of a separate list for each cluster.
start_x = [ 8., 15., 20., 36., 40.]
end_x = [15., 20., 24., 43., 42.88888889]
width = [7., 5., 4., 7., 2.88888889]
sum_width = 0
clusters = [[width[0]]] # initialise with first width
for i in range(1, len(end_x)):
print(f'iter {i}: (start={start_x[i]}, end={end_x[i]}); width {width[i]}')
if end_x[i-1] < start_x[i]:
print(f' doing a reset ({end_x[i-1]} < {start_x[i]})')
sum_width = 0
clusters.append([])
sum_width += width[i]
print(' total width:', sum_width)
clusters[-1].append(width[i]) # append width into last cluster
print(clusters)
# [[7.0, 5.0, 4.0], [7.0, 2.88888889]]
Here, we iterate from second element to the end of the list, comparing the previous end_x element with the current start_x element and make a new cluster if values don't overlap.

Related

How to ignore specific numbers in a numpy moving average?

Let's say I have a simple numpy array:
a = np.array([1,2,3,4,5,6,7])
I can calculate the moving average of a window with size 3 simply like:
np.convolve(a,np.ones(3),'valid') / 3
which would yield
array([2., 3., 4., 5., 6.])
Now, I would like to take a moving average but exclude anytime the number '2' appears. In other words, for the first 3 numbers, originally, it would be (1 + 2 + 3) / 3 = 2. Now, I would like to do (1 + 3) / 2 = 2. How can I specify a user-defined number to ignore and calculate the running mean without including this user-defined number? I would like to keep this to some sort of numpy function without bringing in pandas.
You could replace the unwanted values with 0 using a mask and separately compute the number of valid items, then compute the ratio:
a = np.array([1,2,3,4,5,6,7])
mask = a != 2
num = np.convolve(np.where(mask, a, 0), np.ones(3), 'valid')
denom = np.convolve(mask, np.ones(3), 'valid')
out = num/denom
Output:
array([2. , 3.5, 4. , 5. , 6. ])

To find N Maximum indices of a numpy array whose corresponding values should greater than M in another array

I have 3 Numpy arrays each of length 107952899.
Lets Say :
1. Time = [2.14579526e+08 2.14579626e+08 2.14579726e+08 ...1.10098692e+10 1.10098693e+10]
2. Speed = [0.66 0.66 0.66 .............0.06024864 0.06014756]
3. Brak_press = [0.3, 0.3, 0.3 .............. 0.3, 0.3]
What it mean
Each index Value in Time corresponds to same index Value in Speed & Brake array.
Time Speed Brake
2.14579526e+08 0.66 0.3
.
.
Requirement
No 1 :I want to find the indices in Speed array whose values inside are greater than 20
No 2 : for those indices, what will be values in Brake Array
No 3 : Now i want to find the Top N Maximum Value indices in Brake Array & Store it another list/array
So finally if I take one indices from Top N Maximum Indices and use it in Brake & Speed array it must show..
Brake[idx] = valid Value & more importantly Speed [idx] = Value > than 20
General Summary
Simply, What i needed is, to find the Maximum N Brake point indices whose corresponding speed Value should be greater than 20
What i tried
speed_20 = np.where(Speed > 20) # I got indices as tupple
brake_values = Brake[speed_20] # Found the Brake Values corresponds to speed_20 indices
After that i tried argsort/argpartition but none of result matches my requirement
Request
I believe there will be a best method to do this..Kindly shed some light
(I converted the above np arrays to pandas df, it works fine, due to memory concern i prefer to do using numpy operations)
You are almost there. This should do what you want:
speed_20 = np.where(Speed > 20)[0]
sort = np.argsort(-Brake[speed_20])
result = speed_20[sort[:N]]
Maybe this is an option you can consider, using NumPy.
First create a multidimensional matrix (I changed the values so it's easier to follow):
Time = [ 2, 1, 5, 4, 3]
Speed = [ 10, 20, 40, 30, 50]
Brak_press = [0.1, 0.3, 0.5, 0.4, 0.2]
data = np.array([Time, Speed, Brak_press]).transpose()
So data are stored as:
print(data)
# [[ 2. 10. 0.1]
# [ 1. 20. 0.3]
# [ 5. 40. 0.5]
# [ 4. 30. 0.4]
# [ 3. 50. 0.2]]
To extract speed greater than 20:
data[data[:,1] > 20]
# [[ 5. 40. 0.5]
# [ 4. 30. 0.4]
# [ 3. 50. 0.2]]
To get the n greatest Brak_press:
n = 2
data[data[:,2].argsort()[::-1][:n]]
# [[ 5. 40. 0.5]
# [ 4. 30. 0.4]]

the the step size problem in linespace function

import numpy as np
z=np.linspace(10,20,5)
z1=np.linspace(10,20,5,endpoint=False)
print(z)
print(z1)
The first z is printed out:
[ 10. 12.5 15. 17.5 20.]
The Second z1 is printed out:
[ 10. 12. 14. 16. 18.]
My confusion: when it is z, endpoint=True, the number of equally spaced samples to be generated, num=5, that is, five Numbers, so there are four steps, and it's easy to calculate that each step is 2.5.
But: when it is z1, endpoint=False, according to endpoint definition: if the sequence contains 20, it will be rejected, but the sequence is still 5 Numbers, 4 steps long, why is the last number of sequences 18. Not 19. Or other?
Your intuition seems to be that if endpoint=False is specified, then the last element of the returned array should be 1 less than the stop value.
Suppose we implement things that way. What does the following linspace call return?
numpy.linspace(0, 0.5, 5, endpoint=False)
Is it going to end at -0.5, counting down? That wouldn't make much sense.
numpy.linspace always divides the interval from start to stop into equally-sized chunks, and it always returns an array of length num. The difference between endpoint=True and endpoint=False is that with endpoint=False, it makes one extra chunk to compensate for leaving out the right endpoint. The step from 16 to 18 is the same size as the step from 18 to the omitted 20.
It would be helpful to look at the source here. This will allow you to walk through what the function is doing, which will give you a better understanding of the output of linspace.
Two parameters are used to calculate the step size: div and delta. However, the difference between endpoint=True, and endpoint=False, is that div is equal to num-1 if endpoint=True, and num if endpoint=False.
div = (num - 1) if endpoint else num
Here are the other relevent pieces of the source (very slimmed down):
delta = stop - start
y = _nx.arange(0, num, dtype=dt)
# ...
if num > 1:
step = delta / div
# ...
y = y * step
# ...
y += start
If we walk through each of these the outputs make a lot more sense.
endpoint=True:
div = num - 1 # 4
y = _nx.arange(0, num, dtype=dt) # [0, 1, 2, 3, 4]
delta = 20 - 10 # 10
step = delta / div # 2.5
y = y * step # [0.0, 2.5, 5.0, 7.5, 10.0]
y += start # [10.0, 12.5, 15.0, 17.5, 20.0]
endpoint=False:
div = num # 5
y = _nx.arange(0, num, dtype=dt) # [0, 1, 2, 3, 4]
delta = 20 - 10 # 10
step = delta / div # 2
y = y * step # [0.0, 2.0, 4.0, 6.0, 8.0]
y += start # [10.0, 12.0, 14.0, 16., 18.0]

Assign numpy array of points to a 2D square grid

I'm going beyond my previous question because of speed problems. I have an array of Lat/Lon coordinates of points, and I would like to assign them to an index code derived from a 2D square grid of equal size cells. This is an example of how it would be. Let's called points my first array containing coordinates (called them [x y] pairs) of six points:
points = [[ 1.5 1.5]
[ 1.1 1.1]
[ 2.2 2.2]
[ 1.3 1.3]
[ 3.4 1.4]
[ 2. 1.5]]
Then I have another array containing the coordinates of the vertices of a grid of two cells in the form [minx,miny,maxx,maxy]; let's call it bounds:
bounds = [[ 0. 0. 2. 2.]
[ 2. 2. 3. 3.]]
I would like to find which points are in which boundary, and then assign a code derived from the bounds array index (in this case the first cell has code 0, the second 1 and so on...). Since the cells are squares, the easiest way to compute if each point is in each cell is to evaluate:
x > minx & x < maxx & y > miny & y < maxy
So that the resulting array would appear as:
results = [0 0 1 0 NaN NaN]
where NaN means that the point is outside cells. The number of elements in my real case is of the order of finding 10^6 points into 10^4 cells. Is there a way to do this kind of things in a fast way using numpy arrays?
EDIT: to clarify, the results array expected means that the first points is inside the first cell (0 index of the bounds array) so the second, and the first is inside the second cell of the bounds array and so on...
Here is a vectorized approach to your problem. It should speed things up significantly.
import numpy as np
def findCells(points, bounds):
# make sure points is n by 2 (pool.map might send us 1D arrays)
points = points.reshape((-1,2))
# check for each point if all coordinates are in bounds
# dimension 0 is bound
# dimension 1 is is point
allInBounds = (points[:,0] > bounds[:,None,0])
allInBounds &= (points[:,1] > bounds[:,None,1])
allInBounds &= (points[:,0] < bounds[:,None,2])
allInBounds &= (points[:,1] < bounds[:,None,3])
# now find out the positions of all nonzero (i.e. true) values
# nz[0] contains the indices along dim 0 (bound)
# nz[1] contains the indices along dim 1 (point)
nz = np.nonzero(allInBounds)
# initialize the result with all nan
r = np.full(points.shape[0], np.nan)
# now use nz[1] to index point position and nz[0] to tell which cell the
# point belongs to
r[nz[1]] = nz[0]
return r
def findCellsParallel(points, bounds, chunksize=100):
import multiprocessing as mp
from functools import partial
func = partial(findCells, bounds=bounds)
# using python3 you could also do 'with mp.Pool() as p:'
p = mp.Pool()
try:
return np.hstack(p.map(func, points, chunksize))
finally:
p.close()
def main():
nPoints = 1e6
nBounds = 1e4
# points = np.array([[ 1.5, 1.5],
# [ 1.1, 1.1],
# [ 2.2, 2.2],
# [ 1.3, 1.3],
# [ 3.4, 1.4],
# [ 2. , 1.5]])
points = np.random.random([nPoints, 2])
# bounds = np.array([[0,0,2,2],
# [2,2,3,3]])
# bounds = np.array([[0,0,1.4,1.4],
# [1.4,1.4,2,2],
# [2,2,3,3]])
bounds = np.sort(np.random.random([nBounds, 2, 2]), 1).reshape(nBounds, 4)
r = findCellsParallel(points, bounds)
print(points[:10])
for bIdx in np.unique(r[:10]):
if np.isnan(bIdx):
continue
print("{}: {}".format(bIdx, bounds[bIdx]))
print(r[:10])
if __name__ == "__main__":
main()
Edit:
Trying it with your amount of data gave me a MemoryError. You can avoid that and even speed things up a little more if you use multiprocessing.Pool with its map function, see updated code.
Result:
>time python test.py
[[ 0.69083585 0.19840985]
[ 0.31732711 0.80462512]
[ 0.30542996 0.08569184]
[ 0.72582609 0.46687164]
[ 0.50534322 0.35530554]
[ 0.93581095 0.36375539]
[ 0.66226118 0.62573407]
[ 0.08941219 0.05944215]
[ 0.43015872 0.95306899]
[ 0.43171644 0.74393729]]
9935.0: [ 0.31584562 0.18404152 0.98215445 0.83625487]
9963.0: [ 0.00526106 0.017255 0.33177741 0.9894455 ]
9989.0: [ 0.17328876 0.08181912 0.33170444 0.23493507]
9992.0: [ 0.34548987 0.15906761 0.92277442 0.9972481 ]
9993.0: [ 0.12448765 0.5404578 0.33981119 0.906822 ]
9996.0: [ 0.41198261 0.50958195 0.62843379 0.82677092]
9999.0: [ 0.437169 0.17833114 0.91096133 0.70713434]
[ 9999. 9993. 9989. 9999. 9999. 9935. 9999. 9963. 9992. 9996.]
real 0m 24.352s
user 3m 4.919s
sys 0m 1.464s
You can use a nested loop with to check the condition and yield the result as a generator :
points = [[ 1.5 1.5]
[ 1.1 1.1]
[ 2.2 2.2]
[ 1.3 1.3]
[ 3.4 1.4]
[ 2. 1.5]]
bounds = [[ 0. ,0. , 2., 2.],
[ 2. ,2. ,3., 3.]]
import numpy as np
def pos(p,b):
for x,y in p:
flag=False
for index,dis in enumerate(b):
minx,miny,maxx,maxy=dis
if x > minx and x < maxx and y > miny and y < maxy :
flag=True
yield index
if not flag:
yield 'NaN'
print list(pos(points,bounds))
result :
[0, 0, 1, 0, 'NaN', 'NaN']
I would do it like this:
import numpy as np
points = np.random.rand(10,2)
xmin = [0.25,0.5]
ymin = [0.25,0.5]
results = np.zeros(len(points))
for i in range(len(xmin)):
bool_index_array = np.greater(points, [xmin[i],ymin[i]])
print "boolean index of (x,y) greater (xmin, ymin): ", bool_index_array
indicies_of_true_true = np.where(bool_index_array[:,0]*bool_index_array[:,1]==1)[0]
print "indices of [True,True]: ", indicies_of_true_true
results[indicies_of_true_true] += 1
print "results: ", results
[out]: [ 1. 1. 1. 2. 0. 0. 1. 1. 1. 1.]
This uses the lower boundaries to catagorize your points into the groups:
1 (if xmin[0] < x <= xmin[1] & ymin[0] < y <= ymin[1])
2 (if x > xmin[1] & y > ymin[1])
0 if none of the conditions above are fullfilled

Vectorize Operations in Numpy for Two Dependent Arrays

I have an n x n numpy array that contains all pairwise distances and another 1 x n array that contains some scoring metric.
Example:
import numpy as np
import scipy.spatial.distance
dists = scipy.spatial.distance.squareform(np.array([3.2,4.1,8.8,.6,1.5,9.,5.0,9.9,10.,1.1]))
array([[ 0. , 3.2, 4.1, 8.8, 0.6],
[ 3.2, 0. , 1.5, 9. , 5. ],
[ 4.1, 1.5, 0. , 9.9, 10. ],
[ 8.8, 9. , 9.9, 0. , 1.1],
[ 0.6, 5. , 10. , 1.1, 0. ]])
score = np.array([19., 1.3, 4.8, 6.2, 5.7])
array([ 19. , 1.3, 4.8, 6.2, 5.7])
So, note that the ith element of the score array corresponds to the ith row of the distance array.
What I need to do is vectorize this process:
For the ith value in the score array, find all other values that are larger than the ith value and note their indices
Then, in the ith row of the distance array, get all of the distances with the same indices as noted in step 1. above and return the smallest distance
In the case where the ith value in the score array is the largest, then the smallest distance is set as the largest distance found in the distance array
Here is an un-vectorized version:
n = score.shape[0]
min_dist = np.full(n, np.max(dists))
for i in range(score.shape[0]):
inx = numpy.where(score > score[i])
if len(inx[0]) > 0:
min_dist[i] = np.min(dists[i, inx])
min_dist
array([ 10. , 1.5, 4.1, 8.8, 0.6])
This works but is pretty inefficient in terms of speed and my arrays are expected to be much, much larger. I am hoping to improve the efficiency by using faster vectorized operations to achieve the same result.
Update: Based on Oliver W.'s answer, I came up with my own that doesn't require making a copy of the distance array
def new_method (dists, score):
mask = score > score.reshape(-1,1)
return np.ma.masked_array(dists, mask=~mask).min(axis=1).filled(dists.max())
One could in theory make it a one-liner but it's already a bit challenging to read to the untrained eye.
One possible vectorized solution is given below.
import numpy as np
import scipy.spatial.distance
dists = scipy.spatial.distance.squareform(np.array([3.2,4.1,8.8,.6,1.5,9.,5.0,9.9,10.,1.1]))
score = np.array([19., 1.3, 4.8, 6.2, 5.7])
def your_method(dists, score):
dim = score.shape[0]
min_dist = np.full(dim, np.max(dists))
for i in range(dim):
inx = np.where(score > score[i])
if len(inx[0]) > 0:
min_dist[i] = np.min(dists[i, inx])
return min_dist
def vectorized_method_v1(dists, score):
mask = score > score.reshape(-1,1)
dists2 = dists.copy() # get rid of this in case the dists array can be changed
dists2[np.logical_not(mask)] = dists.max()
return dists2.min(axis=1)
The speed gain is not so impressive for these small arrays (~factor of 3 on my machine), so I'll demonstrate with a larger set:
dists = scipy.spatial.distance.squareform(np.random.random(50*99))
score = np.random.random(dists.shape[0])
print(dists.shape)
%timeit your_method(dists, score)
%timeit vectorized_method_v1(dists, score)
## -- End pasted text --
(100, 100)
100 loops, best of 3: 2.98 ms per loop
10000 loops, best of 3: 125 µs per loop
Which is close to a factor of 24.

Categories