Pythonic way to remove elements from Numpy array closer than threshold - python

What is the best way to remove the minimal number of elements from a sorted Numpy array so that the minimal distance among the remaining is always bigger than a certain threshold?
For example, if the threshold is 1, the following sequence [0.1, 0.5, 1.1, 2.5, 3.] will become [0.1, 1.1, 2.5]. The 0.5 is removed because it is too close to 0.1 but then 1.1 is preserved because it is far enough from 0.1.
My current code:
import numpy as np
MIN_DISTANCE = 1
a = np.array([0.1, 0.5, 1.1, 2.5, 3.])
for i in range(len(a)-1):
if(a[i+1] - a[i] < MIN_DISTANCE):
a[i+1] = a[i]
a = np.unique(a)
a
array([0.1, 1.1, 2.5])
Is there a more efficient way to do so?
Note that my question is similar to Remove values from numpy array closer to each other but not exactly the same.

You could use numpy.ufunc.accumulate to iterate thru adjacent pairs of the array instead of the for loop.
The numpy.add.accumulate example or itertools.accumulate probably shows best what it's doing.
Along with numpy.frompyfunc your condition can be applied as ufunc (universal functions ).
Code: (with an extended array to cross check some additional cases, but works with your array as well)
import numpy as np
MIN_DISTANCE = 1
a = np.array([0.1, 0.5, 0.6, 0.7, 1.1, 2.5, 3., 4., 6., 6.1])
print("original: \n" + str(a))
def my_py_function(arr1, arr2):
if(arr2 - arr1 < MIN_DISTANCE):
arr2 = arr1
return arr2
my_np_function = np.frompyfunc(my_py_function, 2, 1)
my_np_function.accumulate(a, dtype=np.object, out=a).astype(float)
print("complete: \n" + str(a))
a = np.unique(a)
print("unique: \n" + str(a))
Result:
original:
[0.1 0.5 0.6 0.7 1.1 2.5 3. 4. 6. 6.1]
complete:
[0.1 0.1 0.1 0.1 1.1 2.5 2.5 4. 6. 6. ]
unique:
[0.1 1.1 2.5 4. 6. ]
Concerning execution time timeit shows a turnaround at array length of about 20.
Your code is much faster (relative) for your array length of 5
whereas for array length >>20 the accumulate option speeds up considerably (~35% in time for array length 300)

Related

How to append to a list but skip a line?

I am trying to store values of length 150 into a list array but I want to skip a line each iteration. This is what I have which doesnt function. freq_data_1 has size (150,) which I try to append to freq_data which occurs but when I try to skip to the next line, it wont work. Any suggestions?
import numpy as np
import matplotlib.pyplot as plt
from scipy import pi
from scipy.fftpack import fft
freq_data = []
freq_data_2 = []
for i in range(len(video_samples)):
freq_data_1 = fft(video_samples[i,:])
freq_data.append(freq_data_1[i])
freq_data_2 = '\n'.join(freq_data)
My video_samples is an array of (4000,150) meaning I have 4000 signals of length in time of 150 steps. I want my output to be the same size as this but storing the frequency output.
Video_samples is a collection of signals with slightly varying frequency for each signal/row. e.g.
Input:
[0.775 0.3223 0.4613 0.2619 0.4012 0.567
0.908 0.4223 0.5128 0.489 0.318 0.187]
The first row is one of my signals of length 6. The second row is another signal of length 6. Each of these signals represent a frequency with added noise.
I wish to take each row separately, use the FFT on it to obtain the frequency of that signal and then store it in a matrix where each row would represent the FFT of that signal.
Another guess...
import numpy as np # it's not necessary for this snippet actually
def fft(lst): return [x*2 for x in lst] # just for example
# 2d array, just a guess
video_samples = [
[0.775, 0.3223, 0.4613, 0.2619, 0.4012, 0.567],
[0.908, 0.4223, 0.5128, 0.489, 0.318, 0.187],
[0.1, 0.2, 0.3, 0.4, 0.5, 0.6],
[0.7, 0.8, 0.9, 0.1, 0.2, 0.3]
]
video_samples = np.array(video_samples, dtype = 'float') # 2d list to ndarray just for example
print('video samples (input?): \n', video_samples)
matrix1 = []
matrix2 = []
for s1, s2 in zip(video_samples[::2], video_samples[1::2]):
matrix1.append(fft(s1))
matrix2.append(fft(s2))
matrix1 = np.array(matrix1, dtype = 'float') # just for example
matrix2 = np.array(matrix2, dtype = 'float') # just for example
print('\nmatrix1:\n', matrix1)
print('\nmatrix2:\n', matrix2)
Output:
video samples (input?):
[[0.775 0.3223 0.4613 0.2619 0.4012 0.567 ]
[0.908 0.4223 0.5128 0.489 0.318 0.187 ]
[0.1 0.2 0.3 0.4 0.5 0.6 ]
[0.7 0.8 0.9 0.1 0.2 0.3 ]]
matrix1:
[[1.55 0.6446 0.9226 0.5238 0.8024 1.134 ]
[0.2 0.4 0.6 0.8 1. 1.2 ]]
matrix2:
[[1.816 0.8446 1.0256 0.978 0.636 0.374 ]
[1.4 1.6 1.8 0.2 0.4 0.6 ]]
Five guys for two (or more?) days can't get what do you mean. Amazing.

Dynamically normalise 2D numpy array

I have a 2D numpy array "signals" of shape (100000, 1024). Each row contains the traces of amplitude of a signal, which I want to normalise to be within 0-1.
The signals each have different amplitudes, so I can't just divide by one common factor, so I was wondering if there's a way to normalise each of the signals so that each value within them is between 0-1?
Let's say that the signals look something like [[0,1,2,3,5,8,2,1],[0,2,5,10,7,4,2,1]] and I want them to become [[0.125,0.25,0.375,0.625,1,0.25,0.125],[0,0.2,0.5,0.7,0.4,0.2,0.1]].
Is there a way to do it without looping over all 100,000 signals, as this will surely be slow?
Thanks!
Easy thing to do would be to generate a new numpy array with max values by axis and divide by it:
import numpy as np
a = np.array([[0,1,2,3,5,8,2,1],[0,2,5,10,7,4,2,1]])
b = np.max(a, axis = 1)
print(a / b[:,np.newaxis])
output:
[[0. 0.125 0.25 0.375 0.625 1. 0.25 0.125]
[0. 0.2 0.5 1. 0.7 0.4 0.2 0.1 ]]
Adding a little benchmark to show just how significant is the performance difference between the two solutions:
import numpy as np
import timeit
arr = np.arange(1024).reshape(128,8)
def using_list_comp():
return np.array([s/np.max(s) for s in arr])
def using_vectorized_max_div():
return arr/arr.max(axis=1)[:, np.newaxis]
result1 = using_list_comp()
result2 = using_vectorized_max_div()
print("Results equal:", (result1==result2).all())
time1 = timeit.timeit('using_list_comp()', globals=globals(), number=1000)
time2 = timeit.timeit('using_vectorized_max_div()', globals=globals(), number=1000)
print(time1)
print(time2)
print(time1/time2)
On my machine the output is:
Results equal: True
0.9873569
0.010177099999999939
97.01750989967731
Almost a 100x difference!
Another solution is to use normalize:
from sklearn.preprocessing import normalize
data = [[0,1,2,3,5,8,2,1],[0,2,5,10,7,4,2,1]]
normalize(data, axis=1, norm='max')
result:
array([[0. , 0.125, 0.25 , 0.375, 0.625, 1. , 0.25 , 0.125],
[0. , 0.2 , 0.5 , 1. , 0.7 , 0.4 , 0.2 , 0.1 ]])
Please note norm='max' argument. Default value is 'l2'.

To find N Maximum indices of a numpy array whose corresponding values should greater than M in another array

I have 3 Numpy arrays each of length 107952899.
Lets Say :
1. Time = [2.14579526e+08 2.14579626e+08 2.14579726e+08 ...1.10098692e+10 1.10098693e+10]
2. Speed = [0.66 0.66 0.66 .............0.06024864 0.06014756]
3. Brak_press = [0.3, 0.3, 0.3 .............. 0.3, 0.3]
What it mean
Each index Value in Time corresponds to same index Value in Speed & Brake array.
Time Speed Brake
2.14579526e+08 0.66 0.3
.
.
Requirement
No 1 :I want to find the indices in Speed array whose values inside are greater than 20
No 2 : for those indices, what will be values in Brake Array
No 3 : Now i want to find the Top N Maximum Value indices in Brake Array & Store it another list/array
So finally if I take one indices from Top N Maximum Indices and use it in Brake & Speed array it must show..
Brake[idx] = valid Value & more importantly Speed [idx] = Value > than 20
General Summary
Simply, What i needed is, to find the Maximum N Brake point indices whose corresponding speed Value should be greater than 20
What i tried
speed_20 = np.where(Speed > 20) # I got indices as tupple
brake_values = Brake[speed_20] # Found the Brake Values corresponds to speed_20 indices
After that i tried argsort/argpartition but none of result matches my requirement
Request
I believe there will be a best method to do this..Kindly shed some light
(I converted the above np arrays to pandas df, it works fine, due to memory concern i prefer to do using numpy operations)
You are almost there. This should do what you want:
speed_20 = np.where(Speed > 20)[0]
sort = np.argsort(-Brake[speed_20])
result = speed_20[sort[:N]]
Maybe this is an option you can consider, using NumPy.
First create a multidimensional matrix (I changed the values so it's easier to follow):
Time = [ 2, 1, 5, 4, 3]
Speed = [ 10, 20, 40, 30, 50]
Brak_press = [0.1, 0.3, 0.5, 0.4, 0.2]
data = np.array([Time, Speed, Brak_press]).transpose()
So data are stored as:
print(data)
# [[ 2. 10. 0.1]
# [ 1. 20. 0.3]
# [ 5. 40. 0.5]
# [ 4. 30. 0.4]
# [ 3. 50. 0.2]]
To extract speed greater than 20:
data[data[:,1] > 20]
# [[ 5. 40. 0.5]
# [ 4. 30. 0.4]
# [ 3. 50. 0.2]]
To get the n greatest Brak_press:
n = 2
data[data[:,2].argsort()[::-1][:n]]
# [[ 5. 40. 0.5]
# [ 4. 30. 0.4]]

numpy interpolation to increase a vector size

Hi I have to enlarge the number of points inside of vector to enlarge the vector to fixed size. for example:
for this simple vector
>>> a = np.array([0, 1, 2, 3, 4, 5])
>>> len(a)
# 6
now, I want to get a vector with size of 11 taken the a vector as base the results will be
# array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. ])
EDIT 1
what I need is a function that will enter the base vector and the number of values that must be the resultant vector, and I return a new vector with size equal to the parameter. something like
def enlargeVector(vector, size):
.....
return newVector
to use like:
>>> a = np.array([0, 1, 2, 3, 4, 5])
>>> b = enlargeVector(a, 200):
>>> len(b)
# 200
and b contains data results of linear, cubic, or whatever interpolation methods
There are many methods to do this within scipy.interpolate. My favourite is UnivariateSpline, which produces an order k spline guaranteed to be differentiable k times.
To use it:
from scipy.interpolate import UnivariateSpline
old_indices = np.arange(0,len(a))
new_length = 11
new_indices = np.linspace(0,len(a)-1,new_length)
spl = UnivariateSpline(old_indices,a,k=3,s=0)
new_array = spl(new_indices)
The s is a smoothing factor that you should set to 0 in this case (since the data are exact).
Note that for the problem you have specified (since a just increases monotonically by 1), this is overkill, since the second np.linspace gives already the desired output.
EDIT: clarified that the length is arbitrary
As AGML pointed out there are tools to do this, but how about a pure numpy solution:
In [20]: a = np.arange(6)
In [21]: temp = np.dstack((a[:-1], a[:-1] + np.diff(a) / 2.0)).ravel()
In [22]: temp
Out[22]: array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])
In [23]: np.hstack((temp, [a[-1]]))
Out[23]: array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. ])

Vectorize Operations in Numpy for Two Dependent Arrays

I have an n x n numpy array that contains all pairwise distances and another 1 x n array that contains some scoring metric.
Example:
import numpy as np
import scipy.spatial.distance
dists = scipy.spatial.distance.squareform(np.array([3.2,4.1,8.8,.6,1.5,9.,5.0,9.9,10.,1.1]))
array([[ 0. , 3.2, 4.1, 8.8, 0.6],
[ 3.2, 0. , 1.5, 9. , 5. ],
[ 4.1, 1.5, 0. , 9.9, 10. ],
[ 8.8, 9. , 9.9, 0. , 1.1],
[ 0.6, 5. , 10. , 1.1, 0. ]])
score = np.array([19., 1.3, 4.8, 6.2, 5.7])
array([ 19. , 1.3, 4.8, 6.2, 5.7])
So, note that the ith element of the score array corresponds to the ith row of the distance array.
What I need to do is vectorize this process:
For the ith value in the score array, find all other values that are larger than the ith value and note their indices
Then, in the ith row of the distance array, get all of the distances with the same indices as noted in step 1. above and return the smallest distance
In the case where the ith value in the score array is the largest, then the smallest distance is set as the largest distance found in the distance array
Here is an un-vectorized version:
n = score.shape[0]
min_dist = np.full(n, np.max(dists))
for i in range(score.shape[0]):
inx = numpy.where(score > score[i])
if len(inx[0]) > 0:
min_dist[i] = np.min(dists[i, inx])
min_dist
array([ 10. , 1.5, 4.1, 8.8, 0.6])
This works but is pretty inefficient in terms of speed and my arrays are expected to be much, much larger. I am hoping to improve the efficiency by using faster vectorized operations to achieve the same result.
Update: Based on Oliver W.'s answer, I came up with my own that doesn't require making a copy of the distance array
def new_method (dists, score):
mask = score > score.reshape(-1,1)
return np.ma.masked_array(dists, mask=~mask).min(axis=1).filled(dists.max())
One could in theory make it a one-liner but it's already a bit challenging to read to the untrained eye.
One possible vectorized solution is given below.
import numpy as np
import scipy.spatial.distance
dists = scipy.spatial.distance.squareform(np.array([3.2,4.1,8.8,.6,1.5,9.,5.0,9.9,10.,1.1]))
score = np.array([19., 1.3, 4.8, 6.2, 5.7])
def your_method(dists, score):
dim = score.shape[0]
min_dist = np.full(dim, np.max(dists))
for i in range(dim):
inx = np.where(score > score[i])
if len(inx[0]) > 0:
min_dist[i] = np.min(dists[i, inx])
return min_dist
def vectorized_method_v1(dists, score):
mask = score > score.reshape(-1,1)
dists2 = dists.copy() # get rid of this in case the dists array can be changed
dists2[np.logical_not(mask)] = dists.max()
return dists2.min(axis=1)
The speed gain is not so impressive for these small arrays (~factor of 3 on my machine), so I'll demonstrate with a larger set:
dists = scipy.spatial.distance.squareform(np.random.random(50*99))
score = np.random.random(dists.shape[0])
print(dists.shape)
%timeit your_method(dists, score)
%timeit vectorized_method_v1(dists, score)
## -- End pasted text --
(100, 100)
100 loops, best of 3: 2.98 ms per loop
10000 loops, best of 3: 125 µs per loop
Which is close to a factor of 24.

Categories