Related
I have a signal where I want to find the average height of the values. This is done by finding the zero crossings and calculating the max and min between each zero crossing, then averaging these values.
My problem occurs when I want to use np.where() to find where the signal is crossing zero. When I use np.where() I get the result in a tuple, but I want it in an array where I can count the amount of times zero is crossed.
I am new to Python and coming from Matlab it is a bit confusing with all the different classes. As you can see, I get an error because nu = len(zero_u) gives 1 as a result, because the whole array is written in a tuple as one element.
Any ideas how to go around this?
The code looks like this:
import numpy as np
def averageheight(f):
rms = np.std(f)
f = f + (rms * 10**-6)
# Find zero crossing
fsign = np.sign(f)
fdiff = np.diff(fsign)
zero_u = np.asarray(np.where(fdiff > 0)) + 1
zero_d = np.asarray(np.where(fdiff < 0)) + 1
nu = len(zero_u)
nd = len(zero_d)
value_max = np.zeros((nu, 1))
value_min = np.zeros((nu, 1))
imaxvec = np.zeros((nu, 1))
iminvec = np.zeros((nu, 1))
if (nu > 2) and (nd > 2):
if zero_u[0] > zero_d[0]:
zero_d[0] = []
nu = len(zero_u)
nd = len(zero_d)
ncross = np.fmin(nu, nd)
# Find Maxima:
for ic in range(0, ncross - 1):
up = int(zero_u[ic])
down = int(zero_d[ic])
fvec = f[up:down]
value_max[ic] = np.amax(fvec)
index_max = value_max.argmax()
imaxvec[ic] = up + index_max - 1
# Find Minima:
for ic in range(0, ncross - 2):
down = int(zero_d[ic])
up = int(zero_u[ic+1])
fvec = f[down:up]
value_min[ic] = np.amin(fvec)
index_min = value_min.argmin()
iminvec[ic] = down + index_min - 1
# Remove spurious values, bumps and zero_d
thr = rms/3
maxfind = np.where(value_max < thr)
for i in range(0, len(maxfind)):
imaxfind = np.where(value_max == maxfind[i])
imaxvec[imaxfind] = 0
value_max[imaxfind] = 0
minfind = np.where(value_min > -thr)
for j in range(0, len(minfind)):
iminfind = np.where(value_min == minfind[j])
value_min[iminfind] = 0
iminvec[iminfind] = 0
# Find Average Height
avh = np.mean(value_max) - np.mean(value_min)
else:
avh = 0
return avh
np.where, and np.nonzero even more so, clearly explains that it returns a tuple, with one array for each dimension of the condition array:
In [71]: arr = np.random.randint(-5,5,10)
In [72]: arr
Out[72]: array([ 3, 4, 2, -3, -1, 0, -5, 4, 2, -3])
In [73]: arr.shape
Out[73]: (10,)
In [74]: np.where(arr>=0)
Out[74]: (array([0, 1, 2, 5, 7, 8]),)
In [75]: arr[_]
Out[75]: array([3, 4, 2, 0, 4, 2])
That Out[74] tuple can be used directly as an index.
You can also extract the array from the tuple:
In [76]: np.where(arr>=0)[0]
Out[76]: array([0, 1, 2, 5, 7, 8])
That, I think is a better choice than the np.asarray(np.where(...))
This convention for where becomes clearer when we use it on a 2d array
In [77]: arr2 = arr.reshape(2,5)
In [78]: np.where(arr2>=0)
Out[78]: (array([0, 0, 0, 1, 1, 1]), array([0, 1, 2, 0, 2, 3]))
In [79]: arr2[_]
Out[79]: array([3, 4, 2, 0, 4, 2])
Again we are indexing with a tuple. arr2[1,3] is really arr2[(1,3)]. The values in [] indexing brackets are actually passed to the indexing function as a tuple of values.
np.argwhere applies transpose to the result of where, producing an array:
In [80]: np.transpose(np.where(arr2>=0))
Out[80]:
array([[0, 0],
[0, 1],
[0, 2],
[1, 0],
[1, 2],
[1, 3]])
That's the same indexing arrays, but arranged in a 2d column matrix.
If you need the count of where without the actual values, a slightly faster function is
In [81]: np.count_nonzero(arr>=0)
Out[81]: 6
In fact np.nonzero uses the count to first determine the size of the arrays that it will return.
I want to get the rank of each element, so I use argsort in numpy:
np.argsort(np.array((1,1,1,2,2,3,3,3,3)))
array([0, 1, 2, 3, 4, 5, 6, 7, 8])
it give the same element the different rank, can I get the same rank like:
array([0, 0, 0, 3, 3, 5, 5, 5, 5])
If you don't mind a dependency on scipy, you can use scipy.stats.rankdata, with method='min':
In [14]: a
Out[14]: array([1, 1, 1, 2, 2, 3, 3, 3, 3])
In [15]: from scipy.stats import rankdata
In [16]: rankdata(a, method='min')
Out[16]: array([1, 1, 1, 4, 4, 6, 6, 6, 6])
Note that rankdata starts the ranks at 1. To start at 0, subtract 1 from the result:
In [17]: rankdata(a, method='min') - 1
Out[17]: array([0, 0, 0, 3, 3, 5, 5, 5, 5])
If you don't want the scipy dependency, you can use numpy.unique to compute the ranking. Here's a function that computes the same result as rankdata(x, method='min') - 1:
import numpy as np
def rankmin(x):
u, inv, counts = np.unique(x, return_inverse=True, return_counts=True)
csum = np.zeros_like(counts)
csum[1:] = counts[:-1].cumsum()
return csum[inv]
For example,
In [137]: x = np.array([60, 10, 0, 30, 20, 40, 50])
In [138]: rankdata(x, method='min') - 1
Out[138]: array([6, 1, 0, 3, 2, 4, 5])
In [139]: rankmin(x)
Out[139]: array([6, 1, 0, 3, 2, 4, 5])
In [140]: a = np.array([1,1,1,2,2,3,3,3,3])
In [141]: rankdata(a, method='min') - 1
Out[141]: array([0, 0, 0, 3, 3, 5, 5, 5, 5])
In [142]: rankmin(a)
Out[142]: array([0, 0, 0, 3, 3, 5, 5, 5, 5])
By the way, a single call to argsort() does not give ranks. You can find an assortment of approaches to ranking in the question Rank items in an array using Python/NumPy, including how to do it using argsort().
Alternatively, pandas series has a rank method which does what you need with the min method:
import pandas as pd
pd.Series((1,1,1,2,2,3,3,3,3)).rank(method="min")
# 0 1
# 1 1
# 2 1
# 3 4
# 4 4
# 5 6
# 6 6
# 7 6
# 8 6
# dtype: float64
With focus on performance, here's an approach -
def rank_repeat_based(arr):
idx = np.concatenate(([0],np.flatnonzero(np.diff(arr))+1,[arr.size]))
return np.repeat(idx[:-1],np.diff(idx))
For a generic case with the elements in input array not already sorted, we would need to use argsort() to keep track of the positions. So, we would have a modified version, like so -
def rank_repeat_based_generic(arr):
sidx = np.argsort(arr,kind='mergesort')
idx = np.concatenate(([0],np.flatnonzero(np.diff(arr[sidx]))+1,[arr.size]))
return np.repeat(idx[:-1],np.diff(idx))[sidx.argsort()]
Runtime test
Testing out all the approaches listed thus far to solve the problem on a large dataset.
Sorted array case :
In [96]: arr = np.sort(np.random.randint(1,100,(10000)))
In [97]: %timeit rankdata(arr, method='min') - 1
1000 loops, best of 3: 635 µs per loop
In [98]: %timeit rankmin(arr)
1000 loops, best of 3: 495 µs per loop
In [99]: %timeit (pd.Series(arr).rank(method="min")-1).values
1000 loops, best of 3: 826 µs per loop
In [100]: %timeit rank_repeat_based(arr)
10000 loops, best of 3: 200 µs per loop
Unsorted case :
In [106]: arr = np.random.randint(1,100,(10000))
In [107]: %timeit rankdata(arr, method='min') - 1
1000 loops, best of 3: 963 µs per loop
In [108]: %timeit rankmin(arr)
1000 loops, best of 3: 869 µs per loop
In [109]: %timeit (pd.Series(arr).rank(method="min")-1).values
1000 loops, best of 3: 1.17 ms per loop
In [110]: %timeit rank_repeat_based_generic(arr)
1000 loops, best of 3: 1.76 ms per loop
I've written a function for the same purpose. It uses pure python and numpy only. Please have a look. I put comments as well.
def my_argsort(array):
# this type conversion let us work with python lists and pandas series
array = np.array(array)
# create mapping for unique values
# it's a dictionary where keys are values from the array and
# values are desired indices
unique_values = list(set(array))
mapping = dict(zip(unique_values, np.argsort(unique_values)))
# apply mapping to our array
# np.vectorize works similar map(), and can work with dictionaries
array = np.vectorize(mapping.get)(array)
return array
Hope that helps.
Complex solutions are unnecessary for this problem.
> ary = np.sort([1, 1, 1, 2, 2, 3, 3, 3, 3]) # or anything; must be sorted.
> a = np.diff().cumsum(); a
array([0, 0, 1, 1, 2, 2, 2, 2])
> b = np.r_[0, a]; b # ties get first open rank
array([0, 0, 0, 1, 1, 2, 2, 2, 2])
> c = np.flatnonzero(ary[1:] != ary[:-1])
> np.r_[0, 1 + c][b] # ties get last open rank
array([0, 0, 0, 3, 3, 5, 5, 5])
How can I convert the duplicate elements in a array 'data' into 0? It has to be done row-wise.
data = np.array([[1,8,3,3,4],
[1,8,9,9,4]])
The answer should be as follows:
ans = array([[1,8,3,0,4],
[1,8,9,0,4]])
Approach #1
One approach with np.unique -
# Find out the unique elements and their starting positions
unq_data, idx = np.unique(data,return_index=True)
# Find out the positions for each unique element, their duplicate positions
dup_idx = np.setdiff1d(np.arange(data.size),idx)
# Set those duplicate positioned elemnents to 0s
data[dup_idx] = 0
Sample run -
In [46]: data
Out[46]: array([1, 8, 3, 3, 4, 1, 3, 3, 9, 4])
In [47]: unq_data, idx = np.unique(data,return_index=True)
...: dup_idx = np.setdiff1d(np.arange(data.size),idx)
...: data[dup_idx] = 0
...:
In [48]: data
Out[48]: array([1, 8, 3, 0, 4, 0, 0, 0, 9, 0])
Approach #2
You can also use sorting and differentiation as a faster approach -
# Get indices for sorted data
sort_idx = np.argsort(data)
# Get duplicate indices and set those in data to 0s
dup_idx = sort_idx[1::][np.diff(np.sort(data))==0]
data[dup_idx] = 0
Runtime tests -
In [110]: data = np.random.randint(0,100,(10000))
...: data1 = data.copy()
...: data2 = data.copy()
...:
In [111]: def func1(data):
...: unq_data, idx = np.unique(data,return_index=True)
...: dup_idx = np.setdiff1d(np.arange(data.size),idx)
...: data[dup_idx] = 0
...:
...: def func2(data):
...: sort_idx = np.argsort(data)
...: dup_idx = sort_idx[1::][np.diff(np.sort(data))==0]
...: data[dup_idx] = 0
...:
In [112]: %timeit func1(data1)
1000 loops, best of 3: 1.36 ms per loop
In [113]: %timeit func2(data2)
1000 loops, best of 3: 467 µs per loop
Extending to a 2D case :
Approach #2 could be extended to work for a 2D array case, avoiding any loop like so -
# Get indices for sorted data
sort_idx = np.argsort(data,axis=1)
# Get sorted linear indices
row_offset = data.shape[1]*np.arange(data.shape[0])[:,None]
sort_lin_idx = sort_idx[:,1::] + row_offset
# Get duplicate linear indices and set those in data as 0s
dup_lin_idx = sort_lin_idx[np.diff(np.sort(data,axis=1),axis=1)==0]
data.ravel()[dup_lin_idx] = 0
Sample run -
In [6]: data
Out[6]:
array([[1, 8, 3, 3, 4, 0, 3, 3],
[1, 8, 9, 9, 4, 8, 7, 9],
[1, 8, 9, 9, 4, 8, 7, 3]])
In [7]: sort_idx = np.argsort(data,axis=1)
...: row_offset = data.shape[1]*np.arange(data.shape[0])[:,None]
...: sort_lin_idx = sort_idx[:,1::] + row_offset
...: dup_lin_idx = sort_lin_idx[np.diff(np.sort(data,axis=1),axis=1)==0]
...: data.ravel()[dup_lin_idx] = 0
...:
In [8]: data
Out[8]:
array([[1, 8, 3, 0, 4, 0, 0, 0],
[1, 8, 9, 0, 4, 0, 7, 0],
[1, 8, 9, 0, 4, 0, 7, 3]])
Here's a simple pure-Python way to do it:
seen = set()
for i, x in enumerate(data):
if x in seen:
data[i] = 0
else:
seen.add(x)
You could use a nested for loop, where you compare each element of the array to every other element to check for duplicate records. Syntax might be a bit off as I am not really familiar with numpy.
for x in range(0, len(data))
for y in range(x+1, len(data))
if(data[x] == data[y])
data[x] = 0
#Divakar has it almost right, but there are a few things that can be further optimized, but don't really fit in a comment. To begin:
rows, cols = data.shape
The first operation is to sort the array to identify the duplicates. Since we will want to undo the sorting, we need to use np.argsort, but if you want to make sure that it is the first occurrence of each repeated value that is kept, you need to use a stable sorting algorithm:
sort_idx = data.argsort(axis=1, kind='mergesort')
Once we have the indices to sort data, to get a sorted copy of the array it is faster to use the indices than to re-sort the array:
sorted_data = data[np.arange(rows)[:, None], sort_idx]
While the principle is similar to that in using np.diff, it is typically faster to use boolean operations. We want an array full of False where the first occurrences of each value happen, and True where the duplicates are:
sorted_mask = np.concatenate((np.zeros((rows, 1), dtype=bool),
sorted_data[:, :-1] == sorted_data[:, 1:]),
axis=1)
We now use that mask to set all the duplicates to zero:
sorted_data[sorted_mask] = 0
And we finally undo the sorting. To revert a permutation you can sort the indices that define it, i.e. you could do:
invert_idx = sort_idx.argsort(axis=1, kind='mergesort')
ans = sorted_data[np.arange(rows)[:, None], invert_idx]
But it is more efficient to use assignment, i.e.:
ans = np.empty_like(data)
ans[np.arange(rows), sort_idx] = sorted_data
Putting it all together:
def zero_dups(data):
rows, cols = data.shape
sort_idx = data.argsort(axis=1, kind='mergesort')
sorted_data = data[np.arange(rows)[:, None], sort_idx]
sorted_mask = np.concatenate((np.zeros((rows, 1), dtype=bool),
sorted_data[:, :-1] == sorted_data[:, 1:]),
axis=1)
sorted_data[sorted_mask] = 0
ans = np.empty_like(data)
ans[np.arange(rows)[:, None], sort_idx] = sorted_data
return ans
Given a numpy array 'x' and a hop size 'N', I have to create a function that will return a numpy.ndarray with the values of 'x' that fit the hop size, for example, if x = [0,1,2,3,4,5,6,7,8,9] and N = 2, the function would return output = [0,2,4,6,8]. So far I have thought of the following:
def hopSamples(x,N)
i = 0
n = len(x)
output = numpy.ndarray([])
while i<n:
output.append(x[i])
i = i+N
return output
but it gives errors. How can I manage this? I am just starting python so I am sure there will be plenty of errors, so any help will be very much appreciated!
You can use slicing:
In [14]: arr = np.arange(0, 10)
In [15]: arr
Out[15]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [16]: arr[::2]
Out[16]: array([0, 2, 4, 6, 8])
Thus, your function would simply look like this:
def hopSamples1(x, N):
return x[::N]
If you're insistent on declaring an empty array beforehand and filling it using a loop, you can alter your function a bit to do one of the following.
You can initializes an empty array and extend it by another cell with each iteration of the loop. Note that a new array is created and returned each time.
def hopSamples2(x, N):
i = 0
n = len(x)
output = np.empty(shape = 0, dtype = x.dtype)
while i < n:
output = np.append(output, x[i])
i += N
return output
An alternative implementation would be creating the entire array beforehand, but setting the values into its cells one by one.
def hopSamples3(x, N):
i = 0
n = len(x)
m = n / N
output = np.ndarray(shape = m, dtype = x.dtype)
while i < m:
output[i] = x[i * N]
i += 1
return output
A simple benchmark test shows that using slicing is the quickest approach while extending the array one by one is the slowest:
In [146]: %time hopSamples1(arr, 2)
CPU times: user 21 µs, sys: 3 µs, total: 24 µs
Wall time: 28.8 µs
Out[146]: array([0, 2, 4, 6, 8])
In [147]: %time hopSamples2(arr, 2)
CPU times: user 241 µs, sys: 29 µs, total: 270 µs
Wall time: 230 µs
Out[147]: array([0, 2, 4, 6, 8])
In [148]: %time hopSamples3(arr, 2)
CPU times: user 35 µs, sys: 5 µs, total: 40 µs
Wall time: 45.8 µs
Out[148]: array([0, 2, 4, 6, 8])
import numpy as np
a = np.array([0,1,2,3,4,5,6,7,8,9,10])
print "Please input a step number: "
N = int(raw_input())
b = a[::N]
print "b is: ", b
use numpy slicing, basically, start:stop:step:
In [20]: xs
Out[20]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [21]: xs[::2]
Out[21]: array([0, 2, 4, 6, 8])
I want to find the minimum-sized 2-dimensional ndarray within an ndarray that contains all values meeting a condition.
For example:
Let's say I have the array
x = np.array([[1, 1, 5, 3, 11, 1],
[1, 2, 15, 19, 21, 33],
[1, 8, 17, 22, 21, 31],
[3, 5, 6, 11, 23, 19]])
and call f(x, x % 2 == 0)
Then the return value of the program would be the array
[[2, 15, 19]
[8, 17, 22]
[5, 6, 11]]
Because it is the smallest rectangular array that includes all the even numbers (the condition).
I've found a way to find all the indices for which the condition is true by using np.argwhere and then slicing from the minimum to maximum indices from the original array, and I've done it using a for loop but I was wondering if there was a more efficient way to do it using numpy or scipy.
My current method:
def f(arr, cond_arr):
indices = np.argwhere(cond_arr)
min = np.amin(indices, axis = 0) #get first row, col meeting cond
max = np.amax(indices, axis = 0) #get last row, col meeting cond
return arr[min[0]:max[0] + 1, min[1] : max[1] + 1]
The function is pretty efficient already - but you can do better.
Instead of checking the condition for every row/column and then finding the minimum and maximum, we can collapse the condition into each axis (using reduction with the logical OR) and find the first/last indices:
def f2(arr, cond_arr):
c0 = np.where(np.logical_or.reduce(cond_arr, axis=0))[0]
c1 = np.where(np.logical_or.reduce(cond_arr, axis=1))[0]
return arr[c0[0]:c0[-1] + 1, c1[0]:c1[-1] + 1]
How it works:
With the example data cond_array looks like this:
>>> (x%2==0).astype(int)
array([[0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0],
[0, 1, 0, 1, 0, 0],
[0, 0, 1, 0, 0, 0]])
This are the column conditions:
>>> np.logical_or.reduce(cond_arr, axis=0).astype(int)
array([0, 1, 1, 1, 0, 0])
And this the row conditions:
>>> np.logical_or.reduce(cond_arr, axis=).astype(int)
array([0, 1, 1, 1])
Now we only need to find the first/last nonzero element for each of the two arrays.
Is it really faster?
%timeit f(x, x%2 == 0) # 10000 loops, best of 3: 24.6 µs per loop
%timeit f2(x, x%2 == 0) # 100000 loops, best of 3: 12.6 µs per loop
Well, a little bit... but it really shines with larger arrays:
x = np.random.randn(1000, 1000)
c = np.zeros((1000, 1000), dtype=bool)
c[400:600, 400:600] = True
%timeit f(x,c) # 100 loops, best of 3: 5.28 ms per loop
%timeit f2(x,c) # 1000 loops, best of 3: 225 µs per loop
Finally, this version has slightly more overhead but is generic over the number of dimensions:
def f3(arr, cond_arr):
s = []
for a in range(arr.ndim):
c = np.where(np.logical_or.reduce(cond_arr, axis=a))[0]
s.append(slice(c[0], c[-1] + 1))
return arr[s]