Digitize numpy array - python

I have two vectors:
time_vec = np.array([0.2,0.23,0.3,0.4,0.5,...., 28....])
values_vec = np.array([500,200,220,250,200,...., 218....])
time_vec.shape == values_vec.shape
Now, I want to take the bin the values for every 0.5 second interval and take the mean of the values. So for example
value_vec = np.array(mean_of(500,200,220,250,200), mean_of(next values in next 0.5 second interval))
Is there any numpy method I am missing which bin and take mean of the bins?

You may use np.ufunc.reduceat. You just need to populate where the breaking points are, i.e. when floor(t / .5) changes:
say for:
>>> t
array([ 0. , 0.025 , 0.2125, 0.2375, 0.2625, 0.3375, 0.475 , 0.6875, 0.7 , 0.7375, 0.8 , 0.9 ,
0.925 , 1.05 , 1.1375, 1.15 , 1.1625, 1.1875, 1.1875, 1.225 ])
>>> b
array([ 0.8144, 0.3734, 1.4734, 0.6307, -0.611 , -0.8762, 1.6064, 0.3863, -0.0103, -1.6889, -0.4328, -0.7373,
1.7856, 0.8938, -1.1574, -0.4029, -0.4352, -0.4412, -1.7819, -0.3298])
the break points are:
>>> i = np.r_[0, 1 + np.nonzero(np.diff(np.floor(t / .5)))[0]]
>>> i
array([ 0, 7, 13])
and the sum over each interval is:
>>> np.add.reduceat(b, i)
array([ 3.411 , -0.6975, -3.6545])
and the mean would be sum over length of interval:
>>> np.add.reduceat(b, i) / np.diff(np.r_[i, len(b)])
array([ 0.4873, -0.1162, -0.5221])

You can pass a weights= parameter to np.histogram to compute the summed values within each time bin, then normalize by the bin count:
# 0.5 second time bins to average within
tmin = time_vec.min()
tmax = time_vec.max()
bins = np.arange(tmin - (tmin % 0.5), tmax - (tmax % 0.5) + 0.5, 0.5)
# summed values within each bin
bin_sums, edges = np.histogram(time_vec,bins=bins, weights=values_vec)
# number of values within each bin
bin_counts, edges = np.histogram(time_vec,bins=bins)
# average value within each bin
bin_means = bin_sums / bin_counts

You can use np.bincount that is supposedly pretty efficient for such binning operations. Here's an implementation based on it to solve our case -
# Find indices where 0.5 intervals shifts onto next ones
A = time_vec*2
idx = np.searchsorted(A,np.arange(1,int(np.ceil(A.max()))),'right')
# Setup ID array such that all 0.5 intervals are ID-ed same
out = np.zeros((A.size),dtype=int)
out[idx[idx < A.size]] = 1
ID = out.cumsum()
# Finally use bincount to sum and count elements of same IDs
# and thus get mean values per ID
mean_vec = np.bincount(ID,values_vec)/np.bincount(ID)
Sample run -
In [189]: time_vec
Out[189]:
array([ 0.2 , 0.23, 0.3 , 0.4 , 0.5 , 0.7 , 0.8 , 0.92, 0.95,
1. , 1.11, 1.5 , 2. , 2.3 , 2.5 , 4.5 ])
In [190]: values_vec
Out[190]: array([36, 11, 93, 32, 72, 75, 26, 28, 77, 31, 60, 77, 76, 32, 6, 85])
In [191]: ID
Out[191]: array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 3, 4, 4, 5], dtype=int32)
In [192]: mean_vec
Out[192]: array([ 48.8, 47.4, 68.5, 76. , 19. , 85. ])

Related

Numpy function that utilizes index as it iterates Python

How can I write a numpy code where it takes in the numpy array and then calculates the percentage of the arrays that are positive, it will do this until it reaches the end of the arrays. So as the code goes through index a for the first and second index the calculations will be negative or positive value/index *100 so since 12 is positive it will be 1/1 * 100 = 100, 2/2 *100=100 until it reaches a negative value in the third index then it will be 2/3 * 100. The percentage has gone down since now only 2 out of the 3 indexes checked were positive. How will I be able to do that and get the Expected Output below preferably without a for loop?
import numpy as np
a = np.array([12, 23,-12 ,2 ,-1 ,-44, 8, -9, 1.45])
b = np.array([-12.2, -1.45, 0.74, -88])
Expected Output
[100, 100, 66.6, 75, 60, 50, 57.1, 50, 55.5]
[0, 0, 33.3, 25]
You can use:
(a > 0).cumsum() / np.arange(1, len(a)+1) * 100
array([100. , 100. , 66.66666667, 75. ,
60. , 50. , 57.14285714, 50. ,
55.55555556])
(a > 0).cumsum() gives the running count of positives while np.arange(1, len(a)+1) gives the running count of total elements. Dividing them gives the percentage.
Similar for b:
(b > 0).cumsum() / np.arange(1, len(b)+1) * 100
array([ 0. , 0. , 33.33333333, 25. ])

How to set threshold values in a numpy array?

I have an array of values and I want to set specific values to integers. Anything below 0.95 set to 0, anything above 1.6 set to 2. How can I set everything between 0.95 and 1.6 to 1?
n1_binary = np.where(n1_img_resize < 0.95, 0, n1_img_resize)
n1_binary = np.where(n1_binary > 1.6, 2, n1_binary)
Like this in one line using np.where:
n1_binary = np.where((n1_binary > 0.95) & (n1_binary <= 1.6), 1, n1_binary)
Check below example:
In [652]: a = np.array([0.99, 1.23, 1.7, 9])
In [653]: a = np.where((a > 0.95) & (a <= 1.6), 1, a)
In [654]: a
Out[654]: array([1. , 1. , 1.7, 9. ])
Try this:
a = np.array([0.3, 5, 7, 2])
a[a < 0.95] = 0
a[a > 1.6] = 1
This is very clear and consice, and tells exactly what you are doing. a now is:
[0.0, 1.0, 1.0, 1.0]

Finding percentage change with Numpy

I'm writing a function to find the percentage change using Numpy and function calls. So far what I got is:
def change(a,b):
answer = (np.subtract(a[b+1], a[b])) / a[b+1] * 100
return answer
print(change(a,0))
"a" is the array I have made and b will be the index/numbers I am trying to calculate.
For example:
My Array is
[[1,2,3,5,7]
[1,4,5,6,7]
[5,8,9,10,32]
[3,5,6,13,11]]
How would I calculate the percentage change between 1 to 2 (=0.5) or 1 to 4(=0.75) or 5,7 etc..
Note: I know how mathematically to get the change, I'm not sure how to do this in python/ numpy.
If I understand correctly, that you're trying to find percent change in each row, then you can do:
>>> np.diff(a) / a[:,1:] * 100
Which gives you:
array([[ 50. , 33.33333333, 40. , 28.57142857],
[ 75. , 20. , 16.66666667, 14.28571429],
[ 37.5 , 11.11111111, 10. , 68.75 ],
[ 40. , 16.66666667, 53.84615385, -18.18181818]])
I know you have asked this question with Numpy in mind and got answers above:
import numpy as np
np.diff(a) / a[:,1:]
I attempt to solve this with Pandas. For those who would have the same question but using Pandas instead of Numpy
import pandas as pd
data = [[1,2,3,4,5],
[1,4,5,6,7],
[5,8,9,10,32],
[3,5,6,13,11]]
df = pd.DataFrame(data)
df_change = df.rolling(1,axis=1).sum().pct_change(axis=1)
print(df_change)
I suggest to simply shift the array. The computation basically becomes a one-liner.
import numpy as np
arr = np.array(
[
[1, 2, 3, 5, 7],
[1, 4, 5, 6, 7],
[5, 8, 9, 10, 32],
[3, 5, 6, 13, 11],
]
)
# Percentage change from row to row
pct_chg_row = arr[1:] / arr[:-1] - 1
[[ 0. 1. 0.66666667 0.2 0. ]
[ 4. 1. 0.8 0.66666667 3.57142857]
[-0.4 -0.375 -0.33333333 0.3 -0.65625 ]]
# Percentage change from column to column
pct_chg_col = arr[:, 1::] / arr[:, 0:-1] - 1
[[ 1. 0.5 0.66666667 0.4 ]
[ 3. 0.25 0.2 0.16666667]
[ 0.6 0.125 0.11111111 2.2 ]
[ 0.66666667 0.2 1.16666667 -0.15384615]]
You could easily generalize the task, so that you are not limited to compute the change from one row/column to another, but be able to compute the change for n rows/columns.
n = 2
pct_chg_row_generalized = arr[n:] / arr[:-n] - 1
[[4. 3. 2. 1. 3.57142857]
[2. 0.25 0.2 1.16666667 0.57142857]]
pct_chg_col_generalized = arr[:, n:] / arr[:, :-n] - 1
[[2. 1.5 1.33333333]
[4. 0.5 0.4 ]
[0.8 0.25 2.55555556]
[1. 1.6 0.83333333]]
If the output array must have the same shape as the input array, you need to make sure to insert the appropriate number of np.nan.
out_row = np.full_like(arr, np.nan, dtype=float)
out_row[n:] = arr[n:] / arr[:-n] - 1
[[ nan nan nan nan nan]
[ nan nan nan nan nan]
[4. 3. 2. 1. 3.57142857]
[2. 0.25 0.2 1.16666667 0.57142857]]
out_col = np.full_like(arr, np.nan, dtype=float)
out_col[:, n:] = arr[:, n:] / arr[:, :-n] - 1
[[ nan nan 2. 1.5 1.33333333]
[ nan nan 4. 0.5 0.4 ]
[ nan nan 0.8 0.25 2.55555556]
[ nan nan 1. 1.6 0.83333333]]
Finally, a small function for the general 2D case might look like this:
def np_pct_chg(arr: np.ndarray, n: int = 1, axis: int = 0) -> np.ndarray:
out = np.full_like(arr, np.nan, dtype=float)
if axis == 0:
out[n:] = arr[n:] / arr[:-n] - 1
elif axis == 1:
out[:, n:] = arr[:, n:] / arr[:, :-n] - 1
return out
The accepted answer is close but incorrect if you're trying to take % difference from left to right.
You should get the following percent difference:
1,2,3,5,7 --> 100%, 50%, 66.66%, 40%
check for yourself: https://www.calculatorsoup.com/calculators/algebra/percent-change-calculator.php
Going by what Josmoor98 said, you can use np.diff(a) / a[:,:-1] * 100 to get the percent difference from left to right, which will give you the correct answer.
array([[100. , 50. , 66.66666667, 40. ],
[300. , 25. , 20. , 16.66666667],
[ 60. , 12.5 , 11.11111111, 220. ],
[ 66.66666667, 20. , 116.66666667, -15.38461538]])
import numpy as np
a = np.array([[1,2,3,5,7],
[1,4,5,6,7],
[5,8,9,10,32],
[3,5,6,13,11]])
np.array([(i[:-1]/i[1:]) for i in a])
Combine all your arrays.
Then make a data frame from them.
df = pd.df(data=array you made)
Use the pct_change() function on dataframe. It will calculate the % change for all rows in dataframe.

Cosine Similarity

I was reading and came across this formula:
The formula is for cosine similarity. I thought this looked interesting and I created a numpy array that has user_id as row and item_id as column. For instance, let M be this matrix:
M = [[2,3,4,1,0],[0,0,0,0,5],[5,4,3,0,0],[1,1,1,1,1]]
Here the entries inside the matrix are ratings the people u has given to item i based on row u and column i. I want to calculate this cosine similarity for this matrix between items (rows). This should yield a 5 x 5 matrix I believe. I tried to do
df = pd.DataFrame(M)
item_mean_subtracted = df.sub(df.mean(axis=0), axis=1)
similarity_matrix = item_mean_subtracted.fillna(0).corr(method="pearson").values
However, this does not seem right.
Here's a possible implementation of the adjusted cosine similarity:
import numpy as np
from scipy.spatial.distance import pdist, squareform
M = np.asarray([[2, 3, 4, 1, 0],
[0, 0, 0, 0, 5],
[5, 4, 3, 0, 0],
[1, 1, 1, 1, 1]])
M_u = M.mean(axis=1)
item_mean_subtracted = M - M_u[:, None]
similarity_matrix = 1 - squareform(pdist(item_mean_subtracted.T, 'cosine'))
Remarks:
I'm taking advantage of NumPy broadcasting to subtract the mean.
If M is a sparse matrix, you could do something like ths: M.toarray().
From the docs:
Y = pdist(X, 'cosine')
Computes the cosine distance between vectors u and v,
1 − u⋅v / (||u||2||v||2)
where ||∗||2 is the 2-norm of its argument *, and u⋅v is the dot product of u and v.
Array transposition is performed through the T method.
Demo:
In [277]: M_u
Out[277]: array([ 2. , 1. , 2.4, 1. ])
In [278]: item_mean_subtracted
Out[278]:
array([[ 0. , 1. , 2. , -1. , -2. ],
[-1. , -1. , -1. , -1. , 4. ],
[ 2.6, 1.6, 0.6, -2.4, -2.4],
[ 0. , 0. , 0. , 0. , 0. ]])
In [279]: np.set_printoptions(precision=2)
In [280]: similarity_matrix
Out[280]:
array([[ 1. , 0.87, 0.4 , -0.68, -0.72],
[ 0.87, 1. , 0.8 , -0.65, -0.91],
[ 0.4 , 0.8 , 1. , -0.38, -0.8 ],
[-0.68, -0.65, -0.38, 1. , 0.27],
[-0.72, -0.91, -0.8 , 0.27, 1. ]])

numpy how find local minimum in neighborhood on 1darray

I've got a list of sorted samples. They're sorted by their sample time, where each sample is taken one second after the previous one.
I'd like to find the minimum value in a neighborhood of a specified size.
For example, given a neighborhood size of 2 and the following sample size:
samples = [ 5, 12.3, 12.3, 7, 2, 6, 9, 10, 5, 9, 17, 2 ]
I'd expect the following output: [5, 2, 5, 2]
What would be the best way to achieve this in numpy / scipy
Edited: Explained the reasoning behind the min values:
5 - the 2 number window next to it are [12.3 12.3]. 5 is smaller
2 - to the left [12.3, 7] to the right [6 9]. 2 is the min
5 - to the left [9 10] to the right [9 17]. 5 is the min
notice that 9 isn't min are there's a 2 window to its left and right with a smaller value (2)
Use scipy's argrelextrema:
>>> import numpy as np
>>> from scipy.signal import argrelextrema
>>> data = np.array([ 5, 12.3, 12.3, 7, 2, 6, 9, 10, 5, 9, 17, 2 ])
>>> radius = 2 # number of elements to the left and right to compare to
>>> argrelextrema(data, np.less, order=radius)
(array([4, 8]),)
Which suggest that numbers at position 4 and 8 (2 and 5) are the smallest ones in within a 2 size neighbourhood. The numbers at boundaries (5 and 2) are not detected since argrelextrema only supports clip or wrap boundary conditions. As for your question, I guess you are interested in them too. To detect them, it is easy to add reflect boundary conditions first:
>>> new_data = np.pad(data, radius, mode='reflect')
>>> new_data
array([ 12.3, 12.3, 5. , 12.3, 12.3, 7. , 2. , 6. , 9. ,
10. , 5. , 9. , 17. , 2. , 17. , 9. ])
With the data with the corresponding boundary conditions, we can now apply the previus extrema detector:
>>> arg_minimas = argrelextrema(new_data, np.less, order=radius)[0] - radius
>>> arg_minimas
array([ 0, 4, 8, 11])
Which returns the positions where the local extrema (minimum in this case since np.less) happens in a sliding window of radius=2.
NOTE the -radius to fix the +radius index after wrapping the array with reflect boundary conditions with np.pad.
EDIT: if you are insterested in the values and not in positions, it is straight forward:
>>> data[arg_minimas]
array([ 5., 2., 5., 2.])
It seems, basically you are finding local minima in a sliding window, but that sliding window slides in such a manner that the ending of the previous window act as the starting of a new window. For such a specific problem, suggested in this solution is a vectorized approach that uses broadcasting -
import numpy as np
# Inputs
N = 2
samples = [ 5, 12.3, 12.3, 7, 2, 6, 9, 10, 5, 9, 17, 2 ]
# Convert input list to a numpy array
S = np.asarray(samples)
# Calculate the number of Infs to be appended at the end
append_endlen = int(2*N*np.ceil((S.size+1)/(2*N))-1 - S.size)
# Append Infs at the start and end of the input array
S1 = np.concatenate((np.repeat(np.Inf,N),S,np.repeat(np.Inf,append_endlen)),0)
# Number of sliding windows
num_windows = int((S1.size-1)/(2*N))
# Get windowed values from input array into rows.
# Thus, get minimum from each row to get the desired local minimum.
indexed_vals = S1[np.arange(num_windows)[:,None]*2*N + np.arange(2*N+1)]
out = indexed_vals.min(1)
Sample runs
Run # 1: Original input data
In [105]: S # Input array
Out[105]:
array([ 5. , 12.3, 12.3, 7. , 2. , 6. , 9. , 10. , 5. ,
9. , 17. , 2. ])
In [106]: N # Window radius
Out[106]: 2
In [107]: out # Output array
Out[107]: array([ 5., 2., 5., 2.])
Run # 2: Modified input data, Window radius = 2
In [101]: S # Input array
Out[101]:
array([ 5. , 12.3, 12.3, 7. , 2. , 6. , 9. , 10. , 5. ,
9. , 17. , 2. , 0. , -3. , 7. , 99. , 1. , 0. ,
-4. , -2. ])
In [102]: N # Window radius
Out[102]: 2
In [103]: out # Output array
Out[103]: array([ 5., 2., 5., -3., -4., -4.])
Run # 3: Modified input data, Window radius = 3
In [97]: S # Input array
Out[97]:
array([ 5. , 12.3, 12.3, 7. , 2. , 6. , 9. , 10. , 5. ,
9. , 17. , 2. , 0. , -3. , 7. , 99. , 1. , 0. ,
-4. , -2. ])
In [98]: N # Window radius
Out[98]: 3
In [99]: out # Output array
Out[99]: array([ 5., 2., -3., -4.])
>>> import numpy as np
>>> a = np.array(samples)
>>> [a[max(i-2,0):i+2].min() for i in xrange(1, a.size)]
[5.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 2.0]
As Divakar pointed out in the comments, this is what a sliding window yields. If you want to remove duplicates, that can be done separately
This will look through each window, find the minimum value, and add it to a list if the window's minimum value isn't equal to the most recently added value.
samples = [5, 12.3, 12.3, 7, 2, 6, 9, 10, 5, 9, 17, 2]
neighborhood = 2
minima = []
for i in xrange(len(samples)):
window = samples[max(0, i - neighborhood):i + neighborhood + 1]
windowMin = min(window)
if minima == [] or windowMin != minima[-1]:
minima.append(windowMin)
This gives the output you described:
print minima
> [5, 2, 5, 2]
However, #imaluengo's answer is better since it will include both of two consecutive equal minimum values if they have different indices in the original list!

Categories