Add indexes to index list around index value in numpy - python

I have the following arrays:
time = [1e-6, 2e-6, 3e-6, 4e-6, 5e-6, 6e-6, 7e-6, 8e-6, 9e-6, 10e-6]
signal = [0, 10, 3, 2, 1, 0, 10, 2, 2, 5]
and I want to remove (from both arrays) any datapoints that are above a threshold value, with a given padding width
threshold = 9
padding = 3e-6
so any indexes that are above 9 in the signal array or are within 100 data points in the time array should be removed from both arrays. Note: this means there could be overlap if there are two data points within the padding window that are above the threshold
example output
time_out = [4e-6, 5e-6, 9e-6, 10e-6]
signal_out = [2, 1, 2, 5]
EDIT: this post is very similar, however it does it only for one index of an array, where I would need to do it at multiple (above e.g. time=2e-6 and time=7e-6) https://stackoverflow.com/a/66695205/12728698

Let's try this one. The idea is to create a boolean mask which returns True if a signal is out of reach of threshold for each padding. I divided the padding by 3, since IIUC, a padding is a window of size 3, so we only need to consider the signals that are greater than the threshold and its 2 adjacent values.
time_arr = np.array(time)
signal_arr = np.array(signal)
llim = time_arr[signal_arr>threshold, None] - padding/3
ulim = time_arr[signal_arr>threshold, None] + padding/3
msk = ((llim > time_arr) | (ulim< time_arr)).all(axis=0)
time_out = time_arr[msk]
signal_out = signal_arr[msk]
Another option is to use numpy.roll to get the adjacent values to create a boolean mask:
comp = signal_arr<=threshold
msk = np.roll(comp, 1) & comp & np.roll(comp, -1)
time_out = time_arr[msk]
signal_out = signal_arr[msk]
Output:
array([4.e-06, 5.e-06, 9.e-06, 1.e-05])
array([2, 1, 2, 5])

Related

binary matrix aa (only contain 0, 1), why sum(sum(aa)) isn't equal sum(sum(aa>0))?

I have a binary mask named crop_mask, which only contains 0 and 1. Why isn't the sum(sum(aa1)) equal sum(sum(aa2)).
aa1 = crop_mask,
aa2 = (aa1>0)
print(sum(sum(aa1)), sum(sum(aa2)))
This might be a minor issue, but I am just so confused now. Thanks for any help. I made a screenshot of the result in the attached figure.
updated screenshot
By definition the sum should be the same.
The only thing I can thing of is that the dtype of your array (assuming you are using a numpy array) is not int or float.
Did you check that the "True"s in aa2 match the "1" in aa1?
EDIT:
dtype = np.uint8 limits the maximum value of the column sum to 255 (2^8). So the sum(sum(a)) --> sum([0,160,0,...]) (160 is the remainder of 4000/256)
aa0 = aa0.astype(int) will solve your issue
a = np.zeros((4000, 4000)).astype(np.uint8)
a[:,1] = 1
a[:,4] = 1
b = (a > 0)
sum(sum(b)) # 8000
sum(sum(a)) # 320
a = a.astype(int)
sum(sum(a)) #8000
Assuming your cropmask is indeed a 2-dimensional ndarray with only 1s and 0s, this works:
import numpy as np
cropmask = np.array([[1, 1, 1], [1, 1, 0], [1, 0, 0]], np.uint8)
x = (cropmask > 0)
print(sum(sum(cropmask)), sum(sum(x)))
Result:
6 6
The most likely cause here is that you're wrong and your cropmask doesn't actually contain only 1s and 0s.
Have you tried:
print(sum(sum(np.logical_and((0 != crop_mask), (1 != crop_mask)))))
If that comes up greater than 0, there's something else in there.

How can I quickly fancy-reorder a flattened "jagged" numpy array

So I have lots of data in a single, flat array that is grouped into irregularly sized chunks. The sizes of these chunks are given in another array. What I need to do is rearrange the chunks based on a third index array (think fancy indexing)
These chunks are always >= 3 long, usually 4, but technically unbounded, so it's not feasible to pad up to a max length and mask. Also, due to technical reasons I only have access to numpy, so nothing like scipy or pandas.
Just to be easier to read, the data in this example is easily grouped. In the real data, the numbers can be anything and do not follow this pattern.
[EDIT] Updated with less confusing data
data = np.array([1,2,3,4, 11,12,13, 21,22,23,24, 31,32,33,34, 41,42,43, 51,52,53,54])
chunkSizes = np.array([4, 3, 4, 4, 3, 4])
newOrder = np.array([0, 5, 4, 5, 2, 1])
The expected output in this case would be
np.array([1,2,3,4, 51,52,53,54, 41,42,43, 51,52,53,54, 21,22,23,24, 11,12,13])
Since the real data can be millions long, I'm hoping for some kind of numpy magic that can do this without python loops.
Approach #1
Here's a vectorized one based on creating a regular array and masking -
def chunk_rearrange(data, chunkSizes, newOrder):
m = chunkSizes[:,None] > np.arange(chunkSizes.max())
d1 = np.empty(m.shape, dtype=data.dtype)
d1[m] = data
return d1[newOrder][m[newOrder]]
Output for given sample -
In [4]: chunk_rearrange(data, chunkSizes, newOrder)
Out[4]: array([0, 0, 0, 0, 5, 5, 5, 5, 4, 4, 4, 5, 5, 5, 5, 2, 2, 2, 2, 1, 1, 1])
Approach #2
Another vectorized one based on cumsum and with smaller footprint for those very-ragged chunksizes -
def chunk_rearrange_cumsum(data, chunkSizes, newOrder):
# Setup ID array that will hold specific values at those interval starts,
# such that a final cumsum would lead us to the indices which when indexed
# by the input array gives us the re-arranged o/p
idar = np.ones(len(data), dtype=int)
# New chunk lengths
newlens = chunkSizes[newOrder]
# Original chunk intervals
c = np.r_[0,chunkSizes[:-1].cumsum()]
# Indices from original order that form the interval starts in new arrangement
d1 = c[newOrder]
# Starts of chunks in new arrangement where those from d1 are to be assigned
c2 = np.r_[0,newlens[:-1].cumsum()]
# Offset required for the starts in new arrangement for final cumsum to work
diffs = np.diff(d1)+1-np.diff(c2)
idar[c2[1:]] = diffs
idar[0] = d1[0]
# Final cumsum and indexing leads to desired new arrangement
out = data[idar.cumsum()]
return out
You can use np.split to create views into your data array corresponding to the chunkSizes, if you build up the indices with np.cumsum. You can then reorder the views according to the newOrder indices using fancy indexing. This should be reasonably efficient since the data is only copied to the new array when you call np.concatenate on the reordered views:
import numpy as np
data = np.array([0,0,0,0, 1,1,1, 2,2,2,2, 3,3,3,3, 4,4,4, 5,5,5,5])
chunkSizes = np.array([4, 3, 4, 4, 3, 4])
newOrder = np.array([0, 5, 4, 5, 2, 1])
cumIndices = np.cumsum(chunkSizes)
splitArray = np.array(np.split(data, cumIndices[:-1]))
targetArray = np.concatenate(splitArray[newOrder])
# >>> targetArray
# array([0, 0, 0, 0, 5, 5, 5, 5, 4, 4, 4, 5, 5, 5, 5, 2, 2, 2, 2, 1, 1, 1])

Rolling difference in Pandas

Does anyone know an efficient function/method such as pandas.rolling_mean, that would calculate the rolling difference of an array
This is my closest solution:
roll_diff = pd.Series(values).diff(periods=1)
However, it only calculates single-step rolling difference. Ideally the step size would be editable (i.e. difference between current time step and n last steps).
I've also written this, but for larger arrays, it is quite slow:
def roll_diff(values,step):
diff = []
for i in np.arange(step, len(values)-1):
pers_window = np.arange(i-1,i-step-1,-1)
diff.append(np.abs(values[i] - np.mean(values[pers_window])))
diff = np.pad(diff, (0, step+1), 'constant')
return diff
What about:
import pandas
x = pandas.DataFrame({
'x_1': [0, 1, 2, 3, 0, 1, 2, 500, ],},
index=[0, 1, 2, 3, 4, 5, 6, 7])
x['x_1'].rolling(window=2).apply(lambda x: x.iloc[1] - x.iloc[0])
in general you can replace the lambda function with your own function. Note that in this case the first item will be NaN.
Update
Defining the following:
n_steps = 2
def my_fun(x):
return x.iloc[-1] - x.iloc[0]
x['x_1'].rolling(window=n_steps).apply(my_fun)
you can compute the differences between values at n_steps.
You can do the same thing as in https://stackoverflow.com/a/48345749/1011724 if you work directly on the underlying numpy array:
import numpy as np
diff_kernel = np.array([1,-1])
np.convolve(rs,diff_kernel ,'same')
where rs is your pandas series
This should work:
import numpy as np
x = np.array([1, 3, 6, 1, -5, 6, 4, 1, 6])
def running_diff(arr, N):
return np.array([arr[i] - arr[i-N] for i in range(N, len(arr))])
running_diff(x, 4) # array([-6, 3, -2, 0, 11])
For a given pd.Series, you will have to define what you want for the first few items. The below example just returns the initial series values.
s_roll_diff = np.hstack((s.values[:4], running_diff(s.values, 4)))
This works because you can assign a np.array directly to a pd.DataFrame, e.g. for a column s, df.s_roll_diff = np.hstack((df.s.values[:4], running_diff(df.s.values, 4)))
If you got KeyError: 0, try with iloc:
import pandas
x = pandas.DataFrame({
'x_1': [0, 1, 2, 3, 0, 1, 2, 500, ],},
index=[0, 1, 2, 3, 4, 5, 6, 7])
x['x_1'].rolling(window=2).apply(lambda x: x.iloc[1] - x.iloc[0])
Applying numpy.diff:
import pandas as pd
import numpy as np
x = pd.DataFrame({
'x_1': [0, 1, 2, 3, 0, 1, 2, 500, ],}
)
print(x)
>>> x_1
0 0
1 1
2 2
3 3
4 0
5 1
6 2
7 500
print(x['x_1'].rolling(window=2).apply(np.diff))
>>>0 NaN
1 1.0
2 1.0
3 1.0
4 -3.0
5 1.0
6 1.0
7 498.0
Name: x_1, dtype: float64
If you have unevenly-spaced intervals, or temporal gaps in your data, and you want to use a rolling window of time frequencies, rather than number of periods, you can easily end up in a situation where x.iloc[-1] - x.iloc[0] doesn't return the result you expect. Pandas can construct windows with exactly 1 point, so x.iloc[-1] == x.iloc[0] and the diff is always 0.
Sometimes this is the desired outcome, but other times you might want to use the last-known value from before the start of each window.
A general solution (perhaps not so efficient) is to first artificially construct an evenly-spaced series, interpolate or fill data as needed (e.g. using Series.ffill), and then use the .rolling() techniques described in other answers.
# Data with temporal gaps
y = pd.Series(..., index=DatetimeIndex(...))
# Your desired frequency
freq = '1M'
# Construct a new Index with this frequency, using your data ranges
idx_artificial = pd.date_range(y.index.min(), y.index.max(), freq=freq)
# Artificially expand the data to the evenly-spaced index
# New data points will be inserted with null/NaN values
y_artificial = y.reindex(idx_artificial)
# Fill the empty values with last-known value
# This part will vary depending on your needs
y_artificial.ffill(inplace=True)
# Now compute the diffs, using the forward-filled artificially-spaced data
y_diff = y.rolling(freq=freq).apply(lambda x: x.iat[-1] - x.iat[0])
And here are some helper functions to implement the above, for your copy-paste pleasure (warning: lightly-tested code written by a complete stranger, use with caution):
def date_range_from_index(index, freq=None, start=None, end=None, **kwargs):
if start is None:
start = index.min()
if end is None:
end = index.max()
if freq is None:
try:
freq = index.freq
except AttributeError:
freq = None
if freq is None:
raise ValueError('Frequency not provided and input has no set frequency.')
return pd.date_range(start, end, freq=freq, **kwargs)
def fill_dtindex(y, freq=None, start=None, end=None, fill=None):
new_index = date_range_from_index(y.index, freq=freq, start=start, end=end)
y = y.reindex(new_index)
if fill is not None:
if isinstance(fill, str):
y = y.fillna(method=fill)
else:
y = y.fillna(fill)
return y

Map numbers to their percentiles

I would like to apply the result of numpy.percentile to its argument, i.e., map every number in the input vector to its quantile.
E.g., if v=np.array([1,2,3,4]), and I want just two quantiles (bigger and smaller than the median), I would get np.array([0,0,1,1]) telling me that the first two elements of v are smaller than the median and the last two are bigger than the median.
Note that I am interested in, say, deciles, not just the median!
IOW, #PaulPanzer hit the nail:
np.digitize(v,np.percentile(v,quantiles))
thanks!
(v > np.percentile(v, 50)).astype(int)
Out[93]:
array([0, 0, 1, 1])
Use np.digitize:
perc = np.percentile(data, q)
indices = np.digitize(data, perc)
Example q = [25,50,75], data = np.arange(8):
indices
# array([0, 0, 1, 1, 2, 2, 3, 3])

Randomize part of an array

I'm working on a project involving binary patterns (here np.arrays of 0 and 1).
I'd like to modify a random subset of these and return several altered versions of the pattern where a given fraction of the values have been changed (like map a function to a random subset of an array of fixed size)
ex : take the pattern [0 0 1 0 1] and rate 0.2, return [[0 1 1 0 1] [1 0 1 0 1]]
It seems possible by using auxiliary arrays and iterating with a condition, but is there a "clean" way to do that ?
Thanks in advance !
The map function works on boolean arrays too. You could add the subsample logic to your function, like so:
import numpy as np
rate = 0.2
f = lambda x: np.random.choice((True, x),1,p=[rate,1-rate])[0]
a = np.array([0,0,1,0,1], dtype='bool')
map(f, a)
# This will output array a with on average 20% of the elements changed to "1"
# it can be slightly more or less than 20%, by chance.
Or you could rewrite a map function, like so:
import numpy as np
def map_bitarray(f, b, rate):
'''
maps function f on a random subset of b
:param f: the function, should take a binary array of size <= len(b)
:param b: the binary array
:param rate: the fraction of elements that will be replaced
:return: the modified binary array
'''
c = np.copy(b)
num_elem = len(c)
idx = np.random.choice(range(num_elem), num_elem*rate, replace=False)
c[idx] = f(c[idx])
return c
f = lambda x: True
b = np.array([0,0,1,0,1], dtype='bool')
map_bitarray(f, b, 0.2)
# This will output array b with exactly 20% of the elements changed to "1"
rate=0.2
repeats=5
seed=[0,0,1,0,1]
realizations=np.tile(seed,[repeats,1]) ^ np.random.binomial(1,rate,[repeats,len(seed)])
Use np.tile() to generate a matrix from the seed row.
np.random.binomial() to generate a binomial mask matrix with your requested rate.
Apply the mask with the xor binary operator ^
EDIT:
Based on #Jared Goguen comments, if you want to change 20% of the bits, you can elaborate a mask by choosing elements to change randomly:
seed=[1,0,1,0,1]
rate=0.2
repeats=10
mask_list=[]
for _ in xrange(repeats):
y=np.zeros(len(seed),np.int32)
y[np.random.choice(len(seed),0.2*len(seed))]=1
mask_list.append(y)
mask = np.vstack(mask_list)
realizations=np.tile(seed,[repeats,1]) ^ mask
So, there's already an answer that provides sequences where each element has a random transition probability. However, it seems like you might want an exact fraction of the elements to change instead. For example, [1, 0, 0, 1, 0] can change to [1, 1, 0, 1, 0] or [0, 0, 0, 1, 0], but not [1, 1, 1, 1, 0].
The premise, based off of xvan's answer, uses the bit-wise xor operator ^. When a bit is xor'd with 0, it's value will not change. When a bit is xor'd with 1, it will flip. From your question, it seems like you want to change len(seq)*rate number of bits in the sequence. First create mask which contains len(seq)*rate number of 1's. To get an altered sequence, xor the original sequence with a shuffled version of mask.
Here's a simple, inefficient implementation:
import numpy as np
def edit_sequence(seq, rate, count):
length = len(seq)
change = int(length * rate)
mask = [0]*(length - change) + [1]*change
return [seq ^ np.random.permutation(mask) for _ in range(count)]
rate = 0.2
seq = np.array([0, 0, 1, 0, 1])
print edit_sequence(seq, rate, 5)
# [0, 0, 1, 0, 0]
# [0, 1, 1, 0, 1]
# [1, 0, 1, 0, 1]
# [0, 1, 1, 0, 1]
# [0, 0, 0, 0, 1]
I don't really know much about NumPy, so maybe someone with more experience can make this efficient, but the approach seems solid.
Edit: Here's a version that times about 30% faster:
def edit_sequence(seq, rate, count):
mask = np.zeros(len(seq), dtype=int)
mask[:len(seq)*rate] = 1
output = []
for _ in range(count):
np.random.shuffle(mask)
output.append(seq ^ mask)
return output
It appears that this updated version scales very well with the size of seq and the value of count. Using dtype=bool in seq and mask yields another 50% improvement in the timing.

Categories