I am trying to create a central tendency operator (like the mean or the median) which would follow this logic:
For a given array, return the value closest to zero if all values have the same sign and zero otherwise
In other words:
if all values > 0 return min(array)
if all values < 0 return max(array)
else return 0
Here is the most optimized implementation I managed to do:
def zero_min(x):
if len(x) == 1:
return x[0]
else:
tmin = np.min(x)
tmax = np.max(x)
return (tmin if tmin == abs(tmin) else tmax) if tmin*tmax > 0 else 0
The issue is that I want it to be very efficient in order to use it in a rolling window (using pandas.Series.rolling) on 8.5M values of type float64, like this:
df = df.rolling(timedelta(seconds=5)).apply(zero_min, raw=True)
But this function is painfully slow to execute: for a window of 5s it takes 33.34s, while pandas.Series.rolling.mean takes 0.15s and pandas.Series.rolling.median 1.01 (and the median should be longer to compute, as it is an operation more complex).
Would you know how to optimize it so that it is at least as fast as the median?
I guess I would have to use matrix calculation or code the operation in C but I don't know how to do that.
You can reproduce the data to process using
import random
n = 8467200
df = pd.Series([random.random() for i in range(n)], index=pd.date_range(datetime.now(), datetime.now() + timedelta(seconds=n-1), freq='1S'))
avoid using apply, you can do something like this:
min_val = df['some_col'].rolling(timedelta(seconds=5), min_periods=1).min()
max_val = df['some_col'].rolling(timedelta(seconds=5), min_periods=1).max()
# perform the logics on these series
df['new_col'] = np.select((min_val.gt(0) | min_val.eq(max_val), max_val < 0),
(min_val, max_val), 0)
Related
I currently use something like the similar bit of code to determine comparison
list_of_numbers = [29800.0, 29795.0, 29795.0, 29740.0, 29755.0, 29745.0]
high = 29980.0
lookback = 10
counter = 1
for number in list_of_numbers:
if (high >= number) \
and (counter < lookback):
counter += 1
else:
break
The resulted counter magnitude will be 7. However, it is very taxing on large data arrays. So, I have looked for a solution and came up with np.argmax(), but there seems to be an issue. For example the following:
list_of_numbers = [29800.0, 29795.0, 29795.0, 29740.0, 29755.0, 29745.0]
np_list = np.array(list_of_numbers)
high = 29980.0
print(np.argmax(np_list > high) + 1)
this will get output 1, just like argmax is suppose to .. but I want it to get output 7. Is there another method to do this that will give me similar output for the if statement ?
You can get a boolean array for where high >= number using NumPy:
list_of_numbers = [29800.0, 29795.0, 29795.0, 29740.0, 29755.0, 29745.0]
high = 29980.0
lookback = 10
boolean_arr = np.less_equal(np.array(list_of_numbers), high)
Then finding where is the first False argument in that to satisfy break condition in your code. Furthermore, to consider countering, you can use np.cumsum on the boolean array and find the first argument that satisfying specified lookback magnitude. So, the result will be the smaller value between break_arr and lookback_lim:
break_arr = np.where(boolean_arr == False)[0][0] + 1
lookback_lim = np.where(np.cumsum(boolean_arr) == lookback)[0][0] + 1
result = min(break_arr, lookback_lim)
If your list_of_numbers have not any bigger value than your specified high limit for break_arr or the specified lookback exceeds values in np.cumsum(boolean_arr) for lookback_lim, the aforementioned code will get stuck with an error like the following, relating to np.where:
IndexError: index 0 is out of bounds for axis 0 with size 0
Which can be handled by try-except or if statements e.g.:
try:
break_arr = np.where(boolean_arr == False)[0][0] + 1
except:
break_arr = len(boolean_arr) + 1
try:
lookback_lim = np.where(np.cumsum(boolean_arr) == lookback)[0][0] + 1
except:
lookback_lim = len(boolean_arr) + 1
You have you less than sign backwards, no? The following should work as the for-loop:
print(np.min([np.sum(np.array(list_of_numbers) < high) + 1, lookback]))
A look back can be accomplished using shift. A cumcount can be used to get a running total. A query can be used as a filter
I have a time series s stored as a pandas.Series and I need to find when the value tracked by the time series changes by at least x.
In pseudocode:
print s(0)
s*=s(0)
for all t in ]t, t_max]:
if |s(t)-s*| > x:
s* = s(t)
print s*
Naively, this can be coded in Python as follows:
import pandas as pd
def find_changes(s, x):
changes = []
s_last = None
for index, value in s.iteritems():
if s_last is None:
s_last = value
if value-s_last > x or s_last-value > x:
changes += [index, value]
s_last = value
return changes
My data set is large, so I can't just use the method above. Moreover, I cannot use Cython or Numba due to limitations of the framework I will run this on. I can (and plan to) use pandas and NumPy.
I'm looking for some guidance on what NumPy vectorized/optimized methods to use and how.
Thanks!
EDIT: Changed code to match pseudocode.
I don't know if I am understanding you correctly, but here is how I interpreted the problem:
import pandas as pd
import numpy as np
# Our series of data.
data = pd.DataFrame(np.random.rand(10), columns = ['value'])
# The threshold.
threshold = .33
# For each point t, grab t - 1.
data['value_shifted'] = data['value'].shift(1)
# Absolute difference of t and t - 1.
data['abs_change'] = abs(data['value'] - data['value_shifted'])
# Test against the threshold.
data['change_exceeds_threshold'] = np.where(data['abs_change'] > threshold, 1, 0)
print(data)
Giving:
value value_shifted abs_change change_exceeds_threshold
0 0.005382 NaN NaN 0
1 0.060954 0.005382 0.055573 0
2 0.090456 0.060954 0.029502 0
3 0.603118 0.090456 0.512661 1
4 0.178681 0.603118 0.424436 1
5 0.597814 0.178681 0.419133 1
6 0.976092 0.597814 0.378278 1
7 0.660010 0.976092 0.316082 0
8 0.805768 0.660010 0.145758 0
9 0.698369 0.805768 0.107400 0
I don't think the pseudo code can be vectorized because the next state of s* is dependent on the last state. There's a pure python solution (1 iteration):
import random
import pandas as pd
s = [random.randint(0,100) for _ in range(100)]
res = [] # record changes
thres = 20
ss = s[0]
for i in range(len(s)):
if abs(s[i] - ss) > thres:
ss = s[i]
res.append([i, s[i]])
df = pd.DataFrame(res, columns=['value'])
I think there's no way to run faster than O(N) in this case.
I am currently using python and numpy for calculations of correlations between 2 lists: data_0 and data_1. Each list contains respecively sorted times t0 and t1.
I want to calculate all the events where 0 < t1 - t0 < t_max.
for time_0 in np.nditer(data_0):
delta_time = np.subtract(data_1, np.full(data_1.size, time_0))
delta_time = delta_time[delta_time >= 0]
delta_time = delta_time[delta_time < time_max]
Doing so, as the list are sorted, I am selecting a subarray of data_1 of the form data_1[index_min: index_max].
So I need in fact to find two indexes to get what I want.
And what's interesting is that when I go to the next time_0, as data_0 is also sorted, I just need to find the new index_min / index_max such as new_index_min >= index_min / new_index_max >= index_max.
Meaning that I don't need to scann again all the data_1.
(data list from scratch).
I have implemented such a solution not using the numpy methods (just with while loop) and it gives me the same results as before but not as fast than before (15 times longer!).
I think as normally it requires less calculation, there should be a way to make it faster using numpy methods but I don't know how to do it.
Does anyone have an idea?
I am not sure if I am super clear so if you have any questions, do not hestitate.
Thank you in advance,
Paul
Here is a vectorized approach using argsort. It uses a strategy similar to your avoid-full-scan idea:
import numpy as np
def find_gt(ref, data, incl=True):
out = np.empty(len(ref) + len(data) + 1, int)
total = (data, ref) if incl else (ref, data)
out[1:] = np.argsort(np.concatenate(total), kind='mergesort')
out[0] = -1
split = (out < len(data)) if incl else (out >= len(ref))
if incl:
out[~split] -= len(data)
split[0] = False
return np.maximum.accumulate(np.where(split, -1, out))[split] + 1
def find_intervals(ref, data, span, incl=(True, True)):
index_min = find_gt(ref, data, incl[0])
index_max = len(ref) - find_gt(-ref[::-1], -span-data[::-1], incl[1])[::-1]
return index_min, index_max
ref = np.sort(np.random.randint(0,20000,(10000,)))
data = np.sort(np.random.randint(0,20000,(10000,)))
span = 2
idmn, idmx = find_intervals(ref, data, span, (True, True))
print('checking')
for d,mn,mx in zip(data, idmn, idmx):
assert mn == len(ref) or ref[mn] >= d
assert mn == 0 or ref[mn-1] < d
assert mx == len(ref) or ref[mx] > d+span
assert mx == 0 or ref[mx-1] <= d+span
print('ok')
It works by
indirectly sorting both sets together
finding for each time in one set the preceding time in the other
this is done using maximum.reduce
the preceding steps are applied twice, the second time the times in
one set are shifted by span
I am trying to group on two columns to get an aggregated value and then test that value to see if it is greater or smaller than a threshold. What I have:
SEGMENT = df.groupby(['Col_1','Col_2'])['Number'].apply(lambda x: '1_5' if sum(x) <6 else '6+'
It is slow. Is there a fundamental error in this approach? Thanks.
Edit:
SEGMENT = df.groupby(['Col_1','Col_2'])['Number'].sum().apply(lambda x: '1_5' if x <6 else '6+'
This is speeds it up 3x.
You can do a transform and use it as a boolean mask:
g = df.groupby(['Col_1','Col_2'])
mask = g["Number"].transform("sum") < 6
df[mask] # with group sum smaller than 6
df[~mask] # with group sum greater or equal 6
You're can also use filter:
g.filter(lambda x: x.sum() >= 6)
data is a matrix containing 2500 time series of a measurment. I need to average each time series over time, discarding data points that were recorded around a spike (in the interval tspike-dt*10... tspike+10*dt). The number of spiketimes is variable for each neuron and stored in a dictionary with 2500 entries. My current code iterates over neurons and spiketimes and sets the masked values to NaN. Then bottleneck.nanmean() is called. However this code is to slow in the current version, and I am wondering wheater there is a faster solution. thanks!
import bottleneck
import numpy as np
from numpy.random import rand, randint
t = 1
dt = 1e-4
N = 2500
dtbin = 10*dt
data = np.float32(ones((N, t/dt)))
times = np.arange(0,t,dt)
spiketimes = dict.fromkeys(np.arange(N))
for key in spiketimes:
spiketimes[key] = rand(randint(100))
means = np.empty(N)
for i in range(N):
spike_times = spiketimes[i]
datarow = data[i]
if len(spike_times) > 0:
for spike_time in spike_times:
start=max(spike_time-dtbin,0)
end=min(spike_time+dtbin,t)
idx = np.all([times>=start,times<=end],0)
datarow[idx] = np.NaN
means[i] = bottleneck.nanmean(datarow)
The vast majority of the processing time in your code comes from this line:
idx = np.all([times>=start,times<=end],0)
This is because for each spike, you are comparing every value in times against start and end. Since you have uniform time steps in this example (and I presume this is true in your data as well), it is much faster to simply compute the start and end indexes:
# This replaces the last loop in your example:
for i in range(N):
spike_times = spiketimes[i]
datarow = data[i]
if len(spike_times) > 0:
for spike_time in spike_times:
start=max(spike_time-dtbin,0)
end=min(spike_time+dtbin,t)
#idx = np.all([times>=start,times<=end],0)
#datarow[idx] = np.NaN
datarow[int(start/dt):int(end/dt)] = np.NaN
## replaced this with equivalent for testing
means[i] = datarow[~np.isnan(datarow)].mean()
This reduces the run time for me from ~100s to ~1.5s.
You can also shave off a bit more time by vectorizing the loop over spike_times. The effect of this will depend on the characteristics of your data (should be most effective for high spike rates):
kernel = np.ones(20, dtype=bool)
for i in range(N):
spike_times = spiketimes[i]
datarow = data[i]
mask = np.zeros(len(datarow), dtype=bool)
indexes = (spike_times / dt).astype(int)
mask[indexes] = True
mask = np.convolve(mask, kernel)[10:-9]
means[i] = datarow[~mask].mean()
Instead of using nanmean you could just index the values you need and use mean.
means[i] = data[ (times<start) | (times>end) ].mean()
If I misunderstood and you do need your indexing, you might try
means[i] = data[numpy.logical_not( np.all([times>=start,times<=end],0) )].mean()
Also in the code you probably want to not use if len(spike_times) > 0 (I assume you remove the spike time at each iteration or else that statement will always be true and you'll have an infinite loop), only use for spike_time in spike_times.