Rolling difference in Pandas - python

Does anyone know an efficient function/method such as pandas.rolling_mean, that would calculate the rolling difference of an array
This is my closest solution:
roll_diff = pd.Series(values).diff(periods=1)
However, it only calculates single-step rolling difference. Ideally the step size would be editable (i.e. difference between current time step and n last steps).
I've also written this, but for larger arrays, it is quite slow:
def roll_diff(values,step):
diff = []
for i in np.arange(step, len(values)-1):
pers_window = np.arange(i-1,i-step-1,-1)
diff.append(np.abs(values[i] - np.mean(values[pers_window])))
diff = np.pad(diff, (0, step+1), 'constant')
return diff

What about:
import pandas
x = pandas.DataFrame({
'x_1': [0, 1, 2, 3, 0, 1, 2, 500, ],},
index=[0, 1, 2, 3, 4, 5, 6, 7])
x['x_1'].rolling(window=2).apply(lambda x: x.iloc[1] - x.iloc[0])
in general you can replace the lambda function with your own function. Note that in this case the first item will be NaN.
Update
Defining the following:
n_steps = 2
def my_fun(x):
return x.iloc[-1] - x.iloc[0]
x['x_1'].rolling(window=n_steps).apply(my_fun)
you can compute the differences between values at n_steps.

You can do the same thing as in https://stackoverflow.com/a/48345749/1011724 if you work directly on the underlying numpy array:
import numpy as np
diff_kernel = np.array([1,-1])
np.convolve(rs,diff_kernel ,'same')
where rs is your pandas series

This should work:
import numpy as np
x = np.array([1, 3, 6, 1, -5, 6, 4, 1, 6])
def running_diff(arr, N):
return np.array([arr[i] - arr[i-N] for i in range(N, len(arr))])
running_diff(x, 4) # array([-6, 3, -2, 0, 11])
For a given pd.Series, you will have to define what you want for the first few items. The below example just returns the initial series values.
s_roll_diff = np.hstack((s.values[:4], running_diff(s.values, 4)))
This works because you can assign a np.array directly to a pd.DataFrame, e.g. for a column s, df.s_roll_diff = np.hstack((df.s.values[:4], running_diff(df.s.values, 4)))

If you got KeyError: 0, try with iloc:
import pandas
x = pandas.DataFrame({
'x_1': [0, 1, 2, 3, 0, 1, 2, 500, ],},
index=[0, 1, 2, 3, 4, 5, 6, 7])
x['x_1'].rolling(window=2).apply(lambda x: x.iloc[1] - x.iloc[0])

Applying numpy.diff:
import pandas as pd
import numpy as np
x = pd.DataFrame({
'x_1': [0, 1, 2, 3, 0, 1, 2, 500, ],}
)
print(x)
>>> x_1
0 0
1 1
2 2
3 3
4 0
5 1
6 2
7 500
print(x['x_1'].rolling(window=2).apply(np.diff))
>>>0 NaN
1 1.0
2 1.0
3 1.0
4 -3.0
5 1.0
6 1.0
7 498.0
Name: x_1, dtype: float64

If you have unevenly-spaced intervals, or temporal gaps in your data, and you want to use a rolling window of time frequencies, rather than number of periods, you can easily end up in a situation where x.iloc[-1] - x.iloc[0] doesn't return the result you expect. Pandas can construct windows with exactly 1 point, so x.iloc[-1] == x.iloc[0] and the diff is always 0.
Sometimes this is the desired outcome, but other times you might want to use the last-known value from before the start of each window.
A general solution (perhaps not so efficient) is to first artificially construct an evenly-spaced series, interpolate or fill data as needed (e.g. using Series.ffill), and then use the .rolling() techniques described in other answers.
# Data with temporal gaps
y = pd.Series(..., index=DatetimeIndex(...))
# Your desired frequency
freq = '1M'
# Construct a new Index with this frequency, using your data ranges
idx_artificial = pd.date_range(y.index.min(), y.index.max(), freq=freq)
# Artificially expand the data to the evenly-spaced index
# New data points will be inserted with null/NaN values
y_artificial = y.reindex(idx_artificial)
# Fill the empty values with last-known value
# This part will vary depending on your needs
y_artificial.ffill(inplace=True)
# Now compute the diffs, using the forward-filled artificially-spaced data
y_diff = y.rolling(freq=freq).apply(lambda x: x.iat[-1] - x.iat[0])
And here are some helper functions to implement the above, for your copy-paste pleasure (warning: lightly-tested code written by a complete stranger, use with caution):
def date_range_from_index(index, freq=None, start=None, end=None, **kwargs):
if start is None:
start = index.min()
if end is None:
end = index.max()
if freq is None:
try:
freq = index.freq
except AttributeError:
freq = None
if freq is None:
raise ValueError('Frequency not provided and input has no set frequency.')
return pd.date_range(start, end, freq=freq, **kwargs)
def fill_dtindex(y, freq=None, start=None, end=None, fill=None):
new_index = date_range_from_index(y.index, freq=freq, start=start, end=end)
y = y.reindex(new_index)
if fill is not None:
if isinstance(fill, str):
y = y.fillna(method=fill)
else:
y = y.fillna(fill)
return y

Related

Add indexes to index list around index value in numpy

I have the following arrays:
time = [1e-6, 2e-6, 3e-6, 4e-6, 5e-6, 6e-6, 7e-6, 8e-6, 9e-6, 10e-6]
signal = [0, 10, 3, 2, 1, 0, 10, 2, 2, 5]
and I want to remove (from both arrays) any datapoints that are above a threshold value, with a given padding width
threshold = 9
padding = 3e-6
so any indexes that are above 9 in the signal array or are within 100 data points in the time array should be removed from both arrays. Note: this means there could be overlap if there are two data points within the padding window that are above the threshold
example output
time_out = [4e-6, 5e-6, 9e-6, 10e-6]
signal_out = [2, 1, 2, 5]
EDIT: this post is very similar, however it does it only for one index of an array, where I would need to do it at multiple (above e.g. time=2e-6 and time=7e-6) https://stackoverflow.com/a/66695205/12728698
Let's try this one. The idea is to create a boolean mask which returns True if a signal is out of reach of threshold for each padding. I divided the padding by 3, since IIUC, a padding is a window of size 3, so we only need to consider the signals that are greater than the threshold and its 2 adjacent values.
time_arr = np.array(time)
signal_arr = np.array(signal)
llim = time_arr[signal_arr>threshold, None] - padding/3
ulim = time_arr[signal_arr>threshold, None] + padding/3
msk = ((llim > time_arr) | (ulim< time_arr)).all(axis=0)
time_out = time_arr[msk]
signal_out = signal_arr[msk]
Another option is to use numpy.roll to get the adjacent values to create a boolean mask:
comp = signal_arr<=threshold
msk = np.roll(comp, 1) & comp & np.roll(comp, -1)
time_out = time_arr[msk]
signal_out = signal_arr[msk]
Output:
array([4.e-06, 5.e-06, 9.e-06, 1.e-05])
array([2, 1, 2, 5])

Use information of two arrays to create a third one

I have two numpy-arrays and want to create a third one with the information in these twos.
Here is a simple example:
have = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
use = np.array([[2], [3]])
solution = np.array([[1, 1, 3, 4], [5, 5, 5, 8]])
What I want is to use the "use"-array, which gives me the number of how often I want to use the first element in each row from my "have"-array.
So the 2 in "use" means, that I want to have two times a "1" in my new array "solution". Similary for the "3" in use, I want that my new array has 3 times a "5". The rest from have should be the same.
It is important to use the "use"-array for doing this (or a numpy-array in general).
Do you have some ideas?
If there are only small such data structures and performance is not an issue then you can do this so simple:
np.array([ [a[0]]*b[0]+list(a[b[0]:]) for a,b in zip(have,use)])
Simply iterate through the have and replace the values based on the use.
Use:
for i in range(use.shape[0]):
have[i, :use[i, 0]] = np.repeat(have[i, 0], use[i, 0])
Using only numpy operations:
First create a boolean mask of same size as have. mask(i, j) is True if j < use[i, j] otherwise it's False. So mask is True for indices which are to be replaced by first column value. Now use np.where to replace.
n, m = have.shape
mask = np.repeat(np.arange(m)[None, :], n, axis = 0) < use
have = np.where(mask, have[:, 0:1], have)
Output:
>>> have
array([[1, 1, 3, 4],
[5, 5, 5, 8]])
If performance matters, you can use np.apply_along_axis().
import numpy as np
have = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
use = np.array([[2], [3]])
def rep1st(arr):
rep = arr[0]
res = np.repeat(arr[1], rep)
res = np.concatenate([res, arr[rep+1:]])
return res
solution = np.apply_along_axis(rep1st, 1, np.concatenate([use, have], axis=1))
update:
As #hpaulj said, actually the method using apply_along_axis above is not as efficient as I expected. I misunderstood it. Reference: numpy np.apply_along_axis function speed up?.
However, I made some test on current methods:
import numpy as np
from timeit import timeit
def rep1st(arr):
rep = arr[0]
res = np.repeat(arr[1], rep)
res = np.concatenate([res, arr[rep + 1:]])
return res
def test(row, col, run):
have = np.random.randint(0, 100, size=(row, col))
use = np.random.randint(0, col, size=(row, 1))
d = locals()
d.update(globals())
# method by me
t1 = timeit("np.apply_along_axis(rep1st, 1, np.concatenate([use, have], axis=1))", number=run, globals=d)
# method by #quantummind
t2 = timeit("np.array([[a[0]] * b[0] + list(a[b[0]:]) for a, b in zip(have, use)])", number=run, globals=d)
# method by #Amit Vikram Singh
t3 = timeit(
"np.where(np.repeat(np.arange(have.shape[1])[None, :], have.shape[0], axis=0) < use, have[:, 0:1], have)",
number=run, globals=d
)
print(f"{t1:8.6f}, {t2:8.6f}, {t3:8.6f}")
test(1000, 10, 10)
test(100, 100, 10)
test(10, 1000, 10)
test(1000000, 10, 1)
test(100000, 100, 1)
test(10000, 1000, 1)
test(1000, 10000, 1)
test(100, 100000, 1)
test(10, 1000000, 1)
results:
0.062488, 0.028484, 0.000408
0.010787, 0.013811, 0.000270
0.001057, 0.009146, 0.000216
6.146863, 3.210017, 0.044232
0.585289, 1.186013, 0.034110
0.091086, 0.961570, 0.026294
0.039448, 0.917052, 0.022553
0.028719, 0.919377, 0.022751
0.035121, 1.027036, 0.025216
It shows that the second method proposed by #Amit Vikram Singh always works well even when the arrays are huge.

Optimize the python function with numpy without using the for loop

I have the following python function:
def npnearest(u: np.ndarray, X: np.ndarray, Y: np.ndarray, distance: 'callbale'=npdistance):
'''
Finds x1 so that x1 is in X and u and x1 have a minimal distance (according to the
provided distance function) compared to all other data points in X. Returns the label of x1
Args:
u (np.ndarray): The vector (ndim=1) we want to classify
X (np.ndarray): A matrix (ndim=2) with training data points (vectors)
Y (np.ndarray): A vector containing the label of each data point in X
distance (callable): A function that receives two inputs and defines the distance function used
Returns:
int: The label of the data point which is closest to `u`
'''
xbest = None
ybest = None
dbest = float('inf')
for x, y in zip(X, Y):
d = distance(u, x)
if d < dbest:
ybest = y
xbest = x
dbest = d
return ybest
Where, npdistance simply gives distance between two points i.e.
def npdistance(x1, x2):
return(np.sum((x1-x2)**2))
I want to optimize npnearest by performing nearest neighbor search directly in numpy. This means that the function cannot use for/while loops.
Thanks
Since you don't need to use that exact function, you can simply change the sum to work over a particular axis. This will return a new list with the calculations and you can call argmin to get the index of the minimum value. Use that and lookup your label:
import numpy as np
def npdistance_idx(x1, x2):
return np.argmin(np.sum((x1-x2)**2, axis=1))
Y = ["label 0", "label 1", "label 2", "label 3"]
u = np.array([[1, 5.5]])
X = np.array([[1,2], [1, 5], [0, 0], [7, 7]])
idx = npdistance_idx(X, u)
print(Y[idx]) # label 1
Numpy supports vectorized operations (broadcasting)
This means you can pass in arrays and operations will be applied to entire arrays in an optimized way (SIMD - single instruction, multiple data)
You can then get the address of the array minimum using .argmin()
Hope this helps
In [9]: numbers = np.arange(10); numbers
Out[9]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [10]: numbers -= 5; numbers
Out[10]: array([-5, -4, -3, -2, -1, 0, 1, 2, 3, 4])
In [11]: numbers = np.power(numbers, 2); numbers
Out[11]: array([25, 16, 9, 4, 1, 0, 1, 4, 9, 16])
In [12]: numbers.argmin()
Out[12]: 5

parallelize zonal computation on numpy array

I try to compute mode on all cells of the same zone (same value) on a numpy array. I give you an example of code below. In this example sequential approach works fine but multiprocessed approach does nothing. I do not find my mistake.
Does someone see my error ?
I would like to parallelize the computation because my real array is a 10k * 10k array with 1M zones.
import numpy as np
import scipy.stats as ss
import multiprocessing as mp
def zone_mode(i, a, b, output):
to_extract = np.where(a == i)
val = b[to_extract]
output[to_extract] = ss.mode(val)[0][0]
return output
def zone_mode0(i, a, b):
to_extract = np.where(a == i)
val = b[to_extract]
output = ss.mode(val)[0][0]
return output
np.random.seed(1)
zone = np.array([[1, 1, 1, 2, 3],
[1, 1, 2, 2, 3],
[4, 2, 2, 3, 3],
[4, 4, 5, 5, 3],
[4, 6, 6, 5, 5],
[6, 6, 6, 5, 5]])
values = np.random.randint(8, size=zone.shape)
output = np.zeros_like(zone).astype(np.float)
for i in np.unique(zone):
output = zone_mode(i, zone, values, output)
# for multiprocessing
zone0 = zone - 1
pool = mp.Pool(mp.cpu_count() - 1)
results = [pool.apply(zone_mode0, args=(u, zone0, values)) for u in np.unique(zone0)]
pool.close()
output = results[zone0]
For positve integers in the arrays - zone and values, we can use np.bincount. The basic idea is that we will consider zone and values as row and cols on a 2D grid. So, can map those to their linear index equivalent numbers. Those would be used as bins for binned summation with np.bincount. Their argmax IDs would be the mode numbers. They are mapped back to zone-grid with indexing into zone.
Hence, the solution would be -
m = zone.max()+1
n = values.max()+1
ids = zone*n + values
c = np.bincount(ids.ravel(),minlength=m*n).reshape(-1,n).argmax(1)
out = c[zone]
For sparsey data (well spread integers in the input arrays), we can look into sparse-matrix to get the argmax IDs c. Hence, with SciPy's sparse-matrix -
from scipy.sparse import coo_matrix
data = np.ones(zone.size,dtype=int)
r,c = zone.ravel(),values.ravel()
c = coo_matrix((data,(r,c))).argmax(1).A1
For slight perf. boost, specify the shape -
c = coo_matrix((data,(r,c)),shape=(m,n)).argmax(1).A1
Solving for generic values
We will make use of pandas.factorize, like so -
import pandas as pd
ids,unq = pd.factorize(values.flat)
v = ids.reshape(values.shape)
# .. same steps as earlier with bincount, using v in place of values
out = unq[c[zone]]
Note that for tie-cases, it would pick random element off values. If you want to pick the first one, use pd.factorize(values.flat, sort=True).

Map numbers to their percentiles

I would like to apply the result of numpy.percentile to its argument, i.e., map every number in the input vector to its quantile.
E.g., if v=np.array([1,2,3,4]), and I want just two quantiles (bigger and smaller than the median), I would get np.array([0,0,1,1]) telling me that the first two elements of v are smaller than the median and the last two are bigger than the median.
Note that I am interested in, say, deciles, not just the median!
IOW, #PaulPanzer hit the nail:
np.digitize(v,np.percentile(v,quantiles))
thanks!
(v > np.percentile(v, 50)).astype(int)
Out[93]:
array([0, 0, 1, 1])
Use np.digitize:
perc = np.percentile(data, q)
indices = np.digitize(data, perc)
Example q = [25,50,75], data = np.arange(8):
indices
# array([0, 0, 1, 1, 2, 2, 3, 3])

Categories