rolling window in python of positive values in a list - python

What is a pythonic way to calculate the mean of a list ,but only considering the positive values?
So if I have the values
[1,2,3,4,5,-1,4,2,3] and I want to calculate the rolling mean of three values it is basically calculating the average rolling average of [1,2,3,4,5,'nan',4,2,3].
And that becomes
[nan,2,3,4,4.5,4.5,3,nan] where the first and the last nan are due to the missing elements.
The 2 = mean ([1,2,3])
the 3 = mean ([2,3,4])
but the 4.5 = mean ([4,5,nan])=mean ([4,5])
and so on. So it is important that when there are negative values they are excluded, but the division is between the number of positive values.
I tried:
def RollingPositiveAverage(listA,nElements):
listB=[element for element in listA if element>0]
return pd.rolling_mean(listB,3)
but the list B has elements missing. I tried to substitute those elements with nan but then the mean becomes nan itself.
Is there any nice and elegant way to solve this?
Thanks

Since you are using Pandas:
import numpy as np
import pandas as pd
def RollingPositiveAverage(listA, window=3):
s = pd.Series(listA)
s[s < 0] = np.nan
result = s.rolling(window, center=True, min_periods=1).mean()
result.iloc[:window // 2] = np.nan
result.iloc[-(window // 2):] = np.nan
return result # or result.values or list(result) if you prefer array or list
print(RollingPositiveAverage([1, 2, 3, 4, 5, -1, 4, 2, 3]))
Output:
0 NaN
1 2.0
2 3.0
3 4.0
4 4.5
5 4.5
6 3.0
7 3.0
8 NaN
dtype: float64
Plain Python version:
import math
def RollingPositiveAverage(listA, window=3):
result = [math.nan] * (window // 2)
for win in zip(*(listA[i:] for i in range(window))):
win = tuple(v for v in win if v >= 0)
result.append(float(sum(win)) / min(len(win), 1))
result.extend([math.nan] * (window // 2))
return result
print(RollingPositiveAverage([1, 2, 3, 4, 5, -1, 4, 2, 3]))
Output:
[nan, 2.0, 3.0, 4.0, 4.5, 4.5, 3.0, 3.0, nan]

Get rolling summations and get the count of valid elements participating with rolling summations of the mask of positive elements and simple divide them for the average values. For the rolling summations, we could use np.convolve.
Hence, the implementation -
def rolling_mean(a, W=3):
a = np.asarray(a) # convert to array
k = np.ones(W) # kernel for convolution
# Mask of positive numbers and get clipped array
m = a>=0
a_clipped = np.where(m,a,0)
# Get rolling windowed summations and divide by the rolling valid counts
return np.convolve(a_clipped,k,'same')/np.convolve(m,k,'same')
Extending to the specific case of NaN-padding at the boundaries -
def rolling_mean_pad(a, W=3):
hW = (W-1)//2 # half window size for padding
a = np.asarray(a) # convert to array
k = np.ones(W) # kernel for convolution
# Mask of positive numbers and get clipped array
m = a>=0
a_clipped = np.where(m,a,0)
# Get rolling windowed summations and divide by the rolling valid counts
out = np.convolve(a_clipped,k,'same')/np.convolve(m,k,'same')
out[:hW] = np.nan
out[-hW:] = np.nan
return out
Sample run -
In [54]: a
Out[54]: array([ 1, 2, 3, 4, 5, -1, 4, 2, 3])
In [55]: rolling_mean_pad(a, W=3)
Out[55]: array([ nan, 2. , 3. , 4. , 4.5, 4.5, 3. , 3. , nan])

Related

Filling parts of a list without a loop

I have the following list or numpy array
ll=[7.2,0,0,0,0,0,6.5,0,0,-8.1,0,0,0,0]
and an additional list indicating the positions of non-zeros
i=[0,6,9]
I would like to make two new lists out of them, one filling the zeros and one counting in between, for this short example:
a=[7.2,7.2,7.2,7.2,7.2,7.2,6.5,6.5,6.5,-8.1,-8.1,-8.1,-8.1,-8.1]
b=[0,1,2,3,4,5,0,1,2,0,1,2,3,4]
Is therea a way to do that without a for loop to speed up things, as the list ll is quite long in my case.
Array a is the result of a forward fill and array b are indices associated with the range between each consecutive non-zero element.
pandas has a forward fill function, but it should be easy enough to compute with numpy and there are many sources on how to do this.
ll=[7.2,0,0,0,0,0,6.5,0,0,-8.1,0,0,0,0]
a = np.array(ll)
# find zero elements and associated index
mask = a == 0
idx = np.where(~mask, np.arange(mask.size), False)
# do the fill
a[np.maximum.accumulate(idx)]
output:
array([ 7.2, 7.2, 7.2, 7.2, 7.2, 7.2, 6.5, 6.5, 6.5, -8.1, -8.1,
-8.1, -8.1, -8.1])
More information about forward fill is found here:
Most efficient way to forward-fill NaN values in numpy array
Finding the consecutive zeros in a numpy array
Computing array b you could use the forward fill mask and combine it with a single np.arange:
fill_mask = np.maximum.accumulate(idx)
np.arange(len(fill_mask)) - fill_mask
output:
array([0, 1, 2, 3, 4, 5, 0, 1, 2, 0, 1, 2, 3, 4])
So...
import numpy as np
ll = np.array([7.2, 0, 0, 0, 0, 0, 6.5, 0, 0, -8.1, 0, 0, 0, 0])
i = np.array([0, 6, 9])
counts = np.append(
np.diff(i), # difference between each element in i
# (i element shorter than i)
len(ll) - i[-1], # + length of last repeat
)
repeated = np.repeat(ll[i], counts)
repeated becomes
[ 7.2 7.2 7.2 7.2 7.2 7.2 6.5 6.5 6.5 -8.1 -8.1 -8.1 -8.1 -8.1]
b could be computed with
b = np.concatenate([np.arange(c) for c in counts])
print(b)
# [0 1 2 3 4 5 0 1 2 0 1 2 3 4]
but that involves a loop in the form of that list comprehension; perhaps someone Numpyier could implement it without a Python loop.

How to find first occurrence of a significant difference in values of a pandas dataframe?

In a Pandas DataFrame, how would I find the first occurrence of a large difference between two values at two adjacent indices?
As an example, if I have a DataFrame column A with data [1, 1.1, 1.2, 1.3, 1.4, 1.5, 7, 7.1, 7.2, 15, 15.1], I would want index holding 1.5, which would be 5. In my code below, it would give me the index holding 7.2, because 15 - 7.2 > 7 - 1.5.
idx = df['A'].diff().idxmax() - 1
How should I fix this problem, so I get the index of the first 'large difference' occurrence?
The main issue is of course how you define a "large difference". Your solution is pretty good to get the largest difference, improved only by using .diff(-1) and using absolute values as shown by Jezrael:
differences = df['A'].diff(-1).abs()
Using absolute values matters if your values are not sorted, in which case you can get negative differences.
Then, you should probably do some clustering on these values and get the smallest index of the cluster with largest values. Jezrael already showed a heuristic by using the largest quartile, however by only slightly modifying your example this doesn’t work:
df = pd.DataFrame({'A': [1, 1.05, 1.2, 1.3, 1.4, 1.5, 7, 7.1, 7.2, 15, 15.1]})
differences = df['A'].diff(-1).abs()
idx = differences.index[differences >= differences.quantile(.75)][0]
print(idx, differences[idx])
This returns 1 0.1499999999999999
Here's 3 other heuristics that might work better for you:
If you have a value above which you consider a difference to be “large” (e.g. 1.5):
idx = differences.index[differences >= 1.5][0]
If you know how many large values there are, you can select those and get the smallest index (e.g. 2):
idx = differences.nlargest(2).index.min()
If you know all small values are grouped together (as are all the 0.1 in your example), you can filter what's larger than the mean (or the mean + 1 standard deviation if your “large” values are very close to the smaller ones).
idx = differences.index[differences >= differences.mean()][0]
This is because contrarily to the median, your few large differences will pull the mean up significantly.
If you really want to go for proper clustering, you can use the KMeans algorithm from scikit learn:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2).fit(differences.values[:-1].reshape(-1, 1))
clusters = pd.Series(kmeans.labels_, index=differences.index[:-1])
idx = clusters.index[clusters.eq(np.squeeze(kmeans.cluster_centers_).argmax())][0]
This classifies the data into 2 classes, and then gets the classification into a pandas Series. We then filter this series’ index by selecting only the cluster that has the highest values, and finally get the first element of this filtered index.
One idea is filter by Series.quantile with Series of differences with change order of differencies by -1 and aboslute values, last get first index:
df = pd.DataFrame({'A':[1, 1.1, 1.2, 1.3, 1.4, 1.5, 7, 7.1, 7.2, 15, 15.1]})
x = df['A'].diff(-1) .abs()
print (x)
0 0.1
1 0.1
2 0.1
3 0.1
4 0.1
5 5.5
6 0.1
7 0.1
8 7.8
9 0.1
10 NaN
Name: A, dtype: float64
idx = x.index[x >= x.quantile(.75)]
print (idx)
Int64Index([5, 7, 8], dtype='int64')
print (idx[0])
5
If you have a Numpy Array, which you can use any dataframe row as, you can use numpy.argmax.
import numpy as np
import numpy as np
a = np.array([1, 1.1, 1.2, 1.3, 1.4, 1.5, 7, 7.1, 7.2, 15, 15.1])
diff = np.diff(a)
threshold = 2 # set your threshold
max_index = np.argwhere(diff> threshold) [[5],[8]]
References:
https://numpy.org/doc/stable/reference/generated/numpy.diff.html
https://numpy.org/doc/stable/reference/generated/numpy.argwhere.html
More Info:
pandas.diff will calculate the diff diff[i] = a[i] - a[i-1]
numpy.diff will calculate the diff diff[i] = a[i+1] - a[i],
with the exception of i=max len:
a[i] = a[i]-a[i-1]
def shift(a):
a_r = np.roll(a, 1) # right shift
a_l = np.roll(a, -1) # left shift
return np.stack([a_l, a_r], axis=1)
a = np.array([1, 1.1, 1.2, 1.3, 1.4, 1.5, 7, 7.1, 7.2, 15, 15.1])
diff = abs(shift(a) - a.reshape(-1, 1))
diff = diff[1:-1]
indices = diff.argmax(axis=0) - 2
a[indices]
array([7. , 1.5])

Normalization: how to avoid zero standard deviation

Have the following task:
Normalize the matrix by columns. From each value in column subtract average (in column) and divide it by standard deviation (in the column). Your output should not contain nan (caused by division by zero). Replace Nans with 1. Don't use if and while/for.
I an working with numpy, so I wrote the following code:
def normalize(matrix: np.array) -> np.array:
res = (matrix - np.mean(matrix, axis = 0)) / np.std(matrix, axis = 0, dtype=np.float64)
return res
matrix = np.array([[1, 4, 4200], [0, 10, 5000], [1, 2, 1000]])
assert np.allclose(
normalize(matrix),
np.array([[ 0.7071, -0.39223, 0.46291],
[-1.4142, 1.37281, 0.92582],
[ 0.7071, -0.98058, -1.38873]])
)
The answer is right.
However, my question is: how do I avoid division by zero? If i have a column of similar numbers, I'll have standard deviation = 0 and the Nan value in result. How do I solve it? Would be grateful!
Your task specifies to avoid nan in the output and replace nan that occur with 1. It does not specify that intermediate results may not contain nan. A valid solution can be to use numpy.nan_to_num on res before returning:
import numpy as np
def normalize(matrix: np.array) -> np.array:
res = (matrix - np.mean(matrix, axis = 0)) / np.std(matrix, axis = 0, dtype=np.float64)
return np.nan_to_num(res, False, 1.0)
matrix = np.array([[2, 4, 4200], [2, 10, 5000], [2, 2, 1000]])
print(normalize(matrix))
yields:
[[ 1. -0.39223227 0.46291005]
[ 1. 1.37281295 0.9258201 ]
[ 1. -0.98058068 -1.38873015]]

Scipy poisson distribution with an upper limit

I am generating a random number using scipy stats.
I used the Poisson distribution.
Below is an example:
import scipy.stats as sct
A =2.5
Pos = sct.poisson.rvs(A,size = 20)
When I print Pos, I got the following numbers:
array([1, 3, 2, 3, 1, 2, 1, 2, 2, 3, 6, 0, 0, 4, 0, 1, 1, 3, 1, 5])
You can see from the array that some of the number,such as 6, is generated.
What I want to do it to limit the biggest number(let's say 5), i.e. any random number generated using sct.poisson.rvs should be equal or less than 5,
How can I tweak my code to achieve it.
By the way, I am using this in Pandas Dataframe.
I think the solution is quite simple (assuming I understood your issue correctly):
# for repeatability:
import numpy as np
np.random.seed(0)
from scipy.stats import poisson, uniform
sample_size = 20
maxval = 5
mu = 2.5
cutoff = poisson.cdf(maxval, mu)
# generate uniform distribution [0, cutoff):
u = uniform.rvs(scale=cutoff, size=sample_size)
# convert to Poisson:
truncated_poisson = poisson.ppf(u, mu)
Then print(truncated_poisson):
[2. 3. 3. 2. 2. 3. 2. 4. 5. 2. 4. 2. 3. 4. 0. 1. 0. 4. 3. 4.]
What you want could be called the truncated Poisson distribution, except that in the common usage of this term, truncation happens from below instead of from above (example). The easiest, even if not always the most efficient, way to sample a truncated distribution is to double the requested array size and keep only the elements that fall in the desired range; if there are not enough, double the size again, etc. As shown below:
import scipy.stats as sct
def truncated_Poisson(mu, max_value, size):
temp_size = size
while True:
temp_size *= 2
temp = sct.poisson.rvs(mu, size=temp_size)
truncated = temp[temp <= max_value]
if len(truncated) >= size:
return truncated[:size]
mu = 2.5
max_value = 5
print(truncated_Poisson(mu, max_value, 20))
Typical output: [0 1 4 5 0 2 3 2 2 2 5 2 3 3 3 3 4 1 0 3].

Way of easily finding the average of every nth element over a window of size k in a pandas.Series? (not the rolling mean)

The motivation here is to take a time series and get the average activity throughout a sub-period (day, week).
It is possible to reshape an array and take the mean over the y axis to achieve this, similar to this answer (but using axis=2):
Averaging over every n elements of a numpy array
but I'm looking for something which can handle arrays of length N%k != 0 and does not solve the issue by reshaping and padding with ones or zeros (e.g numpy.resize), i.e takes the average over the existing data only.
E.g Start with a sequence [2,2,3,2,2,3,2,2,3,6] of length N=10 which is not divisible by k=3. What I want is to take the average over columns of a reshaped array with mis-matched dimensions:
In: [[2,2,3],
[2,2,3],
[2,2,3],
[6]], k =3
Out: [3,2,3]
Instead of:
In: [[2,2,3],
[2,2,3],
[2,2,3],
[6,0,0]], k =3
Out: [3,1.5,2.25]
Thank you.
You can use a masked array to pad with special values that are ignored when finding the mean, instead of summing.
k = 3
# how long the array needs to be to be divisible by 3
padded_len = (len(in_arr) + (k - 1)) // k * k
# create a np.ma.MaskedArray with padded entries masked
padded = np.ma.empty(padded_len)
padded[:len(in_arr)] = in_arr
padded[len(in_arr):] = np.ma.masked
# now we can treat it an array divisible by k:
mean = padded.reshape((-1, k)).mean(axis=0)
# if you need to remove the masked-ness
assert not np.ma.is_masked(mean), "in_arr was too short to calculate all means"
mean = mean.data
You can easily do it by padding, reshaping and calculating by how many elements to divide each row:
>>> import numpy as np
>>> a = np.array([2,2,3,2,2,3,2,2,3,6])
>>> k = 3
Pad data
>>> b = np.pad(a, (0, k - a.size%k), mode='constant').reshape(-1, k)
>>> b
array([[2, 2, 3],
[2, 2, 3],
[2, 2, 3],
[6, 0, 0]])
Then create a mask:
>>> c = a.size // k # 3
>>> d = (np.arange(k) + c * k) < a.size # [True, False, False]
The first part of d will create an array that contains [9, 10, 11], and compare it to the size of a (10), generating the mentioned boolean mask.
And divide it:
>>> b.sum(0) / (c + 1.0 * d)
array([ 3., 2., 3.])
The above will divide the first column by 4 (c + 1 * True) and the rest by 3. This is vectorized numpy, thus, it scales very well to large arrays.
Everything can be written shorter, I just show all the steps to make it more clear.
Flatten the list In by unpacking and chaining. Create a new list that arranges the flattened list lst by columns, then use the map function to calculate the average of each column:
from itertools import chain
In = [[2, 2, 3], [2, 2, 3], [2, 2, 3], [6]]
lst = chain(*In)
k = 3
In_by_cols = [lst[i::k] for i in range(k)]
# [[2, 2, 2, 6], [2, 2, 2], [3, 3, 3]]
Out = map(lambda x: sum(x)/ float(len(x)), In_by_cols)
# [3.0, 2.0, 3.0]
Using float on the length of each sublist will provide a more accurate result on python 2.x as it won't do integer truncation.

Categories