How to roll Series elements on non overlapping intervals - python

I'd like to calculate rolling sum of elements as R rollapply doing:
s = pd.Series([1,2,3,4,5,6])
As result I'd like to receive new series with sum of elements for non overlapping intervals(window size is 2):
3
7
11
Pandas Series.rolling procedure works in other way producing rolling on overlapping intervals. Please tell me how to do what I want...

You can try
s.groupby(s.index//2).sum()
0 3
1 7
2 11
dtype: int64

Here is true solution:
s = pd.Series([1,2,3,4,5,6])
pd.Series([np.sum(s[x:x + 2]) for x in range(0, len(s), 2)])

Related

why pandas.DataFrame.sum(axis=0) returns sum of values in each column where axis =0 represent rows?

In pandas, axis=0 represent rows and axis=1 represent columns.
Therefore to get the sum of values in each row in pandas, df.sum(axis=0) is called.
But it returns a sum of values in each columns and vice-versa. Why???
import pandas as pd
df=pd.DataFrame({"x":[1,2,3,4,5],"y":[2,4,6,8,10]})
df.sum(axis=0)
Dataframe:
x y
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
Output:
x 15
y 30
Expected Output:
0 3
1 6
2 9
3 12
4 15
I think the right way to interpret the axis parameter is what axis you sum 'over' (or 'across'), rather than the 'direction' the sum is computed in. Specifying axis = 0 computes the sum over the rows, giving you a total for each column; axis = 1 computes the sum across the columns, giving you a total for each row.
I was a reading the source code in pandas project, and I think that this come from Numpy, in this library is used in that way(0 sum vertically and 1 horizonally), and additionally Pandas use under the hood numpy in order to make this sum.
In this link you could check that pandas use numpy.cumsum function to make the sum.
And this link is for numpy documentation.
If you are looking a way to remember how to use the axis parameter, the 'anant' answer, its a good approach, interpreting the sum over the axis instead across. So when is specified 0 you are computing the sum over the rows(iterating over the index in order to be more pandas doc complaint). When axis is 1 you are iterating over the columns.

python pandas calculate a mean

I have a data frame like this:
pk_dcdata threshold last_ep diff
window
1 11075761 0.00001 4 3
1 11075768 0.00001 7 6
2 11075769 0.00001 1 -1
2 11075770 0.00001 1 -1
3 11075771 0.00001 1 0
3 11075768 0.00001 7 6
I want to calculate the mean in the column 'diff' but compare with the index 'window', and save the mean into a new list. e.g. window = 1 and the mean is (3+6)/2, and the next is window = 2, so (-1-1)/2 and so on.
Expected outcome: list = [4.5,-1,3]
I tried to use 'rolling_mean' but don't know how to set the moving length. Because the dataset is big, hope can get a fast way to get the result.
Dont use list as variable because python reserved word.
Need aggregate by mean per index and last convert Series to list:
L = df.groupby(level=0)['diff'].mean().tolist()
#alternative
#L = df.groupby('window')['diff'].mean().tolist()
print (L)
[4.5, -1.0, 3.0]
Alternative working in pandas 0.20.0+, check docs.
You can use groupby(): let's say your dataframe is called df
avg_diff = df['diff'].groupby(level=0).mean()
This will provide you with a dataframe with the means based on the window.
If then you want to put it in a list you can do like this:
my_list = avg.tolist()

How to count longest uninterrupted sequence in pandas

Let's say I have pd.Series like below
s = pd.Series([False, True, False,True,True,True,False, False])
0 False
1 True
2 False
3 True
4 True
5 True
6 False
7 False
dtype: bool
I want to know how long is the longest True sequence, in this example, it is 3.
I tried it in a stupid way.
s_list = s.tolist()
count = 0
max_count = 0
for item in s_list:
if item:
count +=1
else:
if count>max_count:
max_count = count
count = 0
print(max_count)
It will print 3, but in a Series of all True, it will print 0
Option 1
Use a the series itself to mask the cumulative sum of the negation. Then use value_counts
(~s).cumsum()[s].value_counts().max()
3
explanation
(~s).cumsum() is a pretty standard way to produce distinct True/False groups
0 1
1 1
2 2
3 2
4 2
5 2
6 3
7 4
dtype: int64
But you can see that the group we care about is represented by the 2s and there are four of them. That's because the group is initiated by the first False (which becomes True with (~s)). Therefore, we mask this cumulative sum with the boolean mask we started with.
(~s).cumsum()[s]
1 1
3 2
4 2
5 2
dtype: int64
Now we see the three 2s pop out and we just have to use a method to extract them. I used value_counts and max.
Option 2
Use factorize and bincount
a = s.values
b = pd.factorize((~a).cumsum())[0]
np.bincount(b[a]).max()
3
explanation
This is a similar explanation as for option 1. The main difference is in how I a found the max. I use pd.factorize to tokenize the values into integers ranging from 0 to the total number of unique values. Given the actual values we had in (~a).cumsum() we didn't strictly need this part. I used it because it's a general purpose tool that could be used on arbitrary group names.
After pd.factorize I use those integer values in np.bincount which accumulates the total number of times each integer is used. Then take the maximum.
Option 3
As stated in the explanation of option 2, this also works:
a = s.values
np.bincount((~a).cumsum()[a]).max()
3
I think this could work
pd.Series(s.index[~s].values).diff().max()-1
Out[57]: 3.0
Also outside pandas' we can back to python groupby
from itertools import groupby
max([len(list(group)) for key, group in groupby(s.tolist())])
Out[73]: 3
Update :
from itertools import compress
max(list(compress([len(list(group)) for key, group in groupby(s.tolist())],[key for key, group in groupby(s.tolist())])))
Out[84]: 3
You can use (inspired by #piRSquared answer):
s.groupby((~s).cumsum()).sum().max()
Out[513]: 3.0
Another option to use a lambda func to do this.
s.to_frame().apply(lambda x: s.loc[x.name:].idxmin() - x.name, axis=1).max()
Out[429]: 3
Edit: As piRSquared mentioned, my previous solution needs to append two False at the beginning and at the end of the series. piRSquared kindly gave an answer based on that.
(np.diff(np.flatnonzero(np.append(True, np.append(~s.values, True)))) - 1).max()
My original trial is
(np.diff(s.where(~s).dropna().index.values) - 1).max()
(This will not give the correct answer if the longest True starts at the beginning or ends at the end as pointed out by piRSquared. Please use the solution above given by piRSquared. This work remains only for explanation.)
Explanation:
This finds the indices of the False parts and by finding the gaps between the indices of False, we can know the longest True.
s.where(s == False).dropna().index.values finds all the indices of False
array([0, 2, 6, 7])
We know that Trues live between the Falses. Thus, we can use
np.diff to find the gaps between these indices.
array([2, 4, 1])
Minus 1 in the end as Trues lies between these indices.
Find the maximum of the difference.
Your code was actually very close. It becomes perfect with a minor fix:
count = 0
maxCount = 0
for item in s:
if item:
count += 1
if count > maxCount:
maxCount = count
else:
count = 0
print(maxCount)
I'm not exactly sure how to do it with pandas but what about using itertools.groupby?
>>> import pandas as pd
>>> s = pd.Series([False, True, False,True,True,True,False, False])
>>> max(sum(1 for _ in g) for k, g in groupby(s) if k)
3

Comparing values within the same dataframe column

Is there anyway to compare values within the same column of a pandas DataFrame?
The task at hand is something like this:
import pandas as pd
data = pd.DataFrame({"A": [0,-5,2,3,-3,-4,-4,-2,-1,5,6,7,3,-1]});
I need to find the maximum time (in indices) consecutive +/- values appear (Equivalently checking consecutive values because the sign can be encoded by True/False). The above data should yield 5 because there are 5 consecutive negative integers [-3,-4,-4,-2,-1]
If possible, I was hoping to avoid using a loop because the number of data points in the column may very well exceed millions in order.
I've tried using data.A.rolling() and it's variants, but can't seem to figure out any possible way to do this in a vectorized way.
Any suggestions?
Here's a NumPy approach that computes the max interval lengths for the positive and negative values -
def max_interval_lens(arr):
# Store mask of positive values
pos_mask = arr>=0
# Get indices of shifts
idx = np.r_[0,np.flatnonzero(pos_mask[1:] != pos_mask[:-1])+1, arr.size]
# Return max of intervals
lens = np.diff(idx)
s = int(pos_mask[0])
maxs = [0,0]
if len(lens)==1:
maxs[1-s] = lens[0]
else:
maxs = lens[1-s::2].max(), lens[s::2].max()
return maxs # Positive, negative max lens
Sample run -
In [227]: data
Out[227]:
A
0 0
1 -5
2 2
3 3
4 -3
5 -4
6 -4
7 -2
8 -1
9 5
10 6
11 7
12 3
13 -1
In [228]: max_interval_lens(data['A'].values)
Out[228]: (4, 5)

Subtract subgroup averages from individuals without resorting to for loop

I have a dataframe with a number of columns, two of which are grouping variables.
>>> df2
Groupvar1 Groupvar2 x y z
0 A 1 0.726317 0.574514 0.700475
1 A 2 0.422089 0.798931 0.191157
2 A 3 0.888318 0.658061 0.686496
....
13 B 2 0.978920 0.764266 0.673941
14 B 3 0.759589 0.162488 0.698958
and I want to make a new dataframe which holds the diffrence between each datapoint in the origianl df and the mean corresponding to its subgroup.
So to begin with a make the new df with the grouped averages:
>>> grp_vars = ['Groupvar1','Groupvar2']
>>> df2_grp = df2.groupby(grp_vars)
>>> df2_grp_avg = df2_grp.mean()
>>> df2_grp_avg
x y z
Groupvar1 Groupvar2
A 1 0.364533 0.645237 0.886286
2 0.325533 0.500077 0.246287
3 0.796326 0.496950 0.510085
4 0.774854 0.688732 0.487547
B 1 0.743783 0.452482 0.612006
2 0.575687 0.396902 0.446126
3 0.473152 0.476379 0.508060
4 0.434320 0.406458 0.382187
and in the new dtaframe I want to keep the deltas, defined as:
delta = individual value - average value of the subgroup this individual is a member of
Now, it's clear to me how to do this the hard way (for loop) but I supose there must be a more elegant solution. Apprecaite any advice on finding that more elegant solution. TIA.
Use .groupby(...).transform function:
>>> demean = lambda df: df - df.mean()
>>> df.groupby(['Groupvar1', 'Groupvar2']).transform(demean)
ant then pd.concat the result with the original data-frame.

Categories