How do you make combined rolling groups in Pandas - python

How do you get rolling groups in Pandas I need the following group (1,2), then group (2,3), then group (3,4), etc. The best i can do is group (1,2), then group (3,4). I take group 1, add the values to group 2. Then the next iteration is group (2,3). I take group 2's newly updated values, and add them to group 3's original values. I then take those group 3 newly updated values, and add them to group 4's original values, so we get:
The most important parts of this, don't get stuck on adding the values in the right order, really the most important thing is that i want to update a group, i want to update group 2, with group 1's values (my post is just an example) , then in the next transform, i want those new group values i updated in group2 to update the next group, which is 3. Then in the next transform or apply, I want those new group 3 values so i can update group4. I hope that makes sense
num group
1 1
2 1
2 1
4 1
5 2
6 2
7 2
8 2
9 3
10 3
11 3
12 3
13 4
14 4
15 4
16 4
df=pd.read_clipboard()
I want my first group to be the following. group two has had it's values added by group 1:
1 1
2 1
3 1
4 1
6 2
8 2
10 2
14 2
My second group will then hopefully be the new modified values due to adding group one to them. group 3 will have it's original values added by group 2's new values:
6 2
8 2
10 2
14 2
15 3
18 3
21 3
26 3
My third group to be group 3's new values. and group four will be it's original values added by group 3 in order:
15 3
18 3
21 3
26 3
29 4
33 4
36 4
42 4
I tried
df.groupby(np.arrange(len(df))//4))
, except it only splits it by groups (1,2) the then the next group is (3,4). I need (1,2), (2,3), (3,4). This is due to me processing group 1 to make group 2's values. I then use group 2 to create group 3's values. I then use group 3 to make group 4's values. Any help on this would be appreciated. I made a simple example because I don't need help with what I'm doing with the groups, I just need to know how to group like that.
Again, this is just an example, The most important parts of this, don't get stuck on adding the values in the right order, i'm not trying to test anyone. really the most important thing is that i want to update a group, i want to update group 2, with group 1's values (my post is just an example) , then in the next transform, i want those new group values i updated in group2 to update the next group, which is 3. Then in the next transform or apply, I want those new group 3 values so i can update group4. I hope that makes sense

I will do
s=df.group.drop_duplicates()
l=[df.loc[df.group.isin([x,y])]for x , y in zip(s.iloc[1:],s.shift().iloc[1:])]
Update
df['num']=df['num'].groupby(df.groupby('group').cumcount()).cumsum()
s=df.group.drop_duplicates()
l=[df.loc[df.group.isin([x,y])]for x , y in zip(s.iloc[1:],s.shift().iloc[1:])]
l[0]
num group
0 1 1
1 2 1
2 2 1
3 4 1
4 6 2
5 8 2
6 9 2
7 12 2

df['grp'] = df.apply(lambda x : x.iloc[::4]).groupby('num').cumsum()
df['grp'] = df['grp'].ffill()
print(df)
num group grp
0 1 1 1.0
1 2 1 1.0
2 3 1 1.0
3 4 1 1.0
4 5 2 2.0
5 6 2 2.0
6 7 2 2.0
7 8 2 2.0
8 9 3 3.0
9 10 3 3.0
10 11 3 3.0
11 12 3 3.0
12 13 4 4.0
13 14 4 4.0
14 15 4 4.0
15 16 4 4.0

Prerequisites:
df["group_sub"]=df.groupby("group").cumcount()
dfprev=df["num"]
for i in range(1, df.group.nunique()):
dfprev+=df["num"].groupby(df["group_sub"]).shift(i).fillna(0)
df.drop("group_sub", axis=1, inplace=True)
You can do:
df_series=[df.loc[df.group.isin(df.group.unique()[i:i+2])] for i in range(df.group.nunique()-1)]

Related

Length of a group within a group (apply groupby after a groupby)

I am facing the next problem:
I have groups (by ID) and for all of those groups I need to apply the following code: if the distances between locations within a group are within 3 meters, they need to be added together, hence a new group will be created (the code how to create a group I showed below). Now, what I want is the number of detections within a distance group, hence the length of the group.
This all worked, but after applying it to the ID groups, it gives me an error.
The code is as follows:
def group_nearby_peaks(df, col, cutoff=-3.00):
"""
This function groups nearby peaks based on location.
When peaks are within 3 meters from each other they will be added together.
"""
min_location_between_groups = cutoff
df = df.sort_values('Location')
return (
df.assign(
location_diff=lambda d: d['Location'].diff(-1).fillna(-9999),
NOD=lambda d: d[col]
.groupby(d["location_diff"].shift().lt(min_location_between_groups).cumsum())
.transform(len)
)
)
def find_relative_difference(df, peak_col, difference_col):
def relative_differences_per_ID(ID_df):
return (
spoortak_df.pipe(find_difference_peaks)
.loc[lambda d: d[peak_col]]
.pipe(group_nearby_peaks, difference_col)
)
return df.groupby('ID').apply(relative_differences_per_ID)
The error I get is the following:
ValueError: No objects to concatenate
With the following example dataframe, I expect this result.
ID Location
0 1 12.0
1 1 14.0
2 1 15.0
3 1 17.5
4 1 25.0
5 1 30.0
6 1 31.0
7 1 34.0
8 1 36.0
9 1 37.0
10 2 8.0
11 2 14.0
12 2 15.0
13 2 17.5
14 2 50.0
15 2 55.0
16 2 58.0
17 2 59.0
18 2 60.0
19 2 70.0
Expected result:
ID Number of detections
0 1 4
1 1 1
2 1 5
3 2 1
4 2 3
5 2 1
6 2 5
Create groupID s for Location within 3 meters. Those are > 3 meters will be forced as single ID while others will be duplicated ID. Finally, groupby ID and s and count
s = df.groupby('ID').Location.diff().fillna(0).abs().gt(3).cumsum()
df.groupby(['ID',s]).ID.count().reset_index(name='Number of detections').drop('Location', 1)
Out[190]:
ID Number of detections
0 1 4
1 1 1
2 1 5
3 2 1
4 2 3
5 2 1
6 2 4
7 2 1

Refer to next index in pandas

If I had a simple pandas DataFrame like this:
frame = pd.DataFrame(np.arange(12).reshape((3,4)), columns=list('abcd'), index=list('123'))
I want find the max value from each row, and use this to find the next value in the column and add this value to a new column.
So the above DataFrame looks like this (with d2 changed to 3):
a b c d
1 1 2 3 4
2 5 6 7 3
3 9 10 11 12
So, conceptually the first row should be scanned, 4 is identified as the largest number, then 3 is found as the number within the same column but in the next index. Similarly for the row 2, 7 is the largest number, and 11 is the next number in that column. So 3 and 11 should get added to a new column like this:
a b c d Next
1 1 2 3 4 NaN
2 5 6 7 3 3
3 9 10 11 12 11
I started by making a function like this, but it only finds the max values.
f = lambda x: x.max()
max = frame.apply(f, axis='columns')
frame['Next'] = max
Based on your edit, you can use np.argsort:
i = np.arange(len(df))
j = pd.Series(np.argmax(df.values, axis=1))
df['next'] = df.shift(-1).values[i, j]
a b c d next
1 1 2 3 4 3.0
2 5 6 7 3 11.0
3 9 10 11 12 NaN

Slicing Pandas Dataframe according to number of lines

I suppose this is something rather simple, but I can't find how to make this. I've been searching tutorials and stackoverflow.
Suppose I have a dataframe df loking like this :
Group Id_In_Group SomeQuantity
1 1 10
1 2 20
2 1 7
3 1 16
3 2 22
3 3 5
3 4 12
3 5 28
4 1 1
4 2 18
4 3 14
4 4 7
5 1 36
I would like to select only the lines having at least 4 objects in the group (so there are at least 4 rows having the same "group" number) and for which SomeQuantity for the 4th object, when sorted in the group by ascending SomeQuantity, is greater than 20 (for example).
In the given Dataframe, for example, it would only return the 3rd group, since it has 4 (>=4) members and its 4th SomeQuantity (after sorting) is 22 (>=20), so it should construct the dataframe :
Group Id_In_Group SomeQuantity
3 1 16
3 2 22
3 3 5
3 4 12
3 5 28
(being or not sorted by SomeQuantity, whatever).
Could somebody be kind enough to help me? :)
I would use .groupby() + .filter() methods:
In [66]: df.groupby('Group').filter(lambda x: len(x) >= 4 and x['SomeQuantity'].max() >= 20)
Out[66]:
Group Id_In_Group SomeQuantity
3 3 1 16
4 3 2 22
5 3 3 5
6 3 4 12
7 3 5 28
A slightly different approach using map, value_counts, groupby , filter:
(df[df.Group.map(df.Group.value_counts().ge(4))]
.groupby('Group')
.filter(lambda x: np.any(x['SomeQuantity'].sort_values().iloc[3] >= 20)))
Breakdown of steps:
Perform value_counts to compute the total counts of distinct elements present in Group column.
>>> df.Group.value_counts()
3 5
4 4
1 2
5 1
2 1
Name: Group, dtype: int64
Use map which functions like a dictionary (wherein the index becomes the keys and the series elements become the values) to map these results back to the original DF
>>> df.Group.map(df.Group.value_counts())
0 2
1 2
2 1
3 5
4 5
5 5
6 5
7 5
8 4
9 4
10 4
11 4
12 1
Name: Group, dtype: int64
Then, we check for the elements having a value of 4 or more which is our threshold limit and take only those subset from the entire DF.
>>> df[df.Group.map(df.Group.value_counts().ge(4))]
Group Id_In_Group SomeQuantity
3 3 1 16
4 3 2 22
5 3 3 5
6 3 4 12
7 3 5 28
8 4 1 1
9 4 2 28
10 4 3 14
11 4 4 7
Inorder to use groupby.filter operation on this, we must make sure that we return a single boolean value corresponding to each grouped key when we perform the sorting process and compare the fourth element to the threshold which is 20.
np.any returns all such possiblities matching our filter.
>>> df[df.Group.map(df.Group.value_counts().ge(4))] \
.groupby('Group').apply(lambda x: x['SomeQuantity'].sort_values().iloc[3])
Group
3 22
4 18
dtype: int64
From these, we compare the fourth element .iloc[3] as it is 0-based indexed and return all such favourable matches.
This is how I have worked through your question, warts and all. Im sure there are much nicer ways to do this.
Find groups with "4 objects in the group"
import collections
groups = list({k for k, v in collections.Counter(df.Group).items() if v > 3} );groups
Out:[3, 4]
Use these groups to filter to a new df containing these groups:
df2 = df[df.Group.isin(groups)]
"4th SomeQuantity (after sorting) is 22 (>=20)"
df3 = df2.sort_values(by='SomeQuantity',ascending=False)
(Updated as per comment below...)
df3.groupby('Group').filter(lambda grp: any(grp.sort_values('SomeQuantity').iloc[3] >= 20)).sort_index()
Group Id_In_Group SomeQuantity
3 3 1 16
4 3 2 22
5 3 3 5
6 3 4 12
7 3 5 28

How to access individual elements within a rolling window on a dataframe

I have a dataframe with the quarterly U.S. GDP as column values. I would like to look at the values, 3 at a time, and find the index where the GDP fell for the next two consecutive quarters. This means I need to compare individual elements within df['GDP'] with each other, in groups of 3.
Here's an example dataframe.
df = pd.DataFrame(data=np.random.randint(0,10,10), columns=['GDP'])
df
GDP
0 4
1 4
2 4
3 1
4 4
5 4
6 8
7 2
8 3
9 9
I'm using df.rolling().apply(find_recession), but I don't know how I can access individual elements of the rolling window within my find_recession() function.
gdp['Recession_rolling'] = gdp['GDP'].rolling(window=3).apply(find_recession_start)
How can I access individual elements within the rolling window, so I can make a comparison such as gdp_val_2 < gdp_val_1 < gdp_val?
The .rolling().apply() will go through the entire dataframe, 3 values at a time, so let's take a look at one particular window, which starts at index location 6:
GDP
6 8 # <- gdp_val
7 2 # <- gdp_val_1
8 3 # <- gdp_val_2
How can I access gdp_val, gdp_val_1, and gdp_val_2 within the current window?
Using a lambda expression within .apply() will pass an array into the custom function (find_recession_start), and so I can just access the elements as I would any list/array e.g. arr[0], arr[1], arr[2]
df = pd.DataFrame(data=np.random.randint(0,10,10), columns=['GDP'])
def my_func(arr):
if((arr[2] < arr[1]) & (arr[1] < arr[0])):
return 1
else:
return 0
df['Result'] = df.rolling(window=3).apply(lambda x: my_func(x))
df
GDP Result
0 8 NaN
1 0 NaN
2 8 0.0
3 1 0.0
4 9 0.0
5 7 0.0
6 9 0.0
7 8 0.0
8 3 1.0
9 9 0.0
The short answer is: you can't, but you can use your knowledge about the structure of the dataframe/series.
You know the size of the window, you know the current index - therefore, you can output the shift relative to the current index:
Let's pretend, here is your gdp:
In [627]: gdp
Out[627]:
0 8
1 0
2 0
3 4
4 0
5 3
6 6
7 2
8 5
9 5
dtype: int64
The naive approach is just to return the (argmin() - 2) and add it to the current index:
In [630]: gdp.rolling(window=3).apply(lambda win: win.argmin() - 2) + gdp.index
Out[630]:
0 NaN
1 NaN
2 1.0
3 1.0
4 2.0
5 4.0
6 4.0
7 7.0
8 7.0
9 7.0
dtype: float64
The naive approach won't return the correct result, since you can't predict which index it would return when there are equal values, and when there is a rise in the middle. But you understand the idea.

Select particular rows from inside groups in pandas dataframe

Suppose I have a dataframe that looks like this:
group level
0 1 10
1 1 10
2 1 11
3 2 5
4 2 5
5 3 9
6 3 9
7 3 9
8 3 8
The desired output is this:
group level
0 1 10
5 3 9
Namely, this is the logic: look inside each group, if there is more than 1 distinct value present in the level column, return the first row in that group. For example, no row from group 2 is selected, because the only value present in the level column is 5.
In addition, how does the situation change if I want the last, instead of the first row of such groups?
What I have attempted was combining group_by statements, with creating sets from entries in the level column, but failed to produce anything even nearly sensible.
This can be done with groupby and using apply to run a simple function on each group:
def get_first_val(group):
has_multiple_vals = len(group['level'].unique()) >= 2
if has_multiple_vals:
return group['level'].loc[group['level'].first_valid_index()]
else:
return None
df.groupby('group').apply(get_first_val).dropna()
Out[8]:
group
1 10
3 9
dtype: float64
There's also a last_valid_index() method, so you wouldn't have to
make any huge changes to get the last row instead.
If you have other columns that you want to keep, you just need a slight tweak:
import numpy as np
df['col1'] = np.random.randint(10, 20, 9)
df['col2'] = np.random.randint(20, 30, 9)
df
Out[17]:
group level col1 col2
0 1 10 19 21
1 1 10 18 24
2 1 11 14 23
3 2 5 14 26
4 2 5 10 22
5 3 9 13 27
6 3 9 16 20
7 3 9 18 26
8 3 8 11 2
def get_first_val_keep_cols(group):
has_multiple_vals = len(group['level'].unique()) >= 2
if has_multiple_vals:
return group.loc[group['level'].first_valid_index(), :]
else:
return None
df.groupby('group').apply(get_first_val_keep_cols).dropna()
Out[20]:
group level col1 col2
group
1 1 10 19 21
3 3 9 13 27
This would be simpler:
In [121]:
print df.groupby('group').\
agg(lambda x: x.values[0] if (x.values!=x.values[0]).any() else np.nan).\
dropna()
level
group
1 10
3 9
For each group, if any of the values are not the same as the first value, aggregate that group into its first value; otherwise, aggregate it to nan.
Finally, dropna().

Categories