Length of a group within a group (apply groupby after a groupby) - python

I am facing the next problem:
I have groups (by ID) and for all of those groups I need to apply the following code: if the distances between locations within a group are within 3 meters, they need to be added together, hence a new group will be created (the code how to create a group I showed below). Now, what I want is the number of detections within a distance group, hence the length of the group.
This all worked, but after applying it to the ID groups, it gives me an error.
The code is as follows:
def group_nearby_peaks(df, col, cutoff=-3.00):
"""
This function groups nearby peaks based on location.
When peaks are within 3 meters from each other they will be added together.
"""
min_location_between_groups = cutoff
df = df.sort_values('Location')
return (
df.assign(
location_diff=lambda d: d['Location'].diff(-1).fillna(-9999),
NOD=lambda d: d[col]
.groupby(d["location_diff"].shift().lt(min_location_between_groups).cumsum())
.transform(len)
)
)
def find_relative_difference(df, peak_col, difference_col):
def relative_differences_per_ID(ID_df):
return (
spoortak_df.pipe(find_difference_peaks)
.loc[lambda d: d[peak_col]]
.pipe(group_nearby_peaks, difference_col)
)
return df.groupby('ID').apply(relative_differences_per_ID)
The error I get is the following:
ValueError: No objects to concatenate
With the following example dataframe, I expect this result.
ID Location
0 1 12.0
1 1 14.0
2 1 15.0
3 1 17.5
4 1 25.0
5 1 30.0
6 1 31.0
7 1 34.0
8 1 36.0
9 1 37.0
10 2 8.0
11 2 14.0
12 2 15.0
13 2 17.5
14 2 50.0
15 2 55.0
16 2 58.0
17 2 59.0
18 2 60.0
19 2 70.0
Expected result:
ID Number of detections
0 1 4
1 1 1
2 1 5
3 2 1
4 2 3
5 2 1
6 2 5

Create groupID s for Location within 3 meters. Those are > 3 meters will be forced as single ID while others will be duplicated ID. Finally, groupby ID and s and count
s = df.groupby('ID').Location.diff().fillna(0).abs().gt(3).cumsum()
df.groupby(['ID',s]).ID.count().reset_index(name='Number of detections').drop('Location', 1)
Out[190]:
ID Number of detections
0 1 4
1 1 1
2 1 5
3 2 1
4 2 3
5 2 1
6 2 4
7 2 1

Related

Sum all values row-wise conditionally grouped by id

My end goal is to sum all minutes only from initial to final in column periods. This needs to be grouped by id
I have thousands of id and not all of them have the same amount of min in between initial and final.
Periods are sorted in a "journey" fashion each record represents a period of time of its id
Pseudocode:
Iterate rows and sum all values in column "min"
if sum starts in periods == initial and ends in periods = final
Example with 2 ids
id
periods
min
1
period_x
10
1
initial
2
1
progress
3
1
progress_1
4
1
final
5
2
period_y
10
2
period_z
2
2
initial
3
2
progress_1
20
2
final
3
Desired output
id
periods
min
sum
1
period_x
10
14
1
initial
2
14
1
progress
3
14
1
progress_1
4
14
1
final
5
14
2
period_y
10
26
2
period_z
2
26
2
initial
3
26
2
progress_1
20
26
2
final
3
26
So far I've tried:
L = ['initial' 'final']
df['sum'] = df.id.where( df.zone_name.isin(L)).groupby(df['if']).transform('sum')
But this doesn't count what is in between initial and final
Create groups using cumsum and then return the sum of group 1, then apply that sum to the entire column. "Group 1" is anything per id that is between initial and final:
import numpy as np
df['grp'] = df['periods'].isin(['initial','final'])
df['grp'] = np.where(df['periods'] == 'final', 1, df.groupby('id')['grp'].cumsum())
df['sum'] = np.where(df['grp'].eq(1), df.groupby(['id', 'grp'])['min'].transform('sum'), np.nan)
df['sum'] = df.groupby('id')['sum'].transform(max)
df
Out[1]:
id periods min grp sum
0 1 period_x 10 0 14.0
1 1 initial 2 1 14.0
2 1 progress 3 1 14.0
3 1 progress_1 4 1 14.0
4 1 final 5 1 14.0
5 2 period_y 10 0 26.0
6 2 period_z 2 0 26.0
7 2 initial 3 1 26.0
8 2 progress_1 20 1 26.0
9 2 final 3 1 26.0

fill nan values with values from another row with common values in two or more columns [duplicate]

I am trying to impute/fill values using rows with similar columns' values.
For example, I have this dataframe:
one | two | three
1 1 10
1 1 nan
1 1 nan
1 2 nan
1 2 20
1 2 nan
1 3 nan
1 3 nan
I wanted to using the keys of column one and two which is similar and if column three is not entirely nan then impute the existing value from a row of similar keys with value in column '3'.
Here is my desired result:
one | two | three
1 1 10
1 1 10
1 1 10
1 2 20
1 2 20
1 2 20
1 3 nan
1 3 nan
You can see that keys 1 and 3 do not contain any value because the existing value does not exists.
I have tried using groupby+fillna():
df['three'] = df.groupby(['one','two'])['three'].fillna()
which gave me an error.
I have tried forward fill which give me rather strange result where it forward fill the column 2 instead. I am using this code for forward fill.
df['three'] = df.groupby(['one','two'], sort=False)['three'].ffill()
If only one non NaN value per group use ffill (forward filling) and bfill (backward filling) per group, so need apply with lambda:
df['three'] = df.groupby(['one','two'], sort=False)['three']
.apply(lambda x: x.ffill().bfill())
print (df)
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 20.0
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
But if multiple value per group and need replace NaN by some constant - e.g. mean by group:
print (df)
one two three
0 1 1 10.0
1 1 1 40.0
2 1 1 NaN
3 1 2 NaN
4 1 2 20.0
5 1 2 NaN
6 1 3 NaN
7 1 3 NaN
df['three'] = df.groupby(['one','two'], sort=False)['three']
.apply(lambda x: x.fillna(x.mean()))
print (df)
one two three
0 1 1 10.0
1 1 1 40.0
2 1 1 25.0
3 1 2 20.0
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
You can sort data by the column with missing values then groupby and forwardfill:
df.sort_values('three', inplace=True)
df['three'] = df.groupby(['one','two'])['three'].ffill()

Replace NaN of rows using data from another rows [duplicate]

I am trying to impute/fill values using rows with similar columns' values.
For example, I have this dataframe:
one | two | three
1 1 10
1 1 nan
1 1 nan
1 2 nan
1 2 20
1 2 nan
1 3 nan
1 3 nan
I wanted to using the keys of column one and two which is similar and if column three is not entirely nan then impute the existing value from a row of similar keys with value in column '3'.
Here is my desired result:
one | two | three
1 1 10
1 1 10
1 1 10
1 2 20
1 2 20
1 2 20
1 3 nan
1 3 nan
You can see that keys 1 and 3 do not contain any value because the existing value does not exists.
I have tried using groupby+fillna():
df['three'] = df.groupby(['one','two'])['three'].fillna()
which gave me an error.
I have tried forward fill which give me rather strange result where it forward fill the column 2 instead. I am using this code for forward fill.
df['three'] = df.groupby(['one','two'], sort=False)['three'].ffill()
If only one non NaN value per group use ffill (forward filling) and bfill (backward filling) per group, so need apply with lambda:
df['three'] = df.groupby(['one','two'], sort=False)['three']
.apply(lambda x: x.ffill().bfill())
print (df)
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 20.0
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
But if multiple value per group and need replace NaN by some constant - e.g. mean by group:
print (df)
one two three
0 1 1 10.0
1 1 1 40.0
2 1 1 NaN
3 1 2 NaN
4 1 2 20.0
5 1 2 NaN
6 1 3 NaN
7 1 3 NaN
df['three'] = df.groupby(['one','two'], sort=False)['three']
.apply(lambda x: x.fillna(x.mean()))
print (df)
one two three
0 1 1 10.0
1 1 1 40.0
2 1 1 25.0
3 1 2 20.0
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
You can sort data by the column with missing values then groupby and forwardfill:
df.sort_values('three', inplace=True)
df['three'] = df.groupby(['one','two'])['three'].ffill()

Adding values of a pandas column only at specific indices

I am working on pandas data frame, something like below:
id vals
0 1 11
1 1 5.5
2 1 -2
3 1 8
4 2 3
5 2 4
6 2 19
7 2 20
Above is just a small part of the df, the vals are grouped by id , and there will always be equal number of vals per id. In above case it's 4 and 4 values for id = 1 and id =2.
What I am trying to achieve is to add the value at index 0 with value at index 4, then value at index 1 with value at index 5 and so on.
Following is the expected df/ series, say df2:
total
0 14
1 9.5
2 17
3 28
Real df has hundreds of id and not just two as above.
Groupby() can be used but I dont get how to get the specific indices from each group.
Please let me know if anything is unclear.
groupby on modulo of df.index values and take sum of vals
In [805]: df.groupby(df.index % 4).vals.sum()
Out[805]:
0 14.0
1 9.5
2 17.0
3 28.0
Name: vals, dtype: float64
Since there are exactly 4 values per ID, we can simply reshape the underlying 1D array data to 2D array and sum along the appropriate axis (axis=0 in this case) -
pd.DataFrame({'total':df.vals.values.reshape(-1,4).sum(0)})
Sample run -
In [192]: df
Out[192]:
id vals
0 1 11.0
1 1 5.5
2 1 -2.0
3 1 8.0
4 2 3.0
5 2 4.0
6 2 19.0
7 2 20.0
In [193]: pd.DataFrame({'total':df.vals.values.reshape(-1,4).sum(0)})
Out[193]:
total
0 14.0
1 9.5
2 17.0
3 28.0

Pandas - create total column based on other column

I'm trying to create a total column that sums the numbers from another column based on a third column. I can do this by using .groupby(), but that creates a truncated column, whereas I want a column that is the same length.
My code:
df = pd.DataFrame({'a':[1,2,2,3,3,3], 'b':[1,2,3,4,5,6]})
df['total'] = df.groupby(['a']).sum().reset_index()['b']
My result:
a b total
0 1 1 1.0
1 2 2 5.0
2 2 3 15.0
3 3 4 NaN
4 3 5 NaN
5 3 6 NaN
My desired result:
a b total
0 1 1 1.0
1 2 2 5.0
2 2 3 5.0
3 3 4 15.0
4 3 5 15.0
5 3 6 15.0
...where each 'a' column has the same total as the other.
Returning the sum from a groupby operation in pandas produces a column only as long as the number of unique items in the index. Use transform to produce a column of the same length ("like-indexed") as the original data frame without performing any merges.
df['total'] = df.groupby('a')['b'].transform(sum)
>>> df
a b total
0 1 1 1
1 2 2 5
2 2 3 5
3 3 4 15
4 3 5 15
5 3 6 15

Categories