My end goal is to sum all minutes only from initial to final in column periods. This needs to be grouped by id
I have thousands of id and not all of them have the same amount of min in between initial and final.
Periods are sorted in a "journey" fashion each record represents a period of time of its id
Pseudocode:
Iterate rows and sum all values in column "min"
if sum starts in periods == initial and ends in periods = final
Example with 2 ids
id
periods
min
1
period_x
10
1
initial
2
1
progress
3
1
progress_1
4
1
final
5
2
period_y
10
2
period_z
2
2
initial
3
2
progress_1
20
2
final
3
Desired output
id
periods
min
sum
1
period_x
10
14
1
initial
2
14
1
progress
3
14
1
progress_1
4
14
1
final
5
14
2
period_y
10
26
2
period_z
2
26
2
initial
3
26
2
progress_1
20
26
2
final
3
26
So far I've tried:
L = ['initial' 'final']
df['sum'] = df.id.where( df.zone_name.isin(L)).groupby(df['if']).transform('sum')
But this doesn't count what is in between initial and final
Create groups using cumsum and then return the sum of group 1, then apply that sum to the entire column. "Group 1" is anything per id that is between initial and final:
import numpy as np
df['grp'] = df['periods'].isin(['initial','final'])
df['grp'] = np.where(df['periods'] == 'final', 1, df.groupby('id')['grp'].cumsum())
df['sum'] = np.where(df['grp'].eq(1), df.groupby(['id', 'grp'])['min'].transform('sum'), np.nan)
df['sum'] = df.groupby('id')['sum'].transform(max)
df
Out[1]:
id periods min grp sum
0 1 period_x 10 0 14.0
1 1 initial 2 1 14.0
2 1 progress 3 1 14.0
3 1 progress_1 4 1 14.0
4 1 final 5 1 14.0
5 2 period_y 10 0 26.0
6 2 period_z 2 0 26.0
7 2 initial 3 1 26.0
8 2 progress_1 20 1 26.0
9 2 final 3 1 26.0
Related
So I have tried to find an average of a value for an index 0 before it exchange to another index.
An example of the dataframe:
column_a
value_b
sum_c
count_d_
avg_e
0
10
10
1
0
20
30
2
0
30
60
3
20
1
10
10
1
1
20
30
2
1
30
60
3
20
0
10
10
1
0
20
30
2
15
1
10
10
1
1
20
30
2
1
30
60
3
20
0
10
10
1
0
20
however, only the last row for sum and count is unavailable, so the avg cannot be calculated for it
part of the code...
#sum and avg for each section
for i, row in df.iloc[0:-1].iterrows():
if df['column_a'][i] == 0:
sum = sum + df['value_b'][i]
df['sum_c'][i] = sum
count = count + 1
df['count_d'][i] = count
else:
sum = 0
count = 0
df['sum_c'][i] = sum
df['count_d'][i] = count
totcount = 0
for m, row in df.iloc[0:-1].iterrows():
if df.loc[m, 'column_a'] == 0 :
if (df.loc[m+1, 'sum_c'] == 0) :
totcount = df.loc[m, 'count_d']
avg_e = (df.loc[m, 'sum_c']) / totcount
df.loc[m, 'avg_e'] = avg_e
have tried only using df.iloc[0:].iterrows but it produce an error.
You can rewrite you full code with groupby.cummax, groupby.cumcount, groupby.transform('mean') and masking with where.
# compute a mask with True for the last value per successive group
m = df['column_a'].ne(df['column_a'].shift(-1))[::-1]
# make a grouper
group = m.cumsum()
# for each group
g = df.groupby(group)['value_b']
# compute the cumsum
df['sum_c'] = g.cumsum()
# compute the cumcount
df['count_d_'] = g.cumcount().add(1)
# compute the mean and assign to the last row per group
df['avg_e'] = g.transform('mean').where(m)
Output:
column_a value_b sum_c count_d_ avg_e
0 0 10 10 1 NaN
1 0 20 30 2 NaN
2 0 30 60 3 20.0
3 1 10 10 1 NaN
4 1 20 30 2 NaN
5 1 30 60 3 20.0
6 0 10 10 1 NaN
7 0 20 30 2 15.0
8 1 10 10 1 NaN
9 1 20 30 2 NaN
10 1 30 60 3 20.0
11 0 10 10 1 NaN
12 0 20 30 2 15.0
It is the expected behavior of df.iloc[0:-1] to return all the rows excepting the last one. When using slicing, remember that the last index you provide is not included in the return range. Since -1 is the index of the last row, [0:-1] excludes the last row.
The solution given by #mozway is anyway more elegant, but if for any reason you still want to use iterrows(), you can use df.iloc[0:].
The error ou got when you did may be due to your df.loc[m+1, 'sum_c']. At the last row, m+1 will be out of bounds and produce an IndexError.
I created this dataframe I calculated the gap that I was looking but the problem is that some flats have the same price and I get a difference of price of 0. How could I replace the value 0 by the difference with the last lower price of the same group.
for example:
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:3
neighboorhood:a, bed:1, bath:1, price:2
I get difference price of 0,2,1,nan and I'm looking for 2,2,1,nan (briefly I don't want to compare 2 flats with the same price)
Thanks in advance and good day.
data=[
[1,'a',1,1,5],[2,'a',1,1,5],[3,'a',1,1,4],[4,'a',1,1,2],[5,'b',1,2,6],[6,'b',1,2,6],[7,'b',1,2,3]
]
df = pd.DataFrame(data, columns = ['id','neighborhoodname', 'beds', 'baths', 'price'])
df['difference_price'] = ( df.dropna()
.sort_values('price',ascending=False)
.groupby(['city','beds','baths'])['price'].diff(-1) )
I think you can remove duplicates first per all columns used for groupby with diff, create new column in filtered data and last use merge with left join to original:
df1 = (df.dropna()
.sort_values('price',ascending=False)
.drop_duplicates(['neighborhoodname','beds','baths', 'price']))
df1['difference_price'] = df1.groupby(['neighborhoodname','beds','baths'])['price'].diff(-1)
df = df.merge(df1[['neighborhoodname','beds','baths','price', 'difference_price']], how='left')
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN
Or you can use lambda function for back filling 0 values per groups for avoid wrong outputs if one row groups (data moved from another groups):
df['difference_price'] = (df.sort_values('price',ascending=False)
.groupby(['neighborhoodname','beds','baths'])['price']
.apply(lambda x: x.diff(-1).replace(0, np.nan).bfill()))
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN
I am facing the next problem:
I have groups (by ID) and for all of those groups I need to apply the following code: if the distances between locations within a group are within 3 meters, they need to be added together, hence a new group will be created (the code how to create a group I showed below). Now, what I want is the number of detections within a distance group, hence the length of the group.
This all worked, but after applying it to the ID groups, it gives me an error.
The code is as follows:
def group_nearby_peaks(df, col, cutoff=-3.00):
"""
This function groups nearby peaks based on location.
When peaks are within 3 meters from each other they will be added together.
"""
min_location_between_groups = cutoff
df = df.sort_values('Location')
return (
df.assign(
location_diff=lambda d: d['Location'].diff(-1).fillna(-9999),
NOD=lambda d: d[col]
.groupby(d["location_diff"].shift().lt(min_location_between_groups).cumsum())
.transform(len)
)
)
def find_relative_difference(df, peak_col, difference_col):
def relative_differences_per_ID(ID_df):
return (
spoortak_df.pipe(find_difference_peaks)
.loc[lambda d: d[peak_col]]
.pipe(group_nearby_peaks, difference_col)
)
return df.groupby('ID').apply(relative_differences_per_ID)
The error I get is the following:
ValueError: No objects to concatenate
With the following example dataframe, I expect this result.
ID Location
0 1 12.0
1 1 14.0
2 1 15.0
3 1 17.5
4 1 25.0
5 1 30.0
6 1 31.0
7 1 34.0
8 1 36.0
9 1 37.0
10 2 8.0
11 2 14.0
12 2 15.0
13 2 17.5
14 2 50.0
15 2 55.0
16 2 58.0
17 2 59.0
18 2 60.0
19 2 70.0
Expected result:
ID Number of detections
0 1 4
1 1 1
2 1 5
3 2 1
4 2 3
5 2 1
6 2 5
Create groupID s for Location within 3 meters. Those are > 3 meters will be forced as single ID while others will be duplicated ID. Finally, groupby ID and s and count
s = df.groupby('ID').Location.diff().fillna(0).abs().gt(3).cumsum()
df.groupby(['ID',s]).ID.count().reset_index(name='Number of detections').drop('Location', 1)
Out[190]:
ID Number of detections
0 1 4
1 1 1
2 1 5
3 2 1
4 2 3
5 2 1
6 2 4
7 2 1
I have a question for which I have
a dataframe which looks like (example):
index ID time value
0 1 2h 10
1 1 2.15h 15
2 1 2.30h 5
3 1 2.45h 24
4 2 2.15h 6
5 2 2.30h 12
6 2 2.45h 18
7 3 2.15h 2
8 3 2.30h 1
I would like to keep the maximum number of ID row overlapping.
So:
index ID time value
1 1 2.15h 15
2 1 2.30h 5
4 2 2.15h 6
5 2 2.30h 12
7 3 2.15h 2
8 3 2.30h 1
I know I can create a df with unique times and then merge each ID separately to it and then keep all rows with all IDs filled for each time but this is quite impractical. I have looked but have not found an answer for a possible smarter way. Does someone have an idea how to make this more practical?
Use:
cols = df.groupby(['ID', 'time']).size().unstack().dropna(axis=1).columns
df = df[df['time'].isin(cols)]
print (df)
ID time value
1 1 2.15h 15
2 1 2.30h 5
4 2 2.15h 6
5 2 2.30h 12
7 3 2.15h 2
8 3 2.30h 1
Details:
First aggregate DataFrame by groupby and size, then reshape by unstack - NaNs are created for non overlapping values:
print (df.groupby(['ID', 'time']).size().unstack())
time 2.15h 2.30h 2.45h 2h
ID
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 NaN
3 1.0 1.0 NaN NaN
Remove columns with dropna and get columns names:
print (df.groupby(['ID', 'time']).size().unstack().dropna(axis=1))
time 2.15h 2.30h
ID
1 1.0 1.0
2 1.0 1.0
3 1.0 1.0
And last filter list by isin and boolean indexing:
df = df[df['time'].isin(cols)]
I'm trying to create a total column that sums the numbers from another column based on a third column. I can do this by using .groupby(), but that creates a truncated column, whereas I want a column that is the same length.
My code:
df = pd.DataFrame({'a':[1,2,2,3,3,3], 'b':[1,2,3,4,5,6]})
df['total'] = df.groupby(['a']).sum().reset_index()['b']
My result:
a b total
0 1 1 1.0
1 2 2 5.0
2 2 3 15.0
3 3 4 NaN
4 3 5 NaN
5 3 6 NaN
My desired result:
a b total
0 1 1 1.0
1 2 2 5.0
2 2 3 5.0
3 3 4 15.0
4 3 5 15.0
5 3 6 15.0
...where each 'a' column has the same total as the other.
Returning the sum from a groupby operation in pandas produces a column only as long as the number of unique items in the index. Use transform to produce a column of the same length ("like-indexed") as the original data frame without performing any merges.
df['total'] = df.groupby('a')['b'].transform(sum)
>>> df
a b total
0 1 1 1
1 2 2 5
2 2 3 5
3 3 4 15
4 3 5 15
5 3 6 15