My end goal is to sum all minutes only from initial to final in column periods. This needs to be grouped by id
I have thousands of id and not all of them have the same amount of min in between initial and final.
Periods are sorted in a "journey" fashion each record represents a period of time of its id
Pseudocode:
Iterate rows and sum all values in column "min"
if sum starts in periods == initial and ends in periods = final
Example with 2 ids
id
periods
min
1
period_x
10
1
initial
2
1
progress
3
1
progress_1
4
1
final
5
2
period_y
10
2
period_z
2
2
initial
3
2
progress_1
20
2
final
3
Desired output
id
periods
min
sum
1
period_x
10
14
1
initial
2
14
1
progress
3
14
1
progress_1
4
14
1
final
5
14
2
period_y
10
26
2
period_z
2
26
2
initial
3
26
2
progress_1
20
26
2
final
3
26
So far I've tried:
L = ['initial' 'final']
df['sum'] = df.id.where( df.zone_name.isin(L)).groupby(df['if']).transform('sum')
But this doesn't count what is in between initial and final
Create groups using cumsum and then return the sum of group 1, then apply that sum to the entire column. "Group 1" is anything per id that is between initial and final:
import numpy as np
df['grp'] = df['periods'].isin(['initial','final'])
df['grp'] = np.where(df['periods'] == 'final', 1, df.groupby('id')['grp'].cumsum())
df['sum'] = np.where(df['grp'].eq(1), df.groupby(['id', 'grp'])['min'].transform('sum'), np.nan)
df['sum'] = df.groupby('id')['sum'].transform(max)
df
Out[1]:
id periods min grp sum
0 1 period_x 10 0 14.0
1 1 initial 2 1 14.0
2 1 progress 3 1 14.0
3 1 progress_1 4 1 14.0
4 1 final 5 1 14.0
5 2 period_y 10 0 26.0
6 2 period_z 2 0 26.0
7 2 initial 3 1 26.0
8 2 progress_1 20 1 26.0
9 2 final 3 1 26.0
I created this dataframe I calculated the gap that I was looking but the problem is that some flats have the same price and I get a difference of price of 0. How could I replace the value 0 by the difference with the last lower price of the same group.
for example:
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:3
neighboorhood:a, bed:1, bath:1, price:2
I get difference price of 0,2,1,nan and I'm looking for 2,2,1,nan (briefly I don't want to compare 2 flats with the same price)
Thanks in advance and good day.
data=[
[1,'a',1,1,5],[2,'a',1,1,5],[3,'a',1,1,4],[4,'a',1,1,2],[5,'b',1,2,6],[6,'b',1,2,6],[7,'b',1,2,3]
]
df = pd.DataFrame(data, columns = ['id','neighborhoodname', 'beds', 'baths', 'price'])
df['difference_price'] = ( df.dropna()
.sort_values('price',ascending=False)
.groupby(['city','beds','baths'])['price'].diff(-1) )
I think you can remove duplicates first per all columns used for groupby with diff, create new column in filtered data and last use merge with left join to original:
df1 = (df.dropna()
.sort_values('price',ascending=False)
.drop_duplicates(['neighborhoodname','beds','baths', 'price']))
df1['difference_price'] = df1.groupby(['neighborhoodname','beds','baths'])['price'].diff(-1)
df = df.merge(df1[['neighborhoodname','beds','baths','price', 'difference_price']], how='left')
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN
Or you can use lambda function for back filling 0 values per groups for avoid wrong outputs if one row groups (data moved from another groups):
df['difference_price'] = (df.sort_values('price',ascending=False)
.groupby(['neighborhoodname','beds','baths'])['price']
.apply(lambda x: x.diff(-1).replace(0, np.nan).bfill()))
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN
I have a dataframe that looks like:
value 1 value 2
1 10
4 1
5 8
6 10
10 12
I want to go down each entry of value 1, average the previous value, and then create a new column beside value 2 with the average.
The output needs to look like:
value 1 value 2 avg
1 10 nan
4 1 2.5
5 8 4.5
6 10 5.5
10 12 8.0
.
.
How would I go about doing this?
shift
You can sum a series with the shifted version of itself:
df['avg'] = (df['value1'] + df['value1'].shift()) / 2
print(df)
value1 value2 avg
0 1 10 NaN
1 4 1 2.5
2 5 8 4.5
3 6 10 5.5
4 10 12 8.0
I have a dataframe that looks like the following:
ID1 ID2 Date
1 2 01/01/2018
1 2 03/01/2018
1 2 04/05/2018
2 1 06/06/2018
1 2 08/06/2018
3 4 09/07/2018
etc.
What I need to do is to flag the first time that an ID in ID1 appears in ID2. In the above example this would look like
ID1 ID2 Date Flag
1 2 01/01/2018
1 2 03/01/2018
1 2 04/05/2018
2 1 06/06/2018
1 2 08/06/2018 Y
3 4 09/07/2018
I've used the following code to tell me if ID1 ever occurs in ID2:
ID2List= df['ID2'].tolist()
ID2List= list(set(IDList)) # dedupe list
df['ID1 is in ID2List'] = np.where(df[ID1].isin(ID2List), 'Yes', 'No')
But this only tells me that there is an occasion where ID1 appears in ID2 at some point but not the event at which this first occurs.
Any help?
One idea is to use next with a generator expression to calculate the indices of matches in ID1. Then compare with index and use argmax to get the index of the first True value:
idx = df.apply(lambda row: next((idx for idx, val in enumerate(df['ID1']) \
if row['ID2'] == val), 0), axis=1)
df.loc[(df.index > idx).argmax(), 'Flag'] = 'Y'
print(df)
ID1 ID2 Date Flag
0 1 2 01/01/2018 NaN
1 1 2 03/01/2018 NaN
2 1 2 04/05/2018 NaN
3 2 1 06/06/2018 Y
4 1 2 08/06/2018 NaN
5 3 4 09/07/2018 NaN
I have a dataframe like this:
ID day purchase
ID1 1 10
ID1 2 15
ID1 4 13
ID2 2 11
ID2 4 11
ID2 5 24
ID2 6 10
Desired output:
ID day purchase Txn
ID1 1 10 1
ID1 2 15 2
ID1 4 13 3
ID2 2 11 1
ID2 4 11 2
ID2 5 24 3
ID2 6 10 4
So for each ID, i want to create a counter to keep a track of their transactions. In SAS, i would do something like First.ID then Txn=1 else Txn+1
How to do something like this in Python?
I got the idea of sorting by ID and day. But how to create customized counter?
Here is one solution. Like you suggest, it involves sorting by ID and day (in case your original dataframe isn't), and then grouping by ID, creating a counter for each ID:
# Make sure your dataframe is sorted properly (first by ID, then by day)
df = df.sort_values(['ID', 'day'])
# group by ID
by_id = df.groupby('ID')
# Make a custom counter using the default index of dataframes (adding 1)
df['txn'] = by_id.apply(lambda x: x.reset_index()).index.get_level_values(1)+1
>>> df
ID day purchase txn
0 ID1 1 10 1
1 ID1 2 15 2
2 ID1 4 13 3
3 ID2 2 11 1
4 ID2 4 11 2
5 ID2 5 24 3
6 ID2 6 10 4
If your dataframe started out as not properly sorted, you can get back to the original order like this:
df = df.sort_index()
The simplest method I could come up with, definitely not the most efficient though.
df['txn'] = [0]*len(df)
prev_ID = None
for index, row in df.iterrows():
if row['ID'] == prev_ID:
df['txn'][index] = counter
counter += 1
else:
prev_ID = row['ID']
df['txn'][index] = 1
counter = 2
outputs
ID day purchase txn
0 ID1 1 10 1
1 ID1 2 15 2
2 ID1 4 13 3
3 ID2 2 11 1
4 ID2 4 11 2
5 ID2 5 24 3
6 ID2 6 10 4