Taking the mean value of N last days, including NaNs - python

I have this data frame:
ID Date X 123_Var 456_Var 789_Var
A 16-07-19 3 777.0 250.0 810.0
A 17-07-19 9 637.0 121.0 529.0
A 20-07-19 2 295.0 272.0 490.0
A 21-07-19 3 778.0 600.0 544.0
A 22-07-19 6 741.0 792.0 907.0
A 25-07-19 6 435.0 416.0 820.0
A 26-07-19 8 590.0 455.0 342.0
A 27-07-19 6 763.0 476.0 753.0
A 02-08-19 6 717.0 211.0 454.0
A 03-08-19 6 152.0 442.0 475.0
A 05-08-19 6 564.0 340.0 302.0
A 07-08-19 6 105.0 929.0 633.0
A 08-08-19 6 948.0 366.0 586.0
B 07-08-19 4 509.0 690.0 406.0
B 08-08-19 2 413.0 725.0 414.0
B 12-08-19 2 170.0 702.0 912.0
B 13-08-19 3 851.0 616.0 477.0
B 14-08-19 9 475.0 447.0 555.0
B 15-08-19 1 412.0 403.0 708.0
B 17-08-19 2 299.0 537.0 321.0
B 18-08-19 4 310.0 119.0 125.0
C 16-07-19 3 777.0 250.0 810.0
C 17-07-19 9 637.0 121.0 529.0
C 20-07-19 2 NaN NaN NaN
C 21-07-19 3 NaN NaN NaN
C 22-07-19 6 741.0 792.0 907.0
C 25-07-19 6 NaN NaN NaN
C 26-07-19 8 590.0 455.0 342.0
C 27-07-19 6 763.0 476.0 753.0
C 02-08-19 6 717.0 211.0 454.0
C 03-08-19 6 NaN NaN NaN
C 05-08-19 6 564.0 340.0 302.0
C 07-08-19 6 NaN NaN NaN
C 08-08-19 6 948.0 366.0 586.0
I want to show the mean value of n last days (using Date column), excluding the value of current day.
I'm using this code (what should I do to fix this?):
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
n = 4
cols = df.filter(regex='Var').columns
df = df.set_index('Date')
df_ = df.set_index('ID', append=True).swaplevel(1,0)
df1 = df.groupby('ID').rolling(window=f'{n+1}D')[cols].count()
df2 = df.groupby('ID').rolling(window=f'{n+1}D')[cols].mean()
df3 = (df1.mul(df2)
.sub(df_[cols])
.div(df1[cols].sub(1)).add_suffix(f'_{n}')
)
df4 = df_.join(df3)
Expected result:
ID Date X 123_Var 456_Var 789_Var 123_Var_4 456_Var_4 789_Var_4
A 16-07-19 3 777.0 250.0 810.0 NaN NaN NaN
A 17-07-19 9 637.0 121.0 529.0 777.000000 250.000000 810.0
A 20-07-19 2 295.0 272.0 490.0 707.000000 185.500000 669.5
A 21-07-19 3 778.0 600.0 544.0 466.000000 196.500000 509.5
A 22-07-19 6 741.0 792.0 907.0 536.500000 436.000000 517.0
A 25-07-19 6 435.0 416.0 820.0 759.500000 696.000000 725.5
A 26-07-19 8 590.0 455.0 342.0 588.000000 604.000000 863.5
A 27-07-19 6 763.0 476.0 753.0 512.500000 435.500000 581.0
A 02-08-19 6 717.0 211.0 454.0 NaN NaN NaN
A 03-08-19 6 152.0 442.0 475.0 717.000000 211.000000 454.0
A 05-08-19 6 564.0 340.0 302.0 434.500000 326.500000 464.5
A 07-08-19 6 105.0 929.0 633.0 358.000000 391.000000 388.5
A 08-08-19 6 948.0 366.0 586.0 334.500000 634.500000 467.5
B 07-08-19 4 509.0 690.0 406.0 NaN NaN NaN
B 08-08-19 2 413.0 725.0 414.0 509.000000 690.000000 406.0
B 12-08-19 2 170.0 702.0 912.0 413.000000 725.000000 414.0
B 13-08-19 3 851.0 616.0 477.0 291.500000 713.500000 663.0
B 14-08-19 9 475.0 447.0 555.0 510.500000 659.000000 694.5
B 15-08-19 1 412.0 403.0 708.0 498.666667 588.333333 648.0
B 17-08-19 2 299.0 537.0 321.0 579.333333 488.666667 580.0
B 18-08-19 4 310.0 119.0 125.0 395.333333 462.333333 528.0
C 16-07-19 3 777.0 250.0 810.0 NaN NaN NaN
C 17-07-19 9 637.0 121.0 529.0 777.000000 250.000000 810.0
C 20-07-19 2 NaN NaN NaN 707.000000 185.500000 669.5
C 21-07-19 3 NaN NaN NaN 637.000000 121.000000 529.0
C 22-07-19 6 741.0 792.0 907.0 NaN NaN NaN
C 25-07-19 6 NaN NaN NaN 741.000000 792.000000 907.0
C 26-07-19 8 590.0 455.0 342.0 741.000000 792.000000 907.0
C 27-07-19 6 763.0 476.0 753.0 590.000000 455.000000 342.0
C 02-08-19 6 717.0 211.0 454.0 NaN NaN NaN
C 03-08-19 6 NaN NaN NaN 717.000000 211.000000 454.0
C 05-08-19 6 564.0 340.0 302.0 717.000000 211.000000 454.0
C 07-08-19 6 NaN NaN NaN 564.000000 340.000000 302.0
C 08-08-19 6 948.0 366.0 586.0 564.000000 340.000000 302.0
Numbers after the decimal point is not the matter.
These threads might help:
Taking the mean value of N last days
Taking the min value of N last days

Try:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df1 = (df.groupby('ID')['Date','123_Var','456_Var','789_Var'].rolling('4D', on='Date', closed='left').mean())
dfx = (df.set_index(['ID','Date'])
.join(df1.reset_index().set_index(['ID','Date']), rsuffix='_4')
.reset_index()
.drop('level_1',axis=1))
print(dfx.to_string())
ID Date X 123_Var 456_Var 789_Var 123_Var_4 456_Var_4 789_Var_4
0 A 2019-07-16 3 777.0 250.0 810.0 NaN NaN NaN
1 A 2019-07-17 9 637.0 121.0 529.0 777.000000 250.000000 810.0
2 A 2019-07-20 2 295.0 272.0 490.0 707.000000 185.500000 669.5
3 A 2019-07-21 3 778.0 600.0 544.0 466.000000 196.500000 509.5
4 A 2019-07-22 6 741.0 792.0 907.0 536.500000 436.000000 517.0
5 A 2019-07-25 6 435.0 416.0 820.0 759.500000 696.000000 725.5
6 A 2019-07-26 8 590.0 455.0 342.0 588.000000 604.000000 863.5
7 A 2019-07-27 6 763.0 476.0 753.0 512.500000 435.500000 581.0
8 A 2019-08-02 6 717.0 211.0 454.0 NaN NaN NaN
9 A 2019-08-03 6 152.0 442.0 475.0 717.000000 211.000000 454.0
10 A 2019-08-05 6 564.0 340.0 302.0 434.500000 326.500000 464.5
11 A 2019-08-07 6 105.0 929.0 633.0 358.000000 391.000000 388.5
12 A 2019-08-08 6 948.0 366.0 586.0 334.500000 634.500000 467.5
13 B 2019-08-07 4 509.0 690.0 406.0 NaN NaN NaN
14 B 2019-08-08 2 413.0 725.0 414.0 509.000000 690.000000 406.0
15 B 2019-08-12 2 170.0 702.0 912.0 413.000000 725.000000 414.0
16 B 2019-08-13 3 851.0 616.0 477.0 170.000000 702.000000 912.0
17 B 2019-08-14 9 475.0 447.0 555.0 510.500000 659.000000 694.5
18 B 2019-08-15 1 412.0 403.0 708.0 498.666667 588.333333 648.0
19 B 2019-08-17 2 299.0 537.0 321.0 579.333333 488.666667 580.0
20 B 2019-08-18 4 310.0 119.0 125.0 395.333333 462.333333 528.0
21 C 2019-07-16 3 777.0 250.0 810.0 NaN NaN NaN
22 C 2019-07-17 9 637.0 121.0 529.0 777.000000 250.000000 810.0
23 C 2019-07-20 2 NaN NaN NaN 707.000000 185.500000 669.5
24 C 2019-07-21 3 NaN NaN NaN 637.000000 121.000000 529.0
25 C 2019-07-22 6 741.0 792.0 907.0 NaN NaN NaN
26 C 2019-07-25 6 NaN NaN NaN 741.000000 792.000000 907.0
27 C 2019-07-26 8 590.0 455.0 342.0 741.000000 792.000000 907.0
28 C 2019-07-27 6 763.0 476.0 753.0 590.000000 455.000000 342.0
29 C 2019-08-02 6 717.0 211.0 454.0 NaN NaN NaN
30 C 2019-08-03 6 NaN NaN NaN 717.000000 211.000000 454.0
31 C 2019-08-05 6 564.0 340.0 302.0 717.000000 211.000000 454.0
32 C 2019-08-07 6 NaN NaN NaN 564.000000 340.000000 302.0
33 C 2019-08-08 6 948.0 366.0 586.0 564.000000 340.000000 302.0

Related

Subtracting fix date from whole panda data frame - python

I have data
customer_id purchase_amount date_of_purchase
0 760 25.0 06-11-2009
1 860 50.0 09-28-2012
2 1200 100.0 10-25-2005
3 1420 50.0 09-07-2009
4 1940 70.0 01-25-2013
5 1960 40.0 10-29-2013
6 2620 30.0 09-03-2006
7 3050 50.0 12-04-2007
8 3120 150.0 08-11-2006
9 3260 45.0 10-20-2010
10 3510 35.0 04-05-2013
11 3970 30.0 07-06-2007
12 4000 20.0 11-25-2005
13 4180 20.0 09-22-2010
14 4390 30.0 04-15-2011
15 4750 60.0 02-12-2013
16 4840 30.0 10-14-2005
17 4910 15.0 12-13-2006
18 4950 50.0 05-19-2010
19 4970 30.0 01-12-2006
20 5250 50.0 12-20-2005
Now I want to subtract 01-01-2016 from each row of date_of_purchase
I tried the following so I should have a new column days_since with a number of days.
NOW = pd.to_datetime('01/01/2016').strftime('%m-%d-%Y')
gb = customer_purchases_df.groupby('customer_id')
df2 = gb.agg({'date_of_purchase': lambda x: (NOW - x.max()).days})
any suggestion. how I can achieve this
Thanks in advance
pd.to_datetime(df['date_of_purchase']).rsub(pd.to_datetime('2016-01-01')).dt.days
0 2395
1 1190
2 3720
3 2307
4 1071
5 794
6 3407
7 2950
8 3430
9 1899
10 1001
11 3101
12 3689
13 1927
14 1722
15 1053
16 3731
17 3306
18 2053
19 3641
20 3664
Name: date_of_purchase, dtype: int64
I'm assuming the 'date_of_purchase' column already has the datetime dtype.
>>> df
customer_id purchase_amount date_of_purchase
0 760 25.0 2009-06-11
1 860 50.0 2012-09-28
2 1200 100.0 2005-10-25
>>> df['days_since'] = df['date_of_purchase'].sub(pd.to_datetime('01/01/2016')).dt.days.abs()
>>> df
customer_id purchase_amount date_of_purchase days_since
0 760 25.0 2009-06-11 2395
1 860 50.0 2012-09-28 1190
2 1200 100.0 2005-10-25 3720

Add to values of a DataFrame using cooridnates

I have a dataframe a:
Out[68]:
p0_4 p5_7 p8_9 p10_14 p15 p16_17 p18_19 p20_24 p25_29 \
0 1360.0 921.0 676.0 1839.0 336.0 668.0 622.0 1190.0 1399.0
1 308.0 197.0 187.0 411.0 67.0 153.0 172.0 336.0 385.0
2 76.0 59.0 40.0 72.0 16.0 36.0 20.0 56.0 82.0
3 765.0 608.0 409.0 1077.0 220.0 359.0 342.0 873.0 911.0
4 1304.0 906.0 660.0 1921.0 375.0 725.0 645.0 1362.0 1474.0
5 195.0 135.0 78.0 262.0 44.0 97.0 100.0 265.0 229.0
6 1036.0 965.0 701.0 1802.0 335.0 701.0 662.0 1321.0 1102.0
7 5072.0 3798.0 2865.0 7334.0 1399.0 2732.0 2603.0 4976.0 4575.0
8 1360.0 962.0 722.0 1758.0 357.0 710.0 713.0 1761.0 1660.0
9 743.0 508.0 369.0 1118.0 286.0 615.0 429.0 738.0 885.0
10 1459.0 1015.0 679.0 1732.0 337.0 746.0 677.0 1493.0 1546.0
11 828.0 519.0 415.0 1057.0 190.0 439.0 379.0 788.0 1024.0
12 1042.0 690.0 503.0 1204.0 219.0 451.0 465.0 1193.0 1406.0
p30_44 p45_59 p60_64 p65_74 p75_84 p85_89 p90plus
0 4776.0 8315.0 2736.0 5463.0 2819.0 738.0 451.0
1 1004.0 2456.0 988.0 2007.0 1139.0 313.0 153.0
2 291.0 529.0 187.0 332.0 108.0 31.0 10.0
3 2807.0 5505.0 2060.0 4104.0 2129.0 516.0 252.0
4 4524.0 9406.0 3034.0 6003.0 3366.0 840.0 471.0
5 806.0 1490.0 606.0 1288.0 664.0 185.0 108.0
6 4127.0 8311.0 2911.0 6111.0 3525.0 1029.0 707.0
7 16917.0 27547.0 8145.0 15950.0 9510.0 2696.0 1714.0
8 5692.0 9380.0 3288.0 6458.0 3830.0 1050.0 577.0
9 2749.0 5696.0 2014.0 4165.0 2352.0 603.0 288.0
10 4676.0 7654.0 2502.0 5077.0 3004.0 754.0 461.0
11 2799.0 4880.0 1875.0 3951.0 2294.0 551.0 361.0
12 3288.0 5661.0 1974.0 4007.0 2343.0 623.0 303.0
and a series d:
Out[70]:
2 p45_59
10 p45_59
11 p45_59
Is there a simple way to add 1 to number in a with the same index and column labels in d?
I have tried:
a[d] +=1
However this adds 1 to every value in the column, not just the values with indices 2, 10 and 11.
Thanking you in advance.
You might want to try this.
a.loc[list(d.index), list(d.values)] += 1

Rename specific columns with numbers with str+number

I originally have r number of csv files.
I created one dataframe with 9 columns and r of them have numbers as headers.
I would like to target only them and change their name into ['Apple']+range(len(files)).
Example:
I have 3 csv files.
The current 3 targeted columns in my dataframe are:
0 1 2
0 444.0 286.0 657.0
1 2103.0 2317.0 2577.0
2 157.0 200.0 161.0
3 4000.0 3363.0 4986.0
4 1042.0 541.0 872.0
5 1607.0 1294.0 3305.0
I would like:
Apple1 Apple2 Apple3
0 444.0 286.0 657.0
1 2103.0 2317.0 2577.0
2 157.0 200.0 161.0
3 4000.0 3363.0 4986.0
4 1042.0 541.0 872.0
5 1607.0 1294.0 3305.0
Thank you
IIUC, you can initialise a itertools.count object and reset the columns in a list comprehension.
from itertools import count
cnt = count(1)
df.columns = ['Apple{}'.format(next(cnt)) if
str(x).isdigit() else x for x in df.columns]
This will also work very well if the digits are not contiguous, but you want them to be renamed with a contiguous suffix:
print(df)
1 Col1 5 Col2 500
0 1240.0 552.0 1238.0 52.0 1370.0
1 633.0 435.0 177.0 2201.0 185.0
2 1518.0 936.0 385.0 288.0 427.0
3 212.0 660.0 320.0 438.0 1403.0
4 15.0 556.0 501.0 1259.0 1298.0
5 177.0 718.0 1420.0 833.0 984.0
cnt = count(1)
df.columns = ['Apple{}'.format(next(cnt)) if
str(x).isdigit() else x for x in df.columns]
print(df)
Apple1 Col1 Apple2 Col2 Apple3
0 1240.0 552.0 1238.0 52.0 1370.0
1 633.0 435.0 177.0 2201.0 185.0
2 1518.0 936.0 385.0 288.0 427.0
3 212.0 660.0 320.0 438.0 1403.0
4 15.0 556.0 501.0 1259.0 1298.0
5 177.0 718.0 1420.0 833.0 984.0
You can use rename_axis:
df.rename_axis(lambda x: 'Apple{}'.format(int(x)+1) if str(x).isdigit() else x, axis="columns")
Out[9]:
Apple1 Apple2 Apple3
0 444.0 286.0 657.0
1 2103.0 2317.0 2577.0
2 157.0 200.0 161.0
3 4000.0 3363.0 4986.0
4 1042.0 541.0 872.0
5 1607.0 1294.0 3305.0

delete the integer in timeindex

This is a part of a dataframe.as you can see, there are some Integer in the timeindex. It should not be a timestamp. So I want to just delete it.So how can we delete the records which has the integer as a timeindex?
rent_time rent_price_per_square_meter
0 2016-11-28 09:01:58 0.400000
1 2016-11-28 09:02:35 0.400000
2 2016-11-28 09:02:43 0.400000
3 2016-11-28 09:03:21 0.400000
4 2016-11-28 09:03:21 0.400000
5 2016-11-28 09:03:34 0.400000
6 2016-11-28 09:03:34 0.400000
7 2017-06-17 02:49:33 0.933333
8 2017-03-19 01:30:03 0.490196
9 2017-03-10 06:39:03 11.111111
10 2017-03-09 14:40:03 16.666667
11 908797 11.000000
12 2017-06-08 03:27:52 22.000000
13 2017-06-30 03:03:11 22.000000
14 2017-02-20 11:04:48 12.000000
15 2017-03-05 13:53:39 6.842105
16 2017-03-06 14:00:01 6.842105
17 2017-03-15 02:38:54 20.000000
18 2017-03-15 02:19:07 13.043478
19 2017-02-24 15:10:00 25.000000
20 2017-06-26 02:17:31 13.043478
21 82368 11.111111
22 2017-06-30 07:53:55 4.109589
23 2017-07-17 02:42:43 20.000000
24 2017-06-30 07:38:00 5.254237
25 2017-06-30 07:49:00 4.920635
26 2017-06-30 05:26:26 4.189189
You can use boolean indexing with to_datetime and parameter errors=coerce for return NaNs for no datetime values and then add notnull for return all datetimes:
df1 = df[pd.to_datetime(df['rent_time'], errors='coerce').notnull()]
print (df1)
rent_time rent_price_per_square_meter
0 2016-11-28 09:01:58 0.400000
1 2016-11-28 09:02:35 0.400000
2 2016-11-28 09:02:43 0.400000
3 2016-11-28 09:03:21 0.400000
4 2016-11-28 09:03:21 0.400000
5 2016-11-28 09:03:34 0.400000
6 2016-11-28 09:03:34 0.400000
7 2017-06-17 02:49:33 0.933333
8 2017-03-19 01:30:03 0.490196
9 2017-03-10 06:39:03 11.111111
10 2017-03-09 14:40:03 16.666667
12 2017-06-08 03:27:52 22.000000
13 2017-06-30 03:03:11 22.000000
14 2017-02-20 11:04:48 12.000000
15 2017-03-05 13:53:39 6.842105
16 2017-03-06 14:00:01 6.842105
17 2017-03-15 02:38:54 20.000000
18 2017-03-15 02:19:07 13.043478
19 2017-02-24 15:10:00 25.000000
20 2017-06-26 02:17:31 13.043478
22 2017-06-30 07:53:55 4.109589
23 2017-07-17 02:42:43 20.000000
24 2017-06-30 07:38:00 5.254237
25 2017-06-30 07:49:00 4.920635
26 2017-06-30 05:26:26 4.189189
EDIT:
For next data procesing if need DatetimeIndex:
df['rent_time'] = pd.to_datetime(df['rent_time'], errors='coerce')
df = df.dropna(subset=['rent_time']).set_index('rent_time')
print (df)
rent_price_per_square_meter
rent_time
2016-11-28 09:01:58 0.400000
2016-11-28 09:02:35 0.400000
2016-11-28 09:02:43 0.400000
2016-11-28 09:03:21 0.400000
2016-11-28 09:03:21 0.400000
2016-11-28 09:03:34 0.400000
2016-11-28 09:03:34 0.400000
2017-06-17 02:49:33 0.933333
2017-03-19 01:30:03 0.490196
2017-03-10 06:39:03 11.111111
2017-03-09 14:40:03 16.666667
2017-06-08 03:27:52 22.000000
2017-06-30 03:03:11 22.000000
2017-02-20 11:04:48 12.000000
2017-03-05 13:53:39 6.842105
2017-03-06 14:00:01 6.842105
2017-03-15 02:38:54 20.000000
2017-03-15 02:19:07 13.043478
2017-02-24 15:10:00 25.000000
2017-06-26 02:17:31 13.043478
2017-06-30 07:53:55 4.109589
2017-07-17 02:42:43 20.000000
2017-06-30 07:38:00 5.254237
2017-06-30 07:49:00 4.920635
2017-06-30 05:26:26 4.189189

Python: Create a new column of date from an existing column of date by subtracting consecutive rows [duplicate]

This question already has answers here:
Adding a column thats result of difference in consecutive rows in pandas
(4 answers)
Closed 5 years ago.
Code:
import pandas as pd
df = pd.read_csv('xyz.csv', usecols=['transaction_date', 'amount'])
df=pd.concat(g for _, g in df.groupby("amount") if len(g) > 3)
df=df.reset_index(drop=True)
print(df)
Output:
transaction_date amount
0 2016-06-02 50.0
1 2016-06-02 50.0
2 2016-06-02 50.0
3 2016-06-02 50.0
4 2016-06-02 50.0
5 2016-06-02 50.0
6 2016-07-04 50.0
7 2016-07-04 50.0
8 2016-09-29 225.0
9 2016-10-29 225.0
10 2016-11-29 225.0
11 2016-12-30 225.0
12 2017-01-30 225.0
13 2016-05-16 1000.0
14 2016-05-20 1000.0
I need to add another column next to the amount column which gives the difference between corresponding rows of transaction_date
e.g.
transaction_date amount delta(days)
0 2016-06-02 50.0 -
1 2016-06-02 50.0 0
2 2016-06-02 50.0 0
3 2016-06-02 50.0 0
4 2016-06-02 50.0 0
5 2016-06-02 50.0 0
6 2016-07-04 50.0 32
7 2016-07-04 50.0 .
8 2016-09-29 225.0 .
9 2016-10-29 225.0 .
10 2016-11-29 225.0
there're probably some better methods, but you can use pandas.Series.shift:
>>> df.transaction_date.shift(-1) - df.transaction_date
0 0 days
1 0 days
2 0 days
3 0 days
4 0 days
5 32 days
6 0 days
7 87 days
8 30 days
9 31 days
10 31 days
11 31 days
12 -259 days
13 4 days
14 NaT
I think you need diff + dt.days:
df['delta(days)'] = df['transaction_date'].diff().dt.days
print (df)
transaction_date amount delta(days)
0 2016-06-02 50.0 NaN
1 2016-06-02 50.0 0.0
2 2016-06-02 50.0 0.0
3 2016-06-02 50.0 0.0
4 2016-06-02 50.0 0.0
5 2016-06-02 50.0 0.0
6 2016-07-04 50.0 32.0
7 2016-07-04 50.0 0.0
8 2016-09-29 225.0 87.0
9 2016-10-29 225.0 30.0
10 2016-11-29 225.0 31.0
11 2016-12-30 225.0 31.0
12 2017-01-30 225.0 31.0
13 2016-05-16 1000.0 -259.0
14 2016-05-20 1000.0 4.0
But if need count it by groups add groupby:
df['delta(days)'] = df.groupby('amount')['transaction_date'].diff().dt.days
print (df)
transaction_date amount delta(days)
0 2016-06-02 50.0 NaN
1 2016-06-02 50.0 0.0
2 2016-06-02 50.0 0.0
3 2016-06-02 50.0 0.0
4 2016-06-02 50.0 0.0
5 2016-06-02 50.0 0.0
6 2016-07-04 50.0 32.0
7 2016-07-04 50.0 0.0
8 2016-09-29 225.0 NaN
9 2016-10-29 225.0 30.0
10 2016-11-29 225.0 31.0
11 2016-12-30 225.0 31.0
12 2017-01-30 225.0 31.0
13 2016-05-16 1000.0 NaN
14 2016-05-20 1000.0 4.0
To get exact output you've requested (sorting optional), use shift to solve for timedelta, use dt.days to find int:
df.transaction_date = pd.to_datetime(df.transaction_date)
df.sort_values('transaction_date', inplace=True)
df['delta(days)'] = (df['transaction_date'] - df['transaction_date'].shift(1)).dt.days
Output:
transaction_date amount delta(days)
13 2016-05-16 1000.0 NaN
14 2016-05-20 1000.0 4.0
0 2016-06-02 50.0 13.0
1 2016-06-02 50.0 0.0
2 2016-06-02 50.0 0.0
3 2016-06-02 50.0 0.0
4 2016-06-02 50.0 0.0
5 2016-06-02 50.0 0.0
6 2016-07-04 50.0 32.0
7 2016-07-04 50.0 0.0
8 2016-09-29 225.0 87.0
9 2016-10-29 225.0 30.0
10 2016-11-29 225.0 31.0
11 2016-12-30 225.0 31.0
12 2017-01-30 225.0 31.0

Categories