i am trying to take diff value from previous row in a dataframe by grouping column "group", there are several similar questions but i can't get this working.
date group value
0 2020-01-01 A 808
1 2020-01-01 B 331
2 2020-01-02 A 612
3 2020-01-02 B 1391
4 2020-01-03 A 234
5 2020-01-04 A 828
6 2020-01-04 B 820
6 2020-01-05 A 1075
8 2020-01-07 B 572
9 2020-01-10 B 736
10 2020-01-10 A 1436
df.sort_values(['group','date'], inplace=True)
df['diff'] = df['value'].diff()
print(df)
date value group diff
1 2020-01-03 234 A NaN
8 2020-01-01 331 B 97.0
2 2020-01-07 572 B 241.0
9 2020-01-02 612 A 40.0
5 2020-01-10 736 B 124.0
17 2020-01-01 808 A 72.0
14 2020-01-04 820 B 12.0
4 2020-01-04 828 A 8.0
18 2020-01-05 1075 A 247.0
7 2020-01-02 1391 B 316.0
10 2020-01-10 1436 A 45.0
This is the result that i need
date group value diff
0 2020-01-01 A 808 Na
2 2020-01-02 A 612 -196
4 2020-01-03 A 234 -378
5 2020-01-04 A 828 594
6 2020-01-05 A 1075 247
10 2020-01-10 A 1436 361
1 2020-01-01 B 331 Na
3 2020-01-02 B 1391 1060
6 2020-01-04 B 820 -571
8 2020-01-07 B 572 -248
9 2020-01-10 B 736 164
Shifts through each group to create a calculated column. Subtract that column from the original value column to create the difference column.
df.sort_values(['group','date'], ascending=[True,True], inplace=True)
df['shift'] = df.groupby('group')['value'].shift()
df['diff'] = df['value'] - df['shift']
df = df[['date','group','value','diff']]
1
df
date group value diff
0 2020-01-01 A 808 NaN
2 2020-01-02 A 612 -196.0
4 2020-01-03 A 234 -378.0
5 2020-01-04 A 828 594.0
6 2020-01-05 A 1075 247.0
10 2020-01-10 A 1436 361.0
1 2020-01-01 B 331 NaN
3 2020-01-02 B 1391 1060.0
6 2020-01-04 B 820 -571.0
8 2020-01-07 B 572 -248.0
9 2020-01-10 B 736 164.0
You can group use diff()
df = df.sort_values('date')
df['diff'] = df.groupby(['group'])['value'].diff()
gives
date group value diff
0 2020-01-01 A 808 NaN
1 2020-01-01 B 331 NaN
2 2020-01-02 A 612 -196.0
3 2020-01-02 B 1391 1060.0
4 2020-01-03 A 234 -378.0
5 2020-01-04 A 828 594.0
6 2020-01-04 B 820 -571.0
7 2020-01-05 A 1075 247.0
8 2020-01-07 B 572 -248.0
10 2020-01-10 A 1436 361.0
9 2020-01-10 B 736 164.0
If you want the dataset ordered as you have it you can add group to the sort but its not necessary for the operation and can be done before or after you get the differences.
df.sort_values(['group','date'])
date group value diff
0 2020-01-01 A 808 NaN
2 2020-01-02 A 612 -196.0
4 2020-01-03 A 234 -378.0
5 2020-01-04 A 828 594.0
7 2020-01-05 A 1075 247.0
10 2020-01-10 A 1436 361.0
1 2020-01-01 B 331 NaN
3 2020-01-02 B 1391 1060.0
6 2020-01-04 B 820 -571.0
8 2020-01-07 B 572 -248.0
9 2020-01-10 B 736 164.0
Related
i've generated this dataframe:
np.random.seed(123)
len_df = 10
groups_list = ['A','B']
dates_list = pd.date_range(start='1/1/2020', periods=10, freq='D').to_list()
df2 = pd.DataFrame()
df2['date'] = np.random.choice(dates_list, size=len_df)
df2['value'] = np.random.randint(232, 1532, size=len_df)
df2['group'] = np.random.choice(groups_list, size=len_df)
df2 = df2.sort_values(by=['date'])
df2.reset_index(drop=True, inplace=True)
date group value
0 2020-01-01 A 652
1 2020-01-02 B 1174
2 2020-01-02 B 1509
3 2020-01-02 A 840
4 2020-01-03 A 870
5 2020-01-03 A 279
6 2020-01-04 B 456
7 2020-01-07 B 305
8 2020-01-07 A 1078
9 2020-01-10 A 343
I need to get rid of duplicated groups in the same date. I just want that one group appears only once in a date.
Result
date group value
0 2020-01-01 A 652
1 2020-01-02 B 1174
2 2020-01-02 A 840
3 2020-01-03 A 870
4 2020-01-04 B 456
5 2020-01-07 B 305
6 2020-01-07 A 1078
7 2020-01-10 A 343
.drop_duplicates() is in the pandas library and allows you to do exactly that. Read more in the documentation.
df2.drop_duplicates(subset=["date", "group"], keep="first")
Out[9]:
date group value
0 2020-01-01 A 652
1 2020-01-02 B 1174
3 2020-01-02 A 840
4 2020-01-03 A 870
6 2020-01-04 B 456
7 2020-01-07 B 305
8 2020-01-07 A 1078
9 2020-01-10 A 343
You can use drop_duplicates() to drop based on a subset of columns. However, you need to specify which row to keep e.g. first/last row.
df2 = df2.drop_duplicates(subset=['date', 'group'], keep='first')
you are looking for the drop_duplicates method on a dataframe.
df2 = df2.drop_duplicates(subset=['date', 'group'], keep='first').reset_index(drop=True)
date value group
0 2020-01-01 652 A
1 2020-01-02 1174 B
2 2020-01-02 840 A
3 2020-01-03 870 A
4 2020-01-04 456 B
5 2020-01-07 305 B
6 2020-01-07 1078 A
7 2020-01-10 343 A
I have data
customer_id purchase_amount date_of_purchase
0 760 25.0 06-11-2009
1 860 50.0 09-28-2012
2 1200 100.0 10-25-2005
3 1420 50.0 09-07-2009
4 1940 70.0 01-25-2013
5 1960 40.0 10-29-2013
6 2620 30.0 09-03-2006
7 3050 50.0 12-04-2007
8 3120 150.0 08-11-2006
9 3260 45.0 10-20-2010
10 3510 35.0 04-05-2013
11 3970 30.0 07-06-2007
12 4000 20.0 11-25-2005
13 4180 20.0 09-22-2010
14 4390 30.0 04-15-2011
15 4750 60.0 02-12-2013
16 4840 30.0 10-14-2005
17 4910 15.0 12-13-2006
18 4950 50.0 05-19-2010
19 4970 30.0 01-12-2006
20 5250 50.0 12-20-2005
Now I want to subtract 01-01-2016 from each row of date_of_purchase
I tried the following so I should have a new column days_since with a number of days.
NOW = pd.to_datetime('01/01/2016').strftime('%m-%d-%Y')
gb = customer_purchases_df.groupby('customer_id')
df2 = gb.agg({'date_of_purchase': lambda x: (NOW - x.max()).days})
any suggestion. how I can achieve this
Thanks in advance
pd.to_datetime(df['date_of_purchase']).rsub(pd.to_datetime('2016-01-01')).dt.days
0 2395
1 1190
2 3720
3 2307
4 1071
5 794
6 3407
7 2950
8 3430
9 1899
10 1001
11 3101
12 3689
13 1927
14 1722
15 1053
16 3731
17 3306
18 2053
19 3641
20 3664
Name: date_of_purchase, dtype: int64
I'm assuming the 'date_of_purchase' column already has the datetime dtype.
>>> df
customer_id purchase_amount date_of_purchase
0 760 25.0 2009-06-11
1 860 50.0 2012-09-28
2 1200 100.0 2005-10-25
>>> df['days_since'] = df['date_of_purchase'].sub(pd.to_datetime('01/01/2016')).dt.days.abs()
>>> df
customer_id purchase_amount date_of_purchase days_since
0 760 25.0 2009-06-11 2395
1 860 50.0 2012-09-28 1190
2 1200 100.0 2005-10-25 3720
I'd like to change my dataframe adding time intervals for every hour during a month
Original df
money food
0 1 2
1 4 5
2 5 7
Output:
money food time
0 1 2 2020-01-01 00:00:00
1 1 2 2020-01-01 00:01:00
2 1 2 2020-01-01 00:02:00
...
2230 5 7 2020-01-31 00:22:00
2231 5 7 2020-01-31 00:23:00
where 2231 = out_rows_number-1 = month_days_number*hours_per_day*orig_rows_number - 1
What is the proper way to perform it?
Use cross join by DataFrame.merge and new DataFrame with all hours per month created by date_range:
df1 = pd.DataFrame({'a':1,
'time':pd.date_range('2020-01-01', '2020-01-31 23:00:00', freq='h')})
df = df.assign(a=1).merge(df1, on='a', how='outer').drop('a', axis=1)
print (df)
money food time
0 1 2 2020-01-01 00:00:00
1 1 2 2020-01-01 01:00:00
2 1 2 2020-01-01 02:00:00
3 1 2 2020-01-01 03:00:00
4 1 2 2020-01-01 04:00:00
... ... ...
2227 5 7 2020-01-31 19:00:00
2228 5 7 2020-01-31 20:00:00
2229 5 7 2020-01-31 21:00:00
2230 5 7 2020-01-31 22:00:00
2231 5 7 2020-01-31 23:00:00
[2232 rows x 3 columns]
Using pandas, what is the easiest way to calculate a rolling cumsum over the previous n elements, for instance to calculate trailing three days sales:
df = pandas.Series(numpy.random.randint(0,10,10), index=pandas.date_range('2020-01', periods=10))
df
2020-01-01 8
2020-01-02 4
2020-01-03 1
2020-01-04 0
2020-01-05 5
2020-01-06 8
2020-01-07 3
2020-01-08 8
2020-01-09 9
2020-01-10 0
Freq: D, dtype: int64
Desired output:
2020-01-01 8
2020-01-02 12
2020-01-03 13
2020-01-04 5
2020-01-05 6
2020-01-06 13
2020-01-07 16
2020-01-08 19
2020-01-09 20
2020-01-10 17
Freq: D, dtype: int64
You need rolling.sum:
df.rolling(3, min_periods=1).sum()
Out:
2020-01-01 8.0
2020-01-02 12.0
2020-01-03 13.0
2020-01-04 5.0
2020-01-05 6.0
2020-01-06 13.0
2020-01-07 16.0
2020-01-08 19.0
2020-01-09 20.0
2020-01-10 17.0
dtype: float64
min_periods ensures the first two elements are calculated, too. With a window size of 3, by default, the first two elements are NaN.
A B C D yearweek
0 245 95 60 30 2014-48
1 245 15 70 25 2014-49
2 150 275 385 175 2014-50
3 100 260 170 335 2014-51
4 580 925 535 2590 2015-02
5 630 126 485 2115 2015-03
6 425 90 905 1085 2015-04
7 210 670 655 945 2015-05
The last column contains the the year along with the weeknumber. Is it possible to convert this to a datetime column with pd.to_datetime?
I've tried:
pd.to_datetime(df.yearweek, format='%Y-%U')
0 2014-01-01
1 2014-01-01
2 2014-01-01
3 2014-01-01
4 2015-01-01
5 2015-01-01
6 2015-01-01
7 2015-01-01
Name: yearweek, dtype: datetime64[ns]
But that output is incorrect, while I believe %U should be the format string for week number. What am I missing here?
You need another parameter for specify day - check this:
df = pd.to_datetime(df.yearweek.add('-0'), format='%Y-%W-%w')
print (df)
0 2014-12-07
1 2014-12-14
2 2014-12-21
3 2014-12-28
4 2015-01-18
5 2015-01-25
6 2015-02-01
7 2015-02-08
Name: yearweek, dtype: datetime64[ns]