I have a dataframe of orders as below, where the column 'Value' represents cash in/out and the 'Date' column reflects when the transaction occurred.
Each transaction is grouped, so that the 'QTY' out, is always succeeded by the 'QTY' 'in', reflected by the sign in the 'QTY' column:
Date Qty Price Value
0 2014-11-18 58 495.775716 -2875499
1 2014-11-24 -58 484.280147 2808824
2 2014-11-26 63 474.138699 -2987073
3 2014-12-31 -63 507.931247 3199966
4 2015-01-05 59 495.923771 -2925950
5 2015-02-05 -59 456.224370 2691723
How can I create two columns, 'n_days' and 'price_diff' that is the difference in days between the two dates of each transaction and the 'Value'?
I have tried:
df['price_diff'] = df['Value'].rolling(2).apply(lambda x: x[0] + x[1])
but receiving a key error for the first observation (0).
Many thanks
Why don't you just use sum:
df['price_diff'] = df['Value'].rolling(2).sum()
Although from the name, it looks like
df['price_diff'] = df['Price'].diff()
And, for the two columns:
df[['Date_diff','Price_diff']] = df[['Date','Price']].diff()
Output:
Date Qty Price Value Date_diff Price_diff
0 2014-11-18 58 495.775716 -2875499 NaT NaN
1 2014-11-24 -58 484.280147 2808824 6 days -11.495569
2 2014-11-26 63 474.138699 -2987073 2 days -10.141448
3 2014-12-31 -63 507.931247 3199966 35 days 33.792548
4 2015-01-05 59 495.923771 -2925950 5 days -12.007476
5 2015-02-05 -59 456.224370 2691723 31 days -39.699401
Updated Per comment, you can try:
df['Val_sum'] = df['Value'].rolling(2).sum()[1::2]
Output:
Date Qty Price Value Val_sum
0 2014-11-18 58 495.775716 -2875499 NaN
1 2014-11-24 -58 484.280147 2808824 -66675.0
2 2014-11-26 63 474.138699 -2987073 NaN
3 2014-12-31 -63 507.931247 3199966 212893.0
4 2015-01-05 59 495.923771 -2925950 NaN
5 2015-02-05 -59 456.224370 2691723 -234227.0
Related
I've got a dataframe with two columns one is datetime dataframe consisting of dates, and another one consists of quantity. It looks like something like this,
Date Quantity
0 2019-01-05 10
1 2019-01-10 15
2 2019-01-22 14
3 2019-02-03 12
4 2019-05-11 25
5 2019-05-21 4
6 2019-07-08 1
7 2019-07-30 15
8 2019-09-05 31
9 2019-09-10 44
10 2019-09-25 8
11 2019-12-09 10
12 2020-04-11 111
13 2020-04-17 5
14 2020-06-05 17
15 2020-06-16 12
16 2020-06-22 14
I want to make another dataframe. It should consist of two columns one is Month/Year and the other is Till Highest. I basically want to calculate the highest quantity value until that month and group it using month/year. Example of what I want precisely is,
Month/Year Till Highest
0 Jan/2019 15
1 Feb/2019 15
2 May/2019 25
3 Jul/2019 25
4 Sep/2019 44
5 Dec/2019 44
6 Apr/2020 111
7 Jun/2020 111
In my case, the dataset is vast, and I've readings of almost every day of each month and each year in the specified timeline. Here I've made a dummy dataset to show an example of what I want.
Please help me with this. Thanks in advance :)
See the annotated code:
(df
# convert date to monthly period (2019-01)
.assign(Date=pd.to_datetime(df['Date']).dt.to_period('M'))
# period and max quantity per month
.groupby('Date')
.agg(**{'Month/Year': ('Date', 'first'),
'Till highest': ('Quantity', 'max')})
# format periods as Jan/2019 and get cumulated max quantity
.assign(**{
'Month/Year': lambda d: d['Month/Year'].dt.strftime('%b/%Y'),
'Till highest': lambda d: d['Till highest'].cummax()
})
# drop the groupby index
.reset_index(drop=True)
)
output:
Month/Year Till highest
0 Jan/2019 15
1 Feb/2019 15
2 May/2019 25
3 Jul/2019 25
4 Sep/2019 44
5 Dec/2019 44
6 Apr/2020 111
7 Jun/2020 111
In R you can use cummax:
df=data.frame(Date=c("2019-01-05","2019-01-10","2019-01-22","2019-02-03","2019-05-11","2019-05-21","2019-07-08","2019-07-30","2019-09-05","2019-09-10","2019-09-25","2019-12-09","2020-04-11","2020-04-17","2020-06-05","2020-06-16","2020-06-22"),Quantity=c(10,15,14,12,25,4,1,15,31,44,8,10,111,5,17,12,14))
data.frame(`Month/Year`=unique(format(as.Date(df$Date),"%b/%Y")),
`Till Highest`=cummax(tapply(df$Quantity,sub("-..$","",df$Date),max)),
check.names=F,row.names=NULL)
Month/Year Till Highest
1 Jan/2019 15
2 Feb/2019 15
3 May/2019 25
4 Jul/2019 25
5 Sep/2019 44
6 Dec/2019 44
7 Apr/2020 111
8 Jun/2020 111
I have two dataframes df1 and df2.
df1:
Id Date Remark
0 28 2010-04-08 xx
1 29 2010-10-10 yy
2 30 2012-12-03 zz
3 31 2010-03-16 aa
df2:
Id Timestamp Value Site
0 28 2010-04-08 13:20:15.120 125.0 93
1 28 2010-04-08 13:20:16.020 120.0 94
2 28 2010-04-08 13:20:18.020 135.0 95
3 28 2010-04-08 13:20:18.360 140.0 96
...
1000 29 2010-06-15 05:04:15.120 16.0 101
1001 29 2010-06-15 05:05:16.320 14.0 101
...
I would like to select all Value data 10 days before/including the Date in df1 from df2 for the same Id. For example, for Id 28, Date is 2010-04-08, so select Value where Timestamp is between 2010-03-30 00:00:00 and 2010-04-08 23:59:59(inclusive).
Then, I want to resample the Value data using forward fill ffill and backward fill bfill at 1min frequency so that there will be 10 x 24 x 60 = 14400 values exactly for each Id.
Lastly, I'd like to rearrange the dataframe horizontally with transpose.
Expected output looks like this:
Id Date value1 value2 ... value14399 value14400 Remark
0 28 2010-04-08 125.0 125.0 ... ... xx (value1 and value2 and following values before "2010-04-08 13:20:15.120" are 125.0 as a result of backward fill since the first value for Id 29 is 125.0)
1 29 2010-10-10 16.0 16.0 ... yy
...
I am not sure what's the best way to approach this problem since I'm adding "another time series dimension" to the dataframe. Any idea is appreciated.
I subsetted a big dataframe, slicing only one column Start Time with `type(object).
test = taxi_2020['Start Time']
Got a column
0 00:15:00
1 00:15:00
2 00:15:00
3 00:15:00
4 00:15:00
...
4137289 00:00:00
4137290 00:00:00
4137291 00:00:00
4137292 00:00:00
4137293 00:00:00
Name: Start Time, Length: 4137294, dtype: object
Then I grouped and summarized it by the count (to my best knowledge)
test.value_counts().sort_index().reset_index()
and got two columns
index Start Time
0 00:00:00 24005
1 00:15:00 22815
2 00:30:00 20438
3 00:45:00 19012
4 01:00:00 18082
... ... ...
91 22:45:00 32365
92 23:00:00 31815
93 23:15:00 29582
94 23:30:00 26903
95 23:45:00 24599
Not sure why this index column appeared, now I failed to rename it or convert.
What do I would like to see?
My ideal output - to group time by hour (24h format is ok), it looks like data counts every 15 min, so basically put each next 4 columns together. 00:15:00 is ok to be as 0 hour, 23:00:00 as 23rd hour.
My ideal output:
Hour Rides
0 34000
1 60000
2 30000
3 40000
I would like to create afterwards a simple histogram to show the occurrence by the hour.
Appreciate any help!
IIUC,
#Create dummy input datafframe
test = pd.DataFrame({'time':pd.date_range('2020-06-01', '2020-06-01 23:59:00', freq='15T').strftime('%H:%M:%S'),
'rides':np.random.randint(15000,28000,96)})
Let's create a DateTimeIndex from string and resample, aggregate with sum and convert DateTimeIndex to hours:
test2 = (test.set_index(pd.to_datetime(test['time'], format='%H:%M:%S'))
.rename_axis('hour').resample('H').sum())
test2.index = test2.index.hour
test2.reset_index()
Output:
hour rides
0 0 74241
1 1 87329
2 2 76933
3 3 86208
4 4 88002
5 5 82618
6 6 82188
7 7 81203
8 8 78591
9 9 95592
10 10 99778
11 11 85294
12 12 93931
13 13 80490
14 14 84181
15 15 71786
16 16 90962
17 17 96568
18 18 85646
19 19 88324
20 20 83595
21 21 89284
22 22 72061
23 23 74057
Step by step I found answer myself
Using this code, I renamed columns
test.rename(columns = {'index': "Time", 'Start Time': 'Rides'})
Got
The remaining question - how to summarize by the hour.
After applying
test2['hour'] = pd.to_datetime(test2['Time'], format='%H:%M:%S').dt.hour
test2
I came closer
Finally, I grouped by hour value
test3 = test2.groupby('hour', as_index=False).agg({"Rides": "sum"})
print(test3)
I'm currently have a line of code I'm using to try to create a column that is based on a Cumulative sum of timedelta data between dates. How ever its not correctly performing the Cumulative sum everywhere, and I was also given a warning that my line of python code wont work in the future.
The original dataset is below:
ID CREATION_DATE TIMEDIFF EDITNUMB
8211 11/26/2019 13:00 1
8211 1/3/2020 9:11 37 days 20:11:09.000000000 1
8211 2/3/2020 14:52 31 days 05:40:57.000000000 1
8211 3/27/2020 15:00 53 days 00:07:49.000000000 1
8211 4/29/2020 12:07 32 days 21:07:23.000000000 1
Here is my line of python code:
df['RECUR'] = df.groupby(['ID']).TIMEDIFF.apply(lambda x: x.shift().fillna(1).cumsum())
Which produces the new column 'RECUR' that is not summing cumulatively correctly from the data in the 'TIMEDIFF' column:
ID CREATION_DATE TIMEDIFF EDITNUMB RECUR
8211 11/26/2019 13:00 1 0 days 00:00:01.000000000
8211 1/3/2020 9:11 37 days 20:11:09.000000000 1 0 days 00:00:02.000000000
8211 2/3/2020 14:52 31 days 05:40:57.000000000 1 37 days 20:11:11.000000000
8211 3/27/2020 15:00 53 days 00:07:49.000000000 1 69 days 01:52:08.000000000
8211 4/29/2020 12:07 32 days 21:07:23.000000000 1 122 days 01:59:57.000000000
Which also produces this warning:
FutureWarning: Passing integers to fillna is deprecated, will raise a TypeError in a future version. To retain the old behavior, pass pd.Timedelta(seconds=n) instead.
Any help on this will be greatly appreciated, the sum total should be 153 days starting from 11/26/19, and correctly displayed cumulatively in the 'RECUR' column.
You can do:
# transform('first') would also work
df['RECUR'] = df['CREATION_DATE'] - df.groupby('ID').CREATION_DATE.transform('min')
Output:
ID CREATION_DATE TIMEDIFF EDITNUMB RECUR
0 8211 2019-11-26 13:00:00 NaT 1 0 days 00:00:00
1 8211 2020-01-03 09:11:00 37 days 20:11:00 1 37 days 20:11:00
2 8211 2020-02-03 14:52:00 31 days 05:41:00 1 69 days 01:52:00
3 8211 2020-03-27 15:00:00 53 days 00:08:00 1 122 days 02:00:00
4 8211 2020-04-29 12:07:00 32 days 21:07:00 1 154 days 23:07:00
You can fillna with a timedelta of 0 seconds and do the cumsum
df['RECUR'] = df.groupby('ID').TIMEDIFF.apply(
lambda x: x.fillna(pd.Timedelta(seconds=0)).cumsum())
df['RECUR']
# 0 0 days 00:00:00
# 1 37 days 20:11:09
# 2 69 days 01:52:06
# 3 122 days 01:59:55
# 4 154 days 23:07:18
[In 621]: df = pd.DataFrame({'id':[44,44,44,88,88,90,95],
'Status': ['Reject','Submit','Draft','Accept','Submit',
'Submit','Draft'],
'Datetime': ['2018-11-24 08:56:02',
'2018-10-24 18:12:02','2018-10-24 08:12:02',
'2018-10-29 13:17:02','2018-10-24 10:12:02',
'2018-12-30 08:43:12', '2019-01-24 06:12:02']
}, columns = ['id','Status', 'Datetime'])
df['Datetime'] = pd.to_datetime(df['Datetime'])
df
Out[621]:
id Status Datetime
0 44 Reject 2018-11-24 08:56:02
1 44 Submit 2018-10-24 18:12:02
2 44 Draft 2018-10-24 08:12:02
3 88 Accept 2018-10-29 13:17:02
4 88 Submit 2018-10-24 10:12:02
5 90 Submit 2018-12-30 08:43:12
6 95 Draft 2019-01-24 06:12:02
What I am trying to get is another column, e.g. df['Time in Status'] which is the time that id spent at that status.
I've looked at df.groupby() but only found answers (such as this one) for working out between two dates (e.g. first and last) regardless how how many dates are in between.
df['Datetime'] = pd.to_datetime(df['Datetime'])
g = df.groupby('id')['Datetime']
print(df.groupby('id')['Datetime'].apply(lambda g: g.iloc[-1] - g.iloc[0]))
id
44 -32 days +23:16:00
88 -6 days +20:55:00
90 0 days 00:00:00
95 0 days 00:00:00
Name: Datetime, dtype: timedelta64[ns]
The closest I've come to getting the result is DataFrameGroupBy.diff
df['Time in Status'] = df.groupby('id')['Datetime'].diff()
df
id Status Datetime Time in Status
0 44 Reject 2018-11-24 08:56:02 NaT
1 44 Submit 2018-10-24 18:12:02 -31 days +09:16:00
2 44 Draft 2018-10-24 08:12:02 -1 days +14:00:00
3 88 Accept 2018-10-29 13:17:02 NaT
4 88 Submit 2018-10-24 10:12:02 -6 days +20:55:00
5 90 Submit 2018-12-30 08:43:12 NaT
6 95 Draft 2019-01-24 06:12:02 NaT
However there are two issues with this. First, how can I do this calculation starting with the earliest date and working through until the end? E.g. so in row 2, instead of -1 days +14:00:00 it would be 0 Days 10:00:00? Or is this easier to solve by rearranging the order of the data before hand?
The other issue is the NaT. If there is no date to compare with, then the current day (i.e. datetime.now) would be used. I could apply this afterwards easy enough, but I was wondering if there might be a better solution to finding and replacing all the NaT values.
Exactly you are right, first is necessary sorting DataFrame.sort_values with both columns:
df = df.sort_values(['id', 'Datetime'])
df['Time in Status'] = df.groupby('id')['Datetime'].diff()
print (df)
id Status Datetime Time in Status
2 44 Draft 2018-10-24 08:12:02 NaT
1 44 Submit 2018-10-24 18:12:02 0 days 10:00:00
0 44 Reject 2018-11-24 08:56:02 30 days 14:44:00
4 88 Submit 2018-10-24 10:12:02 NaT
3 88 Accept 2018-10-29 13:17:02 5 days 03:05:00
5 90 Submit 2018-12-30 08:43:12 NaT
6 95 Draft 2019-01-24 06:12:02 NaT