[In 621]: df = pd.DataFrame({'id':[44,44,44,88,88,90,95],
'Status': ['Reject','Submit','Draft','Accept','Submit',
'Submit','Draft'],
'Datetime': ['2018-11-24 08:56:02',
'2018-10-24 18:12:02','2018-10-24 08:12:02',
'2018-10-29 13:17:02','2018-10-24 10:12:02',
'2018-12-30 08:43:12', '2019-01-24 06:12:02']
}, columns = ['id','Status', 'Datetime'])
df['Datetime'] = pd.to_datetime(df['Datetime'])
df
Out[621]:
id Status Datetime
0 44 Reject 2018-11-24 08:56:02
1 44 Submit 2018-10-24 18:12:02
2 44 Draft 2018-10-24 08:12:02
3 88 Accept 2018-10-29 13:17:02
4 88 Submit 2018-10-24 10:12:02
5 90 Submit 2018-12-30 08:43:12
6 95 Draft 2019-01-24 06:12:02
What I am trying to get is another column, e.g. df['Time in Status'] which is the time that id spent at that status.
I've looked at df.groupby() but only found answers (such as this one) for working out between two dates (e.g. first and last) regardless how how many dates are in between.
df['Datetime'] = pd.to_datetime(df['Datetime'])
g = df.groupby('id')['Datetime']
print(df.groupby('id')['Datetime'].apply(lambda g: g.iloc[-1] - g.iloc[0]))
id
44 -32 days +23:16:00
88 -6 days +20:55:00
90 0 days 00:00:00
95 0 days 00:00:00
Name: Datetime, dtype: timedelta64[ns]
The closest I've come to getting the result is DataFrameGroupBy.diff
df['Time in Status'] = df.groupby('id')['Datetime'].diff()
df
id Status Datetime Time in Status
0 44 Reject 2018-11-24 08:56:02 NaT
1 44 Submit 2018-10-24 18:12:02 -31 days +09:16:00
2 44 Draft 2018-10-24 08:12:02 -1 days +14:00:00
3 88 Accept 2018-10-29 13:17:02 NaT
4 88 Submit 2018-10-24 10:12:02 -6 days +20:55:00
5 90 Submit 2018-12-30 08:43:12 NaT
6 95 Draft 2019-01-24 06:12:02 NaT
However there are two issues with this. First, how can I do this calculation starting with the earliest date and working through until the end? E.g. so in row 2, instead of -1 days +14:00:00 it would be 0 Days 10:00:00? Or is this easier to solve by rearranging the order of the data before hand?
The other issue is the NaT. If there is no date to compare with, then the current day (i.e. datetime.now) would be used. I could apply this afterwards easy enough, but I was wondering if there might be a better solution to finding and replacing all the NaT values.
Exactly you are right, first is necessary sorting DataFrame.sort_values with both columns:
df = df.sort_values(['id', 'Datetime'])
df['Time in Status'] = df.groupby('id')['Datetime'].diff()
print (df)
id Status Datetime Time in Status
2 44 Draft 2018-10-24 08:12:02 NaT
1 44 Submit 2018-10-24 18:12:02 0 days 10:00:00
0 44 Reject 2018-11-24 08:56:02 30 days 14:44:00
4 88 Submit 2018-10-24 10:12:02 NaT
3 88 Accept 2018-10-29 13:17:02 5 days 03:05:00
5 90 Submit 2018-12-30 08:43:12 NaT
6 95 Draft 2019-01-24 06:12:02 NaT
Related
Here is the code I have so far:
import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/Colab Datasets/KickstarterRevised.csv')
df['deadline'] = pd.to_datetime(df['deadline'])
df['launched'] = pd.to_datetime(df['launched'])
df['difference'] = df['deadline'].sub(df['launched'], axis=0)
df['difference']
0 58 days 23:24:00
1 45 days 00:00:00
2 30 days 01:00:00
3 55 days 16:25:00
4 35 days 00:00:00
...
4994 40 days 00:00:00
4995 8 days 10:50:00
4996 38 days 18:53:00
4997 30 days 00:00:00
4998 30 days 00:00:00
Name: difference, Length: 4999, dtype: timedelta64[ns]
As you see from your code, df['difference'] is a Series with dtype: timedelta64[ns]. To get the days, just use .astype("timedelta64[D]"), see below
df['difference'] = df['deadline'].sub(df['launched'], axis=0).astype('timedelta64[D]')
I have a dataframe of orders as below, where the column 'Value' represents cash in/out and the 'Date' column reflects when the transaction occurred.
Each transaction is grouped, so that the 'QTY' out, is always succeeded by the 'QTY' 'in', reflected by the sign in the 'QTY' column:
Date Qty Price Value
0 2014-11-18 58 495.775716 -2875499
1 2014-11-24 -58 484.280147 2808824
2 2014-11-26 63 474.138699 -2987073
3 2014-12-31 -63 507.931247 3199966
4 2015-01-05 59 495.923771 -2925950
5 2015-02-05 -59 456.224370 2691723
How can I create two columns, 'n_days' and 'price_diff' that is the difference in days between the two dates of each transaction and the 'Value'?
I have tried:
df['price_diff'] = df['Value'].rolling(2).apply(lambda x: x[0] + x[1])
but receiving a key error for the first observation (0).
Many thanks
Why don't you just use sum:
df['price_diff'] = df['Value'].rolling(2).sum()
Although from the name, it looks like
df['price_diff'] = df['Price'].diff()
And, for the two columns:
df[['Date_diff','Price_diff']] = df[['Date','Price']].diff()
Output:
Date Qty Price Value Date_diff Price_diff
0 2014-11-18 58 495.775716 -2875499 NaT NaN
1 2014-11-24 -58 484.280147 2808824 6 days -11.495569
2 2014-11-26 63 474.138699 -2987073 2 days -10.141448
3 2014-12-31 -63 507.931247 3199966 35 days 33.792548
4 2015-01-05 59 495.923771 -2925950 5 days -12.007476
5 2015-02-05 -59 456.224370 2691723 31 days -39.699401
Updated Per comment, you can try:
df['Val_sum'] = df['Value'].rolling(2).sum()[1::2]
Output:
Date Qty Price Value Val_sum
0 2014-11-18 58 495.775716 -2875499 NaN
1 2014-11-24 -58 484.280147 2808824 -66675.0
2 2014-11-26 63 474.138699 -2987073 NaN
3 2014-12-31 -63 507.931247 3199966 212893.0
4 2015-01-05 59 495.923771 -2925950 NaN
5 2015-02-05 -59 456.224370 2691723 -234227.0
I subsetted a big dataframe, slicing only one column Start Time with `type(object).
test = taxi_2020['Start Time']
Got a column
0 00:15:00
1 00:15:00
2 00:15:00
3 00:15:00
4 00:15:00
...
4137289 00:00:00
4137290 00:00:00
4137291 00:00:00
4137292 00:00:00
4137293 00:00:00
Name: Start Time, Length: 4137294, dtype: object
Then I grouped and summarized it by the count (to my best knowledge)
test.value_counts().sort_index().reset_index()
and got two columns
index Start Time
0 00:00:00 24005
1 00:15:00 22815
2 00:30:00 20438
3 00:45:00 19012
4 01:00:00 18082
... ... ...
91 22:45:00 32365
92 23:00:00 31815
93 23:15:00 29582
94 23:30:00 26903
95 23:45:00 24599
Not sure why this index column appeared, now I failed to rename it or convert.
What do I would like to see?
My ideal output - to group time by hour (24h format is ok), it looks like data counts every 15 min, so basically put each next 4 columns together. 00:15:00 is ok to be as 0 hour, 23:00:00 as 23rd hour.
My ideal output:
Hour Rides
0 34000
1 60000
2 30000
3 40000
I would like to create afterwards a simple histogram to show the occurrence by the hour.
Appreciate any help!
IIUC,
#Create dummy input datafframe
test = pd.DataFrame({'time':pd.date_range('2020-06-01', '2020-06-01 23:59:00', freq='15T').strftime('%H:%M:%S'),
'rides':np.random.randint(15000,28000,96)})
Let's create a DateTimeIndex from string and resample, aggregate with sum and convert DateTimeIndex to hours:
test2 = (test.set_index(pd.to_datetime(test['time'], format='%H:%M:%S'))
.rename_axis('hour').resample('H').sum())
test2.index = test2.index.hour
test2.reset_index()
Output:
hour rides
0 0 74241
1 1 87329
2 2 76933
3 3 86208
4 4 88002
5 5 82618
6 6 82188
7 7 81203
8 8 78591
9 9 95592
10 10 99778
11 11 85294
12 12 93931
13 13 80490
14 14 84181
15 15 71786
16 16 90962
17 17 96568
18 18 85646
19 19 88324
20 20 83595
21 21 89284
22 22 72061
23 23 74057
Step by step I found answer myself
Using this code, I renamed columns
test.rename(columns = {'index': "Time", 'Start Time': 'Rides'})
Got
The remaining question - how to summarize by the hour.
After applying
test2['hour'] = pd.to_datetime(test2['Time'], format='%H:%M:%S').dt.hour
test2
I came closer
Finally, I grouped by hour value
test3 = test2.groupby('hour', as_index=False).agg({"Rides": "sum"})
print(test3)
I'm currently have a line of code I'm using to try to create a column that is based on a Cumulative sum of timedelta data between dates. How ever its not correctly performing the Cumulative sum everywhere, and I was also given a warning that my line of python code wont work in the future.
The original dataset is below:
ID CREATION_DATE TIMEDIFF EDITNUMB
8211 11/26/2019 13:00 1
8211 1/3/2020 9:11 37 days 20:11:09.000000000 1
8211 2/3/2020 14:52 31 days 05:40:57.000000000 1
8211 3/27/2020 15:00 53 days 00:07:49.000000000 1
8211 4/29/2020 12:07 32 days 21:07:23.000000000 1
Here is my line of python code:
df['RECUR'] = df.groupby(['ID']).TIMEDIFF.apply(lambda x: x.shift().fillna(1).cumsum())
Which produces the new column 'RECUR' that is not summing cumulatively correctly from the data in the 'TIMEDIFF' column:
ID CREATION_DATE TIMEDIFF EDITNUMB RECUR
8211 11/26/2019 13:00 1 0 days 00:00:01.000000000
8211 1/3/2020 9:11 37 days 20:11:09.000000000 1 0 days 00:00:02.000000000
8211 2/3/2020 14:52 31 days 05:40:57.000000000 1 37 days 20:11:11.000000000
8211 3/27/2020 15:00 53 days 00:07:49.000000000 1 69 days 01:52:08.000000000
8211 4/29/2020 12:07 32 days 21:07:23.000000000 1 122 days 01:59:57.000000000
Which also produces this warning:
FutureWarning: Passing integers to fillna is deprecated, will raise a TypeError in a future version. To retain the old behavior, pass pd.Timedelta(seconds=n) instead.
Any help on this will be greatly appreciated, the sum total should be 153 days starting from 11/26/19, and correctly displayed cumulatively in the 'RECUR' column.
You can do:
# transform('first') would also work
df['RECUR'] = df['CREATION_DATE'] - df.groupby('ID').CREATION_DATE.transform('min')
Output:
ID CREATION_DATE TIMEDIFF EDITNUMB RECUR
0 8211 2019-11-26 13:00:00 NaT 1 0 days 00:00:00
1 8211 2020-01-03 09:11:00 37 days 20:11:00 1 37 days 20:11:00
2 8211 2020-02-03 14:52:00 31 days 05:41:00 1 69 days 01:52:00
3 8211 2020-03-27 15:00:00 53 days 00:08:00 1 122 days 02:00:00
4 8211 2020-04-29 12:07:00 32 days 21:07:00 1 154 days 23:07:00
You can fillna with a timedelta of 0 seconds and do the cumsum
df['RECUR'] = df.groupby('ID').TIMEDIFF.apply(
lambda x: x.fillna(pd.Timedelta(seconds=0)).cumsum())
df['RECUR']
# 0 0 days 00:00:00
# 1 37 days 20:11:09
# 2 69 days 01:52:06
# 3 122 days 01:59:55
# 4 154 days 23:07:18
I want to take a column of datetime objects and return a column of integers that are "days from that datetime until today". I can do it in an ugly way, looking for a prettier (and faster) way.
So suppose I have a dataframe with a datetime column like so:
11 2014-03-04 17:16:26+00:00
12 2014-03-10 01:35:56+00:00
13 2014-03-15 02:35:51+00:00
14 2014-03-20 05:55:47+00:00
15 2014-03-26 04:56:33+00:00
Name: datetime, dtype: object
And each element looks like:
datetime.datetime(2014, 3, 4, 17, 16, 26, tzinfo=<UTC>)
Suppose I want to calculate how many days ago each observation occurred, and return that as a simple integer. I know I can just use apply twice, but is there a vectorized/cleaner way to do it?
today = datetime.datetime.today().date()
df_dates = df['datetime'].apply(lambda x: x.date())
days_ago = today - df_dates
Which gives a timedelta64[ns] Series.
11 56 days, 00:00:00
12 50 days, 00:00:00
13 45 days, 00:00:00
14 40 days, 00:00:00
15 34 days, 00:00:00
Name: datetime, dtype: timedelta64[ns]
And then finally if I want it as an integer:
days_ago_as_int = days_ago.apply(lambda x: x.item().days)
days_ago_as_int
11 56
12 50
13 45
14 40
15 34
Name: datetime, dtype: int64
Any thoughts?
Related questions that didn't quite get at what I was asking:
Pandas Python- can datetime be used with vectorized inputs
Pandas add one day to column
Trying Karl D's answer, I'm successfully able to get today's date and the date column as desired, but something goes awry in the subtraction (different datetimes than in the original example, but shouldn't matter, right?):
converted_dates = df['date'].values.astype('datetime64[D]')
today_date = np.datetime64(dt.date.today())
print converted_dates
print today_date
print today_date - converted_dates
[2014-01-16 00:00:00
2014-01-19 00:00:00
2014-01-22 00:00:00
2014-01-26 00:00:00
2014-01-29 00:00:00]
2014-04-30 00:00:00
[16189 days, 0:08:20.637994
16189 days, 0:08:20.637991
16189 days, 0:08:20.637988
16189 days, 0:08:20.637984
16189 days, 0:08:20.637981]
How about (for a column named date)?
import datetime as dt
df['foo'] = (np.datetime64(dt.date.today())
- df['date'].values.astype('datetime64[D]'))
print df
date foo
0 2014-03-04 17:16:26 56 days
1 2014-03-10 01:35:56 50 days
2 2014-03-15 02:35:51 45 days
3 2014-03-20 05:55:47 40 days
4 2014-03-26 04:56:33 34 days
Or if you wanted it as an int:
df['foo'] = (np.datetime64(dt.date.today())
- df['date'].values.astype('datetime64[D]')).astype(int)
print df
date foo
0 2014-03-04 17:16:26 56
1 2014-03-10 01:35:56 50
2 2014-03-15 02:35:51 45
3 2014-03-20 05:55:47 40
4 2014-03-26 04:56:33 34
Or if it was an index
print np.datetime64(dt.date.today()) - df.index.values.astype('datetime64[D]')
[56 50 45 40 34]
Much later Edit: How about this for a work around?
>>> print df
date
0 2014-03-04 17:16:26
1 2014-03-10 01:35:56
2 2014-03-15 02:35:51
3 2014-03-20 05:55:47
4 2014-03-26 04:56:33
Try assigning today's date to a column so it gets converted to a datetime64 column by pandas and then do the arithmetic:
>>> df['today'] = dt.date.today()
>>> df['foo'] = (df['today'].values.astype('datetime64[D]')
- df['date'].values.astype('datetime64[D]'))
>>> print df
date today foo
0 2014-03-04 17:16:26 2014-05-14 71 days
1 2014-03-10 01:35:56 2014-05-14 65 days
2 2014-03-15 02:35:51 2014-05-14 60 days
3 2014-03-20 05:55:47 2014-05-14 55 days
4 2014-03-26 04:56:33 2014-05-14 49 days