I have a column of datetime stamps. I need a column of total minutes elapsed from the first to the last value.
I have:
>>> df = pd.DataFrame({'timestamp': [
... pd.Timestamp('2001-01-01 06:00:00'),
... pd.Timestamp('2001-01-01 06:01:00'),
... pd.Timestamp('2001-01-01 06:15:00')
... ]})
>>> df
timestamp
0 2001-01-01 06:00:00
1 2001-01-01 06:01:00
2 2001-01-01 06:15:00
I need to add a column that gives the running total:
timestamp minutes
1-1-2001 6:00 0
1-1-2001 6:01 1
1-1-2001 6:15 15
1-1-2001 7:00 60
1-1-2001 7:35 95
Having a hard time manipulating the datetime Series to allow me to total up the timestamp.
I've looked at a lot of posts and can't find anything that does what I'm trying to do. Would appreciate any ideas!
You can chain a few methods together:
>>> df['minutes'] = df['timestamp'].diff().fillna(0).dt.total_seconds()\
... .cumsum().div(60).astype(int)
>>> df
timestamp minutes
0 2001-01-01 06:00:00 0
1 2001-01-01 06:01:00 1
2 2001-01-01 06:15:00 15
Creation:
>>> df = pd.DataFrame({'timestamp': [
... pd.Timestamp('2001-01-01 06:00:00'),
... pd.Timestamp('2001-01-01 06:01:00'),
... pd.Timestamp('2001-01-01 06:15:00')
... ]})
Walkthrough
The easiest way to break this down is to separate each intermediate method call.
df['timestamp'].diff() gives you a Series of Pandas-equivalent of Python's datetime.timedelta, the differences in times from each value to the next.
>>> df['timestamp'].diff()
0 NaT
1 00:01:00
2 00:14:00
Name: timestamp, dtype: timedelta64[ns]
This contains an N/A value (NaT/not a time) because there's nothing to subtract from the first value. You can simply fill it with the zero-value for timedeltas:
>>> df['timestamp'].diff().fillna(0)
0 00:00:00
1 00:01:00
2 00:14:00
Name: timestamp, dtype: timedelta64[ns]
Now you need to get an actual integer (minutes) from these objects. In .dt.total_seconds(), .dt is an "accessor" that is a way of accessing a bunch of methods that let you work with datetime-like data:
>>> df['timestamp'].diff().fillna(0).dt.total_seconds()
0 0.0
1 60.0
2 840.0
Name: timestamp, dtype: float64
The result is the incremental second-change as a float. You need this on a cumulative basis, in minutes, and as an integer. That's what the final 3 operations do:
>>> df['timestamp'].diff().fillna(0).dt.total_seconds().cumsum().div(60).astype(int)
0 0
1 1
2 15
Name: timestamp, dtype: int64
Note that astype(int) will do rounding if you have seconds that aren't fully divisible by 60.
Related
I have code as below. My questions:
why is it assigning week 1 to 2014-12-29 and '2014-1-1'? Why it is not assigning week 53 to 2014-12-29?
how could i assign week number that is continuously increasing? I
want '2014-12-29','2015-1-1' to have week 53 and '2015-1-15' to have
week 55 etc.
x=pd.DataFrame(data=['2014-1-1','2014-12-29','2015-1-1','2015-1-15'],columns=['date'])
x['week_number']=pd.DatetimeIndex(x['date']).week
As far as why the week number is 1 for 12/29/2014 -- see the question I linked to in the comments. For the second part of your question:
January 1, 2014 was a Wednesday. We can take the minimum date of your date column, get the day number and subtract from the difference:
Solution
# x["date"] = pd.to_datetime(x["date"]) # if not already a datetime column
min_date = x["date"].min() + 1 # + 1 because they're zero-indexed
x["weeks_from_start"] = ((x["date"].diff().dt.days.cumsum() - min_date) // 7 + 1).fillna(1).astype(int)
Output:
date weeks_from_start
0 2014-01-01 1
1 2014-12-29 52
2 2015-01-01 52
3 2015-01-15 54
Step by step
The first step is to convert the date column to the datetime type, if you haven't already:
In [3]: x.dtypes
Out[3]:
date object
dtype: object
In [4]: x["date"] = pd.to_datetime(x["date"])
In [5]: x
Out[5]:
date
0 2014-01-01
1 2014-12-29
2 2015-01-01
3 2015-01-15
In [6]: x.dtypes
Out[6]:
date datetime64[ns]
dtype: object
Next, we need to find the minimum of your date column and set that as the starting date day of the week number (adding 1 because the day number starts at 0):
In [7]: x["date"].min().day + 1
Out[7]: 2
Next, use the built-in .diff() function to take the differences of adjacent rows:
In [8]: x["date"].diff()
Out[8]:
0 NaT
1 362 days
2 3 days
3 14 days
Name: date, dtype: timedelta64[ns]
Note that we get NaT ("not a time") for the first entry -- that's because the first row has nothing to compare to above it.
The way to interpret these values is that row 1 is 362 days after row 0, and row 2 is 3 days after row 1, etc.
If you take the cumulative sum and subtract the starting day number, you'll get the days since the starting date, in this case 2014-01-01, as if the Wednesday was day 0 of that first week (this is because when we calculate the number of weeks since that starting date, we need to compensate for the fact that Wednesday was the middle of that week):
In [9]: x["date"].diff().dt.days.cumsum() - min_date
Out[9]:
0 NaN
1 360.0
2 363.0
3 377.0
Name: date, dtype: float64
Now when we take the floor division by 7, we'll get the correct number of weeks since the starting date:
In [10]: (x["date"].diff().dt.days.cumsum() - 2) // 7 + 1
Out[10]:
0 NaN
1 52.0
2 52.0
3 54.0
Name: date, dtype: float64
Note that we add 1 because (I assume) you're counting from 1 -- i.e., 2014-01-01 is week 1 for you, and not week 0.
Finally, the .fillna is just to take care of that NaT (which turned into a NaN when we started doing arithmetic). You use .fillna(value) to fill NaNs with value:
In [11]: ((x["date"].diff().dt.days.cumsum() - 2) // 7 + 1).fillna(1)
Out[11]:
0 1.0
1 52.0
2 52.0
3 54.0
Name: date, dtype: float64
Finally use .astype() to convert the column to integers instead of floats.
round() function in pandas rounds down the time 07:30 to 07:00 But I want to round up any time which passes the 30 minutes (inclusive).
Eg.
07:15 to 07:00
05:25 to 05:00
22:30 to 23:00
18:45 to 19:00
How to achieve this for a column of a dataframe using pandas?
timestamps
You need to use dt.round. This is however a bit as the previous/next hour behavior depends on the hour itself. You can force it by adding or subtracting a small amount of time (here 1ns):
s = pd.to_datetime(pd.Series(['1/2/2021 3:45', '25/4/2021 12:30',
'25/4/2021 13:30', '12/4/2022 23:45']))
# xx:30 -> rounding depending on the hour parity (default)
s.dt.round(freq='1h')
0 2021-01-02 04:00:00
1 2021-04-25 12:00:00 <- -30min
2 2021-04-25 14:00:00 <- +30min
3 2022-12-05 00:00:00
dtype: datetime64[ns]
# 00:30 -> 00:00 (force down)
s.sub(pd.Timedelta('1ns')).dt.round(freq='1h')
0 2021-01-02 04:00:00
1 2021-04-25 12:00:00
2 2021-04-25 13:00:00
3 2022-12-05 00:00:00
dtype: datetime64[ns]
# 00:30 -> 01:00 (force up)
s.add(pd.Timedelta('1ns')).dt.round(freq='1h')
0 2021-01-02 04:00:00
1 2021-04-25 12:00:00
2 2021-04-25 13:00:00
3 2022-12-05 00:00:00
dtype: datetime64[ns]
floats
IIUC, you can use divmod (or numpy.modf) to get the integer and decimal part, then perform simple boolean arithmetic:
s = pd.Series([7.15, 5.25, 22.30, 18.45])
s2, r = s.divmod(1) # or np.modf(s)
s2[r.ge(0.3)] += 1
s2 = s2.astype(int)
Alternative: using mod and boolean to int equivalence:
s2 = s.astype(int)+s.mod(1).ge(0.3)
output:
0 7
1 5
2 23
3 19
dtype: int64
Note on precision. It is not always easy to compare floats due to floating point arithmetics. For instance using gt would fail on the 22.30 here. To ensure precision round to 2 digits first.
s.mod(1).round(2).ge(0.3)
or use integers:
s.mod(1).mul(100).astype(int).ge(30)
Here a version that works with timestamps:
#dummy data:
df = pd.DataFrame({'time':pd.to_datetime([np.random.randint(0,10**8) for a in range(10)], unit='s')})
def custom_round(df, col, out):
if df[col].minute >= 30:
df[out] = df[col].ceil('H')
else:
df[out] = df[col].floor('H')
return df
df.apply(lambda x: custom_round(x, 'time', 'new_time'), axis=1)
#edit:
using numpy:
def custom_round(df, col, out):
df[out] = np.where(
(
df['time'].dt.minute>=30),
df[col].dt.ceil('H'),
df[col].dt.floor('H')
)
return df
df = custom_round(df, 'time', 'new_time')
I have the following df:
time_series date sales
store_0090_item_85261507 1/2020 1,0
store_0090_item_85261501 2/2020 0,0
store_0090_item_85261500 3/2020 6,0
Being 'date' = Week/Year.
So, I tried use the following code:
df['date'] = df['date'].apply(lambda x: datetime.strptime(x + '/0', "%U/%Y/%w"))
But, return this df:
time_series date sales
store_0090_item_85261507 2020-01-05 1,0
store_0090_item_85261501 2020-01-12 0,0
store_0090_item_85261500 2020-01-19 6,0
But, the first day of the first week of 2020 is 2019-12-29, considering sunday as first day. How can I have the first day 2020-12-29 of the first week of 2020 and not 2020-01-05?
From the datetime module's documentation:
%U: Week number of the year (Sunday as the first day of the week) as a zero padded decimal number. All days in a new year preceding the first Sunday are considered to be in week 0.
Edit: My originals answer doesn't work for input 1/2023 and using ISO 8601 date values doesn't work for 1/2021, so I've edited this answer by adding a custom function
Here is a way with a custom function
import pandas as pd
from datetime import datetime, timedelta
##############################################
# to demonstrate issues with certain dates
print(datetime.strptime('0/2020/0', "%U/%Y/%w")) # 2019-12-29 00:00:00
print(datetime.strptime('1/2020/0', "%U/%Y/%w")) # 2020-01-05 00:00:00
print(datetime.strptime('0/2021/0', "%U/%Y/%w")) # 2020-12-27 00:00:00
print(datetime.strptime('1/2021/0', "%U/%Y/%w")) # 2021-01-03 00:00:00
print(datetime.strptime('0/2023/0', "%U/%Y/%w")) # 2023-01-01 00:00:00
print(datetime.strptime('1/2023/0', "%U/%Y/%w")) # 2023-01-01 00:00:00
#################################################
df = pd.DataFrame({'date':["1/2020", "2/2020", "3/2020", "1/2021", "2/2021", "1/2023", "2/2023"]})
print(df)
def get_first_day(date):
date0 = datetime.strptime('0/' + date.split('/')[1] + '/0', "%U/%Y/%w")
date1 = datetime.strptime('1/' + date.split('/')[1] + '/0', "%U/%Y/%w")
date = datetime.strptime(date + '/0', "%U/%Y/%w")
return date if date0 == date1 else date - timedelta(weeks=1)
df['new_date'] = df['date'].apply(lambda x:get_first_day(x))
print(df)
Input
date
0 1/2020
1 2/2020
2 3/2020
3 1/2021
4 2/2021
5 1/2023
6 2/2023
Output
date new_date
0 1/2020 2019-12-29
1 2/2020 2020-01-05
2 3/2020 2020-01-12
3 1/2021 2020-12-27
4 2/2021 2021-01-03
5 1/2023 2023-01-01
6 2/2023 2023-01-08
You'll want to use ISO week parsing directives, Ex:
import pandas as pd
date = pd.Series(["1/2020", "2/2020", "3/2020"])
pd.to_datetime(date+"/1", format="%V/%G/%u")
0 2019-12-30
1 2020-01-06
2 2020-01-13
dtype: datetime64[ns]
you can also shift by one day if the week should start on Sunday:
pd.to_datetime(date+"/1", format="%V/%G/%u") - pd.Timedelta('1d')
0 2019-12-29
1 2020-01-05
2 2020-01-12
dtype: datetime64[ns]
I would like to calculate a mean of a time delta serie excluding 00:00:00 values.
Then this is my time serie:
1 00:28:00
3 01:57:00
5 00:00:00
7 01:27:00
9 00:00:00
11 01:30:00
I try to replace 5 and 9 row per NaN and then apply .mean() to the serie. mean() doesn´t include NaN values and I get the desired value.
How can I do that stuff?
I´am trying:
`df["time_column"].replace('0 days 00:00:00', np.NaN).mean()`
but no values are replaced
One idea is use 0 Timedelta object:
out = df["time_column"].replace(pd.Timedelta(0), np.NaN).mean()
print (out)
0 days 01:20:30
I have tried to calculate the number of business days between two date (stored in separate columns in a dataframe ).
MonthBegin MonthEnd
0 2014-06-09 2014-06-30
1 2014-07-01 2014-07-31
2 2014-08-01 2014-08-31
3 2014-09-01 2014-09-30
4 2014-10-01 2014-10-31
I have tried to apply numpy.busday_count but I get the following error:
Iterator operand 0 dtype could not be cast from dtype('<M8[ns]') to dtype('<M8[D]') according to the rule 'safe'
I have tried to change the type into Timestamp as the following :
Timestamp('2014-08-31 00:00:00')
or datetime :
datetime.date(2014, 8, 31)
or to numpy.datetime64:
numpy.datetime64('2014-06-30T00:00:00.000000000')
Anyone knows how to fix it?
Note 1: I have passed tried np.busday_count in two way :
1. Passing dataframe columns, t['Days']=np.busday_count(t.MonthBegin,t.MonthEnd)
Passing arrays np.busday_count(dt1,dt2)
Note2: My dataframe has over 150K rows so I need to use an efficient algorithm
You can using bdate_range, also I corrected your input , since the most of MonthEnd is early than the MonthBegin
[len(pd.bdate_range(x,y))for x,y in zip(df['MonthBegin'],df['MonthEnd'])]
Out[519]: [16, 21, 22, 23, 20]
I think the best way to do is
df.apply(lambda row : np.busday_count(row['MBegin'],row['MEnd']),axis=1)
For my dataframe df as below:
MBegin MEnd
0 2011-01-01 2011-02-01
1 2011-01-10 2011-02-10
2 2011-01-02 2011-02-02
doing :
df['MBegin'] = df['MBegin'].values.astype('datetime64[D]')
df['MEnd'] = df['MEnd'].values.astype('datetime64[D]')
df['busday'] = df.apply(lambda row : np.busday_count(row['MBegin'],row['MEnd']),axis=1)
>>df
MBegin MEnd busday
0 2011-01-01 2011-02-01 21
1 2011-01-10 2011-02-10 23
2 2011-01-02 2011-02-02 22
You need to provide the template in which your dates are written.
a = datetime.strptime('2014-06-9', '%Y-%m-%d')
Calculate this for your
b = datetime.strptime('2014-06-30', '%Y-%m-%d')
Now their difference
c = b-a
c.days
which gives you the difference 21 days, You can now use list comprehension to get the difference between two dates as days.
will give you datetime.timedelta(21), to convert it into days, just use
You can modify your code to get the desired result as below:
df = pd.DataFrame({'MonthBegin': ['2014-06-09', '2014-08-01', '2014-09-01', '2014-10-01', '2014-11-01'],
'MonthEnd': ['2014-06-30', '2014-08-31', '2014-09-30', '2014-10-31', '2014-11-30']})
df['MonthBegin'] = df['MonthBegin'].astype('datetime64[ns]')
df['MonthEnd'] = df['MonthEnd'].astype('datetime64[ns]')
df['BDays'] = np.busday_count(df['MonthBegin'].tolist(), df['MonthEnd'].tolist())
print(df)
MonthBegin MonthEnd BDays
0 2014-06-09 2014-06-30 15
1 2014-08-01 2014-08-31 21
2 2014-09-01 2014-09-30 21
3 2014-10-01 2014-10-31 22
4 2014-11-01 2014-11-30 20
Additionally numpy.busday_count has few other optional arguments like weekmask, holidays ... which you can use according to your need.