Timestamp subtraction must have the same timezones - python

I keep getting the following error:
TypeError: Timestamp subtraction must have the same timezones or no
timezones
At this line
df['days_in_Month'].loc[df['Month'] == min_date_Month] = (df['Month_end'] - \
pd.to_datetime(min_date,format="%Y-%m-%d"))
My df['TransactionDate'] is a column with the following format 2019-08-23T00:00:00.000Z. I am programming on Python3.3.7.
df['Month'] = df['TransactionDate'].apply(lambda x : str(x)[:7])
df['Month_begin'] = pd.to_datetime(df['Month'], format="%Y-%m") + MonthBegin(0)
df['Month_end'] = pd.to_datetime(df['Month'], format="%Y-%m") + MonthEnd(1)
df['days_in_Month'] = (df['Month_end'] - df['Month_begin'])#.days()
print(df.columns)
print(df)
min_date = df['TransactionDate'].min()
min_date_Month = min_date[:7]
df['days_in_Month'].loc[df['Month'] == min_date_Month] = (df['Month_end'] - \
pd.to_datetime(min_date,format="%Y-%m-%d"))
df['Month_begin'].loc[df['Month'] == min_date_Month] = pd.to_datetime(min_date,format="%Y-%m-%d")

When you run a piece of your offending instruction:
pd.to_datetime(min_date, format="%Y-%m-%d")
you will get:
Timestamp('2019-11-01 00:00:00+0000', tz='UTC')
It indicates that format="%Y-%m-%d" does not prevent this function
from parsing the whole input string, so the result is with
a time zone.
To parse only the date part, run:
pd.to_datetime(min_date[:10])
(even without format) and you will get:
Timestamp('2019-11-01 00:00:00')
without the time zone.
But the whole your instruction is weird.
When you run the left hand side alone:
df['days_in_Month'].loc[df['Month'] == min_date_Month]
you will get:
0 29 days
Name: days_in_Month, dtype: timedelta64[ns]
But when you run the right hand side alone:
df['Month_end'] - pd.to_datetime(min_date[:10])
you will get:
0 29 days
1 60 days
2 91 days
3 120 days
Name: Month_end, dtype: timedelta64[ns]
So you attempt to save the whole column under a single cell.
Maybe this instruction should be:
df['days_in_Month'] = df['Month_end'] - pd.to_datetime(min_date[:10])
instead?
And yet another remark: Your days_in_Month column is actually of
timedelta64 type, not the number of days.
To have the number of days in each month (as an integer), you should run:
df['days_in_Month'] = (df['Month_end'] - df['Month_begin']).dt.days + 1
Note that e.g. the difference between 2019-11-01 and 2019-11-30
is 29 days, whereas November has 30 days.

the problem is the Z in your datetimestring causes the datetime to be interpretted as utc timezone
but your Month_end key does not have any timezone info attached to it, so it does not have a timezone associated with it
pandas does not know how to interact with these two different things, so you need to either remove the timezone from the datetime string, or better make your other datetimes timezone aware to utc.
pandas makes this relatively easy
Month_end = pandas.to_datetime(month_end_strings,utc=True)

Related

Convertng a Pandas series of stringg of '%d:%H:%M:%S' format into datetime format

I have a Pandas series which consists of strings of '169:21:5:24', '54:9:19:29', and so on which stand for 169 days 21 hours 5 minutes 24 seconds and 54 days 9 hours 19 minutes 29 seconds, respectively.
I want to convert them to datetime object (preferable) or just integers of seconds.
The first try was
pd.to_datetime(series1, format = '%d:%H:%M:%S')
which failed with an error message
time data '169:21:5:24' does not match format '%d:%H:%M:%S' (match)
The second try
pd.to_datetime(series1)
also failed with
expected hh:mm:ss format
The first try seems to work if all the 'days' are less than 30 or 31 days, but my data includes 150 days, 250 days etc and with no month value.
Finally,
temp_list1 = [[int(subitem) for subitem in item.split(":")] for item in series1]
temp_list2 = [item[0] * 24 * 3600 + item[1] * 3600 + item[2] * 60 + item[3] for item in temp_list1]
successfully converted the Series into a list of seconds, but this is lengthy.
I wonder if there is a Pandas.Series.dt or datetime methods that can deal with such type of data.
I want to convert them to datetime object (preferable) or just integers of seconds.
It seems to me like you are rather looking for a timedelta because it's unclear what the year should be?
You could do that for example by (ser your series):
ser = pd.Series(["169:21:5:24", "54:9:19:29"])
timedeltas = ser.str.split(":", n=1, expand=True).assign(td=lambda df:
pd.to_timedelta(df[0].astype("int"), unit="D") + pd.to_timedelta(df[1])
)["td"]
seconds = timedeltas.dt.total_seconds().astype("int")
datetimes = pd.Timestamp("2022") + timedeltas # year has to be provided
Result:
timedeltas:
0 169 days 21:05:24
1 54 days 09:19:29
Name: td, dtype: timedelta64[ns]
seconds:
0 14677524
1 4699169
Name: td, dtype: int64
datetimes:
0 2022-06-19 21:05:24
1 2022-02-24 09:19:29
Name: td, dtype: datetime64[ns]
[PyData.Pandas]: pandas.to_datetime uses (and points to) [Python.Docs]: datetime - strftime() and strptime() Behavior which states (emphasis is mine):
%d - Day of the month as a zero-padded decimal number.
...
%j - Day of the year as a zero-padded decimal number.
So, you're using the wrong directive (correct one is %j):
>>> import pandas as pd
>>>
>>> pd.to_datetime("169:21:5:24", format="%j:%H:%M:%S")
Timestamp('1900-06-18 21:05:24')
As seen, the reference year is 1900 (as specified in the 2nd URL). If you want to use the current year, a bit of extra processing is required:
>>> import datetime
>>>
>>> cur_year_str = "{:04d}:".format(datetime.datetime.today().year)
>>> cur_year_str
'2023:'
>>>
>>> pd.to_datetime(cur_year_str + "169:21:5:24", format="%Y:%j:%H:%M:%S")
Timestamp('2023-06-18 21:05:24')
>>>
>>> # Quick leap year test
>>> pd.to_datetime("2020:169:21:5:24", format="%Y:%j:%H:%M:%S")
Timestamp('2020-06-17 21:05:24')
All in all:
>>> series = pd.Series(("169:21:5:24", "54:9:19:29"))
>>> pd.to_datetime(year_str + series, format="%Y:%j:%H:%M:%S")
0 2023-06-18 21:05:24
1 2023-02-23 09:19:29
dtype: datetime64[ns]

Python3 - datetime function -- clarification on .days meaning

I am wanting to take user input of date1 and date2 and calculate the difference to determine how many weeks are in between. My entire program is below. The example dates I'm using are:
date1 = 2023-03-15
date2 = 2022-11-09
Output is Number of weeks: 18 -- which is correct.
My 1st question that I need help in clarifying is why do I need the .days after the days = abs(date2-date1).days? I have searched for many hours via Google, stackoverflow, Youtube and Python docs https://docs.python.org/3.9/library/datetime.html?highlight=datetime#module-datetime. I'm pretty new to Python and reading the docs sometimes trips me up, so please forgive if it's in there -- I've struggled reading through some of it. Why is the .days needed? I know that if I remove .days, the output is: Number of weeks: 18 days, 0:00:00. Where is the documentation on needing the .days listed in the datetime module docs??? Can someone help me understand this please?
My 2nd question is why do I get Number of weeks: 0 when I change .days to .seconds? (this is when I was testing things and comment out the weeks = days//7 and print out days) The one part in the docs that I think addresses this the following: https://docs.python.org/3.9/library/datetime.html?highlight=datetime#module-datetime:~:text=the%20given%20year.-,Supported%20operations%3A,!%3D.%20The%20latter%20cases%20return%20False%20or%20True%2C%20respectively.,-In%20Boolean%20contexts.... and if this is correct, am I reading it correctly that if the difference in dates are to be determined, only "days" are returned, and thus no seconds or microseconds?
Thank you for your help! Code below:
#Find the number of weeks between two given dates
from datetime import datetime
#User input for 1st date in YYYY-MM-DD format
date1 = input("Enter 1st date in YYYY-MM-DD format: ")
date1 = datetime.strptime(date1, "%Y-%m-%d")
#User input for 2nd date in YYYY-MM-DD format
date2 = input("Enter 2nd date in YYYY-MM-DD format: ")
date2 = datetime.strptime(date2, "%Y-%m-%d")
#Calculate the weeks between the 2 given dates
days = abs(date2-date1).days
weeks = days//7
print("Number of weeks: ", weeks)
Output-correct answer with .days included:
Enter 1st date in YYYY-MM-DD format: 2023-03-15
Enter 2nd date in YYYY-MM-DD format: 2022-11-09
Number of weeks: 18
Output-with no .days added:
Enter 1st date in YYYY-MM-DD format: 2023-03-15
Enter 2nd date in YYYY-MM-DD format: 2022-11-09
Number of weeks: 18 days, 0:00:00
Output-(regarding 2nd question with the .seconds put in place of .days:
Enter 1st date in YYYY-MM-DD format: 2023-03-15
Enter 2nd date in YYYY-MM-DD format: 2022-11-09
Number of weeks: 0
Subtracting two datetime.datetime objects returns a datetime.timedelta object:
>>> date1 = datetime.strptime('2023-03-15', "%Y-%m-%d")
>>> date2 = datetime.strptime('2022-11-09', "%Y-%m-%d")
>>> date2-date1
datetime.timedelta(days=-126)
>>>
From the docs:
Only days, seconds and microseconds are stored internally. Arguments are converted to those units:
A millisecond is converted to 1000 microseconds.
A minute is converted to 60 seconds.
An hour is converted to 3600 seconds.
A week is converted to 7 days.
Here is some example usage of datetime.timedelta objects. So for your second question, I believe that you're right; for the difference between two days, there are no .seconds, as it is strictly a difference of days. For .seconds to be nonzero, you'd have to have some component of the difference that is larger than a 1,000,000 microseconds but smaller than 86,400 seconds, I suppose.
TL;DR: The answer to both of your questions is "because that is a property of the datetime.timedelta class."
More fully, the date1 and date2 objects you create in your code are both instances of the datetime.datetime class. The - operation between them makes a timedelta object.
Why is the .days needed to avoid printing ", 0:00:00"?
By default all the date information in the timedelta object you created with the operation abs(date2-date1) is printed (including seconds and microseconds, even after modifying it with the //7 operation). When you use the . operator, you access the days attribute of the timedelta object, and only that attribute's value is used.
Why do I get "Number of weeks: 0" when I change .days to .seconds?
The value of the seconds attribute of the timedelta object you created with the operation abs(date2-date1) is integer 0.
See below:
>>> from datetime import datetime
>>> date1 = "2023-03-15"
>>> date1, type(date1)
('2023-03-15', <class 'str'>)
>>> date1 = datetime.strptime(date1, "%Y-%m-%d")
>>> date1, type(date1)
(datetime.datetime(2023, 3, 15, 0, 0), <class 'datetime.datetime'>)
>>> date2 = "2022-11-09"
>>> date2, type(date2)
('2022-11-09', <class 'str'>)
>>> date2 = datetime.strptime(date2, "%Y-%m-%d")
>>> date2, type(date2)
(datetime.datetime(2022, 11, 9, 0, 0), <class 'datetime.datetime'>)
>>> abs(date2-date1), type(abs(date2-date1))
(datetime.timedelta(days=126), <class 'datetime.timedelta'>)
>>> abs(date2-date1).days, type(abs(date2-date1).days)
(126, <class 'int'>)
>>> abs(date2-date1).seconds, type(abs(date2-date1).seconds)
(0, <class 'int'>)
See also: this discussion.
To be precise you calculate difference in 7 days intervals, but the week is actualy a time period which starts on Monday (or Sunday) and ends on Sunday (or Monday). So the difference between 2022-10-01 (Sat) and 2022-10-04 (Tue) in weeks of year is 1, but in days is 3 (0 7days intervals in your case).
So if you need to find the distance between two dates in weeks of year you have to take account of the weekdays:
from datetime import date
d1 = date(2022,10,18)
d2 = date(2022,10,5)
w1 = d1.weekday()
w2 = d2.weekday()
# dfference in weeks of year = 2
((d1-d2).days - (w1-w2))/7 # 2.0
# difference in days = 13
d1-d2 # datetime.timedelta(days=13)

How to deal with inconsistent date series in Python?

Inconsistent date formats
As shown in the photo above, the check-in and check-out dates are inconsistent. Whenever I try to clean convert the entire series to datetime using df['Check-in date'] = pd.to_datetime(df['Check-in date'], errors='coerce') and
df['Check-out date'] = pd.to_datetime(df['Check-out date'], errors='coerce') the days and months get mixed up. I don't really know what to do now. I also tried splitting the days months and years and re-arranging them, but I still have no luck.
My goal here is to get the total night stay of our guest but due to the inconsistency, I end up getting negative total night stays.
I'd appreciate any help here. Thanks!
You can try different formats with strptime and return a DateTime object if any of them works.
from datetime import datetime
import pandas as pd
def try_different_formats(value):
only_date_format = "%d/%m/%Y"
date_and_time_format = "%Y-%m-%d %H:%M:%S"
try:
return datetime.strptime(value,only_date_format)
except ValueError:
pass
try:
return datetime.strptime(value,date_and_time_format)
except ValueError:
return pd.NaT
in your example:
df = pd.DataFrame({'Check-in date': ['19/02/2022','2022-02-12 00:00:00']})
Check-in date
0 19/02/2022
1 2022-02-12 00:00:00
apply method will run this function on every value of the Check-in date
column. the result would be a column of DateTime objects.
df['Check-in date'].apply(try_different_formats)
0 2022-02-19
1 2022-02-12
Name: Check-in date, dtype: datetime64[ns]
for a more pandas-specific solution you can check out this answer.

How do I calculate time difference in days or months in python3

I've been working on a scraping and EDA project on Python3 using Pandas, BeautifulSoup, and a few other libraries and wanted to do some analysis using the time differences between two dates. I want to determine the number of days (or months or even years if that'll make it easier) between the start dates and end dates, and am stuck. I have two columns (air start date, air end date), with dates in the following format: MM-YYYY (so like 01-2021). I basically wanted to make a third column with the time difference between the end and start dates (so I could use it in later analysis).
# split air_dates column into start and end date
dateList = df["air_dates"].str.split("-", n = 1, expand = True)
df['air_start_date'] = dateList[0]
df['air_end_date'] = dateList[1]
df.drop(columns = ['air_dates'], inplace = True)
df.drop(columns = ['rank'], inplace = True)
# changing dates to numerical notation
df['air_start_date'] = pds.to_datetime(df['air_start_date'])
df['air_start_date'] = df['air_start_date'].dt.date.apply(lambda x: x.strftime('%m-%Y') if pds.notnull(x) else npy.NaN)
df['air_end_date'] = pds.Series(df['air_end_date'])
df['air_end_date'] = pds.to_datetime(df['air_end_date'], errors = 'coerce')
df['air_end_date'] = df['air_end_date'].dt.date.apply(lambda x: x.strftime('%m-%Y') if pds.notnull(x) else npy.NaN)
df.isnull().sum()
df.dropna(subset = ['air_end_date'], inplace = True)
def time_diff(time_series):
return datetime.datetime.strptime(time_series, '%d')
df['time difference'] = df['air_end_date'].apply(time_diff) - df['air_start_date'].apply(time_diff)
The last four lines are my attempt at getting a time difference, but I got an error saying 'ValueError: unconverted data remains: -2021'. Any help would be greatly appreciated, as this has had me stuck for a good while now. Thank you!
As far as I can understand, if you have start date and time and end date and time then you can use datetime module in python.
To use this, something like this would be used:
import datetime
# variable = datetime(year, month, day, hour, minute, second)
start = datetime(2017,5,8,18,56,40)
end = datetime(2019,6,27,12,30,58)
print( start - end ) # this will print the difference of these 2 date and time
Hope this answer helps you.
Ok so I figured it out. In my second to last line, I replaced the %d with %m-%Y and now it populates the new column with the number of days between the two dates. I think the format needed to be consistent when running strptime so that's what was causing that error.
here's a slightly cleaned up version; subtract start date from end date to get a timedelta, then take the days attribute from that.
EX:
import pandas as pd
df = pd.DataFrame({'air_dates': ["Apr 2009 - Jul 2010", "not a date - also not a date"]})
df['air_start_date'] = df['air_dates'].str.split(" - ", expand=True)[0]
df['air_end_date'] = df['air_dates'].str.split(" - ", expand=True)[1]
df['air_start_date'] = pd.to_datetime(df['air_start_date'], errors="coerce")
df['air_end_date'] = pd.to_datetime(df['air_end_date'], errors="coerce")
df['timediff_days'] = (df['air_end_date']-df['air_start_date']).dt.days
That will give you for the dummy example
df['timediff_days']
0 456.0
1 NaN
Name: timediff_days, dtype: float64
Regarding calculation of difference in month, you can find some suggestions how to calculate those here. I'd go with #piRSquared's approach:
df['timediff_months'] = ((df['air_end_date'].dt.year - df['air_start_date'].dt.year) * 12 +
(df['air_end_date'].dt.month - df['air_start_date'].dt.month))
df['timediff_months']
0 15.0
1 NaN
Name: timediff_months, dtype: float64

Get age from timestamp

I have a dataframe with timestamp of BirthDate = 2001-10-10 11:01:04.343
How can I get an actual age?
I tried like that:
i.loc[0, "BirthDate"] = pd.to_datetime('today').normalize() - i.loc[0, "BirthDate"].normalize()
output is: 7248 days 00:00:00
but is there any better method which give me just output 19 years?
If i use:
(i.loc[0, "BirthDate"] = pd.to_datetime('today').normalize() - i.loc[0, "BirthDate"].normalize())/365
the output is:
19 days 20:34:50:958904109 and it is type <class 'pandas.timedeltas.Timedelta>
The timedelta result is wrong because you are dividing by 365 where you shouldn't. It actually means 19.86 years.
In some more detail, you are taking a value which is in years, and dividing it with 365; so now you have a result which shows 1/365 of the input duration. The proper way to get the result in years is to divide by another timedelta.
>>> from datetime import timedelta
>>> t = timedelta(days=7248)
>>> 7248/365.0
19.85753424657534
>>> print(t)
7248 days, 0:00:00
>>> t/timedelta(days=365)
19.85753424657534
>>> # years
How exactly to represent a year is not completely well-defined. You could use timedelta(days=365.2425) for the arithmetically correct length of a year, but then of course that produces odd results if you try to convert that back to a resolution where hours and minutes are important.
First, delete the last part of the timestamp and then the following python code can be applied:
from datetime import datetime, date
def calculate_age(born):
born = datetime.strptime(born, "%d/%m/%Y").date()
today = date.today()
return today.year - born.year - ((today.month, today.day) < (born.month, born.day))
df['Age'] = df['Date of birth'].apply(calculate_age)
print(df)

Categories