Date difference: different results in Excel vs. Python - python

I have a pandas dataframe with two dates columns with timestamp ( i want to keep time stamp)
I want to get the difference in days between those two dates , I used the below . It works just fine.
mergethetwo['diff_days']=(mergethetwo['todaydate']-mergethetwo['LastLogon']).dt.days
The doubt is , when I got the difference between those two dates in Excel , it gave me different number.
In python for example the difference between
5/15/2020 1:48:00 PM (LastLogon) and 6/21/2020 12:00:00 AM(todaydate) is 36 .
However , in Excel using
DATEDIF =(LastLogon,todaydate,"d")
5/15/2020 1:48:00 PM and 6/21/2020 12:00:00 AM is 37 days !
Why is the difference ? Which one should I trust ? As I have 30,000 + rows I can't go through all od them to confirm.
Appreciate your support
Thank you

Excel DATEDIF with "D" seems to count "started" days (dates, as the name of the function says...); whilst the Python timedelta gives the actual delta in time - 36.425 days:
import pandas as pd
td = pd.to_datetime("6/21/2020 12:00:00 AM")-pd.to_datetime("5/15/2020 1:48:00 PM")
# Timedelta('36 days 10:12:00')
td.days
# 36
td.total_seconds() / 86400
# 36.425
You will get the same result if you do todaydate-LastLogon in Excel, without using any function.

Related

3-day and 7-day moving average

I have a data indexed by time (2023-01-01 00:15:00+00:00) and I want to calculate 3-day and 7 day average i.e say I want to predict what my value will be on 2023-01-10 00:15:00+00:00, I will have to find the average of the values I have on 2023-01-9 00:15:00+00:00, 2023-01-08 00:15:00+00:00 and 2023-01-07 00:15:00+00:00 (for three day average). I want to these for all the values. Pls how can I write the python code that does this? Part of the data is seen in the pic the pic show how the dataframe look
I have tried to do some indexing using the date and using the notion that I need 24 hours lookbacks but to no avail
here is the code I wrote so far (in the code I tried to see if I could print the value I have 24 hrs before 2023-01-22 23:30:00+00:00):
import datetime
import pandas as pd
data = pd.read_csv("output_saldo.csv")
print(data)
target_timestamp = datetime.datetime.strptime('2023-01-22 23:30:00+00:00', '%Y-%m-%d %H:%M:%S+00:00')
target_timestamp_24h = target_timestamp - datetime.timedelta(hours=24)
print(data.loc[pd.DatetimeIndex([target_timestamp_24h])])
print(target_timestamp_24h)
Error: No axis named 2023-01-21 23:30:00 for object type DataFrame

How to get difference in days between two dates from a csv in python?

So I'm trying to get the number of days between these two dates ("today" and a prior date) from a csv file. I'm using the datetime library, i can get the delta (difference between the two dates), but I can't get the number of days. If i try to use .days it returns me an error, stating it cannot convert series to datetime.
I just wanna get the number of days in int or float to use in a code I'm working on. Can anyone give me some direction? I've tried everything and done a lot of prior research here on Stackoverflow and other websites.
Here's the code below.
import pandas as pd
from datetime import datetime
df = pd.read_csv("serv.csv",parse_dates=["PasswordLastSet"])
for row in df:
d1 = pd.to_datetime(datetime.today())
d2 = (df["PasswordLastSet"])
delta = d1-d2
days = delta.datetime.days
tried some different librarys, tried converting the delta to datetime, tried turning it to a string and filter that string but no sucess. I just want to get the number of days, like "10" or "60" so i can make some math with it later on.
convert passwordlastset to datetime and subtract from today
df['delta']=(datetime.today() - pd.to_datetime(df['PasswordLastSet'])).dt.days
df
PasswordLastSet delta
0 2019-09-10 11:11 1142
1 2019-10-07 11:13 1115
2 2019-11-04 11:16 1087
3 2019-11-28 11:20 1063

Date change halfway through csv from YYYY-MM-DD to DD/MM/YY and after switch datetime no longer works

I have a csv of daily temperature data with 3 columns: dates, daily maximum temperatures, and daily minimum temperatures. I attached it here so you can see what I mean.
I am trying to break this data set into smaller datasets of 30 year periods. For the first few years of Old.csv the dates are entered in YYYY-MM-DD but then switch to DD/MM/YY in 1900. After this date format switches my code to split the years no longer works. Here is what I'm using:
df2 = pd.read_csv("Old.csv")
test = df2[
(pd.to_datetime(df2['Date']) >
pd.to_datetime('1897-01-01')) &
(pd.to_datetime(df2['Date']) <
pd.to_datetime('1899-12-31'))
]
and it works...BUT when I switch to 1900 and beyond it stops. So this one doesnt work:
test = df2[
(pd.to_datetime(df2['Date']) >
pd.to_datetime('1900-01-01')) &
(pd.to_datetime(df2['Date']) <
pd.to_datetime('1905-12-31'))
]
The above code gives me an empty data set, despite working pre 1900. I'm assuming this is some sort of a formatting issue but I thought that using ".to_datetime" would fix that. I also tried this:
df2['Date']=pd.to_datetime(df2['Date'])
to reformat the entire list before I ran the code above but it still didnt work. The other interesting thing is that I have a separate csv with dates consistently entered as MM/DD/YY and that one works with the code above. Could it be an issue with the turn of the century? Does anyone know how to fix this?
You're dealing with time/date data with different formats, for this you could you could use a more flexible parser, for instance dateutil.parser
Example:
>>> from dateutil.parser import parse
>>> df
Date
0 1897-01-01
1 1899-12-31
2 01/01/00
>>> df.Date.apply(parse)
0 1897-01-01 00:00:00
1 1899-12-31 00:00:00
2 2000-01-01
Name: Date, dtype: datetime64[ns]
and use your function on the parsed data.
As remarked in the comment above, it's still not clear whether year "00" refers to year 1900 or 2000, but maybe you can infer that from the context of the csv file.
To change all years in the 'DD/MM/YY' format to 1900 dates you could define your own parse function
>>> def my_parse(d):
... if d[-3]=='/':
... d = d[:-3]+'/19'+d[-2:]
... return parse(d)
>>> df.Date.apply(my_parse)
0 1897-01-01
1 1899-12-31
2 1900-01-01
Python is reading 00 as 2000 instead of 1900. So I tried this to edit 00 to read as 1900:
df2.Date.dt.year.replace(2000, 1990, inplace=True)
But python returned an error that said dates are not directly editable. So I then changed them to a string and edited that way using:
df2['Date'] = df2['Date'].str.replace(r'00', '1900')
This works but now I need to find a way to loop through 1896-1968 without having to type that line out every time.

Pandas read and parse Excel data that shows as a datetime, but shouldn't be a datetime

I have a system I am reading from that implemented a time tracking function in a pretty poor way - It shows the tracked working time as [hh]:mm in the cell. Now this is problematic when attempting to read this data because when you click that cell the data bar shows 11:00:00 PM, but what that 23:00 actually represents is 23 hours of time spent and not 11PM. So whenever the time is 24:00 or more you end up with 1/1/1900 12:00:00 AM and on up ( 25:00 = 1/1/1900 01:00:00 AM).
So pandas picks up the 11:00:00 AM or 1/1/1900 01:00:00 AM when it comes into the dataframe. I am at a loss as to how I would put this back into an INT for and get the number of hours in a whole number format 24, 25, 32, etc.
Can anyone help me figure out how to turn this horribly formatted data into the number of hours in int format?
If you want 1/1/1900 01:00:00 AM to represent 25 hours of elapsed time then this tells me your reference timestamp is 12/31/1899 00:00:00. Try the following:
time_delta = pd.Timestamp('1/1/1900 01:00:00 AM') - pd.Timestamp('12/31/1899 00:00:00')
# returns Timedelta('1 days 01:00:00')
You can get the total number of seconds by using the Timedelta.total_seconds() method:
time_delta.total_seconds()
# returns 90000.0
and then you could get the number of hours with
time_delta.total_seconds() / 3600.0
# returns 25.0
So try subtracting pd.Timestamp('12/31/1899 00:00:00') from your DatetimeIndex based on the year 1900 to get a TimedeltaIndex. You can then leave your TimedeltaIndex as is or convert it to a Float64Index with TimedeltaIndex.total_seconds().
pandas is not at fault its the excel that is interpreting the data wrongly,
Set the data to text in that column and it wont interpret as date.
then save the file and open through pandas and it should work fine.
other wise export as CSV and try to open in pandas.
Here is where I ended and it does work:
for i in range(len(df['Planned working time'])) :
pwt = df['Planned working time'][i]
if len(str(df['Planned working time'][i]).split(' ')) > 1 :
if str(str(pwt).split(' ')[0]).split('-')[0] == '1900' :
workint = int(24)*int(str(str(pwt).split(' ')[0]).split('-')[2]) + int(str(pwt).split(' ')[1].split(':')[0])
elif len(str(pwt).split(' ')) == 1 :
if str(str(pwt).split(' ')[0]).split('-')[0] != '1900' :
workint = int(str(pwt).split(' ')[0].split(':')[0])
df.set_value(i, 'Planned working time', workint)
any suggested improvements are welcome, but this results in the correct int values in all cases. Tested on over 14K rows of data. This would likely have to be refined if there were minutes, but there are no cases where minutes show up in the data and the UI on the front end doesn't appear to actually allow minutes.

Python code to average values during certain time periods in monthly data

Hello everyone I have a cvs file which contains a months worth of data in hourly intervals. I need to get an average value of one of the columns for the time intervals of 12:00am-3:00am for the entire month. I am using pandas.DataFrame to try and do this.
Sample of data I am using
DateTime current voltage
11/1/2014 12:00 1.122061402 4.058617834
11/1/2014 1:00 1.120534925 4.060912132
11/1/2014 2:00 1.119349897 4.058656072
11/1/2014 3:00 1.118277733 4.060912132
11/1/2014 4:00 1.120365636 4.060912132
11/1/2014 5:00 1.120365636 4.060912132
i'd like to average column 2 from 12am-3am everyday for the entire month. I am thinking using a conditional statement on the time would be a good option however I am unsure of how to implement that conditional statement on date/time data.
I will assume that you have already imported the file into a Pandas dataframe named df.
Confirm that your "DateTime" field is being recognized by pandas as a DateTime by checking the value of df.dtypes. If not, recast e.g. with:
df['DateTime'] = pd.to_datetime(df['DateTime'])
Double-check that times like 12 AM, 1 PM, etc. are being handled properly. (You have not indicated anything to distinguish 12 AM from 12 PM etc. in your dataset.) If not, you will need to devise an appropriate method to correct them or re-export them from the original source.
Create a DatetimeIndex from your DateTime field:
df = df.set_index(pd.DatetimeIndex(df['DateTime']))
Now take Dmitry's suggestion (lightly modified):
>>> df.between_time('0:00', '3:00').resample('1D').mean()
The index of the result will show the beginning of the time interval being averaged.
Edited to take into account new info in the comments.

Categories