I am trying to automate some data processing on an excel file and in the data processing file I use a macro that I guess converts the date column from "mm/dd/yyyy" into the days since January 1, 1900 as per excel convention. So when I try to do a similar thing in Python 2.7.4. It seems to be 2 days off. For example, 2/25/2015 becomes 42060 when done by the excel macro. When I run this code:
import datetime
gotDay = datetime.date(2015,2,25)
epoch = datetime.date(1900,1,1)
print((gotDay-epoch).days)
I get 42058.
I could totally just add 2 to everything in Python, but I was wondering why this happens.
Thanks!
The correct answer is 42058, so there's something wrong w/the excel macro.
My guess is that excel is counting both the start and end date as a day, when really you're only interested in the time between those dates.
It's like saying how many days between today and tomorrow. You could say one, which is what python is doing, or you could say 3 (today, the day between, and tomorrow) which is what I'm guessing your excel macro is doing.
This is an old thread, but since it came up in google, I wanted to put in an answer. The off-by-two problem is explained like this:
One day because 1900-01-01 is day #1, not day #0. In Excel, if you convert 1900/1/1 into a number, it gives the answer 1, not 0. In python:
print((datetime.date(1900,1,1)-datetime.date(1900,1,1)).days) = 0.
Another day because Excel implemented a bug in Lotus 1-2-3 for backwards compatibility. Apparently, Lotus 1-2-3 incorrectly assumed that 1900 was a leap year.
https://support.microsoft.com/en-us/help/214326/excel-incorrectly-assumes-that-the-year-1900-is-a-leap-year
Related
I am trying to create some simple code that given a datetime, it would tell me if I would miss the train or not, based on the time (hour/minute) of a train timetable stored in the EDT time.
For example the time one of the entries in the timetabe is 1000 (got that time in June).
Now for example if I put my time to be t1 = datetime(2022,7,1,9,30, tzinfo=pytz('America/New York')), function would return true (I should be able to catch the train since I get to the train station at 9:30am).
And if I input the time as t2=datetime(2022,12,15,9,30, tzinfo = pytz('America/New York')) , it should return false for me since now the train is leaving at 0900 in New York time in December (but I do not want to manage all the messy conversion in my code).
One way I can think of is to look at the utcoffset() of the times in June and then add that to the UTC date in December to do that comparison, but I am not sure if there is something that's even simpler and do not need to involve conversion to UTC.
I have a dataframe with dates in string format. I convert those dates to timestamp, so that I could use this date column in the later part of the code. Everything is fine with calculations/comparisons etc, but I would like the timestamp to appear in %d.%m.%Y format, as opposed to default %Y-%m-%d. Let me illustrate it -
dt=pd.DataFrame({'date':['09.12.1998','07.04.2014']},index=[1,2])
dt
Out[4]:
date
1 09.12.1998
2 07.04.2014
dt['date_1']=pd.to_datetime(dt['date'],format='%d.%m.%Y')
dt
Out[7]:
date date_1
1 09.12.1998 1998-12-09
2 07.04.2014 2014-04-07
I would like to have dt['date_1'] to de displayed in the same format as dt['date']. I don't wish to use .strftime() function because it will convert the datatype from timestamp to string.
In Nutshell: How can I invoke the python system in displaying the timestamp in the format of my choice(months could be like APR, MAY etc), rather than getting a default format(like 1998-12-09), keeping in mind that the data type remains a timestamp, rather than string?
It seems Pandas didn't implement this option yet:
https://github.com/pandas-dev/pandas/issues/11501
having a look at https://pandas.pydata.org/pandas-docs/stable/options.html looks like you can set the display to achieve some of this, although not all.
display.date_dayfirst When True, prints and parses dates with the day first, eg 20/01/2005
display.date_yearfirst When True, prints and parses dates with the year first, eg 2005/01/20
so you can have dayfirst, but they haven't included names for months.
On a more fundamental level, whenever you're displaying something it is a string, right? I'm not sure why you wouldn't be able to convert it when you're displaying it without having to change the original dataframe.
your code would be:
pd.set_option("display.date_dayfirst", True)
except actually this doesn't work:
https://github.com/pandas-dev/pandas/issues/11501
the options have been implemented for parsing, but not for displaying.
Hallo Stael/Cezar/Droravr, Thank you all for providing your inputs. I value your time and appreciate your help a lot. Thanks for sharing this link https://github.com/pandas-dev/pandas/issues/11501 as well. I went through the link and understood that this problem can be broken down to a 'displaying problem' ultimately, as also expounded by jreback. This issue to have the dates displayed to your desired format has been marked as an Enhancement, so probably will be added to future versions.
All I wanted was the have to dates exported as dd-mm-yyy and by just formatting the string while exporting, we could solve this problem.
So, I sorted this issue by exporting the file as -
dt.to_csv(filename, date_format='%d-%m-%Y',index=False).
date date_1
09.12.1998 09-12-1998
07.04.2014 07-04-2014
Thus, this issue stands SOLVED.
Once again, thank you all for your kind help and the precious hours you spent with this issue. Deeply appreciated.
I am looking to create an if then statement that involves the current time of the day. For example I want something like if it is past 2pm then do this function.
I have tried using the time module but I can't seem to find a way to get just the time of day without the extra stuff like the date. Any help?
Here is a start, and I think it'll be enough for you to get to the answer and use it how you need.
import time
print(time.strftime("%Y-%m-%d %H:%M"))
print(time.strftime("I can format the date and time in many ways. Time is: %H:%M"))
Output (when I ran it):
2017-06-21 10:40
I can format the date and time in many ways. Time is: 10:40
I get this error when trying to append two Pandas DFs together in a for loop:
Aggdata=Aggdata.append(Newdata)
This is the full error:
File "pandas\tslib.pyx", line 4096, in pandas.tslib.tz_localize_to_utc (pandas
\tslib.c:69713)
pytz.exceptions.NonExistentTimeError: 2017-03-12 02:01:24
However, in my files, I do not have such a time stamp, but I do have ones like 03/12/17 00:45:26 or 03/12/17 00:01:24. Where it is 2 hours before daylight savings. And if I manually delete the offending row, I get that same error for the next row with times between 12 and 1am on the 12th of March.
My original date/time column has no TZ info, but I calculate another column in EST, before the concatenation and localize it to EST, with time with TZ information:
`data['EST_DateTimeStamp']=pd.DatetimeIndex(pd.to_datetime(data['myDate'])).tz_localize('US/Eastern').tz_convert('US/Eastern')`
Doing some research here, I understand that 2 to 3am on the 12th should be having such error, but why midnight to 1am. So am I localizing it incorrectly? and then why is the error on the append line, and not the localization line?
I was able to reproduce this behavior in a very simple MCVE, saved here:
https://codeshare.io/GLjrLe
It absolutely boggles my mind that the error is raised on the third append, and only if the next 3 appends follow. In others words, if I comment out the last 3 copies of appends, it works fine.. can't imagine what is happening.
Thank you for reading.
In case someone else may still find this helpful:
Talking about it with #hashcode55, the solution was to upgrade Pandas on my server, as this was likely a bug in my previous version of that module.
The problem seems to occur at daylight savings switch - there are local times that do not exist, once per year. In the opposite direction there will be duplicate times.
This could be from say your input dates being converted from UTC to "local time" by adding a fixed offset. When you try to localize these you will hit a non existent times over that hour (or 30 minutes if you are in Adelaide).
I have a file (dozens of columns and millions of rows) that essentially looks like this:
customerID VARCHAR(11)
accountID VARCHAR(11)
snapshotDate Date
isOpen Boolean
...
One record in the file might look like this:
1,100,200901,1,...
1,100,200902,1,...
1,100,200903,1,...
1,100,200904,1,...
1,100,200905,1,...
1,100,200906,1,...
...
1,100,201504,1,...
1,100,201505,1,...
1,100,201506,1,...
When an account is closed, two things can happen. Typically, no further snapshots for that record will exist in the data. Occasionally, further records will continue to be added but the isOpen flag will be set to 0.
I want to add an additional Boolean column, called "closedInYr", that has a 0 value UNLESS THE ACCOUNT CLOSES WITHIN ONE YEAR AFTER THE SNAPSHOT DATE.
My solution is slow and gross. It takes each record, counts forward in time 12 months, and if it finds a record with the same customerID, accountID, and isOpen set to 1, it populates the record with a 0 in the "closedInYr" field, otherwise it populates the field with a 1. It works, but the performance is not acceptable, and we have a number of these kinds of files to process.
Any ideas on how to implement this? I use R, but am willing to code in Perl, Python, or practically anything except COBOL or VB.
Thanks
I suggest to use the Linux "date" command to convert the date to the unix time stamps.
Unix time stamp are the number of seconds elapsed since 1 January 1970. So basically a year is 60s*60m*24h*256d seconds. So, if the difference between the time stamps is more than this number then it is longer than a year.
It will be something like this:
>date --date='201106' "+%s"
1604642400
So if you use perl, which is a pretty cool file handling language, you will parse your whole file in a few lines and use eval"you date command".
If all the snapshots for a given record appear in one row, and the records that were open for the same period of time have the same length (i.e., snapshots were taken at regular intervals), then one possibility might be filtering based on row lengths. If the longest open row is length N and one year's records are M, then you know a N-M row was open, at longest, one year less than the longest... That approach doesn't handle the case where the snapshots keep getting added, albeit with open flags set to 0, but it might allow you to cut the number of searches down by at least reducing the number of searches that need to be made per row?
At least, that's an idea. More generally, searching from the end to find the last year where isOpen == 1 might cut the search down a little...
Of course, this all assumes each record is in one row. If not, maybe a melt is in order first?