python efficient way to validate date

python efficient way to validate date - python

I have a large file with the date in %m%d%Y format i.e 12012013 for 12th jan 2013.
I have to perform 2 things:
1) validate the date
2) store it in a list in sorted chronological format
for validation:
try:
parsedDate = datetime.strptime(date, '%m%d%Y')
return parsedDate
except:
return None'
using DateTime take a lot of time to parse the date. Since the format is mmddyyyy, can I validate it without using datetime efficiently?
2) For chronological order: I dont want to convert it to datetime and then sort it, is there a way I can use string to sort it. I have check a lot of answers, but almost all of them assumes that you have a list and then sort it.
I want to insert it in a sorted format?

datetime module is pretty good, still if you want any other option you can validate as reg expression, check: match dates using python regular expressions.
To sort the date without converting to datetime. Just convert it into format yyyymmdd then do string sort or just create a empty list then append the string to the the correct position based on value greater or lesser.
Would request you to try it yourself :)

If the format is %m%d%Y, the most efficient is using a RegEx (there are some profiling about that).
For instance:
import re
import datetime
match_date = re.compile(r'(\d{2})(\d{2})(\d{4})$').match
text = '12012013'
mo = match_date(text)
if mo:
date = datetime.date(int(mo.group(3)), int(mo.group(1)), int(mo.group(2)))
print(date)
# -> 2013-12-01
That way, the RegEx will do the first level of filtering and the date constructor the second (with an exception). Of course, you can improve your RegEx, this one is trivial for the demo.
If you know in advance that your dates are all valid you can avoid the conversion to date and use the tuple (year, month, day) for sorting, instead of using date.

Related

How do I select all rows of a specific date in a dataframe which contains datetime column with multiple times on that date? [duplicate]

I use pandas.to_datetime to parse the dates in my data. Pandas by default represents the dates with datetime64[ns] even though the dates are all daily only.
I wonder whether there is an elegant/clever way to convert the dates to datetime.date or datetime64[D] so that, when I write the data to CSV, the dates are not appended with 00:00:00. I know I can convert the type manually element-by-element:
[dt.to_datetime().date() for dt in df.dates]
But this is really slow since I have many rows and it sort of defeats the purpose of using pandas.to_datetime. Is there a way to convert the dtype of the entire column at once? Or alternatively, does pandas.to_datetime support a precision specification so that I can get rid of the time part while working with daily data?

Since version 0.15.0 this can now be easily done using .dt to access just the date component:
df['just_date'] = df['dates'].dt.date
The above returns a datetime.date dtype, if you want to have a datetime64 then you can just normalize the time component to midnight so it sets all the values to 00:00:00:
df['normalised_date'] = df['dates'].dt.normalize()
This keeps the dtype as datetime64, but the display shows just the date value.
pandas: .dt accessor
pandas.Series.dt

Simple Solution:
df['date_only'] = df['date_time_column'].dt.date

While I upvoted EdChum's answer, which is the most direct answer to the question the OP posed, it does not really solve the performance problem (it still relies on python datetime objects, and hence any operation on them will be not vectorized - that is, it will be slow).
A better performing alternative is to use df['dates'].dt.floor('d'). Strictly speaking, it does not "keep only date part", since it just sets the time to 00:00:00. But it does work as desired by the OP when, for instance:
printing to screen
saving to csv
using the column to groupby
... and it is much more efficient, since the operation is vectorized.
EDIT: in fact, the answer the OP's would have preferred is probably "recent versions of pandas do not write the time to csv if it is 00:00:00 for all observations".

Pandas v0.13+: Use to_csv with date_format parameter
Avoid, where possible, converting your datetime64[ns] series to an object dtype series of datetime.date objects. The latter, often constructed using pd.Series.dt.date, is stored as an array of pointers and is inefficient relative to a pure NumPy-based series.
Since your concern is format when writing to CSV, just use the date_format parameter of to_csv. For example:
df.to_csv(filename, date_format='%Y-%m-%d')
See Python's strftime directives for formatting conventions.

This is a simple way to extract the date:
import pandas as pd
d='2015-01-08 22:44:09'
date=pd.to_datetime(d).date()
print(date)

Pandas DatetimeIndex and Series have a method called normalize that does exactly what you want.
You can read more about it in this answer.
It can be used as ser.dt.normalize()

Just giving a more up to date answer in case someone sees this old post.
Adding "utc=False" when converting to datetime will remove the timezone component and keep only the date in a datetime64[ns] data type.
pd.to_datetime(df['Date'], utc=False)
You will be able to save it in excel without getting the error "ValueError: Excel does not support datetimes with timezones. Please ensure that datetimes are timezone unaware before writing to Excel."

df['Column'] = df['Column'].dt.strftime('%m/%d/%Y')
This will give you just the dates and NO TIME at your desired format. You can change format according to your need '%m/%d/%Y' It will change the data type of the column to 'object'.
If you want just the dates and DO NOT want time in YYYY-MM-DD format use :
df['Column'] = pd.to_datetime(df['Column']).dt.date
The datatype will be 'object'.
For 'datetime64' datatype, use:
df['Column'] = pd.to_datetime(df['Column']).dt.normalize()

Converting to datetime64[D]:
df.dates.values.astype('M8[D]')
Though re-assigning that to a DataFrame col will revert it back to [ns].
If you wanted actual datetime.date:
dt = pd.DatetimeIndex(df.dates)
dates = np.array([datetime.date(*date_tuple) for date_tuple in zip(dt.year, dt.month, dt.day)])

I wanted to be able to change the type for a set of columns in a data frame and then remove the time keeping the day. round(), floor(), ceil() all work
df[date_columns] = df[date_columns].apply(pd.to_datetime)
df[date_columns] = df[date_columns].apply(lambda t: t.dt.floor('d'))

On tables of >1000000 rows I've found that these are both fast, with floor just slightly faster:
df['mydate'] = df.index.floor('d')
or
df['mydate'] = df.index.normalize()
If your index has timezones and you don't want those in the result, do:
df['mydate'] = df.index.tz_localize(None).floor('d')
df.index.date is many times slower; to_datetime() is even worse. Both have the further disadvantage that the results cannot be saved to an hdf store as it does not support type datetime.date.
Note that I've used the index as the date source here; if your source is another column, you would need to add .dt, e.g. df.mycol.dt.floor('d')

This worked for me on UTC Timestamp (2020-08-19T09:12:57.945888)
for di, i in enumerate(df['YourColumnName']):
df['YourColumnName'][di] = pd.Timestamp(i)

If the column is not already in datetime format:
df['DTformat'] = pd.to_datetime(df['col'])
Once it's in datetime format you can convert the entire column to date only like this:
df['DateOnly'] = df['DTformat'].apply(lambda x: x.date())

How do I convert a xsd:duration in date format in python?

I have been given a data set that has two rows with dates in the format xsd:duration which python accounts for as a string.
The format looks like PT3H20M (for 3h20min), or PT3H (for 3h) or PT30M (for 30m). How do you convert this format to date so that I can add the times and perform comparisons on them ?
Thanks for any help
EDIT : I'm specifically looking for any built-in package/function that I don't know about that would do that relatively easily.

I would suggest that:
You extract the numbers of hours and minutes using regular
expression in python.
Use a datetime function to convert to UNIX time or another format.
Make a comparison or sum.

Detect missing date in string using dateutil in python

I have a string which contains a timestamp. This timestamp may or may not contain the date it was recorded. If it does not I need to retrieve it from another source. For example:
if the string is
str='11:42:27.619' #It does not contain a date, just the time
when I use dateutil.parser.parse(str), it will return me a datetime object with today's date. How can I detect when there is no date? So I can get it from somewhere else?
I can not just test if it is today's date because the timestamp may be from today and I should use the date provided in the timestamp if it exists.
Thank you

What I would do is first check the string's length, if it contains it should be larger, then I would proceed as you mention.

Changing many variables to datetime in Pandas - Python

I have a dataset with around 1 million rows and I'd like to convert 12 columns to datetime. Currently they are "object" type. I previously read that I could do this with:
data.iloc[:,7:19] = data.iloc[:,7:19].apply(pd.to_datetime, errors='coerce')
This does work, but the performance is extremely poor. Someone else mentioned performance could be sped up by doing:
def lookup(s):
"""
This is an extremely fast approach to datetime parsing.
For large data, the same dates are often repeated. Rather than
re-parse these, we store all unique dates, parse them, and
use a lookup to convert all dates.
"""
dates = {date:pd.to_datetime(date) for date in s.unique()}
return s.apply(lambda v: dates[v])
However, I'm not sure how to apply this code to my data (I'm a beginner). Does anyone know how to speed up changing many columns to datetime using this code or any other method? Thanks!

If all your dates have the same format you can define a dateparse function, then pass it as an argument when you import. Furst you import datetime, then use datetime.strf (#define your format here).
Once that function is defined, in pandas you set the parse dates option to True, then you have an option to call a date parser. you would put date parser=yourfunction.
I would look up the pandas api to get specific syntax

How can I retrieving a timestamp from a dictionary and converting it to datetime object?

I have a dictionary of timestamps in this form 2011-03-01 17:52:49.728883 and ids. How can I retrieve the latest timestamp and represent it in python datetime object?
The point is to be able to use the latest date of the timestamp instead of the currnt date in above code.
latest = datetime.now()

The structure of your dictionary is not entirely clear from your question, so I'll assume the keys are strings containing timestamps like the one in your question.
If d is the dictionary:
datetime.strptime(max(d.keys()),'%Y-%m-%d %H:%M:%S.%f')
The above code uses the fact that the lexicographical ordering of your datetime strings sorts timestamps in chronological order. If you'd rather not rely on that, you could use:
max(map(lambda dt:datetime.strptime(dt,'%Y-%m-%d %H:%M:%S.%f'),d.keys()))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python efficient way to validate date - python

Related

How do I select all rows of a specific date in a dataframe which contains datetime column with multiple times on that date? [duplicate]

How do I convert a xsd:duration in date format in python?

Detect missing date in string using dateutil in python

Changing many variables to datetime in Pandas - Python

How can I retrieving a timestamp from a dictionary and converting it to datetime object?

Categories

Resources