Vaex Datetime comparison - python

I have a vaex dataframe that reads from a hdf5 file. It has a date column which is read as string. I converted it into datetime. However, I am not able to do any date comparisons. I can extract day,month,year, etc from the date so the conversion is correct. But how do I perform operations like date is between x and y?
import vaex
import datetime
vaex_df=vaex.open('filename.hdf5')
vaex_df['pDate']=vaex_df.Date.values.astype('datetime64[ns]')
The datatypes are as expected
print(data.dtypes)
## Date <class 'str'>
## pDate datetime64[ns]
Now I need to filter out rows based on some date
start_date=datetime.date(2019,10,1)
vaex_df=vaex_df[(vaex_df.pDate.dt>=start_date)]
print(vaex_df) # throws SyntaxError: invalid token
I get an invalid token when I try to look at the new dataframe.
I can extract the month and year separately and apply the filter. But that would give a wrong result
vaex_df=vaex_df[(vaex_df.pDate.dt.month>int(str(start_date)[5:7]))&(vaex_df.pDate.dt.year>=int(str(start_date)[:4]))]
How do I do date range comparison operations in vaex?

datetime from numpy works
#Instead of
start_date=datetime.date(2019,10,1)
#Use
start_date=np.datetime64('2019-10-01')
On the vaex dataframe
vaex_df=vaex_df[(vaex_df.pDate>=start_date)]

Related

What is Vaex function to parse string to datetime64, which equivalent to pandas to_datetime, that allow custom format?

I have date as string (example: 3/24/2020) that I would like to convert to datetime64[ns] format
df2['date'] = pd.to_datetime(df1["str_date"], format='%m/%d/%Y')
Use pandas to_datetime on vaex dataframe will result an error:
ValueError: time data 'str_date' does not match format '%m/%d/%Y' (match)
I have see maybe duplicate question.
df2['pdate']=df2.date.astype('datetime64[ns]')
However, the answer is type casting. My case required to a format ('%m/%d/%Y') parse string to datetime64[ns], not just type cast.
Solution: make custom function, then .apply
vaex can use apply function for object operations, so you can use datetime and np.datetime64 convert each date string, then apply it.
import numpy as np
from datetime import datetime
def convert_to_datetime(date_string):
return np.datetime64(datetime.strptime(str(date_string), "%Y%m%d%H%M%S"))
df['date'] = df.date.apply(convert_to_datetime)

How to convert float value to date9 format in pandas

Basically i am sas developer.
As of now i am doing sas2python migrations.
Before reading to pandas dataframe i have two columns ie,
DATE NAME
01JAN1988 VARUN
11JAN1999 THARUN
After reading to pandas dataframe the DATE columns is automatically read as float values. Now I need to show it as DATE Columns as date9 format
Could you please provide the steps
you can use apply function to convert the values into date objects and datetime module to covert them:
df['DATE'] = df['DATE'].apply(lambda x: datetime.datetime.strptime(x,'%d%b%Y').date())
Output:
DATE NAME
0 1988-01-01 VARUN
1 1999-01-11 THARUN

Getting columns with datetime format such as (2017-02-12 10:23:55 AM)[YYYY-MM-dd hh:mm:ss AM/PM] using pandas

I recently asked a question about identifing all the columns which are datetime. Here it is: Get all columns with datetime type using pandas?
The answer was correct for a proper date time format, however, I now realize my data isn't proper date time, it is a string formatted like "2017-02-12 10:23:55 AM" and I was advised to create a new question.
I have a huge dataframe with an unknown number of date time columns, where I do not know their names nor their position. How do I identify the column names of the date time columns which have the date of format such as YYYY-MM-dd hh:mm:ss AM/PM?
One way to do this would be to test for successful conversion:
def is_datetime(datetime_string):
try:
pd.to_datetime(datetime_string)
return True
except ValueError:
return False
With this:
dt_columns = [c for c in df.columns if is_datetime(df[c][0])]
Note: This tests for any string that can be converted to a datetime.

Converting objects from CSV into datetime

I've got an imported csv file which has multiple columns with dates in the format "5 Jan 2001 10:20". (Note not zero-padded day)
if I do df.dtype then it shows the columns as being a objects rather than a string or a datetime. I need to be able to subtract 2 column values to work out the difference so I'm trying to get them into a state where I can do that.
At the moment if I try the test subtraction at the end I get the error unsupported operand type(s) for -: 'str' and 'str'.
I've tried multiple methods but have run into a problem every way I've tried.
Any help would be appreciated. If I need to give any more information then I will.
As suggested by #MaxU, you can use pd.to_datetime() method to bring the values of the given column to the 'appropriate' format, like this:
df['datetime'] = pd.to_datetime(df.datetime)
You would have to do this on whatever columns you have that you need trasformed to the right dtype.
Alternatively, you can use parse_dates argument of pd.read_csv() method, like this:
df = pd.read_csv(path, parse_dates=[1,2,3])
where columns 1,2,3 are expected to contain data that can be interpreted as dates.
I hope this helps.
convert a column to datetime using this approach
df["Date"] = pd.to_datetime(df["Date"])
If column has empty values then change error level to coerce to ignore errors: Details
df["Date"] = pd.to_datetime(df["Date"], errors='coerce')
After which you should be able to subtract two dates.
example:
import pandas
df = pandas.DataFrame(columns=['to','fr','ans'])
df.to = [pandas.Timestamp('2014-01-24 13:03:12.050000'), pandas.Timestamp('2014-01-27 11:57:18.240000'), pandas.Timestamp('2014-01-23 10:07:47.660000')]
df.fr = [pandas.Timestamp('2014-01-26 23:41:21.870000'), pandas.Timestamp('2014-01-27 15:38:22.540000'), pandas.Timestamp('2014-01-23 18:50:41.420000')]
(df.fr-df.to).astype('timedelta64[h]')
consult this answer for more details:
Calculate Pandas DataFrame Time Difference Between Two Columns in Hours and Minutes
If you want to directly load the column as datetime object while reading from csv, consider this example :
Pandas read csv dateint columns to datetime
I found that the problem was to do with missing values within the column. Using coerce=True so df["Date"] = pd.to_datetime(df["Date"], coerce=True) solves the problem.

Python cleaning dates for conversion to year only in Pandas

I have a large data set which some users put in data on an csv. I converted the CSV into a dataframe with panda. The column is over 1000 entries here is a sample
datestart
5/5/2013
6/12/2013
11/9/2011
4/11/2013
10/16/2011
6/15/2013
6/19/2013
6/16/2013
10/1/2011
1/8/2013
7/15/2013
7/22/2013
7/22/2013
5/5/2013
7/12/2013
7/29/2013
8/1/2013
7/22/2013
3/15/2013
6/17/2013
7/9/2013
3/5/2013
5/10/2013
5/15/2013
6/30/2013
6/30/2013
1/1/2006
00/00/0000
7/1/2013
12/21/2009
8/14/2013
Feb 1 2013
Then I tried converting the dates into years using
df['year']=df['datestart'].astype('timedelta64[Y]')
But it gave me an error:
ValueError: Value cannot be converted into object Numpy Time delta
Using Datetime64
df['year']=pd.to_datetime(df['datestart']).astype('datetime64[Y]')
it gave:
"ValueError: Error parsing datetime string ""03/13/2014"" at position 2"
Since that column was filled in by users, the majority was in this format MM/DD/YYYY but some data was put in like this: Feb 10 2013 and there was one entry like this 00/00/0000. I am guessing the different formats screwed up the processing.
Is there a try loop, if statement, or something that I can skip over problems like these?
If date time fails I will be force to use a str.extract script which also works:
year=df['datestart'].str.extract("(?P<month>[0-9]+)(-|\/)(?P<day>[0-9]+)(-|\/)(?P<year>[0-9]+)")
del df['month'], df['day']
and use concat to take the year out.
With df['year']=pd.to_datetime(df['datestart'],coerce=True, errors ='ignore').astype('datetime64[Y]') The error message is:
Message File Name Line Position
Traceback
<module> C:\Users\0\Desktop\python\Example.py 23
astype C:\Python33\lib\site-packages\pandas\core\generic.py 2062
astype C:\Python33\lib\site-packages\pandas\core\internals.py 2491
apply C:\Python33\lib\site-packages\pandas\core\internals.py 3728
astype C:\Python33\lib\site-packages\pandas\core\internals.py 1746
_astype C:\Python33\lib\site-packages\pandas\core\internals.py 470
_astype_nansafe C:\Python33\lib\site-packages\pandas\core\common.py 2222
TypeError: cannot astype a datetimelike from [datetime64[ns]] to [datetime64[Y]]
You first have to convert the column with the date values to datetime's with to_datetime():
df['datestart'] = pd.to_datetime(df['datestart'], coerce=True)
This should normally parse the different formats flexibly (the coerce=True is important here to convert invalid dates to NaT).
If you then want the year part of the dates, you can do the following (seems doing astype directly on the pandas column gives an error, but with values you can get the underlying numpy array):
df['datestart'].values.astype('datetime64[Y]')
The problem with this is that it gives again an error when assigning this to a column due to the NaT value (this seems a bug, you can solve this by doing df = df.dropna()). But also, when you assign this to a column, it get converted back to a datetime64[ns] as this is the way pandas stores datetimes. So I personally think if you want a column with the years, you can better do the following:
df['year'] = pd.DatetimeIndex(df['datestart']).year
This last one will return the year as an integer.

Categories