PySpark removing Invalid Date time format in column

PySpark removing Invalid Date time format in column - python

My date time field format is: 2016-10-15 00:00:00
after using infer schema while saving my data to a parquet file, I have a few rows that don't comply to this format.
How can I collectively remove them in PySpark?
It is causing me problems in my UDF's.

Assuming you're parsing the date column and rows with invalid dates are null, which is usually the case:
df.filter(col('date').isNotNull())
Alternatively, if your date is read as a string, you can parse it using unix_timestamp:
(
df
.select(unix_timestamp('date', 'yyyy-MM-dd HH:mm:ss').cast("timestamp").alias('date'))
.filter(col('date').isNotNull())
)

Related

Python -Pandas _ Datetime conversion to a specific format such as DD-MM-YYYY

I have column emp_date which consists of different date formats such as mm/dd/yyy, mm-dd-yyyy and also dd-mm-yyyy ,along with blank spaces, and some with timestamps.And the data type is Object for this column.
I want to convert these dates into the specific format such as DD-MM-YYYY .
Since it has multiple formats and blank spaces along with my specific format i am getting different errors.
Input file: CSV file
emp_date Column
10-07-2013
1/15/2012
Blank space or Null value
1/15/2023
12/13/2021
1-15-2021
Blank space or Null value
5/31/2013
Blank space or Null value
209-06-13 00:00:00
Code:
col='Previous Employment Start Date'
CorePreviousWorkexp_bkp['col'] = pd.to_datetime(CorePreviousWorkexp_bkp[col], format='%d-%m-%Y')
or
import datetime
def format(val):
a = pd.to_datetime(val, errors='coerce', cache=False).strftime('%m/%d/%Y')
try:
date_time_obj = datetime.datetime.strptime(a, '%d/%m/%Y')
except:
date_time_obj = datetime.datetime.strptime(a, '%m/%d/%Y')
return date_time_obj.date()
Output : But multiple errors due to different formats and blank spaces.
Expected Format: DD-MM-YYYY
How to achieve this format ?

Usually pd.to_datetime does a pretty good job with differently formatted dates.
I would try a pd.to_datetime(df[col]), and DO NOT specify the format you are looking for. This will allow the function to consider multiple date formats.
After this you can df[col].fillna(somedate)
and then reformat your df[col] as you please.
EDIT: to include #FObersteiner comment below, for dates which the mm and dd could be confused, you would likely need to parse these out yourself. For example 1/5/2020 vs 5/1/2020. Only you would know which is correct.

How to write pandas date column to Databricks SQL database

I have pandas dataframe column that has string values in the format YYYY-MM-DD HH:MM:SS:mmmmmmm, for example 2021-12-26 21:10:18.6766667. I have verified that all values are in this format where milliseconds are in 7 digits. But the following code throws conversion error (shown below) when it tries to insert data into an Azure Databricks SQL database:
Conversion failed when converting date and/or time from character string
Question: What could be a cause of the error and how can we fix it?
Remark: After conversion the initial value (for example 2021-12-26 21:10:18.6766667) even adds two more digits at the end to make it 2021-12-26 21:10:18.676666700 - with 9 digits milliseconds.
import sqlalchemy as sq
import pandas as pd
import datetime
data_df = pd.read_csv('/dbfs/FileStore/tables/myDataFile.csv', low_memory=False, quotechar='"', header='infer')
data_df['OrderDate'] = data_df['OrderDate'].astype('datetime64[ns]')
data_df.to_sql(name='CustomerOrderTable', con=engine, if_exists='append', index=False, dtype={'OrderID' : sq.VARCHAR(10),
'Name' : sq.VARCHAR(50),
'OrderDate' : sq.DATETIME()})

Keep the dates as plain strings without converting to_datetime.
This is because DataBricks SQL is based on SQLite, and SQLite expects date strings:
In the case of SQLite, date and time types are stored as strings which are then converted back to datetime objects when rows are returned.
If the raw date strings still don't work, convert them to_datetime and reformat into a safe format using dt.strftime:
df['OrderDate'] = pd.to_datetime(df['OrderDate']).dt.strftime('%Y-%m-%d %H:%M:%S.%f').str[:-3]
Or if the column is already datetime, use dt.strftime directly:
df['OrderDate'] = df['OrderDate'].dt.strftime('%Y-%m-%d %H:%M:%S.%f').str[:-3]

How to convert float value to date9 format in pandas

Basically i am sas developer.
As of now i am doing sas2python migrations.
Before reading to pandas dataframe i have two columns ie,
DATE NAME
01JAN1988 VARUN
11JAN1999 THARUN
After reading to pandas dataframe the DATE columns is automatically read as float values. Now I need to show it as DATE Columns as date9 format
Could you please provide the steps

you can use apply function to convert the values into date objects and datetime module to covert them:
df['DATE'] = df['DATE'].apply(lambda x: datetime.datetime.strptime(x,'%d%b%Y').date())
Output:
DATE NAME
0 1988-01-01 VARUN
1 1999-01-11 THARUN

Getting columns with datetime format such as (2017-02-12 10:23:55 AM)[YYYY-MM-dd hh:mm:ss AM/PM] using pandas

I recently asked a question about identifing all the columns which are datetime. Here it is: Get all columns with datetime type using pandas?
The answer was correct for a proper date time format, however, I now realize my data isn't proper date time, it is a string formatted like "2017-02-12 10:23:55 AM" and I was advised to create a new question.
I have a huge dataframe with an unknown number of date time columns, where I do not know their names nor their position. How do I identify the column names of the date time columns which have the date of format such as YYYY-MM-dd hh:mm:ss AM/PM?

One way to do this would be to test for successful conversion:
def is_datetime(datetime_string):
try:
pd.to_datetime(datetime_string)
return True
except ValueError:
return False
With this:
dt_columns = [c for c in df.columns if is_datetime(df[c][0])]
Note: This tests for any string that can be converted to a datetime.

incorrect date format when writing df to csv pandas

I convert a string to date using pandas.
When I write the DF to CSV, the date comes like '2016-08-15 instead of plain 2016-08-15. Unable to read it as date in ETL tool.Same is the case for all date fields.
Any suggestion to get the date format correctly ?
df =pd.read_csv(r'/Users/tcssig/Documents/ABP_News_Aug01.csv', parse_dates=['Dates'])
df.to_csv('/Users/tcssig/Documents/Sarang.csv')

You can try this
df = pd.read_csv(r'/Users/tcssig/Documents/ABP_News_Aug01.csv')
df['date'] = pd.to_datetime(df['date'])
df.to_csv('/Users/tcssig/Documents/Sarang.csv')
(assuming name of the date field is 'date'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark removing Invalid Date time format in column - python

My date time field format is: 2016-10-15 00:00:00 after using infer schema while saving my data to a parquet file, I have a few rows that don't comply to this format. How can I collectively remove them in PySpark? It is causing me problems in my UDF's.

Related

Python -Pandas _ Datetime conversion to a specific format such as DD-MM-YYYY

How to write pandas date column to Databricks SQL database

How to convert float value to date9 format in pandas

Getting columns with datetime format such as (2017-02-12 10:23:55 AM)[YYYY-MM-dd hh:mm:ss AM/PM] using pandas

incorrect date format when writing df to csv pandas

Categories

Resources