How to write pandas date column to Databricks SQL database - python

I have pandas dataframe column that has string values in the format YYYY-MM-DD HH:MM:SS:mmmmmmm, for example 2021-12-26 21:10:18.6766667. I have verified that all values are in this format where milliseconds are in 7 digits. But the following code throws conversion error (shown below) when it tries to insert data into an Azure Databricks SQL database:
Conversion failed when converting date and/or time from character string
Question: What could be a cause of the error and how can we fix it?
Remark: After conversion the initial value (for example 2021-12-26 21:10:18.6766667) even adds two more digits at the end to make it 2021-12-26 21:10:18.676666700 - with 9 digits milliseconds.
import sqlalchemy as sq
import pandas as pd
import datetime
data_df = pd.read_csv('/dbfs/FileStore/tables/myDataFile.csv', low_memory=False, quotechar='"', header='infer')
data_df['OrderDate'] = data_df['OrderDate'].astype('datetime64[ns]')
data_df.to_sql(name='CustomerOrderTable', con=engine, if_exists='append', index=False, dtype={'OrderID' : sq.VARCHAR(10),
'Name' : sq.VARCHAR(50),
'OrderDate' : sq.DATETIME()})

Keep the dates as plain strings without converting to_datetime.
This is because DataBricks SQL is based on SQLite, and SQLite expects date strings:
In the case of SQLite, date and time types are stored as strings which are then converted back to datetime objects when rows are returned.
If the raw date strings still don't work, convert them to_datetime and reformat into a safe format using dt.strftime:
df['OrderDate'] = pd.to_datetime(df['OrderDate']).dt.strftime('%Y-%m-%d %H:%M:%S.%f').str[:-3]
Or if the column is already datetime, use dt.strftime directly:
df['OrderDate'] = df['OrderDate'].dt.strftime('%Y-%m-%d %H:%M:%S.%f').str[:-3]

Related

How to correctly convert the column in csv that contains the dates into JSON

In my csv file, the "ESTABLİSHMENT DATE" column is delimited by the slashes like this: 01/22/2012.
I am converting the csv format into the JSON format, which needs to be done with pandas, but the "ESTABLİSHMENT DATE" column isn't correctly translated to JSON.
df = pd.read_csv(my_csv)
df.to_json("some_path", orient="records")
I don't understand why it awkwardly adds the backward slashes.
"ESTABLİSHMENT DATE":"01\/22\/2012",
However, I need to write the result to a file as the following:
"ESTABLİSHMENT DATE":"01/22/2012",
Forward slash in json file from pandas dataframe answers why it awkwardly adds the backward slashes, and this answer shows how to use the json library to solve the issue.
As long as the date format is 01/22/2012, the / will be escaped with \.
To correctly convert the column in a csv that contains the dates into JSON, using pandas, can be done by converting the 'date' column to a correct datetime dtype, and then using .to_json.
2012-01-22 is the correct datetime format, but .to_json will convert that to 1327190400000. After using pd.to_datetime to set the correct format as %Y-%m-%d, the type must be set to a string.
import pandas as pd
# test dataframe
df = pd.DataFrame({'date': ['01/22/2012']})
# display(df)
date
0 01/22/2012
# to JSON
print(df.to_json(orient='records'))
[out]: [{"date":"01\/22\/2012"}]
# set the date column to a proper datetime
df.date = pd.to_datetime(df.date, format='%m/%d/%Y')
# display(df)
date
0 2012-01-22
# to JSON
print(df.to_json(orient='records'))
[out]: [{"date":1327190400000}]
# set the date column type to string
df.date = df.date.astype(str)
# to JSON
print(df.to_json(orient='records'))
[out]: [{"date":"2012-01-22"}]
# as a single line of code
df.date = pd.to_datetime(df.date, format='%m/%d/%Y').astype(str)

How to extract a date from a SQL Server Table and store it in a variable in Pandas without noise, only the date

I try to extract a date from a SQL Server Table. I get my query to return it like this:
Hours = pd.read_sql_query("select * from tblAllHours",con)
Now I convert my "Start" Column in the Hours dataframe like this:
Hours['Start'] = pd.to_datetime(Hours['Start'], format='%Y-%m-%d')
then I select the row I want in the column like this:
StartDate1 = Hours.loc[Hours.Month == Sym1, 'Start'].values
Now, if I print my variable print(StartDate1) I get this result:
[datetime.date(2020, 10, 1)]
What I need is actually 2020-10-01
How can I get this result?
You currently have a column of datetimes. The format you're requesting is a string format
Use pandas.Series.dt.strftime to convert the datetime to a string
pd.to_datetime(Hours['Start'], format='%Y-%m-%d'): format tells the parser what format your dates are in, so they can be converted to a datetime, it is not a way to indicate the format you want the datetime.
Review pandas.to_datetime
If you want only the values, not the Series, use .values at the end of the following command, as you did in the question.
start_date_str = Hours.Start.dt.strftime('%Y-%m-%d')
try
print(Hours['Start'].dt.strftime('%Y-%m-%d').values)
result is a list of YYYY-MM-dd
['2020-07-03', '2020-07-02']
a bit similar to this How to change the datetime format in pandas

Date to float from an R Tibble in Python

By using an API, I retrieved a Tibble (an R object) in Python (using rpy2.objects), that is a very large 2-dimensional table. It contains a column with dates in the format "YYYY-MM-DD" when I print the Tibble object. When I grab the Date in Python (simply by indexing the Tibble) it is converted to a 5 digit float. For example, the date "2019-09-28" is converted to the float 18167.0. I'm not sure how to convert it back to a string date (e.g. "YYYY-MM-DD").
Does anyone have any ideas? I'm happy to clarify anything that I can :)
Edit: The answer I discovered with help was the following
import pandas as pd
pd.to_datetime(18167.0,unit='d',origin='1970-01-01')
If the Date class got converted to numeric storage mode, we can use as.Date with origin
as.Date(18167, origin = "1970-01-01")
#[1] "2019-09-28"
The Date storage mode is numeric
storage.mode(Sys.Date())
#[1] "double"
In python, we can also do
from datetime import datetime, date, time
date.fromordinal(int(18167) + date(1970, 1, 1).toordinal()).strftime("%Y-%m-%d")
#'2019-09-28'

Pandas data frame - to_json() to_csv() don't act the same for dates in iso format

I'm getting dates from my API in iso format.
When I'm doing:
df = DataFrame(results)
df.to_csv(path_or_buf=file_name, index=False, encoding='utf-8',
compression='gzip',
quoting=QUOTE_NONNUMERIC)
And I look at the CSV I see for example:
lastDeliveryDate
2018-11-21 16:25:53.990000-05:00
However,
When I do:
df = DataFrame(results)
df.to_json(path_or_buf=file_name, orient="records",compression='gzip', lines=True)
I see (other record):
"lastDeliveryDate":1543258826689
This is a problem.
When I load the data from the CSV to Google BigQuery eveything is fine. The date is parsed correctly.
But when I changed the loading to Json. It doesn't parse the date correctly.
I see the dates in this format:
50866-01-09 23:46:40 UTC
This occurs because the to_json() and to_csv() produce different results for dates in iso_format
How can I fix this? Must I edit the data frame and convert all my dates columns to regular UTC? how can I do that? and why it's needed for to_json() but not for to_csv() ?
as explained at How do I translate an ISO 8601 datetime string into a Python datetime object? Tried to do:
df["lastDeliveryDate"] = dateutil.parser.parse(df["lastDeliveryDate"])
But it gives:
TypeError: Parser must be a string or character stream, not Series
From the Pandas documentation on to_json():
date_format: {None, ‘epoch’, ‘iso’}
Type of date conversion. ‘epoch’ = epoch milliseconds, ‘iso’ = ISO8601. The default depends on the orient. For orient='table', the default is ‘iso’. For all other orients, the default is ‘epoch’.
So, with orient="records", you'll have to set date_format="iso" to get a date-time format that can be understood later on:
df.to_json(path_or_buf=file_name, orient="records", date_format="iso",
compression='gzip', lines=True)
Basically dateutil.parser.parse() is expecting a string as a parameter, but you passed the whole column. Try with the lambda function:
df["lastDeliveryDate"] = df["lastDeliveryDate"].apply( lambda row: dateutil.parser.parse(row))

PySpark removing Invalid Date time format in column

My date time field format is: 2016-10-15 00:00:00
after using infer schema while saving my data to a parquet file, I have a few rows that don't comply to this format.
How can I collectively remove them in PySpark?
It is causing me problems in my UDF's.
Assuming you're parsing the date column and rows with invalid dates are null, which is usually the case:
df.filter(col('date').isNotNull())
Alternatively, if your date is read as a string, you can parse it using unix_timestamp:
(
df
.select(unix_timestamp('date', 'yyyy-MM-dd HH:mm:ss').cast("timestamp").alias('date'))
.filter(col('date').isNotNull())
)

Categories