Date to float from an R Tibble in Python - python

By using an API, I retrieved a Tibble (an R object) in Python (using rpy2.objects), that is a very large 2-dimensional table. It contains a column with dates in the format "YYYY-MM-DD" when I print the Tibble object. When I grab the Date in Python (simply by indexing the Tibble) it is converted to a 5 digit float. For example, the date "2019-09-28" is converted to the float 18167.0. I'm not sure how to convert it back to a string date (e.g. "YYYY-MM-DD").
Does anyone have any ideas? I'm happy to clarify anything that I can :)
Edit: The answer I discovered with help was the following
import pandas as pd
pd.to_datetime(18167.0,unit='d',origin='1970-01-01')

If the Date class got converted to numeric storage mode, we can use as.Date with origin
as.Date(18167, origin = "1970-01-01")
#[1] "2019-09-28"
The Date storage mode is numeric
storage.mode(Sys.Date())
#[1] "double"
In python, we can also do
from datetime import datetime, date, time
date.fromordinal(int(18167) + date(1970, 1, 1).toordinal()).strftime("%Y-%m-%d")
#'2019-09-28'

Related

Pandas.to_datetime doesn't recognize the format of the string to be converted to datetime

I am trying to convert some data from a .txt file to a dataframe to use it for some analysis
the form of the data in the .txt is a follows
DATE_TIME VELOC MEASURE
[m/s] [l/h]
A 09.01.2023 12:45:20 ??? ???
A 09.01.2023 12:46:20 0,048 52,67
A 09.01.2023 12:47:20 0,049 53,77
A 09.01.2023 12:48:20 0,050 54,86
I load the data to a dataframe no problem i covnert the str values of the measurement to float etc everything is good as shows in the
image
the problem I get is when trying to convert the column of the date time that is string to datetime pandas format using this line of code:
volume_flow['DATE_TIME'] = pd.to_datetime(volume_flow['DATE_TIME'], format = '%d.%m.%Y %H:%M:S')
and i get the following error
ValueError: time data '09.01.2023 12:46:20' does not match format '%d.%m.%Y %H:%M:S' (match)
but i don't see how the format is off
I am really lost as to why this is caused as i used the same code with different formats of datetime before with no problem
further more i tried using format = '%dd.%mm.%yyyy %H:%M:S' as well with the same results and when i let the pandas.to_datetime convert it automatically it confuses the day and the month of the data. the data is between 09.01-12.01 so you can't really tell if one is the month or day just by the values.
I think you should go from this
(..., format='%d.%m.%Y %H:%M:S')
to this
(..., format='%d.%m.%Y %H:%M:%S')
You forgot the percentage character!
check the documentations for correct time format. You will note that the directive %S represents the seconds.
Second as a decimal number [00,61].

How to write pandas date column to Databricks SQL database

I have pandas dataframe column that has string values in the format YYYY-MM-DD HH:MM:SS:mmmmmmm, for example 2021-12-26 21:10:18.6766667. I have verified that all values are in this format where milliseconds are in 7 digits. But the following code throws conversion error (shown below) when it tries to insert data into an Azure Databricks SQL database:
Conversion failed when converting date and/or time from character string
Question: What could be a cause of the error and how can we fix it?
Remark: After conversion the initial value (for example 2021-12-26 21:10:18.6766667) even adds two more digits at the end to make it 2021-12-26 21:10:18.676666700 - with 9 digits milliseconds.
import sqlalchemy as sq
import pandas as pd
import datetime
data_df = pd.read_csv('/dbfs/FileStore/tables/myDataFile.csv', low_memory=False, quotechar='"', header='infer')
data_df['OrderDate'] = data_df['OrderDate'].astype('datetime64[ns]')
data_df.to_sql(name='CustomerOrderTable', con=engine, if_exists='append', index=False, dtype={'OrderID' : sq.VARCHAR(10),
'Name' : sq.VARCHAR(50),
'OrderDate' : sq.DATETIME()})
Keep the dates as plain strings without converting to_datetime.
This is because DataBricks SQL is based on SQLite, and SQLite expects date strings:
In the case of SQLite, date and time types are stored as strings which are then converted back to datetime objects when rows are returned.
If the raw date strings still don't work, convert them to_datetime and reformat into a safe format using dt.strftime:
df['OrderDate'] = pd.to_datetime(df['OrderDate']).dt.strftime('%Y-%m-%d %H:%M:%S.%f').str[:-3]
Or if the column is already datetime, use dt.strftime directly:
df['OrderDate'] = df['OrderDate'].dt.strftime('%Y-%m-%d %H:%M:%S.%f').str[:-3]

Pandas to_datetime function resulting in Unix timestamp instead of Datetime for certain date strings

I am running into an issue where the Pandas to_datetime function results in a Unix timestamp instead of a datetime object for certain rows. The date format in rows that do convert to datetime and rows that convert to Unix timestamp as int appear to be identical. When the problem occurs it seems to affect all the dates in the row.
For example, :
2019-01-02T10:12:28.64Z (stored as str) ends up as 1546424003423000000
While
2019-09-17T11:28:49.35Z (stored as str) converts to a datetime object.
Another date in the same row is 2019-01-02T10:13:23.423Z (stored as str) which is converting to a timestamp as well.
There isn't much code to look at, the conversion happens on a single line:
full_df.loc[mask, 'transaction_changed_datetime'] = pd.to_datetime(full_df['SaleItemChangedOn']) and
full_df.loc[pd.isnull(full_df['completed_date']), 'completed_date'] = pd.to_datetime(full_df['SaleCompletedOn']
I've tried with errors='coerce' on as well but the result is the same. I can deal with this problem later in the code, but I would really like to understand why this is happening.
Edit
As requested, this is the MRE to reproduces the issue on my computer. Some notes on this:
The mask is somehow involved. If I remove the mask it converts fine.
If I only pass in the first row in the Dataframe (single row Dataframe) it converts fine.
import pandas as pd
from pandas import NaT, Timestamp
debug_dict = {'SaleItemChangedOn': ['2019-01-02T10:12:28.64Z', '2019-01-02T10:12:28.627Z'],
'transaction_changed_datetime': [NaT, Timestamp('2019-01-02 11:58:47.900000+0000', tz='UTC')]}
df = pd.DataFrame(debug_dict)
mask = (pd.isnull(df['transaction_changed_datetime']))
df.loc[mask, 'transaction_changed_datetime'] = pd.to_datetime(df['SaleItemChangedOn'])```
When I try the examples you mention:
import numpy as np
import pandas as pd
df = pd.DataFrame({'a':['2019-01-02T10:12:28.64Z', '2019-09-17T11:28:49.35Z', np.nan]})
pd.to_datetime(df['a'])
There doesn't seem to be any issue:
Out[74]:
0 2019-01-02 10:12:28.640000+00:00
1 2019-09-17 11:28:49.350000+00:00
2 NaT
Name: a, dtype: datetime64[ns, UTC]
Could you provide an MRE?
You might want to check if you have more than one column with the same name which is being sent to pd.to_datetime. It solved the datetime being converted to timestamp problem for me.
This appears to have been a bug in Panda that has been fixed with the release of V1.0. The example code above now produces the expected results.

Pandas, convert datetime format mm/dd/yyyy to dd/mm/yyyy

The default format of csv is dd/mm/yyyy. When I convert it to datetime by df['Date']=pd.to_datetime(df['Date']), it change the format to mm//dd/yyyy.
Then, I used df['Date'] = pd.to_datetime(df['Date']).dt.strftime('%d/%m/%Y')
to convert to dd/mm/yyyy, But, they are in the string (object) format. However, I need to change them to datetime format. When I use again this (df['Date']=pd.to_datetime(df['Date'])), it gets back to the previous format. Need your help
You can use the parse_dates and dayfirst arguments of pd.read_csv, see: the docs for read_csv()
df = pd.read_csv('myfile.csv', parse_dates=['Date'], dayfirst=True)
This will read the Date column as datetime values, correctly taking the first part of the date input as the day. Note that in general you will want your dates to be stored as datetime objects.
Then, if you need to output the dates as a string you can call dt.strftime():
df['Date'].dt.strftime('%d/%m/%Y')
When I use again this: df['Date'] = pd.to_datetime(df['Date']), it gets back to the previous format.
No, you cannot simultaneously have the string format of your choice and keep your series of type datetime. As remarked here:
datetime series are stored internally as integers. Any
human-readable date representation is just that, a representation,
not the underlying integer. To access your custom formatting, you can
use methods available in Pandas. You can even store such a text
representation in a pd.Series variable:
formatted_dates = df['datetime'].dt.strftime('%m/%d/%Y')
The dtype of formatted_dates will be object, which indicates
that the elements of your series point to arbitrary Python times. In
this case, those arbitrary types happen to be all strings.
Lastly, I strongly recommend you do not convert a datetime series
to strings until the very last step in your workflow. This is because
as soon as you do so, you will no longer be able to use efficient,
vectorised operations on such a series.
This solution will work for all cases where a column has mixed date formats. Add more conditions to the function if needed. Pandas to_datetime() function was not working for me, but this seems to work well.
import date
def format(val):
a = pd.to_datetime(val, errors='coerce', cache=False).strftime('%m/%d/%Y')
try:
date_time_obj = datetime.datetime.strptime(a, '%d/%m/%Y')
except:
date_time_obj = datetime.datetime.strptime(a, '%m/%d/%Y')
return date_time_obj.date()
Saving the changes to the same column.
df['Date'] = df['Date'].apply(lambda x: format(x))
Saving as CSV.
df.to_csv(f'{file_name}.csv', index=False, date_format='%s')

how to convert series integer to datetime in pandas

i want to convert integer type date to datetime.
ex) i : 20130601000011( 2013-6-1 00:00: 11 )
i don't know exactly how to use pd.to_datetime
please any advice
thanks
ps. my script is below
rent_date_raw = pd.Series(1, rent['RENT_DATE'])
return_date_raw = pd.Series(1, rent['RETURN_DATE'])
rent_date = pd.Series([pd.to_datetime(date)
for date in rent_date_raw])
daily_rent_ts = rent_date.resample('D', how='count')
monthly_rent_ts = rent_date.resample('M', how='count')
Pandas seems to deal with your format fine as long as you convert to string first:
import pandas as pd
eg_date = 20130601000011
pd.to_datetime(str(eg_date))
Out[4]: Timestamp('2013-06-01 00:00:11')
Your data at the moment is really more of a string than an integer, since it doesn't really represent a single number. Different subparts of the string reflect different aspects of the time.

Categories