I am writing a script to output csv's that need to have the date of the execution time in YYYYMMDD format as part of the filename.
the execution_timestamp is pulled through JDBC and ends up in my DataFrame
as int64.
import pandas as pd
from dateutil import parser
Input:
x = pd.DataFrame([1493293503289], columns=['EXECUTION_TIMESTAMP'])
ts= x['EXECUTION_TIMESTAMP']
ts
Output:
0 1493293503289
Name: EXECUTION_TIMESTAMP, dtype: int64
I have written the following code, where I convert to a pandas
DataFrame -> timestamp ->parsed YYYYMMDD
Input:
df=pd.DataFrame(ts) # create pd data frame
ts_conv = pd.to_datetime(df['EXECUTION_TIMESTAMP'], unit='ns')[0]
parser.parse(str(ts_conv)).strftime('%Y%m%d')
Output:
'19700101'
But ts_conv = Timestamp('1970-01-01 00:24:53.293503289')
I know that the actual execution time is '2017-04-27-11.45.03'
I would greatly appreciate any suggestions to convert this to the appropriate date.
Following the suggestion from #gseva setting unit='ms' the parse provides the correct YYYYMMDD string. The units were in nanoseconds instead of milliseconds.
Input:
ts_conv = pd.to_datetime(df['EXECUTION_TIMESTAMP'], unit='ms')[0]
parser.parse(str(ts_conv)).strftime('%Y%m%d')
Output:
'20170427'
Related
I have the following problem. I want to create a date from another. To do this, I extract the year from the database date and then create the chosen date (day = 30 and month = 9) being the year extracted from the database.
The code is the following
bbdd20Q3['year']=(pd.DatetimeIndex(bbdd20Q3['datedaymonthyear']).year)
y=(bbdd20Q3['year'])
m=int(9)
d=int(30)
bbdd20Q3['mydate']=dt.datetime(y,m,d)
But error message is this
"cannot convert the series to <class 'int'>"
I think dt mean datetime, so the line 'dt.datetime(y,m,d)' create datetime object type.
bbdd20Q3['mydate'] should get int?
If so, try to think of another way to store the date (8 numbers maybe).
hope I helped :)
I assume that you did import datetime as dt then by doing:
bbdd20Q3['year']=(pd.DatetimeIndex(bbdd20Q3['datedaymonthyear']).year)
y=(bbdd20Q3['year'])
m=int(9)
d=int(30)
bbdd20Q3['mydate']=dt.datetime(y,m,d)
You are delivering series as first argument to datetime.datetime, when it excepts int or something which can be converted to int. You should create one datetime.datetime for each element of series not single datetime.datetime, consider following example
import datetime
import pandas as pd
df = pd.DataFrame({"year":[2001,2002,2003]})
df["day"] = df["year"].apply(lambda x:datetime.datetime(x,9,30))
print(df)
Output:
year day
0 2001 2001-09-30
1 2002 2002-09-30
2 2003 2003-09-30
Here's a sample code with the required logic -
import pandas as pd
df = pd.DataFrame.from_dict({'date': ['2019-12-14', '2020-12-15']})
print(df.dtypes)
# convert the date in string format to datetime object,
# if the date column(Series) is already a datetime object then this is not required
df['date'] = pd.to_datetime(df['date'])
print(f'after conversion \n {df.dtypes}')
# logic to create a new data column
df['new_date'] = pd.to_datetime({'year':df['date'].dt.year,'month':9,'day':30})
#eollon I see that you are also new to Stack Overflow. It would be better if you can add a simple sample code, which others can tryout independently
(keeping the comment here since I don't have permission to comment :) )
I have a vaex dataframe that reads from a hdf5 file. It has a date column which is read as string. I converted it into datetime. However, I am not able to do any date comparisons. I can extract day,month,year, etc from the date so the conversion is correct. But how do I perform operations like date is between x and y?
import vaex
import datetime
vaex_df=vaex.open('filename.hdf5')
vaex_df['pDate']=vaex_df.Date.values.astype('datetime64[ns]')
The datatypes are as expected
print(data.dtypes)
## Date <class 'str'>
## pDate datetime64[ns]
Now I need to filter out rows based on some date
start_date=datetime.date(2019,10,1)
vaex_df=vaex_df[(vaex_df.pDate.dt>=start_date)]
print(vaex_df) # throws SyntaxError: invalid token
I get an invalid token when I try to look at the new dataframe.
I can extract the month and year separately and apply the filter. But that would give a wrong result
vaex_df=vaex_df[(vaex_df.pDate.dt.month>int(str(start_date)[5:7]))&(vaex_df.pDate.dt.year>=int(str(start_date)[:4]))]
How do I do date range comparison operations in vaex?
datetime from numpy works
#Instead of
start_date=datetime.date(2019,10,1)
#Use
start_date=np.datetime64('2019-10-01')
On the vaex dataframe
vaex_df=vaex_df[(vaex_df.pDate>=start_date)]
I want to convert the date in a column in a dataframe to a different format. Currently, it has this format: '2019-11-20T01:04:18'. I want it to have this format: 20-11-19 1:04.
I think I need to develop a loop and generate a new column for the new date format. So essentially, in the loop, I would refer to the initial column and then generate the variable for the new column in the format I want.
Can someone help me out to complete this task?
The following code works for one occasion:
import datetime
d = datetime.datetime.strptime('2019-11-20T01:04:18', '%Y-%m-%dT%H:%M:%S')
print d.strftime('%d-%m-%y %H:%M')
From a previous answer in this site , this should be able to help you, comments give explanation
You can read your data into pandas from csv or database or create some test data as shown below for testing.
>>> import pandas as pd
>>> df = pd.DataFrame({'column': {0: '26/1/2016', 1: '26/1/2016'}})
>>> # First convert the column to datetime datatype
>>> df['column'] = pd.to_datetime(df.column)
>>> # Then call the datetime object format() method, set the modifiers you want here
>>> df['column'] = df['column'].dt.strftime('%Y-%m-%dT%H:%M:%S')
>>> df
column
0 2016-01-26T00:00:00
1 2016-01-26T00:00:00
NB. Check to ensure that all your columns have similar date strings
You can either achieve it like this:
from datetime import datetime
df['your_column'] = df['your_column'].apply(lambda x: datetime.strptime(x, '%Y-%m-%dT%H:%M:%S').strftime('%d-%m-%y %H:%M'))
I am running into an issue where the Pandas to_datetime function results in a Unix timestamp instead of a datetime object for certain rows. The date format in rows that do convert to datetime and rows that convert to Unix timestamp as int appear to be identical. When the problem occurs it seems to affect all the dates in the row.
For example, :
2019-01-02T10:12:28.64Z (stored as str) ends up as 1546424003423000000
While
2019-09-17T11:28:49.35Z (stored as str) converts to a datetime object.
Another date in the same row is 2019-01-02T10:13:23.423Z (stored as str) which is converting to a timestamp as well.
There isn't much code to look at, the conversion happens on a single line:
full_df.loc[mask, 'transaction_changed_datetime'] = pd.to_datetime(full_df['SaleItemChangedOn']) and
full_df.loc[pd.isnull(full_df['completed_date']), 'completed_date'] = pd.to_datetime(full_df['SaleCompletedOn']
I've tried with errors='coerce' on as well but the result is the same. I can deal with this problem later in the code, but I would really like to understand why this is happening.
Edit
As requested, this is the MRE to reproduces the issue on my computer. Some notes on this:
The mask is somehow involved. If I remove the mask it converts fine.
If I only pass in the first row in the Dataframe (single row Dataframe) it converts fine.
import pandas as pd
from pandas import NaT, Timestamp
debug_dict = {'SaleItemChangedOn': ['2019-01-02T10:12:28.64Z', '2019-01-02T10:12:28.627Z'],
'transaction_changed_datetime': [NaT, Timestamp('2019-01-02 11:58:47.900000+0000', tz='UTC')]}
df = pd.DataFrame(debug_dict)
mask = (pd.isnull(df['transaction_changed_datetime']))
df.loc[mask, 'transaction_changed_datetime'] = pd.to_datetime(df['SaleItemChangedOn'])```
When I try the examples you mention:
import numpy as np
import pandas as pd
df = pd.DataFrame({'a':['2019-01-02T10:12:28.64Z', '2019-09-17T11:28:49.35Z', np.nan]})
pd.to_datetime(df['a'])
There doesn't seem to be any issue:
Out[74]:
0 2019-01-02 10:12:28.640000+00:00
1 2019-09-17 11:28:49.350000+00:00
2 NaT
Name: a, dtype: datetime64[ns, UTC]
Could you provide an MRE?
You might want to check if you have more than one column with the same name which is being sent to pd.to_datetime. It solved the datetime being converted to timestamp problem for me.
This appears to have been a bug in Panda that has been fixed with the release of V1.0. The example code above now produces the expected results.
I am reading from an Excel sheet. The header is date in the format of Month-Year and I want to keep it that way. But when it reades it, it changes the format to "2014-01-01 00:00:00". I wrote the following peice to fix it, but doesn't work.
import pandas as pd
import numpy as np
import datetime
from datetime import date
import time
file_loc = "path.xlsx"
df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], parse_cols = 37)
df.columns=pd.to_datetime(df.columns, format='%b-%y')
Which didn't do anything. On another try, I did the following:
df.columns = datetime.datetime.strptime(df.columns, '%Y-%m-%d %H:%M:%S').strftime('%b-%y')
Which returns the must be str, not datetime.datetime error. I don't know how make it read the row cell by cell to read the strings!
Here is a sample data:
NaT 11/14/2015 00:00:00 12/15/2015 00:00:00 1/15/2016 00:00:00
A 5 1 6
B 6 3 3
My main problem with this is that it does not recognize it as the header, e.g., df['11/14/2015 00:00:00'] retuns an keyError.
Any help is appreciated.
UPDATE: Here is a photo to illustrate what I keep geting! Box 6 is the implementation of apply, and box 7 is what my data looks like.
import datetime
df = pd.DataFrame({'data': ["11/14/2015 00:00:00", "11/14/2015 00:10:00", "11/14/2015 00:20:00"]})
df["data"].apply(lambda x: datetime.datetime.strptime(x, '%m/%d/%Y %H:%M:%S').strftime('%b-%y'))
EDIT
If you'd like to work with df.columns you could use map function:
df.columns = list(map(lambda x: datetime.datetime.strptime(x, '%m/%d/%Y %H:%M:%S').strftime('%b-%y'), df1.columns))
You need list if you are using python 3.x because it's iterator by default.
The problem might be that the data in excel isn't stored in the string format you think it is. Perhaps it is stored as a number, and just displayed as a date string in excel.
Excel sometimes uses milliseconds after an epoch to store dates.
Check what the actual values you see in the df array.
What does this show?
from pprint import pprint
pprint(df)