Working with more than one datetime format in python - python

Below is the sample data
Datetime
11/19/2020 9:48:50 AM
12/17/2020 2:41:02 PM
2020-02-11 14:44:58
2020-28-12 10:41:02
2020-05-12 06:31:39
11/19/2020 is in mm/dd/yyyy whereas 2020-28-12 is yyyy-dd-mm.
After applying pd.to_datetime below is the output that I am getting.
Date
2020-11-19 09:48:50
2020-12-17 22:41:02
2020-02-11 14:44:58
2020-28-12 10:41:02
2020-05-12 06:31:39
If the input data is coming with slash (/) i.e 11/19/2020 then format is mm/dd/yyyy in input itself and when data is coming with dash (-) i.e 2020-02-11 then the format is yyyy-dd-mm. But after applying pd.to_datetime datetime is getting interchanged.
The first two output is correct. The bottom three needs to be corrected as
2020-11-02 14:44:58
2020-12-28 10:41:02
2020-12-05 06:31:39
Please suggest to have common format i.e yyyy-mm-dd format.

Use to_datetime with specify both formats and errors='coerce' for missing values if no match and then replace them by another Series in Series.fillna:
d1 = pd.to_datetime(df['datetime'], format='%Y-%d-%m %H:%M:%S', errors='coerce')
d2 = pd.to_datetime(df['datetime'], format='%m/%d/%Y %I:%M:%S %p', errors='coerce')
df['datetime'] = d1.fillna(d2)
print (df)
datetime
0 2020-11-19 09:48:50
1 2020-12-17 14:41:02
2 2020-11-02 14:44:58
3 2020-12-28 10:41:02
4 2020-12-05 06:31:39

Related

Cannot convert timestamp to date in Python with correct timezone

I have a Pandas DataFrame as the following:
timestamp
1583985600000
1584072000000
1584331200000
1584417600000
1584504000000
1584590400000
There are actually other columns as well but I pasted the above one for the sake of simplicity.
I need to change this column to date format by creating a separate column within the same DataFrame. I try the following:
df["date EST"] = pd.to_datetime(agg_daily_df["timestamp"],
unit='ms').dt.tz_localize('EST').astype(str)
... which gives the following result:
date EST
2020-03-12 04:00:00-05:00
2020-03-13 04:00:00-05:00
2020-03-16 04:00:00-05:00
2020-03-17 04:00:00-05:00
2020-03-18 04:00:00-05:00
2020-03-19 04:00:00-05:00
... which looks quite strange to me. The first row should actually give
2020-03-12 00:00:00.
What is it I am doing wrong here so that I get results in a strange format?
This will return time in UTC as tz-naive datetime.
pd.to_datetime(agg_daily_df["timestamp"], unit='ms')
# 1583985600000 => 2020-03-12 04:00:00
So, localizing this datetime, it results in
original: 1583985600000 =>
pd.to_datetime: 2020-03-12 04:00:00 (tz-naive) =>
tz_localize: 2020-03-12 04:00:00-05:00 (tz-aware, EST)
The issue is that you need to have tz-aware datetime before converting to other timezone.
# Add utc=True to get tz-aware time and convert to EST
(pd.to_datetime(agg_daily_df["timestamp"], unit='ms', utc=True)
dt.tz_convert('EST'))
This way the time will be converted like this.
original: 1583985600000 =>
to_datetime with utc=True: 2020-03-12 04:00:00+00:00 (tz-aware, UTC) =>
tz_convert: 2020-03-11 23:00:00-05:00 (tz-aware, EST)
Note that "EST" timezone doesn't handle day light saving. If you would like to have the day light saving handling, use locale for timezone.
(pd.to_datetime(agg_daily_df["timestamp"], unit='ms', utc=True)
.dt.tz_convert('America/New_York'))
This will give you 2020-03-12 00:00:00-04:00.
========================================================
Update:
If you would like to have tz-naive time again, remove tzinfo by tz_localize(None)
(pd.to_datetime(agg_daily_df["timestamp"], unit='ms', utc=True)
.dt.tz_convert('America/New_York')
.tz_localize(None))
Or if you are just want to have time without showing timezone offset, use strftime to format datetime into string.
(pd.to_datetime(agg_daily_df["timestamp"], unit='ms', utc=True)
.dt.tz_convert('America/New_York')
.transform(lambda x: x.strftime('%Y-%m-%d %H:%M:%S'))

Convert multiple date formats to datetime in pandas

I have a row of messy data where date formats are different and I want them to be coherent as datetime in pandas
df:
Date
0 1/05/2015
1 15 Jul 2009
2 1-Feb-15
3 12/08/2019
When I run this part:
df['date'] = pd.to_datetime(df['date'], format='%d %b %Y', errors='coerce')
I get
Date
0 NaT
1 2009-07-15
2 NaT
3 NaT
How do I convert it all to date time in pandas?
pd.to_datetime is capabale of handling multiple date formats in the same column. Specifying a format will hinder its ability to dynamically determine the format, so if there are multiple types do not specify the format:
import pandas as pd
df = pd.DataFrame({
'Date': ['1/05/2015', '15 Jul 2009', '1-Feb-15', '12/08/2019']
})
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
print(df)
Date
0 2015-01-05
1 2009-07-15
2 2015-02-01
3 2019-12-08
*There are limitations to the ability to handle multiple date times. Mixed timezone aware and timezone unaware datetimes will not process correctly. Likewise mixed dayfirst and monthfirst notations will not always parse correctly.

Cannot convert datetime values to a necessary format

I am struggling with datetime format... This is my dataframe in pandas:
Datetime Date Field
2020-01-12 00:00:00 2020-12-01 6.543916
2020-01-12 00:10:00 2020-12-01 6.505547
2020-01-12 00:20:00 2020-12-01 7.047578
2020-01-12 00:30:00 2020-12-01 6.070998
2020-01-12 00:40:00 2020-12-01 6.452112
df.dtypes
Datetime object
Date datetime64[ns]
Field float64
I need to convert Datetime to datetime64 and swap months with days to get values in the format %Y-%m-%d %H:%M:%S, e.g. 2020-12-01 00:00:00.
import pandas as pd
from datetime import datetime
df["Datetime"] = pd.to_datetime(df["Datetime"])
df["Datetime"] = df["Datetime"].apply(lambda x: datetime.strftime(x, "%Y-%m-%d %H:%M:%S"))
Still I get the same dataframe as shown above...
Consider placing the parameter "errors":
df["Datetime"] = pd.to_datetime(df["Datetime"], errors='coerce')
See if it helps you!
I think you'll get what you want with "%Y-%d-%m %H:%M:%S" instead of "%Y-%m-%d %H:%M:%S" on your last line.
EDIT: Or better even, simply replace the last 2 lines of your code by the following:
df["Datetime"] = pd.to_datetime(df["Datetime"], format="%Y-%d-%m %H:%M:%S")
That way you won't get a ParserError: month must be in 1..12 from pd.to_datetime in the case where your Datetime column contains something like "2020-30-12 00:00:00"

Pandas read_excel function ignoring dtype

I'm trying to read an excel file with pd.read_excel().
The excel file has 2 columns Date and Time and I want to read both columns as str not the excel dtype.
Example of the excel file
I've tried to specify the dtype or the converters arguments to no avail.
df = pd.read_excel('xls_test.xlsx',
dtype={'Date':str,'Time':str})
df.dtypes
Date object
Time object
dtype: object
df.head()
Date Time
0 2020-03-08 00:00:00 10:00:00
1 2020-03-09 00:00:00 11:00:00
2 2020-03-10 00:00:00 12:00:00
3 2020-03-11 00:00:00 13:00:00
4 2020-03-12 00:00:00 14:00:00
As you can see the Date column is not treated as str...
Same thing when using converters
df = pd.read_excel('xls_test.xlsx',
converters={'Date':str,'Time':str})
df.dtypes
Date object
Time object
dtype: object
df.head()
Date Time
0 2020-03-08 00:00:00 10:00:00
1 2020-03-09 00:00:00 11:00:00
2 2020-03-10 00:00:00 12:00:00
3 2020-03-11 00:00:00 13:00:00
4 2020-03-12 00:00:00 14:00:00
I have also tried to use other engine but the result is always the same.
The dtype argument seems to work as expected when reading a csv though
What am I doing wrong here ??
Edit:
I forgot to mention, I'm using the last version of pandas 1.2.2 but had the same problem before updating from 1.1.2.
here is a simple solution, even if you apply the "str" in a dtype it will return as an object only. Use the below code to read the columns as string Dtype.
df= pd.read_excel("xls_test.xlsx",dtype={'Date':'string','Time':'string'})
To understand more about the Pandas String Dtype use the link below,
https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html
Let me know if you have any issues on that !!
The problem you’re having is that cells in excel have datatypes. So here the data type is a date or a time, and it’s formatted for display only. Loading it “directly” means loading a datetime type*.
This means that, whatever you do with the dtype= argument, the data will be loaded as a date, and then converted to string, giving you the result you see:
>>> pd.read_excel('test.xlsx').head()
Date Time Datetime
0 2020-03-08 10:00:00 2020-03-08 10:00:00
1 2020-03-09 11:00:00 2020-03-09 11:00:00
2 2020-03-10 12:00:00 2020-03-10 12:00:00
3 2020-03-11 13:00:00 2020-03-11 13:00:00
4 2020-03-12 14:00:00 2020-03-12 14:00:00
>>> pd.read_excel('test.xlsx').dtypes
Date datetime64[ns]
Time object
Datetime datetime64[ns]
dtype: object
>>> pd.read_excel('test.xlsx', dtype='string').head()
Date Time Datetime
0 2020-03-08 00:00:00 10:00:00 2020-03-08 10:00:00
1 2020-03-09 00:00:00 11:00:00 2020-03-09 11:00:00
2 2020-03-10 00:00:00 12:00:00 2020-03-10 12:00:00
3 2020-03-11 00:00:00 13:00:00 2020-03-11 13:00:00
4 2020-03-12 00:00:00 14:00:00 2020-03-12 14:00:00
>>> pd.read_excel('test.xlsx', dtype='string').dtypes
Date string
Time string
Datetime string
dtype: object
Only in csv files are datetime data stored as string in the file. There, loading it “directly” as a string makes sense. In an excel file, you may as well load it as a date and format it with .dt.strftime()
That’s not to say that you can’t load the data as it is formatted, but you’ll need 2 steps:
load data
re-apply formatting
There is some translation to be done between formatting types, and you can’t use pandas directly − however you can use the engine that pandas uses as a backend:
import datetime
import openpyxl
import re
date_corresp = {
'dd': '%d',
'mm': '%m',
'yy': '%y',
'yyyy': '%Y',
}
time_corresp = {
'hh': '%h',
'mm': '%M',
'ss': '%S',
}
def datecell_as_formatted(cell):
if isinstance(cell.value, datetime.time):
dfmt, tfmt = '', cell.number_format
elif isinstance(cell.value, (datetime.date, datetime.datetime)):
dfmt, tfmt, *_ = cell.number_format.split('\\', 1) + ['']
else:
raise ValueError('Not a datetime cell')
for fmt in re.split(r'\W', dfmt):
if fmt:
dfmt = re.sub(f'\\b{fmt}\\b', date_corresp.get(fmt, fmt), dfmt)
for fmt in re.split(r'\W', tfmt):
if fmt:
tfmt = re.sub(f'\\b{fmt}\\b', time_corresp.get(fmt, fmt), tfmt)
return cell.value.strftime(dfmt + tfmt)
Which you can then use as follows:
>>> wb = openpyxl.load_workbook('test.xlsx')
>>> ws = wb.worksheets[0]
>>> datecell_as_formatted(ws.cell(row=2, column=1))
'08/03/20'
(You can also complete the _corresp dictionaries with more date/time formatting items if they are incomplete)
* It is stored as a floating-point number, which is the number of days since 1/1/1900, as you can see by formatting a date as number or on this excelcampus page.
The issue just like the other comments say is most likely a bug
Although not ideal, but you could always do something like this?
import pandas as pd
#df = pd.read_excel('test.xlsx',dtype={'Date':str,'Time':str})
# this line can be then simplified to :
df = pd.read_excel('test.xlsx')
df['Date'] = df['Date'].apply(lambda x: '"' + str(x) + '"')
df['Time'] = df['Time'].apply(lambda x: '"' + str(x) + '"')
print (df)
print(df['Date'].dtype)
print(df['Time'].dtype)
Date Time
0 "2020-03-08 00:00:00" "10:00:00"
1 "2020-03-09 00:00:00" "11:00:00"
2 "2020-03-10 00:00:00" "12:00:00"
3 "2020-03-11 00:00:00" "13:00:00"
4 "2020-03-12 00:00:00" "14:00:00"
5 "2020-03-13 00:00:00" "15:00:00"
6 "2020-03-14 00:00:00" "16:00:00"
7 "2020-03-15 00:00:00" "17:00:00"
8 "2020-03-16 00:00:00" "18:00:00"
9 "2020-03-17 00:00:00" "19:00:00"
10 "2020-03-18 00:00:00" "20:00:00"
11 "2020-03-19 00:00:00" "21:00:00"
object
object
Since version 1.0.0, there are two ways to store text data in pandas: object or StringDtype (source).
And since version 1.1.0, StringDtype now works in all situations where astype(str) or dtype=str work (source).
All dtypes can now be converted to StringDtype
You just need to specify dtype="string" when loading your data with pandas:
>>df = pd.read_excel('xls_test.xlsx', dtype="string")
>>df.dtypes
Date string
Time string
dtype: object

Formatting datetimes without two digits in month and day?

I have a dataframe that has a particular column with datetimes in a format outputted in the following format:
df['A']
1/23/2008 15:41
3/10/2010 14:42
10/14/2010 15:23
1/2/2008 11:39
4/3/2008 13:35
5/2/2008 9:29
I need to convert df['A'] into df['Date'], df['Time'], and df['Timestamp'].
I tried to first convert df['A'] to a datetime by using
df['Datetime'] = pd.to_datetime(df['A'],format='%m/%d/%y %H:%M')
from which I would've created my three columns above, but my formatting codes for %m/%d do not pick up the single digit month and days.
Does anyone know a quick fix to this?
There's a bug with your format. As #MaxU commented, if you don't pass a format argument, then pandas will automagically convert your column to datetime.
df['Timestamp'] = pd.to_datetime(df['A'])
Or, to fix your code -
df['Timestamp'] = pd.to_datetime(df['A'], format='%m/%d/%Y %H:%M')
For your first query, use dt.normalize or, dt.floor (thanks, MaxU, for the suggestion!) -
df['Date'] = df['Timestamp'].dt.normalize()
Or,
df['Date'] = df['Timestamp'].dt.floor('D')
For your second query, use dt.time.
df['Time'] = df['Timestamp'].dt.time
df.drop('A', 1)
Date Time Timestamp
0 2008-01-23 15:41:00 2008-01-23 15:41:00
1 2010-03-10 14:42:00 2010-03-10 14:42:00
2 2010-10-14 15:23:00 2010-10-14 15:23:00
3 2008-01-02 11:39:00 2008-01-02 11:39:00
4 2008-04-03 13:35:00 2008-04-03 13:35:00
5 2008-05-02 09:29:00 2008-05-02 09:29:00
I believe you can use %-m instead of %m, if this works in the same way as strftime() function.

Categories