Pandas datetimes with different formats in the same column - python

I have a pandas data frame which has datetimes with 2 different formats e.g.:
3/14/2019 5:15:32 AM
2019-08-03 05:15:35
2019-01-03 05:15:33
2019-01-03 05:15:33
2/28/2019 5:15:31 AM
2/27/2019 11:18:39 AM
...
I have tried various formats but get errors like ValueError: unconverted data remains: AM
I would like to get the format as 2019-02-28 and have the time removed

You can use pd.to_datetime().dt.strftime() to efficienty convert the entire column to a datetime object and then to a string with Pandas intelligently guessing the date formatting:
df = pd.Series('''3/14/2019 5:15:32 AM
2019-08-03 05:15:35
2019-01-03 05:15:33
2019-01-03 05:15:33
2/28/2019 5:15:31 AM
2/27/2019 11:18:39 AM'''.split('\n'), name='date', dtype=str).to_frame()
print(pd.to_datetime(df.date).dt.strftime('%Y-%m-%d'))
0 2019-03-14
1 2019-08-03
2 2019-01-03
3 2019-01-03
4 2019-02-28
5 2019-02-27
Name: date, dtype: object
If that doesn't give you what you want, you will need to identify the different kinds of formats and apply different settings when you convert them to datetime objects:
# Classify date column by format type
df['format'] = 1
df.loc[df.date.str.contains('/'), 'format'] = 2
df['new_date'] = pd.to_datetime(df.date)
# Convert to datetime with two different format settings
df.loc[df.format == 1, 'new_date'] = pd.to_datetime(df.loc[df.format == 1, 'date'], format = '%Y-%d-%m %H:%M:%S').dt.strftime('%Y-%m-%d')
df.loc[df.format == 2, 'new_date'] = pd.to_datetime(df.loc[df.format == 2, 'date'], format = '%m/%d/%Y %H:%M:%S %p').dt.strftime('%Y-%m-%d')
print(df)
date format new_date
0 3/14/2019 5:15:32 AM 2 2019-03-14
1 2019-08-03 05:15:35 1 2019-03-08
2 2019-01-03 05:15:33 1 2019-03-01
3 2019-01-03 05:15:33 1 2019-03-01
4 2/28/2019 5:15:31 AM 2 2019-02-28
5 2/27/2019 11:18:39 AM 2 2019-02-27

Assume that the column name in your DataFrame is DatStr.
The key to success is a proper conversion function, to be
applied to each date string:
def datCnv(src):
return pd.to_datetime(src)
Then all you should do to create a true date column is to call:
df['Dat'] = df.DatStr.apply(datCnv)
When you print the DataFrame, the result is:
DatStr Dat
0 3/14/2019 5:15:32 AM 2019-03-14 05:15:32
1 2019-08-03 05:15:35 2019-08-03 05:15:35
2 2019-01-03 05:15:33 2019-01-03 05:15:33
3 2019-01-03 05:15:33 2019-01-03 05:15:33
4 2/28/2019 5:15:31 AM 2019-02-28 05:15:31
5 2/27/2019 11:18:39 AM 2019-02-27 11:18:39
Note that to_datetime function is clever enough to recognize the
actual date format used in each case.

I had a similar issue. Luckily for me the different format occurred every other row. Therefore I could easily do a slice with .iloc. Howevery you could also slice the Series with .loc and a filter (detecting each format).
Then you can combine the rows with pd.concat. The order will be the same as for the rest of the DataFrame (if you assign it). If there are missing indices they will become NaN, if there are duplicated indices pandas will raise an error.
df["datetime"] = pd.concat([
pd.to_datetime(df["Time"].str.slice(1).iloc[1::2], format="%y-%m-%d %H:%M:%S.%f"),
pd.to_datetime(df["Time"].str.slice(1).iloc[::2], format="%y-%m-%d %H:%M:%S"),
])

I think is a little bit late for the answer but I discover a simplier way to do the same
df["date"] = pd.to_datetime(df["date"], format='%Y-%d-%m %H:%M:%S', errors='ignore').astype('datetime64[D]')
df["date"] = pd.to_datetime(df["date"], format='%m/%d/%Y %H:%M:%S %p', errors='ignore').astype('datetime64[D]')

Related

Pandas read_excel function ignoring dtype

I'm trying to read an excel file with pd.read_excel().
The excel file has 2 columns Date and Time and I want to read both columns as str not the excel dtype.
Example of the excel file
I've tried to specify the dtype or the converters arguments to no avail.
df = pd.read_excel('xls_test.xlsx',
dtype={'Date':str,'Time':str})
df.dtypes
Date object
Time object
dtype: object
df.head()
Date Time
0 2020-03-08 00:00:00 10:00:00
1 2020-03-09 00:00:00 11:00:00
2 2020-03-10 00:00:00 12:00:00
3 2020-03-11 00:00:00 13:00:00
4 2020-03-12 00:00:00 14:00:00
As you can see the Date column is not treated as str...
Same thing when using converters
df = pd.read_excel('xls_test.xlsx',
converters={'Date':str,'Time':str})
df.dtypes
Date object
Time object
dtype: object
df.head()
Date Time
0 2020-03-08 00:00:00 10:00:00
1 2020-03-09 00:00:00 11:00:00
2 2020-03-10 00:00:00 12:00:00
3 2020-03-11 00:00:00 13:00:00
4 2020-03-12 00:00:00 14:00:00
I have also tried to use other engine but the result is always the same.
The dtype argument seems to work as expected when reading a csv though
What am I doing wrong here ??
Edit:
I forgot to mention, I'm using the last version of pandas 1.2.2 but had the same problem before updating from 1.1.2.
here is a simple solution, even if you apply the "str" in a dtype it will return as an object only. Use the below code to read the columns as string Dtype.
df= pd.read_excel("xls_test.xlsx",dtype={'Date':'string','Time':'string'})
To understand more about the Pandas String Dtype use the link below,
https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html
Let me know if you have any issues on that !!
The problem you’re having is that cells in excel have datatypes. So here the data type is a date or a time, and it’s formatted for display only. Loading it “directly” means loading a datetime type*.
This means that, whatever you do with the dtype= argument, the data will be loaded as a date, and then converted to string, giving you the result you see:
>>> pd.read_excel('test.xlsx').head()
Date Time Datetime
0 2020-03-08 10:00:00 2020-03-08 10:00:00
1 2020-03-09 11:00:00 2020-03-09 11:00:00
2 2020-03-10 12:00:00 2020-03-10 12:00:00
3 2020-03-11 13:00:00 2020-03-11 13:00:00
4 2020-03-12 14:00:00 2020-03-12 14:00:00
>>> pd.read_excel('test.xlsx').dtypes
Date datetime64[ns]
Time object
Datetime datetime64[ns]
dtype: object
>>> pd.read_excel('test.xlsx', dtype='string').head()
Date Time Datetime
0 2020-03-08 00:00:00 10:00:00 2020-03-08 10:00:00
1 2020-03-09 00:00:00 11:00:00 2020-03-09 11:00:00
2 2020-03-10 00:00:00 12:00:00 2020-03-10 12:00:00
3 2020-03-11 00:00:00 13:00:00 2020-03-11 13:00:00
4 2020-03-12 00:00:00 14:00:00 2020-03-12 14:00:00
>>> pd.read_excel('test.xlsx', dtype='string').dtypes
Date string
Time string
Datetime string
dtype: object
Only in csv files are datetime data stored as string in the file. There, loading it “directly” as a string makes sense. In an excel file, you may as well load it as a date and format it with .dt.strftime()
That’s not to say that you can’t load the data as it is formatted, but you’ll need 2 steps:
load data
re-apply formatting
There is some translation to be done between formatting types, and you can’t use pandas directly − however you can use the engine that pandas uses as a backend:
import datetime
import openpyxl
import re
date_corresp = {
'dd': '%d',
'mm': '%m',
'yy': '%y',
'yyyy': '%Y',
}
time_corresp = {
'hh': '%h',
'mm': '%M',
'ss': '%S',
}
def datecell_as_formatted(cell):
if isinstance(cell.value, datetime.time):
dfmt, tfmt = '', cell.number_format
elif isinstance(cell.value, (datetime.date, datetime.datetime)):
dfmt, tfmt, *_ = cell.number_format.split('\\', 1) + ['']
else:
raise ValueError('Not a datetime cell')
for fmt in re.split(r'\W', dfmt):
if fmt:
dfmt = re.sub(f'\\b{fmt}\\b', date_corresp.get(fmt, fmt), dfmt)
for fmt in re.split(r'\W', tfmt):
if fmt:
tfmt = re.sub(f'\\b{fmt}\\b', time_corresp.get(fmt, fmt), tfmt)
return cell.value.strftime(dfmt + tfmt)
Which you can then use as follows:
>>> wb = openpyxl.load_workbook('test.xlsx')
>>> ws = wb.worksheets[0]
>>> datecell_as_formatted(ws.cell(row=2, column=1))
'08/03/20'
(You can also complete the _corresp dictionaries with more date/time formatting items if they are incomplete)
* It is stored as a floating-point number, which is the number of days since 1/1/1900, as you can see by formatting a date as number or on this excelcampus page.
The issue just like the other comments say is most likely a bug
Although not ideal, but you could always do something like this?
import pandas as pd
#df = pd.read_excel('test.xlsx',dtype={'Date':str,'Time':str})
# this line can be then simplified to :
df = pd.read_excel('test.xlsx')
df['Date'] = df['Date'].apply(lambda x: '"' + str(x) + '"')
df['Time'] = df['Time'].apply(lambda x: '"' + str(x) + '"')
print (df)
print(df['Date'].dtype)
print(df['Time'].dtype)
Date Time
0 "2020-03-08 00:00:00" "10:00:00"
1 "2020-03-09 00:00:00" "11:00:00"
2 "2020-03-10 00:00:00" "12:00:00"
3 "2020-03-11 00:00:00" "13:00:00"
4 "2020-03-12 00:00:00" "14:00:00"
5 "2020-03-13 00:00:00" "15:00:00"
6 "2020-03-14 00:00:00" "16:00:00"
7 "2020-03-15 00:00:00" "17:00:00"
8 "2020-03-16 00:00:00" "18:00:00"
9 "2020-03-17 00:00:00" "19:00:00"
10 "2020-03-18 00:00:00" "20:00:00"
11 "2020-03-19 00:00:00" "21:00:00"
object
object
Since version 1.0.0, there are two ways to store text data in pandas: object or StringDtype (source).
And since version 1.1.0, StringDtype now works in all situations where astype(str) or dtype=str work (source).
All dtypes can now be converted to StringDtype
You just need to specify dtype="string" when loading your data with pandas:
>>df = pd.read_excel('xls_test.xlsx', dtype="string")
>>df.dtypes
Date string
Time string
dtype: object

select pandas dataframe datetime column based on times

i have the following dataframe:
timestamp mes
0 2019-01-01 18:15:55.700 1228
1 2019-01-01 18:35:56.872 1402
2 2019-01-01 18:35:56.872 1560
3 2019-01-01 19:04:25.700 1541
4 2019-01-01 19:54:23.150 8754
5 2019-01-02 18:01:00.025 4124
6 2019-01-02 18:17:56.125 9736
7 2019-01-02 18:58:59.799 1597
8 2019-01-02 20:10:15.896 5285
How can I select only the rows where timestamp is between a start_time and end_time, for all the days in the dataframe? Basically the same role of .between_time() but here the timestamp column can't be the index since there are repeated values.
Also, this is actually a chunk from pd.read_csv() and I would have to do this for several millions of them, would it be faster if I used for example numpy datetime functionalities? I guess I could create from timestamp a time column and create a mask on that, but I'm afraid this would be too slow.
EDIT:
I added more rows and this is the expected result, say for start_time=datetime.time(18), end_time=datetime.time(19):
timestamp mes
0 2019-01-01 18:15:55.700 1228
1 2019-01-01 18:35:56.872 1402
2 2019-01-01 18:35:56.872 1560
5 2019-01-02 18:01:00.025 4124
6 2019-01-02 18:17:56.125 9736
7 2019-01-02 18:58:59.799 1597
My code (works but is slow):
df['time'] = df.timestamp.apply(lambda x: x.time())
mask = (df.time<end) & (df.time>=start)
selected = df.loc[mask]
If you have you columns set to date time:
start = df["timestamp"] >= "2019-01-01 18:15:55.700"
end = df["timestamp"] <= "2019-01-01 18:15:55.896 "
between_two_dates = start & end
df.loc[between_two_dates]
Works for me. Just set timestamp to datetime and take to index
df=pd.DataFrame({'timestamp':['2019-01-01 18:15:55.700','2019-01-01 18:17:55.700','2019-01-01 18:19:55.896'],'mes':[1228,1402,1560]})#Data
df['timestamp']=pd.to_datetime(df['timestamp'])#Coerce timestamp to datetime
df.set_index('timestamp', inplace=True)#set timestamp as index
df.between_time('18:16', '20:15')#Time btetween select
Result

How to homogenize date type in a pandas dataframe column?

I have a Date column in my dataframe having dates with 2 different types (YYYY-DD-MM 00:00:00 and YYYY-DD-MM) :
Date
0 2023-01-10 00:00:00
1 2024-27-06
2 2022-07-04 00:00:00
3 NaN
4 2020-30-06
(you can use pd.read_clipboard(sep='\s\s+') after copying the previous dataframe to get it in your notebook)
I would like to have only a YYYY-MM-DD type. Consequently, I would like to have :
Date
0 2023-10-01
1 2024-06-27
2 2022-04-07
3 NaN
4 2020-06-30
How please could I do ?
Use Series.str.replace with to_datetime and format parameter:
df['Date'] = pd.to_datetime(df['Date'].str.replace(' 00:00:00',''), format='%Y-%d-%m')
print (df)
Date
0 2023-10-01
1 2024-06-27
2 2022-04-07
3 NaT
4 2020-06-30
Another idea with match both formats:
d1 = pd.to_datetime(df['Date'], format='%Y-%d-%m', errors='coerce')
d2 = pd.to_datetime(df['Date'], format='%Y-%d-%m 00:00:00', errors='coerce')
df['Date'] = d1.fillna(d2)

Is possible to set a date when converting with pandas.to_datetime?

I have an array that looks like this:
array([(b'03:35:05.397191'),
(b'03:35:06.184700'),
(b'03:35:08.642503'), ...,
(b'05:47:15.285806'),
(b'05:47:20.189460'),
(b'05:47:30.598514')],
dtype=[('Date', 'S15')])
I want to convert it into a dataframe, using to_datetime. I could do that by simply doing this:
df = pd.DataFrame( array )
df['Date'] = pd.to_datetime( df['Date'].str.decode("utf-8") )
>>> df.Date
0 2018-03-07 03:35:05.397191
1 2018-03-07 03:35:06.184700
2 2018-03-07 03:35:08.642503
3 2018-03-07 03:35:09.155030
4 2018-03-07 03:35:09.300029
5 2018-03-07 03:35:09.303031
The problem is that it automatically sets the date as today. Is it possible to set the date as a different day, for example, 2015-01-25?
Instead of using pd.to_datetime, use pd.to_timedelta and add a date.
pd.to_timedelta(df.Date.str.decode("utf-8")) + pd.to_datetime('2017-03-15')
0 2017-03-15 03:35:05.397191
1 2017-03-15 03:35:06.184700
2 2017-03-15 03:35:08.642503
3 2017-03-15 05:47:15.285806
4 2017-03-15 05:47:20.189460
5 2017-03-15 05:47:30.598514
Name: Date, dtype: datetime64[ns]
Try this:
df['Date'] = pd.to_datetime( df['Date'].str.decode("utf-8") ).apply(lambda x: x.replace(year=2015, month=1, day=25))
Incorporating #Wen's solution for correctness :)
you could create a string with complete date-time and parse, like:
df = pd.DataFrame( array )
df['Date'] = pd.to_datetime( '20150125 ' + df['Date'].str.decode("utf-8") )
Ummm, seems like it work :-)
pd.to_datetime(df['Date'].str.decode("utf-8"))-(pd.to_datetime('today')-pd.to_datetime('2015-01-25'))
Out[376]:
0 2015-01-25 03:35:05.397191
1 2015-01-25 03:35:06.184700
2 2015-01-25 03:35:08.642503
3 2015-01-25 05:47:15.285806
4 2015-01-25 05:47:20.189460
5 2015-01-25 05:47:30.598514
Name: Date, dtype: datetime64[ns]

Handle multiple date formats in pandas dataframe

I have a dataframe (imported from Excel) which looks like this:
Date Period
0 2017-03-02 2017-03-01 00:00:00
1 2017-03-02 2017-04-01 00:00:00
2 2017-03-02 2017-05-01 00:00:00
3 2017-03-02 2017-06-01 00:00:00
4 2017-03-02 2017-07-01 00:00:00
5 2017-03-02 2017-08-01 00:00:00
6 2017-03-02 2017-09-01 00:00:00
7 2017-03-02 2017-10-01 00:00:00
8 2017-03-02 2017-11-01 00:00:00
9 2017-03-02 2017-12-01 00:00:00
10 2017-03-02 Q217
11 2017-03-02 Q317
12 2017-03-02 Q417
13 2017-03-02 Q118
14 2017-03-02 Q218
15 2017-03-02 Q318
16 2017-03-02 Q418
17 2017-03-02 2018
I am trying to convert all the 'Period' column into a consistent format. Some elements look already in the datetime format, others are converted to string (ex. Q217), others to int (ex 2018). Which is the fastest way to convert everything in a datetime? I was trying with some masking, like this:
mask = df['Period'].str.startswith('Q', na = False)
list_quarter = df_final[mask]['Period'].tolist()
quarter_convert = {'1':'31/03', '2':'30/06', '3':'31/08', '4':'30/12'}
counter = 0
for element in list_quarter:
element = element[1:]
quarter = element[0]
year = element[1:]
daymonth = ''.join(str(quarter_convert.get(word, word)) for word in quarter)
final = daymonth+'/'+year
list_quarter[counter] = final
counter+=1
However it fails when I try to substitute the modified elements in the original column:
df_nwe_final['Period'] = np.where(mask, pd.Series(list_quarter), df_nwe_final['Period'])
Of course I would need to do more or less the same with the 2018 type formats. However, I am sure I am missing something here, and there should be a much faster solution. Some fresh ideas from you would help! Thank you.
Reusing the code you show, let's first write a function that converts the Q-string to a datetime format (I adjusted to final format a little bit):
def convert_q_string(element):
quarter_convert = {'1':'03-31', '2':'06-30', '3':'08-31', '4':'12-30'}
element = element[1:]
quarter = element[0]
year = element[1:]
daymonth = ''.join(str(quarter_convert.get(word, word)) for word in quarter)
final = '20' + year + '-' + daymonth
return final
We can now use this to first convert all 'Q'-strings, and then pd.to_datetime to convert all elements to proper datetime values:
In [2]: s = pd.Series(['2017-03-01 00:00:00', 'Q217', '2018'])
In [3]: mask = s.str.startswith('Q')
In [4]: s[mask] = s[mask].map(convert_q_string)
In [5]: s
Out[5]:
0 2017-03-01 00:00:00
1 2017-06-30
2 2018
dtype: object
In [6]: pd.to_datetime(s)
Out[6]:
0 2017-03-01
1 2017-06-30
2 2018-01-01
dtype: datetime64[ns]

Categories