Python Pandas reading time as decimal value - python

When I am reading a time data from an xlsx file into pandas, it reads as a decimal value.
Example: 9:23:27 AM is read as .391284722
I can fix it by converting it into time using format cell and select time. But I would prefer to use pandas all the way through and not Excel.
When I call the value and convert it into a date time object
df.TIME=pd.to_datetime(df.TIME)
It changes to this date 1970-01-01
Desired time is 9:23:27 AM
Any help is greatly appreciated.
Thank you

Demo:
read that column as string:
df = pd.read_excel(filename, dtype={'col_name':str})
In [51]: df
Out[51]:
time
0 9:23:27 AM
1 12:59:59 AM
In [52]: df['time2'] = pd.to_timedelta(df['time'])
In [53]: df
Out[53]:
time time2
0 9:23:27 AM 09:23:27
1 12:59:59 AM 12:59:59
In [54]: df.dtypes
Out[54]:
time object
time2 timedelta64[ns]
dtype: object
UPDATE: in order to convert a float number (# of seconds) read from Excel
try the following:
Source DF:
In [85]: df
Out[85]:
time
0 0.391285
1 0.391285
2 0.391285
Solution:
In [94]: df['time2'] = pd.to_timedelta((df['time'] * 86400).round(), unit='s')
In [95]: df
Out[95]:
time time2
0 0.391285 09:23:27
1 0.391285 09:23:27
2 0.391285 09:23:27
In [96]: df.dtypes
Out[96]:
time float64
time2 timedelta64[ns]
dtype: object

The question could use some clarifying for an end-purpose for the time-column. For general purposes though, try using the format keyword in to_datetime.
df.TIME=pd.to_datetime(df.TIME, format='%I:%M%S %p')
See this website for formatting: http://strftime.org/

Related

datetime.timestamp returns different values in pandas apply and dataframe selection

Question
See code below demonstrating the issue. A simple pandas dataframe is created with one row and one column containing one datetime instance. As you can see, calling timestamp() on the datetime object returns 1581894000.0. Selecting the datetime object through the dataframe and calling timestamp() gives 1581897600.0. When using pandas apply function to call datetime.timestamp on each row of column 'date', the return value becomes 1581894000.0. I would expect to get the same timestamp value in all situations.
In[19]: d = datetime(2020, 2, 17)
In[20]: d.timestamp()
Out[20]: 1581894000.0 <----------------------------------+
In[21]: df = pd.DataFrame({'date': [d]}) |
In[22]: df |
Out[22]: |
date |
0 2020-02-17 |
In[23]: df['date'][0] |
Out[23]: Timestamp('2020-02-17 00:00:00') |
In[24]: df['date'][0].timestamp() |
Out[24]: 1581897600.0 <---------------------- These should be the same
In[25]: df['date'].apply(datetime.timestamp) |
Out[25]: |
0 1.581894e+09 |
Name: date, dtype: float64 |
In[26]: df['date'].apply(datetime.timestamp)[0] |
Out[26]: 1581894000.0 <----------------------------------+
Edit
Thanks to input from #ALollz, using to_datetime and Timestamp from pandas, as shown below seems to fix the problem.
In[15]: d = pd.to_datetime(datetime(2020,2,17))
In[16]: d.timestamp()
Out[16]: 1581897600.0
In[17]: df = pd.DataFrame({'date': [d]})
In[18]: df
Out[18]:
date
0 2020-02-17
In[19]: df['date'][0]
Out[19]: Timestamp('2020-02-17 00:00:00')
In[20]: df['date'][0].timestamp()
Out[20]: 1581897600.0
In[21]: df['date'].apply(pd.Timestamp.timestamp)
Out[21]:
0 1.581898e+09
Name: date, dtype: float64
In[22]: df['date'].apply(pd.Timestamp.timestamp)[0]
Out[22]: 1581897600.0
The problem is timezone awareness. pandas doesn't always play well with the datetime module and some decisions diverge from the standard library, in this case how to deal with timezone unaware datetime objects.
This specific issue seems to have been a design choice based upon this open issue
Yah, for tz-naive we implement timestamp as if it were UTC. Among other things, this ensures that we get the same behavior regardless of where the code is running.
So to get a consistent answer you'd need a UTC localized timezone so that datetime.timestamp used that instead of your machine's local timezone.
from datetime import datetime
import pytz
my_date = datetime(2020, 2, 17)
my_date_aware = pytz.utc.localize(my_date)
# UTC aware is the same as pandas
datetime.timestamp(my_date_aware) - pd.to_datetime(my_date).timestamp()
#0
datetime.timestamp(my_date) - pd.to_datetime(my_date).timestamp()
#18000.0

Change comma separated date format 20,190,927 in a dataframe?

I have a data frame with date columns as 20,190,927 which means: 2019/09/27.
I need to change the format to YYYY/MM/DD or something similar.
I thought of doing it manually like:
x = df_all['CREATION_DATE'].str[:2] + df_all['CREATION_DATE'].str[3:5] + "-" + \
df_all['CREATION_DATE'].str[5] + df_all['CREATION_DATE'].str[7] + "-" + df_all['CREATION_DATE'].str[8:]
print(x)
What's a more creative way of doing this? Could it be done with datetime module?
I believe this is what you want. First replace the , with nothing, so you get a yyyymmdd format, and then change it to datetime with pd.to_datetime by passing the correct format. One liner:
df['dates'] = pd.to_datetime(df['dates'].str.replace(',',''),format='%Y%m%d')
Full explanation:
import pandas as pd
a = {'dates':['20,190,927','20,191,114'],'values':[1,2]}
df = pd.DataFrame(a)
print(df)
Output, here's how the original dataframe looks like:
dates values
0 20,190,927 1
1 20,191,114 2
df['dates'] = df['dates'].str.replace(',','')
df['dates'] = pd.to_datetime(df['dates'],format='%Y%m%d')
print(df)
print(df.info())
Output of the newly formatted dataframe:
dates values
0 2019-09-27 1
1 2019-11-14 2
Printing .info() to ensure we have the correct format:
dates 2 non-null datetime64[ns]
values 2 non-null int64
Hope this helps,
date=['20,190,927','20,190,928','20,190,929']
df3=pd.DataFrame(date,columns=['Date'])
df3['Date']=df3['Date'].replace('\,','',regex=True)
df3['Date']=pd.to_datetime(df3['Date'])

Converting dataframe column of datetime data to DD/MM/YYYY string data

I have a dataframe column with datetime data in 1980-12-11T00:00:00 format.
I need to convert the whole column to DD/MM/YYY string format.
Is there any easy code for this?
Creating a working example:
df = pd.DataFrame({'date':['1980-12-11T00:00:00', '1990-12-11T00:00:00', '2000-12-11T00:00:00']})
print(df)
date
0 1980-12-11T00:00:00
1 1990-12-11T00:00:00
2 2000-12-11T00:00:00
Convert the column to datetime by pd.to_datetime() and invoke strftime()
df['date_new']=pd.to_datetime(df.date).dt.strftime('%d/%m/%Y')
print(df)
date date_new
0 1980-12-11T00:00:00 11/12/1980
1 1990-12-11T00:00:00 11/12/1990
2 2000-12-11T00:00:00 11/12/2000
You can use pd.to_datetime to convert string to datetime data
pd.to_datetime(df['col'])
You can also pass specific format as:
pd.to_datetime(df['col']).dt.strftime('%d/%m/%Y')
When using pandas, try pandas.to_datetime:
import pandas as pd
df = pd.DataFrame({'date': ['1980-12-%sT00:00:00'%i for i in range(10,20)]})
df.date = pd.to_datetime(df.date).dt.strftime("%d/%m/%Y")
print(df)
date
0 10/12/1980
1 11/12/1980
2 12/12/1980
3 13/12/1980
4 14/12/1980
5 15/12/1980
6 16/12/1980
7 17/12/1980
8 18/12/1980
9 19/12/1980

Converting part of rows of dataframe from numbers to datetime, but got weird numbers

I'm trying to convert some rows of dataframe from numbers to datetime, but got weird numbers.
import pandas as pd
import datetime as dt
df = pd.DataFrame({'col': [dt.datetime(2018,1,1), 1.2, 3.2, 2.1]})
mask = df['col'].apply(lambda x:type(x)==float) # find rows that are numbers
# convert numbers to datetime
df.loc[mask, 'col'] = df.loc[mask, 'col'].apply(
lambda x: dt.datetime(2018,5,1) + dt.timedelta(days=(x*365)))
print(df)
col
0 2018-01-01 00:00:00
1 1562976000000000000
2 1626048000000000000
3 1591358400000000000
Why got huge numbers in rows 1~3? I guess the reason is that type of elements in different rows are different. But I really want to do the change within the data frame. Any suggestions? Thanks!
The reason is because you have a column of mixed dtypes (datetimes and floats). Pandas, being confused, assumes that the values you're assigning are also floats, and attempts to convert the datetimes in index 1 through 3 to numbers (what you see is the number of nanoseconds since the epoch in 1970).
Here's a vectorised fix using pd.to_numeric, pd.to_timedelta, and pd.to_datetime:
((pd.to_timedelta(pd.to_numeric(df.col, errors='coerce')) * 365
+ pd.to_datetime('2018-05-01')).fillna(df.col)
)
0 2018-01-01 00:00:00.000000000
1 2018-05-01 00:00:00.000000365
2 2018-05-01 00:00:00.000001095
3 2018-05-01 00:00:00.000000730
Name: col, dtype: datetime64[ns]
I'm not terribly familiar with pandas, but it looks like the datetime series you are creating in df.loc[mask, 'col'].apply(lambda x: dt.datetime(2018,5,1) + dt.timedelta(days=(x*365))) is getting implicitly cast to an integer data type when it is assigned into df.loc[mask, 'col']. I'm not sure why Pandas would do this, but that seems to be what's causing your problem. Here is a quick solution:
import pandas as pd
import datetime as dt
df = pd.DataFrame({'col': [dt.datetime(2018, 1, 1), 1.2, 3.2, 2.1]})
df['col'] = df['col'].apply(lambda x: dt.datetime(2018, 5, 1) + dt.timedelta(days=(x * 365)) if type(x) == float else x)
What
I
find
confusing is why
Pandas
converts
some of the elements of the
series (elements at index 1-3)
to
integers, while leaving other elements (element at index 0) as is. In
other
words, why
convert
the
elements
of
the
series
that
are
being
assigned into the
series(df.loc[mask, 'col'].apply(lambda x: dt.datetime(2018, 5, 1) + dt.timedelta(days=(x * 365)))
from a data
type
of
datetimes
to
integers,
while not converting the element that already exists in the series from datetime to integer? Seems unintuitive to me, but maybe I'm missing something. #coldspeed, can you clarify?

Using 2 pandas columns as arguments for np.timedelta

Simple question:
In [1]:
df = DataFrame({'value':[4,4,4],'unit':['D','W','Y']})
df
Out[1]:
unit value
0 D 4
1 W 4
2 Y 4
I can create timedeltas this way (of course):
In [2]:
timedelta64(4, 'D')
Out[2]:
numpy.timedelta64(4,'D')
But I'm not being able to iterate through DataFrame columns to get a resulting Series with timedeltas:
def f(x):
return timedelta64(x['value'], x['unit'])
df.apply(f, axis=1)
Instead, I'm getting:
TypeError: don't know how to convert scalar number to float
EDIT:
This also does not work, and returns the same error:
df['arg'] = zip(df.value, df.unit)
df.arg.apply(lambda x: timedelta64(x[0], x[1]))
So your code works for me.
df = pd.DataFrame({'value':[4,4,4],'unit':['D','W','Y']})
df.apply(f, axis=1)
0 4 days
1 4 weeks
2 4 years
dtype: object
Here's my versions:
numpy.__version__
'1.8.0'
pandas.__version__
'0.13.0rc1-32-g81053f9'
I did notice a bug perhaps related to your issue. You might check if you have numpy 1.7, if so upgrade to 1.8 and see if that fixes the issues. Good Luck :)
https://github.com/pydata/pandas/issues/5689
In 0.13 this is supported using the new pd.to_timedelta:
In [24]: df = DataFrame({'value':[4,4,4],'unit':['D','W','Y']})
In [25]: pd.to_timedelta(df.apply(lambda x: np.timedelta64(x['value'],x['unit']), axis=1))
Out[25]:
0 4 days, 00:00:00
1 28 days, 00:00:00
2 1460 days, 23:16:48
dtype: timedelta64[ns]

Categories