Convert float64 column to datetime pandas - python

I have the following pandas DataFrame column dfA['TradeDate']:
0 20100329.0
1 20100328.0
2 20100329.0
...
and I wish to transform it to a datetime.
Based on another tread on SO, I convert it first to a string and then apply the strptime function.
dfA['TradeDate'] = datetime.datetime.strptime( dfA['TradeDate'].astype('int').to_string() ,'%Y%m%d')
However this returns the error that my format is incorrect (ValueError).
An issue that I spotted is that the column is not properly to string, but to an object.
When I try:
dfA['TradeDate'] = datetime.datetime.strptime( dfA['TradeDate'].astype(int).astype(str),'%Y%m%d')
It returns: must be a Str and not Series.

You can use:
df['TradeDate'] = pd.to_datetime(df['TradeDate'], format='%Y%m%d.0')
print (df)
TradeDate
0 2010-03-29
1 2010-03-28
2 2010-03-29
But if some bad values, add errors='coerce' for replace them to NaT
print (df)
TradeDate
0 20100329.0
1 20100328.0
2 20100329.0
3 20153030.0
4 yyy
df['TradeDate'] = pd.to_datetime(df['TradeDate'], format='%Y%m%d.0', errors='coerce')
print (df)
TradeDate
0 2010-03-29
1 2010-03-28
2 2010-03-29
3 NaT
4 NaT

You can use to_datetime with a custom format on a string representation of the values:
import pandas as pd
pd.to_datetime(pd.Series([20100329.0, 20100328.0, 20100329.0]).astype(str), format='%Y%m%d.0')

strptime function works on a single value, not on series. You need to apply that function to each element of the column
try pandas.to_datetime method
eg
dfA = pandas.DataFrame({"TradeDate" : [20100329.0,20100328.0]})
pandas.to_datetime(dfA['TradeDate'], format = "%Y%m%d")
or
dfA['TradeDate'].astype(int).astype(str)\
.apply(lambda x:datetime.datetime.strptime(x,'%Y%m%d'))

In your first attempt you tried to convert it to string and then pass to strptime, which resulted in ValueError. This happens because dfA['TradeDate'].astype('int').to_string() creates a single string containing all dates as well as their row numbers. You can change this to
dates = dfA['TradeDate'].astype('int').to_string(index=False).split()
dates
[u'20100329.0', u'20100328.0', u'20100329.0']
to get a list of dates. Then use python list comprehension to convert each element to datetime:
dfA['TradeDate'] = [datetime.strptime(x, '%Y%m%d.0') for x in dates]

Related

How to convert a pandas datetime column from UTC to EST

There is another question that is eleven years old with a similar title.
I have a pandas dataframe with a column of datetime.time values.
val time
a 12:30:01.323
b 12:48:04.583
c 14:38:29.162
I want to convert the time column from UTC to EST.
I tried to do dataframe.tz_localize('utc').tz_convert('US/Eastern') but it gave me the following error: RangeIndex Object has no attribute tz_localize
tz_localize and tz_convert work on the index of the DataFrame. So you can do the following:
convert the "time" to Timestamp format
set the "time" column as index and use the conversion functions
reset_index()
keep only the time
Try:
dataframe["time"] = pd.to_datetime(dataframe["time"],format="%H:%M:%S.%f")
output = (dataframe.set_index("time")
.tz_localize("utc")
.tz_convert("US/Eastern")
.reset_index()
)
output["time"] = output["time"].dt.time
>>> output
time val
0 15:13:12.349211 a
1 15:13:13.435233 b
2 15:13:14.345233 c
to_datetime accepts an argument utc (bool) which, when true, coerces the timestamp to utc.
to_datetime returns a DateTimeIndex, which has a method tz_convert. this method will convert tz-aware timestamps from one timezeone to another.
So, this transformation could be concisely written as
df = pd.DataFrame(
[['a', '12:30:01.323'],
['b', '12:48:04.583'],
['c', '14:38:29.162']],
columns=['val', 'time']
)
df['time'] = pd.to_datetime(df.time, utc=True, format='%H:%M:%S.%f')
# convert string to timezone aware field ^^^
df['time'] = df.time.dt.tz_convert('EST').dt.time
# convert timezone, discarding the date part ^^^
This produces the following dataframe:
val time
0 a 07:30:01.323000
1 b 07:48:04.583000
2 c 09:38:29.162000
This could also be a 1-liner as below:
pd.to_datetime(df.time, utc=True, format='%H:%M:%S.%f').dt.tz_convert('EST').dt.time
list_temp = []
for row in df['time_UTC']:
list_temp.append(Timestamp(row, tz = 'UTC').tz_convert('US/Eastern'))
df['time_EST'] = list_temp

Pandas giving me the wrong max date in a date time column?

I have a dataframe with a date column:
data['Date']
0 1/1/14
1 1/8/14
2 1/15/14
3 1/22/14
4 1/29/14
...
255 11/21/18
256 11/28/18
257 12/5/18
258 12/12/18
259 12/19/18
But, when I try to get the max date out of that column, I get:
test_data.Date.max()
'9/9/15'
Any idea why this would happen?
Clearly the column is of type object. You should try using pd.to_datetime() and then performing the max() aggregator:
data['Date'] = pd.to_datetime(data['Date'],errors='coerce') #You might need to pass format
print(data['Date'].max())
The .max() understands it as a date (like you want), if it is a datetime object. Building upon Seshadri's response, try:
type(data['Date'][1])
If it is a datetime object, this returns this:
pandas._libs.tslibs.timestamps.Timestamp
If not, you can make that column a datatime object like so:
data['Date'] = pd.to_datetime(data['Date'],format='%m/%d/%y')
The format argument makes sure you get the right formatting. See the full list of formatting options here in the python docs.
Your date may be stored as a string. First convert the column from string to datetime. Then, max() should work.
test = pd.DataFrame(['1/1/2010', '2/1/2011', '3/4/2020'], columns=['Dates'])
Dates
0 1/1/2010
1 2/1/2011
2 3/4/2020
pd.to_datetime(test['Dates'], format='%m/%d/%Y').max()
Timestamp('2020-03-04 00:00:00')
That timestamp can be cleaned up using .dt.date:
pd.to_datetime(test['Dates'], format='%m/%d/%Y').dt.date.max()
datetime.date(2020, 3, 4)
to_datetime format argument table python docs
pandas to_datetime pandas docs

Pandas Dataframe Time column has float values

I am doing a cleaning of my Database. In one of the tables, the time column has values like 0.013391204. I am unable to convert this to time [mm:ss] format. Is there a function to convert this to the required format [mm:ss]
The head for the column
0 20:00
1 0.013391204
2 0.013333333
3 0.012708333
4 0.012280093
Use the below reproducible data:
import pandas as pd
df = pd.DataFrame({"time": ["20:00", "0.013391204", "0.013333333", "0.012708333", "0.012280093"]})
I expect the output to be like the first row of the column values shown above.
What is the correct time interpretation for say the first entry? 0.013391204 is it 48 seconds?
Because, if we use datetime module we can convert float into the time format:
Updating answer to add the new information
import datetime
datetime.timedelta(days = 0.013391204)
str(datetime.timedelta(days = 0.013391204))
Output:'0:19:17.000026'
Hope this helps :))
First convert values by to_numeric with errors='coerce' for replace non floats to missing values and then replace them by original values with 00: for hours, last convert by to_timedelta with unit='d':
df = pd.DataFrame({"time": ["20:00", "0.013391204", "0.013333333",
"0.012708333", "0.012280093"]})
s = pd.to_numeric(df['time'], errors='coerce').fillna(df['time'].radd('00:'))
df['new'] = pd.to_timedelta(s, unit='d')
print (df)
time new
0 20:00 00:20:00
1 0.013391204 00:19:17.000025
2 0.013333333 00:19:11.999971
3 0.012708333 00:18:17.999971
4 0.012280093 00:17:41.000035

Slicing a string to filter a pandas dataframe

Should be an easy one, just not getting anywhere with it after looking at any existing examples.
I'm trying to filter a df where a date/time in my df equals a date/time I have in another variable called "date".
Both of these are stored as strings.
The format of df['DATE'] is like this:
2017/11/28 14:19:58
The format of date is like this:
11/28/2017 14:19
I want these to return a match.
df = df[df['DATE'][:-3] == date]
Error I get is this:
raise IndexingError('Unalignable boolean Series provided as '
pandas.core.indexing.IndexingError: Unalignable boolean Series provided
as indexer (index of the boolean Series and of the indexed object do not match
Seems like interpreter treats it as I am referencing the df position, not slicing the string within.
You need to use the pd.Series.str accessor for slicing:
from datetime import datetime
s = pd.Series(['2016/09/25 12:29:18', '2017/11/28 14:19:58', '2018/01/02 03:35:12'])
date = '11/28/2017 14:19'
res = (s.str[:-3] == datetime.strptime(date, '%m/%d/%Y %H:%M').strftime('%Y/%m/%d %H:%M'))
print(res)
0 False
1 True
2 False
dtype: bool
df
DATE
0 2017/11/21 14:19:58
1 2017/11/20 14:19:58
2 2017/11/21 12:19:58
date = '11/20/2017 14:19'
df[df['DATE'].apply(lambda x :pd.to_datetime(x,infer_datetime_format=True).strftime('%m/%d/%Y %H:%M'))==date]
DATE
1 2017/11/20 14:19:58
You can convert either of them or both if you would like to do any other datetime based operations.

Convert number to date format using Python

I am reading data from a text file with more that 14000 rows and there is a column which has eight (08) digit numbers in it. The format for some of the rows are like:
01021943
02031944
00041945
00001946
The problem is that when I use to_date function it converts the datatype of the date from object to int64 but I want it to be datetime. Second by using the to_datetime function the dates like
00041945 becomes 41945
00001946 becomes 1946 and hence I cannot properly format them
You can add parameter dtype to read_csv for converting column col to string and then use to_datetime with parameters format for specify formatting and errors='coerce' - because bad dates, which are converted to NaT:
import pandas as pd
import io
temp=u"""col
01021943
02031944
00041945
00001946"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), dtype={'col': 'str'})
df['col'] = pd.to_datetime(df['col'], format='%d%m%Y', errors='coerce')
print (df)
col
0 1943-02-01
1 1944-03-02
2 NaT
3 NaT
print (df.dtypes)
col datetime64[ns]
dtype: object
Thanks Jon Clements for another solution:
import pandas as pd
import io
temp=u"""col_name
01021943
02031944
00041945
00001946"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp),
converters={'col_name': lambda dt: pd.to_datetime(dt, format='%d%m%Y', errors='coerce')})
print (df)
col_name
0 1943-02-01
1 1944-03-02
2 NaT
3 NaT
print (df.dtypes)
col_name datetime64[ns]
dtype: object
As a first guess solution you could just parse it as a string into a datetime instance. Something like:
from datetime import datetime
EXAMPLE = u'01021943'
dt = datetime(int(EXAMPLE[4:]), int(EXAMPLE[2:4]), int(EXAMPLE[:2]))
...not caring very much about performance issues.
import datetime
def to_date(num_str):
return datetime.datetime.strptime(num_str,"%d%m%Y")
Note this will also throw exceptions for zero values because the expected behavior is not clear for this input.
If you want a different behavior for zero values, you can implement it with try & except,
for example, if you want to get None for zero values you can do:
def to_date(num_str):
try:
return datetime.datetime.strptime(num_str,"%d%m%Y")
except ValueError, e:
return None

Categories