I am reading data from a text file with more that 14000 rows and there is a column which has eight (08) digit numbers in it. The format for some of the rows are like:
01021943
02031944
00041945
00001946
The problem is that when I use to_date function it converts the datatype of the date from object to int64 but I want it to be datetime. Second by using the to_datetime function the dates like
00041945 becomes 41945
00001946 becomes 1946 and hence I cannot properly format them
You can add parameter dtype to read_csv for converting column col to string and then use to_datetime with parameters format for specify formatting and errors='coerce' - because bad dates, which are converted to NaT:
import pandas as pd
import io
temp=u"""col
01021943
02031944
00041945
00001946"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), dtype={'col': 'str'})
df['col'] = pd.to_datetime(df['col'], format='%d%m%Y', errors='coerce')
print (df)
col
0 1943-02-01
1 1944-03-02
2 NaT
3 NaT
print (df.dtypes)
col datetime64[ns]
dtype: object
Thanks Jon Clements for another solution:
import pandas as pd
import io
temp=u"""col_name
01021943
02031944
00041945
00001946"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp),
converters={'col_name': lambda dt: pd.to_datetime(dt, format='%d%m%Y', errors='coerce')})
print (df)
col_name
0 1943-02-01
1 1944-03-02
2 NaT
3 NaT
print (df.dtypes)
col_name datetime64[ns]
dtype: object
As a first guess solution you could just parse it as a string into a datetime instance. Something like:
from datetime import datetime
EXAMPLE = u'01021943'
dt = datetime(int(EXAMPLE[4:]), int(EXAMPLE[2:4]), int(EXAMPLE[:2]))
...not caring very much about performance issues.
import datetime
def to_date(num_str):
return datetime.datetime.strptime(num_str,"%d%m%Y")
Note this will also throw exceptions for zero values because the expected behavior is not clear for this input.
If you want a different behavior for zero values, you can implement it with try & except,
for example, if you want to get None for zero values you can do:
def to_date(num_str):
try:
return datetime.datetime.strptime(num_str,"%d%m%Y")
except ValueError, e:
return None
Related
I have a date column in a dataset where the dates are like 'Apr-12','Jan-12' format. I would like to change the format to 04-2012,01-2012. I am looking for a function which can do this.
I think I know one guy with the same name. Jokes apart here is the solution to your problem.
We do have an inbuilt function named as strptime(), so it takes up the string and then convert into the format you want.
You need to import datetime first since it is the part of the datetime package of python. Don't no need to install anything, just import it.
Then this works like this: datetime.strptime(your_string, format_you_want)
# You can also do this, from datetime import * (this imports all the functions of datetime)
from datetime import datetime
str = 'Apr-12'
date_object = datetime.strptime(str, '%m-%Y')
print(date_object)
I hope this will work for you. Happy coding :)
You can do following:
import pandas as pd
df = pd.DataFrame({
'date': ['Apr-12', 'Jan-12', 'May-12', 'March-13', 'June-14']
})
pd.to_datetime(df['date'], format='%b-%y')
This will output:
0 2012-04-01
1 2012-01-01
2 2012-05-01
Name: date, dtype: datetime64[ns]
Which means you can update your date column right away:
df['date'] = pd.to_datetime(df['date'], format='%b-%y')
You can chain a couple of pandas methods together to get this the desired output:
df = pd.DataFrame({'date_fmt':['Apr-12','Jan-12']})
df
Input dataframe:
date_fmt
0 Apr-12
1 Jan-12
Use pd.to_datetime chained with .dt date accessor and strftime
pd.to_datetime(df['date_fmt'], format='%b-%y').dt.strftime('%m-%Y')
Output:
0 04-2012
1 01-2012
Name: date_fmt, dtype: object
I have a DataFrame which have date fields, some of which are containing None values.
I am changing all the dates which are in milliseconds to ddmmyyyy using below but facing issue when the value is None
dataframe['dt'] =dataframe['dt']/1000.0
dataframe['dt']=datetime.datetime.fromtimestamp(dataframe['dt']).strftime('%d/%m/%Y')
Tried to disregard the None Values as below, but it is not executing the condition and going directly to else
dataframe['dt']=np.where(dataframe['dt']==pd.np.nan, pd.np.nan,dataframe['dt']/1000)
dataframe['dt']=np.where(dataframe['dt']==pd.np.nan, pd.np.nan, datetime.datetime.fromtimestamp(dataframe['dt']).strftime('%d/%m/%Y'))
It is becoming tiring now
Have you tried this:
pd.to_datetime(dataframe['dt'], unit='ms', format='%d%m%Y', errors='coerce')
You can try apply:
import datetime
import numpy as np
df=pd.DataFrame({'dt':[1564025326921, 1564025327921, None, 1564025328921]})
df['dt'] =df['dt']/1000.0
df['dt']= df['dt'].apply(lambda t: None if np.isnan(t) else datetime.datetime.fromtimestamp(t).strftime('%d/%m/%Y'))
df.head(10)
This will output:
dt
0 24/07/2019
1 24/07/2019
2 None
3 24/07/2019
You can use the inbuilt method in pandas pd.to_datetime to covert value as DateTime objects. once you convert to Datetime you can change the format more flexibly.
If you want to convert the column 'date' to Datetime object, you can do it as below
df['date']= pd.to_datetime(df['date'],format='%d/%m/%y', errors='coerce')
errors='coerce will make non convertible items as NaT
I have couple of date columns, I want to convert them to month/day/year format. Let's say test is one of the date columns - below code works.
dfq['test1'] = dfq['test1'].apply(lambda x: x.strftime('%m/%d/%Y'))
But when there are missing value in the column as 'NaT' it shows error
ValueError: NaTType does not support strftime . I have created a sample data set and intentionally kept one missing value as ' ' . In that case also it shows error.
I want to keep the missing or NaT values, so can't remove them. Is there any other way around ?
Another question, if I want to convert all my date columns (say test1, test, test3) at the same time, - is there a way to do it while using lambda/strftime ?
You should use pd.Series.dt.strftime, which handles NaT gracefully:
import pandas as pd
s = pd.Series(['2018-01-01', 'hello'])
s = pd.to_datetime(s, errors='coerce')
# 0 2018-01-01
# 1 NaT
# dtype: datetime64[ns]
s = s.dt.strftime('%m/%d/%Y')
print(s)
# 0 01/01/2018
# 1 NaT
# dtype: object
For your second question, I do not believe datetime to str conversion can be vectorised. You can easily do this:
for col in ['col1', 'col2', 'col3']:
df[col] = df[col].dt.strftime('%m/%d/%Y')
Or better:
for col in df.select_dtypes(include=['datetime']):
df[col] = df[col].dt.strftime('%m/%d/%Y')
Here's another solution that's a bit more flexible, since it also works with pd.style.format(), which is where I encountered the issue. Just wrap the time formatter in a function and catch the error, returning NaT when it throws. You can then use whichever time formatting function you want in there.
def format_time_nat(t, fmt='{:%d-%b-%y}'):
try:
return fmt.format(t) # or strftime
except ValueError:
return t
dfq['test1'] = dfq['test1'].apply(format_time_nat)
# when using pd.style.format()
colstyles = {
'test1' : format_time_nat
}
dfq.style.format(colstyles)
I have the following pandas DataFrame column dfA['TradeDate']:
0 20100329.0
1 20100328.0
2 20100329.0
...
and I wish to transform it to a datetime.
Based on another tread on SO, I convert it first to a string and then apply the strptime function.
dfA['TradeDate'] = datetime.datetime.strptime( dfA['TradeDate'].astype('int').to_string() ,'%Y%m%d')
However this returns the error that my format is incorrect (ValueError).
An issue that I spotted is that the column is not properly to string, but to an object.
When I try:
dfA['TradeDate'] = datetime.datetime.strptime( dfA['TradeDate'].astype(int).astype(str),'%Y%m%d')
It returns: must be a Str and not Series.
You can use:
df['TradeDate'] = pd.to_datetime(df['TradeDate'], format='%Y%m%d.0')
print (df)
TradeDate
0 2010-03-29
1 2010-03-28
2 2010-03-29
But if some bad values, add errors='coerce' for replace them to NaT
print (df)
TradeDate
0 20100329.0
1 20100328.0
2 20100329.0
3 20153030.0
4 yyy
df['TradeDate'] = pd.to_datetime(df['TradeDate'], format='%Y%m%d.0', errors='coerce')
print (df)
TradeDate
0 2010-03-29
1 2010-03-28
2 2010-03-29
3 NaT
4 NaT
You can use to_datetime with a custom format on a string representation of the values:
import pandas as pd
pd.to_datetime(pd.Series([20100329.0, 20100328.0, 20100329.0]).astype(str), format='%Y%m%d.0')
strptime function works on a single value, not on series. You need to apply that function to each element of the column
try pandas.to_datetime method
eg
dfA = pandas.DataFrame({"TradeDate" : [20100329.0,20100328.0]})
pandas.to_datetime(dfA['TradeDate'], format = "%Y%m%d")
or
dfA['TradeDate'].astype(int).astype(str)\
.apply(lambda x:datetime.datetime.strptime(x,'%Y%m%d'))
In your first attempt you tried to convert it to string and then pass to strptime, which resulted in ValueError. This happens because dfA['TradeDate'].astype('int').to_string() creates a single string containing all dates as well as their row numbers. You can change this to
dates = dfA['TradeDate'].astype('int').to_string(index=False).split()
dates
[u'20100329.0', u'20100328.0', u'20100329.0']
to get a list of dates. Then use python list comprehension to convert each element to datetime:
dfA['TradeDate'] = [datetime.strptime(x, '%Y%m%d.0') for x in dates]
I have a SQL table that contains data of the mySQL time type as follows:
time_of_day
-----------
12:34:56
I then use pandas to read the table in:
df = pd.read_sql('select * from time_of_day', engine)
Looking at df.dtypes yields:
time_of_day timedelta64[ns]
My main issue is that, when writing my df to a csv file, the data comes out all messed up, instead of essentially looking like my SQL table:
time_of_day
0 days 12:34:56.000000000
I'd like to instead (obviously) store this record as a time, but I can't find anything in the pandas docs that talk about a time dtype.
Does pandas lack this functionality intentionally? Is there a way to solve my problem without requiring janky data casting?
Seems like this should be elementary, but I'm confounded.
Pandas does not support a time dtype series
Pandas (and NumPy) do not have a time dtype. Since you wish to avoid Pandas timedelta, you have 3 options: Pandas datetime, Python datetime.time, or Python str. Below they are presented in order of preference. Let's assume you start with the following dataframe:
df = pd.DataFrame({'time': pd.to_timedelta(['12:34:56', '05:12:45', '15:15:06'])})
print(df['time'].dtype) # timedelta64[ns]
Pandas datetime series
You can use Pandas datetime series and include an arbitrary date component, e.g. today's date. Underlying such a series are integers, which makes this solution the most efficient and adaptable.
The default date, if unspecified, is 1-Jan-1970:
df['time'] = pd.to_datetime(df['time'])
print(df)
# time
# 0 1970-01-01 12:34:56
# 1 1970-01-01 05:12:45
# 2 1970-01-01 15:15:06
You can also specify a date, such as today:
df['time'] = pd.Timestamp('today').normalize() + df['time']
print(df)
# time
# 0 2019-01-02 12:34:56
# 1 2019-01-02 05:12:45
# 2 2019-01-02 15:15:06
Pandas object series of Python datetime.time values
The Python datetime module from the standard library supports datetime.time objects. You can convert your series to an object dtype series containing pointers to a sequence of datetime.time objects. Operations will no longer be vectorised, but each underlying value will be represented internally by a number.
df['time'] = pd.to_datetime(df['time']).dt.time
print(df)
# time
# 0 12:34:56
# 1 05:12:45
# 2 15:15:06
print(df['time'].dtype)
# object
print(type(df['time'].at[0]))
# <class 'datetime.time'>
Pandas object series of Python str values
Converting to strings is only recommended for presentation purposes that are not supported by other types, e.g. Pandas datetime or Python datetime.time. For example:
df['time'] = pd.to_datetime(df['time']).dt.strftime('%H:%M:%S')
print(df)
# time
# 0 12:34:56
# 1 05:12:45
# 2 15:15:06
print(df['time'].dtype)
# object
print(type(df['time'].at[0]))
# <class 'str'>
it's a hack, but you can pull out the components to create a string and convert that string to a datetime.time(h,m,s) object
def convert(td):
time = [str(td.components.hours), str(td.components.minutes),
str(td.components.seconds)]
return datetime.strptime(':'.join(time), '%H:%M:%S').time()
df['time'] = df['time'].apply(lambda x: convert(x))
found a solution, but i feel like it's gotta be more elegant than this:
def convert(x):
return pd.to_datetime(x).strftime('%H:%M:%S')
df['time_of_day'] = df['time_of_day'].apply(convert)
df['time_of_day'] = pd.to_datetime(df['time_of_day']).apply(lambda x: x.time())
Adapted this code