pandas.to_datetime inconsistent time string format - python

I am attempting to convert the index of a pandas.DataFrame from string format to a datetime index, using pandas.to_datetime().
Import pandas:
In [1]: import pandas as pd
In [2]: pd.__version__
Out[2]: '0.10.1'
Create an example DataFrame:
In [3]: d = {'data' : pd.Series([1.,2.], index=['26/12/2012', '10/01/2013'])}
In [4]: df=pd.DataFrame(d)
Look at indices. Note that the date format is day/month/year:
In [5]: df.index
Out[5]: Index([26/12/2012, 10/01/2013], dtype=object)
Convert index to datetime:
In [6]: pd.to_datetime(df.index)
Out[6]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2012-12-26 00:00:00, 2013-10-01 00:00:00]
Length: 2, Freq: None, Timezone: None
Already at this stage, you can see that the date format for each entry has been formatted differently. The first is fine, the second has swapped month and day.
This is what I want to write, but avoiding the inconsistent formatting of date strings:
In [7]: df.set_index(pd.to_datetime(df.index))
Out[7]:
data
2012-12-26 1
2013-10-01 2
I guess the first entry is correct because the function 'knows' there aren't 26 months, and so does not choose the default month/day/year format.
Is there another/better way to do this? Can I pass the format into the to_datetime() function?
Thank you.
EDIT:
I have found a way to do this, without pandas.to_datetime:
import datetime.datetime as dt
date_string_list = df.index.tolist()
datetime_list = [ dt.strptime(date_string_list[x], '%d/%m/%Y') for x in range(len(date_string_list)) ]
df.index=datetime_list
but it's a bit messy. Any improvements welcome.

There are (hidden?) dayfirst argument to to_datetime:
In [23]: pd.to_datetime(df.index, dayfirst=True)
Out[23]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2012-12-26 00:00:00, 2013-01-10 00:00:00]
Length: 2, Freq: None, Timezone: None
In pandas 0.11 (onwards) you'll be able to use the format argument:
In [24]: pd.to_datetime(df.index, format='%d/%m/%Y')
Out[24]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2012-12-26 00:00:00, 2013-01-10 00:00:00]
Length: 2, Freq: None, Timezone: None

Related

Convert pandas series cell to string and datetime object

I have sliced the pandas dataframe.
end_date = df[-1:]['end']
type(end_date)
Out[4]: pandas.core.series.Series
end_date
Out[3]:
48173 2017-09-20 04:47:59
Name: end, dtype: datetime64[ns]
How to get rid of end_date's index value 48173 and get only 2017-09-20 04:47:59 string? I have to call REST API with 2017-09-20 04:47:59 as a parameter, so I have to get string from pandas datetime64 series.
How to get rid of end_date's index value 48173 and get only datetime object [something like datetime.datetime.strptime('2017-09-20 04:47:59', '%Y-%m-%d %H:%M:%S')]. I need it because, later I will have to check if '2017-09-20 04:47:59' < datetime.datetime(2017,1,9)
I need to convert just a single cell value, not a whole column.
How to do these conversions?
It seems you need:
import pandas as pd
data = ['2017-09-20 04:47:59','2017-10-20 04:47:59','2017-09-30 04:47:59']
df = pd.DataFrame(data,columns=['end'])
df['end'] = pd.to_datetime(df['end'])
df
df will be:
end
0 2017-09-20 04:47:59
1 2017-10-20 04:47:59
2 2017-09-30 04:47:59
After that you can use below code to get rid of index and use as 'Timestamp' object:
end_date = df['end'].iloc[-1] #get last row of column end
print(type(end_date)) # pandas.tslib.Timestamp
end_date_str = end_date.strftime('%Y-%m-%d %H:%M:%S') #convert to str
print(end_date_str) # '2017-09-30 04:47:59'
print(end_date < datetime.datetime(2017,1,9)) #False
Simply cast the result to a string, and recover it using .values[0]:
In [38]: end_date
Out[38]:
48173 2017-09-20 04:47:59
Name: end, dtype: datetime64[ns]
In [39]: end_date.astype(str).values[0]
Out[39]: '2017-09-20 04:47:59'
If you want a datetime object, you have to convert it to a timestamp, and then back to a datetime object:
In [42]: end_date.values[0].item()
Out[42]: 1505882879000000000
In [43]: datetime.fromtimestamp(end_date.values[0].item()/10**9)
Out[43]: datetime.datetime(2017, 9, 20, 6, 47, 59)
Otherwise, you can strptime the string recovered in step 1:
In [48]: datetime.datetime.strptime(end_date.astype(str).values[0], '%Y-%m-%d %H:%M:%S')
Out[48]: datetime.datetime(2017, 9, 20, 4, 47, 59)
You may wonder why there is a 2 hours difference between the results. This is because the datetime.datetime.fromtimestamp takes my timezone into account (currently CEST, which is UTC+2).
On the other hand, parsing a string to a datetime object doesn't yield any timezone information, srtptime naively parses the timestamp without regards for the timezone, which leads to a 2 hours discrepancy.

Subtracting a fixed date from a column in Pandas

Consider
In [99]: d = pd.to_datetime({'year':[2016], 'month':[06], 'day':[01]})
In [100]: d1 = pd.to_datetime({'year':[2016], 'month':[01], 'day':[01]})
In [101]:d - d1
Out[101]:
0 152 days
dtype: timedelta64[ns]
But when I try to do this for a whole column, it gives me trouble. Consider:
df['Age'] = map(lambda x:x - pd.to_datetime({'year':[2016], 'month':[06], 'day':[01]}), df['Manager_DoB'])
df['Manager_Dob'] is a column of datetime objects.
It flags the following error:
TypeError: can only operate on a datetime with a rhs of a timedelta/DateOffset for addition and subtraction, but the operator [__rsub__] was passed
You don't need to use map*, you can subtract a Timestamp from a datetime column/Series:
In [11]: d = pd.to_datetime({'year':[2016], 'month':[6], 'day':[1]})
In [12]: d
Out[12]:
0 2016-06-01
dtype: datetime64[ns]
In [13]: d[0] # This is the Timestamp you are actually interested in subtracting
Out[13]: Timestamp('2016-06-01 00:00:00')
In [14]: dates = pd.date_range(start="2016-01-01", periods=4)
In [15]: dates - d[0]
Out[15]: TimedeltaIndex(['-152 days', '-151 days', '-150 days', '-149 days'], dtype='timedelta64[ns]', freq=None)
You can get the Timestamp more directly using the constructor:
In [21]: pd.Timestamp("2016-06-01")
Out[21]: Timestamp('2016-06-01 00:00:00')
*You should never use python's map with pandas, prefer .apply.

How to convert Pandas time series into a dict with string key

I need to convert a Pandas time series object into dicts which have the datetime as the key. I tried dict(my_ts_obj), but the keys are Timestamp, not string.
Thanks a million for your help!
You could use s.index.format() to convert the Timestamps into strings:
In [87]: rng = pd.date_range('12/1/2012', periods=4, freq='D')
In [88]: s = pd.Series(pd.np.random.randn(len(rng)), index=rng)
In [89]: s
Out[89]:
2012-12-01 -1.673655
2012-12-02 1.447061
2012-12-03 -0.672347
2012-12-04 0.202692
Freq: D, dtype: float64
In [90]: dict(zip(s.index.format(), s))
Out[90]:
{'2012-12-01': -1.6736553219187384,
'2012-12-02': 1.4470613776383001,
'2012-12-03': -0.67234662513200982,
'2012-12-04': 0.20269246374288372}

Pandas date_range from DatetimeIndex to Date format

Pandas date_range returns a pandas.DatetimeIndex which has the indexes formatted as a timestamps (date plus time). For example:
In [114] rng=pandas.date_range('1/1/2013','1/31/2013',freq='D')
In [115] rng
Out [116]
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00, ..., 2013-01-31 00:00:00]
Length: 31, Freq: D, Timezone: None
Given I am not using timestamps in my application, I would like to convert this index to a date such that:
In [117] rng[0]
Out [118]
<Timestamp: 2013-01-02 00:00:00>
Will be in the form 2013-01-02.
I am using pandas version 0.9.1
to_pydatetime returns a NumPy array of Python datetime.datetime objects:
In [8]: dates = rng.to_pydatetime()
In [9]: print(dates[0])
2013-01-01 00:00:00
In [10]: print(dates[0].strftime('%Y-%m-%d'))
2013-01-01
For me the current answer is not satisfactory because internally it is still stored as a timestamp with hours, minutes, seconds.
Pandas version : 0.22.0
My solution has been to convert it to datetime.date:
In[30]: import pandas as pd
In[31]: rng = pd.date_range('1/1/2013','1/31/2013', freq='D')
In[32]: date_rng = rng.date # Here it becomes date
In[33]: date_rng[0]
Out[33]: datetime.date(2013, 1, 1)
In[34]: print(date_rng[0])
2013-01-01

How do I convert strings in a Pandas data frame to a 'date' data type?

I have a Pandas data frame, one of the column contains date strings in the format YYYY-MM-DD
For e.g. '2013-10-28'
At the moment the dtype of the column is object.
How do I convert the column values to Pandas date format?
Essentially equivalent to #waitingkuo, but I would use pd.to_datetime here (it seems a little cleaner, and offers some additional functionality e.g. dayfirst):
In [11]: df
Out[11]:
a time
0 1 2013-01-01
1 2 2013-01-02
2 3 2013-01-03
In [12]: pd.to_datetime(df['time'])
Out[12]:
0 2013-01-01 00:00:00
1 2013-01-02 00:00:00
2 2013-01-03 00:00:00
Name: time, dtype: datetime64[ns]
In [13]: df['time'] = pd.to_datetime(df['time'])
In [14]: df
Out[14]:
a time
0 1 2013-01-01 00:00:00
1 2 2013-01-02 00:00:00
2 3 2013-01-03 00:00:00
Handling ValueErrors
If you run into a situation where doing
df['time'] = pd.to_datetime(df['time'])
Throws a
ValueError: Unknown string format
That means you have invalid (non-coercible) values. If you are okay with having them converted to pd.NaT, you can add an errors='coerce' argument to to_datetime:
df['time'] = pd.to_datetime(df['time'], errors='coerce')
Use astype
In [31]: df
Out[31]:
a time
0 1 2013-01-01
1 2 2013-01-02
2 3 2013-01-03
In [32]: df['time'] = df['time'].astype('datetime64[ns]')
In [33]: df
Out[33]:
a time
0 1 2013-01-01 00:00:00
1 2 2013-01-02 00:00:00
2 3 2013-01-03 00:00:00
I imagine a lot of data comes into Pandas from CSV files, in which case you can simply convert the date during the initial CSV read:
dfcsv = pd.read_csv('xyz.csv', parse_dates=[0]) where the 0 refers to the column the date is in.
You could also add , index_col=0 in there if you want the date to be your index.
See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Now you can do df['column'].dt.date
Note that for datetime objects, if you don't see the hour when they're all 00:00:00, that's not pandas. That's iPython notebook trying to make things look pretty.
If you want to get the DATE and not DATETIME format:
df["id_date"] = pd.to_datetime(df["id_date"]).dt.date
Another way to do this and this works well if you have multiple columns to convert to datetime.
cols = ['date1','date2']
df[cols] = df[cols].apply(pd.to_datetime)
It may be the case that dates need to be converted to a different frequency. In this case, I would suggest setting an index by dates.
#set an index by dates
df.set_index(['time'], drop=True, inplace=True)
After this, you can more easily convert to the type of date format you will need most. Below, I sequentially convert to a number of date formats, ultimately ending up with a set of daily dates at the beginning of the month.
#Convert to daily dates
df.index = pd.DatetimeIndex(data=df.index)
#Convert to monthly dates
df.index = df.index.to_period(freq='M')
#Convert to strings
df.index = df.index.strftime('%Y-%m')
#Convert to daily dates
df.index = pd.DatetimeIndex(data=df.index)
For brevity, I don't show that I run the following code after each line above:
print(df.index)
print(df.index.dtype)
print(type(df.index))
This gives me the following output:
Index(['2013-01-01', '2013-01-02', '2013-01-03'], dtype='object', name='time')
object
<class 'pandas.core.indexes.base.Index'>
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03'], dtype='datetime64[ns]', name='time', freq=None)
datetime64[ns]
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
PeriodIndex(['2013-01', '2013-01', '2013-01'], dtype='period[M]', name='time', freq='M')
period[M]
<class 'pandas.core.indexes.period.PeriodIndex'>
Index(['2013-01', '2013-01', '2013-01'], dtype='object')
object
<class 'pandas.core.indexes.base.Index'>
DatetimeIndex(['2013-01-01', '2013-01-01', '2013-01-01'], dtype='datetime64[ns]', freq=None)
datetime64[ns]
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
For the sake of completeness, another option, which might not be the most straightforward one, a bit similar to the one proposed by #SSS, but using rather the datetime library is:
import datetime
df["Date"] = df["Date"].apply(lambda x: datetime.datetime.strptime(x, '%Y-%d-%m').date())
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 startDay 110526 non-null object
1 endDay 110526 non-null object
import pandas as pd
df['startDay'] = pd.to_datetime(df.startDay)
df['endDay'] = pd.to_datetime(df.endDay)
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 startDay 110526 non-null datetime64[ns]
1 endDay 110526 non-null datetime64[ns]
Try to convert one of the rows into timestamp using the pd.to_datetime function and then use .map to map the formular to the entire column

Categories