Subtracting a fixed date from a column in Pandas - python

Consider
In [99]: d = pd.to_datetime({'year':[2016], 'month':[06], 'day':[01]})
In [100]: d1 = pd.to_datetime({'year':[2016], 'month':[01], 'day':[01]})
In [101]:d - d1
Out[101]:
0 152 days
dtype: timedelta64[ns]
But when I try to do this for a whole column, it gives me trouble. Consider:
df['Age'] = map(lambda x:x - pd.to_datetime({'year':[2016], 'month':[06], 'day':[01]}), df['Manager_DoB'])
df['Manager_Dob'] is a column of datetime objects.
It flags the following error:
TypeError: can only operate on a datetime with a rhs of a timedelta/DateOffset for addition and subtraction, but the operator [__rsub__] was passed

You don't need to use map*, you can subtract a Timestamp from a datetime column/Series:
In [11]: d = pd.to_datetime({'year':[2016], 'month':[6], 'day':[1]})
In [12]: d
Out[12]:
0 2016-06-01
dtype: datetime64[ns]
In [13]: d[0] # This is the Timestamp you are actually interested in subtracting
Out[13]: Timestamp('2016-06-01 00:00:00')
In [14]: dates = pd.date_range(start="2016-01-01", periods=4)
In [15]: dates - d[0]
Out[15]: TimedeltaIndex(['-152 days', '-151 days', '-150 days', '-149 days'], dtype='timedelta64[ns]', freq=None)
You can get the Timestamp more directly using the constructor:
In [21]: pd.Timestamp("2016-06-01")
Out[21]: Timestamp('2016-06-01 00:00:00')
*You should never use python's map with pandas, prefer .apply.

Related

Convert pandas series cell to string and datetime object

I have sliced the pandas dataframe.
end_date = df[-1:]['end']
type(end_date)
Out[4]: pandas.core.series.Series
end_date
Out[3]:
48173 2017-09-20 04:47:59
Name: end, dtype: datetime64[ns]
How to get rid of end_date's index value 48173 and get only 2017-09-20 04:47:59 string? I have to call REST API with 2017-09-20 04:47:59 as a parameter, so I have to get string from pandas datetime64 series.
How to get rid of end_date's index value 48173 and get only datetime object [something like datetime.datetime.strptime('2017-09-20 04:47:59', '%Y-%m-%d %H:%M:%S')]. I need it because, later I will have to check if '2017-09-20 04:47:59' < datetime.datetime(2017,1,9)
I need to convert just a single cell value, not a whole column.
How to do these conversions?
It seems you need:
import pandas as pd
data = ['2017-09-20 04:47:59','2017-10-20 04:47:59','2017-09-30 04:47:59']
df = pd.DataFrame(data,columns=['end'])
df['end'] = pd.to_datetime(df['end'])
df
df will be:
end
0 2017-09-20 04:47:59
1 2017-10-20 04:47:59
2 2017-09-30 04:47:59
After that you can use below code to get rid of index and use as 'Timestamp' object:
end_date = df['end'].iloc[-1] #get last row of column end
print(type(end_date)) # pandas.tslib.Timestamp
end_date_str = end_date.strftime('%Y-%m-%d %H:%M:%S') #convert to str
print(end_date_str) # '2017-09-30 04:47:59'
print(end_date < datetime.datetime(2017,1,9)) #False
Simply cast the result to a string, and recover it using .values[0]:
In [38]: end_date
Out[38]:
48173 2017-09-20 04:47:59
Name: end, dtype: datetime64[ns]
In [39]: end_date.astype(str).values[0]
Out[39]: '2017-09-20 04:47:59'
If you want a datetime object, you have to convert it to a timestamp, and then back to a datetime object:
In [42]: end_date.values[0].item()
Out[42]: 1505882879000000000
In [43]: datetime.fromtimestamp(end_date.values[0].item()/10**9)
Out[43]: datetime.datetime(2017, 9, 20, 6, 47, 59)
Otherwise, you can strptime the string recovered in step 1:
In [48]: datetime.datetime.strptime(end_date.astype(str).values[0], '%Y-%m-%d %H:%M:%S')
Out[48]: datetime.datetime(2017, 9, 20, 4, 47, 59)
You may wonder why there is a 2 hours difference between the results. This is because the datetime.datetime.fromtimestamp takes my timezone into account (currently CEST, which is UTC+2).
On the other hand, parsing a string to a datetime object doesn't yield any timezone information, srtptime naively parses the timestamp without regards for the timezone, which leads to a 2 hours discrepancy.

Datetime and Timestamp equality in Python and Pandas

I've been playing around with datetimes and timestamps, and I've come across something that I can't understand.
import pandas as pd
import datetime
year_month = pd.DataFrame({'year':[2001,2002,2003], 'month':[1,2,3]})
year_month['date'] = [datetime.datetime.strptime(str(y) + str(m) + '1', '%Y%m%d') for y,m in zip(year_month['year'], year_month['month'])]
>>> year_month
month year date
0 1 2001 2001-01-01
1 2 2002 2002-02-01
2 3 2003 2003-03-01
I think the unique function is doing something to the timestamps that is changing them somehow:
first_date = year_month['date'].unique()[0]
>>> first_date == year_month['date'][0]
False
In fact:
>>> year_month['date'].unique()
array(['2000-12-31T16:00:00.000000000-0800',
'2002-01-31T16:00:00.000000000-0800',
'2003-02-28T16:00:00.000000000-0800'], dtype='datetime64[ns]')
My suspicions are that there is some sort of timezone difference underneath the functions, but I can't figure it out.
EDIT
I just checked the python commands list(set()) as an alternative to the unique function, and that works. This must be a quirk of the unique() function.
You have to convert to datetime64 to compare:
In [12]:
first_date == year_month['date'][0].to_datetime64()
Out[12]:
True
This is because unique has converted the dtype to datetime64:
In [6]:
first_date = year_month['date'].unique()[0]
first_date
Out[6]:
numpy.datetime64('2001-01-01T00:00:00.000000000+0000')
I think is because unique returns a np array and there is no dtype that numpy understands TimeStamp currently: Converting between datetime, Timestamp and datetime64

How to convert Pandas time series into a dict with string key

I need to convert a Pandas time series object into dicts which have the datetime as the key. I tried dict(my_ts_obj), but the keys are Timestamp, not string.
Thanks a million for your help!
You could use s.index.format() to convert the Timestamps into strings:
In [87]: rng = pd.date_range('12/1/2012', periods=4, freq='D')
In [88]: s = pd.Series(pd.np.random.randn(len(rng)), index=rng)
In [89]: s
Out[89]:
2012-12-01 -1.673655
2012-12-02 1.447061
2012-12-03 -0.672347
2012-12-04 0.202692
Freq: D, dtype: float64
In [90]: dict(zip(s.index.format(), s))
Out[90]:
{'2012-12-01': -1.6736553219187384,
'2012-12-02': 1.4470613776383001,
'2012-12-03': -0.67234662513200982,
'2012-12-04': 0.20269246374288372}

Pandas Timedelta in Days

I have a dataframe in pandas called 'munged_data' with two columns 'entry_date' and 'dob' which i have converted to Timestamps using pd.to_timestamp.I am trying to figure out how to calculate ages of people based on the time difference between 'entry_date' and 'dob' and to do this i need to get the difference in days between the two columns ( so that i can then do somehting like round(days/365.25). I do not seem to be able to find a way to do this using a vectorized operation. When I do munged_data.entry_date-munged_data.dob i get the following :
internal_quote_id
2 15685977 days, 23:54:30.457856
3 11651985 days, 23:49:15.359744
4 9491988 days, 23:39:55.621376
7 11907004 days, 0:10:30.196224
9 15282164 days, 23:30:30.196224
15 15282227 days, 23:50:40.261632
However i do not seem to be able to extract the days as an integer so that i can continue with my calculation.
Any help appreciated.
Using the Pandas type Timedelta available since v0.15.0 you also can do:
In[1]: import pandas as pd
In[2]: df = pd.DataFrame([ pd.Timestamp('20150111'),
pd.Timestamp('20150301') ], columns=['date'])
In[3]: df['today'] = pd.Timestamp('20150315')
In[4]: df
Out[4]:
date today
0 2015-01-11 2015-03-15
1 2015-03-01 2015-03-15
In[5]: (df['today'] - df['date']).dt.days
Out[5]:
0 63
1 14
dtype: int64
You need 0.11 for this (0.11rc1 is out, final prob next week)
In [9]: df = DataFrame([ Timestamp('20010101'), Timestamp('20040601') ])
In [10]: df
Out[10]:
0
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [11]: df = DataFrame([ Timestamp('20010101'),
Timestamp('20040601') ],columns=['age'])
In [12]: df
Out[12]:
age
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [13]: df['today'] = Timestamp('20130419')
In [14]: df['diff'] = df['today']-df['age']
In [16]: df['years'] = df['diff'].apply(lambda x: float(x.item().days)/365)
In [17]: df
Out[17]:
age today diff years
0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00 12.304110
1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00 8.887671
You need this odd apply at the end because not yet full support for timedelta64[ns] scalars (e.g. like how we use Timestamps now for datetime64[ns], coming in 0.12)
Not sure if you still need it, but in Pandas 0.14 i usually use .astype('timedelta64[X]') method
http://pandas.pydata.org/pandas-docs/stable/timeseries.html (frequency conversion)
df = pd.DataFrame([ pd.Timestamp('20010101'), pd.Timestamp('20040605') ])
df.ix[0]-df.ix[1]
Returns:
0 -1251 days
dtype: timedelta64[ns]
(df.ix[0]-df.ix[1]).astype('timedelta64[Y]')
Returns:
0 -4
dtype: float64
Hope that will help
Let's specify that you have a pandas series named time_difference which has type
numpy.timedelta64[ns]
One way of extracting just the day (or whatever desired attribute) is the following:
just_day = time_difference.apply(lambda x: pd.tslib.Timedelta(x).days)
This function is used because the numpy.timedelta64 object does not have a 'days' attribute.
To convert any type of data into days just use pd.Timedelta().days:
pd.Timedelta(1985, unit='Y').days
84494

pandas.to_datetime inconsistent time string format

I am attempting to convert the index of a pandas.DataFrame from string format to a datetime index, using pandas.to_datetime().
Import pandas:
In [1]: import pandas as pd
In [2]: pd.__version__
Out[2]: '0.10.1'
Create an example DataFrame:
In [3]: d = {'data' : pd.Series([1.,2.], index=['26/12/2012', '10/01/2013'])}
In [4]: df=pd.DataFrame(d)
Look at indices. Note that the date format is day/month/year:
In [5]: df.index
Out[5]: Index([26/12/2012, 10/01/2013], dtype=object)
Convert index to datetime:
In [6]: pd.to_datetime(df.index)
Out[6]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2012-12-26 00:00:00, 2013-10-01 00:00:00]
Length: 2, Freq: None, Timezone: None
Already at this stage, you can see that the date format for each entry has been formatted differently. The first is fine, the second has swapped month and day.
This is what I want to write, but avoiding the inconsistent formatting of date strings:
In [7]: df.set_index(pd.to_datetime(df.index))
Out[7]:
data
2012-12-26 1
2013-10-01 2
I guess the first entry is correct because the function 'knows' there aren't 26 months, and so does not choose the default month/day/year format.
Is there another/better way to do this? Can I pass the format into the to_datetime() function?
Thank you.
EDIT:
I have found a way to do this, without pandas.to_datetime:
import datetime.datetime as dt
date_string_list = df.index.tolist()
datetime_list = [ dt.strptime(date_string_list[x], '%d/%m/%Y') for x in range(len(date_string_list)) ]
df.index=datetime_list
but it's a bit messy. Any improvements welcome.
There are (hidden?) dayfirst argument to to_datetime:
In [23]: pd.to_datetime(df.index, dayfirst=True)
Out[23]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2012-12-26 00:00:00, 2013-01-10 00:00:00]
Length: 2, Freq: None, Timezone: None
In pandas 0.11 (onwards) you'll be able to use the format argument:
In [24]: pd.to_datetime(df.index, format='%d/%m/%Y')
Out[24]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2012-12-26 00:00:00, 2013-01-10 00:00:00]
Length: 2, Freq: None, Timezone: None

Categories