I have a dataframe in pandas called 'munged_data' with two columns 'entry_date' and 'dob' which i have converted to Timestamps using pd.to_timestamp.I am trying to figure out how to calculate ages of people based on the time difference between 'entry_date' and 'dob' and to do this i need to get the difference in days between the two columns ( so that i can then do somehting like round(days/365.25). I do not seem to be able to find a way to do this using a vectorized operation. When I do munged_data.entry_date-munged_data.dob i get the following :
internal_quote_id
2 15685977 days, 23:54:30.457856
3 11651985 days, 23:49:15.359744
4 9491988 days, 23:39:55.621376
7 11907004 days, 0:10:30.196224
9 15282164 days, 23:30:30.196224
15 15282227 days, 23:50:40.261632
However i do not seem to be able to extract the days as an integer so that i can continue with my calculation.
Any help appreciated.
Using the Pandas type Timedelta available since v0.15.0 you also can do:
In[1]: import pandas as pd
In[2]: df = pd.DataFrame([ pd.Timestamp('20150111'),
pd.Timestamp('20150301') ], columns=['date'])
In[3]: df['today'] = pd.Timestamp('20150315')
In[4]: df
Out[4]:
date today
0 2015-01-11 2015-03-15
1 2015-03-01 2015-03-15
In[5]: (df['today'] - df['date']).dt.days
Out[5]:
0 63
1 14
dtype: int64
You need 0.11 for this (0.11rc1 is out, final prob next week)
In [9]: df = DataFrame([ Timestamp('20010101'), Timestamp('20040601') ])
In [10]: df
Out[10]:
0
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [11]: df = DataFrame([ Timestamp('20010101'),
Timestamp('20040601') ],columns=['age'])
In [12]: df
Out[12]:
age
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [13]: df['today'] = Timestamp('20130419')
In [14]: df['diff'] = df['today']-df['age']
In [16]: df['years'] = df['diff'].apply(lambda x: float(x.item().days)/365)
In [17]: df
Out[17]:
age today diff years
0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00 12.304110
1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00 8.887671
You need this odd apply at the end because not yet full support for timedelta64[ns] scalars (e.g. like how we use Timestamps now for datetime64[ns], coming in 0.12)
Not sure if you still need it, but in Pandas 0.14 i usually use .astype('timedelta64[X]') method
http://pandas.pydata.org/pandas-docs/stable/timeseries.html (frequency conversion)
df = pd.DataFrame([ pd.Timestamp('20010101'), pd.Timestamp('20040605') ])
df.ix[0]-df.ix[1]
Returns:
0 -1251 days
dtype: timedelta64[ns]
(df.ix[0]-df.ix[1]).astype('timedelta64[Y]')
Returns:
0 -4
dtype: float64
Hope that will help
Let's specify that you have a pandas series named time_difference which has type
numpy.timedelta64[ns]
One way of extracting just the day (or whatever desired attribute) is the following:
just_day = time_difference.apply(lambda x: pd.tslib.Timedelta(x).days)
This function is used because the numpy.timedelta64 object does not have a 'days' attribute.
To convert any type of data into days just use pd.Timedelta().days:
pd.Timedelta(1985, unit='Y').days
84494
Related
Want to calculate the difference of days between pandas date series -
0 2013-02-16
1 2013-01-29
2 2013-02-21
3 2013-02-22
4 2013-03-01
5 2013-03-14
6 2013-03-18
7 2013-03-21
and today's date.
I tried but could not come up with logical solution.
Please help me with the code. Actually I am new to python and there are lot of syntactical errors happening while applying any function.
You could do something like
# generate time data
data = pd.to_datetime(pd.Series(["2018-09-1", "2019-01-25", "2018-10-10"]))
pd.to_datetime("now") > data
returns:
0 False
1 True
2 False
you could then use that to select the data
data[pd.to_datetime("now") > data]
Hope it helps.
Edit: I misread it but you can easily alter this example to calculate the difference:
data - pd.to_datetime("now")
returns:
0 -122 days +13:10:37.489823
1 24 days 13:10:37.489823
2 -83 days +13:10:37.489823
dtype: timedelta64[ns]
You can try as Follows:
>>> from datetime import datetime
>>> df
col1
0 2013-02-16
1 2013-01-29
2 2013-02-21
3 2013-02-22
4 2013-03-01
5 2013-03-14
6 2013-03-18
7 2013-03-21
Make Sure to convert the column names to_datetime:
>>> df['col1'] = pd.to_datetime(df['col1'], infer_datetime_format=True)
set the current datetime in order to Further get the diffrence:
>>> curr_time = pd.to_datetime("now")
Now get the Difference as follows:
>>> df['col1'] - curr_time
0 -2145 days +07:48:48.736939
1 -2163 days +07:48:48.736939
2 -2140 days +07:48:48.736939
3 -2139 days +07:48:48.736939
4 -2132 days +07:48:48.736939
5 -2119 days +07:48:48.736939
6 -2115 days +07:48:48.736939
7 -2112 days +07:48:48.736939
Name: col1, dtype: timedelta64[ns]
With numpy you can solve it like difference-two-dates-days-weeks-months-years-pandas-python-2
. bottom line
df['diff_days'] = df['First dates column'] - df['Second Date column']
# for days use 'D' for weeks use 'W', for month use 'M' and for years use 'Y'
df['diff_days']=df['diff_days']/np.timedelta64(1,'D')
print(df)
if you want days as int and not as float use
df['diff_days']=df['diff_days']//np.timedelta64(1,'D')
From the pandas docs under Converting To Timestamps you will find:
"Converting to Timestamps To convert a Series or list-like object of date-like objects e.g. strings, epochs, or a mixture, you can use the to_datetime function"
I haven't used pandas before but this suggests your pandas date series (a list-like object) is iterable and each element of this series is an instance of a class which has a to_datetime function.
Assuming my assumptions are correct, the following function would take such a list and return a list of timedeltas' (a datetime object representing the difference between two date time objects).
from datetime import datetime
def convert(pandas_series):
# get the current date
now = datetime.now()
# Use a list comprehension and the pandas to_datetime method to calculate timedeltas.
return [now - pandas_element.to_datetime() for pandas_series]
# assuming 'some_pandas_series' is a list-like pandas series object
list_of_timedeltas = convert(some_pandas_series)
I've a dataframe of this format -
var1 date
A 2017/01/01
A 2017/01/02
...
I want the date to be converted into YYYY-MM format but the df['date'].dtype is object.
How can I remove the day part from date while keeping the data type as datetime?
Expected Output -
A - 2017/01
Thanks
You can't have custom representation for the datetime dtype. But you have the following options:
use strings - you might have any representation (as you wish), but all datetime methods and attributes get lost
use datetime, but set the day part to 1 (as #Kopytok) has already shown.
use period dtype, which still allows you to use some date arithmetic
Demo:
In [207]: df
Out[207]:
var1 date
0 A 2018-12-31
1 A 2017-09-07
2 B 2016-02-29
In [208]: df['new'] = df['date'].dt.to_period('M')
In [209]: df
Out[209]:
var1 date new
0 A 2018-12-31 2018-12
1 A 2017-09-07 2017-09
2 B 2016-02-29 2016-02
In [210]: df.dtypes
Out[210]:
var1 object
date datetime64[ns]
new object
dtype: object
In [211]: df['new'] + 8
Out[211]:
0 2019-08
1 2018-05
2 2016-10
Name: new, dtype: object
It is possible replace every date with the first day of month:
pd.to_datetime(d["date"], format="%Y/%m/%d").apply(lambda x: x.replace(day=1))
Result:
0 2017-01-01
1 2017-01-01
I am working with python 3.5.2, pandas 0.18.1 and sqlite3.
In my data base, I have a column unix_time with INT for seconds since 1970. Ideally I want to read my dataframe from sqlite, and then create a time column which would correspond to the datetime or pandas.tslib.Timestamp conversion of the unix_time column that I woul only use for some processing and then drop before saving the dataframe back.
The issue is that when parsing the unix_time column using :
df = pd.read_from_sql_query("SELECT * FROM test", con, parse_dates=['unix_time'])
I obtain pandas.tslib.Timestamp types which is fine for my processing, but then I have to recreate my original unix_time column using :
df['unix_time'][i] = (df['unix_time'][i] - datetime(1970,1,1)).total_seconds()
which is really 'dirty'
First question : Do you have a better way?
I thought about giving up the unix time format and only use datetime format but the to_datetime method from pandas returns in fact pandas.tslib.Timestamp ... And anyway, doing so would force me to iterate over all rows which is a bad solution. (It is impossible to apply to_datetime on something else than a view over a single cell of the dataframe
Second question : Is it possible to apply it on a series?
My last try was with directly using df['time'] = datetime.datetime.fromtimestamp(df['unix_time']) but surprisingly, it also returns pandas.tslib.Timestamp.
In the end, knowing that I can only save unix timestamps or datetimes, my only choices for the moment are :
parsing but then having to convert them back to unix timestamp one by
one.
Or not parse it but have to convert them to pandas.tslib.Timestamp
one by one.
It would be great if I could convert a whole series.
Last question : Is there a way to convert a unix timestamps series to datetime (or at least pandas.tslib.Timestamp), or a pandas.tslib.Timestamp (or datetime) series to unix timestamps?
Thanks
EDIT:
During my processing, I extract a row that I want to append to my dataset. Apparently, the coversion to pandas.tslib.Timestamp appends implicitly when passing from dataframe to serie :
df = pd.DataFrame({'UNX':pd.date_range('2016-01-01', freq='9999S', periods=10).astype(np.int64)//10**9})
df['Date'] = pd.to_datetime(df.UNX, unit='s')
print(df.Date.dtypes)
print(type(df['Date'][0]))
test = df.iloc[0]
print(type(test.Date))
new_df = test.to_frame().transpose() #from here, impossible to do : new_df.to_sql("test", con) because the type for 'Date' is not supported
print(new_df.Date.dtypes)
returns
datetime64[ns]
<class 'pandas.tslib.Timestamp'>
<class 'pandas.tslib.Timestamp'>
object
Is there a way to convert the 'Date' in new_df from pandas.tslib.Timestamp to datetime64[ns] or datetime.datetime (or simply str) ?
IIUC you can do it this way:
In [96]: df = pd.DataFrame({'UNX':pd.date_range('2016-01-01', freq='9999S', periods=10).astype(np.int64)//10**9})
In [97]: df
Out[97]:
UNX
0 1451606400
1 1451616399
2 1451626398
3 1451636397
4 1451646396
5 1451656395
6 1451666394
7 1451676393
8 1451686392
9 1451696391
Convert UNIX epoch to Python datetime:
In [98]: df['Date'] = pd.to_datetime(df.UNX, unit='s')
In [99]: df
Out[99]:
UNX Date
0 1451606400 2016-01-01 00:00:00
1 1451616399 2016-01-01 02:46:39
2 1451626398 2016-01-01 05:33:18
3 1451636397 2016-01-01 08:19:57
4 1451646396 2016-01-01 11:06:36
5 1451656395 2016-01-01 13:53:15
6 1451666394 2016-01-01 16:39:54
7 1451676393 2016-01-01 19:26:33
8 1451686392 2016-01-01 22:13:12
9 1451696391 2016-01-02 00:59:51
Convert datetime to UNIX epoch:
In [100]: df['UNX2'] = df.Date.astype('int64')//10**9
In [101]: df
Out[101]:
UNX Date UNX2
0 1451606400 2016-01-01 00:00:00 1451606400
1 1451616399 2016-01-01 02:46:39 1451616399
2 1451626398 2016-01-01 05:33:18 1451626398
3 1451636397 2016-01-01 08:19:57 1451636397
4 1451646396 2016-01-01 11:06:36 1451646396
5 1451656395 2016-01-01 13:53:15 1451656395
6 1451666394 2016-01-01 16:39:54 1451666394
7 1451676393 2016-01-01 19:26:33 1451676393
8 1451686392 2016-01-01 22:13:12 1451686392
9 1451696391 2016-01-02 00:59:51 1451696391
Check:
In [102]: df.UNX.eq(df.UNX2).all()
Out[102]: True
Round trip between Pandas Timestamp and Unix Seconds (since 1970-01-01):
date_in = pd.to_datetime("2022-04-07")
# type(date_in) is: pandas._libs.tslibs.timestamps.Timestamp
unix_seconds = date_in.value//10**9
date_out = pd.to_datetime(unix_seconds, unit="s")
Output:
date_in
Out[1]: Timestamp('2021-04-07 00:00:00')
unix_seconds
Out[2]: 1617753600
date_out
Out[3]: Timestamp('2021-04-07 00:00:00')
I have a Series of dates in datetime64 format.
I want to convert them to a series of Period with a monthly frequency. (Essentially, I want to group dates into months for analytical purposes).
There must be a way of doing this - I just cannot find it quickly.
Note: these dates are not the index of the data frame - they are just a column of data in the data frame.
Example input data (as a Series)
data = pd.to_datetime(pd.Series(['2014-10-01', '2014-10-01', '2014-10-31', '2014-11-15', '2014-11-30', np.NaN, '2014-12-01']))
print (data)
My current kludge/work around looks like
data = pd.to_datetime(pd.Series(['2014-10-01', '2014-10-01', '2014-10-31', '2014-11-15', '2014-11-30', np.NaN, '2014-01-01']))
data = pd.DatetimeIndex(data).to_period('M')
data = pd.Series(data.year).astype('str') + '-' + pd.Series((data.month).astype('int')).map('{:0>2d}'.format)
data = data.where(data != '2262-04', other='No Date')
print (data)
Their are some issues currently (even in master) dealing with NaT in PeriodIndex, so your approach won't work like that. But seems that you simply want to resample; so do this. You can of course specify a function for how if you want.
In [57]: data
Out[57]:
0 2014-10-01
1 2014-10-01
2 2014-10-31
3 2014-11-15
4 2014-11-30
5 NaT
6 2014-12-01
dtype: datetime64[ns]
In [58]: df = DataFrame(dict(A = data, B = np.arange(len(data))))
In [59]: df.dropna(how='any',subset=['A']).set_index('A').resample('M',how='count')
Out[59]:
B
A
2014-10-31 3
2014-11-30 2
2014-12-31 1
import pandas as pd
import numpy as np
datetime import datetime
data = pd.to_datetime(
pd.Series(['2014-10-01', '2014-10-01', '2014-10-31', '2014-11-15', '2014-11-30', np.NaN, '2014-01-01']))
data=pd.Series(['{}-{:02d}'.format(x.year,x.month) if isinstance(x, datetime) else "Nat" for x in pd.DatetimeIndex(data).to_pydatetime()])
0 2014-10
1 2014-10
2 2014-10
3 2014-11
4 2014-11
5 Nat
6 2014-01
dtype: object
Best I could come up with, if the only non datetimes objects possible are floats you can change if isinstance(x, datetime) to if not isinstance(x, float)
I have a dataset I'm analyzing in pandas where all data is binned monthly. The data originates from a MySQL database where all dates are in the format 'YYYY-MM-01', such that, for example, all rows for October 2013 would have "2013-10-01" in the month column.
I'm currently reading the data into pandas (via a .tsv dump of the MySQL table) with
data = pd.read_table(filename,header=None,names=('uid','iid','artist','tag','date'),index_col=indexes, parse_dates='date')
This is all fine, except for the fact that any subsequent analyses I run in which I do monthly resampling always represents dates using the end-of-month convention (i.e. data from October becomes '2013-10-31' instead of '2013-10-01'), but this can lead to inconsistencies where the original data has months labeled as 'YYYY-MM-01', while any resampled data will have the months labeled as 'YYYY-MM-31' (or '-30' or '-28', as appropriate).
My question is this: What is the easiest and/or fastest way I can convert all the dates in my dataframe to the end-of-month format from the outset? Keep in mind that the date is one of several indexes in a multi-index, not a column. I think my best bet is to use a modified date_parser in my in my pd.read_table call that always converts month to the end-of-month convention, but I'm not sure how to approach it.
Read your dates in exactly like you are doing.
Create some test data. I am setting the dates to the start of month, but it doesn't matter.
In [39]: df = DataFrame(np.random.randn(10,2),columns=list('AB'),
index=date_range('20130101',periods=10,freq='MS'))
In [40]: df
Out[40]:
A B
2013-01-01 -0.553482 0.049128
2013-02-01 0.337975 -0.035897
2013-03-01 -0.394849 -1.755323
2013-04-01 -0.555638 1.903388
2013-05-01 -0.087752 1.551916
2013-06-01 1.000943 -0.361248
2013-07-01 -1.855171 -2.215276
2013-08-01 -0.582643 1.661696
2013-09-01 0.501061 -1.455171
2013-10-01 1.343630 -2.008060
Force convert them to the end-of-month in time space regardless of the day
In [41]: df.index = df.index.to_period().to_timestamp('M')
In [42]: df
Out[42]:
A B
2013-01-31 -0.553482 0.049128
2013-02-28 0.337975 -0.035897
2013-03-31 -0.394849 -1.755323
2013-04-30 -0.555638 1.903388
2013-05-31 -0.087752 1.551916
2013-06-30 1.000943 -0.361248
2013-07-31 -1.855171 -2.215276
2013-08-31 -0.582643 1.661696
2013-09-30 0.501061 -1.455171
2013-10-31 1.343630 -2.008060
Back to the start
In [43]: df.index = df.index.to_period().to_timestamp('MS')
In [44]: df
Out[44]:
A B
2013-01-01 -0.553482 0.049128
2013-02-01 0.337975 -0.035897
2013-03-01 -0.394849 -1.755323
2013-04-01 -0.555638 1.903388
2013-05-01 -0.087752 1.551916
2013-06-01 1.000943 -0.361248
2013-07-01 -1.855171 -2.215276
2013-08-01 -0.582643 1.661696
2013-09-01 0.501061 -1.455171
2013-10-01 1.343630 -2.008060
You can also work with (and resample) as periods
In [45]: df.index = df.index.to_period()
In [46]: df
Out[46]:
A B
2013-01 -0.553482 0.049128
2013-02 0.337975 -0.035897
2013-03 -0.394849 -1.755323
2013-04 -0.555638 1.903388
2013-05 -0.087752 1.551916
2013-06 1.000943 -0.361248
2013-07 -1.855171 -2.215276
2013-08 -0.582643 1.661696
2013-09 0.501061 -1.455171
2013-10 1.343630 -2.008060
use replace() to change the day value. and you can get the last day of month using
from datetime import date
import calendar
d = date(2000,1,1)
d = d.replace(day=calendar.monthrange(d.year, d.month)[1])
UPDATE
I add some example for pandas.
sample file date.csv
2013-01-01, 1
2013-02-01, 2
ipython shell log.
In [27]: import pandas as pd
In [28]: from datetime import datetime, date
In [29]: import calendar
In [30]: def parse(dt):
dt = datetime.strptime(dt, '%Y-%m-%d')
dt = dt.replace(day=calendar.monthrange(dt.year, dt.month)[1])
return dt.date()
....:
In [31]: parse('2013-01-01')
Out[31]: datetime.date(2013, 1, 31)
In [32]: r = pd.read_csv('date.csv', header=None, names=('date', 'value'), parse_dates=['date'], date_parser=parse)
In [33]: r
Out[33]:
date value
0 2013-01-31 1
1 2013-02-28 2