Pandas DataFrame index - month and day only - python

I'd like to have a DataFrame with a DatetimeIndex, but I only want the months and days; not years. I'd like it to look like the following:
(index) (values)
01-01 56.2
01-02 59.6
...
01-31 62.3
02-01 61.6
...
12-31 44.0
I've tried creating a date_range but this seems to require the year input, so I can't seem to figure out how to achieve the above.

you can do it this way:
In [78]: df = pd.DataFrame({'val':np.random.rand(10)}, index=pd.date_range('2000-01-01', freq='10D', periods=10))
In [79]: df
Out[79]:
val
2000-01-01 0.422023
2000-01-11 0.215800
2000-01-21 0.186017
2000-01-31 0.804285
2000-02-10 0.014004
2000-02-20 0.296644
2000-03-01 0.048683
2000-03-11 0.239037
2000-03-21 0.129382
2000-03-31 0.963110
In [80]: df.index.dtype_str
Out[80]: 'datetime64[ns]'
In [81]: df.index.dtype
Out[81]: dtype('<M8[ns]')
In [82]: df.index = df.index.strftime('%m-%d')
In [83]: df
Out[83]:
val
01-01 0.422023
01-11 0.215800
01-21 0.186017
01-31 0.804285
02-10 0.014004
02-20 0.296644
03-01 0.048683
03-11 0.239037
03-21 0.129382
03-31 0.963110
In [84]: df.index.dtype_str
Out[84]: 'object'
In [85]: df.index.dtype
Out[85]: dtype('O')
NOTE: the index dtype is a string (object) now
PS of course you can do it in one step if you nedd:
In [86]: pd.date_range('2000-01-01', freq='10D', periods=5).strftime('%m-%d')
Out[86]:
array(['01-01', '01-11', '01-21', '01-31', '02-10'],
dtype='<U5')

Related

Pandas resample offset from the most recent year end date?

Can somone explain what is going on with my resampling?
For example,
In [53]: daily_3mo_treasury.resample('5Y').mean()
Out[53]:
1993-12-31 2.997120
1998-12-31 4.917730
2003-12-31 3.297176
2008-12-31 2.997204
2013-12-31 0.097330
2018-12-31 0.534476
Where the last date in my time series is 2018-08-23 2.04
I really want my resample from the most recent year-end instead, so for example from 2017-12-31 to 2012-12-31 and so on.
I tried,
end = daily_3mo_treasury.index.searchsorted(date(2017,12,31))
daily_3mo_treasury.iloc[:end].resample('5Y').mean()
In [66]: daily_3mo_treasury.iloc[:end].resample('5Y').mean()
Out[66]:
1993-12-31 2.997120
1998-12-31 4.917730
2003-12-31 3.297176
2008-12-31 2.997204
2013-12-31 0.097330
2018-12-31 0.333467
dtype: float64
Where the last value in daily_3mo_treasury.iloc[:end] is 2017-12-29 1.37
How come my second 5 year resample is not ending 2017-12-31?
Edit: My index is sorted.
From #ALollz - When you resample, the bins are based on the first date in your index.
sistart = daily_3mo_treasury.index.searchsorted(date(1992,12,31))
siend = daily_3mo_treasury.index.searchsorted(date(2017,12,31))
In [95]: daily_3mo_treasury.iloc[sistart:siend].resample('5Y').mean()
Out[95]:
1992-12-31 3.080000
1997-12-31 4.562246
2002-12-31 4.050696
2007-12-31 2.925971
2012-12-31 0.360775
2017-12-31 0.278233
dtype: float64

Extract day and month from a datetime object

I have a column with dates in string format '2017-01-01'. Is there a way to extract day and month from it using pandas?
I have converted the column to datetime dtype but haven't figured out the later part:
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
df.dtypes:
Date datetime64[ns]
print(df)
Date
0 2017-05-11
1 2017-05-12
2 2017-05-13
With dt.day and dt.month --- Series.dt
df = pd.DataFrame({'date':pd.date_range(start='2017-01-01',periods=5)})
df.date.dt.month
Out[164]:
0 1
1 1
2 1
3 1
4 1
Name: date, dtype: int64
df.date.dt.day
Out[165]:
0 1
1 2
2 3
3 4
4 5
Name: date, dtype: int64
Also can do with dt.strftime
df.date.dt.strftime('%m')
Out[166]:
0 01
1 01
2 01
3 01
4 01
Name: date, dtype: object
A simple form:
df['MM-DD'] = df['date'].dt.strftime('%m-%d')
Use dt to get the datetime attributes of the column.
In [60]: df = pd.DataFrame({'date': [datetime.datetime(2018,1,1),datetime.datetime(2018,1,2),datetime.datetime(2018,1,3),]})
In [61]: df
Out[61]:
date
0 2018-01-01
1 2018-01-02
2 2018-01-03
In [63]: df['day'] = df.date.dt.day
In [64]: df['month'] = df.date.dt.month
In [65]: df
Out[65]:
date day month
0 2018-01-01 1 1
1 2018-01-02 2 1
2 2018-01-03 3 1
Timing the methods provided:
Using apply:
In [217]: %timeit(df['date'].apply(lambda d: d.day))
The slowest run took 33.66 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 210 µs per loop
Using dt.date:
In [218]: %timeit(df.date.dt.day)
10000 loops, best of 3: 127 µs per loop
Using dt.strftime:
In [219]: %timeit(df.date.dt.strftime('%d'))
The slowest run took 40.92 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 284 µs per loop
We can see that dt.day is the fastest
This should do it:
df['day'] = df['Date'].apply(lambda r:r.day)
df['month'] = df['Date'].apply(lambda r:r.month)

Modify hour in datetimeindex in pandas dataframe

I have a dataframe that looks like this:
master.head(5)
Out[73]:
hour price
day
2014-01-01 0 1066.24
2014-01-01 1 1032.11
2014-01-01 2 1028.53
2014-01-01 3 963.57
2014-01-01 4 890.65
In [74]: master.index.dtype
Out[74]: dtype('<M8[ns]')
What I need to do is update the hour in the index with the hour in the column but the following approaches don't work:
In [82]: master.index.hour = master.index.hour(master['hour'])
TypeError: 'numpy.ndarray' object is not callable
In [83]: master.index.hour = [master.index.hour(master.iloc[i,0]) for i in len(master.index.hour)]
TypeError: 'int' object is not iterable
How to proceed?
IIUC I think you want to construct a TimedeltaIndex:
In [89]:
df.index += pd.TimedeltaIndex(df['hour'], unit='h')
df
Out[89]:
hour price
2014-01-01 00:00:00 0 1066.24
2014-01-01 01:00:00 1 1032.11
2014-01-01 02:00:00 2 1028.53
2014-01-01 03:00:00 3 963.57
2014-01-01 04:00:00 4 890.65
Just to compare against using apply:
In [87]:
%timeit df.index + pd.TimedeltaIndex(df['hour'], unit='h')
%timeit df.index + df['hour'].apply(lambda x: pd.Timedelta(x, 'h'))
1000 loops, best of 3: 291 µs per loop
1000 loops, best of 3: 1.18 ms per loop
You can see that using a TimedeltaIndex is significantly faster
master.index =
pd.to_datetime(master.index.map(lambda x : x.strftime('%Y-%m-%d')) + '-' + master.hour.map(str) , format='%Y-%m-%d-%H.0')

Get MM-DD-YYYY from pandas Timestamp

dates seem to be a tricky thing in python, and I am having a lot of trouble simply stripping the date out of the pandas TimeStamp. I would like to get from 2013-09-29 02:34:44 to simply 09-29-2013
I have a dataframe with a column Created_date:
Name: Created_Date, Length: 1162549, dtype: datetime64[ns]`
I have tried applying the .date() method on this Series, eg: df.Created_Date.date(), but I get the error AttributeError: 'Series' object has no attribute 'date'
Can someone help me out?
map over the elements:
In [239]: from operator import methodcaller
In [240]: s = Series(date_range(Timestamp('now'), periods=2))
In [241]: s
Out[241]:
0 2013-10-01 00:24:16
1 2013-10-02 00:24:16
dtype: datetime64[ns]
In [238]: s.map(lambda x: x.strftime('%d-%m-%Y'))
Out[238]:
0 01-10-2013
1 02-10-2013
dtype: object
In [242]: s.map(methodcaller('strftime', '%d-%m-%Y'))
Out[242]:
0 01-10-2013
1 02-10-2013
dtype: object
You can get the raw datetime.date objects by calling the date() method of the Timestamp elements that make up the Series:
In [249]: s.map(methodcaller('date'))
Out[249]:
0 2013-10-01
1 2013-10-02
dtype: object
In [250]: s.map(methodcaller('date')).values
Out[250]:
array([datetime.date(2013, 10, 1), datetime.date(2013, 10, 2)], dtype=object)
Yet another way you can do this is by calling the unbound Timestamp.date method:
In [273]: s.map(Timestamp.date)
Out[273]:
0 2013-10-01
1 2013-10-02
dtype: object
This method is the fastest, and IMHO the most readable. Timestamp is accessible in the top-level pandas module, like so: pandas.Timestamp. I've imported it directly for expository purposes.
The date attribute of DatetimeIndex objects does something similar, but returns a numpy object array instead:
In [243]: index = DatetimeIndex(s)
In [244]: index
Out[244]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-10-01 00:24:16, 2013-10-02 00:24:16]
Length: 2, Freq: None, Timezone: None
In [246]: index.date
Out[246]:
array([datetime.date(2013, 10, 1), datetime.date(2013, 10, 2)], dtype=object)
For larger datetime64[ns] Series objects, calling Timestamp.date is faster than operator.methodcaller which is slightly faster than a lambda:
In [263]: f = methodcaller('date')
In [264]: flam = lambda x: x.date()
In [265]: fmeth = Timestamp.date
In [266]: s2 = Series(date_range('20010101', periods=1000000, freq='T'))
In [267]: s2
Out[267]:
0 2001-01-01 00:00:00
1 2001-01-01 00:01:00
2 2001-01-01 00:02:00
3 2001-01-01 00:03:00
4 2001-01-01 00:04:00
5 2001-01-01 00:05:00
6 2001-01-01 00:06:00
7 2001-01-01 00:07:00
8 2001-01-01 00:08:00
9 2001-01-01 00:09:00
10 2001-01-01 00:10:00
11 2001-01-01 00:11:00
12 2001-01-01 00:12:00
13 2001-01-01 00:13:00
14 2001-01-01 00:14:00
...
999985 2002-11-26 10:25:00
999986 2002-11-26 10:26:00
999987 2002-11-26 10:27:00
999988 2002-11-26 10:28:00
999989 2002-11-26 10:29:00
999990 2002-11-26 10:30:00
999991 2002-11-26 10:31:00
999992 2002-11-26 10:32:00
999993 2002-11-26 10:33:00
999994 2002-11-26 10:34:00
999995 2002-11-26 10:35:00
999996 2002-11-26 10:36:00
999997 2002-11-26 10:37:00
999998 2002-11-26 10:38:00
999999 2002-11-26 10:39:00
Length: 1000000, dtype: datetime64[ns]
In [269]: timeit s2.map(f)
1 loops, best of 3: 1.04 s per loop
In [270]: timeit s2.map(flam)
1 loops, best of 3: 1.1 s per loop
In [271]: timeit s2.map(fmeth)
1 loops, best of 3: 968 ms per loop
Keep in mind that one of the goals of pandas is to provide a layer on top of numpy so that (most of the time) you don't have to deal with the low level details of the ndarray. So getting the raw datetime.date objects in an array is of limited use since they don't correspond to any numpy.dtype that is supported by pandas (pandas only supports datetime64[ns] [that's nanoseconds] dtypes). That said, sometimes you need to do this.
Maybe this only came in recently, but there are built-in methods for this. Try:
In [27]: s = pd.Series(pd.date_range(pd.Timestamp('now'), periods=2))
In [28]: s
Out[28]:
0 2016-02-11 19:11:43.386016
1 2016-02-12 19:11:43.386016
dtype: datetime64[ns]
In [29]: s.dt.to_pydatetime()
Out[29]:
array([datetime.datetime(2016, 2, 11, 19, 11, 43, 386016),
datetime.datetime(2016, 2, 12, 19, 11, 43, 386016)], dtype=object)
You can try using .dt.date on datetime64[ns] of the dataframe.
For e.g. df['Created_date'] = df['Created_date'].dt.date
Input dataframe named as test_df:
print(test_df)
Result:
Created_date
0 2015-03-04 15:39:16
1 2015-03-22 17:36:49
2 2015-03-25 22:08:45
3 2015-03-16 13:45:20
4 2015-03-19 18:53:50
Checking dtypes:
print(test_df.dtypes)
Result:
Created_date datetime64[ns]
dtype: object
Extracting date and updating Created_date column:
test_df['Created_date'] = test_df['Created_date'].dt.date
print(test_df)
Result:
Created_date
0 2015-03-04
1 2015-03-22
2 2015-03-25
3 2015-03-16
4 2015-03-19
well I would do this way.
pdTime =pd.date_range(timeStamp, periods=len(years), freq="D")
pdTime[i].strftime('%m-%d-%Y')

Python Pandas business day range bdate_range doesn't take 1min freq?

I am trying to use bdate_range with '1min' freq to get minute by minute data on all business days.
df = pd.bdate_range('20130101 9:30','20130106 16:00',freq='1min')
with output ends with
......
2013-01-05 23:59:00
2013-01-06 00:00:00
In [158]:
Notice that 2013-01-05 and 2013-01-06 are weekends and it didn't take time limit between 9:30 and 16:00
I think the freq = '1min' totally overwrites freq = 'B' from function name bdate_range
I also tried using date_range. It worked for the time range from 9:30 to 16:00, but it can't exclude weekends.
Thanks!
You could do it like this
In [28]: rng = pd.date_range('2012-01-01', '2013-01-01', freq="1min")
In [29]: rng
Out[29]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2012-01-01 00:00:00, ..., 2013-01-01 00:00:00]
Length: 527041, Freq: T, Timezone: None
Limit the times that I want
In [30]: x = rng[rng.indexer_between_time('9:30','16:00')]
In [31]: x
Out[31]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2012-01-01 09:30:00, ..., 2012-12-31 16:00:00]
Length: 143106, Freq: None, Timezone: None
Only days that are mon-fri
In [32]: x = x[x.dayofweek<5]
In [33]: x
Out[33]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2012-01-02 09:30:00, ..., 2012-12-31 16:00:00]
Length: 102051, Freq: None, Timezone: None

Categories