I want to aggregate daily data to weekly (7-day sum) but with the last date as the 'origin'. Is it possible to do a group by from the end date using pd.Grouper? This is how the data looks like:
This code:
df.groupby(pd.Grouper(key='date', freq='7d'))['value'].sum()
results to
2020-01-01 5
2020-01-08 12
2020-01-15 4
but I was hoping for this:
2020-01-01 0
2020-01-03 7
2020-01-10 14
the method you have used can be shortened using resample method of pandas on df
but i think you problem is the order your dates are;
the result you expect is more day wise output;
hence what i will recommend is splitting the df and then again merging them
df.set_index(['date'],inplace=True)
df_below = df[3:].resample('W').sum()
df_up = df.iloc[0:3,:].sum()
# or you can give dates instead of 0:3 in iloc
the rows [0,1,2] you can take sum of those n then using hstack or concat or merge again make them one DataFrame
feel free for asking further queries....
Lets say I have a idx=pd.DatatimeIndex with one minute frequency. I also have a list of bad dates (each are of type pd.Timestamp without the time information) that I want to remove from the original idx. How do I do that in pandas?
Use normalize to remove the time part from your index so you can do a simple ~ + isin selection, i.e. find the dates not in that bad list. You can further ensure your list of dates don't have a time part with the same [x.normalize() for x in bad_dates] if you need to be extra safe.
Sample Data
import pandas as pd
df = pd.DataFrame(range(9), index=pd.date_range('2010-01-01', freq='11H', periods=9))
bad_dates = [pd.Timestamp('2010-01-02'), pd.Timestamp('2010-01-03')]
Code
df[~df.index.normalize().isin(bad_dates)]
# 0
#2010-01-01 00:00:00 0
#2010-01-01 11:00:00 1
#2010-01-01 22:00:00 2
#2010-01-04 05:00:00 7
#2010-01-04 16:00:00 8
Here is a simple code with date_range and indexing [ ] I used with Pandas
period_start = '2013-01-01'
period_end = '2019-12-24'
print(pd.DataFrame ({'close':aapl_close,
'returns':aapl_returns},index=pd.date_range(start=period_start,periods=6)))
print(pd.DataFrame ({'close':aapl_close,
'returns':aapl_returns})[period_start:'20130110'])
date_range gives Nan results
close returns
2013-01-01 NaN NaN
2013-01-02 NaN NaN
2013-01-03 NaN NaN
2013-01-04 NaN NaN
Indexing gives correct results
close returns
2013-01-02 00:00:00+00:00 68.732 0.028322
2013-01-03 00:00:00+00:00 68.032 -0.010184
2013-01-04 00:00:00+00:00 66.091 -0.028531
Based on how the dates are shown by date_range - I suppose the date format of date_range does not match the date format in the Pandas DataFrame.
1) Can you explaine please why it gives NaN?
2) What would you suggest to get a specific time range from the Panda DataFrame?
As I'm a beginner in Python and its libraries, I didn't understand that this question refers to the Quantopian library, not to Pandas.
I got a solution on their forum. All the times returned by methods on Quantopian are timezone aware with a timezone of 'UTC'. By default, the date_range method returns timezone naive dates. Simply supply the timezone information to date_range method. Like this
pd.DataFrame ({
'close':aapl_close,
'returns':aapl_returns,},
index=pd.date_range(start=period_start, periods=6, tz='UTC'))
To get a specific date or time range in pandas perhaps the easiest is simple bracket notation. For example, to get dates between 2013-01-04 and 2013-01-08 (inclusive) simply enter this:
df = pd.DataFrame ({'close':aapl_close, 'returns':aapl_returns,})
my_selected_dates = df['2013-01-04':'2013-01-08']
This bracket notation is really shorthand for using the loc method
my_selected_dates = df.loc['2013-01-04':'2013-01-08']
Both work the same but the loc method has a bit more flexibility. This notation also works with datetimes if desired.
I am new to Pandas. I have the following data (stock prices)
id,date,time,price
0,2015-01-01,9:00,21.72
1,2015-01-01,9:00,17.65
2,2015-01-01,9:00,54.24
0,2015-01-01,11:00,21.82
1,2015-01-01,11:00,18.65
2,2015-01-01,11:00,52.24
0,2015-01-02,9:00,21.02
1,2015-01-02,9:00,19.01
2,2015-01-02,9:00,50.21
0,2015-01-02,11:00,20.61
1,2015-01-02,11:00,18.70
2,2015-01-02,11:00,51.21
...
...
I want to sort by date and calculate returns for each id and across dates and times within a date. I tried this
import pandas as pd
import numpy as np
df = pd.read_csv("/path/to/csv", index_col=[0,2,1])
df['returns'] = df['price'].pct_change()
However, the returns are calculated across the ids in the order they appear. Any idea how to do this correctly? I would also like to access the data as
price_0 = df['id'==0]['date'=='2014-01-01'][time=='9:00']['price']
Assuming that those are the columns in your dataframe (and none are the index), then you want to group by date, time, and id on price. You then unstack the id, which effectively creates a pivot table with dates and times as the rows and ids as the columns. You then need to use pct_change to achieve your objective.
returns = df.groupby(['date', 'time', 'id']).price.first().unstack().pct_change()
>>> returns
id 0 1 2
date time
1/1/15 11:00 NaN NaN NaN
9:00 -0.004583 -0.053619 0.038285
1/2/15 11:00 -0.051105 0.059490 -0.055863
9:00 0.019893 0.016578 -0.019527
It will probably be better, however, to combine the dates and times into timestamps. Assuming your dates and times are text representations, the following should work:
df['timestamp'] = df.apply(lambda row: pd.Timestamp(row.date + ' ' + row.time), axis=1)
Then, just group on the the timestamp and id, and unstack the id.
returns = df.groupby(['timestamp, 'id']).price.first().unstack('id').pct_change()
>>> returns
id 0 1 2
timestamp
2015-01-01 09:00:00 NaN NaN NaN
2015-01-01 11:00:00 0.004604 0.056657 -0.036873
2015-01-02 09:00:00 -0.036664 0.019303 -0.038859
You would index the returns for a given security as follows:
>>> returns.ix['2015-01-02 9:00'].loc[1]
0.0193029490616623
I have the following subset with a starting date (DD/MM/YYYY) and Amount
Start Date Amount
1 01/01/2013 20
2 02/05/2007 10
3 01/05/2004 15
4 01/06/2014 20
5 17/08/2008 21
I'd like to create a subset of this dataframe where only where the Start Date Day is 01:
Start Date Amount
1 01/01/2013 20
3 01/05/2004 15
4 01/06/2014 20
I've tried to loop through the table and use the index but couldn't find a suitable way to iterate through a dataframe rows.
Assuming your dates are datetime already then the following should work, if they are strings you can convert them using to_datetime so df['Start Date'] = pd.to_datetime(df['Start Date']), you may also need to pass param dayfirst = True if required. If you imported the data using read_csv you could've done this at the point of import so df = pd.read_csv('data.csv', parse_dates=[n], dayfirst=True) where n is the index (0-based of course) so if it was the first then pass parse_dates=[0].
One method could be to apply a lambda to the column and use the boolean index returned this to index against:
In [19]:
df[df['Start Date'].apply(lambda x: x.day == 1)]
Out[19]:
Start Date Amount
index
1 2013-01-01 20
3 2004-05-01 15
4 2014-06-01 20
Not sure if there is a built in method that doesn't involve setting this to be the index which will convert it into a timeseries index.