Using rolling_apply on a DataFrame object - python

I am trying to calculate Volume Weighted Average Price on a rolling basis.
To do this, I have a function vwap that does this for me, like so:
def vwap(bars):
return ((bars.Close*bars.Volume).sum()/bars.Volume.sum()).round(2)
When I try to use this function with rolling_apply, as shown, I get an error:
import pandas.io.data as web
bars = web.DataReader('AAPL','yahoo')
print pandas.rolling_apply(bars,30,vwap)
AttributeError: 'numpy.ndarray' object has no attribute 'Close'
The error makes sense to me because the rolling_apply requires not DataSeries or a ndarray as an input and not a dataFrame.. the way I am doing it.
Is there a way to use rolling_apply to a DataFrame to solve my problem?

This is not directly enabled, but you can do it like this
In [29]: bars
Out[29]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 942 entries, 2010-01-04 00:00:00 to 2013-09-30 00:00:00
Data columns (total 6 columns):
Open 942 non-null values
High 942 non-null values
Low 942 non-null values
Close 942 non-null values
Volume 942 non-null values
Adj Close 942 non-null values
dtypes: float64(5), int64(1)
window=30
In [30]: concat([ (Series(vwap(bars.iloc[i:i+window]),
index=[bars.index[i+window]])) for i in xrange(len(df)-window) ])
Out[30]:
2010-02-17 203.21
2010-02-18 202.95
2010-02-19 202.64
2010-02-22 202.41
2010-02-23 202.19
2010-02-24 201.85
2010-02-25 201.65
2010-02-26 201.50
2010-03-01 201.31
2010-03-02 201.35
2010-03-03 201.42
2010-03-04 201.09
2010-03-05 200.95
2010-03-08 201.50
2010-03-09 202.02
...
2013-09-10 485.94
2013-09-11 487.38
2013-09-12 486.77
2013-09-13 487.23
2013-09-16 487.20
2013-09-17 486.09
2013-09-18 485.52
2013-09-19 485.30
2013-09-20 485.37
2013-09-23 484.87
2013-09-24 485.81
2013-09-25 486.41
2013-09-26 486.07
2013-09-27 485.30
2013-09-30 484.74
Length: 912

A cleaned up version for reference, hopefully got the indexing correct:
def myrolling_apply(df, N, f, nn=1):
ii = [int(x) for x in arange(0, df.shape[0] - N + 1, nn)]
out = [f(df.iloc[i:(i + N)]) for i in ii]
out = pandas.Series(out)
out.index = df.index[N-1::nn]
return(out)

Modified #mathtick's answer to include na_fill. Also note that your function f needs to return a single value, this can't return a dataframe with multiple columns.
def rolling_apply_df(dfg, N, f, nn=1, na_fill=True):
ii = [int(x) for x in np.arange(0, dfg.shape[0] - N + 1, nn)]
out = [f(dfg.iloc[i:(i + N)]) for i in ii]
if(na_fill):
out = pd.Series(np.concatenate([np.repeat(np.nan, N-1),np.array(out)]))
out.index = dfg.index[::nn]
else:
out = pd.Series(out)
out.index = dfg.index[N-1::nn]
return(out)

Related

date calculation (TypeError: unsupported operand type(s) for -: 'str' and 'str')

I have a data set as below:
date_time srch_co srch_ci
0 2014-11-03 16:02:28 2014-12-19 2014-12-15
1 2013-03-13 19:25:01 2013-03-14 2013-03-13
2 2014-10-13 13:20:25 2015-04-10 2015-04-03
3 2013-11-05 10:40:34 2013-11-08 2013-11-07
4 2014-06-10 13:34:56 2014-08-08 2014-08-03
5 2014-12-16 14:34:39 2014-12-17 2014-12-16
And this is the information of the dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 3 columns):
date_time 100000 non-null datetime64[ns]
srch_co 99878 non-null object
srch_ci 99878 non-null object
dtypes: datetime64[ns](1), object(2)
memory usage: 2.3+ MB
What I would like to do is create 2 new columns using the following function:
def duration(row):
delta = (row['srch_co'] - row['srch_ci'])/np.timedelta64(1, 'D')
if delta <= 0:
return np.nan
else:
return delta
sample['duration'] = sample.apply(duration, axis=1)
def days_in_advance(row):
delta = (row['srch_ci'] - row['date_time'])/np.timedelta64(1, 'D')
if delta < 0:
return np.nan
else:
return delta
sample['days_in_advance'] = sample.apply(days_in_advance, axis=1)
However, it seems like the date calculation I want to run constantly hits errors. I've searched and found several solutions and tried, but either they would create error or turn the date into inaccurate value.
The methods I've tried to use are such as:
#1)
def to_integer(dt_time):
return 10000*dt_time.year + 100*dt_time.month + dt_time.day
#2)
datetime.strptime(str(row[2]), '%Y%m%d%H%M%S')
#3)
pd.to_numeric(sample['date_time'], errors='coerce')
#4)
sample['srch_ci_int'] = sample['srch_ci'].astype(str).astype(int)
I just want to create new column that would calculate the difference of each columns:
sample["duration"] = sample["srch_co"] - sample["srch_ci"]
sample["days_in_advance"] = sample["srch_co"] - sample["date_time"]
Any tips appreciated.
You need convert columns srch_co and srch_ci to_datetime first and then use mask for replace values less as 0 to NaN (default value of mask function):
sample["srch_co"] = pd.to_datetime(sample["srch_co"])
sample["srch_ci"] = pd.to_datetime(sample["srch_ci"])
sample["duration"] = (sample["srch_co"] - sample["srch_ci"])/np.timedelta64(1, 'D')
sample["days_in_advance"] = (sample["srch_co"] - sample["date_time"])/np.timedelta64(1, 'D')
cols = ['duration','days_in_advance']
sample[cols] = sample[cols].mask(sample[cols] < 0)
#first value of srch_ci column was changed for NaN output
print (sample)
date_time srch_co srch_ci duration days_in_advance
0 2014-11-03 16:02:28 2014-12-19 2015-12-15 NaN 45.331620
1 2013-03-13 19:25:01 2013-03-14 2013-03-13 1.0 0.190961
2 2014-10-13 13:20:25 2015-04-10 2015-04-03 7.0 178.444155
3 2013-11-05 10:40:34 2013-11-08 2013-11-07 1.0 2.555162
4 2014-06-10 13:34:56 2014-08-08 2014-08-03 5.0 58.434074
5 2014-12-16 14:34:39 2014-12-17 2014-12-16 1.0 0.392604
Seems like you're subtracting a string from a string. Make sure to convert the column to type 'date' using pd.to_datetime, and then you'll be able to subtract one day from another.
Another recommendation is to avoid for loops and to use vectorized operations, such as pd.DataFrame.subtract(series, axis=0), as that is one of the biggest advantages of using pandas over any simple list.
After you've calculated the difference, then you can turn all negatives into nan by saying
dataframe[dataframe['duration'] < 0] = np.nan

Using set_index within a custom function

I would like to convert the date observations from a column into the index for my dataframe. I am able to do this with the code below:
Sample data:
test = pd.DataFrame({'Values':[1,2,3], 'Date':["1/1/2016 17:49","1/2/2016 7:10","1/3/2016 15:19"]})
Indexing code:
test['Date Index'] = pd.to_datetime(test['Date'])
test = test.set_index('Date Index')
test['Index'] = test.index.date
However when I try to include this code in a function, I am able to create the 'Date Index' column but set_index does not seem to work as expected.
def date_index(df):
df['Date Index'] = pd.to_datetime(df['Date'])
df = df.set_index('Date Index')
df['Index'] = df.index.date
If I inspect the output of not using a function info() returns:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3 entries, 2016-01-01 17:49:00 to 2016-01-03 15:19:00
Data columns (total 3 columns):
Date 3 non-null object
Values 3 non-null int64
Index 3 non-null object
dtypes: int64(1), object(2)
memory usage: 96.0+ bytes
If I inspect the output of the function info() returns:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
Date 3 non-null object
Values 3 non-null int64
dtypes: int64(1), object(1)
memory usage: 120.0+ bytes
I would like the DatetimeIndex.
How can set_index be used within a function? Am I using it incorrectly?
IIUC return df is missing:
df1 = pd.DataFrame({'Values':[1,2,3], 'Exam Completed Date':["1/1/2016 17:49","1/2/2016 7:10","1/3/2016 15:19"]})
def date_index(df):
df['Exam Completed Date Index'] = pd.to_datetime(df['Exam Completed Date'])
df = df.set_index('Exam Completed Date Index')
df['Index'] = df.index.date
return df
print (date_index(df1))
Exam Completed Date Values Index
Exam Completed Date Index
2016-01-01 17:49:00 1/1/2016 17:49 1 2016-01-01
2016-01-02 07:10:00 1/2/2016 7:10 2 2016-01-02
2016-01-03 15:19:00 1/3/2016 15:19 3 2016-01-03
print (date_index(df1).info())
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3 entries, 2016-01-01 17:49:00 to 2016-01-03 15:19:00
Data columns (total 3 columns):
Exam Completed Date 3 non-null object
Values 3 non-null int64
Index 3 non-null object
dtypes: int64(1), object(2)
memory usage: 96.0+ bytes
None

Reindexing pandas dataframe for stacking with new unique index

I have several dataframes which look like the following:
In [2]: skew
Out[2]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 96 entries, 2006-01-31 00:00:00 to 2013-12-31 00:00:00
Freq: BM
Data columns (total 3 columns):
AAPL 96 non-null values
GOOG 96 non-null values
MSFT 96 non-null values
dtypes: float64(3)
In [3]: skew.head()
Out[3]:
AAPL GOOG MSFT
2006-01-31 0.531769 -0.567731 2.132850
2006-02-28 -0.389711 0.028723 0.724277
2006-03-31 1.184884 1.009587 -0.959136
2006-04-28 1.664745 0.852869 -4.020731
2006-05-31 -0.419757 -0.288422 0.240444
In [5]: skew.index
Out[5]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2006-01-31 00:00:00, ..., 2013-12-31 00:00:00]
Length: 96, Freq: BM, Timezone: None
I want to generate a single column of them with a unique index so that I can merge it with the columns from the other dataframes at a later point, which would looks somewhat like this, but with an unique index:
frame
Out[6]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 288 entries, 2006-01-31 00:00:00 to 2013-12-31 00:00:00
Data columns (total 3 columns):
Returns 285 non-null values
Skew 288 non-null values
WinLose 288 non-null values
dtypes: bool(1), float64(2)
In [7]: frame.head()
Out[7]:
Returns Skew WinLose
2006-01-31 NaN 0.531769 True
2006-02-28 -0.092968 -0.389711 False
2006-03-31 -0.084246 1.184884 True
2006-04-28 0.122290 1.664745 False
2006-05-31 -0.150874 -0.419757 False
i.e, something like:
In [7]: frame.head()
Out[7]:
Returns Skew WinLose
2006-01-31-AAPL NaN 0.531769 True
2006-02-28-MSFT -0.092968 -0.389711 False
2006-03-31-AAPL -0.084246 1.184884 True
2006-04-28-GOGL 0.122290 1.664745 False
2006-05-31-AAPL -0.150874 -0.419757 False
The code is:
import pandas as pd
import pandas.io.data as web
#Class parameters
names = ['AAPL','GOOG','MSFT']
# Functions
def get_px(stock, start, end):
return web.get_data_yahoo(stock, start, end)['Close']
def getWinnerLoser(stock, medRet, retsM):
return retsM[stock].shift(-1) >= medRet.shift(-1)
def getSkew( stock, rets, period):
return pd.rolling_skew(rets[stock],period).asfreq('BM').fillna(method='pad')
px = pd.DataFrame(data={n: get_px(n,'1/1/2006','1/1/2014') for n in names})
px = px.asfreq('B').fillna(method = 'pad')
rets = px.pct_change()
# Monthly returns and median return
retsM = px.asfreq('BM').fillna(method = 'pad').pct_change()
medRet = retsM.median(axis = 1)
# Dataframes
winLose = pd.DataFrame(data = {n: getWinnerLoser(n,medRet,retsM) for n in names})
skew = pd.DataFrame(data = {n: getSkew(n,rets,20) for n in names})
# Concatenating
retsMCon = pd.concat(retsM[n] for n in names)
winLoseCon = pd.concat(winLose[n] for n in names)
skewCon = pd.concat(skew[n] for n in names)
frame = pd.DataFrame({'Returns':retsMCon, 'Skew':skewCon, 'WinLose':winLoseCon})
I have yet to find a good solution to this

Transforming financial data from postgres to pandas dataframe for use with Zipline

I'm new to Pandas and Zipline, and I'm trying to learn how to use them (and use them with this data that I have). Any sorts of tips, even if no full solution, would be much appreciated. I have tried a number of things, and have gotten quite close, but run into indexing issues, Exception: Reindexing only valid with uniquely valued Index objects, in particular. [Pandas 0.10.0, Python 2.7]
I'm trying to transform monthly returns data I have for thousands of stocks in postgres from the form:
ticker_symbol :: String, monthly_return :: Float, date :: Timestamp
e.g.
AAPL, 0.112, 28/2/1992
GS, 0.13, 30/11/1981
GS, -0.23, 22/12/1981
NB: The frequency of the reporting is monthly, but there is going to be considerable NaN data here, as not all of the over 6000 companies I have here are going to be around at the same time.
…to the form described below, which is what Zipline needs to run its backtester. (I think. Can Zipline's backtester work with monthly data like this, easily? I know it can, but any tips for doing this?)
The below is a DataFrame (of timeseries? How do you say this?), in the format I need:
> data:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2268 entries, 1993-01-04 00:00:00+00:00 to 2001-12-31 00:00:00+00:00
Data columns:
AA 2268 non-null values
AAPL 2268 non-null values
GE 2268 non-null values
IBM 2268 non-null values
JNJ 2268 non-null values
KO 2268 non-null values
MSFT 2268 non-null values
PEP 2268 non-null values
SPX 2268 non-null values
XOM 2268 non-null values
dtypes: float64(10)
The below is a TimeSeries, and is in the format I need.
> data.AAPL:
Date
1993-01-04 00:00:00+00:00 73.00
1993-01-05 00:00:00+00:00 73.12
...
2001-12-28 00:00:00+00:00 36.15
2001-12-31 00:00:00+00:00 35.55
Name: AAPL, Length: 2268
Note, there isn't return data here, but prices instead. They're adjusted (by Zipline's load_from_yahoo—though, from reading the source, really by functions in pandas) for dividends, splits, etc, so there's an isomorphism (less the initial price) between that and my return data (so, no problem here).
(EDIT: Let me know if you'd like me to write what I have, or attach my iPython notebook or a gist; I just doubt it'd be helpful, but I can absolutely do it if requested.)
I suspect you are trying to set the date as the index too early. My suggestion would be to first set_index as date and company name, then you can unstack the company name and resample.
Something like this:
In [11]: df1
Out[11]:
ticker_symbol monthly_return date
0 AAPL 0.112 1992-02-28 00:00:00
1 GS 0.130 1981-11-30 00:00:00
2 GS -0.230 1981-12-22 00:00:00
df2 = df2.set_index(['date','ticker_symbol'])
df3 = df2.unstack(level=1)
df4 = df.resample('M')
In [14]: df2
Out[14]:
monthly_return
date ticker_symbol
1992-02-28 AAPL 0.112
1981-11-30 GS 0.130
1981-12-22 GS -0.230
In [15]: df3
Out[15]:
monthly_return
ticker_symbol AAPL GS
date
1981-11-30 NaN 0.13
1981-12-22 NaN -0.23
1992-02-28 0.112 NaN
In [16]: df4
Out[16]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 124 entries, 1981-11-30 00:00:00 to 1992-02-29 00:00:00
Freq: M
Data columns:
(monthly_return, AAPL) 1 non-null values
(monthly_return, GS) 2 non-null values
dtypes: float64(2)

Pandas, How to reference Timeseries Items?

I'm trying to work with some stock market data. I have the following DataFrame:
>>> ticker
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 707 entries, 2010-01-04 00:00:00 to 2012-10-19 00:00:00
Data columns:
Open 707 non-null values
High 707 non-null values
Low 707 non-null values
Close 707 non-null values
Volume 707 non-null values
Adj Close 707 non-null values
dtypes: float64(5), int64(1)
I'll reference a random closing price:
>>> ticker ['Close'] [704]
21.789999999999999
What's the syntax to get the date of that 704th item?
Similarly, how do I get the position in the array of the following item?:
>>> ticker.Close.min ()
17.670000000000002
I know this seems pretty basic, but I've spent a lot of time scouring the documentation. If it's there, I'm absolutely missing it.
This should answer both your questions:
Note: if you want the 704th element, you should use "703" as index starts form zero. As you see df['A'].argmin() also returns 1, that is the second row in the df.
In [682]: print df
A B C
2000-01-01 1.073247 -1.784255 0.137262
2000-01-02 -0.797483 0.665392 0.692429
2000-01-03 0.123751 0.532109 0.814245
2000-01-04 1.045414 -0.687119 -0.451437
2000-01-05 0.594588 0.240058 -0.813954
2000-01-06 1.104193 0.765873 0.527262
2000-01-07 -0.304374 -0.894570 -0.846679
2000-01-08 -0.443329 -1.437305 -0.316648
In [683]: df.index[3]
Out[683]: <Timestamp: 2000-01-04 00:00:00>
In [684]: df['A'].argmin()
Out[684]: 1

Categories