Pandas, How to reference Timeseries Items? - python

I'm trying to work with some stock market data. I have the following DataFrame:
>>> ticker
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 707 entries, 2010-01-04 00:00:00 to 2012-10-19 00:00:00
Data columns:
Open 707 non-null values
High 707 non-null values
Low 707 non-null values
Close 707 non-null values
Volume 707 non-null values
Adj Close 707 non-null values
dtypes: float64(5), int64(1)
I'll reference a random closing price:
>>> ticker ['Close'] [704]
21.789999999999999
What's the syntax to get the date of that 704th item?
Similarly, how do I get the position in the array of the following item?:
>>> ticker.Close.min ()
17.670000000000002
I know this seems pretty basic, but I've spent a lot of time scouring the documentation. If it's there, I'm absolutely missing it.

This should answer both your questions:
Note: if you want the 704th element, you should use "703" as index starts form zero. As you see df['A'].argmin() also returns 1, that is the second row in the df.
In [682]: print df
A B C
2000-01-01 1.073247 -1.784255 0.137262
2000-01-02 -0.797483 0.665392 0.692429
2000-01-03 0.123751 0.532109 0.814245
2000-01-04 1.045414 -0.687119 -0.451437
2000-01-05 0.594588 0.240058 -0.813954
2000-01-06 1.104193 0.765873 0.527262
2000-01-07 -0.304374 -0.894570 -0.846679
2000-01-08 -0.443329 -1.437305 -0.316648
In [683]: df.index[3]
Out[683]: <Timestamp: 2000-01-04 00:00:00>
In [684]: df['A'].argmin()
Out[684]: 1

Related

Why are my forecast predictions coming out as NaN?

My problem is pretty simple, and I know I'm missing something very obvious, I just can't figure out what it is....
My test predictions for Holt-Winters are coming out as NaN and I can't figure out why. Can anyone help on this?
I'm using a Jupyter Notebook, and trying to forecast sales of one SKU using Holt-Winters method. I even went as far as using
Here is the code I used:
# Import the libraries needed to execute Holt-Winters
import pandas as pd
import numpy as np
%matplotlib inline
df = pd.read_csv('../Data/M1045_White.csv',index_col='Month',parse_dates=True)
# Set the month column as the index column
df.index.freq = 'MS'
df.index
df.head()
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 48 entries, 2015-05-01 to 2019-04-01
Freq: MS
Data columns (total 7 columns):
Sales 48 non-null int64
EWMA12 48 non-null float64
SES12 47 non-null float64
DESadd12 47 non-null float64
DESmul12 47 non-null float64
TESadd12 48 non-null float64
TESmul12 12 non-null float64
dtypes: float64(6), int64(1)
memory usage: 3.0 KB
from statsmodels.tsa.holtwinters import SimpleExpSmoothing
# Train Test Split
train_data = df.iloc[:36] # Goes up to but not including 36
test_data = df.iloc[12:]
# Fit the Model
fitted_model = exponentialSmoothing(train_data['Sales'],trend='mul',seasonal='mul',seasonal_periods=12).fit()
test_predictions = fitted_model.forecast(12).rename('HW M1045 White Forecast')
test_predictions
Here is the output of my predictions:
2018-05-01 NaN
2018-06-01 NaN
2018-07-01 NaN
2018-08-01 NaN
2018-09-01 NaN
2018-10-01 NaN
2018-11-01 NaN
2018-12-01 NaN
2019-01-01 NaN
2019-02-01 NaN
2019-03-01 NaN
2019-04-01 NaN
Freq: MS, Name: HW M1045 White Forecast, dtype: float64
Can someone please point out what I may have missed? This seems to be a simple problem with a simple solution, but it's kicking my butt.
Thanks!
The answer has something to do with the seasonal_periods variable being set to 12. If this is updated to 6 then the predictions yield actual values. I'm not a stats expert in Exponential Smoothing to understand why this is the case.
Reason:
Your training data contained some NaNs, so it was unable to model nor forecast.
See the non-null values count for each column, it is not the same.
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 48 entries, 2015-05-01 to 2019-04-01
Freq: MS
Data columns (total 7 columns):
Sales 48 non-null int64
EWMA12 48 non-null float64
SES12 47 non-null float64
DESadd12 47 non-null float64
DESmul12 47 non-null float64
TESadd12 48 non-null float64
TESmul12 12 non-null float64
dtypes: float64(6), int64(1)
memory usage: 3.0 KB
Check if there are any missing values in dataframe
df.isnull().sum()
Solution:
In your case, missing value treatment is needed before training the model.
Thanks all. My but there was a few blank cells, and N/A within my dataset that caused my code to throw me this error. My mistake not doing a better job with data cleaning. As well, I ensured my dates where formatted correctly and sales data should be integer.

Getting rid of a hierarchical index in Pandas

I have just pivoted a dataframe to create the dataframe below:
date 2012-10-31 2012-11-30
term
red -4.043862 -0.709225
blue -18.046630 -8.137812
green -8.339924 -6.358016
The columns are supposed to be dates, the left most column in supposed to have strings in it.
I want to be able to run through the rows (using the .apply()) and compare the values under each date column. The problem I am having is that I think the df has a hierarchical index.
Is there a way to give the whole df a new index (e.g. 1, 2, 3 etc.) and then have a flat index (but not get rid of the terms in the first column)?
EDIT: When I try to use .reset_index() I get the error ending with 'AttributeError: 'str' object has no attribute 'view''.
EDIT 2: this is what the df looks like:
EDIT 3: here is the description of the df:
<class 'pandas.core.frame.DataFrame'>
Index: 14597 entries, 101016j to zymogens
Data columns (total 6 columns):
2012-10-31 00:00:00 14597 non-null values
2012-11-30 00:00:00 14597 non-null values
2012-12-31 00:00:00 14597 non-null values
2013-01-31 00:00:00 14597 non-null values
2013-02-28 00:00:00 14597 non-null values
2013-03-31 00:00:00 14597 non-null values
dtypes: float64(6)
Thanks in advance.
df= df.reset_index()
this will take the current index and make it a column then give you a fresh index from 0
Adding example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'2012-10-31': [-4, -18, -18], '2012-11-30': [-0.7, -8, -6]}, index = ['red', 'blue','green'])
df
2012-10-31 2012-11-30
red -4 -0.7
blue -18 -8.0
green -18 -6.0
df.reset_index()
term 2012-10-31 2012-11-30
0 red -4 -0.7
1 blue -18 -8.0
2 green -18 -6.0
EDIT: When I try to use .reset_index() I get the error ending with 'AttributeError: 'str' object has no attribute 'view''.
Try to convert your date columns to string type columns first.
I think pandas doesn't like to reset_index() here because you try to reset your string index into a columns which only consist of dates. If you only have dates as columns, pandas will handle those columns internally as a DateTimeIndex. When calling reset_index(), pandas tries to set up your string index as a further column to your date columns and fails somehow. Looks like a bug for me, but not sure.
Example:
t = pandas.DataFrame({pandas.to_datetime('2011') : [1,2], pandas.to_datetime('2012') : [3,4]}, index=['A', 'B'])
t
2011-01-01 00:00:00 2012-01-01 00:00:00
A 1 3
B 2 4
t.columns
<class 'pandas.tseries.index.DatetimeIndex'>
[2011-01-01 00:00:00, 2012-01-01 00:00:00]
Length: 2, Freq: None, Timezone: None
t.reset_index()
...
AttributeError: 'str' object has no attribute 'view'
If you try with a string columns it will work.

Using rolling_apply on a DataFrame object

I am trying to calculate Volume Weighted Average Price on a rolling basis.
To do this, I have a function vwap that does this for me, like so:
def vwap(bars):
return ((bars.Close*bars.Volume).sum()/bars.Volume.sum()).round(2)
When I try to use this function with rolling_apply, as shown, I get an error:
import pandas.io.data as web
bars = web.DataReader('AAPL','yahoo')
print pandas.rolling_apply(bars,30,vwap)
AttributeError: 'numpy.ndarray' object has no attribute 'Close'
The error makes sense to me because the rolling_apply requires not DataSeries or a ndarray as an input and not a dataFrame.. the way I am doing it.
Is there a way to use rolling_apply to a DataFrame to solve my problem?
This is not directly enabled, but you can do it like this
In [29]: bars
Out[29]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 942 entries, 2010-01-04 00:00:00 to 2013-09-30 00:00:00
Data columns (total 6 columns):
Open 942 non-null values
High 942 non-null values
Low 942 non-null values
Close 942 non-null values
Volume 942 non-null values
Adj Close 942 non-null values
dtypes: float64(5), int64(1)
window=30
In [30]: concat([ (Series(vwap(bars.iloc[i:i+window]),
index=[bars.index[i+window]])) for i in xrange(len(df)-window) ])
Out[30]:
2010-02-17 203.21
2010-02-18 202.95
2010-02-19 202.64
2010-02-22 202.41
2010-02-23 202.19
2010-02-24 201.85
2010-02-25 201.65
2010-02-26 201.50
2010-03-01 201.31
2010-03-02 201.35
2010-03-03 201.42
2010-03-04 201.09
2010-03-05 200.95
2010-03-08 201.50
2010-03-09 202.02
...
2013-09-10 485.94
2013-09-11 487.38
2013-09-12 486.77
2013-09-13 487.23
2013-09-16 487.20
2013-09-17 486.09
2013-09-18 485.52
2013-09-19 485.30
2013-09-20 485.37
2013-09-23 484.87
2013-09-24 485.81
2013-09-25 486.41
2013-09-26 486.07
2013-09-27 485.30
2013-09-30 484.74
Length: 912
A cleaned up version for reference, hopefully got the indexing correct:
def myrolling_apply(df, N, f, nn=1):
ii = [int(x) for x in arange(0, df.shape[0] - N + 1, nn)]
out = [f(df.iloc[i:(i + N)]) for i in ii]
out = pandas.Series(out)
out.index = df.index[N-1::nn]
return(out)
Modified #mathtick's answer to include na_fill. Also note that your function f needs to return a single value, this can't return a dataframe with multiple columns.
def rolling_apply_df(dfg, N, f, nn=1, na_fill=True):
ii = [int(x) for x in np.arange(0, dfg.shape[0] - N + 1, nn)]
out = [f(dfg.iloc[i:(i + N)]) for i in ii]
if(na_fill):
out = pd.Series(np.concatenate([np.repeat(np.nan, N-1),np.array(out)]))
out.index = dfg.index[::nn]
else:
out = pd.Series(out)
out.index = dfg.index[N-1::nn]
return(out)

Transforming financial data from postgres to pandas dataframe for use with Zipline

I'm new to Pandas and Zipline, and I'm trying to learn how to use them (and use them with this data that I have). Any sorts of tips, even if no full solution, would be much appreciated. I have tried a number of things, and have gotten quite close, but run into indexing issues, Exception: Reindexing only valid with uniquely valued Index objects, in particular. [Pandas 0.10.0, Python 2.7]
I'm trying to transform monthly returns data I have for thousands of stocks in postgres from the form:
ticker_symbol :: String, monthly_return :: Float, date :: Timestamp
e.g.
AAPL, 0.112, 28/2/1992
GS, 0.13, 30/11/1981
GS, -0.23, 22/12/1981
NB: The frequency of the reporting is monthly, but there is going to be considerable NaN data here, as not all of the over 6000 companies I have here are going to be around at the same time.
…to the form described below, which is what Zipline needs to run its backtester. (I think. Can Zipline's backtester work with monthly data like this, easily? I know it can, but any tips for doing this?)
The below is a DataFrame (of timeseries? How do you say this?), in the format I need:
> data:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2268 entries, 1993-01-04 00:00:00+00:00 to 2001-12-31 00:00:00+00:00
Data columns:
AA 2268 non-null values
AAPL 2268 non-null values
GE 2268 non-null values
IBM 2268 non-null values
JNJ 2268 non-null values
KO 2268 non-null values
MSFT 2268 non-null values
PEP 2268 non-null values
SPX 2268 non-null values
XOM 2268 non-null values
dtypes: float64(10)
The below is a TimeSeries, and is in the format I need.
> data.AAPL:
Date
1993-01-04 00:00:00+00:00 73.00
1993-01-05 00:00:00+00:00 73.12
...
2001-12-28 00:00:00+00:00 36.15
2001-12-31 00:00:00+00:00 35.55
Name: AAPL, Length: 2268
Note, there isn't return data here, but prices instead. They're adjusted (by Zipline's load_from_yahoo—though, from reading the source, really by functions in pandas) for dividends, splits, etc, so there's an isomorphism (less the initial price) between that and my return data (so, no problem here).
(EDIT: Let me know if you'd like me to write what I have, or attach my iPython notebook or a gist; I just doubt it'd be helpful, but I can absolutely do it if requested.)
I suspect you are trying to set the date as the index too early. My suggestion would be to first set_index as date and company name, then you can unstack the company name and resample.
Something like this:
In [11]: df1
Out[11]:
ticker_symbol monthly_return date
0 AAPL 0.112 1992-02-28 00:00:00
1 GS 0.130 1981-11-30 00:00:00
2 GS -0.230 1981-12-22 00:00:00
df2 = df2.set_index(['date','ticker_symbol'])
df3 = df2.unstack(level=1)
df4 = df.resample('M')
In [14]: df2
Out[14]:
monthly_return
date ticker_symbol
1992-02-28 AAPL 0.112
1981-11-30 GS 0.130
1981-12-22 GS -0.230
In [15]: df3
Out[15]:
monthly_return
ticker_symbol AAPL GS
date
1981-11-30 NaN 0.13
1981-12-22 NaN -0.23
1992-02-28 0.112 NaN
In [16]: df4
Out[16]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 124 entries, 1981-11-30 00:00:00 to 1992-02-29 00:00:00
Freq: M
Data columns:
(monthly_return, AAPL) 1 non-null values
(monthly_return, GS) 2 non-null values
dtypes: float64(2)

Python pandas resample added dates not present in the original data

I am using pandas to convert intraday data, stored in data_m, to daily data. For some reason resample added rows for days that were not present in the intraday data. For example, 1/8/2000 is not in the intraday data, yet the daily data contains a row for that date with NaN as the value. DatetimeIndex has more entries than the actual data. Am I doing anything wrong?
data_m.resample('D', how = mean).head()
Out[13]:
x
2000-01-04 8803.879581
2000-01-05 8765.036649
2000-01-06 8893.156250
2000-01-07 8780.037433
2000-01-08 NaN
data_m.resample('D', how = mean)
Out[14]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4729 entries, 2000-01-04 00:00:00 to 2012-12-14 00:00:00
Freq: D
Data columns:
x 3241 non-null values
dtypes: float64(1)
What you are doing looks correct, it's just that pandas gives NaN for the mean of an empty array.
In [1]: Series().mean()
Out[1]: nan
resample converts to a regular time interval, so if there are no samples that day you get NaN.
Most of the time having NaN isn't a problem. If it is we can either use fill_method (for example 'ffill') or if you really wanted to remove them you could use dropna (not recommended):
data_m.resample('D', how = mean, fill_method='ffill')
data_m.resample('D', how = mean).dropna()
Update: The modern equivalent seems to be:
In [21]: s.resample("D").mean().ffill()
Out[21]:
x
2000-01-04 8803.879581
2000-01-05 8765.036649
2000-01-06 8893.156250
2000-01-07 8780.037433
2000-01-08 8780.037433
In [22]: s.resample("D").mean().dropna()
Out[22]:
x
2000-01-04 8803.879581
2000-01-05 8765.036649
2000-01-06 8893.156250
2000-01-07 8780.037433
See resample docs.
Prior to 0.10.0, pandas labeled resample bins with the right-most edge, which for daily resampling, is the next day. Starting with 0.10.0, the default binning behavior for daily and higher frequencies changed to label='left', closed='left' to minimize this confusion. See http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#api-changes for more information.

Categories