My problem is pretty simple, and I know I'm missing something very obvious, I just can't figure out what it is....
My test predictions for Holt-Winters are coming out as NaN and I can't figure out why. Can anyone help on this?
I'm using a Jupyter Notebook, and trying to forecast sales of one SKU using Holt-Winters method. I even went as far as using
Here is the code I used:
# Import the libraries needed to execute Holt-Winters
import pandas as pd
import numpy as np
%matplotlib inline
df = pd.read_csv('../Data/M1045_White.csv',index_col='Month',parse_dates=True)
# Set the month column as the index column
df.index.freq = 'MS'
df.index
df.head()
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 48 entries, 2015-05-01 to 2019-04-01
Freq: MS
Data columns (total 7 columns):
Sales 48 non-null int64
EWMA12 48 non-null float64
SES12 47 non-null float64
DESadd12 47 non-null float64
DESmul12 47 non-null float64
TESadd12 48 non-null float64
TESmul12 12 non-null float64
dtypes: float64(6), int64(1)
memory usage: 3.0 KB
from statsmodels.tsa.holtwinters import SimpleExpSmoothing
# Train Test Split
train_data = df.iloc[:36] # Goes up to but not including 36
test_data = df.iloc[12:]
# Fit the Model
fitted_model = exponentialSmoothing(train_data['Sales'],trend='mul',seasonal='mul',seasonal_periods=12).fit()
test_predictions = fitted_model.forecast(12).rename('HW M1045 White Forecast')
test_predictions
Here is the output of my predictions:
2018-05-01 NaN
2018-06-01 NaN
2018-07-01 NaN
2018-08-01 NaN
2018-09-01 NaN
2018-10-01 NaN
2018-11-01 NaN
2018-12-01 NaN
2019-01-01 NaN
2019-02-01 NaN
2019-03-01 NaN
2019-04-01 NaN
Freq: MS, Name: HW M1045 White Forecast, dtype: float64
Can someone please point out what I may have missed? This seems to be a simple problem with a simple solution, but it's kicking my butt.
Thanks!
The answer has something to do with the seasonal_periods variable being set to 12. If this is updated to 6 then the predictions yield actual values. I'm not a stats expert in Exponential Smoothing to understand why this is the case.
Reason:
Your training data contained some NaNs, so it was unable to model nor forecast.
See the non-null values count for each column, it is not the same.
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 48 entries, 2015-05-01 to 2019-04-01
Freq: MS
Data columns (total 7 columns):
Sales 48 non-null int64
EWMA12 48 non-null float64
SES12 47 non-null float64
DESadd12 47 non-null float64
DESmul12 47 non-null float64
TESadd12 48 non-null float64
TESmul12 12 non-null float64
dtypes: float64(6), int64(1)
memory usage: 3.0 KB
Check if there are any missing values in dataframe
df.isnull().sum()
Solution:
In your case, missing value treatment is needed before training the model.
Thanks all. My but there was a few blank cells, and N/A within my dataset that caused my code to throw me this error. My mistake not doing a better job with data cleaning. As well, I ensured my dates where formatted correctly and sales data should be integer.
Related
I have pivot my table using python. And I have verified that all the columns are visible. But when we view the info , the date column is not appearing. And when we are creating a graph, it is required to put the date as X value. Python says that is a key error :Date
Below is the query
df2=pd.pivot_table(df,index='Date',values = 'Amount', columns = 'Type',aggfunc='sum')
Output :
Type Customer Credit Note Payment Sales Invoice Balance \
Date
2022-01-31 927.85 685435.45 1108054.27 421690.97
2022-02-28 0.00 666665.71 1158489.98 491824.27
2022-03-31 31174.00 726719.20 908525.44 150632.24
2022-04-30 0.00 0.00 967592.69 967592.69
Type cumsum_reverse OS for the month limit vs purchases ratio \
Date
2022-01-31 1610049.20 2474027.18 0.271311
2022-02-28 1118224.93 2965851.45 0.283660
2022-03-31 967592.69 3116483.69 0.222456
2022-04-30 0.00 4084076.38 0.236918
Type OS vs Payment ratio OS vs limit ratio
Date
2022-01-31 0.277053 0.618507
2022-02-28 0.224781 0.741463
2022-03-31 0.233186 0.779121
2022-04-30 0.000000 1.021019
When we try out df2.info()
Output :
class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4 entries, 2022-01-31 to 2022-04-30
Data columns (total 9 columns):
Column Non-Null Count Dtype
--- ------ -------------- -----
0 Customer Credit Note 4 non-null float64
1 Payment 4 non-null float64
2 Sales Invoice 4 non-null float64
3 Balance 4 non-null float64
4 cumsum_reverse 4 non-null float64
5 OS for the month 4 non-null float64
6 limit vs purchases ratio 4 non-null float64
7 OS vs Payment ratio 4 non-null float64
8 OS vs limit ratio 4 non-null float64
dtypes: float64(9)
memory usage: 320.0 bytes
As you can see, the date column is missing from the info table and it is stating as date time index. Also, I do need to create a foreasting chart based on these columns.
(Data,OS vs limit ratio). But when I run the query, it says key error :Date
Can anyone help me to sort out this issue?
You can specify an object instead of a string as index or columns parameters for pivot_table:
df2 = pd.pivot_table(df, index=df.index, values='Amount', columns='Type', aggfunc='sum')
# HERE ---^
You set the Date column as an index when you did the pivoting. If you need the Date column, perhaps you can reset the index by doing
df = df.reset_index()
This will remove the Date column from being an index and set it as a separate column.
I am building a straightforward regression model with statsmodels using a formula but I am getting an error which I do not understand.
For a reproducible example my dataframe is Prices:
Prices.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 25 entries, 2012-06-30 to 2018-06-30
Freq: Q-DEC
Data columns (total 4 columns):
FB 25 non-null float64
GOOG 25 non-null float64
GDP_growth 25 non-null float64
GDP_growth_shifted 24 non-null float64
dtypes: float64(4)
memory usage: 1000.0 bytes
Prices.to_json()
'{"FB":{"1341014400000":35.26,"1348963200000":31.9133333333,"1356912000000":24.76,"1364688000000":25.43,"1372550400000":27.6966666667,"1380499200000":29.8733333333,"1388448000000":42.2733333333,"1396224000000":54.78,"1404086400000":61.9233333333,"1412035200000":64.54,"1419984000000":72.7233333333,"1427760000000":77.42,"1435622400000":80.87,"1443571200000":83.9333333333,"1451520000000":90.2466666667,"1459382400000":100.8266666667,"1467244800000":106.8833333333,"1475193600000":116.6166666667,"1483142400000":122.8233333333,"1490918400000":125.43,"1498780800000":135.3733333333,"1506729600000":152.3233333333,"1514678400000":166.5233333333,"1522454400000":174.9033333333,"1530316800000":172.17},"GOOG":{"1341014400000":306.6033333333,"1348963200000":312.19,"1356912000000":338.7,"1364688000000":363.36,"1372550400000":396.7433333333,"1380499200000":428.11,"1388448000000":459.26,"1396224000000":520.9766666667,"1404086400000":568.72,"1412035200000":556.0,"1419984000000":565.0566666667,"1427760000000":535.56,"1435622400000":536.8866666667,"1443571200000":560.1466666667,"1451520000000":627.9,"1459382400000":709.6133333333,"1467244800000":738.7766666667,"1475193600000":743.08,"1483142400000":771.7733333333,"1490918400000":801.2033333333,"1498780800000":843.9766666667,"1506729600000":915.2533333333,"1514678400000":964.72,"1522454400000":1036.6966666667,"1530316800000":1083.77},"GDP_growth":{"1341014400000":1.7,"1348963200000":0.5,"1356912000000":0.5,"1364688000000":3.6,"1372550400000":0.5,"1380499200000":3.2,"1388448000000":3.2,"1396224000000":-1.0,"1404086400000":5.1,"1412035200000":4.9,"1419984000000":1.9,"1427760000000":3.3,"1435622400000":3.3,"1443571200000":1.0,"1451520000000":0.4,"1459382400000":1.5,"1467244800000":2.3,"1475193600000":1.9,"1483142400000":1.8,"1490918400000":1.8,"1498780800000":3.0,"1506729600000":2.8,"1514678400000":2.3,"1522454400000":2.2,"1530316800000":4.1},"GDP_growth_shifted":{"1341014400000":0.5,"1348963200000":0.5,"1356912000000":3.6,"1364688000000":0.5,"1372550400000":3.2,"1380499200000":3.2,"1388448000000":-1.0,"1396224000000":5.1,"1404086400000":4.9,"1412035200000":1.9,"1419984000000":3.3,"1427760000000":3.3,"1435622400000":1.0,"1443571200000":0.4,"1451520000000":1.5,"1459382400000":2.3,"1467244800000":1.9,"1475193600000":1.8,"1483142400000":1.8,"1490918400000":3.0,"1498780800000":2.8,"1506729600000":2.3,"1514678400000":2.2,"1522454400000":4.1,"1530316800000":null}}'
My code is :
import statsmodels.formula.api as sm
from statsmodels.formula.api import ols
result = sm.ols(formula="FB ~ GOOG + GDP_growth", data=Price.tail(-1)).fit()
PatsyError: Error evaluating factor: NameError: name 'GDP_growth' is not defined
FB ~ GOOG + GDP_growth
^^^^^^^^^^
I'm wondering how I might approach the problem of inconsistent data formats with pandas. Initially I used regular expression to extract a date from a large data set of urls. That worked great however there is an inconsistent date format among the extracted dates:
dates
20140609
20140624
20140404
3/18/14
3/10/14
3/14/2014
20140807
20140806
2014-07-18
As you can see there is an inconsistent formatting of the date data in this dataset. Is there a way to fix this formatting so that all the dates are of the same format?
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 122270 entries, 0 to 122269
Data columns (total 4 columns):
id 119534 non-null float64
x1 122270 non-null int64
url 122270 non-null object
date 122025 non-null object
dtypes: float64(1), int64(1), object(2)
memory usage: 4.7+ MB
Use to_datetime it seems man/woman enough to handle your inconsistent formatting:
In [77]:
df['dates'] = pd.to_datetime(df['dates'])
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9 entries, 0 to 8
Data columns (total 1 columns):
dates 9 non-null datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 144.0 bytes
In [78]:
df
Out[78]:
dates
0 2014-06-09
1 2014-06-24
2 2014-04-04
3 2014-03-18
4 2014-03-10
5 2014-03-14
6 2014-08-07
7 2014-08-06
8 2014-07-18
For your sample dataset to_datetime works fine, if it didn't work for you it will be because you have some formats that it can't convert, you can either set the param coerce=True which will set any values that cannot be converted to NaT or errors='raise' to report any problems.
I'm new to Pandas and Zipline, and I'm trying to learn how to use them (and use them with this data that I have). Any sorts of tips, even if no full solution, would be much appreciated. I have tried a number of things, and have gotten quite close, but run into indexing issues, Exception: Reindexing only valid with uniquely valued Index objects, in particular. [Pandas 0.10.0, Python 2.7]
I'm trying to transform monthly returns data I have for thousands of stocks in postgres from the form:
ticker_symbol :: String, monthly_return :: Float, date :: Timestamp
e.g.
AAPL, 0.112, 28/2/1992
GS, 0.13, 30/11/1981
GS, -0.23, 22/12/1981
NB: The frequency of the reporting is monthly, but there is going to be considerable NaN data here, as not all of the over 6000 companies I have here are going to be around at the same time.
…to the form described below, which is what Zipline needs to run its backtester. (I think. Can Zipline's backtester work with monthly data like this, easily? I know it can, but any tips for doing this?)
The below is a DataFrame (of timeseries? How do you say this?), in the format I need:
> data:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2268 entries, 1993-01-04 00:00:00+00:00 to 2001-12-31 00:00:00+00:00
Data columns:
AA 2268 non-null values
AAPL 2268 non-null values
GE 2268 non-null values
IBM 2268 non-null values
JNJ 2268 non-null values
KO 2268 non-null values
MSFT 2268 non-null values
PEP 2268 non-null values
SPX 2268 non-null values
XOM 2268 non-null values
dtypes: float64(10)
The below is a TimeSeries, and is in the format I need.
> data.AAPL:
Date
1993-01-04 00:00:00+00:00 73.00
1993-01-05 00:00:00+00:00 73.12
...
2001-12-28 00:00:00+00:00 36.15
2001-12-31 00:00:00+00:00 35.55
Name: AAPL, Length: 2268
Note, there isn't return data here, but prices instead. They're adjusted (by Zipline's load_from_yahoo—though, from reading the source, really by functions in pandas) for dividends, splits, etc, so there's an isomorphism (less the initial price) between that and my return data (so, no problem here).
(EDIT: Let me know if you'd like me to write what I have, or attach my iPython notebook or a gist; I just doubt it'd be helpful, but I can absolutely do it if requested.)
I suspect you are trying to set the date as the index too early. My suggestion would be to first set_index as date and company name, then you can unstack the company name and resample.
Something like this:
In [11]: df1
Out[11]:
ticker_symbol monthly_return date
0 AAPL 0.112 1992-02-28 00:00:00
1 GS 0.130 1981-11-30 00:00:00
2 GS -0.230 1981-12-22 00:00:00
df2 = df2.set_index(['date','ticker_symbol'])
df3 = df2.unstack(level=1)
df4 = df.resample('M')
In [14]: df2
Out[14]:
monthly_return
date ticker_symbol
1992-02-28 AAPL 0.112
1981-11-30 GS 0.130
1981-12-22 GS -0.230
In [15]: df3
Out[15]:
monthly_return
ticker_symbol AAPL GS
date
1981-11-30 NaN 0.13
1981-12-22 NaN -0.23
1992-02-28 0.112 NaN
In [16]: df4
Out[16]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 124 entries, 1981-11-30 00:00:00 to 1992-02-29 00:00:00
Freq: M
Data columns:
(monthly_return, AAPL) 1 non-null values
(monthly_return, GS) 2 non-null values
dtypes: float64(2)
I'm trying to work with some stock market data. I have the following DataFrame:
>>> ticker
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 707 entries, 2010-01-04 00:00:00 to 2012-10-19 00:00:00
Data columns:
Open 707 non-null values
High 707 non-null values
Low 707 non-null values
Close 707 non-null values
Volume 707 non-null values
Adj Close 707 non-null values
dtypes: float64(5), int64(1)
I'll reference a random closing price:
>>> ticker ['Close'] [704]
21.789999999999999
What's the syntax to get the date of that 704th item?
Similarly, how do I get the position in the array of the following item?:
>>> ticker.Close.min ()
17.670000000000002
I know this seems pretty basic, but I've spent a lot of time scouring the documentation. If it's there, I'm absolutely missing it.
This should answer both your questions:
Note: if you want the 704th element, you should use "703" as index starts form zero. As you see df['A'].argmin() also returns 1, that is the second row in the df.
In [682]: print df
A B C
2000-01-01 1.073247 -1.784255 0.137262
2000-01-02 -0.797483 0.665392 0.692429
2000-01-03 0.123751 0.532109 0.814245
2000-01-04 1.045414 -0.687119 -0.451437
2000-01-05 0.594588 0.240058 -0.813954
2000-01-06 1.104193 0.765873 0.527262
2000-01-07 -0.304374 -0.894570 -0.846679
2000-01-08 -0.443329 -1.437305 -0.316648
In [683]: df.index[3]
Out[683]: <Timestamp: 2000-01-04 00:00:00>
In [684]: df['A'].argmin()
Out[684]: 1