I am using fbprophet for time-series predictions in Python and I am wondering how the yhat (prediction) column is calculated. I used the following code.
import quandl
import fbprophet
tesla = quandl.get('WIKI/TSLA')
tesla = tesla.reset_index()
tesla = tesla.rename(columns={'Date': 'ds', 'Adj. Close': 'y'})
tesla = tesla[['ds', 'y']]
prophet = fbprophet.Prophet()
prophet.fit(tesla)
future = prophet.make_future_dataframe(periods=365)
future = prophet.predict(future)
The future dataframe contains the following columns:
['ds', 'trend', 'trend_lower', 'trend_upper', 'yhat_lower', 'yhat_upper',
'seasonal', 'seasonal_lower', 'seasonal_upper', 'seasonalities',
'seasonalities_lower', 'seasonalities_upper', 'weekly', 'weekly_lower',
'weekly_upper', 'yearly', 'yearly_lower', 'yearly_upper', 'yhat']
I understand yhat is the prediction, but is it a combination of trend, seasonal, seasonalities, weekly, yearly or something else?
I have tried combining the trend, seasonal, seasonalities, weekly, yearly columns to see if they equal the yhat column, but they do not:
future['combination'] = future['trend'] + future['seasonal'] + future['weekly'] + future['yearly'] + future['seasonalities']
print(future[['combination', 'yhat']].head())
combination yhat
0 57.071956 27.681139
1 55.840545 27.337270
2 53.741200 26.704090
3 51.874192 26.148355
4 47.827763 25.065950
I have been unable to find an answer in the documentation, but apologize if I have simply missed it.
I am a beginner of fbprophet but would like to share my guess.
Even though I'm not sure it is correct, The following equation probably holds true in the default setting of prophet.
yhat = trend + seasonal
Hopefully, my answer will help you!
This seems to have changed now. Looking at Github code, yhat is calculated as below
yhat = (trend * (1 + multiplicative_terms) + additive_terms
A link to this calculation in the fbprophet source code is found
here for Python
here for R
Related
I tried a time series forecast with Python using statsmodel's arima function and it gave me a different result from the r's arima function.
I used the same hyper-parameters.
R's version :
fit <- arima(data[1:9000,3], order = c(3,0,3), seasonal = list(order = c(0,0,0)))
predd = forecast(fit,h=1000)
pred = cbind(data[9001:10000,3], predd$mean)
Python's version :
series = df[0:9000].copy()
model = ARIMA(series, order=(3, 0, 3))
model_fitted = model.fit()
predictions = model_fitted.predict(start=len(series), end=len(df)-1)
Attached are the plots results Plots of the R's and Python's arima
What am I doing wrong?
Is there any other Python package/function arima that I can use other than statsmodel for a univariate time series?
Any insight or guidance would be greatly appreciated. Thank you so much in advance.
Summary: I do not know how you created the first image you showed as "R's version", but when I run the R code you gave and plot the results, they look identical to the Python results to me and do not look like the "R's version" graph you included. My best guess is that somehow you were plotting in-sample predictions when you created that image showing R's results.
See below for details.
Details:
I started by downloading the dataset "dataset.txt" from the link you gave, https://gist.github.com/DouddaS/5043a340ff7d7b35b255b4f8f74fc534
Now, if I run the following R code:
library(forecast)
y <- read.csv('dataset.txt')
fit <- arima(y[1:9000, 1], order = c(3,0,3), seasonal = list(order = c(0,0,0)))
predd = forecast(fit,h=1000)
pred = cbind(y[9001:10000,1], predd$mean)
autoplot(pred)
This gives the following plot:
And when I run the following Python code:
y = pd.read_csv('dataset.txt')
model = sm.tsa.arima.ARIMA(y.iloc[:9000, 0], order=(3, 0, 3))
model_fitted = model.fit()
pred = model_fitted.predict(start=len(series), end=len(y)-1)
predd = pd.concat([y.iloc[9000:, 0], pred], axis=1)
predd.plot()
Then I get the following plot:
These look basically identical to me, and R's version looks nothing like the image that was posted in the question.
I am trying to use Prophet - the forecasting package by facebook, and it works great with say, 150 rows of data. But when I try to model with less than 100 rows, it gives me very weird predictions. When I do it in R, it gives me the same prediction for all dates and when I do it in python, it gives me very bad predictions.
My data is weekly from 2018 week 1 to 2019 week 40.
This is my code:
(python)
predictionSize=6
new_train_df = data[:-predictionSize]
new_test_df = data[len(data)-predictionSize:]
m_new = Prophet(weekly_seasonality=True,yearly_seasonality=True)
m_new.fit(new_train_df)
new_future = m_new.make_future_dataframe(periods=predictionSize,freq='W')
new_forecast = m_new.predict(new_future)
new_ypred = new_forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail(6)
Using this code gives me negative values for yhat.
My question is, are the predictions bad because the dataset is too less for prophet?
Do let me know if you need any other information. The data has weekly seasonality and yearly seasonality.
I am new to machine learning , i am running a code which will analysis a set of data using pandas, quandl , the code is running fine and giving output also but i am unable to understand two lines of that code ,i am posting that one
import pandas as pd
import quandl
import math
df = quandl.get('WIKI/GOOGL')
df = df[['Adj. Open','Adj. High','Adj. Low','Adj. Close','Adj. Volume',]]
df['HL_PCT'] = (df['Adj. High'] - df['Adj. Close'])/ df['Adj. Close']*100.0
df['PCT_change'] = (df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open'] *
100.0
df = df[['Adj. Close','HL_PCT','PCT_change','Adj. Volume']]
forecast_col = 'Adj. Close'
#filling the NAN datas
df.fillna(-99999,inplace=True)
// this line i am unable to understand
forecast_out = int(math.ceil(0.02*len(df)))
// this line i am unable to understand
df['label'] = df[forecast_col].shift(-forecast_out)
df.dropna(inplace=True)
print(df.head())
I am unable to understand what is the use of 0.1 in the ceil function and why this code is using shift function, why they have used -forecast_out?Beacause forecast_out is giving some different values. and we have already filled the NAN positions with some data but then why we are going drop NAN?Please help
I was following the same tutorial and I was stuck in your same problem here is how it I figured it out :
math_ceil(): Rounds to the highest number for example:
math_ceil(4.5)
it will round to:
5
then the code will be multiplied by:
(0.02*len(df))
len(df) is basically the size of the dataset which in this case is 3424
(print (len(df))
In other words we are keeping track of data for 3424 days, now in our case, we will be forecasting what is happening in the future, but obviously, we won't go for 3424 days timeframe but we will take a small dive into the future, in our case it will be 69 days(2% of our total data) beyond our last data in our classifier to see what prices will it be on that period.
So to wrap this up :
forecast_out = int(math.ceil(0.02*len(df)))
Equals 69
Now we will use the variable forcast_out to determine the label:
df['label'] = df[forecast_col].shift(-forecast_out)
This formula means that we are shifting the column of our dataset up so what appears in our vision is the stock price after 69 days.
Here is the code with more details you can try to play around with it.
forecast_col ='Adj. Close'
df.fillna(-99999,inplace=True)
forecast_out=int(math.ceil(0.02*len(df)))
print ("Dataset= " + str(len(df)))
print ("Forecasting after how many days = " + str(forecast_out))
df['label']=df[forecast_col].shift(-forecast_out)
df.dropna(inplace=True)
print(df.tail())
Often in machine learning, you'll have data samples and each sample has features and labels (many api's expect this such as scikit-learn). In your case, each sample is a row of your dataframe. The value to predict is the forecast_col. Since you're looking at stock data, you want to predict what will happen in the future. It's meaningless to "predict" what's happening now (you can just observe it). The forecast_out value is some arbitrary value, in this case it's used to say how far in advance you will predict the 'Adj. Close'.
The shift method aligns the observations with the future value to predict. Then with this dataframe you can easily use scikit-learn to fit a model.
lr = sklearn.linear_model.LinearRegression()
lr.fit(df[['HL_PCT','PCT_change','Adj. Volume']], df[forecast_col])
This model will make predictions from the current observed values about what's going to happen forecast_out days from now.
I'm trying to convert the following code from R to Python using the Statsmodels module:
model <- glm(goals ~ att + def + home - (1), data=df, family=poisson, weights=weight)
I've got a similar dataframe (named df) using pandas, and currently have the following line in Python (version 3.4 if it makes a difference):
model = sm.Poisson.from_formula("goals ~ att + def + home - 1", df).fit()
Or, using GLM:
smf.glm("goals ~ att + def + home - 1", df, family=sm.families.Poisson()).fit()
However, I can't get the weighting terms to work. Each record in the dataframe has a date, and I want more recent records to be more valuable for fitting the model than older ones. I've not seen an example of it being used, but surely if it can be done in R, it can be done on Statsmodels... right?
freq_weights is now supported on GLM Poisson, but unfortunately not on sm.Poisson
To use it, pass freq_weights when creating the GLM:
import statsmodels.api as sm
import statsmodels.formula.api as smf
formula = "goals ~ att + def + home - 1"
smf.glm(formula, df, family=sm.families.Poisson(), freq_weights=df['freq_weight']).fit()
I've encountered the same issue.
there is a workaround that should lead to same results. add the weight in logarithm scale (np.log(weight)) you need as one of the explanatory variables with beta equal to 1 (offset option).
I can see there is an option for the exposure which doing the same as I explained above.
There are two solutions for setting up weights for Poisson regression. The first is to use freq_weigths in the GLM function as mentioned by MarkWPiper. The second is to just go with Poisson regression and pass the weights to exposure. As documented here: "Log(exposure) is added to the linear prediction with coefficient equal to 1." This does the same mathematical trick as mentioned by Yaron, although the parameter has a different original meaning. A sample code is as follows:
import statsmodels.api as sm
# or: from statsmodels.discrete.discrete_model import Poisson
fitted = sm.Poisson.from_formula("goals ~ att + def + home - 1", data=df, exposure=df['weight']).fit()
Say I do an OLS regression using statsmodels of variable y on some explanatory variables x1 x2 x3 (contained in a dataframe df):
res = smf.ols('y ~ x1 + x2 + x3', data=df).fit()
Is it possible to get a predicted value using only a subset of the explanatory variables? For example, I would like to get a predicted value for the observations in df using only x1 and x2 but not x3.
I have tried to do
res.predict(df[['x1','x2']])
but I get the error message: NameError: name 'x3' is not defined.
Edit: The reason I want to do this is the following. I'm running a regression of house values on house characteristics and dummies for metropolitan area, suburban status, and year. I would like to use the dummies for metropolitan area, suburban status and year to construct a price index for each location and time period.
Edit 2: This is how I ended up doing it, in case it can be helpful to anyone or someone can point me to a better way of doing it.
I'm interested in doing an OLS on the following specification:
model = 'price ~ C(MetroArea) + C(City) + C(Year) + x1 + ... + xK'
where 'x1 + ... + xK' is pseudo-code for a bunch of variables I'm using as controls but I'm not interested in, and the categorical variables are very large (e.g. 90 Metropolitan areas).
Next I fit the model with statsmodels and construct the design matrix that I'll use to predict prices using the variables of interest.
res = smf.ols(model, data=mydata).fit()
data_prediction = mydata[['MetroArea','City','Year']]
model_predict = 'C(MetroArea) + C(City) + C(Year)'
X = patsy.dmatrix(model_predict, data=data_prediction, return_type='dataframe')
The tricky part now is to select the right parameters for the variables of interest, since there are many and their names are not exactly those of their respective variables since I've used the categorical operator, C(), of patsy (e.g. variables for MetroArea look like: C(MetroArea)[0], C(MetroArea)[8], ...).
vars_interest = ['Intercept', 'MetroArea', 'City', 'Year']
params_interest = res.params[[any([word in var for word in vars_interest])
for var in res.params.index]]
Get prediction by doing the dot product of the selected parameters and variables of interest:
prediction = np.dot(X,params_interest)
In case anyone stumbles on this old question, there seems to be a cleaner solution using the information contained in the design matrix.
import numpy as np
from patsy import dmatrices
import statsmodels.api as sm
mydata = None
vars_of_interest = ['C(MetroArea)', 'C(City)', 'C(Year)']
formula = 'price ~' + " + ".join(vars_of_interest) + ' + x1 + ... + xK'
Y, X = dmatrices(formula, mydata)
# Get the slice names from patsy
slices = X.design_info.term_name_slices
model = sm.OLS(Y, X)
res = model.fit()
prediction = np.zeros(X.shape[0])
for var in vars_of_interest:
prediction += X[:, slices[var]].dot(res.params[slices[var]])
What are you trying to do conceptually? When you predict using your regression you're just plugging values into an equation. So predicting "without x3" is the same as just plugging in x3=0.
In terms of implementing this, it looks like statsmodels is pretty draconian about prediction using the same variable names as you used during a fit. So this is not elegant, but works:
df2 = df.copy()
df2['x3'] = 0
res.predict(df2[['x1','x2','x3']])