Ecommerce item sales forecasting with pandas and statsmodels - python

I want to forecast item sales (number of sales for each product) with pandas and statsmodel for an ecommerce business. Because item sales is a count dependent variable I'm assuming a Poisson modeling would work best.
In an ideal world the model will be used to decide on which products to use in ads (increasing product views) and also to decide on deciding price points (changing prices) to result in best performance/profitability.
So far so good, however when I try:
...
import statsmodels.formula.api as smf
...
result = smf.poisson(formula="Item_Sales ~ Product_Detail_Views + Variant_Price + C(Product_Type)", data=df).fit()
I get:
RuntimeWarning: invalid value encountered in multiply
return -np.dot(L*X.T, X)
RuntimeWarning: invalid value encountered in greater_equal
return mu >= 0
RuntimeWarning: invalid value encountered in greater
oldparams) > tol))
And a table full of NaNs
If I use OLS with the same dataset:
result = smf.ols(formula="Item_Sales ~ Product_Detail_Views + Variant_Price + C(Product_Type)", data=df).fit()
I get an R-squared of 0.809 so data is good. The model isn't as usable though as I get negative predictions which are obviously not possible (you cannot have negative sales of items).
How can I make the Poisson model work?

Looks like a data problem. Since no sample data is shown, cannot be sure. You can try using GLM with family Poisson or GEE with family Poisson
example:
smf.glm('sedimentation ~ C(control_grid)', data=df, families=sm.families.Poisson)

Related

fbprophet yearly seasonality volatility

I am new to using fbprophet and have a question about using the predict function.
As an example, I am using fbprophet to extrapolate Apples revenue for the next 5 years. Below is the code using the default settings.
m = Prophet()
m.fit(data)
future = m.make_future_dataframe(periods=5*365)
forecast = m.predict(future)
m.plot(forecast)
m.plot_components(forecast)
plt.show()
The results:
If I choose to remove the "yearly seasonality", I get a linear regression that fits much better.
My question is why do the predicted yhat results blow up so much when yearly seasonality is included. As shown, turning the option off produces a linear regression model but I'm unsure whether this model is most suitable for the data. Any suggestions would be much appreciated.
It looks like you are using monthly data and not daily data.
So instead of using "periods=5*365" you can change the freq to monthly.
Example:
future_pd = m.make_future_dataframe(periods = 12 * 5,
freq='MS',
include_history=True)

Why my time series use seasonal_decompose() can see clear seasonal, but when apply it with adfuller(), the result shows it is stationary

I think to my naked eye that there are seasonal time series that, when I use adfuller(), the results show the series is stationary based on p values.
I have also applied seasonal_decompose() with it. The results were pretty much what I expected
tb3['percent'].plot(figsize=(18,8))
what the series look like
One thing to note is that my data is collected every minute.
tb3.index.freq = 'T'
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(tb3['percent'].values,freq=24*60, model='additive')
result.plot();
the result of ETS decompose are shown in the figure below
ETS decompose
We can see a clear seasonality, which is same as what i expect
But when use adfuller()
from statsmodels.tsa.stattools import adfuller
result = adfuller(tb3['percent'], autolag='AIC')
the p-value is less than the 0.05, which means this series is stationary.
Can anyone tells me why that happened? how can i fix it?
Because I want to use the SARIMA model to predict furture values, while use the ARIMA model predicts always a constant value of furture.
An Augmented Dickey Fuller test examines whether the coefficient in the regression
y_t - y_{t-1} = <deterministic terms> + c y_{t-1} + <lagged differences>
is equal to 1. It does not usually have power against seasonal deterministic terms, and so it is not surprising that you are not rejecting using adfuller.
You can use a stationary SARIMA model, for example
SARIMAX(y, order=(p,0,q), seasonal_order=(ps, 0, qs, 24*60))
where you set the AR, MA, seasonal AR, and seasonal MA orders as needed.
This model will be quite slow and memory intensive since you have 24 hours of minutely data and so a 1440 lag seasonal.
The next version of statsmodels, which has been released as statsmodels 0.12.0rc0, adds initial support for deterministic processes in time series models which may simplify modeling this type of series. In particular, it would be tempting to use a low order Fourier deterministic sequence. Below is an example notebook.
https://www.statsmodels.org/devel/examples/notebooks/generated/deterministics.html

Partial dependence plots with min/max (interval) and not only average in Python

Good day,
I have applied lightGBM algorithm to real estate price data set (85524 observations and 167 features). I want to receive the interaction between year and real estate area size to price. The dependent variable is transformed with log1p to get normal distribution.
I have used Python, pdpbox module to generate an interaction plot. As I understand the coloring is the average price between the variables, however, I would like to receive the interval of the interaction i.e. min and max. Is it possible to do so?
LGBMR.fit(df_train.drop(["Price"], axis = 1, inplace = False), df_train["Price"])
feats = ['Year', 'Real estate area']
p = pdp.pdp_interact(LGBMR, df, model_features = columns, features = feats)
pdp.pdp_interact_plot(p, feats, plot_type = 'grid')
I am adding the pdp interaction plot. For example, in 2008 year, the real estate object of size 0.52 was purchased for an average price of 5.697 (prediction), but I would like to know the min and max predicted price of this interaction.

Python / statsmodels - out of sample predictions

I am trying to perform autoregressive multiple linear regressions using statsmodels (something like y ~ y_1 + X1 + X2, not ARMA-like).
More specifically, I'm looking for a way to get out of sample results.
When I use the predict method, I get in sample results which means that it uses the previous historical value of the estimated variable instead of the estimated value of the variable.
Thanks for your help.

Solving the Price is Right

In Chapter 5 of Probabilistic Programming for Hackers, the author proposes the following solution to an instance of The Price is Right, where the goal is to estimate the posterior of the price of the full showcase.
As a a contestant of the show, all you have is an estimate of the full price of the showcase based on historical data of the show, and your own estimates of the price of the items included in the showcase.
The chapter of the book is based on this post, and the code for the solution is shown below:
import pymc as pm
# Our belief of the full price of the showcase, based on our analysis of
# historical data from previous episodes:
mu_prior = 35000
std_prior = 7500
true_price = pm.Normal("true_price", mu_prior, 1.0 / std_prior ** 2)
# Our beliefs about the price of two items in the showcase:
prize_1 = pm.Normal("snowblower", 3000, 1.0 / (500 ** 2))
prize_2 = pm.Normal("trip_to_toronto", 5000, 1.0 / (3000 ** 2))
price_estimate = prize_1 + prize_2
# The model that relates our three priors:
#pm.potential
def error(true_price=true_price, price_estimate=price_estimate):
return pm.normal_like(true_price, price_estimate, 1 / (3e3) ** 2)
# Solving for the final price of the full showcase
mcmc = pm.MCMC([true_price, prize_1, prize_2, price_estimate, error])
mcmc.sample(50000, 10000)
trace_of_posterior_of_price_of_suite = mcmc.trace("true_price")[:]
which results in:
My questions are:
What is the Bayesian formulation of this problem? How can the author use a likelihood function to connect the priors and get a posterior?
How does pymc interpret the definition of the error potential in the code above? In the statistical graphical models literature, a potential is usually a factor (i.e. a product term) in the factorization of some distribution. What distribution (i.e. what variables) is potential referring to in this case?
Since the author uses the PyMC function normal_like in the code, does PyMC assume that you want to maximize this likelihood function? (if not, what role does it play?). The author seems to be using true_price as the observed data, and price_estimate as the mu in the normal likelihood function. Is this right? If so, what is the rationale of this?

Categories