Forecasting, (finding the right model) - python

Using Python, I am trying to predict the future sales count of a product, using historical sales data. I am also trying to predict these counts for various groups of products.
For example, my columns looks like this:
Date Sales_count Department Item Color
8/1/2018, 50, Homegoods, Hats, Red_hat
If I want to build a model that predicts the sales_count for each Department/Item/Color combo using historical data (time), what is the best model to use?
If I do Linear regression on time against sales, how do I account for various categories? Can I group them?
Would I instead use multilinear regression, treating the various categories as independent variables?

The best way I have come across in forecasting in python is using SARIMAX( Seasonal Auto Regressive Integrated Moving Average with Exogenous Variables) model in statsmodel Library. Here is the link for a very good tutorial in SARIMAX using python
Also, If you are able to group the data frame according to your Department/Item?color combo, you can put them in a loop and apply the same model.
May be you can create a key for each unique combination and for each key condition you can forecast the sales.
For example,
df=pd.read_csv('your_file.csv')
df['key']=df['Department']+'_'+df['Item']+'_'+df['Color']
for key in df['key'].unique():
temp=df.loc[df['key']==key]#filtering only the specific group
temp=temp.groupby('Date')['Sales_count'].sum().reset_index()
#aggregating the sum of sales in that date. Ignore if not required.
#write the forecasting code here from the tutorial

Related

How do I conduct a Multilevel Model/Regression in Python?

I have yearly data over time (longitudinal data) with repeated measures for many of the subjects. I think I need multilevel modeling/regressions to deal with sure-to-be correlated clusters of measurements for the same individuals over time. The data currently is in separate tables for each year.
I was wondering if there was a way that was built into scikit-learn, like LinearRegression(), that would be able to conduct a multilevel regression where Level 1 is all the data over the years, and Level 2 is for the clustered on the subjects (clusters for each subject's measurements over time). And if so, if it's better to have the longitudnal data laid out length-wise (where the each subject's measures over time are all in one row) or stacked (where each measure for each year is it's own row).
Is there a way to do this?
Estimation of random effects in multilevel models is non-trivial and you typically have to resort to Bayesian inference methods.
I would suggest you look into Bayesian inference packages such as pymc3 or BRMS (if you know R) where you can specify such a model. Or alternatively, look at lme4 package in R for a fully-frequentist implementation of multi-level models.
Also, I think you would be able to get some inspiration from the "sleep-deprivation" dataset which is used as a textbook example of longitudinal data-analysis (https://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf) pg.4
To get started in pymc3 have a look here:
https://github.com/fonnesbeck/Bios8366/blob/master/notebooks/Section4_7-Multilevel-Modeling.ipynb

Machine learning application to column mapping

I have a dataframe with a large number of rows (several hundred thousand) and several columns that show industry classification for a company, while the eighth column is the output and shows the company type, e.g. Corporate or Bank or Asset Manager or Government etc.
Unfortunately industry classification is not consistent 100% of the time and is not finite, i.e. there are too many permutations of the industry classification columns to be mapped once manually. If I mapped say 1k rows with correct Output columns, how can I employ machine learning with python to predict the Output column based on my trained sample data? Please see the image attached which will make it clearer.
Part of the dataset
You are trying to predict to company type based in a couple of columns? That is not possible, there are a lot of companies working on that. The best you can do is to collect a lot of data from different sources match them, and then you can try with sklearn probably a decision tree classifier to start.

Predict sales with Python

I am a beginner in Python programming and machine learning.
I have a dataset with sales per product on monthly level.
The dataset has data from 2015 up till 2019.
With the help of Python I would like to make a prediction model that predicts the sales of the next month.
I followed this tutorial:
Sales prediction
This gave me a prediction of the last 6 months and lined them up with the actual sales, I managed to gat a pretty accurate prediction but my problem is that I need the predictions per product and if possible I would like to get the influence of weather in there aswell. For example if the weather data would be rainy it has to take this into account.
Does anyone know a way of doing this?
Every tip on which model to use or article to read is much appreciated!!
The most basic way of doing that would be to run an ARIMA model with external regressors (the weather measured in terms of temperature, humidity or any other feature that is expected to influence the monthly sales).
What is important is that before fitting the model, the sales data had better be transformed into log monthly changes by something like np.log(df.column).diff().
I've recently written a Python micro-package salesplansuccess, which deals with prediction of the current (or next) year's annual sales for individual products from historic monthly sales data. But a major assumption for that model is a quarterly seasonality (more specifically a repeating drift from the 2nd to the 3rd month in each quarter), which is more characteristic for wholesalers (who strives to achieve their formal quarterly sales goals by pushing sales at the end of each quarter).
The package is installed as usual with pip install salesplansuccess.
You can modify its source code for it to better fit your needs. It uses both ARIMA (maximum likelihood estimates) and linear regression (least square estimates) technics under the hood.
The minimalistic use case is below:
import pandas as pd
from salesplansuccess.api import SalesPlanSuccess
myHistoricalData = pd.read_excel('myfile.xlsx')
myAnnualPlan = 1000
sps = SalesPlanSuccess(data=myHistoricalData, plan=myAnnualPlan)
sps.fit()
sps.simulate()
sps.plot()
For more detailed illustration of its use, you may want to refer to a Jupyter Notebook illustration file at its GitHub repository.

How to include quarterly regressor in Prophet for monthly time series?

I have a monthly time series which I want to forecast using Prophet. I also have external regressors which are only available on a quarterly basis.
I have thought of following possibilities -
repeat the quarterly values to make it monthly and then include
linearly interpolate for the months
What other options I can evaluate?
Which would be the most sensible thing to do in this situation?
You have to evaluate based on your business problem, but there are some questions you can ask yourself.
How are the external regressors making their predictions? Are they trained on completely different data?
If not, are they worth including?
How quickly do we expect those regressors to get "stale"? How far in the future are their predictions available? How well do they perform more than one quarter into the future?
Interpolation can be reasonable based on these factors...but don't leak information about the future to your model at training time.
Do they relate to subsets of your features?
If so, some feature engineering could but fun - combine the external regressor's output with your other data in meaningful ways.

How to Use Lagged Time-Series Variables in a Python Pandas Regression Model?

I'm creating time-series econometric regression models. The data is stored in a Pandas data frame.
How can I do lagged time-series econometric analysis using Python? I have used Eviews in the past (which is a standalone econometric program i.e. not a Python package). To estimate an OLS equation using Eviews you can write something like:
equation eq1.ls log(usales) c log(usales(-1)) log(price(-1)) tv_spend radio_spend
Note the lagged dependent and lagged price terms. It's these lagged variables which seem to be difficult to handle using Python e.g. using scikit or statmodels (unless I've missed something).
Once I've created a model I'd like to perform tests and use the model to forecast.
I'm not interested in doing ARIMA, Exponential Smoothing, or Holt Winters time-series projections - I'm mainly interested in time-series OLS.
pandas allows you to shift your data without moving the index such has
df.shift(-1)
will create a 1 index lag behing
or
df.shift(1)
will create a forward lag of 1 index
so if you have a daily time series, you could use df.shift(1) to create a 1 day lag in you values of price such has
df['lagprice'] = df['price'].shift(1)
after that if you want to do OLS you can look at scipy module here :
http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.linregress.html

Categories