Predict Sales as Counterfactual for Experiment

Predict Sales as Counterfactual for Experiment - python

Which modelling strategy (time frame, features, technique) would you recommend to forecast 3-month sales for total customer base?
At my company, we often analyse the effect of e.g. marketing campaigns that run at the same time for the total customer base. In order to get an idea of the true incremental impact of the campaign, we - among other things - want to use predicted sales as a counterfactual for the campaign, i.e. what sales were expected to be assuming no marketing campaign.
Time frame used to train the model I'm currently considering 2 options (static time frame and rolling window) - let me know what you think. 1. Static: Use the same period last year as the dependent variable to build a specific model for this particular 3 month time frame. Data of 12 months before are used to generate features. 2. Use a rolling window logic of 3 months, dynamically creating dependent time frames and features. Not yet sure what the benefit of that would be. It uses more recent data for the model creation but feels less specific because it uses any 3 month period in a year as dependent. Not sure what the theory says for this particular example. Experiences, thoughts?
Features - currently building features per customer from one year pre-period data, e.g. Sales in individual months, 90,180,365 days prior, max/min/avg per customer, # of months with sales, tenure, etc. Takes quite a lot of time - any libraries/packages you would recommend for this?
Modelling technique - currently considering GLM, XGboost, S/ARIMA, and LSTM networks. Any experience here?
To clarify, even though I'm considering e.g. ARIMA, I do not need to predict any seasonal patterns of the 3 month window. As a first stab, a single number, total predicted sales of customer base for these 3 months, would be sufficient.
Any experience or comment would be highly appreciated.
Thanks,
F

Related

SARIMA with daily data - determine saisonal period

i have 30 years for daily temperature data and i have a problem with determinate the saisonal period , i cant do it with lag = 365 day python is implemented
so can i use 90 day ( 1 season) . I don't know does it make sense or not

A seasonality of 365 does not really make sense for three reasons:
Such a long seasonality often leads to long runtimes or as in your case the process breaks.
When you use exactly 365 days you are ignoring leap years, so you model distorts over longer periods (30 years for examples results a distortion of 7 days already)
The most important reason is the correlation itself. A seasonality of 365 would mean that the weather today is somehow correlated to the weather last year, which is not really the case. Of course, both days will be somewhat close to another due to meteorological seasons, but I would not rely on this correlation. Imagine last year was relatively cold and this year is relatively warm. Do you use the last days immediately before today or would you base your forecast on some day a year ago? That is not very reliable.
Your forecast has to be based on the last immediate lags before the actual day – not the days of last year. However, to improve your model you have to factor in the meteorological seasons (spring, summer, fall, winter). There are multiple ways to do so. One for example would be to use SARIMAX and pass the model an exogenous feature, e.g. the monthly average temperature or a moving average over the last years. That way you can teach the model that days in summer are generally hotter than in winter, and for the precise prediction you use the days immediately before.
See here for the documentation of statsmodels SARIMAX:
https://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.html
There are plenty of examples on how to create these models with exogenous variables on the web.

How to include year fixed effect (in a daily panel data)

I am working on a panel dataset that includes daily stock returns of 450 firms for 5 years and daily ESG score(momentum based) for 5 years. I want to regress stock return on daily ESG scores, keeping Firm and year fixed effect. I have used linearmodels.panel function in python and set the index('Stock ticker", "Date") before running the regressions with entity and time effects. In the regression result, the number of entities shows 450, which is perfect but the time period shows 1800. I am wondering how python is capturing the time effects? Is it based on year or some other way? What I want is a year fixed effects, where for a particular year all firm will have same indicator variable. Can someone please help me to do it in the right way?
the image shows the format of the data, where panel is based on daily returns

Sounds like your model is capturing daily fixed effects instead of yearly fixed effects. This is happening because you set Date as an index, so you're telling Python that you want one fixed effect per date.
You have to create a new column that only contains the year. That is, convert the date column to datetime format (see pandas.to_datetime) and then:
# Extract year from Date
df['Year'] = pd.DatetimeIndex(df['Date']).year
# Set indices
df = df.set_index(['Ticker','Year'])
Then run your model.
I recommend using linearmodels.PanelOLS because that module is specifically made for fitting fixed effects models.
For future reference, post your code and a replicable example so we can help you out more easily.

Gathering data for machine learning (finance data)

So larger macro question here; I'm working on a machine learning model to predict long term performance of stocks using NLP of financial reports and different data from each yearly financial report. So the thing that I am wondering is what to do about different sized data.
So for example I can only get data 27 years back for one company but 100 years back for another. So I guess my question is how can I set it up so that it trains on each company as a single instance? When the ideal epoch size is not going to be constant as the amount of data per company is going to vary?
So the one thing I thought of is to standardize it, and give every company say 300 years of data, and autofill the data that doesn't exist with a impossible value that the model can learn represents no data, so that way if I set the epoch size to 300 it will see one/use a sliding window on one company.
So just wondering if this is a good solution or if there are better solutions out there. Thanks!

Predict sales with Python

I am a beginner in Python programming and machine learning.
I have a dataset with sales per product on monthly level.
The dataset has data from 2015 up till 2019.
With the help of Python I would like to make a prediction model that predicts the sales of the next month.
I followed this tutorial:
Sales prediction
This gave me a prediction of the last 6 months and lined them up with the actual sales, I managed to gat a pretty accurate prediction but my problem is that I need the predictions per product and if possible I would like to get the influence of weather in there aswell. For example if the weather data would be rainy it has to take this into account.
Does anyone know a way of doing this?
Every tip on which model to use or article to read is much appreciated!!

The most basic way of doing that would be to run an ARIMA model with external regressors (the weather measured in terms of temperature, humidity or any other feature that is expected to influence the monthly sales).
What is important is that before fitting the model, the sales data had better be transformed into log monthly changes by something like np.log(df.column).diff().
I've recently written a Python micro-package salesplansuccess, which deals with prediction of the current (or next) year's annual sales for individual products from historic monthly sales data. But a major assumption for that model is a quarterly seasonality (more specifically a repeating drift from the 2nd to the 3rd month in each quarter), which is more characteristic for wholesalers (who strives to achieve their formal quarterly sales goals by pushing sales at the end of each quarter).
The package is installed as usual with pip install salesplansuccess.
You can modify its source code for it to better fit your needs. It uses both ARIMA (maximum likelihood estimates) and linear regression (least square estimates) technics under the hood.
The minimalistic use case is below:
import pandas as pd
from salesplansuccess.api import SalesPlanSuccess
myHistoricalData = pd.read_excel('myfile.xlsx')
myAnnualPlan = 1000
sps = SalesPlanSuccess(data=myHistoricalData, plan=myAnnualPlan)
sps.fit()
sps.simulate()
sps.plot()
For more detailed illustration of its use, you may want to refer to a Jupyter Notebook illustration file at its GitHub repository.

How to pre-process transactional data to predict probability to buy?

I'm working on a model for a departament store that uses data from previous purchases to predict the customer's probability to buy today. For the sake of simplicity, say that we have 3 categories of products (A, B, C) and I want to use the purchase history of the customers in Q1, Q2 and Q3 2017 to predict the probability to buy in Q4 2017.
How should I structure my indicators file?
My try:
The variables I want to predict are the red colored cells in the production set.
Please note the following:
Since my set of customers is the same for both years, I'm using a photo of how customers acted last year to predict what will they do at the end of this year (which is unknown).
Data is separated by trimester, a co-worker sugested this is not correct, because I'm unintentionally giving greater weight to the indicators splitting each one in 4, when they should only be one per category.
Alternative:
Another aproach I was sugested was to use two indicators per category: Ex.'bought_in_category_A' and 'days_since_bought_A'. For me this looks simpler, but then the model will only be able to predict IF the customer will buy Y, not WHEN they will buy Y. Also, what will happen if the customer never bought A? I cannot use a 0 since that will imply customers who never bought are closer to customers who just bought a few days ago.
Questions:
Is this structure ok or would you structure the data in another way?
Is it ok to use information from last year in this case?
Is it ok to 'split' a cateogorical variable into several binary variables? does this affect the importance given to that variable?

Unfortunately, you need a different approach in order to achieve predictive analysis.
For example the products' properties are unknown here (color, taste,
size, seasonality,....)
There is no information about the customers
(age, gender, living area etc...)
You need more "transactional"
information, (when, why - how did they buy etc......)
What is the products "lifecycle"? Does it have to do with fashion?
What branch are you in? (Retail, Bulk, Finance, Clothing...)
Meanwhile have you done any campaign? How will this be measured?
I would first (if applicable) concetrate on the categories relations and behaviour for each Quarter:
For example When n1 decreases then n2 decreases
when q1 is lower than q2 or q1/2016 vs q2/2017.
I think you should first of all, work this out with a business analyst in order to to find out the right "rules" and approach.
I do no think you could get a concrete answer with these generic-assumed data.
Usually you need data from at least 3-5 recent years to do some descent predictive analysis, depending of course, on the nature of your product.
Hope, this helped a bit.
;-)
-mwk

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.