How to pre-process transactional data to predict probability to buy? - python

I'm working on a model for a departament store that uses data from previous purchases to predict the customer's probability to buy today. For the sake of simplicity, say that we have 3 categories of products (A, B, C) and I want to use the purchase history of the customers in Q1, Q2 and Q3 2017 to predict the probability to buy in Q4 2017.
How should I structure my indicators file?
My try:
The variables I want to predict are the red colored cells in the production set.
Please note the following:
Since my set of customers is the same for both years, I'm using a photo of how customers acted last year to predict what will they do at the end of this year (which is unknown).
Data is separated by trimester, a co-worker sugested this is not correct, because I'm unintentionally giving greater weight to the indicators splitting each one in 4, when they should only be one per category.
Another aproach I was sugested was to use two indicators per category: Ex.'bought_in_category_A' and 'days_since_bought_A'. For me this looks simpler, but then the model will only be able to predict IF the customer will buy Y, not WHEN they will buy Y. Also, what will happen if the customer never bought A? I cannot use a 0 since that will imply customers who never bought are closer to customers who just bought a few days ago.
Is this structure ok or would you structure the data in another way?
Is it ok to use information from last year in this case?
Is it ok to 'split' a cateogorical variable into several binary variables? does this affect the importance given to that variable?

Unfortunately, you need a different approach in order to achieve predictive analysis.
For example the products' properties are unknown here (color, taste,
size, seasonality,....)
There is no information about the customers
(age, gender, living area etc...)
You need more "transactional"
information, (when, why - how did they buy etc......)
What is the products "lifecycle"? Does it have to do with fashion?
What branch are you in? (Retail, Bulk, Finance, Clothing...)
Meanwhile have you done any campaign? How will this be measured?
I would first (if applicable) concetrate on the categories relations and behaviour for each Quarter:
For example When n1 decreases then n2 decreases
when q1 is lower than q2 or q1/2016 vs q2/2017.
I think you should first of all, work this out with a business analyst in order to to find out the right "rules" and approach.
I do no think you could get a concrete answer with these generic-assumed data.
Usually you need data from at least 3-5 recent years to do some descent predictive analysis, depending of course, on the nature of your product.
Hope, this helped a bit.


Using multivariable LSTM to predict only certain values

Okay, so, the question might be a bit tricky.
For a project I'm working on, I'm supposed to predict sales values from a store for certain products. Easy enough, I've done two functional models that, analyzing the sales over the past 10 years of a single product, is capable of predicting the future sales.
However, here's where it gets complicated:
My dataframe looks something like this:
df={month : [...], id : [...], n_sales : [...], group : [...], brand : [...]}
Id refers to the product, whereas group refers to the type of product and brand is just the brand.
It's important to understand that, of course, a single id has only one group and one brand, contrary to them since they both can have multiple different id's.
Finally, my data is organized by month (ascendant) and by ID (also ascendant).
Meaning that, let's say the store has 50 products (50 id's).
Then the first 50 rows of my dataset would be:
2012-01-01 | 1 ......
2012-01-01 | 2 ......
2012-01-01 | 3 ......
2012-01-01 | 50 .....
Then the next 50 rows would be the respective sales of each product for the month 2012-02-01 and so on until now.
I'm sorry if this is confusing, I'm trying to explain it as clear as I can.
Okay, I'm almost done. It's understandable that, if I isolate a single product, it would be easy to analyze the data.
I could just plot the sales from the known months alongside the sales from the prediction.
However, in order to make a more accurate prediction, I was asked to run a LSTM multivariable model, meaning that I have to take into account both group and brand. This, of course, means training my model with all the data from all the products. This is better understood with an example:
Let's say a new ice cream from Nestle was just created last November. Only analyzing the sales from that ice cream could not predict that the sales in summer will go up, since the only data the model would have is the few sales made in the cold months.
Nonetheless, if I analyze all the products, LSTM would know that, products from Nestle sell considerably more in summer and would take this into account when making the prediction for this new product.
And there's the problem, so now, getting to the question, how can I analyze all the data, from all the products but only get the predictions from a single Id?
Note: It has to be with LSTM, other models aren't an option.
And to anyone making it this far, even if you are not able to help, thank you for reading such a mess!

how to analyze numerical and categorical variables at the same time?

I'm trying to analyze the data of a food ordering application,
the data consist of both numerical and categorical variables, the main variable I'm studying is the total delivery time of an order, which represent the time from placing the order to closing it, I want to study what are the variables the affects it the most.
an example of rows in the data is the following:
order id
branch id
time placed
items id
no. items
total no. items
total delivery time
total time in seconds
I want to study the effects of all the variables on the total time, even items id and branch id, does a certain item affect time? does the day and period of the day affect it as well?
I used linear regression to get the correlation between total time and the numerical variables, and tried one way anova for some categorical variables, but I didn't like the results, is there a way to analyze all variable together without encoding categorical variables?
I'm looking forward to seeing what other people say about this. Here's my two cents.
ML algos like Regression, love numbers. ML algos like Classification love labels (non-numbers). You can certainly convert labeled data to 'numbered' data. One example is to code ['red','green','blue'] with [1,2,3], would produce weird things like 'red' is lower than 'blue', and if you average a 'red' and a 'blue' you will get a 'green'. Another more subtle example might happen when you code ['low', 'medium', 'high'] with [1,2,3]. In the latter case it might happen to have an ordering which makes sense, however, some subtle inconsistencies might happen when 'medium' in not in the middle of 'low' and 'high'. Now, under the hood, I think classifiers convert labels to numbers, so if you feed in large, medium, and small, it isn't using large, medium, and small to do it's analysis, it's converting those categories to numbers. I think. Maybe someone can confirm this for me.
Thus, I don't think it makes sense to try to measure any kind of relationship between IDs and specific outcomes, like 'totaltime', 'totaldays', etc. If you kick off a project on a Monday or a Friday, does the project end sooner or later than non-Monday-start or non-Friday-start projects? Well, maybe it does. But, is that correlation or causation? You can find correlations between all kinds of things, but these don't necessarily imply causation between these same things. Let's say you find a strong relationship between multiple projects that start on the second Monday of the month and all of these projects get finished off much faster than all other projects. This seems like pure coincidence, rather than causation. Or, there is some other factor impacting the outcome. Maybe projects that start on the second Monday of the month are typically small upgrades, rather than full-blown new undertakings, so the volume of work is less, and the project is done faster. However, starting the work on the second Monday of the month doesn't CAUSE the project to be finished off faster. Tell me if I am wrong. I'm always open to feedback.

Predict Sales as Counterfactual for Experiment

Which modelling strategy (time frame, features, technique) would you recommend to forecast 3-month sales for total customer base?
At my company, we often analyse the effect of e.g. marketing campaigns that run at the same time for the total customer base. In order to get an idea of the true incremental impact of the campaign, we - among other things - want to use predicted sales as a counterfactual for the campaign, i.e. what sales were expected to be assuming no marketing campaign.
Time frame used to train the model I'm currently considering 2 options (static time frame and rolling window) - let me know what you think. 1. Static: Use the same period last year as the dependent variable to build a specific model for this particular 3 month time frame. Data of 12 months before are used to generate features. 2. Use a rolling window logic of 3 months, dynamically creating dependent time frames and features. Not yet sure what the benefit of that would be. It uses more recent data for the model creation but feels less specific because it uses any 3 month period in a year as dependent. Not sure what the theory says for this particular example. Experiences, thoughts?
Features - currently building features per customer from one year pre-period data, e.g. Sales in individual months, 90,180,365 days prior, max/min/avg per customer, # of months with sales, tenure, etc. Takes quite a lot of time - any libraries/packages you would recommend for this?
Modelling technique - currently considering GLM, XGboost, S/ARIMA, and LSTM networks. Any experience here?
To clarify, even though I'm considering e.g. ARIMA, I do not need to predict any seasonal patterns of the 3 month window. As a first stab, a single number, total predicted sales of customer base for these 3 months, would be sufficient.
Any experience or comment would be highly appreciated.

Can training and evaluation sets be the same in predictive analytics?

I'm creating a model to predict the probability that customers will buy product A in a department store that sells product A through Z. The store has it's own credit card with demographic and transactional information of 140,000 customers.
There is a subset of customers (say 10,000) who currently buy A. The goal is to learn from these customers 10,000 customers and score the remaining 130,000 with their probability to buy A, then target the ones with the highest scores with marketing campaigns to increase A sales.
How should I define my training and evaluation sets?
Training set:
Should it be only the 10,000 who bought A or the whole 140k customers?
Evaluation set: (where the model will be used in production)
I believe this should be the 130k who haven't bought A.
The question about time:
Another alternative is to take a photograph of the database last year, use it as a training set, then take the database as it is today and evaluate all customer's with the model created with last year's info. Is this correct? When should I do this?
Which option is correct for all sets?
The training set and the evaluation set must be different. The whole point of having an evaluation set is guard against over-fitting.
In this case what you should do is take say 100,000 customers, picked at random. Then use the data to try and learn what is about customers that make them likely purchase A. Then use the remaining 40,000 to test how well you model works.

Design pattern for ongoing survey anayisis

I'm doing an ongoing survey, every quarter. We get people to sign up (where they give extensive demographic info).
Then we get them to answer six short questions with 5 possible values much worse, worse, same, better, much better.
Of course over time we will not get the same participants,, some will drop out and some new ones will sign up,, so I'm trying to decide how to best build a db and code (hope to use Python, Numpy?) to best allow for ongoing collection and analysis by the various categories defined by the initial demographic data..As of now we have 700 or so participants, so the dataset is not too big.
demographic, UID, North, south, residential. commercial Then answer for 6 questions for Q1
Same for Q2 and so on,, then need able to slice dice and average the values for the quarterly answers by the various demographics to see trends over time.
The averaging, grouping and so forth is modestly complicated by having differing participants each quarter
Any pointers to design patterns for this sort of DB? and analysis? Is this a sparse matrix?
Regarding the survey analysis portion of your question, I would strongly recommend looking at the survey package in R (which includes a number of useful vignettes, including "A survey analysis example"). You can read about it in detail on the webpage "survey analysis in R". In particular, you may want to have a look at the page entitled database-backed survey objects which covers the subject of dealing with very large survey data.
You can integrate this analysis into Python with RPy2 as needed.
This is a Data Warehouse. Small, but a data warehouse.
You have a Star Schema.
You have Facts:
response values are the measures
You have Dimensions:
time period. This has many attributes (year, quarter, month, day, week, etc.) This dimension allows you to accumulate unlimited responses to your survey.
question. This has some attributes. Typically your questions belong to categories or product lines or focus or anything else. You can have lots question "category" columns in this dimension.
participant. Each participant has unique attributes and reference to a Demographic category. Your demographic category can -- very simply -- enumerate your demographic combinations. This dimension allows you to follow respondents or their demographic categories through time.
But Ralph Kimball's The Data Warehouse Toolkit and follow those design patterns.
Buy the book. It's absolutely essential that you fully understand it all before you start down a wrong path.
Also, since you're doing Data Warehousing. Look at all the [Data Warehousing] questions on Stack Overflow. Read every Data Warehousing BLOG you can find.
There's only one relevant design pattern -- the Star Schema. If you understand that, you understand everything.
On the analysis, if your six questions have been posed in a way that would lead you to believe the answers will be correlated, consider conducting a factor analysis on the raw scores first. Often comparing the factors across regions or customer type has more statistical power than comparing across questions alone. Also, the factor scores are more likely to be normally distributed (they are the weighted sum of 6 observations) while the six questions alone would not. This allows you to apply t-tests based on the normal distibution when comparing factor scores.
One watchout, though. If you assign numeric values to answers - 1 = much worse, 2 = worse, etc. you are implying that the distance between much worse and worse is the same as the distance between worse and same. This is generally not true - you might really have to screw up to get a vote of "much worse" while just being a passive screw up might get you a "worse" score. So the assignment of cardinal (numerics) to ordinal (ordering) has a bias of its own.
The unequal number of participants per quarter isn't a problem - there are statistical t-tests that deal with unequal sample sizes.
