Can training and evaluation sets be the same in predictive analytics? - python

I'm creating a model to predict the probability that customers will buy product A in a department store that sells product A through Z. The store has it's own credit card with demographic and transactional information of 140,000 customers.
There is a subset of customers (say 10,000) who currently buy A. The goal is to learn from these customers 10,000 customers and score the remaining 130,000 with their probability to buy A, then target the ones with the highest scores with marketing campaigns to increase A sales.
How should I define my training and evaluation sets?
Training set:
Should it be only the 10,000 who bought A or the whole 140k customers?
Evaluation set: (where the model will be used in production)
I believe this should be the 130k who haven't bought A.
The question about time:
Another alternative is to take a photograph of the database last year, use it as a training set, then take the database as it is today and evaluate all customer's with the model created with last year's info. Is this correct? When should I do this?
Which option is correct for all sets?

The training set and the evaluation set must be different. The whole point of having an evaluation set is guard against over-fitting.
In this case what you should do is take say 100,000 customers, picked at random. Then use the data to try and learn what is about customers that make them likely purchase A. Then use the remaining 40,000 to test how well you model works.

Related

How can we make use of feature variables whose future values are fixed to predict target value?

With regard to time series features in a regression ML model.
Suppose, we are living in a space colony. The temperature there is accurately under control, so we will know the temperature next week.
Now, I have a problem predicting ice cream sales next week. Feature values are past sales records and temp values.
In this case, I believe that the fixed temp next week will help raise the accuracy of the sales prediction but I cannot come up with how to use this temp. I should split training/validation datasets from past data with train_test_split() as always. But I do not know how to handle the fixed future values.
Does somebody know how to?

How to randomise the rebalancing of a dataset

In python I am trying to rebalance a dataset which contains approximately 4000 transactions for a single credit card number, which are all ordered by time.
There is a large class imbalance between genuine and fraudulent transaction, and this data only contains about 15 fraudulent transactions that occurred within two days time.
Naturally, I want to rebalance the dataset. However, when I did this using SMOTE, I noticed that now there are approximately 4000 additional synthetic fraudulent transactions that occur during the exact two days as the original fraudulent transactions.
Is there any way to generate synthetic fraudulent transactions which are more randomised than this?

Predict Sales as Counterfactual for Experiment

Which modelling strategy (time frame, features, technique) would you recommend to forecast 3-month sales for total customer base?
At my company, we often analyse the effect of e.g. marketing campaigns that run at the same time for the total customer base. In order to get an idea of the true incremental impact of the campaign, we - among other things - want to use predicted sales as a counterfactual for the campaign, i.e. what sales were expected to be assuming no marketing campaign.
Time frame used to train the model I'm currently considering 2 options (static time frame and rolling window) - let me know what you think. 1. Static: Use the same period last year as the dependent variable to build a specific model for this particular 3 month time frame. Data of 12 months before are used to generate features. 2. Use a rolling window logic of 3 months, dynamically creating dependent time frames and features. Not yet sure what the benefit of that would be. It uses more recent data for the model creation but feels less specific because it uses any 3 month period in a year as dependent. Not sure what the theory says for this particular example. Experiences, thoughts?
Features - currently building features per customer from one year pre-period data, e.g. Sales in individual months, 90,180,365 days prior, max/min/avg per customer, # of months with sales, tenure, etc. Takes quite a lot of time - any libraries/packages you would recommend for this?
Modelling technique - currently considering GLM, XGboost, S/ARIMA, and LSTM networks. Any experience here?
To clarify, even though I'm considering e.g. ARIMA, I do not need to predict any seasonal patterns of the 3 month window. As a first stab, a single number, total predicted sales of customer base for these 3 months, would be sufficient.
Any experience or comment would be highly appreciated.
Thanks,
F

How to test your prediction for insurance company dataset?

I have a dataset of an insurance company for my data science class project. My ultimate business objective in this project is to sell more policies to existing customers/customer segments.
Firstly, I want to cluster my customers through k-mean model with RFM scores then use apriori algorithm to find association rules among this clusters. Later, I can find which customer/customer segments I can sell more product/s. Yet my teacher want me to test my prediction and said that since the policies are repeated every year, you can not split your data in terms of last 3 months is test data-set and the rest of the 9 months is train data or etc. To sum up, he wants me to test my prediction in more accurate way. How can i test my prediction in this specific case?
My data set includes not much demografics info about customers such as age, income, education, or etc. Then I want to use RFM scores since I know the customers' all purchasing records. Columns include what policy type they purchase, when they purchase, which company they purchase with, the pricing of the purchase, which region they purchase in, telephone numbers, adresses, mail adress etc.
Insurance types are life, car, traffic, residence insurance, fire etc.

How to pre-process transactional data to predict probability to buy?

I'm working on a model for a departament store that uses data from previous purchases to predict the customer's probability to buy today. For the sake of simplicity, say that we have 3 categories of products (A, B, C) and I want to use the purchase history of the customers in Q1, Q2 and Q3 2017 to predict the probability to buy in Q4 2017.
How should I structure my indicators file?
My try:
The variables I want to predict are the red colored cells in the production set.
Please note the following:
Since my set of customers is the same for both years, I'm using a photo of how customers acted last year to predict what will they do at the end of this year (which is unknown).
Data is separated by trimester, a co-worker sugested this is not correct, because I'm unintentionally giving greater weight to the indicators splitting each one in 4, when they should only be one per category.
Alternative:
Another aproach I was sugested was to use two indicators per category: Ex.'bought_in_category_A' and 'days_since_bought_A'. For me this looks simpler, but then the model will only be able to predict IF the customer will buy Y, not WHEN they will buy Y. Also, what will happen if the customer never bought A? I cannot use a 0 since that will imply customers who never bought are closer to customers who just bought a few days ago.
Questions:
Is this structure ok or would you structure the data in another way?
Is it ok to use information from last year in this case?
Is it ok to 'split' a cateogorical variable into several binary variables? does this affect the importance given to that variable?
Unfortunately, you need a different approach in order to achieve predictive analysis.
For example the products' properties are unknown here (color, taste,
size, seasonality,....)
There is no information about the customers
(age, gender, living area etc...)
You need more "transactional"
information, (when, why - how did they buy etc......)
What is the products "lifecycle"? Does it have to do with fashion?
What branch are you in? (Retail, Bulk, Finance, Clothing...)
Meanwhile have you done any campaign? How will this be measured?
I would first (if applicable) concetrate on the categories relations and behaviour for each Quarter:
For example When n1 decreases then n2 decreases
when q1 is lower than q2 or q1/2016 vs q2/2017.
I think you should first of all, work this out with a business analyst in order to to find out the right "rules" and approach.
I do no think you could get a concrete answer with these generic-assumed data.
Usually you need data from at least 3-5 recent years to do some descent predictive analysis, depending of course, on the nature of your product.
Hope, this helped a bit.
;-)
-mwk

Categories