How to test your prediction for insurance company dataset? - python

I have a dataset of an insurance company for my data science class project. My ultimate business objective in this project is to sell more policies to existing customers/customer segments.
Firstly, I want to cluster my customers through k-mean model with RFM scores then use apriori algorithm to find association rules among this clusters. Later, I can find which customer/customer segments I can sell more product/s. Yet my teacher want me to test my prediction and said that since the policies are repeated every year, you can not split your data in terms of last 3 months is test data-set and the rest of the 9 months is train data or etc. To sum up, he wants me to test my prediction in more accurate way. How can i test my prediction in this specific case?
My data set includes not much demografics info about customers such as age, income, education, or etc. Then I want to use RFM scores since I know the customers' all purchasing records. Columns include what policy type they purchase, when they purchase, which company they purchase with, the pricing of the purchase, which region they purchase in, telephone numbers, adresses, mail adress etc.
Insurance types are life, car, traffic, residence insurance, fire etc.

Related

Predict real estate price of available sales in the market based on past transaction data - machine learning

I am building a project in the data science course.
I decided to go for a familiar theme of real estate but with some additions of my own.
I took a city and collected all the real estate transactions from 2000 until today.
In the model learning phase I used RandomForestRegressor and managed to reach a not bad result of 0.8 after processing all the data.
In the second step, I went to a website where people advertise second-hand apartments for sale and collected all the data so that all the data I have in the first model is also here.
Now the problem:
The results I get are far from reality - the price of the model in many apartments is far from the price at which the apartment is sold.
I processed the data and did some comparison.
I think the problem stems from the fact that the first model calculates all the transactions from the year 2000 until today, while when I run the model on the new data, it does not know how to treat them in the same way:
That is, the year of the transaction is 2023 and not spread from 2000-2023, and the model does not know how to take into account the increase in prices.
I will point out that there is some reasonable correlation between the real price of the assets and the price that the model shows, but the spread is still too large and inaccurate.
In addition, another figure that is a little strange to me, I noticed that the average price per meter in 2022 of an apartment sold in the city is:
$11621.85
Compared to the price of an apartment that is currently in the market which is:
$18405.18
The average increase in recent years is 7.5 percent
I would appreciate your ideas on how to make the model more accurate.
*The data in my model contains:
['price',
'year,
'neighborhood',
'buildingMR',
'TotalFloors',
'rooms',
'floor',
'lat',
'long',
'buildyear',
'Distance_From_Sea',
'avgroomsize']

I want to validate the hypothesis that trend follows from social to search to sales

I have three datasets with me. Social , Search and sales dataset. I have to validate the hypothesis that trend follows from social to search to sales. I am also asked to find the latency and if the latency is same among all the theme ids. I am also supposed to pictorially represent the transition between them
Social dataset has claim id, published date, and number of posts
Search data has claim id, date, platform and search Volume
Sales data has claim id, date, sales dollars value
Any help with this would be much appreciated. Answer through python preferred.
Even steps to proceed will be helpful

Predict Sales as Counterfactual for Experiment

Which modelling strategy (time frame, features, technique) would you recommend to forecast 3-month sales for total customer base?
At my company, we often analyse the effect of e.g. marketing campaigns that run at the same time for the total customer base. In order to get an idea of the true incremental impact of the campaign, we - among other things - want to use predicted sales as a counterfactual for the campaign, i.e. what sales were expected to be assuming no marketing campaign.
Time frame used to train the model I'm currently considering 2 options (static time frame and rolling window) - let me know what you think. 1. Static: Use the same period last year as the dependent variable to build a specific model for this particular 3 month time frame. Data of 12 months before are used to generate features. 2. Use a rolling window logic of 3 months, dynamically creating dependent time frames and features. Not yet sure what the benefit of that would be. It uses more recent data for the model creation but feels less specific because it uses any 3 month period in a year as dependent. Not sure what the theory says for this particular example. Experiences, thoughts?
Features - currently building features per customer from one year pre-period data, e.g. Sales in individual months, 90,180,365 days prior, max/min/avg per customer, # of months with sales, tenure, etc. Takes quite a lot of time - any libraries/packages you would recommend for this?
Modelling technique - currently considering GLM, XGboost, S/ARIMA, and LSTM networks. Any experience here?
To clarify, even though I'm considering e.g. ARIMA, I do not need to predict any seasonal patterns of the 3 month window. As a first stab, a single number, total predicted sales of customer base for these 3 months, would be sufficient.
Any experience or comment would be highly appreciated.
Thanks,
F

How to pre-process transactional data to predict probability to buy?

I'm working on a model for a departament store that uses data from previous purchases to predict the customer's probability to buy today. For the sake of simplicity, say that we have 3 categories of products (A, B, C) and I want to use the purchase history of the customers in Q1, Q2 and Q3 2017 to predict the probability to buy in Q4 2017.
How should I structure my indicators file?
My try:
The variables I want to predict are the red colored cells in the production set.
Please note the following:
Since my set of customers is the same for both years, I'm using a photo of how customers acted last year to predict what will they do at the end of this year (which is unknown).
Data is separated by trimester, a co-worker sugested this is not correct, because I'm unintentionally giving greater weight to the indicators splitting each one in 4, when they should only be one per category.
Alternative:
Another aproach I was sugested was to use two indicators per category: Ex.'bought_in_category_A' and 'days_since_bought_A'. For me this looks simpler, but then the model will only be able to predict IF the customer will buy Y, not WHEN they will buy Y. Also, what will happen if the customer never bought A? I cannot use a 0 since that will imply customers who never bought are closer to customers who just bought a few days ago.
Questions:
Is this structure ok or would you structure the data in another way?
Is it ok to use information from last year in this case?
Is it ok to 'split' a cateogorical variable into several binary variables? does this affect the importance given to that variable?
Unfortunately, you need a different approach in order to achieve predictive analysis.
For example the products' properties are unknown here (color, taste,
size, seasonality,....)
There is no information about the customers
(age, gender, living area etc...)
You need more "transactional"
information, (when, why - how did they buy etc......)
What is the products "lifecycle"? Does it have to do with fashion?
What branch are you in? (Retail, Bulk, Finance, Clothing...)
Meanwhile have you done any campaign? How will this be measured?
I would first (if applicable) concetrate on the categories relations and behaviour for each Quarter:
For example When n1 decreases then n2 decreases
when q1 is lower than q2 or q1/2016 vs q2/2017.
I think you should first of all, work this out with a business analyst in order to to find out the right "rules" and approach.
I do no think you could get a concrete answer with these generic-assumed data.
Usually you need data from at least 3-5 recent years to do some descent predictive analysis, depending of course, on the nature of your product.
Hope, this helped a bit.
;-)
-mwk

Can training and evaluation sets be the same in predictive analytics?

I'm creating a model to predict the probability that customers will buy product A in a department store that sells product A through Z. The store has it's own credit card with demographic and transactional information of 140,000 customers.
There is a subset of customers (say 10,000) who currently buy A. The goal is to learn from these customers 10,000 customers and score the remaining 130,000 with their probability to buy A, then target the ones with the highest scores with marketing campaigns to increase A sales.
How should I define my training and evaluation sets?
Training set:
Should it be only the 10,000 who bought A or the whole 140k customers?
Evaluation set: (where the model will be used in production)
I believe this should be the 130k who haven't bought A.
The question about time:
Another alternative is to take a photograph of the database last year, use it as a training set, then take the database as it is today and evaluate all customer's with the model created with last year's info. Is this correct? When should I do this?
Which option is correct for all sets?
The training set and the evaluation set must be different. The whole point of having an evaluation set is guard against over-fitting.
In this case what you should do is take say 100,000 customers, picked at random. Then use the data to try and learn what is about customers that make them likely purchase A. Then use the remaining 40,000 to test how well you model works.

Categories