Predicting churn of customer using Machine Learning with lag - python

I have data of 5000 customers over time series (monthly) which looks like:
This is my first time dealing with time series data. Can someone explain some strategies for Churn prediction probability (3 months, 6 months) in advance?
I am confused because for every customer churning probability 3 or 6 months in advance will be zero (according to target). So should I see some trends or create lag variables?
But I still don't know if regression, what will be the target variable.

you can use lag function and then define churn and use predictive model to churn. As I can see buying product category, so you can take lag for last 3 month for unique customer id and then define the churn for 3 months

Related

Time Series Forecasting in python

I have dataset that contins 300 rows and 4 columns: Date, Hour, counts(how many ads were emitted during this hour in TV), Visits (how many visits were made during this hour). Here is example of data:
If I want to test the effect of the tv spots on visits on the website, should I treat it as a time series and use regression for example? And what should the input table look like in that case? I know that I have to divide the date into day and month, but how to treat the counts column, leave them as they are, if my y is to be the number of visits?
Thanks
just to avoid case of single input and single output regression model, you could use hour and counts as input and predict the visits.
I don't know what format are hours in, if they are in 12hrs format convert them to 24hr format before feeding them to your model.
If you want predict the the next dates and hours in the time series, regression models or classical time series model such as ARIMA, ARMA, exponential smoothing would be useful.
But, as you need to predict the effectiveness of tv spot, I recommend to generate features using tsfresh library in python, based on counts to remove the time effect and use a machine learning model to do prediction, such as SVR or Gradient Boosting.
In your problem:
from tsfresh import extract_features
extracted_features = extract_features(df,
column_id="Hour",
column_kind=None,
column_value="Counts",
column_sort="time")
So, your target table will be:
Hour Feature_1 Feature_2 ... Visits(Avg)
0 min(Counts) max(Counts) ... mean(Visits)
1 min(Counts) max(Counts) ... mean(Visits)
2 min(Counts) max(Counts) ... mean(Visits)
min() and max() are just example features, tsfresh could extract many other features. Visit here for more information

How to calculate the variance of stocks of the last 36 months (some monthly data are missing)?

I downloaded some stock data from CRSP and need the variance of the stock returns of the last 36 months of that company.
So, basically the variance based on two conditions:
Same PERMCO (company number)
Monthly stock returns of the last 3 years.
However, I excluded penny stocks from my sample (stocks with prices < $2). Hence, sometimes months are missing and e.g. april and junes monthly returns are directly on top of each other.
If I am not mistaken, a rolling function (grouped by Permco) would just take the 36 monthly returns above. But when months are missing, the rolling function would actually take more than 3 years data (since the last 36 monthly returns would exceed that timeframe).
Usually I work with Ms Excel. However, in this case the amount of data is too big and it takes years to let Excel calculate stuff. Thats why I want to tackle that problem with Python.
The sample is organized as follows:
PERMNO date SHRCD PERMCO PRC RET
When I have figured out how to make a proper table in here I will show you a sample of my data.
What I have tried so far:
data["RET"]=data["RET"].replace(["C","B"], np.nan)
data["date"] = pd.to_datetime(date["date"])
data=data.sort_values[("PERMCO" , "date"]).reset_index()
L3Yvariance=data.groupby("PERMCO")["RET"].rolling(36).var().reset_index()
Sometimes there are C and B instead of actual returns, thats why the first line
You can replace the missing values by the mean value. It won't affect the variance as the variance is calculated after subtracting the mean, so in this case, for times you won't have the value, the contribution to variance will be 0.

What method can I use in python to determine the likelihood of a value based on past data?

I have a large "time series" dataset that looks something like this:
Date Day of the week Days Since last Happy Out of town Happy
1/1/20 Monday 0 1 1
1/2/20 Tuesday 1 0 0
1/3/20 Wednesday 2 0 0
1/4/20 Thursday 0 0 1
I want to predict the "Happy" observation for future days, based on the previous observations. I know intuitively that there is a correlation between these input variables (like day of the week, days since last happy observation, etc), but I want to know how strong those input variables correlate to the "happy" observation.
What modeling technique should I use? Poissidon? Markov Chain? Linear regression?
The question you are asking, "how to predict" it's the major question of the entire machine learning field, so we can not answer that for you.
If you want some simple models to start, you can use models in sklearn package such as Decision Tree or KNN.
If you are looking to understand correlation between variables, you can use DataFrame.corr(method='pearson') to get correlation matrix.
Bidirectional variational autoencoders, like the one used in GPT-2

How to detect anomalies in multivariate, multiple time-series data?

I am trying to find anomalies in a huge sales-transactions dataset (more than 1 million observations), with thousands of unique customers. Same customer can purchase multiple times on the same date. Dataset contains a mix of both random and seasonal transactions. A dummy sample of my data is below:
Date CustomerID TransactionType CompanyAccountNum Amount
01.01.19 1 Sales 111xxx 100
01.01.19 1 Credit 111xxx -3100
01.01.19 4 Sales 111xxx 100
02.01.19 3 Sales 311xxx 100
02.01.19 1 Refund 211xxx -2100
03.01.19 4 Sales 211xxx 3100
Which algorithm/approach would suit this problem best? I have tried a multivariate FBprophet model (on python) so far and received less-than-satisfactory results.
You may try the pyod package, methods like isolation forest or HBOS.
It's advertised as 'a comprehensive and scalable Python toolkit for detecting outlying objects in multivariate data' but your mileage may vary in terms of performance, so first check out their benchmarks.
If you have time series data, it is better to apply some methods like moving average or exponential smoothing your data to remove trends and seasonalities. Otherwise, all data points involved in seasonality or trend periods will be labeled as anomalies.

Probability of occurrence based on historical data

The dataset is of occurrence of particular insects in a location for the given year and month. This is available for about 30 years. Now when I give a random location and year, month of future, I want what is the probability of finding that insects in that place based on the historic data.
I tried to to classification problem by labelling all available data as 1. And wanted to check the probability of new data point being label 1 . But the error was thrown as there should be at least two classes to train.
The data looks like this:The x and y are longitude and latitude
x y year month
17.01 22.87 2013 01
42.32. 33.09 2015 12
Think about the problem as a map. You'll need a map for each time period you're interested in, so sum all the occurrences in each month and year for each location. Unless the locations are already binned, you'll need to use some binning as otherwise it is pretty meaningless. So round the values in x and y to a reasonable precision level or use numpy to bin the data. Then you can create a map with the counts/ use a markov model to predict the occurrence.
The reason you're not getting anywhere at the moment is that the chance of finding an insect at any random point is virtually 0.

Categories