Python: Cluster analysis on a monthly data with a lot of variables

Python: Cluster analysis on a monthly data with a lot of variables - python

I hope you guys can help me sort this out as I feel this is above me. It might be silly for some of you, but I am lost and I come to you for advice.
I am new to statistics, data analysis and big data. I just started studying and I need to make a project on churn prediction. Yes, this is sort of a homework task, but I hope you can answer some of my questions.
I would be most grateful for a beginner-level answers step-by-step.
Basically, I have a very big data set (obviously) on customer activity data from cellular company for 3 months, the 4th month ending in churned or not churned. Each month has these columns:
['year',
'month',
'user_account_id',
'user_lifetime',
'user_intake',
'user_no_outgoing_activity_in_days',
'user_account_balance_last',
'user_spendings',
'user_has_outgoing_calls',
'user_has_outgoing_sms',
'user_use_gprs',
'user_does_reload',
'reloads_inactive_days',
'reloads_count',
'reloads_sum',
'calls_outgoing_count',
'calls_outgoing_spendings',
'calls_outgoing_duration',
'calls_outgoing_spendings_max',
'calls_outgoing_duration_max',
'calls_outgoing_inactive_days',
'calls_outgoing_to_onnet_count',
'calls_outgoing_to_onnet_spendings',
'calls_outgoing_to_onnet_duration',
'calls_outgoing_to_onnet_inactive_days',
'calls_outgoing_to_offnet_count',
'calls_outgoing_to_offnet_spendings',
'calls_outgoing_to_offnet_duration',
'calls_outgoing_to_offnet_inactive_days',
'calls_outgoing_to_abroad_count',
'calls_outgoing_to_abroad_spendings',
'calls_outgoing_to_abroad_duration',
'calls_outgoing_to_abroad_inactive_days',
'sms_outgoing_count',
'sms_outgoing_spendings',
'sms_outgoing_spendings_max',
'sms_outgoing_inactive_days',
'sms_outgoing_to_onnet_count',
'sms_outgoing_to_onnet_spendings',
'sms_outgoing_to_onnet_inactive_days',
'sms_outgoing_to_offnet_count',
'sms_outgoing_to_offnet_spendings',
'sms_outgoing_to_offnet_inactive_days',
'sms_outgoing_to_abroad_count',
'sms_outgoing_to_abroad_spendings',
'sms_outgoing_to_abroad_inactive_days',
'sms_incoming_count',
'sms_incoming_spendings',
'sms_incoming_from_abroad_count',
'sms_incoming_from_abroad_spendings',
'gprs_session_count',
'gprs_usage',
'gprs_spendings',
'gprs_inactive_days',
'last_100_reloads_count',
'last_100_reloads_sum',
'last_100_calls_outgoing_duration',
'last_100_calls_outgoing_to_onnet_duration',
'last_100_calls_outgoing_to_offnet_duration',
'last_100_calls_outgoing_to_abroad_duration',
'last_100_sms_outgoing_count',
'last_100_sms_outgoing_to_onnet_count',
'last_100_sms_outgoing_to_offnet_count',
'last_100_sms_outgoing_to_abroad_count',
'last_100_gprs_usage']
The end result for this homework would be k-means cluster analysis and churn prediction model.
My biggest headache regarding this dataset is:
How to make a cluster analysis for monthly data including most of these variables? I tried to look for an example, but I either found an example on analyzing one variable per month or many variables per one month.
I am using Python and Spark.
I think I can make it work as long as I know what to do with months and a huge list of variables.
Thanks, your help will be greatly appreciated!
P.S. Would a code example be too much to ask?

Why would you use k-means here?
k-means will not do anything meaningful on such data. It's too sensitive to scaling and attribute types (e.g. year, month)
Churn prediction is a supervised problem. Never use an unsupervised algorithm for a supervised problem. That means you are ignoring the single most valueable information you have to guide the search.

Related

Analyzing a Dataset in Jupyter Notebooks / Python

I have a dataset that I am trying to analyze for a project.
The first step of the project is to basically model the data, and I am running into some issues. The data is on house sales within the past 5 years. Collecting data on buyers, cost of house, income, age, year purchased, years in loan, years at current job, and whether or not this house was foreclosed on with YES or NO.
The goal is to train a model to make predictions using machine learning, but I am stuck on part 1 - describing the data. I am using Jupyter notebooks to analyze the data and trying to put together a linear or multilinear regression model, and I am failing. When I throw together a scatter plot, my data is all over the chart with no way to really "group" the data at intersection point and cast a prediction line. This makes it difficult to figure out what is actually happening, perhaps the data I am comparing is not correlated in any way.
The problem also comes in with the YES or NO data. I was thinking this might need to be converted into 0s and 1s, but then my linear regression model would have an incredible weight on both ends of the spectrum. Perhaps regression is not the best choice?
I'm just struggling to figure out what to do and how to do it. I am kind of new to data analysis, so perhaps I am thinking of this all wrong. If anyone has any insight it would be much appreciated.

LSTM - how implement holiday features

You can see in the following picture a demand problem.
My question relates to how one can/should implement fixed holidays in a LSTM model, which as seen here contain no demand and therefore cause sudden strong 1-day deviations from the average. I am specifically not referring to the change in trend between December and January
An Arima model, for example, can handle such days well.
After hours of searching the internet, all I could find was things how to deal with a change in trend. However, this is not the case, the trend remains the same and is only suspended for one day. I Hope there is someone here who has a paper or an approach for this kind of problem.

since the holydays have predefined dates, why not change the value of the data at that specific date to another value that wouldn't disturb the learning much, maybe the previous one, or the one after. or you could just remove the holydays data from your data and the sequence would be now unharmed by their drastic effect.

Prophet model predicts negative values

I am new to prophet (and stackoverflow in general ;) ) and have some issues with creating a predictive model using python. I am trying to predict daily sales of a product, using around 5 years of data. The data looks as follows: General data plot.
The company is closed in the weekends en during holidays, so there will be no orders. I accounted for this by creating a dataframe with al the weekends/holidays and using this dataframe as an argument for the holidays parameter. Furthermore I didn't change anything from the model, so it looks like: Prophet(holidays = my weekend/holiday dataframe).
However, my model doens't seem to work right and predicts negative values, see the following plot: Predicition 1. Hereby also the different component plots as extra information: trend, holidays, weekly, yearly. I also tried to just replace the negative values in the prediction by 0, which gives some better result (see prediction 2), but I don't think this is the right way to tackle this problem. The last thing I tried was to remove all the weekends from the training and predicting data. The results weren't good either: prediction 3.
I would love to hear some tips from you guys, for things I could try to do. If anything is unclear or you need more information, just let me know. Thank you in advance!!

My suggestions:
Try normalization
If that doesn't work try using Recurrent Neural Networks

Dealing with different data frequencies on FB-Prophet

I'm currently working on a forecasting project with FB-Prophet (I use python code). The data on which I want to make forecasts is weekly, I also have an other time series that could be useful as an additional regressor but for this one the past data is quarterly and the future values only yearly. What would be the best solution to make the forecast using this additional time series?
My current idea is to linearly interpolate the additional data to get weekly data. But I think changing yearly data to weekly with this method is may be too much. Does someone know a better solution?
Thank you for any future answer.

outlier detection in pyspark dataframe

I'm very new to Spark and Hadoop world. I've started learning these topics on my own from Internet. I wanted to know how we can perform outlier detection in Spark DataFrame given that a DataFrame in Spark is immutable? Is there any Spark package or module which can perform this? I'm using PySpark API for Spark, so I will be highly grateful if someone reply on how this can be done in PySpark. Will highly appreciate if I get a small code for performing outlier detection in Spark DataFrame in PySPark(Pyhton). Thanks a lot in advance!

To my knowledge, there is no a API neither a package that dedicated to detecting outliers as the data itself varies depending on the application. However, there are couple of known methods that all help to identify the outliers.
Let's first look at what the term outliers means, it simply refers to the extreme values that fall outside the scope/range of the observations. A good example of how these outliers can be seen is when visualizing the data in a histogram fashion or scatter plot, they can strongly influence the statics and much compress the meaningful data. Or they can be seen as a strong influence on the statistical summary of the data. such as after using the mean or the standard deviations.
This certainly will be misleading, the danger will be when we use training data that contain outliers, training will take longer time as the model will be struggling with the out-of-range values, hence we land in a less accurate model and poor result or 'never converging objective measure', i.e., comparing the output/scoring of the test and training with respect to training time or some accuracy value range.
Although, it is common to have outliers as an undesirable entities in your data, they still can be sign for anomalies and there their detection itself will be a method to spotting frauds or improving security.
Here are some k own methods for outliers detection (more details can be found in this good article):
Extreme Value Analysis,
Probabilistic and Statistical Models,
Linear Models: reduce the data dimension,
Proximity-based Models: mainly using clustering.
For the code, I suggest this good tutorial from mapr. ANd hope this answer helps. Good luck.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.