I have a dataset that I am trying to analyze for a project.
The first step of the project is to basically model the data, and I am running into some issues. The data is on house sales within the past 5 years. Collecting data on buyers, cost of house, income, age, year purchased, years in loan, years at current job, and whether or not this house was foreclosed on with YES or NO.
The goal is to train a model to make predictions using machine learning, but I am stuck on part 1 - describing the data. I am using Jupyter notebooks to analyze the data and trying to put together a linear or multilinear regression model, and I am failing. When I throw together a scatter plot, my data is all over the chart with no way to really "group" the data at intersection point and cast a prediction line. This makes it difficult to figure out what is actually happening, perhaps the data I am comparing is not correlated in any way.
The problem also comes in with the YES or NO data. I was thinking this might need to be converted into 0s and 1s, but then my linear regression model would have an incredible weight on both ends of the spectrum. Perhaps regression is not the best choice?
I'm just struggling to figure out what to do and how to do it. I am kind of new to data analysis, so perhaps I am thinking of this all wrong. If anyone has any insight it would be much appreciated.
Related
So larger macro question here; I'm working on a machine learning model to predict long term performance of stocks using NLP of financial reports and different data from each yearly financial report. So the thing that I am wondering is what to do about different sized data.
So for example I can only get data 27 years back for one company but 100 years back for another. So I guess my question is how can I set it up so that it trains on each company as a single instance? When the ideal epoch size is not going to be constant as the amount of data per company is going to vary?
So the one thing I thought of is to standardize it, and give every company say 300 years of data, and autofill the data that doesn't exist with a impossible value that the model can learn represents no data, so that way if I set the epoch size to 300 it will see one/use a sliding window on one company.
So just wondering if this is a good solution or if there are better solutions out there. Thanks!
I am new to prophet (and stackoverflow in general ;) ) and have some issues with creating a predictive model using python. I am trying to predict daily sales of a product, using around 5 years of data. The data looks as follows: General data plot.
The company is closed in the weekends en during holidays, so there will be no orders. I accounted for this by creating a dataframe with al the weekends/holidays and using this dataframe as an argument for the holidays parameter. Furthermore I didn't change anything from the model, so it looks like: Prophet(holidays = my weekend/holiday dataframe).
However, my model doens't seem to work right and predicts negative values, see the following plot: Predicition 1. Hereby also the different component plots as extra information: trend, holidays, weekly, yearly. I also tried to just replace the negative values in the prediction by 0, which gives some better result (see prediction 2), but I don't think this is the right way to tackle this problem. The last thing I tried was to remove all the weekends from the training and predicting data. The results weren't good either: prediction 3.
I would love to hear some tips from you guys, for things I could try to do. If anything is unclear or you need more information, just let me know. Thank you in advance!!
My suggestions:
Try normalization
If that doesn't work try using Recurrent Neural Networks
I am very sorry if this question violates SO's question guidelines but I am stuck and I cannot find anywhere else to ask this type of questions. Suppose I have a dataset containing three experimental data that were obtained in three different conditions (hot, cold, comfortable). The data is arranged in three columns in a pandas dataframe consisting of 4 columns (time, cold, comfortable and hot).
When I plot the data, I can visually see the separation of the three experiments, but I would like to do it automatically with machine learning.
The x-axis represents the time and the y-axis represents the magnitude of the data. I have read about different machine learning classification techniquesbut I do not understand how to set up my data so that I can 'feed' it into the classification algorithm. Namely, my questions are:
Is this programmatically feasible?
How can I set up (arrange my data) so that it can be easily fed into the classification algorithm? From what I read so far, it seems, for the algorithm to work, the data has to be in a certain order (see for example the iris dataset where the data is nicely labeled. How can I customize the algorithms to fit my needs?
NOTE: Ideally, I would like the program that, given a magnitude value, it would classify the value as hot, comfortable or cold. The time series is not much of relevance in my case
Of course this is feasible.
It's not entirely clear from the original post exactly what variables/features you have available for your model, but here is a bit of general guidance. All of these machine learning problems, from classification to regression, rely on the same core assumption that you are trying to predict some outcome based on a bunch of inputs. Usually this relationship is modeled like this: y ~ X1 + X2 + X3 ..., where y is your outcome ("dependent") variable, and X1, X2, etc. are features ("explanatory" variables). More simply, we can say that using our entire feature-set matrix X (i.e. the matrix containing all of our x-variables), we can predict some outcome variable y using a variety of ML techniques.
So in your case, you'd try to predict whether it's Cold, Comfortable, or Hot based on time. This is really more of a forecasting problem than it is a ML problem, since you have a time component that looks to be one of the most important (if not the only) features in your dataset. You may want to look at some simpler time-series forecasting methods (e.g. ARIMA) instead of ML algorithms, as some of the time-series ML approaches may not be well-suited for a beginner.
In any case, this should get you started, I think.
I hope you guys can help me sort this out as I feel this is above me. It might be silly for some of you, but I am lost and I come to you for advice.
I am new to statistics, data analysis and big data. I just started studying and I need to make a project on churn prediction. Yes, this is sort of a homework task, but I hope you can answer some of my questions.
I would be most grateful for a beginner-level answers step-by-step.
Basically, I have a very big data set (obviously) on customer activity data from cellular company for 3 months, the 4th month ending in churned or not churned. Each month has these columns:
['year',
'month',
'user_account_id',
'user_lifetime',
'user_intake',
'user_no_outgoing_activity_in_days',
'user_account_balance_last',
'user_spendings',
'user_has_outgoing_calls',
'user_has_outgoing_sms',
'user_use_gprs',
'user_does_reload',
'reloads_inactive_days',
'reloads_count',
'reloads_sum',
'calls_outgoing_count',
'calls_outgoing_spendings',
'calls_outgoing_duration',
'calls_outgoing_spendings_max',
'calls_outgoing_duration_max',
'calls_outgoing_inactive_days',
'calls_outgoing_to_onnet_count',
'calls_outgoing_to_onnet_spendings',
'calls_outgoing_to_onnet_duration',
'calls_outgoing_to_onnet_inactive_days',
'calls_outgoing_to_offnet_count',
'calls_outgoing_to_offnet_spendings',
'calls_outgoing_to_offnet_duration',
'calls_outgoing_to_offnet_inactive_days',
'calls_outgoing_to_abroad_count',
'calls_outgoing_to_abroad_spendings',
'calls_outgoing_to_abroad_duration',
'calls_outgoing_to_abroad_inactive_days',
'sms_outgoing_count',
'sms_outgoing_spendings',
'sms_outgoing_spendings_max',
'sms_outgoing_inactive_days',
'sms_outgoing_to_onnet_count',
'sms_outgoing_to_onnet_spendings',
'sms_outgoing_to_onnet_inactive_days',
'sms_outgoing_to_offnet_count',
'sms_outgoing_to_offnet_spendings',
'sms_outgoing_to_offnet_inactive_days',
'sms_outgoing_to_abroad_count',
'sms_outgoing_to_abroad_spendings',
'sms_outgoing_to_abroad_inactive_days',
'sms_incoming_count',
'sms_incoming_spendings',
'sms_incoming_from_abroad_count',
'sms_incoming_from_abroad_spendings',
'gprs_session_count',
'gprs_usage',
'gprs_spendings',
'gprs_inactive_days',
'last_100_reloads_count',
'last_100_reloads_sum',
'last_100_calls_outgoing_duration',
'last_100_calls_outgoing_to_onnet_duration',
'last_100_calls_outgoing_to_offnet_duration',
'last_100_calls_outgoing_to_abroad_duration',
'last_100_sms_outgoing_count',
'last_100_sms_outgoing_to_onnet_count',
'last_100_sms_outgoing_to_offnet_count',
'last_100_sms_outgoing_to_abroad_count',
'last_100_gprs_usage']
The end result for this homework would be k-means cluster analysis and churn prediction model.
My biggest headache regarding this dataset is:
How to make a cluster analysis for monthly data including most of these variables? I tried to look for an example, but I either found an example on analyzing one variable per month or many variables per one month.
I am using Python and Spark.
I think I can make it work as long as I know what to do with months and a huge list of variables.
Thanks, your help will be greatly appreciated!
P.S. Would a code example be too much to ask?
Why would you use k-means here?
k-means will not do anything meaningful on such data. It's too sensitive to scaling and attribute types (e.g. year, month)
Churn prediction is a supervised problem. Never use an unsupervised algorithm for a supervised problem. That means you are ignoring the single most valueable information you have to guide the search.
I have a bunch of contact data listing what members were contacted by what offer, which summarizes something like this:
To make sense of it (and to make it more scalable) I was considering creating dummy variables for each offer and then using a logistic model to see how different offers impact performance:
Before I embark too far on this journey I wanted to get some input if this is a sensible way to approach this (I have started playing around but and got a model output, but haven't dug into it yet). Someone suggested I use linear regression instead, but I'm not really sure about the approach for that in this case.
What I'm hoping to get are coefficients that are interpretable - so I can see that Mailing the 50% off offer in the 3d mailing is not as impactful as the $25 giftcard etc, and then do this at scale (lots of mailings with lots of different offers) to draw some conclusions about the impact of timing of different offers.
My concern is that I will end up with a fairly sparse matrix where only some combinations of the many possible are respresented, and what problems may arise from this. I've taken some online courses in ML but am new to it, and this is one of my first chances to work directly with it so I'm hoping I could create something useful out of this. I have access to lots and lots of data, it's just a matter of getting something basic out that can show some value. Maybe there's already some work on this or even some kind of library I can use?
Thanks for any help!
If your target variable is binary (1 or 0) as in the second chart, then a classification model is appropriate. Logistic Regression is a good first option, you could also a tree-based model like a decision tree classifier or a random forest.
Creating dummy variables is a good move; you could also convert the discounts to numerical values if you want to keep them in a single column, however this may not work so well for a linear model like logistic regression as the correlation will probably not be linear.
If you wanted to model the first chart directly you could use a linear regressions for predicting the conversion rate, I'm not sure about the difference is in doing this, it's actually something I've been wondering about for a while, you've motivated me to post a question on stats.stackexchange.com