Best modeling technique for multiple independent variables - python

I have time series data with 4 independent and 1 dependent variable. I'm trying to predict the value of the independent variable using the dependent variables. The data is quite complex, I've tried the linear regression already, which as expected did not work.
I proceeded to using multivariate polynomial regression, but have been unsuccessful till now because I haven't been able to get the code going. But I also read somewhere that using multivariate polynomial might not be the best approach.
Is there any other model that I could use to predict the value of the independent variable? My entire data is numerical, with new data coming in everyday. I'm using python for this exercise.
Any suggestions are helpful and highly appreciated.
Thank you!

Related

How to find correlation between categorical data and continuous data?

So i have 20 different nominal categorical variables which are independent variables. Each of these variables 2-10 categories.These independent variables are string type and will be used to predict a dependent variable called price, which is a continuous variable.
What algorithm do I use to find the correlation of each variable and decide on the best variables?
Note: I have not built a machine learning model yet and am using Python.
I've tried f_oneway ANOVA from sklearn, but it does not find the correlation, instead it only compares between the group itself. I've found correlation between continuous variables for both independent and dependent variables. Help is much appreciated
I'm not sure about sklearn, but perhaps this information will bring you a step closer.
First of all, when we speak about categorical data, we do not speak about correlation, we speak about association.
Generally speaking you need to use a ANOVA, chi square, or something similar to gather information on the association between a categorical variable and a continuous variable.
With ANOVA, we can calculate the inter- and intra-group variance, and then compare them.
Look at this post, it will probably make more sense then me trying to explain:
Click here

How to determine most impactful input variables in a dataset?

I have a neural network program that is designed to take in input variables and output variables, and use forecasted data to predict what the output variables should be based on the forecasted data. After running this program, I will have an output of an output vector. Lets say for example, my input matrix is 100 rows and 10 columns and my output matrix is a vector with 100 values. How do I determine which of my 10 variables (columns) had the most impact on my output?
I've done a correlation analysis between each of my variables (columns) and my output and created a list of the highest correlation between each variable and output, but I'm wondering if there is a better way to go about this.
If what you want to know is model selection, and it's not as simple as studiying the correlation of your features to your target. For an in-depth, well explained look at model selection, I'd recommend you read chapter 7 of The Elements Statistical Learning. If what you're looking for is how to explain your network, then you're in for a treat as well and I'd recommend reading this article for starters, though I won't go into the matter myself.
Naive approaches to model selection:
There a number of ways to do this.
The naïve way is to estimate all possible models, so every combination of features. Since you have 10 features, it's computationally unfeasible.
Another way is to take a variable you think is a good predictor and train to model only on that variable. Compute the error on the training data. Take another variable at random, retrain the model and recompute the error on the training data. If it drops the error, keep the variable. Otherwise discard it. Keep going for all features.
A third approach is the opposite. Start with training the model on all features and sequentially drop variables (a less naïve approach would be to drop variables you intuitively think have little explanatory power), compute the error on training data and compare to know if you keep the feature or not.
There are million ways of going about this. I've exposed three of the simplest, but again, you can go really deeply into this subject and find all kinds of different information (which is why I highly recommend you read that chapter :) ).

Understanding Multiple-linear regression and using python to accomplish this?

I am wondering what package/library would be best suited for preforming multiple-linear regression. I've read about it but the concept still confuses me. Currently I am trying to preform MLR on two files I have one is something called a Shapefile this is basically like a vector image with a lot of data about a particular data. Another is a raster image of that same state which has a lot of associated data about the state and number of pixels, areas, things like that. What I am trying to do is preform multiple linear regression on the three variables I have:
impervious surface
developed class
planted/cultivated class.
The instructions I have ask me to:
"Perform multiple linear regression between population density and area percentage of the following surface covers and calculate the R2 of the regression"
I'm not sure what this means. When I asked for further clarificatio, thinking it was doing combinations of those three variables and correlating it with a variable called Population_desnity, from an associate I was told this:
"By multiple regression, I don't mean to run three regressions separately, with each one using one independent variable. A multiple regression is one regression with any number of independent variables, not just two only. For this project, you need to use with three independent variables in each regression. Search the internet to understand what what is a multiple linear regression if you don't have a good understanding of it yet."
I need help understanding MLR in this context and how I would go about programming it into python.
Thank you

Using logistic regression for a multiple touch response model (python/pandas)?

I have a bunch of contact data listing what members were contacted by what offer, which summarizes something like this:
To make sense of it (and to make it more scalable) I was considering creating dummy variables for each offer and then using a logistic model to see how different offers impact performance:
Before I embark too far on this journey I wanted to get some input if this is a sensible way to approach this (I have started playing around but and got a model output, but haven't dug into it yet). Someone suggested I use linear regression instead, but I'm not really sure about the approach for that in this case.
What I'm hoping to get are coefficients that are interpretable - so I can see that Mailing the 50% off offer in the 3d mailing is not as impactful as the $25 giftcard etc, and then do this at scale (lots of mailings with lots of different offers) to draw some conclusions about the impact of timing of different offers.
My concern is that I will end up with a fairly sparse matrix where only some combinations of the many possible are respresented, and what problems may arise from this. I've taken some online courses in ML but am new to it, and this is one of my first chances to work directly with it so I'm hoping I could create something useful out of this. I have access to lots and lots of data, it's just a matter of getting something basic out that can show some value. Maybe there's already some work on this or even some kind of library I can use?
Thanks for any help!
If your target variable is binary (1 or 0) as in the second chart, then a classification model is appropriate. Logistic Regression is a good first option, you could also a tree-based model like a decision tree classifier or a random forest.
Creating dummy variables is a good move; you could also convert the discounts to numerical values if you want to keep them in a single column, however this may not work so well for a linear model like logistic regression as the correlation will probably not be linear.
If you wanted to model the first chart directly you could use a linear regressions for predicting the conversion rate, I'm not sure about the difference is in doing this, it's actually something I've been wondering about for a while, you've motivated me to post a question on stats.stackexchange.com

Multiple Logistic Regression in Python

I have a data set as such:
0,1,0,1,1,0,0,1,5
1,1,0,1,1,0,0,1,3
1,1,0,0,1,0,0,1,1
0,1,1,0,1,1,0,0,4
I'm looking for a way to run logistic regression in python which uses several discrete values (0 or 1) to predict a numerical value (between 1-5). This seems useful but it assumes the predictor variable is also discrete: http://www.mblondel.org/tlml/logreg.py.html#
Any suggestions?
If getting the job done in R (through one of rpy2, pyRserve, or pyper) is an option, you could this to get the job done. If questions about the statistical method to use, this "cross-validated" is a better place to ask.

Categories