I am wondering what package/library would be best suited for preforming multiple-linear regression. I've read about it but the concept still confuses me. Currently I am trying to preform MLR on two files I have one is something called a Shapefile this is basically like a vector image with a lot of data about a particular data. Another is a raster image of that same state which has a lot of associated data about the state and number of pixels, areas, things like that. What I am trying to do is preform multiple linear regression on the three variables I have:
impervious surface
developed class
planted/cultivated class.
The instructions I have ask me to:
"Perform multiple linear regression between population density and area percentage of the following surface covers and calculate the R2 of the regression"
I'm not sure what this means. When I asked for further clarificatio, thinking it was doing combinations of those three variables and correlating it with a variable called Population_desnity, from an associate I was told this:
"By multiple regression, I don't mean to run three regressions separately, with each one using one independent variable. A multiple regression is one regression with any number of independent variables, not just two only. For this project, you need to use with three independent variables in each regression. Search the internet to understand what what is a multiple linear regression if you don't have a good understanding of it yet."
I need help understanding MLR in this context and how I would go about programming it into python.
Thank you
Related
I predicted a model with the xgboost algorithm on python and I graphed the predicted values vs the actal ones on a scatterplot (see image).
As you can see, there are several outliers (I drew a circle around them) who greatly damage the model and I would like to get rid of them.
Is there a way in python to identify the exact values from a dataframe with multiple independent variables that generate these outliers?[predicted vs actual values]
There is something called an anomaly/outlier detection system, should check that out.
Here is a link
There are several algorithms which are available in python for multivariate anomaly detection in sklearn like DBSCAN , Isolation Forest , One Class SVM etc and generally isolation forest is deemed have good anomaly/outlier detection when the dataset has high attributes. However more before using anomaly/outlier detection algorithms one needs to identify if these values are actually outlier or whether they are natural behaviour for the dataset. If yes then rather than removing the records one might have to normalize /bin or apply other feature engineering technique/ look at more complex algorithms to fit the data. What if the relation of the target variable and the independent variables is non-linear?
So i have 20 different nominal categorical variables which are independent variables. Each of these variables 2-10 categories.These independent variables are string type and will be used to predict a dependent variable called price, which is a continuous variable.
What algorithm do I use to find the correlation of each variable and decide on the best variables?
Note: I have not built a machine learning model yet and am using Python.
I've tried f_oneway ANOVA from sklearn, but it does not find the correlation, instead it only compares between the group itself. I've found correlation between continuous variables for both independent and dependent variables. Help is much appreciated
I'm not sure about sklearn, but perhaps this information will bring you a step closer.
First of all, when we speak about categorical data, we do not speak about correlation, we speak about association.
Generally speaking you need to use a ANOVA, chi square, or something similar to gather information on the association between a categorical variable and a continuous variable.
With ANOVA, we can calculate the inter- and intra-group variance, and then compare them.
Look at this post, it will probably make more sense then me trying to explain:
Click here
I have time series data with 4 independent and 1 dependent variable. I'm trying to predict the value of the independent variable using the dependent variables. The data is quite complex, I've tried the linear regression already, which as expected did not work.
I proceeded to using multivariate polynomial regression, but have been unsuccessful till now because I haven't been able to get the code going. But I also read somewhere that using multivariate polynomial might not be the best approach.
Is there any other model that I could use to predict the value of the independent variable? My entire data is numerical, with new data coming in everyday. I'm using python for this exercise.
Any suggestions are helpful and highly appreciated.
Thank you!
I have a bunch of contact data listing what members were contacted by what offer, which summarizes something like this:
To make sense of it (and to make it more scalable) I was considering creating dummy variables for each offer and then using a logistic model to see how different offers impact performance:
Before I embark too far on this journey I wanted to get some input if this is a sensible way to approach this (I have started playing around but and got a model output, but haven't dug into it yet). Someone suggested I use linear regression instead, but I'm not really sure about the approach for that in this case.
What I'm hoping to get are coefficients that are interpretable - so I can see that Mailing the 50% off offer in the 3d mailing is not as impactful as the $25 giftcard etc, and then do this at scale (lots of mailings with lots of different offers) to draw some conclusions about the impact of timing of different offers.
My concern is that I will end up with a fairly sparse matrix where only some combinations of the many possible are respresented, and what problems may arise from this. I've taken some online courses in ML but am new to it, and this is one of my first chances to work directly with it so I'm hoping I could create something useful out of this. I have access to lots and lots of data, it's just a matter of getting something basic out that can show some value. Maybe there's already some work on this or even some kind of library I can use?
Thanks for any help!
If your target variable is binary (1 or 0) as in the second chart, then a classification model is appropriate. Logistic Regression is a good first option, you could also a tree-based model like a decision tree classifier or a random forest.
Creating dummy variables is a good move; you could also convert the discounts to numerical values if you want to keep them in a single column, however this may not work so well for a linear model like logistic regression as the correlation will probably not be linear.
If you wanted to model the first chart directly you could use a linear regressions for predicting the conversion rate, I'm not sure about the difference is in doing this, it's actually something I've been wondering about for a while, you've motivated me to post a question on stats.stackexchange.com
I have some points that I need to classify. Given the collection of these points, I need to say which other (known) distribution they match best. For example, given the points in the top left distribution, my algorithm would have to say whether they are a better match to the 2nd, 3rd, or 4th distribution. (Here the bottom-left would be correct due to the similar orientations)
I have some background in Machine Learning, but I am no expert. I was thinking of using Gaussian Mixture Models, or perhaps Hidden Markov Models (as I have previously classified signatures with these- similar problem).
I would appreciate any help as to which approach to use for this problem. As background information, I am working with OpenCV and Python, so I would most likely not have to implement the chosen algorithm from scratch, I just want a pointer to know which algorithms would be applicable to this problem.
Disclaimer: I originally wanted to post this on the Mathematics section of StackExchange, but I lacked the necessary reputation to post images. I felt that my point could not be made clear without showing some images, so I posted it here instead. I believe that it is still relevant to Computer Vision and Machine Learning, as it will eventually be used for object identification.
EDIT:
I read and considered some of the answers given below, and would now like to add some new information. My main reason for not wanting to model these distributions as a single Gaussian is that eventually I will also have to be able to discriminate between distributions. That is, there might be two different and separate distributions representing two different objects, and then my algorithm should be aware that only one of the two distributions represents the object that we are interested in.
I think this depends on where exactly the data comes from and what sort of assumptions you would like to make as to its distribution. The points above can easily be drawn even from a single Gaussian distribution, in which case the estimation of parameters for each one and then the selection of the closest match are pretty simple.
Alternatively you could go for the discriminative option, i.e. calculate whatever statistics you think may be helpful in determining the class a set of points belongs to and perform classification using SVM or something similar. This can be viewed as embedding these samples (sets of 2d points) in a higher-dimensional space to get a single vector.
Also, if the data is actually as simple as in this example, you could just do the principle component analysis and match by the first eigenvector.
You should just fit the distributions to the data, determine the chi^2 deviation for each one, look at F-Test. See for instance these notes on model fitting etc
You might want to consider also non-parametric techniques (e.g. multivariate kernel density estimation on each of your new data set) in order to compare the statistics or distances of the estimated distributions. In Python stats.kde is an implementation in SciPy.Stats.