I am currently working on a project for which I have simulated/mock-up data. This data consists of multiple features of which only one is affecting the response variable. This is a very simplified use case because it is only for demo purposes.
I have used a basic random forest regression (scikit-learn) to predict the dependent variable. This model is performing rather well which was expected due to its simplicity. The thing I am having problems with is plotting a regression curve of the model (Remaining Useful Life is the dependent variable and temp is the feature which is affecting it). I am using pyplot to do this but I am not getting the expected result (see below). I would have expected the plot to be roughly the bottom curve. I am not sure why the straight lines above are there.
To clarify what I was expecting to get:
Below is a scatter plot of the same data
My questions regarding this:
Why is the plot coming out like this? Does it have something to do with how RF works?
Is there a way of getting a "clean" regression curve? (e.g. the shape of the scatter plot but one line) If so: how can this be achieved?
Code I am using for the plot:
plt.plot(y_hat_train_rf, X_train[['temp']], color='k')
Thanks to F. Gyllenhammar's comment I have found a solution now. This should be obvious to experienced people but I will share my solution nevertheless.
Steps to solve:
Create new Dataframe that joins x and y.
sort by x
plot
Related
I have a dataset with more than 50 columns and I'm trying to find a way in Python to make a simple linear regression between each combination of variables. The goal here is to find a starting point in furthering my analysis (i.e, I will dwelve deeper into those pairs that have a somewhat significant R Square).
I've put all my columns in a list of numpy arrays. How could I go about making a simple linear regression between each combination, and for that combination, print the R square? Is there a possibility to try also a multiple linear regression, with up to 5-6 variables, again with each combination?
Each array has ~200 rows, so code efficiency in terms of speed would not be a big issue for this personal project.
If you are looking for columns with high r squared values, just try a correlation matrix. To ease the visualization, I would recommend you to plot a heat map using seaborn:
import seaborn as sns
import matplotlib.pyplot as plt
df_corr = df.corr()
sns.heatmap(df_corr, cmap="coolwarm", annot=True)
plt.show()
Other suggestion I have to you is to run a Principal Component Analysis (PCA) in your dataset to find the features with highest variability. Usually, these variables are the most important, and can be used to make the best predictions. Just let me know if want more info on this technique.
This is more of an EDA problem than a python problem. Look into some regression resources, specifically a correlation matrix. However, one possible solution could use itertools.combinations with a group size of 6. This will give you 15,890,700 different options for running a regression so unless you want to run greater than 15 million regressions you should do some EDA to find important features in your dataset.
I have a dataset that you see below. The data is pretty noisy, but there is a clear linear trend that goes up and to the right. I'd like to transform the data with y = m * x to make the lines horizontal. Essentially, I'd like to do a regression on the orange lines to pull out the slope, but I don't know how to extract the different linear clusters. Is there a good method for transforming data like this? I'm using python/pandas/numpy.
It looks like you'll want to try clustering the orange points. Some clustering methods will cope with the parallel clusters. I would probably start with DBSCAN.
For more on clustering, check out the tutorial on this scikit-learn page. Your situation is a bit like the 4th row here:
If you provide your data, I expect several people will take a look at it.
I would like to know if in Python, and more precisely, in lmfit library, there is an option for fitting data by parts ? I would like to fit data defined in different ranges and then obtain a unique fit.
Thank you
Without a more concrete example, it is hard to give a concrete answer. But, if I understand your question correctly, you are looking to do a fit to one specific region of your data, then a fit (probably with a different functional form) to another region of your data, and then perhaps combine the multiple regions to get a final fit.
If that is correct, then yes, this can be done with lmfit (and probably with other libraries as well). Let's say you want to fit data that is sort of peak like with an exponential decaying background. First, isolate a region around that peak (it doesn't have to be perfect) and fit a peak (say, Gaussian to that). Then fit an exponential decay to all the data except the peak area. (Aside: numpy.where can be very useful in identifying the regions). Finally, combine the two and fit the whole curve to peak + background.
If that is too vague and doesn't point you in the right direction, please make the question more specific.
Let's say price of houses(target variable) can be easily plotted against area of houses(predictor variables) and we can see the data plotted and draw a best fit line through the data.
However, consider if we have predictor variables as ( size, no.of bedrooms,locality,no.of floors ) etc. How am I gonna plot all these against the
target variable and visualize them on a 2-D figure?
The computation shouldn't be an issue (the math works regardless of dimensionality), but the plotting definitely gets tricky. PCA can be hard to interpret and forcing orthogonality might not be appropriate here. I'd check out some of the advice provided here: https://stats.stackexchange.com/questions/73320/how-to-visualize-a-fitted-multiple-regression-model
Fundamentally, it depends on what you are trying to communicate. Goodness of fit? Maybe throw together multiple plots of residuals.
If you truly want a 2D figure, that's certainly not easy. One possible approach would be to reduce the dimensionality of your data to 2 using something like Principal Component Analysis. Then you can plot it in two dimensions again. Reducing to 3 dimensions instead of 2 might also still work, humans can understand 3D plots drawn on a 2D screen fairly well.
You don't normally need to do linear regression by hand though, so you don't need a 2D drawing of your data either. You can just let your computer compute the linear regression, and that works perfectly fine with way more than 2 or 3 dimensions.
I'm working in a space which has 8 dimensions (i.e. 8 features). I have plotted the data points in 2D by applying PCA as well as TSNE. Now I would like also to draw the borderlines of the classifiers I use as shown here. By the way, I'm using different classifiers (SVM, GNB, Logistic Regression).
This means that I have the different 8-dimensional points which I plot in 2D using PCA or TSNE. On top of this plot I would like to plot the different classification regions as shown in the link above.
Of course the classification boundaries/regions are also 8-dimensional. How can I turn the classification boundaries/regions into 2D matching my 2D data points?
Interesting question here, I once wondered it.
It can be answered several way, including more or less details depending whether you want to fully understand or to apply the method.
As you don't a lot of detail but you included a sklearn link, I will first answer on a technical point of view: "How can you do it with sklearn?"
You have a function for this: transform(X, y=None) which will apply the PCA projection (yes, PCA is a projection for high dimensional space to a lower one).
So you basically just need to give transform(your_boundaries) to apply it.
In term of pseudo code this would give:
pca = PCA(n_component=2).fit(data)
2dboundaries = pca.transform(boundaries)
Et voilĂ !
Do not hesitate to give more details or ask question. I could add some specific development if it is relevant.
Hope it helps
pltrdy