Automatic Linear/Multiple Regression in Python with 50+ columns - python

I have a dataset with more than 50 columns and I'm trying to find a way in Python to make a simple linear regression between each combination of variables. The goal here is to find a starting point in furthering my analysis (i.e, I will dwelve deeper into those pairs that have a somewhat significant R Square).
I've put all my columns in a list of numpy arrays. How could I go about making a simple linear regression between each combination, and for that combination, print the R square? Is there a possibility to try also a multiple linear regression, with up to 5-6 variables, again with each combination?
Each array has ~200 rows, so code efficiency in terms of speed would not be a big issue for this personal project.

If you are looking for columns with high r squared values, just try a correlation matrix. To ease the visualization, I would recommend you to plot a heat map using seaborn:
import seaborn as sns
import matplotlib.pyplot as plt
df_corr = df.corr()
sns.heatmap(df_corr, cmap="coolwarm", annot=True)
plt.show()
Other suggestion I have to you is to run a Principal Component Analysis (PCA) in your dataset to find the features with highest variability. Usually, these variables are the most important, and can be used to make the best predictions. Just let me know if want more info on this technique.

This is more of an EDA problem than a python problem. Look into some regression resources, specifically a correlation matrix. However, one possible solution could use itertools.combinations with a group size of 6. This will give you 15,890,700 different options for running a regression so unless you want to run greater than 15 million regressions you should do some EDA to find important features in your dataset.

Related

How do I conduct a Multilevel Model/Regression in Python?

I have yearly data over time (longitudinal data) with repeated measures for many of the subjects. I think I need multilevel modeling/regressions to deal with sure-to-be correlated clusters of measurements for the same individuals over time. The data currently is in separate tables for each year.
I was wondering if there was a way that was built into scikit-learn, like LinearRegression(), that would be able to conduct a multilevel regression where Level 1 is all the data over the years, and Level 2 is for the clustered on the subjects (clusters for each subject's measurements over time). And if so, if it's better to have the longitudnal data laid out length-wise (where the each subject's measures over time are all in one row) or stacked (where each measure for each year is it's own row).
Is there a way to do this?
Estimation of random effects in multilevel models is non-trivial and you typically have to resort to Bayesian inference methods.
I would suggest you look into Bayesian inference packages such as pymc3 or BRMS (if you know R) where you can specify such a model. Or alternatively, look at lme4 package in R for a fully-frequentist implementation of multi-level models.
Also, I think you would be able to get some inspiration from the "sleep-deprivation" dataset which is used as a textbook example of longitudinal data-analysis (https://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf) pg.4
To get started in pymc3 have a look here:
https://github.com/fonnesbeck/Bios8366/blob/master/notebooks/Section4_7-Multilevel-Modeling.ipynb

K Means Clustering: What does it mean about my input features if the Elbow Method gives me a straight line?

I am trying to cluster retail data in order to extract groupings of customers based on 6 input features. The data has a shape of (1712594, 6) in the following format:
I've spilt the 'Department' categorical variable into binary n-dimensional array using Pandas get_dummies(). I'm aware this is not optimal but I just wanted to test it out before trying out Gower Distances.
The Elbow method gives the following output:
USING:
I'm using Python and Scikitlearn's KMeans because the dataset is so large and the more complex models are too computationally demanding for Google Colab.
OBSERVATINS:
I'm aware that columns 1-5 are extremely correlated but the data is limited Sales data and little to no data is captured about Customers. KMeans is very sensitive to inputs and this may affect the WCSS in the Elbow Method and cause the straight line but this is just an inclination and I don't have any quantitative backing to support the argument. I'm a Junior Data Scientist so knowledge about technical foundations of Clustering models and algorithms is still developing so forgive me if I'm missing something.
WHAT I'VE DONE:
There were massive outliers that were skewing the data (this is a Building Goods company and therefore most of their sale prices and quantities fall within a certain range. But ~5% of the data contained massive quantity entries (eg. a company buying 300000 bricks at R3/brick) or massive price entries (eg. company buying an expensive piece of equipment).
I've removed them and maintained ~94% of the data. I've also removed the returns made by customers (ie. negative quantities and prices) under the inclination that I may create a binary variable 'Returned' to capture this feature. Here are some metrics:
These are some metrics before removing the outliers:
and these are the metrics after Outlier removal:
KMeans uses Euclidean distances. I've used both Scikitlearn's StandardScaler and RobustScaler when scaling without any significant changes in both. Here are some distribution plots and scatter plots for the 3 numeric variables:
Anybody have any practical/intuitive reasoning as to why this may be happening? Open to any alternative methods to use as well and any help would be much appreciated! Thanks
I am not an expert, in my experience with scikit learn cluster analysis I find that when the features are really similar in magnitude K-means clustering usually does not fulfill the job. I will first try to use a StandardScaler to see if normalizing the data makes the clustering more efficient. the elbow plot shows that with more n_neighbors you get higher accuracy, and by the looks of the plot and the plots you provide, I would think the data is too similar, making it hard to separate into groups (clusters). Adding an additional feature made up of your data can do the trick.
I would try normalizing the data first, standard scaler.
If the groups are still not very clear with a simple plot of the data I would create another column made up of the combination of the others columns.
I would not suggest using DBSCAN, since the eps parameter (distance) would have to be tunned very finely and as you mention is more computationally expensive.

PCA on large dataset

I have a large dateset consisting of 6 input variables (temperatures, pressures, flow rates etc) to give an output such as yield, purity and conversion.
There are a total of approx 47600 instances and this is all in an excel spreadsheet.
I have applied both artificial neural network and random forest algorithms on this data and obtained predicted plots and accuracy metrics. (in Python)
The random forest model has a feature that gives input variable importance.
I would now like to perform a PCA on this data to firstly compare to the random forest results, as well as to obtain more information on how my input data interacts with each other to give my output.
I've watched a few youtube videos and tutorials to get my head around PCA however the data they use is quite different to mine.
Below is a snippet of my data. The first 6 columns are inputs and the last 3 are outputs.
How can I analyse this using PCA? I have managed to plot it in python however the plot is very busy and almost doesnt give much information.
Any help or tips are welcome! Perhaps a different analysis tool? I don't mind using Python or Matlab
Thank you :)
I suggest to use the KarhunenLoeveSVDAlgorithm in OpenTURNS. It provides 4 implementations of a random SVD algorithm. The constraint is that the number of singular values to be computed has to be set beforehand.
In order to enable the algorithm, we must set the KarhunenLoeveSVDAlgorithm-UseRandomSVD key in the ResourceMap. Then the KarhunenLoeveSVDAlgorithm-RandomSVDMaximumRank key sets the number of singular values to compute (be default, it is equal to 1000).
Two implementations are provided:
Nathan Halko, Per-Gunnar Martinsson, Joel A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions,
Nathan Halko, Per-Gunnar Martisson, Yoel Shkolnisky and Mark Tygert. An algorithm for the principal component analysis of large data sets.
These algorithms can be chosen with the KarhunenLoeveSVDAlgorithm-RandomSVDVariant key.
In the following example, I simulate a large process sample from a gaussian process with AbsoluteExponential covariance model.
import openturns as ot
mesh = ot.IntervalMesher([10]*2).build(ot.Interval([-1.0]*2, [1.0]*2))
s = 0.01
model = ot.AbsoluteExponential([1.0]*2)
sampleSize = 100000
sample = ot.GaussianProcess(model, mesh).getSample(sampleSize)
Then the random SVD algorithm is used:
ot.ResourceMap_SetAsBool('KarhunenLoeveSVDAlgorithm-UseRandomSVD', True)
algorithm = ot.KarhunenLoeveSVDAlgorithm(sample, s)
algorithm.run()
result = algorithm.getResult()
The result object contains the Karhunen-Loève decomposition of the process. This corresponds to the PCA with a regular grid (and equal weights).

How to visualize k-means of multiple columns

i'm not a datascientist however i am intriuged with datascience, machine learning etc etc..
in my efforts to understand all of this i am continiously making a dataset (daily scraping) of grand exchange prices of one of my favourite games Old School runescape.
one of my goals is to pick a set of stocks/items that would give me the most profit. currently i am trying out clustering with k-means, to find stocks that are similar to eachother based on some basic features that i could think of.
however i have no clue if what i'm doing is correct,
for example:
( y = kmeans.fit_predict(df_items) my item_id is included with this, so is it actualy considering item_id as a feature now?)
and how do i even visualise the outcome of this i mean what goes on the x axis and what goes on the y axis, i have multiple columns...
https://github.com/extreme4all/OSRS_DataSet/blob/master/NoteBooks/Stock%20Picking.ipynb
To visualize something you have to reduce dimensionality to 2-3 dimensions, plus you can use color as 4-th dimension or in your case to indicate cluster number.
tSNE is a common choice for this task, check sklearn docs for details: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
Choose almost any visualization technique for multivariate data.
Scatterplot matrix
Parallel coordinates
Dimensionality reduction (PCA makes more sense for k-mrans than tSNE, but also consider Fishers LDA, LMNN, etc.)
Box plots
Violin plots
...

Principal Component Analysis with correlated features and outliers

I am performing PCA on dataset of shape 300,1500 using scikit learn in Python 3.
I have following questions in the context of PCA implementation in scikit learn and generally accepted approach.
1) Before doing PCA do I remove highly correlated columns? I have 67 columns which have correlation > 0.9. Does PCA automatically handle this correlation I.e ignores them?
2) Do I need to remove outliers before performing PCA?
3) if I have to remove outliers how best to approach this. Using z-score for each column when I tried to remove outliers (z-score >3) I am left with only 15 observations. It seems like wrong approach.
4) Finally is there ideal amount of cumulative explained variance which I should be using to choose P components. In this case around 150 components give me 90% cum explained variance
With regards to using PCA, PCA will discover the axes of greatest variance in your data. Consequently:
No, you no not need to remove correlated features.
You shouldn't need to remove outliers for any a priori reason related to PCA. That said, if you think they are potentially manipulating your results either for analysis or prediction you could consider removing them, although I don't think they are a problem for PCA per se.
That is probably not the right approach. First things first visualize your data and look for your outliers. Also, I wouldn't assume the distribution of your data and apply a basic z score to it. Some googling on criteria on removing outliers would be useful here.
There are various cutoffs people use with PCA. 99% can be quite common, although I don't know if there is a hard and fast rule. If your goal is prediction, there there will probably be a trade off between speed and the accuracy of your predictions. You will need to find the cutoff that suits your needs.

Categories