Regression method with high precision - python

I would like to ask suggestions for my data set. As I am not familiar with machine learning or data science, I would like to get help from you guys.
I have four features, about a million rows each of them with one output. The final purpose is to make a fine regression with high precision. I have tried one regression method, and it seems like due to lot of samples there seems to be non-universal regression equation that fits for million rows.
Are there any methods I can try? One idea I thought of is to do multiple regression by truncating data rows, but then what should I do with all these equations to somehow make "one universal" equation or at least, minimize the number of the equation as much as possible to make quasi-universal?
Thank you in advance.

Scikit learn is a great package to do this sort of things (https://scikit-learn.org/stable/)
There exists tons of methods to perform this kind of regression task. The first I would try are LinearRegression, RandomForestRegressor, AdaBoost.
Now, you must prescribe a relevant metric to measure the success of the regression (l2 distance is the most common, but it may be ill-suited for your problem).

Related

Classification algorithms that work well with high dimensional dataset?

I have a dataset with low data points but very high dimensions/features. I wanted to know if there's any classification algorithm that work well with such dataset without having to perform dimensionality reduction techniques such as PCA, TSNE?
df.shape
(2124, 466029)
This is the classic curse of dimensionality (or p>>n) problem (with p being the number of predictors and n the number of observations).
Many techniques have been developed to try and address this problem.
You can randomly restrict your variables (you select different random subsets) and then asses their importance using cross-validation.
A preferable approach (imho) would be to use ridge-regression, lasso, or elastic net for regularization, however be aware that their oracle properties are rarely satisfied in practice.
Finally, there are algorithms that are able to deal with a very large number of predictors (and tweaks in their implementation that improve the performance when p>>n).
Examples of such models are support vector machine or random forest.
There are many resources on the topic, which are freely available.
You can have a look at these slides from Duke University for example.
Oracle properties (Lasso)
I will not explain in a sound mathematical way but I'll briefly give you some intuition.
Y= dependent variable, your target
X= regressors, your features
ε= your errors
We define a shrinkage procedure oracle if it is asymptotically able to:
Identify the right subset of regressors (i.e. retain only the features that have a true causal relationship with your dependent variable.
Has an optimal estimation rate (I'll leave the details out)
There are three assumptions that, if satisfied, make the lasso oracle.
Beta-min condition: The coefficients of the "true" regressors is above a certain threshold.
Your regressors are uncorrelated with each other.
X and ε are normally distributed and homoskedatisc
In practice you rarely have these assumptions satisfied.
What happens in that case is that your shrinkage will not necessarily retain the right variables.
This implies that you can't make statistically sound inference on the final model (you can't say X_1 explains Y for this and this other reason).
The intuition is simple. If assumption 1 is not satisfied one of the true variables might be incorrectly removed. If assumption 2 is not satisfied then a variable highly correlated with one of the true variables might be incorrectly retained in stead of the right one.
All in all, you shouldn't worry if your aim is forecasting. Your forecast will still be good! The only difference is that mathematically you can't say anymore that you are selecting the correct variables with probability -> 1.
PS: Lasso is a special case of elastic net, I vaguely remember that the oracle property of the elastic net has been proved as well but I might be wrong.
PPS: Corrections are appreciated as I haven't studied these things in a long while and there might be inaccuracies.
You could try a lasso/ridge/elastic net logistic regression.

Why random_state parameter is used in NMF and LDA algorithm ? What are the benefits of using random topics generated every time?

For Topic Modelling ,
Why random_state parameter is used in NMF and LDA algorithm ? What are the benefits of using random topics generated every time ?
The algorithms for both are stochastic - meaning they use randomness as a part of estimating a good answer. It's done that way to make it tractable, and in the case of LDA, the whole model is stochastic, providing you ideally with a probabilistic distribution (called "the posterior distribution") of answers, but instead providing a single, likely answer as an estimate.
So the answer is that using randomness in the algorithms makes a tremendously difficult problem much simpler and feasible to calculate in less than a hundred years.
If you're going to use them, I think it would do you well to study them, learn something of how they work, why they work. Using a tool that you don't understand is risky, as you don't really know what the result the tool provides actually means. One example is the numerors words in all "topics" with very low probability. The differences in these probabilities are actually meaningless - given a different sample from the posterior, you'd get different probabilities, ranked differently between words.

Algorithms to model non-linear relationship between two vectors

I want to build a model that describes a curve that fits the data shown in the scatterplot. I thought it would be straight forward using sklearn. But the choice and application of the different methods gets rather confusing.
Which algorithms would you use to tackle this problem?
This is really a question for CrossValidated rather than a Python question.
Your data seems to strongly indicate a simple underlying model which is linear until the very end, when it perhaps becomes polynomial.
As a first step, if possible, I would investigate this phenomenon. It's unusual. Perhaps there's something wrong with the data source. But maybe not. For example, a physical phenomenon with two distinct phases might produce data like these.
As to models, I would suggest natural cubic splines for this data. They are simple and involve cutting the data up into windows which you fit with cubic polynomials (a special case of which is a line).
You might also consider smoothing splines, and local regression.
For information on these, see the free online textbook, An Introduction to Statistical Learning.

Using logistic regression for a multiple touch response model (python/pandas)?

I have a bunch of contact data listing what members were contacted by what offer, which summarizes something like this:
To make sense of it (and to make it more scalable) I was considering creating dummy variables for each offer and then using a logistic model to see how different offers impact performance:
Before I embark too far on this journey I wanted to get some input if this is a sensible way to approach this (I have started playing around but and got a model output, but haven't dug into it yet). Someone suggested I use linear regression instead, but I'm not really sure about the approach for that in this case.
What I'm hoping to get are coefficients that are interpretable - so I can see that Mailing the 50% off offer in the 3d mailing is not as impactful as the $25 giftcard etc, and then do this at scale (lots of mailings with lots of different offers) to draw some conclusions about the impact of timing of different offers.
My concern is that I will end up with a fairly sparse matrix where only some combinations of the many possible are respresented, and what problems may arise from this. I've taken some online courses in ML but am new to it, and this is one of my first chances to work directly with it so I'm hoping I could create something useful out of this. I have access to lots and lots of data, it's just a matter of getting something basic out that can show some value. Maybe there's already some work on this or even some kind of library I can use?
Thanks for any help!
If your target variable is binary (1 or 0) as in the second chart, then a classification model is appropriate. Logistic Regression is a good first option, you could also a tree-based model like a decision tree classifier or a random forest.
Creating dummy variables is a good move; you could also convert the discounts to numerical values if you want to keep them in a single column, however this may not work so well for a linear model like logistic regression as the correlation will probably not be linear.
If you wanted to model the first chart directly you could use a linear regressions for predicting the conversion rate, I'm not sure about the difference is in doing this, it's actually something I've been wondering about for a while, you've motivated me to post a question on stats.stackexchange.com

Predictions with ARIMA (python statsmodels)

I have some time series data which contains some seasonal trends and I want to use an ARIMA model to predict how this series will behave in the future.
In order to predict how my variable of interest (log_var) will behave I have taken a weekly, monthly and annual difference and then used these as the input to an ARIMA model.
Below is an example.
exog = np.column_stack([df_arima['log_var_diff_wk'],
df_arima['log_var_diff_mth'],
df_arima['log_var_diff_yr']])
model = ARIMA(df_arima['log_var'], exog = exog, order=(1,0,1))
results_ARIMA = model.fit()
I am doing this for several different data sources and in all of them I see great results, in the sense that if I plot log_var against results_ARIMA.fittedvalues for the training data then it matches very well (I tune p and q for each data source separately, but d is always 0 given that I have already taken the difference myself).
However, I then want to check what the predictions look like, and in order to do this I redfine exog to just be the 'test' dataset. For example, if I train the original ARIMA model on 2014-01-01 to 2016-01-01, the 'test' set would just be 2016-01-01 onwards.
My approach has worked well for some data sources (in the sense that I plot the forecast against the known values and the trends look sensible) but badly for others, although they are all the same 'kind' of data and they have just been taken from different geographical locations. In some of the locations it completely fails to catch obvious seasonal trends that occur again and again in the training data on the same dates each year. The ARIMA model always fits the training data well, it just seems that in some cases the predictions are completely useless.
I am now wondering if I am actually following the correct procedure to predict values from the ARIMA model. My approach is basically:
exog = np.column_stack([df_arima_predict['log_val_diff_wk'],
df_arima_predict['log_val_diff_mth'],
df_arima_predict['log_val_diff_yr']])
arima_predict = results_ARIMA.predict(start=training_cut_date, end = '2017-01-01', dynamic = False, exog = exog)
Is this the correct way to go about making predictions with ARIMA?
If so, is there a way I can try to understand why the predictions look very good in some datasets and terrible in others, when the ARIMA model seems to fit the training data just as well in both cases?
I have a similar problem atm which I have not entirely figured out yet. It seems including multiple seasonal terms in python is still a bit tricky. R does seem to have this capacity, see here. So, one suggestion I can give you is to try this with the more sophisticated functionality R provides for now (although that could require a large investment of time if you are not familiar with R yet).
Looking at your approach for modeling the seasonal patterns, taking the nth order difference scores does not give you seasonal constants, but rather some representation of the difference between the time points that you designate as seasonally related. If those differences are small, correcting for them might not have much impact on your modeling results. In such cases, model prediction might turn out fairly well. Conversely, if the differences are big, including them can easily distort prediction results. This could explain the variation you are seeing in your modeling results. Conceptually, then, what you'd want to do instead is represent the constants over time.
In the blog post referenced above, the author advocates the use of Fourier series to model the variance within each time period. Both the NumPy and SciPy packages offer routines for calculating the fast Fourier transform. However, as a non-mathematician I found it difficult to ascertain that the fast Fourier transform yielded the appropriate numbers.
In the end I opted to use the Welch signal decomposition form SciPy's signal module. What this does is return a spectral density analysis of your time series, from which you can deduce signal strength at various frequencies in your time series.
If you identify the peaks in the spectral density analysis which correspond to the seasonal frequencies you are trying to account for in your time series, you can use their frequencies and amplitudes to construct sine waves representing the seasonal variations. You can then include these in your ARIMA as exogenous variables, much like the Fourier terms in the blog post.
This is about as far as I have gotten myself at this point - right now I am trying to figure out whether I can get the statsmodels ARIMA process to use these sine waves, which specify a seasonal trend, as exogenous variables in my model (the documentation specifies they should not represent trends but hey, a guy can dream, right?) edit: This blog post by Rob Hyneman is also highly relevant, and explains some of the rationale behind including Fourier terms.
Sorry I'm not able to give you a solution that's proven to be effective within Python, but I hope this gives you some new ideas to control for that pesky seasonal variance.
TL;DR:
It seems python is not very well suited to handle multiple seasonal terms right now, R might be a better solution (see reference);
Using difference scores to account for seasonal trends seems not to capture the constant variance associated with the recurrence of the season;
One way to do this in python could be to use Fourier series representing seasonal trends (also see reference), which can be obtained using, among other ways, a Welch signal decomposition. How to use these as exogenous variables in an ARIMA to good effect is an open question, though.
Best of luck,
Evert
p.s.: I'll update if I find a way to get this to work in Python

Categories