How to select the best feature from each group - python

I have four predictor variables to model a soil property. Each of these predictor variables is generated by ten different methods (four groups of ten). Is there an algorithm (in R, Python, and ...) that selects the best type from each of these groups to predict soil properties?
Thanks.

Try to use PyImpetus, it is a Markov Blanket based method to find the best features. https://github.com/atif-hassan/PyImpetus. Rest, if you need specific help, kindly mention the features for anyone to reproduce the results on.

Related

Best information criterion for ARIMA model?

I use ARIMA model with python:
from statsmodels.tsa.statespace.sarimax import SARIMAX
model = SARIMAX(x, order=(p, d, q),
enforce_stationarity=False,
enforce_invertibility=False).fit(disp=False)
I want to compare some models with different paramenters with each other and choose a model with fewer regressors (choose more easier model).
Which Information Criteria should I use.
I read about AIC and BIC.
And I read, that BIC better than AIC to choose more easier ARMA model, but which best with ARIMA?
Maybe I must use other information criteria like HQIC?
I would be grateful for useful links.
I don't think there is a general answer to your question.
In R there is an auto.arima function which is written by Rob Hyndman: he uses AICc.
You can read all about it in his online book (chapter 8.7).
Note that classical information criteria (AIC, BIC, etc) do not allow to compare ARIMA models with different parameter d or D (since the number of useable observations depends on d and D).
Here is a list of things to keep in mind when working with information criteria.
Therefore ultimately, the final choice of the model can (in my experience) not be based on one simple figure. Rather the final choice should be supported by different diagnostic plots and information criteria.

Handling Categorical Data with Many Values in sklearn

I am trying to predict customer retention with a variety of features.
One of these is org_id which represents the organization the customer belongs to. It is currently a float column with numbers ranging from 0.0 to 416.0 and 417 unique values.
I am wondering what the best way of preprocessing this column is before feeding it to a scikit-learn RandomForestClassifier. Generally, I would one-hot-encode categorical features, but there are so many values here so it would radically increase my data dimensionality. I have 12,000 rows of data, so I might be OK though, and only about 10 other features.
The alternatives are to leave the column with float values, convert the float values to int values, or convert the floats to pandas' categorical objects.
Any tips are much appreciated.
org_id does not seem to be a feature that brings any info for the classification, you should drop this value and not pass it into the classifier.
In a classifier you only want to pass features that are discriminative for the task that you are trying to perform: here the elements that can impact the retention or churn. The ID of a company does not bring any valuable information in this context therefore it should not be used.
Edit following OP's comment:
Before going further let's state something: with respect to the number of samples (12000) and the relative simplicity of the model, one can make multiple attempts to try different configurations of features easily.
So, As a baseline, I would do as I said before, drop this feature all together. Here is your baseline score i.e., a score you can compare your other combinations of features against.
I think it cost nothing to try to hot-encode org_id, whichever result you observe is going to add up to your experience and knowledge of how the Random Forest behaves in such cases. As you only have 10 more features, the Boolean features is_org_id_1, is_org_id_2, ... will be highly preponderant and the classification results may be highly influenced by these features.
Then I would try to reduce the number of Boolean features by finding new features that can "describe" these 400+ organizations. For instance, if they are only US organizations, their state which is ~50 features, or their number of users (which would be a single numerical feature), their years of existence (another single numerical feature). Let's note that these are only examples to illustrate the process of creating new features, only someone knowing the full problematic can design these features in a smart way.
Also, I would find interesting that, once you solve your problem, you come back here and write another answer to your question as I believe, many people run into such problems when working with real data :)

How to determine most impactful input variables in a dataset?

I have a neural network program that is designed to take in input variables and output variables, and use forecasted data to predict what the output variables should be based on the forecasted data. After running this program, I will have an output of an output vector. Lets say for example, my input matrix is 100 rows and 10 columns and my output matrix is a vector with 100 values. How do I determine which of my 10 variables (columns) had the most impact on my output?
I've done a correlation analysis between each of my variables (columns) and my output and created a list of the highest correlation between each variable and output, but I'm wondering if there is a better way to go about this.
If what you want to know is model selection, and it's not as simple as studiying the correlation of your features to your target. For an in-depth, well explained look at model selection, I'd recommend you read chapter 7 of The Elements Statistical Learning. If what you're looking for is how to explain your network, then you're in for a treat as well and I'd recommend reading this article for starters, though I won't go into the matter myself.
Naive approaches to model selection:
There a number of ways to do this.
The naïve way is to estimate all possible models, so every combination of features. Since you have 10 features, it's computationally unfeasible.
Another way is to take a variable you think is a good predictor and train to model only on that variable. Compute the error on the training data. Take another variable at random, retrain the model and recompute the error on the training data. If it drops the error, keep the variable. Otherwise discard it. Keep going for all features.
A third approach is the opposite. Start with training the model on all features and sequentially drop variables (a less naïve approach would be to drop variables you intuitively think have little explanatory power), compute the error on training data and compare to know if you keep the feature or not.
There are million ways of going about this. I've exposed three of the simplest, but again, you can go really deeply into this subject and find all kinds of different information (which is why I highly recommend you read that chapter :) ).

Selecting the best combination of variables for regression model based on reg score

Hello old faithful community,
This might be a though one as I can barely find any material on this.
The Problem
I have a data set of crimes committed in NSW Australia by council, and have merged this with average house prices by council. I'm now looking to produce a linear regression to try and predict said house price by the crime in the neighbourhood. The issue is, I have 49 crimes, and only want the best ones (statistically speaking) to be used in my model.
I've run the regression score over all and some variables (using correlation), and had results from .23 - .38 but I want to perfect this to the best possible - if there is a way to do this of course.
I've thought about looping over every possible combination, but this would end up by couple of million according to google.
So, my friends - how can I python this dataframe to get the best columns?
If I might add, you may want to take a look at the Python package mlxtend, http://rasbt.github.io/mlxtend.
It is a package that features several forward/backward stepwise regression algorithms, while still using the regressors/selectors of sklearn.
There is no gold standard to solving this problem and you are right, selecting every combination is computational not feasible most of the time -- especially with 49 variables. One method would be to implement a forward or backward selection by adding/removing variables based on a user specified p-value criteria (this is the statistically relevant criteria you mention). For python implementations using statsmodels, check out these links:
https://datascience.stackexchange.com/questions/24405/how-to-do-stepwise-regression-using-sklearn/24447#24447
http://planspace.org/20150423-forward_selection_with_statsmodels/
Other approaches that are less 'statistically valid' would be to define a model evaluation metric (e.g., r squared, mean squared error, etc) and use a variable selection approach such as LASSO, random forest, genetic algorithm, etc to identify the set of variables that optimize the metric of choice. I find that in practice, ensembling these techniques in a voting-type scheme works the best as different techniques work better for certain types of data. Check out the links below from sklearn to see some options that you can code up pretty quickly with your data:
Overview of techniques: http://scikit-learn.org/stable/modules/feature_selection.html
A stepwise procedure: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html
Select best features based on model: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html
If you are up for it, I would try a few techniques and see if the answers converge to the same set of features -- This will give you some insight into the relationships between your variables.

Tweet classification into multiple categories on (Unsupervised data/tweets)

I want to classify the tweets into predefined categories (like: sports, health, and 10 more). If I had labeled data, I would be able to do the classification by training Naive Bayes or SVM. As described in http://cucis.ece.northwestern.edu/publications/pdf/LeePal11.pdf
But I cannot figure out a way with unlabeled data. One possibility could be using Expectation-Maximization and generating clusters and label those clusters. But as said earlier I have predefined set of classes, so clustering won't be as good.
Can anyone guide me on what techniques I should follow. Appreciate any help.
Alright by what i can understand i think there are multiple ways to attend to this case.
there will be trade offs and the accuracy rate may vary. because of the well know fact and observation
Each single tweet is distinct!
(unless you are extracting data from twitter stream api based on tags and other keywords). Please define the source of data and how are you extracting it. i am assuming you're just getting general tweets which can be about anything
The thing you can do is to generate a set of dictionary for each class you have
(i.e Music => pop , jazz , rap , instruments ...)
which will contain relevant words to that class. You can use NLTK for python or Stanford NLP for other languages.
You can start with extracting
Synonyms
Hyponyms
Hypernyms
Meronyms
Holonyms
Go see these NLP Lexical semantics slides. it will surely clear some of the concepts.
Once you have dictionaries for each classes. cross compare them with the tweets you have got. the tweet which has the most similarity (you can rank them according to the occurrences of words from the these dictionaries) you can label it to that class. This will make your tweets labeled like others.
Now the question is the accuracy! But it depends on the data and versatility of your classes. This may be an "Over kill" But it may come close to what you want.
Furthermore you can label some set of tweets this way and use Cosine Similarity to cross identify other tweets. This will help with the optimization part. But then again its up-to you. As you know what Trade offs you can bear
The real struggle will be the machine learning part and how you manage that.
Actually this seems as a typical use case of semi-supervised learning. There are plenty methods of use here, including clustering with constraints (where you force model to cluster samples from the same class together), transductive learning (where you try to extrapolate model from labeled samples onto distribution of unlabeled ones).
You could also simply cluster data as #Shoaib suggested, but then you will have to come up the the heuristic approach how to deal with clusters with mixed labeling. Futhermore - obviously solving optimziation problem not related to the task (labeling) will not be as good as actually using this knowledge.
You can use clustering for that task. For that you have to label some examples for each class first. Then using these labeled examples, you can identify the class of each cluster easily.

Categories