I am making a machine learning program just with a simple linear regression. In the dataset, I have two types of features:
Low dimension features: e.g. word count, length, etc.
High dimension features: TFIDF features
I just to want to manually assign weights on different set of features.
I have been trying with pipleline featureunion (transformer_weights), but I find it is difficult to do so if I want to have a TFIDF feature built from a vocabulary created by both train and test data.
Could you please advice if there are any solution other than using pipeline?
Thanks a lot!
Related
I'm working on a machine learning project where I'm trying to predict the revenue of a movie.
My dataset contains mixed data types. There are numerical features (rating, number of votes, release year,...), categorical features (genres, studio, is the movie for mature audiance,...) but also embeddings that consists in large feature vectors (post embeddings and movie description embeddings).
My problem is with the last data type. I'm wondering how should I handle these embeddings ?
I've made some pre-processing (cleaning, one-hot encoding, label encoding,...) but I still have these embeddings. So basically, now I would like to do some feature selection and model selection but for example, let's say I would like to do a filter method. For a linear model, I can use a correlation matrix but I cannot compute it since the variable img_embeddings and txt_embeddings are not numericals but 1D vector. Same if I want to use mutual information for non-linear models.
In Python, can I get 100 best features out of 200k by performing Linear Discriminant Analysis on data having 2 classes?
Although LDA is used for multi-class problems, it can be used in binary classification problems.
You can use LDA for dimensionality reduction which aims to reduce the number of features. Feature selection on the other hand is the process of selecting a subset of features from a set of features.
So it is a kind of feature extraction and not feature selection. This means LDA will create a new set of features and not select the best features.
In essence, the original features no longer exist and new features are constructed from the available data that are not directly comparable to the original data [1].
Check this link for further reading
[1] Linear Discriminant Analysis for Dimensionality Reduction in Python
I've recently started exploring for myself features columns by TensorFlow.
If I understood documentation right, feature columns are just a 'frame' for further transformations just before fitting data to the model. So, if I want to use it, I define some feature columns, create DenseFeatures layer from them, and when I fit data into a model, all features go through that DenseFeatures layer, transforms and then fits into first Dense layer of my NN.
My question is that is it possible at all somehow check correlations of transformed features to my target variable?
For example, I have a categorical feature, which corresponds to a day of a week (Mon/Tue.../Sun) (say, I change it to 1/2..7). Correlation of it to my target feature will not be the same as correlation of categorical feature column (f.e. indicator), as a model don't understand that 7 is the maximum of the possible sequence, but in case of categories, it will be a one-hot encoded feature with precise borders.
Let me know if all is clear.
Will be grateful for the help!
Tensorflow does not provide the feature_importance feature with the way Sklearn provides for XGBoost.
However, you could do this to test the importance or correlation of your feature with the target feature in TensorFlow as follows.
1) Shuffle the values of the particular feature whose correlation with the target feature you want to test. As in, if your feature is say fea1,the value at df['fea1'][0] becomes the value df['fea1'][4], the value at df['fea1'][2] becomes the value df['fea1'][3] and so on.
2) Now fit the model to your modified training data and check the accuracy with validation data.
3) Now if your accuracy goes down drastically, it means your feature had a high correlation with the target feature, else if the accuracy didn't vary much, it means the feature isn't of great importance (high error = high importance).
You can do the same with other features you introduced to your training data.
This could take some time and effort.
I have a trained neural networks in which I am trying to average their prediction using EnsembleVoteClassifier from mlxtend.classifier. The problem is my neural network don't share the same input, (I performed feature reduction and feature select algorithms randomly and stored the results on new different variables, so I have something like X_test_algo1, X_test_algo2 and X_test_algo3 and Y_test).
I am trying to average the weights, but as I said, I don't have the same X, and I didn't any example on the documentation. How can I average the predictions for my three models model1, model2 and model3
eclf = EnsembleVoteClassifier(clfs=[model1, model2, model3], weights=[1,1,1], refit=False)
names = ['NN1', 'NN2', 'NN2', 'Ensemble']
eclf.fit(X_train_algo1, Ytrain) #????
If it's not possible, that is okay. I am only interested on how to calculate the formulas of Hard Voting, Hard Voting and Weighted Voting, or if there is anther library that is more flexible or the explicit expressions of the formulas could be helpful too.
Why would you need a library to do that?
Simply pass the same examples through all your neural networks and get the predictions (either logits or probabilities or labels).
Hard voting choose the label predicted most often by classifiers.
Soft voting, average probabilities predicted by classifiers and choose the label having the highest.
Weighted voting - either of the above can be weighted. Just assign weights to each classifier and multiply their predictions by them. Weights are usually normalized to (0, 1] range.
In principle you could also sum logits and choose the label with highest.
Oh, and weight averaging is different technique and requires you to have the same model and usually is done for the same initialization but at different training timesteps. You can read about it in this blog post.
I have a question about ensemble feature selection.
My data set is consist of 1000 samples with about 30000 features, and they are classified into label A or label B.
What I want to do is picking of some features which can classify the label efficiently.
I used three type of methods, univariate method(Pearson's coefficient), lasso regression and SVM-RFE(recursive feature elimination), so I got three feature sets from them. I used python scikit-learn for feature selection.
Then I am thinking of ensemble feature selection approach, because the size of features were so large. In this case, what is the way to make integrated subset with 3 feature sets?
What can I think is taking union of the sets and using lasso regression or SVM-RFE again, or just take the intersection of the sets.
Can anyone give an idea?
I guess what you do depends on how you want to use these features afterwards. If your goal is to "classify the label efficiently" one thing you can do is to use your classification algorithm (i.e. SVC, Lasso, etc.) as a wrapper and do Recursive Feature Elimination (RFE) with cross-validation.
You can start from the union of features from the previous three methods you used, or from scratch for the given type of model you want to fit, since the number of examples is small. In any case I believe the best way to select features in your case is to select the ones that optimize your goal, which seems to be classification accuracy, thus the CV proposal.