ML - Getting feature names after feature selection - SelectPercentile, python - python

I have been struggling with this one for a while.
My goal is to take a text feature that I have, and find the best 5-10 words in it to help me classify. Hence, I am running a TfIdfVectorizer, and choosing ~90 best for now. however, after I downsize the feature amount, I am unable to see which features were actually chosen.
here is what I have:
import pandas
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif
train=pandas.read_csv("train.tsv", sep='\t')
labels_train = train["label"]
documents = []
for i, row in train.iterrows():
documents.append((row['boilerplate'][1:-1].lower()))
vectorizer = TfidfVectorizer(sublinear_tf=True, stop_words="english")
features_train_transformed = vectorizer.fit_transform(documents)
selector = SelectPercentile(f_classif, percentile=0.1)
selector.fit(features_train_transformed, labels_train)
features_train_transformed = selector.transform(features_train_transformed).toarray()
The result is that features_train_transformed contains a matrix of all the tfidf scores per word per document of the selected words, however I have no idea which words were chosen, and methods like "get_feature_names()" are unavailable for the class SelectPercentile.
This is neccesary because i need to add these features to a bunch of numeric features and only then make my training and predictions.

selector.get_support() to get you a boolean array of columns that were within the percentile range you specified
train.columns.values should get you the complete list of column names for the original dataframe
filtering the latter with the former should give you the names of columns that make up your chosen percentile range.
the code below (cut-pasted from working code) is similar enough to yours, that it's hopefully helpful
import numpy as np
selection = SelectPercentile(f_regression, percentile=2)
train_minus_target = train.drop("y", axis=1)
x_features = selection.fit_transform(train_minus_target, y_train)
columns = np.asarray(train_minus_target.columns.values)
support = np.asarray(selection.get_support())
columns_with_support = columns[support]
Reference:
about get_support

Related

Selected Features Column Names in Scikit Learn Feature Selection

Figuring out which features were selected from the main dataframe is a very common problem data scientists face while doing feature selection using scikit-learn feature_selection module.
# importing modules
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
# creating X - train and Y - test variables
X = main_df.iloc[:,0:-1]
Y = main_df.iloc[:,-1]
# feature extraction
test = SelectKBest(score_func=f_regression, k=5)
features = test.fit_transform(X,Y)
# finding selected column names
feature_idx = test.get_support(indices=True)
feature_names = main_df.columns[feature_idx]
# creating selected features dataframe with corresponding column names
features = pd.DataFrame(features, columns=feature_names)
features.head()
I hope my code helps the community and if you like the effort, do upvote, it is a form of showing appreciation. Any and every feedback is appreciated.

Sentiment Analysis Feature Selection based on word to label correlation

In my sentiment analysis on a dataset of 194k review texts with labels (class 1-5), I am trying to reduce the features (words) based on a word to label correlation by which a classifier can be trained.
Using sklearn.feature_extraction.text.CountVectorizer with default parameterization, I get 86,7k features. When performing fit_transform, I got a CSR-sparse matrix which I tried to put into a data frame using toarray().
Unfortunately, an array of size (194439,86719) causes a Memory Error. I think I need it to be in the data frame in order to calculate the correlations with df.corr(). Below you find my coding:
corpus = data['reviewText']
vectorizer = CountVectorizer(analyzer ='word')
X = vectorizer.fit_transform(corpus)
content = X.toarray() # here comes the Memory Error
vocab = vectorizer.get_feature_names()
df = pd.DataFrame(data= X.toarray(), columns=vocab)
corr = pd.Series(df.corrwith(df['overall']) > 0.6)
new_vocab = df2[corr[corr == True].index] # should return features that we want to use
Is there a way to filter by correlation without having to change the format into a data frame?
Most posts that were going into the same direction of using correlation on df do not have to handle the large data amount.
I figured that there are other ways to implement a feature selection based on the correlation. With SelectKBest and the scoring function f_regression.

Select top n TFIDF features for a given document

I am working with TFIDF sparse matrices for document classification and want to retain only the top n (say 50) terms for each document (ranked by TFIDF score). See EDIT below.
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='english',
token_pattern='[A-Za-z][\w\-]*', max_df=0.25)
n = 50
df = pd.read_pickle('my_df.pickle')
df_t = tfidfvectorizer.fit_transform(df['text'])
df_t
Out[15]:
<21175x201380 sparse matrix of type '<class 'numpy.float64'>'
with 6055621 stored elements in Compressed Sparse Row format>
I have tried following the example in this post, although my aim is not to display the features, but just to select the top n for each document before training. But I get a memory error as my data is too large to be converted into a dense matrix.
df_t_sorted = np.argsort(df_t.toarray()).flatten()[::1][n]
Traceback (most recent call last):
File "<ipython-input-16-e0a74c393ca5>", line 1, in <module>
df_t_sorted = np.argsort(df_t.toarray()).flatten()[::1][n]
File "C:\Users\Me\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\sparse\compressed.py", line 943, in toarray
out = self._process_toarray_args(order, out)
File "C:\Users\Me\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\sparse\base.py", line 1130, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
Is there any way to do what I want without working with a dense representation (i.e. without the toarray() call) and without reducing the feature space too much more than I already have (with min_df)?
Note: the max_features parameter is not what I want as it only considers "the top max_features ordered by term frequency across the corpus" (docs here) and what I want is a document-level ranking.
EDIT: I wonder if the best way to address this problem is to set the values of all features except the n-best to zero. I say this because the vocabulary has already been calculated, so feature indices must remain the same, as I will want to use them for other purposes (e.g. to visualise the actual words that correspond to the n-best features).
A colleague wrote some code to retrieve the indices of the n highest-ranked features:
n = 2
tops = np.zeros((df_t.shape[0], n), dtype=int) # store the top indices in a new array
for ind in range(df_t.shape[0]):
tops[ind,] = np.argsort(-df_t[ind].toarray())[0, 0:n] # for each row (i.e. document) sort the (inversed, as argsort is ascending) list and slice top n
But from there, I would need to either:
retrieve the list of remaining (i.e. lowest-ranked) indices and modify the values "in place", or
loop through the original matrix (df_t) and set all values to 0 except for the n best indices in tops.
There is a post here explaining how to work with a csr_matrix, but I'm not sure how to put this into practice to get what I want.
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(tokenizer=word_tokenize,ngram_range=(1,2), binary=True, max_features=50)
TFIDF=vect.fit_transform(df['processed_cv_data'])
The max_features parameter passed in the TfidfVectorizer will pick out the top 50 features ordered by their term frequency but not by their Tf-idf score.
You can view the features by using:
print(vect.get_feature_names())
As you mention, the max_features parameter of the TfidfVectorizer is one way of selecting features.
If you are looking for an alternative way which takes the relationship to the target variable into account, you can use sklearn's SelectKBest. By setting k=50, this will filter your data for the best features. The metric to use for selection can be specified as the parameter score_func.
Example:
from sklearn.feature_selection import SelectKBest
tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='english',
token_pattern='[A-Za-z][\w\-]*', max_df=0.25)
df_t = tfidfvectorizer.fit_transform(df['text'])
df_t_reduced = SelectKBest(k=50).fit_transform(df_t, df['target'])
You can also chain it in a pipeline:
pipeline = Pipeline([("vectorizer", TfidfVectorizer()),
("feature_reduction", SelectKBest(k=50)),
("classifier", classifier)])
You could break your numpy array in multiple one to free the memory. Then just concat them
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups(subset='train').data
tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='english',
token_pattern='[A-Za-z][\w\-]*', max_df=0.25)
df_t = tfidfvectorizer.fit_transform(data)
n = 10
df_t = tfidfvectorizer.fit_transform(data)
df_top = [np.argsort(df_t[i: i+500, :].toarray(), axis=1)[:, :n]
for i in range(0, df_t.shape[0], 500)]
np.concatenate(df_top, axis=0).shape
>> (11314, 10)

NLP in Python: Obtain word names from SelectKBest after vectorizing

I can't seem to find an answer to my exact problem. Can anyone help?
A simplified description of my dataframe ("df"): It has 2 columns: one is a bunch of text ("Notes"), and the other is a binary variable indicating if the resolution time was above average or not ("y").
I did bag-of-words on the text:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(lowercase=True, stop_words="english")
matrix = vectorizer.fit_transform(df["Notes"])
My matrix is 6290 x 4650. No problem getting the word names (i.e. feature names) :
feature_names = vectorizer.get_feature_names()
feature_names
Next, I want to know which of these 4650 are most associated with above average resolution times; and reduce the matrix I may want to use in a predictive model. I do a chi-square test to find the top 20 most important words.
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
selector = SelectKBest(chi2, k=20)
selector.fit(matrix, y)
top_words = selector.get_support().nonzero()
# Pick only the most informative columns in the data.
chi_matrix = matrix[:,top_words[0]]
Now I'm stuck. How do I get the words from this reduced matrix ("chi_matrix")? What are my feature names? I was trying this:
chi_matrix.feature_names[selector.get_support(indices=True)].tolist()
Or
chi_matrix.feature_names[features.get_support()]
These gives me an error: feature_names not found. What am I missing?
A
After figuring out really what I wanted to do (thanks Daniel) and doing more research, I found a couple other ways to meet my objective.
Way 1 - https://glowingpython.blogspot.com/2014/02/terms-selection-with-chi-square.html
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(lowercase=True,stop_words='english')
X = vectorizer.fit_transform(df["Notes"])
from sklearn.feature_selection import chi2
chi2score = chi2(X,df['AboveAverage'])[0]
wscores = zip(vectorizer.get_feature_names(),chi2score)
wchi2 = sorted(wscores,key=lambda x:x[1])
topchi2 = zip(*wchi2[-20:])
show=list(topchi2)
show
Way 2 - This is the way I used because it was the easiest for me to understand and produced a nice output listing the word, chi2 score, and p-value. Another thread on here: Sklearn Chi2 For Feature Selection
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectKBest, chi2
vectorizer = CountVectorizer(lowercase=True,stop_words='english')
X = vectorizer.fit_transform(df["Notes"])
y = df['AboveAverage']
# Select 10 features with highest chi-squared statistics
chi2_selector = SelectKBest(chi2, k=10)
chi2_selector.fit(X, y)
# Look at scores returned from the selector for each feature
chi2_scores = pd.DataFrame(list(zip(vectorizer.get_feature_names(), chi2_selector.scores_, chi2_selector.pvalues_)),
columns=['ftr', 'score', 'pval'])
chi2_scores
I had a similar problem recently, but I was not constricted to using the 20 most relevant words. Rather, I could select the words which had a chi score higher than a set threshold. I will give you the method I used to achieve this second task. The reason why this is preferable than just using the first n words accordingly to their chi-score, is that those 20 words may have an extremely low score and thus contribute next to nothing to the classification task.
Here is how I have done it for a binary classification task:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import chi2
THRESHOLD_CHI = 5 # or whatever you like. You may try with
# for threshold_chi in [1,2,3,4,5,6,7,8,9,10] if you prefer
# and measure the f1 scores
X = df['text']
y = df['labels']
cv = CountVectorizer()
cv_sparse_matrix = cv.fit_transform(X)
cv_dense_matrix = cv_sparse_matrix.todense()
chi2_stat, pval = chi2(cv_dense_matrix, y)
chi2_reshaped = chi2_stat.reshape(1,-1)
which_ones_to_keep = chi2_reshaped > THRESHOLD_CHI
which_ones_to_keep = np.repeat(which_ones_to_keep ,axis=0,repeats=which_ones_to_keep.shape[1])
The result is a matrix containing ones where the terms have a chi score higher than the threshold, and zeroes where they have a chi score lower than the threshold. This matrix can then be np.dot with either a cv matrix or a tfidf matrix, and subsequently passed to the fit method of a classifier.
If you do this, the columns of the matrix which_ones_to_keep correspond to the columns of the CountVectorizer object, and you can thus determine which terms were relevant for the given labels by comparing the non-zero columns of the which_ones_to_keep matrix to the indices of the .get_feature_names(), or you can just forget about it and pass it directly to a classifier.

How to use vectors as features in scikit learn

I'm trying to use vector representations of words as feature for a scikit learn classifier. I have tried SVC. Here is the code
from sklearn.svm import SVC
import csv
import numpy as np
from gensim.models import word2vec
from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format('text.model.bin', binary=True)
with open('1000.csv', newline='') as csvfile:
listwords = csv.reader(csvfile)
features = []
labels = []
n = 0
for row in listwords:
if n>=199:
break
try:
line = [int(row[2]),np.array(model[row[0]])]
features.append([line])
labels.append([row[1]])
n+=1
except KeyError:
pass
features.append([])
n+=1
clf = SVC()
clf = clf.fit(features, labels)
vocab_obj = model.vocab['anne']
print (clf.predict([vocab_obj.count,model['anne']]))
The function model[X] returns a vector.
I get the error : ValueError: setting an array element with a sequence.
How am I supposed to do this?
There appear to be several issues w.r.t. the representation of your labels and features.
As far as I can see from your code, labels appears to be a list of lists (possibly) containing strings (probably looking like [['0'], ['1'], ...]), however the fit() function expects a numpy array of integers. When building your labels list, try using labels.append(int(row[1])) (ignore the cast to int if row[1] is already an integer). If your labels are category names (e.g. sports, politics, or whatever), the you'd need to use a LabelEncoder first. Before calling fit() you might also want to convert your labels list to a numpy array: y = np.array(labels).
Your features list appears to have a similar problem, but looks as if it might be a triple nested list. The fit() function expects the data matrix to be a n_samples x n_features matrix. If you are working with word vectors, then n_features is the dimensionality of your word vectors and n_samples the number of documents in your csv file.
For getting a document representation from the word vectors you'd need to compose them in some way. Commonly, simply adding or averaging all vectors in a document has been found to be a strong baseline. Note that its hard to tell from your example what the meaning of int(row[2]) in line = [int(row[2]),np.array(model[row[0]])] is.
I'd encourage you to post some more information of how a single line in your csv file looks like if you're still struggling to get this to work.

Categories