How to use vectors as features in scikit learn - python

I'm trying to use vector representations of words as feature for a scikit learn classifier. I have tried SVC. Here is the code
from sklearn.svm import SVC
import csv
import numpy as np
from gensim.models import word2vec
from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format('text.model.bin', binary=True)
with open('1000.csv', newline='') as csvfile:
listwords = csv.reader(csvfile)
features = []
labels = []
n = 0
for row in listwords:
if n>=199:
break
try:
line = [int(row[2]),np.array(model[row[0]])]
features.append([line])
labels.append([row[1]])
n+=1
except KeyError:
pass
features.append([])
n+=1
clf = SVC()
clf = clf.fit(features, labels)
vocab_obj = model.vocab['anne']
print (clf.predict([vocab_obj.count,model['anne']]))
The function model[X] returns a vector.
I get the error : ValueError: setting an array element with a sequence.
How am I supposed to do this?

There appear to be several issues w.r.t. the representation of your labels and features.
As far as I can see from your code, labels appears to be a list of lists (possibly) containing strings (probably looking like [['0'], ['1'], ...]), however the fit() function expects a numpy array of integers. When building your labels list, try using labels.append(int(row[1])) (ignore the cast to int if row[1] is already an integer). If your labels are category names (e.g. sports, politics, or whatever), the you'd need to use a LabelEncoder first. Before calling fit() you might also want to convert your labels list to a numpy array: y = np.array(labels).
Your features list appears to have a similar problem, but looks as if it might be a triple nested list. The fit() function expects the data matrix to be a n_samples x n_features matrix. If you are working with word vectors, then n_features is the dimensionality of your word vectors and n_samples the number of documents in your csv file.
For getting a document representation from the word vectors you'd need to compose them in some way. Commonly, simply adding or averaging all vectors in a document has been found to be a strong baseline. Note that its hard to tell from your example what the meaning of int(row[2]) in line = [int(row[2]),np.array(model[row[0]])] is.
I'd encourage you to post some more information of how a single line in your csv file looks like if you're still struggling to get this to work.

Related

Python Naive Bayes training issues - cannoy perform reduce with flexible type

I am aiming to train a Bayes model to review positive and negative sentences. I have 6000 words to train the model, and want to test it with sentences such as "I am feeling happy".
I am currently using numpy with sklearn but am having issues with a type error: "cannot perform reduce with flexible type".
I have a list of words like words = ["good", "happy", "bad", "sad"] and a corresponding list wordTypes = ["positive", "positive", "negative", "negative"]. I have tried to simplify it to just get something work so the result is the below code. I have also used "str" rather than positive and negative on X as well as brackets instead of square brackets.
from sklearn.naive_bayes import GaussianNB
import numpy as np
nv = GaussianNB()
X = [["happy","positive"], ["good","positive"], ["bad","negative"], ["poor", "negative"]]
y = ["positive","positive", "negative","negative"]
nv.fit(X,y)
I have also tried it as below, but get the following : ValueError: Expected 2D array, got 1D array instead: array=['happy' 'good' 'bad' 'poor']. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
X = ["happy", "good", "bad", "poor"]
y = ["positive","positive", "negative","negative"]
Am I misunderstanding how they should be structured?

Select top n TFIDF features for a given document

I am working with TFIDF sparse matrices for document classification and want to retain only the top n (say 50) terms for each document (ranked by TFIDF score). See EDIT below.
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='english',
token_pattern='[A-Za-z][\w\-]*', max_df=0.25)
n = 50
df = pd.read_pickle('my_df.pickle')
df_t = tfidfvectorizer.fit_transform(df['text'])
df_t
Out[15]:
<21175x201380 sparse matrix of type '<class 'numpy.float64'>'
with 6055621 stored elements in Compressed Sparse Row format>
I have tried following the example in this post, although my aim is not to display the features, but just to select the top n for each document before training. But I get a memory error as my data is too large to be converted into a dense matrix.
df_t_sorted = np.argsort(df_t.toarray()).flatten()[::1][n]
Traceback (most recent call last):
File "<ipython-input-16-e0a74c393ca5>", line 1, in <module>
df_t_sorted = np.argsort(df_t.toarray()).flatten()[::1][n]
File "C:\Users\Me\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\sparse\compressed.py", line 943, in toarray
out = self._process_toarray_args(order, out)
File "C:\Users\Me\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\sparse\base.py", line 1130, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
Is there any way to do what I want without working with a dense representation (i.e. without the toarray() call) and without reducing the feature space too much more than I already have (with min_df)?
Note: the max_features parameter is not what I want as it only considers "the top max_features ordered by term frequency across the corpus" (docs here) and what I want is a document-level ranking.
EDIT: I wonder if the best way to address this problem is to set the values of all features except the n-best to zero. I say this because the vocabulary has already been calculated, so feature indices must remain the same, as I will want to use them for other purposes (e.g. to visualise the actual words that correspond to the n-best features).
A colleague wrote some code to retrieve the indices of the n highest-ranked features:
n = 2
tops = np.zeros((df_t.shape[0], n), dtype=int) # store the top indices in a new array
for ind in range(df_t.shape[0]):
tops[ind,] = np.argsort(-df_t[ind].toarray())[0, 0:n] # for each row (i.e. document) sort the (inversed, as argsort is ascending) list and slice top n
But from there, I would need to either:
retrieve the list of remaining (i.e. lowest-ranked) indices and modify the values "in place", or
loop through the original matrix (df_t) and set all values to 0 except for the n best indices in tops.
There is a post here explaining how to work with a csr_matrix, but I'm not sure how to put this into practice to get what I want.
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(tokenizer=word_tokenize,ngram_range=(1,2), binary=True, max_features=50)
TFIDF=vect.fit_transform(df['processed_cv_data'])
The max_features parameter passed in the TfidfVectorizer will pick out the top 50 features ordered by their term frequency but not by their Tf-idf score.
You can view the features by using:
print(vect.get_feature_names())
As you mention, the max_features parameter of the TfidfVectorizer is one way of selecting features.
If you are looking for an alternative way which takes the relationship to the target variable into account, you can use sklearn's SelectKBest. By setting k=50, this will filter your data for the best features. The metric to use for selection can be specified as the parameter score_func.
Example:
from sklearn.feature_selection import SelectKBest
tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='english',
token_pattern='[A-Za-z][\w\-]*', max_df=0.25)
df_t = tfidfvectorizer.fit_transform(df['text'])
df_t_reduced = SelectKBest(k=50).fit_transform(df_t, df['target'])
You can also chain it in a pipeline:
pipeline = Pipeline([("vectorizer", TfidfVectorizer()),
("feature_reduction", SelectKBest(k=50)),
("classifier", classifier)])
You could break your numpy array in multiple one to free the memory. Then just concat them
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups(subset='train').data
tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='english',
token_pattern='[A-Za-z][\w\-]*', max_df=0.25)
df_t = tfidfvectorizer.fit_transform(data)
n = 10
df_t = tfidfvectorizer.fit_transform(data)
df_top = [np.argsort(df_t[i: i+500, :].toarray(), axis=1)[:, :n]
for i in range(0, df_t.shape[0], 500)]
np.concatenate(df_top, axis=0).shape
>> (11314, 10)

How can I read in a sparse matrix saved in a CSV file? (Python 3.6.4)

Using countvectorizer, I extracted feature vectors from thousands of emails and saved it in a CSV file
dictionary = open (r'''C:\Users\User\Desktop\csmp3\stemmedDictionary.txt''',"r")
dic = list(set(dictionary.read().splitlines()))
cv = CountVectorizer(vocabulary = dic, binary = True)
#~PRESENCE FEATURE VECTOR~#
#TRAIN
pdt = open (r'''C:\Users\User\Desktop\csmp3\presence-dataset-training-stemmed.csv''',"w")
matWriter = csv.writer(pdt,delimiter = ',')
for i in range (1,2): #45252
processed_email = open(r'''C:\Users\User\Desktop\csmp3\processed\processed'''+str(i)+'''.txt''',"r")
presence_array = cv.transform(processed_email)
matWriter.writerow(presence_array)
processed_email.close()
pdt.close()
This is part of a Spam Filtering using Naive Bayes project and our data set is rather large. I'm hoping to use this sparse matrix for Bernoulli Naive Bayes' partial fit method. I just can't quite figure out how to load the sparse matrix from the file. I've already tried numpy.loadtxt but it gives me "ValueError: could not convert string to float: "
Any help would be appreciated! Thank you!

ML - Getting feature names after feature selection - SelectPercentile, python

I have been struggling with this one for a while.
My goal is to take a text feature that I have, and find the best 5-10 words in it to help me classify. Hence, I am running a TfIdfVectorizer, and choosing ~90 best for now. however, after I downsize the feature amount, I am unable to see which features were actually chosen.
here is what I have:
import pandas
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif
train=pandas.read_csv("train.tsv", sep='\t')
labels_train = train["label"]
documents = []
for i, row in train.iterrows():
documents.append((row['boilerplate'][1:-1].lower()))
vectorizer = TfidfVectorizer(sublinear_tf=True, stop_words="english")
features_train_transformed = vectorizer.fit_transform(documents)
selector = SelectPercentile(f_classif, percentile=0.1)
selector.fit(features_train_transformed, labels_train)
features_train_transformed = selector.transform(features_train_transformed).toarray()
The result is that features_train_transformed contains a matrix of all the tfidf scores per word per document of the selected words, however I have no idea which words were chosen, and methods like "get_feature_names()" are unavailable for the class SelectPercentile.
This is neccesary because i need to add these features to a bunch of numeric features and only then make my training and predictions.
selector.get_support() to get you a boolean array of columns that were within the percentile range you specified
train.columns.values should get you the complete list of column names for the original dataframe
filtering the latter with the former should give you the names of columns that make up your chosen percentile range.
the code below (cut-pasted from working code) is similar enough to yours, that it's hopefully helpful
import numpy as np
selection = SelectPercentile(f_regression, percentile=2)
train_minus_target = train.drop("y", axis=1)
x_features = selection.fit_transform(train_minus_target, y_train)
columns = np.asarray(train_minus_target.columns.values)
support = np.asarray(selection.get_support())
columns_with_support = columns[support]
Reference:
about get_support

How to use the actual feature names instead of "X" in scikit-learn DecisionTreeRegressor?

I suppose this is possible since in the def of fit function it says:
X : array-like, shape = [n_samples, n_features]
Now I have,
I can certainly generate a string representation of the decision tree then replace X[] with actual feature names. But I wonder if the fit function could directly take feature names as part of inputs? I tried the following format for each sample
[1, 2, "feature_1", "feature_2"]
[[1, 2], ["feature_1", "feature_2"]]
but neither worked. What does that shape mean? Could you please give me an example?
The fit function itself doesn't support anything like that. However, you can draw the decision tree, including feature labels, with the export_graphviz member function. (Isn't this how you generated the tree above?). Essentially, you'd do something like this:
iris = load_iris()
t = tree.DecisionTreeClassifier()
fitted_tree = t.fit(iris.data, iris.targets)
outfile = tree.export_graphviz(fitted_tree, out_file='filename.dot', feature_names=iris.feature_names)
outfile.close()
This will produce a 'dot' file, which graphviz (which must be installed separately) can then "render" into a traditional image format (postscript, png, etc.) For example, to make a png file, you'd run:
dot -Tpng filename.dot > filename.png
The dot file itself is a plain-text format and fairly self-explanatory. If you wanted to tweak the text, a simple find-replace in the text editor of your choice would work. There are also python modules for directly interacting with graphviz and its files. PyDot seems to be pretty popular, but there are others too.
The shape reference in fit's documentation just refers to the layout of X, the training data matrix. Specifically, it expects the first index to vary over training examples, while the 2nd index refers to features. For example, suppose your data's shape is (150, 4), as is the case for iris.data. The fit function will interpret it as containing 150 training examples, each of which consists of four values.
X should be a 2 dimensional numpy ndarray where each row corresponds to a sample and each column represents the values of a feature. That shape refers to the number of rows and columns of the feature data X.
An example of a valid X which contains 3 samples and 2 features:
import numpy as np
X = np.array([[2,2],[2,0],[0,2]])
y = np.array([0,1,1])
print X.shape # Output (2,2)
where the first sample has value 1 and 2 for the first and second feature respectively.
If you have a representation of the feature data in a list of dict (each dict corresponds to a single sample) like so
D = [
{'feature1': 2, 'feature2': 2},
{'feature1': 2, 'feature2': 0},
{'feature1': 0, 'feature2': 2}
]
then you can use DictVectorizer to produce the matrix X:
from sklearn.feature_extraction import DictVectorizer
v = DictVectorizer(sparse=False)
X = v.fit_transform(D)

Categories