I'm trying Bag of Words problem with a dataset which has two columns - summary and solution. I'm using KNN for it. The train dataset has 91 columns and the test dataset has 15 columns.
To generate the vectors, I'm using the following piece of code.
vectorizer = CountVectorizer()
train_bow_set = vectorizer.fit_transform(dataset[0]).todense()
print( vectorizer.fit_transform(dataset[0]).todense() )
print( vectorizer.vocabulary_ )
I trained it.
classifier = KNeighborsClassifier(n_neighbors=3)
classifier.fit(train_bow_set, dataset[1])
Now, I'm testing it.
y_pred = classifier.predict(test_bow_set)
Here, I'm getting below error when I test it:
sklearn/neighbors/binary_tree.pxi in sklearn.neighbors.kd_tree.BinaryTree.query()
**ValueError: query data dimension must match training data dimension**
I guess you are fitting the vectorizer on the test data again instead of using transform function.
Make sure you are doing the following.
test_bow_set = vectorizer.transform(test_dataset)
You are fitting again the vectorizer:
train_bow_set = vectorizer.fit_transform(dataset[0]).todense()
You need to keep the vectorizer from the training (all the preprocessing elements actually) and only use transform. Fitting again will profoundly change the results.
train_bow_set = vectorizer.transform(dataset[0]).todense()
Related
I have been following this tutorial
https://stackabuse.com/python-for-nlp-sentiment-analysis-with-scikit-learn/
to create a sentiment analysis in python. However, here's what I don't understand: It seems to me that the data they use is already labeled? So, how do I use the training I did on the labeled data to then apply to unlabeled data?
I wanna do sth like this:
Assuming I have 2 dataframes:
df1 is a small one with labeled data, df2 is a big one with unlabeled data. I just finished training with df1. How do I then go about predicting the values for df2?
I thought it would be as straight forward as text_classifier.predict(df2.iloc[:,1].values), but that doesn't work for me.
Also, forgive me if this question may seem stupid, but I don't have a lot of experience with machine learning and nltk ...
EDIT:
Here is the code I'm working on:
enc = preprocessing.LabelEncoder()
//chat_data = chat_data[:180]
//chat_labels = chat_labels[:180]
chat_labels = enc.fit_transform(chat_labels)
vectorizer = TfidfVectorizer (max_features=2500, min_df=1, max_df=1, stop_words=stopwords.words('english'))
features = vectorizer.fit_transform(chat_data).toarray()
print(chat_data)
X_train, X_test, y_train, y_test = train_test_split(features, chat_labels, test_size=0.2, random_state=0)
text_classifier = RandomForestClassifier(n_estimators=200, random_state=0)
text_classifier.fit(X_train, y_train)
predictions = text_classifier.predict(X_test)
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
print(accuracy_score(y_test, predictions))
chatData = pd.read_csv(r"C:\Users\jgott\OneDrive\Dokumente\Thesis\chat.csv")
unlabeled = chatData.iloc[:,1].values
unlabeled = vectorizer.fit_transform(unlabeled.astype('U'))
print(unlabeled)
//features = vectorizer.fit_transform(unlabeled).toarray()
predictions = text_classifier.predict(unlabeled)
Most of it is taken exactly from the tutorial, except for the line with astype in it, which I used to convert the unlabeled data, because I got a valueError that told me it can't convert from String to float if I don't do that first.
how do I use the training I did on the labeled data to then apply to
unlabeled data?
This is really the problem that supervised ML tries to solve: having known labeled data as inputs of the form (sample, label), a model tries to discover the generic patterns that exist in these data. These patterns hopefully will be useful to predict the labels of unseen unlabeled data.
For example in sentiment-analysis (sad, happy) problem, the pattern that may be discovered by a model after training process:
Existence of one or more of this words means sad:
("misery" , 'sad', 'displaced people', 'homeless'...)
Existence of one or more of this words means happy:
("win" , "delightful", "wedding", ...)
If new textual document is given we will search for these patterns inside this document and we will label it accordingly.
As side note: we usually do not use the whole labeled dataset in the training process, instead we take a small portion from the dataset(other than the training set) to validate our model, and verify that it discovered a really generic patterns, not ones tailored specifically for the training data.
I have a build my naive Bayes classifier model for nlp using bags of word. Now I want to predict output for a single external input
. How can I do it?please find this github link for correction thanks
https://github.com/Kundan8296/Machine-Learning/blob/master/NLP.ipynb
You need to apply the same preprocessing steps that you applied on your training data, and use it as an input to the classifier. Make sure you don't use fit_transform() on the new data, use transform() only.
#Change this part in your preprocessing, so you can keep the original vectorizer.
vect = CountVectorizer(tokenizer=lambda doc: doc, lowercase=False)
bag_of_words = vect.fit_transform(corpus)
...
...
# Now when predicting, use this
new_data = ... # your new input
new_x = vect.transform(new_data)
y_pred = classifier.predict(new_x)
I am trying to predict a cluster for a bunch of test documents in a trained k-means model using scikit-learn.
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(train_documents)
k = 10
model = KMeans(n_clusters=k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
The model is generated without any problem with 10 clusters. But when I try to predict a list of documents, I get an error.
predicted_cluster = model.predict(test_documents)
Error message:
ValueError: could not convert string to float...
Do I need to use PCA to reduce the number of features, or do I need to do preprocessing for the text document?
You need to transform the test_documents the same way in which train was transformed.
X_test = vectorizer.transform(test_documents)
predicted_cluster = model.predict(X_test)
Make sure you only call transform on the test documents and use the same vectorizer object which was used for fit() or fit_transform() on train documents.
I am trying to use OneCsRestClassifier on my data set. I extracted the features on which model will be trained and fitted Linear SVC on it. After model fitting, when I try to predict on the same data on which the model was fitted, I get all zeros. Is it because of some implementation issues or because my feature extraction is not good enough. I think since I am predicting on the same data on which my model was fitted I should get 100% accuracy. But instead my model predicts all zeros. Here is my code-
#arrFinal contains all the features and the labels. Last 16 columns are labels and features are from 1 to 521. 17th column from the last is not taken
X=np.array(arrFinal[:,1:-17])
X=X.astype(float)
Xtest=np.array(X)
Y=np.array(arrFinal[:,522:]).astype(float)
clf = OneVsRestClassifier(SVC(kernel='linear'))
clf.fit(X, Y)
ans=clf.predict(Xtest)
print(ans)
print("\n\n\n")
Is there something wrong with my implementation of OneVsRestClassifier?
After looking at your data, it appears the values may be too small for the C value. Try using a sklearn.preprocessing.StandardScaler.
X=np.array(arrFinal[:,1:-17])
X=X.astype(float)
scaler = StandardScaler()
X = scaler.fit_transform(X)
Xtest=np.array(X)
Y=np.array(arrFinal[:,522:]).astype(float)
clf = OneVsRestClassifier(SVC(kernel='linear', C=100))
clf.fit(X, Y)
ans=clf.predict(Xtest)
print(ans)
print("\n\n\n")
From here, you should look at parameter tuning on the C using cross validation. Either with a learning curve or using a grid search.
I am trying to make a sentiment analyser using the scikit-learn LinearSVC classifier. The problem is that the classifier is classifying every sentence as a positive. Another question is - why is the function predict() returning me a list of the classified label for every text? I thought that it should return only one text/number which is the classified label. Here is a sample cut from the code.
vectorizer = TfidfVectorizer(input='content', decode_error='ignore')
vect_train_x = vectorizer.fit_transform(training_data) # this is actually a list of sentences
scaler = StandardScaler(with_mean=False) # I don't know why it should be False
X_train = scaler.fit_transform(vect_train_x) # compute mean, std and transform training data as well
vect_test_x = vectorizer.transform(test) # the sentence that needs to be classified
X_test = scaler.transform(vect_test_x)
clf = LinearSVC()
clf.fit(X_train, labels)
print vect_test_x
print clf.predict(X_test) # returning me a list of Positive => ['Positive' 'Positive' 'Positive' 'Positive' 'Positive' 'Positive']
I would be very grateful if you explain me what exactly I am not understanding. I tried to read the documentation but without any examples I could not understand it. My training data consists of 100 000 positive and 100 000 negative sentences.
Came across this, I had the same issue, what solved my problem was to convert X_test to a list first, then to np.array which then needs to be pass to the 'predict' function
new_array = []
new_array.append(Input) #Input is string if reading from a file or from input()
X_test = np.array(new_array)
print clf.predict(X_test)