I'm very new to programming and machine learning but I've been trying to create a prediction model to tag product reviews. I found the following model:
import numpy as np
import pandas as pd
# the Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
# function to split the data for cross-validation
from sklearn.model_selection import train_test_split
# function for transforming documents into counts
from sklearn.feature_extraction.text import CountVectorizer
# function for encoding categories
from sklearn.preprocessing import LabelEncoder
dataset = pd.read_csv('dataset.csv')
def normalize_text(s):
s = s.lower()
# remove punctuation that is not word-internal (e.g., hyphens, apostrophes)
s = re.sub('\s\W',' ',s)
s = re.sub('\W\s',' ',s)
# make sure we didn't introduce any double spaces
s = re.sub('\s+',' ',s)
return s
dataset['TEXT'] = [normalize_text(s) for s in dataset['texto']]
# pull the data into vectors
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(dataset['TEXT'])
encoder = LabelEncoder()
y = encoder.fit_transform(dataset['codigo'])
# split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
nb = MultinomialNB()
nb.fit(x_train, y_train)
y_predicted = nb.predict(x_test)
So far so good. But then, I tried to use that trained model to predict another set of data like this:
#new data
test = pd.read_csv('testset.csv')
test['TEXT'] = [normalize_text(s) for s in test['respostas']]
# pull the data into vectors
vectorizer = CountVectorizer()
classes = vectorizer.fit_transform(test['TEXT'])
classificacao = nb.predict(classes)
However, I got a "ValueError: dimension mismatch"
I'm not sure how to do this second step, which is using the model to predict the category of a fresh data set.
Thanks in advance for your assistance.
Related
this is a sentiment analysis model made with tf-idf for feature extraction
i want to know how can i save this model and reuse it.
i tried saving it this way but when i load it , do same pre-processing on the test text and fit_transform on it it gave an error that the model expected X numbers of features but got Y
this is how i saved it
filename = "model.joblib"
joblib.dump(model, filename)
and this is the code for my tf-idf model
import pandas as pd
import re
import nltk
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download('stopwords')
from nltk.corpus import stopwords
processed_text = ['List of pre-processed text']
y = ['List of labels']
tfidfconverter = TfidfVectorizer(max_features=10000, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
X = tfidfconverter.fit_transform(processed_text).toarray()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
text_classifier = BernoulliNB()
text_classifier.fit(X_train, y_train)
predictions = text_classifier.predict(X_test)
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))
print(accuracy_score(y_test, predictions))
edit:
just to exact where to put every line
so after:
tfidfconverter = TfidfVectorizer(max_features=10000, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
then
tfidf_obj = tfidfconverter.fit(processed_text)//this is what will be used again
joblib.dump(tfidf_obj, 'tf-idf.joblib')
then you do the rest of the steps you will save the classifier after training as well so after:
text_classifier.fit(X_train, y_train)
put
joblib.dump(model, "classifier.joblib")
now when you want to predict any text
tf_idf_converter = joblib.load("tf-idf.joblib")
classifier = joblib.load("classifier.joblib")
now u have List of sentences to predict
sent = []
classifier.predict(tf_idf_converter.transform(sent))
now print that for a list of sentiments for each sentece
You can first fit tfidf to your training set using:
tfidfconverter = TfidfVectorizer(max_features=10000, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
tfidf_obj = tfidfconverter.fit(processed_text)
Then find a way to store the tfidf_obj for instance using pickle or joblib e.g:
joblib.dump(tfidf_obj, filename)
Then load the saved tfidf_obj and apply transform only on your test set
loaded_tfidf = joblib.load(filename)
test_new = loaded_tfidf.transform(X_test)
which is the correct way to extract features from multiple text columns and apply any classification algorithm on it?
please suggest me, if i am going wrong
example dataset
Independent Variables : Description1,Description2, State, NumericCol1,NumericCol2
Dependent Variable : TargetCategory
Code:
########### Feature Exttraction for Text Data #####################
######### Description1 (it can be any wordembedding technique like countvectorizer, tfidf, word2vec,bert..etc)
tfidf = TfidfVectorizer(max_features = 500,
ngram_range = (1,3),
stop_words = "english")
X_Description1 = tfidf.fit_transform(df["Description1"].tolist())
######### Description2 (it can be any wordembedding technique like countvectorizer, tfidf, word2vec,bert..etc)
tfidf = TfidfVectorizer(max_features = 500,
ngram_range = (1,3),
stop_words = "english")
X_Description2 = tfidf.fit_transform(df["Description2"].tolist())
######### State (have 100 unique entries thats why used BinaryEncoder)
import category_encoders as ce
binary_encoder= ce.BinaryEncoder(cols=['state'],return_df=True)
X_state = binary_encoder.fit_transform(df["state"])
import scipy
X = scipy.sparse.hstack((X_Description1,
X_Description2,
X_state,
df[["NumericCol1", "NumericCol2"]].to_numpy())).tocsr()
y = df['TargetCategory']
##### train Test Split ########
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=111)
##### Create Model Model ######
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, recall_score, classification_report, cohen_kappa_score
from sklearn import metrics
# Baseline Random forest based Model
rfc = RandomForestClassifier(criterion = 'gini', n_estimators=1000, verbose=1, n_jobs = -1,
class_weight = 'balanced', max_features = 'auto')
rfcg = rfc.fit(X_train,y_train) # fit on training data
####### Prediction ##########
predictions = rfcg.predict(X_test)
print('Baseline: Accuracy: ', round(accuracy_score(y_test, predictions)*100, 2))
print('\n Classification Report:\n', classification_report(y_test,predictions))
The way to use multiple columns as input in scikit-learn is by using the ColumnTransformer.
Here is an example on how to use it with heterogeneous data.
Im trying to build a text-classification model on a database of site reviews (3 classes).
i cleaned the DF, tokenized it (with countVectorizer) and Tfidf (TfidfTransformer) and built MNB model.
now after i trained and evaluated the model, i want to get a list of the wrong predictions so i can pass them through LIME and explore the words that confuse the model.
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import (
classification_report,
confusion_matrix,
accuracy_score,
roc_auc_score,
roc_curve,
)
df = pd.read_csv(
"https://raw.githubusercontent.com/m-braverman/ta_dm_course_data/master/train3.csv"
)
cleaned_df = df.drop(
labels=["review_id", "user_id", "business_id", "review_date"], axis=1
)
x = cleaned_df["review_text"]
y = cleaned_df["business_category"]
# tokenization
vectorizer = CountVectorizer()
vectorizer_fit = vectorizer.fit(x)
bow_x = vectorizer_fit.transform(x)
#### transform BOW to TF-IDF
transformer = TfidfTransformer()
transformer_x = transformer.fit(bow_x)
tfidf_x = transformer_x.transform(bow_x)
# SPLITTING THE DATASET INTO TRAINING SET AND TESTING SET
x_train, x_test, y_train, y_test = train_test_split(
tfidf_x, y, test_size=0.3, random_state=101
)
mnb = MultinomialNB(alpha=0.14)
mnb.fit(x_train, y_train)
predmnb = mnb.predict(x_test)
my objective is to get the original indices of the reviews that the model predicted wrongly.
I managed to get the result like this:
predictions = c.predict(preprocessed_df['review_text'])
df2= preprocessed_df.join(pd.DataFrame(predictions))
df2.columns = ['review_text', 'business_category', 'word_count', 'prediction']
df2[df2['business_category']!=df2['prediction']]
im sure there is a more elegant way...
It seems like there is another problem in your code, generally the TfIdf vectorizer is fit on the training data only and in order to get the test data in the same format we do the transform operation. This is primarily done to avoid data leakage. Please refer to TfidfVectorizer: should it be used on train only or train+test. I have modified your code to suit your need.
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import (
classification_report,
confusion_matrix,
accuracy_score,
roc_auc_score,
roc_curve,
)
df = pd.read_csv(
"https://raw.githubusercontent.com/m-braverman/ta_dm_course_data/master/train3.csv"
)
cleaned_df = df.drop(
labels=["review_id", "user_id", "business_id", "review_date"], axis=1
)
x = cleaned_df["review_text"]
y = cleaned_df["business_category"]
# SPLITTING THE DATASET INTO TRAINING SET AND TESTING SET
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.3, random_state=101
)
transformer = TfidfTransformer()
x_train_tf = transformer.fit_transform(x_train)
x_test_tf = transformer.transform(x_test)
mnb = MultinomialNB(alpha=0.14)
mnb.fit(x_train_tf, y_train)
predmnb = mnb.predict(x_test_tf)
incorrect_docs = x_test[predmnb == y_test]
I am learning sentiment analysis and I have a data frame of reviews, which I have to evaluate given a list of words, and get the weights assigned to those words. Unfortunately, when I try to fit the regression I get the following error:
"ValueError: Found input variables with inconsistent numbers of samples: [11, 133401]"
What am I missing on?
CSV file
import pandas
import sklearn
import numpy as np
products = pandas.read_csv('amazon_baby.csv')
selected_words=["awesome", "great", "fantastic", "amazing", "love", "horrible", "bad", "terrible", "awful", "wow", "hate"]
#ignore all 3* reviews
products = products[products['rating'] != 3]
#positive sentiment = 4* or 5* reviews
products['sentiment'] = products['rating'] >=4
#create a separate column for each word
for word in selected_words:
products[word]=[len(re.findall(word,x)) for x in products['review'].tolist()]
# Define X and y
X = products[selected_words]
y = products['sentiment']
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
vect = CountVectorizer()
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_test_dtm = vect.transform(X_test)
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train_dtm, y_train) #here is where I get the error
CountVectorizer() expects an iterable of strings and returns vectors that represents the counts of words. You already implemented this with the for loop and now trying to fit CountVectorizer() to counts of your selected words.
Assuming you want to just want to use your selected words as features
logreg.fit(X_train, y_train)
without the transformation will be fine.
Or if you would like to use all the words as features you could change your X to include the full review
X = products['review'].astype(str)
and then fit the CountVectorizer() and then use
logreg.fit(X_train_dtm, y_train)
I have got a dataset which contains just two useful columns for training my model, first is news heading and the second is category of news.
So, I got the following training command running successfully using python:
import re
import numpy as np
import pandas as pd
# the Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
# function to split the data for cross-validation
from sklearn.model_selection import train_test_split
# function for transforming documents into counts
from sklearn.feature_extraction.text import CountVectorizer
# function for encoding categories
from sklearn.preprocessing import LabelEncoder
# grab the data
news = pd.read_csv("/Users/helloworld/Downloads/NewsAggregatorDataset/newsCorpora.csv",encoding='latin-1')
news.head()
def normalize_text(s):
s = s.lower()
# remove punctuation that is not word-internal (e.g., hyphens, apostrophes)
s = re.sub('\s\W',' ',s)
s = re.sub('\W\s',' ',s)
# make sure we didn't introduce any double spaces
s = re.sub('\s+',' ',s)
return s
news['TEXT'] = [normalize_text(s) for s in news['TITLE']]
# pull the data into vectors
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(news['TEXT'])
encoder = LabelEncoder()
y = encoder.fit_transform(news['CATEGORY'])
# split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
nb = MultinomialNB()
nb.fit(x_train, y_train)
So my question is, how can I give a new set of data (e.g. Just news heading) and tell the program to predict the news category using python sklearn command?
P.S. My training data is like:
You should train the model using the training data (as you did) and then you should predict using new data (the test data).
Do the following:
nb = MultinomialNB()
nb.fit(x_train, y_train)
y_predicted = nb.predict(x_test)
Now, if you want to evaluate the predictions based on the **accuracy you can do the following:**
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_predicted)
Similarly, you can calculate other metrics.
Finally, we can see all the available metrics here !
EDIT 1
When you type:
y_predicted = nb.predict(x_test)
y_predicted will contain numerical values that correspond to your categories.
To project back these values and get the labels you can do:
y_predicted_labels = encoder.inverse_transform(y_predicted)
You are very close. Just need two more lines of code. Use this link, explains Naives Bayes using Sci Kit,
https://www.digitalocean.com/community/tutorials/how-to-build-a-machine-learning-classifier-in-python-with-scikit-learn
The short answer to your question is below, import the accuracy function,
from sklearn.metrics import accuracy_score
test the model using the predict function,
preds = nb.predict(x_test)
and then test the accuracy
print(accuracy_score(y_test, preds))