SciKit-Learn: Trouble Using train_test_split - python

I'm using Pandas and SciKit-Learn to do some basic data cleaning and then ML. I have a words_df DataFrame that's 983 rows x 33,600 columns. The columns are mostly from running TFIDF as below:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
corpus = result_df['_text'].tolist()
count_vect = CountVectorizer(min_df=1, stop_words='english')
dtm = count_vect.fit_transform(corpus)
word_counts = dtm.toarray()
tfidf_transformer = TfidfTransformer()
tfidf = tfidf_transformer.fit_transform(word_counts)
words_df = pd.DataFrame(tfidf.todense(), columns=count_vect.get_feature_names())
I extract an X and a Y (input instances and their target values, in my case page views). X is a DataFrame and Y is a Series (I just use words_df['_pageviews']).
I then run:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
Unfortunately, I get this error:
TypeError: Expected sequence or array-like, got estimator _title
Is this because one of my columns is called _title? I'm not sure what else could be causing this error.
Thanks!

Related

Make predictions with a trained model on Python

I'm very new to programming and machine learning but I've been trying to create a prediction model to tag product reviews. I found the following model:
import numpy as np
import pandas as pd
# the Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
# function to split the data for cross-validation
from sklearn.model_selection import train_test_split
# function for transforming documents into counts
from sklearn.feature_extraction.text import CountVectorizer
# function for encoding categories
from sklearn.preprocessing import LabelEncoder
dataset = pd.read_csv('dataset.csv')
def normalize_text(s):
s = s.lower()
# remove punctuation that is not word-internal (e.g., hyphens, apostrophes)
s = re.sub('\s\W',' ',s)
s = re.sub('\W\s',' ',s)
# make sure we didn't introduce any double spaces
s = re.sub('\s+',' ',s)
return s
dataset['TEXT'] = [normalize_text(s) for s in dataset['texto']]
# pull the data into vectors
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(dataset['TEXT'])
encoder = LabelEncoder()
y = encoder.fit_transform(dataset['codigo'])
# split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
nb = MultinomialNB()
nb.fit(x_train, y_train)
y_predicted = nb.predict(x_test)
So far so good. But then, I tried to use that trained model to predict another set of data like this:
#new data
test = pd.read_csv('testset.csv')
test['TEXT'] = [normalize_text(s) for s in test['respostas']]
# pull the data into vectors
vectorizer = CountVectorizer()
classes = vectorizer.fit_transform(test['TEXT'])
classificacao = nb.predict(classes)
However, I got a "ValueError: dimension mismatch"
I'm not sure how to do this second step, which is using the model to predict the category of a fresh data set.
Thanks in advance for your assistance.

sklearn DecisionTreeClassifier with CountVectorizer and additional predictor

I have built a text classification model with sklearn's DecisionTreeClassifier and would like to add another predictor. My data is in a pandas dataframe with columns labeled 'Impression' (text), 'Volume' (floats), and 'Cancer' (label). I've been using only Impression to predict Cancer but would like to use Impression and Volume to predict Cancer instead.
My code previously that ran without issue:
X_train, X_test, y_train, y_test = train_test_split(data['Impression'], data['Cancer'], test_size=0.2)
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
dt = DecisionTreeClassifier(class_weight='balanced', max_depth=6, min_samples_leaf=3, max_leaf_nodes=20)
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
I've tried a few different ways to add the Volume predictor (changes in bold):
1) Only fit_transform the Impressions
X_train, X_test, y_train, y_test = train_test_split(data[['Impression', 'Volume']], data['Cancer'], test_size=0.2)
vectorizer = CountVectorizer()
X_train['Impression'] = vectorizer.fit_transform(X_train['Impression'])
X_test = vectorizer.transform(X_test)
dt = DecisionTreeClassifier(class_weight='balanced', max_depth=6, min_samples_leaf=3, max_leaf_nodes=20)
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
This throws the error
TypeError: float() argument must be a string or a number, not 'csr_matrix'
...
ValueError: setting an array element with a sequence.
2) Call fit_transform on both Impressions and Volumes. Same code as above except for fit_transform line:
X_train = vectorizer.fit_transform(X_train)
This of course throws the error:
ValueError: Number of labels=1800 does not match number of samples=2
...
X_train.shape
(2, 2)
y_train.shape
(1800,)
I'm pretty sure method #1 is the right way to go but I haven't been able to find any tutorials or solutions for how I can add the float predictor to this text classification model.
Any help would be appreciated!
ColumnTransformer() will exactly solve this problem. Instead of you manually appending the output of CountVectorizer with other columns, we can set the remainder param as passthrough in ColumnTransformer.
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
from sklearn import set_config
set_config(print_changed_only='True', display='diagram')
data = pd.DataFrame({'Impression': ['this is the first text',
'second one goes like this',
'third one is very short',
'This is the final statement'],
'Volume': [123, 1, 2, 123],
'Cancer': [1, 0, 0, 1]})
X_train, X_test, y_train, y_test = train_test_split(
data[['Impression', 'Volume']], data['Cancer'], test_size=0.5)
ct = make_column_transformer(
(CountVectorizer(), 'Impression'), remainder='passthrough')
pipeline = make_pipeline(ct, DecisionTreeClassifier())
pipeline.fit(X_train, y_train)
pipeline.score(X_test, y_test)
Use 0.23.0 version, to see the visuals of pipeline objects (display param in set_config)
You can use hstack to combine two features together.
from scipy.sparse import hstack
X_train = vectorizer.fit_transform(X_train)
X_train_new = hstack(X_train, np.array(data['Volume']))
Now your new train contain both features. And if I may advice, use tfidfvectorizer instead of countvectorizer since tfidf considers the importance of words in each document/Impresion while countvectorizer only counts number of occurrences of words and hence a word like "THE" will have higher importance than those which really matter to us.

how to get a list of wrong predictions on validation set

Im trying to build a text-classification model on a database of site reviews (3 classes).
i cleaned the DF, tokenized it (with countVectorizer) and Tfidf (TfidfTransformer) and built MNB model.
now after i trained and evaluated the model, i want to get a list of the wrong predictions so i can pass them through LIME and explore the words that confuse the model.
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import (
classification_report,
confusion_matrix,
accuracy_score,
roc_auc_score,
roc_curve,
)
df = pd.read_csv(
"https://raw.githubusercontent.com/m-braverman/ta_dm_course_data/master/train3.csv"
)
cleaned_df = df.drop(
labels=["review_id", "user_id", "business_id", "review_date"], axis=1
)
x = cleaned_df["review_text"]
y = cleaned_df["business_category"]
# tokenization
vectorizer = CountVectorizer()
vectorizer_fit = vectorizer.fit(x)
bow_x = vectorizer_fit.transform(x)
#### transform BOW to TF-IDF
transformer = TfidfTransformer()
transformer_x = transformer.fit(bow_x)
tfidf_x = transformer_x.transform(bow_x)
# SPLITTING THE DATASET INTO TRAINING SET AND TESTING SET
x_train, x_test, y_train, y_test = train_test_split(
tfidf_x, y, test_size=0.3, random_state=101
)
mnb = MultinomialNB(alpha=0.14)
mnb.fit(x_train, y_train)
predmnb = mnb.predict(x_test)
my objective is to get the original indices of the reviews that the model predicted wrongly.
I managed to get the result like this:
predictions = c.predict(preprocessed_df['review_text'])
df2= preprocessed_df.join(pd.DataFrame(predictions))
df2.columns = ['review_text', 'business_category', 'word_count', 'prediction']
df2[df2['business_category']!=df2['prediction']]
im sure there is a more elegant way...
It seems like there is another problem in your code, generally the TfIdf vectorizer is fit on the training data only and in order to get the test data in the same format we do the transform operation. This is primarily done to avoid data leakage. Please refer to TfidfVectorizer: should it be used on train only or train+test. I have modified your code to suit your need.
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import (
classification_report,
confusion_matrix,
accuracy_score,
roc_auc_score,
roc_curve,
)
df = pd.read_csv(
"https://raw.githubusercontent.com/m-braverman/ta_dm_course_data/master/train3.csv"
)
cleaned_df = df.drop(
labels=["review_id", "user_id", "business_id", "review_date"], axis=1
)
x = cleaned_df["review_text"]
y = cleaned_df["business_category"]
# SPLITTING THE DATASET INTO TRAINING SET AND TESTING SET
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.3, random_state=101
)
transformer = TfidfTransformer()
x_train_tf = transformer.fit_transform(x_train)
x_test_tf = transformer.transform(x_test)
mnb = MultinomialNB(alpha=0.14)
mnb.fit(x_train_tf, y_train)
predmnb = mnb.predict(x_test_tf)
incorrect_docs = x_test[predmnb == y_test]

Python Sklearn variables with inconsistent numbers of samples

I am learning sentiment analysis and I have a data frame of reviews, which I have to evaluate given a list of words, and get the weights assigned to those words. Unfortunately, when I try to fit the regression I get the following error:
"ValueError: Found input variables with inconsistent numbers of samples: [11, 133401]"
What am I missing on?
CSV file
import pandas
import sklearn
import numpy as np
products = pandas.read_csv('amazon_baby.csv')
selected_words=["awesome", "great", "fantastic", "amazing", "love", "horrible", "bad", "terrible", "awful", "wow", "hate"]
#ignore all 3* reviews
products = products[products['rating'] != 3]
#positive sentiment = 4* or 5* reviews
products['sentiment'] = products['rating'] >=4
#create a separate column for each word
for word in selected_words:
products[word]=[len(re.findall(word,x)) for x in products['review'].tolist()]
# Define X and y
X = products[selected_words]
y = products['sentiment']
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
vect = CountVectorizer()
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_test_dtm = vect.transform(X_test)
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train_dtm, y_train) #here is where I get the error
CountVectorizer() expects an iterable of strings and returns vectors that represents the counts of words. You already implemented this with the for loop and now trying to fit CountVectorizer() to counts of your selected words.
Assuming you want to just want to use your selected words as features
logreg.fit(X_train, y_train)
without the transformation will be fine.
Or if you would like to use all the words as features you could change your X to include the full review
X = products['review'].astype(str)
and then fit the CountVectorizer() and then use
logreg.fit(X_train_dtm, y_train)

Machine Learning Algorithm does not work after Vectorizing a feature that is of type text

I am trying to classify and my features are a combination of words, number and text. I am trying to vectorize the feature that is of type text but when I run it through a classifying algorithm it throws the following error.
line 51, in
classifier.fit(X_train, y_train.values.ravel())
ValueError: setting an array element with a sequence.
Below is my code.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from io import StringIO
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
df = pd.read_csv('data.csv')
df = df[pd.notnull(df['memo'])]
df = df[pd.notnull(df['name'])]
# factorize type, name, and categorized account
df['type_id'] = df.txn_type.factorize()[0]
df['name_id'] = df.name.factorize()[0]
df['categorizedAccountId'] = df.categorizedAccount.factorize()[0]
my_list = df['categorizedAccountId'].tolist()
print(my_list)
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
memoFeatures = tfidf.fit_transform(df.memo)
df['memo_id'] = pd.Series(memoFeatures, index=df.index)
X = df.loc[:, ['type_id', 'name_id', 'memo_id']]
y = df.loc[:, ['categorizedAccountId']]
X_train, X_test, y_train, y_test = train_test_split(X, y)
'''print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
'''
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train, y_train.values.ravel())
y_pred = classifier.predict(X_test)
confusion_matrix = confusion_matrix(y_test, y_pred)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(classifier.score(X_test, y_test)))
And also here are a few rows of my Data. The top row has the labels and the categorized account is the class
"txn_type","name","memo","account","amount","categorizedAccount"
"Journal","","ABC.com 11/29/16 Payments",0,207.24,"1072 ABC.com Money Out Clearing"
"Bill Payment","College Tuition Fund","Multiple inv. (details on stub)",164,-207.24,"1072 ABC.com Money Out Clearing"
Ok so I have implemented some modifications to your code, which I paste here. This snippet goes immediately after you read the csv, and drop the null rows. You have to implement the train_test_split yourself though.
df['categorizedAccount'] = df['categorizedAccount'].astype('category')
df['all_text'] = df['txn_type'] + ' ' + df['name'] + ' ' + df['memo']
X = df['all_text']
y = df['categorizedAccount']
X_train = X # Change these four lines for train_test_split
X_test = X # I don't have enough rows in the mock dataset to implement it,
y_train = y # And it returns an error
y_test = y
tfidf = TfidfVectorizer()
X_train_transformed = tfidf.fit_transform(X_train)
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train_transformed, y_train)
X_test_transformed = tfidf.transform(X_test)
y_pred = classifier.predict(X_test_transformed)
classifier.score(X_test_transformed, y_pred)
A few comments though:
from sklearn.feature_extraction.text import TfidfVectorizer
Imported once, ok
from io import StringIO
Unnecessary as far as I can see
from sklearn.feature_extraction.text import TfidfVectorizer
Why do you import it again?
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
TfidfVectorizer does the job of both CountVectorizer and TfidfTransformer. From sklearn: "Equivalent to CountVectorizer followed by TfidfTransformer." See here for more
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
Not used, do not import.
Additionally:
1) It is not clear what you are trying to do with factorize. TfidfVectorizer automatically performs tokenization for any string of text that you provide it. All columns that you have selected in your original code contain only strings, so it makes more sense to concatenate them and let tfidf do the tokenization, rather than trying to do it yourself.
2) Use the Pipeline constructor, it will save your life.
3) X = df.loc[:, ['type_id', 'name_id', 'memo_id']] This type of splicing looks very bad, just call df[['column_name_1','column_name_2','column_name_3']]
4) And remember PEP20, "Simple is better than complex"!
As a last advice, when developing a ML model it's always better to start with something plain and simple, and then develop further once you have something that works.

Categories