Oversampling after splitting the dataset - Text classification - python

I am having some issues with the steps to follow for over-sampling a dataset.
What I have done is the following:
# Separate input features and target
y_up = df.Label
X_up = df.drop(columns=['Date','Links', 'Paths'], axis=1)
# setting up testing and training sets
X_train_up, X_test_up, y_train_up, y_test_up = train_test_split(X_up, y_up, test_size=0.30, random_state=27)
class_0 = X_train_up[X_train_up.Label==0]
class_1 = X_train_up[X_train_up.Label==1]
# upsample minority
class_1_upsampled = resample(class_1,
replace=True,
n_samples=len(class_0),
random_state=27) #
# combine majority and upsampled minority
upsampled = pd.concat([class_0, class_1_upsampled])
Since my dataset looks like:
Label Text
1 bla bla bla
0 once upon a time
1 some other sentences
1 a few sentences more
1 this is my dataset!
I applied a vectorizer to transform string into numbers:
X_train_up=upsampled[['Text']]
y_train_up=upsampled[['Label']]
X_train_up = pd.DataFrame(vectorizer.fit_transform(X_train_up['Text'].replace(np.NaN, "")).todense(), index=X_train_up.index)
Then I applied the logistic regression function:
upsampled_log = LogisticRegression(solver='liblinear').fit(X_train_up, y_train_up)
However, I have got the following error at this step:
X_test_up = pd.DataFrame(vectorizer.fit_transform(X_test_up['Text'].replace(np.NaN, "")).todense(), index=X_test_up.index)
pred_up_log = upsampled_log.predict(X_test_up)
ValueError: X has 3021 features per sample; expecting 5542
Since it was told me that I should apply the oversampling after splitting my dataset into train e test, I have not vectorised the test set.
My doubts are then the following:
is it right to consider later a vectorisation of the test set: X_test_up = pd.DataFrame(vectorizer.fit_transform(X_test_up['Text'].replace(np.NaN, "")).todense(), index=X_test_up.index)
is it right to consider the over-sampling after splitting the dataset into training and test?
Alternatively, I tried with Smote function. The code below works, but I would prefer to consider also the oversampling, if possible, rather than SMOTE.
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
X_train_up, X_test_up, y_train_up, y_test_up=train_test_split(df['Text'],df['Label'], test_size=0.2,random_state=42)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train_up)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_sample(X_train_tfidf, y_train_up)
print("Shape after smote is:",X_train_res.shape,y_train_res.shape)
nb = Pipeline([('clf', LogisticRegression())])
nb.fit(X_train_res, y_train_res)
y_pred = nb.predict(count_vect.transform(X_test_up))
print(accuracy_score(y_test_up,y_pred))
Any comments and suggestions will be appreciated.
Thanks

It is better to do the countVectorizing and transformation on the whole dataset, split into test and train, and keep it as a sparse matrix without converting back into a data.frame.
For example this is a dataset:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
df = pd.DataFrame({'Text':['This is bill','This is mac','here’s an old saying',
'at least old','data scientist years','data science is data wrangling',
'This rings particularly','true for data science leaders',
'who watch their data','scientists spend days',
'painstakingly picking apart','ossified corporate datasets',
'arcane Excel spreadsheets','Does data science really',
'they just delegate the job','Data Is More Than Just Numbers',
'The reason that',
'data wrangling is so difficult','data is more than text and numbers'],
'Label':[0,1,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0]})
We perform the vectorization and transformation, followed by split:
count_vect = CountVectorizer()
df_counts = count_vect.fit_transform(df['Text'])
tfidf_transformer = TfidfTransformer()
df_tfidf = tfidf_transformer.fit_transform(df_counts)
X_train_up, X_test_up, y_train_up, y_test_up=train_test_split(df_tfidf,df['Label'].values,
test_size=0.2,random_state=42)
Up sampling can be done by resampling the index of the minority classes:
class_0 = np.where(y_train_up==0)[0]
class_1 = np.where(y_train_up==1)[0]
up_idx = np.concatenate((class_0,
np.random.choice(class_1,len(class_0),replace=True)
))
upsampled_log = LogisticRegression(solver='liblinear').fit(X_train_up[up_idx,:], y_train_up[up_idx])
And the prediction will work:
upsampled_log.predict(X_test_up)
array([0, 1, 0, 0])
If you have concerns about data leakage, that is some of the information from test actually goes into the training, through the use of TfidfTransformer(). Honestly yet to see concrete proof or demonstration of this, but below is an alternative where you apply the tfid separately:
count_vect = CountVectorizer()
df_counts = count_vect.fit_transform(df['Text'])
X_train_up, X_test_up, y_train_up, y_test_up=train_test_split(df_counts,df['Label'].values,
test_size=0.2,random_state=42)
class_0 = np.where(y_train_up==0)[0]
class_1 = np.where(y_train_up==1)[0]
up_idx = np.concatenate((class_0,
np.random.choice(class_1,len(class_0),replace=True)
))
tfidf_transformer = TfidfTransformer()
upsample_Xtrain = tfidf_transformer.fit_transform(X_train_up[up_idx,:])
upsamle_y = y_train_up[up_idx]
upsampled_log = LogisticRegression(solver='liblinear').fit(upsample_Xtrain,upsamle_y)
X_test_up = tfidf_transformer.transform(X_test_up)
upsampled_log.predict(X_test_up)

Related

Make predictions with a trained model on Python

I'm very new to programming and machine learning but I've been trying to create a prediction model to tag product reviews. I found the following model:
import numpy as np
import pandas as pd
# the Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
# function to split the data for cross-validation
from sklearn.model_selection import train_test_split
# function for transforming documents into counts
from sklearn.feature_extraction.text import CountVectorizer
# function for encoding categories
from sklearn.preprocessing import LabelEncoder
dataset = pd.read_csv('dataset.csv')
def normalize_text(s):
s = s.lower()
# remove punctuation that is not word-internal (e.g., hyphens, apostrophes)
s = re.sub('\s\W',' ',s)
s = re.sub('\W\s',' ',s)
# make sure we didn't introduce any double spaces
s = re.sub('\s+',' ',s)
return s
dataset['TEXT'] = [normalize_text(s) for s in dataset['texto']]
# pull the data into vectors
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(dataset['TEXT'])
encoder = LabelEncoder()
y = encoder.fit_transform(dataset['codigo'])
# split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
nb = MultinomialNB()
nb.fit(x_train, y_train)
y_predicted = nb.predict(x_test)
So far so good. But then, I tried to use that trained model to predict another set of data like this:
#new data
test = pd.read_csv('testset.csv')
test['TEXT'] = [normalize_text(s) for s in test['respostas']]
# pull the data into vectors
vectorizer = CountVectorizer()
classes = vectorizer.fit_transform(test['TEXT'])
classificacao = nb.predict(classes)
However, I got a "ValueError: dimension mismatch"
I'm not sure how to do this second step, which is using the model to predict the category of a fresh data set.
Thanks in advance for your assistance.

Training a sklearn classifier with more than a single feature

I'm currently training a LinearSVC classifier with a single feature vectorizer. I'm processing news, which are stored in separated files. Those files originally had a title, a textual body, a date, an author and sometimes an image. But I ended up removing everythong but the textual body as a feature. I'm doing it this way:
# Loading the files (Plain files with just the news content. Nor date, author or other features.)
data_train = load_files(self.TRAIN_FOLDER, encoding=self.ENCODING) # data_train
data_test = load_files(self.TEST_FOLDER, encoding=self.ENCODING)
unlabeled = load_files(self.UNLABELED_FOLDER, encoding=self.ENCODING)
categories = data_train.target_names
# Get the sparse matrix of each dataset
y_train = data_train.target
y_test = data_test.target
# Vectorizing
vectorizer = TfidfVectorizer(encoding=self.ENCODING, use_idf=True, norm='l2', binary=False, sublinear_tf=True, min_df=0.001, max_df=1.0, ngram_range=(1, 2), analyzer='word')
X_train = vectorizer.fit_transform(data_train.data)
X_test = vectorizer.transform(data_test.data)
X_unlabeled = vectorizer.transform(self.data_unlabeled.data)
# Instantiating the classifier
clf = LinearSVC(loss='squared_hinge', penalty='l2', dual=False, tol=1e-3)
# Fitting the model according to the training set and predicting
scaler = preprocessing.StandardScaler(with_mean=False)
scaler = scaler.fit(X_train)
normalized_X_train = scaler.transform(X_train)
clf.fit(normalized_X_train, y_train)
normalized_X_test = scaler.transform(X_test)
pred = clf.predict(normalized_X_test)
accuracy_score = metrics.accuracy_score(y_test, pred)
recall_score = metrics.recall_score(y_test, pred)
precision_score = metrics.precision_score(y_test, pred)
But now I would like to include other features, as the date or the author, and all the simpler examples I found are using a single feature. So I'm not really sure how to proceed. Should I have all the information in a single file? How to diferentiate authors from content? Should I use a vectorizer for each feature? If so, should I fit a model with different vectorized features? Or should I have a different classifier for each feature? Can you suggest me something to read (explained to newbies)?
Thanks in advance,
The output of TfidfVectorizer is a scipy.sparse.csr.csr_matrix object. You may use hstack to add more features (like here). Alternatively, you may convert the feature space you already have above to a numpy array or pandas df and then add the new features (which you might have created from other vectorizers) as new columns to it. Either way, your final X_train and X_test should include all the features in one place. You may also need to standardize them before doing the training (here). You do not seem to be doing that here.
I do not have your data so here is an example on some dummy data:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(corpus)
X_train = pd.DataFrame(X_train.todense())
X_train['has_image'] = [1, 0, 0, 1] # just adding a dummy feature for demonstration

Python supervised ML text classification into different categories with probability

I am working with a large dataset of tweets from which I have trained a small subset into four manually classified categories. The manual classifications have about twenty tweets each, while the dataset has tens of thousands of tweets. Here is the code I used to train the model.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
tweets = []
labels_list = []
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2',
encoding='latin-1', ngram_range=(1, 2), stop_words='english')
features = tfidf.fit_transform(tweets).toarray()
labels = labels_list
X_train, X_test, y_train, y_test = train_test_split(tweets, labels,
random_state = 0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = MultinomialNB().fit(X_train_tfidf, y_train)
Whenever I type
print(clf.predict(count_vect.transform(["Some random content"])))
the machine accurately outputs the label that the tweet belongs to if I fill in the content with something that matches the training data. However, if I type in total nonsense, it will also output some random category that I know it doesn't belong to.
My goal is to find 100 tweets that are most likely to belong to that category, however, the four categories mentioned above are not representative of the entire dataset, therefore, I need to know if there some sort of probability threshold I could use to eliminate that tweet and not add it to the 100 if it is too low on the threshold?
I tried looking into multinomial logistic regression but I could not find any sort of probability output, so maybe if I am just doing something wrong or if there is another way I would like to know!
You can use .predict_proba() method on your clf to get probabilities of every class for every tweet. Then to get top-100 tweets for, say, class 0, you sort all your tweets by the probability of class 0 and take top 100.
You can do it easily with pandas for instance:
import pandas as pd
probsd = pd.DataFrame(clf.predict_proba(Xtest_tfidf))
top_100_class_0_tweets = probsd.sort_values(0, ascending=False).head(100).index

Supervised machine learning with scikit-learn

This is the first time I'm doing supervised machine learning. This is a pretty advanced topic (at least for me) and I find it hard to specify a question, since I'm not sure what is going wrong.
# Create a training list and test list (looks something like this):
train = [('this hostel was nice',2),('i hate this hostel',1)]
test = [('had a wonderful time',2),('terrible experience',1)]
# Loading modules
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics
# Use a BOW representation of the reviews
vectorizer = CountVectorizer(stop_words='english')
train_features = vectorizer.fit_transform([r[0] for r in train])
test_features = vectorizer.fit([r[0] for r in test])
# Fit a naive bayes model to the training data
nb = MultinomialNB()
nb.fit(train_features, [r[1] for r in train])
# Use the classifier to predict classification of test dataset
predictions = nb.predict(test_features)
actual=[r[1] for r in test]
Here I get the error:
float() argument must be a string or a number, not 'CountVectorizer'
This confuses me, since the original ratings that I have zipped up in with the reviews are:
type(ratings_new[0])
int
You should change the line
test_features = vectorizer.fit([r[0] for r in test])
to:
test_features = vectorizer.transform([r[0] for r in test])
The reason is that you already used your training data to fit vectorizer, so you don't need to fit it again on your test data. Instead, you need to transform it.

How do I convert new data into the PCA components of my training data?

Suppose I have some text sentences that I want to cluster using kmeans.
sentences = [
"fix grammatical or spelling errors",
"clarify meaning without changing it",
"correct minor mistakes",
"add related resources or links",
"always respect the original author"
]
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
vectorizer = CountVectorizer(min_df=1)
X = vectorizer.fit_transform(sentences)
num_clusters = 2
km = KMeans(n_clusters=num_clusters, init='random', n_init=1,verbose=1)
km.fit(X)
Now I could predict which of the classes a new text would fall into,
new_text = "hello world"
vec = vectorizer.transform([new_text])
print km.predict(vec)[0]
However, say I apply PCA to reduce 10,000 features to 50.
from sklearn.decomposition import RandomizedPCA
pca = RandomizedPCA(n_components=50,whiten=True)
X2 = pca.fit_transform(X)
km.fit(X2)
I cannot do the same thing anymore to predict the cluster for a new text because the results from vectorizer are no longer relevant
new_text = "hello world"
vec = vectorizer.transform([new_text]) ##
print km.predict(vec)[0]
ValueError: Incorrect number of features. Got 10000 features, expected 50
So how do I transform my new text into the lower dimensional feature space?
You want to use pca.transform on your new data before feeding it to the model. This will perform dimensionality reduction using the same PCA model that was fitted when you ran pca.fit_transform on your original data. You can then use your fitted model to predict on this reduced data.
Basically, think of it as fitting one large model, which consists of stacking three smaller models. First you have a CountVectorizer model that determines how to process data. Then you run a RandomizedPCA model that performs dimensionality reduction. And finally you run a KMeans model for clustering. When you fit the models, you go down the stack and fit each one. And when you want to do prediction, you also have to go down the stack and apply each one.
# Initialize models
vectorizer = CountVectorizer(min_df=1)
pca = RandomizedPCA(n_components=50, whiten=True)
km = KMeans(n_clusters=2, init='random', n_init=1, verbose=1)
# Fit models
X = vectorizer.fit_transform(sentences)
X2 = pca.fit_transform(X)
km.fit(X2)
# Predict with models
X_new = vectorizer.transform(["hello world"])
X2_new = pca.transform(X_new)
km.predict(X2_new)
Use a Pipeline:
>>> from sklearn.cluster import KMeans
>>> from sklearn.decomposition import RandomizedPCA
>>> from sklearn.decomposition import TruncatedSVD
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from sklearn.pipeline import make_pipeline
>>> sentences = [
... "fix grammatical or spelling errors",
... "clarify meaning without changing it",
... "correct minor mistakes",
... "add related resources or links",
... "always respect the original author"
... ]
>>> vectorizer = CountVectorizer(min_df=1)
>>> svd = TruncatedSVD(n_components=5)
>>> km = KMeans(n_clusters=2, init='random', n_init=1)
>>> pipe = make_pipeline(vectorizer, svd, km)
>>> pipe.fit(sentences)
Pipeline(steps=[('countvectorizer', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,...n_init=1,
n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
verbose=1))])
>>> pipe.predict(["hello, world"])
array([0], dtype=int32)
(Showing TruncatedSVD because RandomizedPCA will stop working on text frequency matrices in an upcoming release; it actually performed an SVD, not full PCA, anyway.)

Categories