Using three different labels in machine learning - python

I'm really freshman on machine learning. I'm reviewing code that separates spam or ham values on an email. I have a problem when I set up codes for another data set. So, my dataset doesn't just have ham or spam values. I have 2 different classification values (age and gender). When I try to use 2 classification values at below code block , I'm getting an error , too many value for unpack. How can I put my whole values ?
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(messages_bow, import_data['age'], import_data['gender'], test_size = 0.20, random_state = 0)
Whole Codes :
import numpy as np
import pandas
import nltk
from nltk.corpus import stopwords
import string
# Import Data.
import_data = pandas.read_csv('/root/Desktop/%20/%100.csv' , encoding='cp1252')
# To See Columns Headers.
print(import_data.columns)
# To Remove Duplications.
import_data.drop_duplicates(inplace = True)
# To Find Data Size.
print(import_data.shape)
#Tokenization (a list of tokens), will be used as the analyzer
#1.Punctuations are [!"#$%&'()*+,-./:;<=>?#[\]^_`{|}~]
#2.Stop words in natural language processing, are useless words (data).
def process_text(text):
'''
What will be covered:
1. Remove punctuation
2. Remove stopwords
3. Return list of clean text words
'''
#1
nopunc = [char for char in text if char not in string.punctuation]
nopunc = ''.join(nopunc)
#2
clean_words = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
#3
return clean_words
#Show the Tokenization (a list of tokens )
print(import_data['text'].head().apply(process_text))
# Convert the text into a matrix of token counts.
from sklearn.feature_extraction.text import CountVectorizer
messages_bow = CountVectorizer(analyzer=process_text).fit_transform(import_data['text'])
#Split data into 80% training & 20% testing data sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(messages_bow, import_data['gender'], import_data['frequency'], test_size = 0.20, random_state = 0)
#Get the shape of messages_bow
print(messages_bow.shape)

train_test_split splits each argument you pass to it into train and test sets. Since you are splitting three separate types of data, you need 6 variables:
X_train, X_test, age_train, age_test, gender_train, gender_test = train_test_split(messages_bow, import_data['age'], import_data['gender'], test_size=0.20, random_state=0)

Related

Make predictions with a trained model on Python

I'm very new to programming and machine learning but I've been trying to create a prediction model to tag product reviews. I found the following model:
import numpy as np
import pandas as pd
# the Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
# function to split the data for cross-validation
from sklearn.model_selection import train_test_split
# function for transforming documents into counts
from sklearn.feature_extraction.text import CountVectorizer
# function for encoding categories
from sklearn.preprocessing import LabelEncoder
dataset = pd.read_csv('dataset.csv')
def normalize_text(s):
s = s.lower()
# remove punctuation that is not word-internal (e.g., hyphens, apostrophes)
s = re.sub('\s\W',' ',s)
s = re.sub('\W\s',' ',s)
# make sure we didn't introduce any double spaces
s = re.sub('\s+',' ',s)
return s
dataset['TEXT'] = [normalize_text(s) for s in dataset['texto']]
# pull the data into vectors
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(dataset['TEXT'])
encoder = LabelEncoder()
y = encoder.fit_transform(dataset['codigo'])
# split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
nb = MultinomialNB()
nb.fit(x_train, y_train)
y_predicted = nb.predict(x_test)
So far so good. But then, I tried to use that trained model to predict another set of data like this:
#new data
test = pd.read_csv('testset.csv')
test['TEXT'] = [normalize_text(s) for s in test['respostas']]
# pull the data into vectors
vectorizer = CountVectorizer()
classes = vectorizer.fit_transform(test['TEXT'])
classificacao = nb.predict(classes)
However, I got a "ValueError: dimension mismatch"
I'm not sure how to do this second step, which is using the model to predict the category of a fresh data set.
Thanks in advance for your assistance.

i am splitting the data into testing and training set, the error is 'Found input variables with inconsistent number of samples: [1000, 23486]'

my project is to classify the reviews as good or bad using nlp. i have imported the data and done the tokenisation, vectorisation using bag of words model. now i have to spilt the data into testing and training sets and i am getting an error saying "Found input variables with inconsistent numbers of samples: [1000, 23486]"
My file has a column called Review Text and i want to classify the reviews as good or bad. i have attached the tsv file that i am using for this project. please do help me in correcting the error and any change in approach that i can do. i have attached the code here too.
My data file here
import numpy as np
import pandas as pd
import nltk
import matplotlib
dataset = pd.read_csv("C:/Users/a/Downloads/data.tsv", delimiter = "\t", quoting = 1)
dataset.head()
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0, 1000):
review = re.sub('[^a-zA-Z]', ' ', str(dataset['Review Text'][i]))
review = review.lower()
review = review.split()
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in
set(stopwords.words('english'))]
review = ' '.join(review)
corpus.append(review)
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = df.iloc[:, 6].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
Ok, the problem is that X and y must have the same dimensions.
If you want to use just 1000 reviews you can use the same for cycle and then, when selecting y you just do:
y = dataset.iloc[:1000, 6].values
Otherwise, if you want to use the whole dataset you must edit the first part of the cycle.

Twitter sentiment analysis on a string

I've written a program that takes a twitter data that contains tweets and labels (0 for neutral sentiment and 1 for negative sentiment) and predicts which category the tweet belongs to.
The program works well on the training and test Set. However I'm having problem in applying prediction function with a string. I'm not sure how to do that.
I have tried cleaning the string the way I cleaned the dataset before calling the predict function but the values returned are in wrong shape.
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
import re
#Loading dataset
dataset = pd.read_csv('tweet.csv')
#List to hold cleaned tweets
clean_tweet = []
#Cleaning tweets
for i in range(len(dataset)):
tweet = re.sub('[^a-zA-Z]', ' ', dataset['tweet'][i])
tweet = re.sub('#[\w]*',' ',dataset['tweet'][i])
tweet = tweet.lower()
tweet = tweet.split()
tweet = [ps.stem(token) for token in tweet if not token in set(stopwords.words('english'))]
tweet = ' '.join(tweet)
clean_tweet.append(tweet)
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 3000)
X = cv.fit_transform(clean_tweet)
X = X.toarray()
y = dataset.iloc[:, 1].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
from sklearn.naive_bayes import GaussianNB
n_b = GaussianNB()
n_b.fit(X_train, y_train)
y_pred = n_b.predict(X_test)
some_tweet = "this is a mean tweet" # How to apply predict function to this string
Use cv.transform([cleaned_new_tweet]) on your new string to transform your new Tweet to your existing document-term matrix. That will return the Tweet in the correct shape.
tl;dr
.predict() expects a list of strings. So you need to add some_tweet to a list. E.g. new_tweet = ["this is a mean tweet"]
Your code
You had some issues in your code that I tried fixing for you...
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
import re
#Loading dataset
dataset = pd.read_csv('tweet.csv')
# Define cleaning function
# You can define it once as a function so it can be easily re-used else where
def clean_tweet(tweet: str):
tweet = re.sub('[^a-zA-Z]', ' ', dataset['tweet'][i])
tweet = re.sub('#[\w]*', ' ', tweet) #BUG: you need to pass the tweet you modified here instead of the original tweet again
tweet = tweet.lower()
tweet = tweet.split()
tweet = [ps.stem(token) for token in tweet if not token in set(stopwords.words('english'))]
tweet = ' '.join(tweet)
return tweet
#List to hold cleaned tweets and labels
X = [clean_tweet(tweet) for tweet in dataset['tweet']] # you can create your X directly with your new function
y = dataset.iloc[:, 1].values
# Define a single model
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline
# Use Pipeline as your classifier, this way you don't need to keep calling a transform and fit all the time.
classifier = Pipeline(
[
('cv', CountVectorizer(max_features=300)),
('n_b', GaussianNB())
]
)
# Before you trained your CountVectorizer BEFORE splitting into train/test. That is a biiig mistake.
# First you split to train/split and then you train all the steps of your model.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
# Here you train all steps of your Pipeline in one go.
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
# Predicting new tweets
some_tweet = "this is a mean tweet"
some_tweet = clean_tweet(some_tweet) # re-use your clean function
predicted = classifier.predict([some_tweet]) # put the tweet inside a list!!!!

Python Sklearn variables with inconsistent numbers of samples

I am learning sentiment analysis and I have a data frame of reviews, which I have to evaluate given a list of words, and get the weights assigned to those words. Unfortunately, when I try to fit the regression I get the following error:
"ValueError: Found input variables with inconsistent numbers of samples: [11, 133401]"
What am I missing on?
CSV file
import pandas
import sklearn
import numpy as np
products = pandas.read_csv('amazon_baby.csv')
selected_words=["awesome", "great", "fantastic", "amazing", "love", "horrible", "bad", "terrible", "awful", "wow", "hate"]
#ignore all 3* reviews
products = products[products['rating'] != 3]
#positive sentiment = 4* or 5* reviews
products['sentiment'] = products['rating'] >=4
#create a separate column for each word
for word in selected_words:
products[word]=[len(re.findall(word,x)) for x in products['review'].tolist()]
# Define X and y
X = products[selected_words]
y = products['sentiment']
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
vect = CountVectorizer()
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_test_dtm = vect.transform(X_test)
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train_dtm, y_train) #here is where I get the error
CountVectorizer() expects an iterable of strings and returns vectors that represents the counts of words. You already implemented this with the for loop and now trying to fit CountVectorizer() to counts of your selected words.
Assuming you want to just want to use your selected words as features
logreg.fit(X_train, y_train)
without the transformation will be fine.
Or if you would like to use all the words as features you could change your X to include the full review
X = products['review'].astype(str)
and then fit the CountVectorizer() and then use
logreg.fit(X_train_dtm, y_train)

Classification with n-grams

I want to use a sklearn classifier using n-gram features. Furthermore, I want to do cross-validation to find the best order of the n-grams. However, I am a bit stuck on how I can fit all the pieces together.
For now, I have the following code:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
text = ... # This is the input text. A list of strings
labels = ... # These are the labels of each sentence
# Find the optimal order of the ngrams by cross-validation
scores = pd.Series(index=range(1,6), dtype=float)
folds = KFold(n_splits=3)
for n in range(1,6):
count_vect = CountVectorizer(ngram_range=(n,n), stop_words='english')
X = count_vect.fit_transform(text)
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.33, random_state=42)
clf = MultinomialNB()
score = cross_val_score(clf, X_train, y_train, cv=folds, n_jobs=-1)
scores.loc[n] = np.mean(score)
# Evaluate the classifier using the best order found
order = scores.idxmax()
count_vect = CountVectorizer(ngram_range=(order,order), stop_words='english')
X = count_vect.fit_transform(text)
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.33, random_state=42)
clf = MultinomialNB()
clf = clf.fit(X_train, y_train)
acc = clf.score(X_test, y_test)
print('Accuracy is {}'.format(acc))
However, I feel like this is the wrong way to do it, since I create a train-test split in every loop.
If a do a train-test split beforehand and apply the CountVectorizer to both parts separately, than these parts have different shapes, which causes problems when using clf.fit and clf.score.
How can I solve this?
EDIT: If I try to build a vocabulary first, I still have to build several vocabularies, since the vocabulary for unigrams is different from that of bigrams, etc.
To give an example:
# unigram vocab
vocab = set()
for sentence in text:
for word in sentence:
if word not in vocab:
vocab.add(word)
len(vocab) # 47291
# bigram vocab
vocab = set()
for sentence in text:
bigrams = nltk.ngrams(sentence, 2)
for bigram in bigrams:
if bigram not in vocab:
vocab.add(bigram)
len(vocab) # 326044
This again leads me to the same problem of needing to apply the CountVectorizer for every n-gram size.
You need to set the vocabulary parameter first. In some way you have to provide the entire vocabulary, otherwise the dimensions can never match (obviously). If you do the train/test split first, there might be words in one set which are not present in the other and there you get your dimension mismatch.
The documentation says:
If you do not provide an a-priori dictionary and you do not use an analyzer that does some kind of feature selection then the number of features will be equal to the vocabulary size found by analyzing the data.
Further down you'll find a description for vocabulary.
vocabulary:
Mapping or iterable, optional
Either a Mapping (e.g., a dict) where keys are terms and values are indices in the feature matrix, or an iterable over terms. If not given, a vocabulary is determined from the input documents. Indices in the mapping should not be repeated and should not have any gap between 0 and the largest index.

Categories