I'm trying to do text classification for a large corpus (732,066 tweets) in python
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
#dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)
# Importing the dataset
cols = ["text","geocoordinates0","geocoordinates1","grid"]
dataset = pd.read_csv('tweets.tsv', delimiter = '\t', usecols=cols, quoting = 3, error_bad_lines=False, low_memory=False)
# Removing Non-ASCII characters
def remove_non_ascii_1(dataset):
return ''.join([i if ord(i) < 128 else ' ' for i in dataset])
# Cleaning the texts
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0, 732066):
review = re.sub('[^a-zA-Z]', ' ', dataset['text'][i])
review = review.lower()
review = review.split()
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
review = ' '.join(review)
corpus.append(review)
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)
# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()
This is the error I get and where i'm stuck and unable to proceed with the rest of the machine learning text classification
Traceback (most recent call last):
File "<ipython-input-2-3fac33122b74>", line 2, in <module>
review = re.sub('[^a-zA-Z]', ' ', dataset['text'][i])
File "C:\Anaconda3\envs\py35\lib\re.py", line 182, in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object
Thanks ahead for the help
Try
str(dataset.loc[df.index[i], 'text'])
That'll convert it to a str object, from whatever type it was before.
Related
I have a CSV file of
lemma,trained
iran seizes bitcoin mining machines power spike,-1
... (goes on for 1054 lines)
And my code looks like:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
df = pd.read_csv('lemma copy.csv')
X = df.iloc[:, 0].values
y = df.iloc[:, 1].values
print(y)
X_train, X_test, y_train, y_test =train_test_split(X,y,test_size= 0.25, random_state=0)
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
I am getting the error
Traceback (most recent call last):
File "/home/arctesian/Scripts/School/EE/Algos/Qual/bayes/sklean.py", line 20, in <module>
X_train = sc_X.fit_transform(X_train)
File "/home/arctesian/.local/lib/python3.10/site-packages/sklearn/base.py", line 867, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/home/arctesian/.local/lib/python3.10/site-packages/sklearn/preprocessing/_data.py", line 809, in fit
return self.partial_fit(X, y, sample_weight)
File "/home/arctesian/.local/lib/python3.10/site-packages/sklearn/preprocessing/_data.py", line 844, in partial_fit
X = self._validate_data(
File "/home/arctesian/.local/lib/python3.10/site-packages/sklearn/base.py", line 577, in _validate_data
X = check_array(X, input_name="X", **check_params)
File "/home/arctesian/.local/lib/python3.10/site-packages/sklearn/utils/validation.py", line 856, in check_array
array = np.asarray(array, order=order, dtype=dtype)
ValueError: could not convert string to float: 'twitter ios beta lays groundwork bitcoin tips'
Printing this out shows that the random splitting of the data makes that line the first line so it must be a problem with trans coding the data. How do I fix this problem?
Sometimes searching for the right question on Stack Overflow (or the internet as a whole) is difficult. The reason why you're having trouble finding an answer is because your question is related to NLP based on your CSV containing lemmas.
You'll have to preprocess your data in some way such as by using word vectors. Word vectors are essentially a model trained on a large corpus of text data so that each word can be represented by a N length vector. I'm greatly simplifying this of course.
Another strategy is to use the bag of words approach. A bag of words takes the count of each word that appears in your corpus. You use the bag of words rather than the original strings to train your models. Here's a very small example using scikit-learn's CountVectorizer.
from sklearn.feature_extraction.text import CountVectorizer
corpus = ["I like cats", "meow", "Espeon is a cool Pokemon", "my friend has lotsof pet fish",
"my pet cat wants to eat my friend's fish", "spams spam", "not spam",
"someone please hire me for a job", "nlp is cool",
"this corpus isn't actually large enough to use counter vectorizer well"]
count_vec = CountVectorizer(ngram_range=(
1, 3), stop_words="english").fit(corpus)
corpus_cv = count_vec.transform(corpus)
I skipped steps to keep the code concise, but the above is the gist of using CountVectorizer.
So I fixed it by using #joshua megauth method and getting rid of pandas. Did this:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from coalas import csvReader as c
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
# df = pd.read_csv('lemma copy.csv')
def vect(X):
features = vectorizer.fit_transform(X)
features_nd = features.toarray()
return features_nd
def test():
y_pred = classifer.predict(X_test)
print(accuracy_score(y_pred, y_test))
if __name__ == "__main__":
c.importCSV('lemma copy.csv')
vectorizer = CountVectorizer(
analyzer = 'word',
lowercase = False,
)
X = c.lemma
# y = c.Best
y = c.trained
features_nd = vect(X)
X_train, X_test, y_train, y_test =train_test_split(features_nd,y,test_size= 0.2, random_state=0)
sc_X = StandardScaler()
# print(X_train)
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.fit_transform(X_test)
classifer = GaussianNB()
classifer.fit(X_train, y_train)
test()
I adapted the following code from Susan Li's post, but incurred an error when the code tries to tokenize text using NLTK's resources (or, there could be something wrong with "keyed vectors" loaded from the web). The error occurred on the 5th code block (see below, might take a while to load from the web):
## 1. load packages and data
import logging
import pandas as pd
import numpy as np
from numpy import random
import gensim
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk import sent_tokenize
STOPWORDS = set(stopwords.words('english'))
nltk.download('stopwords')
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
import re
from bs4 import BeautifulSoup
%matplotlib inline
df = pd.read_csv('https://www.dropbox.com/s/b2w7iqi7c92uztt/stack-overflow-data.csv?dl=1')
df = df[pd.notnull(df['tags'])]
my_tags = ['java','html','asp.net','c#','ruby-on-rails','jquery','mysql','php','ios','javascript','python','c','css','android','iphone','sql','objective-c','c++','angularjs','.net']
## 2. cleaning
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|#,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))
def clean_text(text):
text = BeautifulSoup(text, "lxml").text # HTML decoding
text = text.lower() # lowercase text
text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
text = BAD_SYMBOLS_RE.sub('', text) # delete symbols which are in BAD_SYMBOLS_RE from text
text = ' '.join(word for word in text.split() if word not in STOPWORDS) # delete stopwors from text
return text
df['post'] = df['post'].apply(clean_text)
## 3. train test split
X = df.post
y = df.tags
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 42)
## 4. load keyed vectors from the web: will take a while to load
import gensim
word2vec_path = "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
wv = gensim.models.KeyedVectors.load_word2vec_format(word2vec_path, binary=True)
wv.init_sims(replace=True)
## 5. this is where it goes wrong
def w2v_tokenize_text(text):
tokens = []
for sent in nltk.sent_tokenize(text, language='english'):
for word in nltk.word_tokenize(sent, language='english'):
if len(word) < 2:
continue
tokens.append(word)
return tokens
train, test = train_test_split(df, test_size=0.3, random_state = 42)
test_tokenized = test.apply(lambda r: w2v_tokenize_text(r['post']), axis=1).values
train_tokenized = train.apply(lambda r: w2v_tokenize_text(r['post']), axis=1).values
X_train_word_average = word_averaging_list(wv,train_tokenized)
X_test_word_average = word_averaging_list(wv,test_tokenized)
## 6. perform logistic regression test
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(n_jobs=1, C=1e5)
logreg = logreg.fit(X_train_word_average, train['tags'])
y_pred = logreg.predict(X_test_word_average)
print('accuracy %s' % accuracy_score(y_pred, test.tags))
print(classification_report(test.tags, y_pred,target_names=my_tags))
Update on part 5 (per #luigigi's comments)
## 5. download nltk and use apply() function without using lambda
import nltk
nltk.download()
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk import sent_tokenize
def w2v_tokenize_text(text):
tokens = []
for sent in nltk.sent_tokenize(text, language='english'):
for word in nltk.word_tokenize(sent, language='english'):
if len(word) < 2:
continue
tokens.append(word)
return tokens
train, test = train_test_split(df, test_size=0.3, random_state = 42)
def w2v_tokenize_text(text):
tokens = []
for sent in nltk.sent_tokenize(text, language='english'):
for word in nltk.word_tokenize(sent, language='english'):
if len(word) < 2:
continue
tokens.append(word)
return tokens
train, test = train_test_split(df, test_size=0.3, random_state = 42)
test_tokenized = test['post'].apply(w2v_tokenize_text).values
train_tokenized = train['post'].apply(w2v_tokenize_text).values
X_train_word_average = word_averaging_list(wv,train_tokenized)
X_test_word_average = word_averaging_list(wv,test_tokenized)
## now run the test
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(n_jobs=1, C=1e5)
logreg = logreg.fit(X_train_word_average, train['tags'])
y_pred = logreg.predict(X_test_word_average)
print('accuracy %s' % accuracy_score(y_pred, test.tags))
print(classification_report(test.tags, y_pred,target_names=my_tags))
This should work.
Then nltk tokenizer expects the punkt resource so you have to download it first:
nltk.download('punkt')
Also, you dont need a lambda expression to apply your tokenizer function. You can simply use:
test_tokenized = test['post'].apply(w2v_tokenize_text).values
train_tokenized = train['post'].apply(w2v_tokenize_text).values
my project is to classify the reviews as good or bad using nlp. i have imported the data and done the tokenisation, vectorisation using bag of words model. now i have to spilt the data into testing and training sets and i am getting an error saying "Found input variables with inconsistent numbers of samples: [1000, 23486]"
My file has a column called Review Text and i want to classify the reviews as good or bad. i have attached the tsv file that i am using for this project. please do help me in correcting the error and any change in approach that i can do. i have attached the code here too.
My data file here
import numpy as np
import pandas as pd
import nltk
import matplotlib
dataset = pd.read_csv("C:/Users/a/Downloads/data.tsv", delimiter = "\t", quoting = 1)
dataset.head()
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0, 1000):
review = re.sub('[^a-zA-Z]', ' ', str(dataset['Review Text'][i]))
review = review.lower()
review = review.split()
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in
set(stopwords.words('english'))]
review = ' '.join(review)
corpus.append(review)
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = df.iloc[:, 6].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
Ok, the problem is that X and y must have the same dimensions.
If you want to use just 1000 reviews you can use the same for cycle and then, when selecting y you just do:
y = dataset.iloc[:1000, 6].values
Otherwise, if you want to use the whole dataset you must edit the first part of the cycle.
I've written a program that takes a twitter data that contains tweets and labels (0 for neutral sentiment and 1 for negative sentiment) and predicts which category the tweet belongs to.
The program works well on the training and test Set. However I'm having problem in applying prediction function with a string. I'm not sure how to do that.
I have tried cleaning the string the way I cleaned the dataset before calling the predict function but the values returned are in wrong shape.
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
import re
#Loading dataset
dataset = pd.read_csv('tweet.csv')
#List to hold cleaned tweets
clean_tweet = []
#Cleaning tweets
for i in range(len(dataset)):
tweet = re.sub('[^a-zA-Z]', ' ', dataset['tweet'][i])
tweet = re.sub('#[\w]*',' ',dataset['tweet'][i])
tweet = tweet.lower()
tweet = tweet.split()
tweet = [ps.stem(token) for token in tweet if not token in set(stopwords.words('english'))]
tweet = ' '.join(tweet)
clean_tweet.append(tweet)
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 3000)
X = cv.fit_transform(clean_tweet)
X = X.toarray()
y = dataset.iloc[:, 1].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
from sklearn.naive_bayes import GaussianNB
n_b = GaussianNB()
n_b.fit(X_train, y_train)
y_pred = n_b.predict(X_test)
some_tweet = "this is a mean tweet" # How to apply predict function to this string
Use cv.transform([cleaned_new_tweet]) on your new string to transform your new Tweet to your existing document-term matrix. That will return the Tweet in the correct shape.
tl;dr
.predict() expects a list of strings. So you need to add some_tweet to a list. E.g. new_tweet = ["this is a mean tweet"]
Your code
You had some issues in your code that I tried fixing for you...
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
import re
#Loading dataset
dataset = pd.read_csv('tweet.csv')
# Define cleaning function
# You can define it once as a function so it can be easily re-used else where
def clean_tweet(tweet: str):
tweet = re.sub('[^a-zA-Z]', ' ', dataset['tweet'][i])
tweet = re.sub('#[\w]*', ' ', tweet) #BUG: you need to pass the tweet you modified here instead of the original tweet again
tweet = tweet.lower()
tweet = tweet.split()
tweet = [ps.stem(token) for token in tweet if not token in set(stopwords.words('english'))]
tweet = ' '.join(tweet)
return tweet
#List to hold cleaned tweets and labels
X = [clean_tweet(tweet) for tweet in dataset['tweet']] # you can create your X directly with your new function
y = dataset.iloc[:, 1].values
# Define a single model
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline
# Use Pipeline as your classifier, this way you don't need to keep calling a transform and fit all the time.
classifier = Pipeline(
[
('cv', CountVectorizer(max_features=300)),
('n_b', GaussianNB())
]
)
# Before you trained your CountVectorizer BEFORE splitting into train/test. That is a biiig mistake.
# First you split to train/split and then you train all the steps of your model.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
# Here you train all steps of your Pipeline in one go.
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
# Predicting new tweets
some_tweet = "this is a mean tweet"
some_tweet = clean_tweet(some_tweet) # re-use your clean function
predicted = classifier.predict([some_tweet]) # put the tweet inside a list!!!!
I am trying to classify and my features are a combination of words, number and text. I am trying to vectorize the feature that is of type text but when I run it through a classifying algorithm it throws the following error.
line 51, in
classifier.fit(X_train, y_train.values.ravel())
ValueError: setting an array element with a sequence.
Below is my code.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from io import StringIO
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
df = pd.read_csv('data.csv')
df = df[pd.notnull(df['memo'])]
df = df[pd.notnull(df['name'])]
# factorize type, name, and categorized account
df['type_id'] = df.txn_type.factorize()[0]
df['name_id'] = df.name.factorize()[0]
df['categorizedAccountId'] = df.categorizedAccount.factorize()[0]
my_list = df['categorizedAccountId'].tolist()
print(my_list)
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
memoFeatures = tfidf.fit_transform(df.memo)
df['memo_id'] = pd.Series(memoFeatures, index=df.index)
X = df.loc[:, ['type_id', 'name_id', 'memo_id']]
y = df.loc[:, ['categorizedAccountId']]
X_train, X_test, y_train, y_test = train_test_split(X, y)
'''print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
'''
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train, y_train.values.ravel())
y_pred = classifier.predict(X_test)
confusion_matrix = confusion_matrix(y_test, y_pred)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(classifier.score(X_test, y_test)))
And also here are a few rows of my Data. The top row has the labels and the categorized account is the class
"txn_type","name","memo","account","amount","categorizedAccount"
"Journal","","ABC.com 11/29/16 Payments",0,207.24,"1072 ABC.com Money Out Clearing"
"Bill Payment","College Tuition Fund","Multiple inv. (details on stub)",164,-207.24,"1072 ABC.com Money Out Clearing"
Ok so I have implemented some modifications to your code, which I paste here. This snippet goes immediately after you read the csv, and drop the null rows. You have to implement the train_test_split yourself though.
df['categorizedAccount'] = df['categorizedAccount'].astype('category')
df['all_text'] = df['txn_type'] + ' ' + df['name'] + ' ' + df['memo']
X = df['all_text']
y = df['categorizedAccount']
X_train = X # Change these four lines for train_test_split
X_test = X # I don't have enough rows in the mock dataset to implement it,
y_train = y # And it returns an error
y_test = y
tfidf = TfidfVectorizer()
X_train_transformed = tfidf.fit_transform(X_train)
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train_transformed, y_train)
X_test_transformed = tfidf.transform(X_test)
y_pred = classifier.predict(X_test_transformed)
classifier.score(X_test_transformed, y_pred)
A few comments though:
from sklearn.feature_extraction.text import TfidfVectorizer
Imported once, ok
from io import StringIO
Unnecessary as far as I can see
from sklearn.feature_extraction.text import TfidfVectorizer
Why do you import it again?
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
TfidfVectorizer does the job of both CountVectorizer and TfidfTransformer. From sklearn: "Equivalent to CountVectorizer followed by TfidfTransformer." See here for more
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
Not used, do not import.
Additionally:
1) It is not clear what you are trying to do with factorize. TfidfVectorizer automatically performs tokenization for any string of text that you provide it. All columns that you have selected in your original code contain only strings, so it makes more sense to concatenate them and let tfidf do the tokenization, rather than trying to do it yourself.
2) Use the Pipeline constructor, it will save your life.
3) X = df.loc[:, ['type_id', 'name_id', 'memo_id']] This type of splicing looks very bad, just call df[['column_name_1','column_name_2','column_name_3']]
4) And remember PEP20, "Simple is better than complex"!
As a last advice, when developing a ML model it's always better to start with something plain and simple, and then develop further once you have something that works.