I have a CSV file of
lemma,trained
iran seizes bitcoin mining machines power spike,-1
... (goes on for 1054 lines)
And my code looks like:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
df = pd.read_csv('lemma copy.csv')
X = df.iloc[:, 0].values
y = df.iloc[:, 1].values
print(y)
X_train, X_test, y_train, y_test =train_test_split(X,y,test_size= 0.25, random_state=0)
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
I am getting the error
Traceback (most recent call last):
File "/home/arctesian/Scripts/School/EE/Algos/Qual/bayes/sklean.py", line 20, in <module>
X_train = sc_X.fit_transform(X_train)
File "/home/arctesian/.local/lib/python3.10/site-packages/sklearn/base.py", line 867, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/home/arctesian/.local/lib/python3.10/site-packages/sklearn/preprocessing/_data.py", line 809, in fit
return self.partial_fit(X, y, sample_weight)
File "/home/arctesian/.local/lib/python3.10/site-packages/sklearn/preprocessing/_data.py", line 844, in partial_fit
X = self._validate_data(
File "/home/arctesian/.local/lib/python3.10/site-packages/sklearn/base.py", line 577, in _validate_data
X = check_array(X, input_name="X", **check_params)
File "/home/arctesian/.local/lib/python3.10/site-packages/sklearn/utils/validation.py", line 856, in check_array
array = np.asarray(array, order=order, dtype=dtype)
ValueError: could not convert string to float: 'twitter ios beta lays groundwork bitcoin tips'
Printing this out shows that the random splitting of the data makes that line the first line so it must be a problem with trans coding the data. How do I fix this problem?
Sometimes searching for the right question on Stack Overflow (or the internet as a whole) is difficult. The reason why you're having trouble finding an answer is because your question is related to NLP based on your CSV containing lemmas.
You'll have to preprocess your data in some way such as by using word vectors. Word vectors are essentially a model trained on a large corpus of text data so that each word can be represented by a N length vector. I'm greatly simplifying this of course.
Another strategy is to use the bag of words approach. A bag of words takes the count of each word that appears in your corpus. You use the bag of words rather than the original strings to train your models. Here's a very small example using scikit-learn's CountVectorizer.
from sklearn.feature_extraction.text import CountVectorizer
corpus = ["I like cats", "meow", "Espeon is a cool Pokemon", "my friend has lotsof pet fish",
"my pet cat wants to eat my friend's fish", "spams spam", "not spam",
"someone please hire me for a job", "nlp is cool",
"this corpus isn't actually large enough to use counter vectorizer well"]
count_vec = CountVectorizer(ngram_range=(
1, 3), stop_words="english").fit(corpus)
corpus_cv = count_vec.transform(corpus)
I skipped steps to keep the code concise, but the above is the gist of using CountVectorizer.
So I fixed it by using #joshua megauth method and getting rid of pandas. Did this:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from coalas import csvReader as c
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
# df = pd.read_csv('lemma copy.csv')
def vect(X):
features = vectorizer.fit_transform(X)
features_nd = features.toarray()
return features_nd
def test():
y_pred = classifer.predict(X_test)
print(accuracy_score(y_pred, y_test))
if __name__ == "__main__":
c.importCSV('lemma copy.csv')
vectorizer = CountVectorizer(
analyzer = 'word',
lowercase = False,
)
X = c.lemma
# y = c.Best
y = c.trained
features_nd = vect(X)
X_train, X_test, y_train, y_test =train_test_split(features_nd,y,test_size= 0.2, random_state=0)
sc_X = StandardScaler()
# print(X_train)
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.fit_transform(X_test)
classifer = GaussianNB()
classifer.fit(X_train, y_train)
test()
Related
I'm trying to extract features from a CSV text data file, I have two columns "Label" and "text_stemmed". Some days before the project was running fine and it was showing output. But now there is an error I tried to found solution but not able to do so. I'm beginner in python anybody plz help.
My code:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
import random
df= pd.read_csv('updated1.csv', encoding='UTF-8')
df.head()
df.loc[df["Label"]=='Acquittal',"Label",]=0
df.loc[df["Label"]=='Convictal',"Label",]=1
df_x=df["text_stemmed"]
df_y=df["Label"]
cv = TfidfVectorizer(min_df=1,stop_words='english')
x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.2, random_state=4)
cv = TfidfVectorizer(min_df=1,stop_words='english')
x_traincv = cv.fit_transform(["Hi How are you How are you doing","Hi what's up","Wow that's awesome"])
cv1 = TfidfVectorizer(min_df=1, ngram_range=(1,1), stop_words='english')
x_traincv=cv1.fit_transform(x_train)
cv1 = TfidfVectorizer(min_df=1,stop_words='english', ngram_range = ('1,1'))
a=x_traincv.toarray()
a
cv1.inverse_transform(a)
The Error:
NotFittedError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_5228/2571295384.py in <module>
----> 1 cv1.inverse_transform(a)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in inverse_transform(self, X)
1270 List of arrays of terms.
1271 """
-> 1272 self._check_vocabulary()
1273 # We need CSR format for fast row manipulations.
1274 X = check_array(X, accept_sparse='csr')
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in _check_vocabulary(self)
470 self._validate_vocabulary()
471 if not self.fixed_vocabulary_:
--> 472 raise NotFittedError("Vocabulary not fitted or provided")
473
474 if len(self.vocabulary_) == 0:
NotFittedError: Vocabulary not fitted or provided
You're re-creating the TfidfVectorizer here:
cv1 = TfidfVectorizer(min_df=1, ngram_range=(1,1), stop_words='english')
x_traincv=cv1.fit_transform(x_train)
cv1 = TfidfVectorizer(min_df=1,stop_words='english', ngram_range = ('1,1'))
The second version of it is written to cv1, and that one is never fitted, so it has no vocabulary.
What is the data you would like to train the model on?
My aim is to predict between five and six numbers in an array, based on csv data with six columns. The below script is supposed to predict only one number, from an array of 5. I assumed I could work my way up to the entire 5 or 6 from there, but I might be wrong about that.
Mre:
import csv
import numpy as np
import pandas as pd
from math import sqrt
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler
df = pd.read_csv('subdata.csv')
ft = [9,8,15,4,6]
fintest = np.array(ft)
def train():
df.astype(np.float64)
df.drop(['One'], axis = 1)
X = df
y = X['One']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)
scaler = StandardScaler()
train_scaled = scaler.fit_transform(X_train)
test_scaled = scaler.transform(X_test)
tree_model = DecisionTreeRegressor()
rf_model = RandomForestRegressor()
tree_model.fit(train_scaled, y_train)
rf_model.fit(train_scaled, y_train)
rfp = rf_model.predict(fintest.reshape(1, -1))
tmp = tree_model.predict(fintest.reshape(1, -1))
print(rfp)
print(tmp)
train()
Could you please clarify, what I am asking this script to predict in the final rfp and tmp lines?
My data looks like this:
The script as is currently gives an error:
Traceback (most recent call last):
File "C:\Users\conra\Desktop\Code\lotto\pie.py", line 43, in <module>
train()
File "C:\Users\conra\Desktop\Code\lotto\pie.py", line 37, in train
rfp = rf_model.predict(fintest.reshape(1, -1))
File "C:\Users\conra\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\ensemble\_forest.py", line 784, in predict
X = self._validate_X_predict(X)
File "C:\Users\conra\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\ensemble\_forest.py", line 422, in _validate_X_predict
return self.estimators_[0]._validate_X_predict(X, check_input=True)
File "C:\Users\conra\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\tree\_classes.py", line 402, in _validate_X_predict
X = self._validate_data(X, dtype=DTYPE, accept_sparse="csr",
File "C:\Users\conra\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\base.py", line 437, in _validate_data
self._check_n_features(X, reset=reset)
File "C:\Users\conra\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\base.py", line 365, in _check_n_features
raise ValueError(
ValueError: X has 5 features, but DecisionTreeRegressor is expecting 6 features as input.
By adding a sixth digit to the ft array I can get around this error and receive wildly inaccurate outputs, that appear to have no correlation with the data whatsoever. For example, by setting variable ft to [9,8,15,4,6,2] which is the first row in the csv file, and setting X and y to use the 'Four' label; I get an output of [37.22] and [37.].
My other questions will probably be answered by my first. But here they are:
Could you also please clarify why I need to pass an array of 6?
And why are my predictions so close together (all ~35), no matter what array I pass for the prediction?
The way you defined your X is wrong. It is containing 6 features.
Your y is contained in your X in the way you defined it :
X = df #6 features
y = X['One'] #1 feature
I think what you wanted to do was something like this :
X = df[['Two', 'Three', 'Four', 'Five', 'Zero']]
y = df['One']
It depends on your data, and like I saw your data is an example without context, so actually you are trying to train your data to predict the 'One' column using two different models, that doesn't make sense to me.
The error is because you give to X the dataframe without column 'One' and after you are asking for the column 'One' to variable Y, Y=X['One'].
I am trying to classify and my features are a combination of words, number and text. I am trying to vectorize the feature that is of type text but when I run it through a classifying algorithm it throws the following error.
line 51, in
classifier.fit(X_train, y_train.values.ravel())
ValueError: setting an array element with a sequence.
Below is my code.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from io import StringIO
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
df = pd.read_csv('data.csv')
df = df[pd.notnull(df['memo'])]
df = df[pd.notnull(df['name'])]
# factorize type, name, and categorized account
df['type_id'] = df.txn_type.factorize()[0]
df['name_id'] = df.name.factorize()[0]
df['categorizedAccountId'] = df.categorizedAccount.factorize()[0]
my_list = df['categorizedAccountId'].tolist()
print(my_list)
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
memoFeatures = tfidf.fit_transform(df.memo)
df['memo_id'] = pd.Series(memoFeatures, index=df.index)
X = df.loc[:, ['type_id', 'name_id', 'memo_id']]
y = df.loc[:, ['categorizedAccountId']]
X_train, X_test, y_train, y_test = train_test_split(X, y)
'''print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
'''
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train, y_train.values.ravel())
y_pred = classifier.predict(X_test)
confusion_matrix = confusion_matrix(y_test, y_pred)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(classifier.score(X_test, y_test)))
And also here are a few rows of my Data. The top row has the labels and the categorized account is the class
"txn_type","name","memo","account","amount","categorizedAccount"
"Journal","","ABC.com 11/29/16 Payments",0,207.24,"1072 ABC.com Money Out Clearing"
"Bill Payment","College Tuition Fund","Multiple inv. (details on stub)",164,-207.24,"1072 ABC.com Money Out Clearing"
Ok so I have implemented some modifications to your code, which I paste here. This snippet goes immediately after you read the csv, and drop the null rows. You have to implement the train_test_split yourself though.
df['categorizedAccount'] = df['categorizedAccount'].astype('category')
df['all_text'] = df['txn_type'] + ' ' + df['name'] + ' ' + df['memo']
X = df['all_text']
y = df['categorizedAccount']
X_train = X # Change these four lines for train_test_split
X_test = X # I don't have enough rows in the mock dataset to implement it,
y_train = y # And it returns an error
y_test = y
tfidf = TfidfVectorizer()
X_train_transformed = tfidf.fit_transform(X_train)
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train_transformed, y_train)
X_test_transformed = tfidf.transform(X_test)
y_pred = classifier.predict(X_test_transformed)
classifier.score(X_test_transformed, y_pred)
A few comments though:
from sklearn.feature_extraction.text import TfidfVectorizer
Imported once, ok
from io import StringIO
Unnecessary as far as I can see
from sklearn.feature_extraction.text import TfidfVectorizer
Why do you import it again?
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
TfidfVectorizer does the job of both CountVectorizer and TfidfTransformer. From sklearn: "Equivalent to CountVectorizer followed by TfidfTransformer." See here for more
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
Not used, do not import.
Additionally:
1) It is not clear what you are trying to do with factorize. TfidfVectorizer automatically performs tokenization for any string of text that you provide it. All columns that you have selected in your original code contain only strings, so it makes more sense to concatenate them and let tfidf do the tokenization, rather than trying to do it yourself.
2) Use the Pipeline constructor, it will save your life.
3) X = df.loc[:, ['type_id', 'name_id', 'memo_id']] This type of splicing looks very bad, just call df[['column_name_1','column_name_2','column_name_3']]
4) And remember PEP20, "Simple is better than complex"!
As a last advice, when developing a ML model it's always better to start with something plain and simple, and then develop further once you have something that works.
# -*- coding: utf-8 -*-
"""
Created on Sun Jun 3 01:36:10 2018
#author: Sharad
"""
import numpy as np
import pickle
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
dbfile=open("D:/df_train_api.pk", 'rb')
df=pickle.load(dbfile)
y=df[['label']]
features=['groups']
X=df[features].copy()
X.columns
y.columns
#for spiliting into training and test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=324)
#for vectorizing
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, y_train)
The problem lies in the vectorisationg as it gives me X_train_counts of size [1,1]. I don't know why. And that's why MultinomialNB can't perform the action as y_train is of size [1, 3185].
I'm new to machine learning. Any help would be much appreciated.
traceback:
Traceback (most recent call last):
File "<ipython-input-52-5b5949203f76>", line 1, in <module>
runfile('C:/Users/Sharad/.spyder-py3/hypothizer.py', wdir='C:/Users/Sharad/.spyder-py3')
File "C:\Users\Sharad\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
execfile(filename, namespace)
File "C:\Users\Sharad\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/Sharad/.spyder-py3/hypothizer.py", line 37, in <module>
clf = MultinomialNB().fit(X_train_tfidf, y_train)
File "C:\Users\Sharad\Anaconda3\lib\site-packages\sklearn\naive_bayes.py", line 579, in fit
X, y = check_X_y(X, y, 'csr')
File "C:\Users\Sharad\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 583, in check_X_y
check_consistent_length(X, y)
File "C:\Users\Sharad\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 204, in check_consistent_length
" samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples: [1, 3185]
CountVectorizer (and by inheritence, TfidfTransformer and TfidfVectorizer) expects an iterable of raw documents in fit() and fit_transform():
raw_documents : iterable
An iterable which yields either str, unicode or file objects.
So internally it will do this:
for doc in raw_documents:
do_processing(doc)
When you pass a pandas DataFrame object in it, only the column names will be yielded by the for ... in X. And hence only a single document is processed (instead of data inside that column).
You need to do this:
X = df[features].values().ravel()
Or else do this:
X=df['groups'].copy()
There is a difference in the code above and the code you are doing. You are doing this:
X=df[features].copy()
Here features is already a list of columns. So essentially this becomes:
X=df[['groups']].copy()
The difference is in the double brackets here (which return a dataframe) and single bracket in my code (which returns a Series).
for value in X works as expected when X is a series, but only returns column names when X is a dataframe.
Hope this is clear.
I'm trying to do text classification for a large corpus (732,066 tweets) in python
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
#dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)
# Importing the dataset
cols = ["text","geocoordinates0","geocoordinates1","grid"]
dataset = pd.read_csv('tweets.tsv', delimiter = '\t', usecols=cols, quoting = 3, error_bad_lines=False, low_memory=False)
# Removing Non-ASCII characters
def remove_non_ascii_1(dataset):
return ''.join([i if ord(i) < 128 else ' ' for i in dataset])
# Cleaning the texts
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0, 732066):
review = re.sub('[^a-zA-Z]', ' ', dataset['text'][i])
review = review.lower()
review = review.split()
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
review = ' '.join(review)
corpus.append(review)
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)
# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()
This is the error I get and where i'm stuck and unable to proceed with the rest of the machine learning text classification
Traceback (most recent call last):
File "<ipython-input-2-3fac33122b74>", line 2, in <module>
review = re.sub('[^a-zA-Z]', ' ', dataset['text'][i])
File "C:\Anaconda3\envs\py35\lib\re.py", line 182, in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object
Thanks ahead for the help
Try
str(dataset.loc[df.index[i], 'text'])
That'll convert it to a str object, from whatever type it was before.