text vectorizer using sckitlearn

text vectorizer using sckitlearn - python

i am using sckit-learn to do preprocessing on text data my aim is to get vector representation of my data (features and label) , what i have done is vectorizing the feature using tfidf after that the dataset size changed by double size as X.ravel() is used .
X before (30376, 2)
X after (60752, 41331)
my problem is that when i am having two features in the x vector and i want to get vector representation correctly how i can do it
df = pd.read_csv('Dataset.csv',encoding='latin1')
df = df.dropna()
X = np.array(df.drop(['Type'], 1))
y = np.array(df['Type'])
#print(X)
print("Extracting features from the training data using a sparse vectorizer")
vectorizer= TfidfVectorizer(sublinear_tf=True, max_df=0.5,
stop_words='english')
X = vectorizer.fit_transform(X.ravel().astype('U'))
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
X = imp.fit_transform(X)
X.shape
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
and if that the case when i am using train_test_split i get this error i don't understand what it referring to
TypeError: Singleton array array(TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=0.5, max_features=None, min_df=1,
ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
stop_words='english', strip_accents=None, sublinear_tf=True,
token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
vocabulary=None), dtype=object) cannot be considered a valid collection.
any suggestion and thanks

Either X or y has a wrong shape.
Here is an excerpt from the validation.py which is used for validating passed data sets:
if hasattr(x, 'shape'):
if len(x.shape) == 0: # <----- NOTE !!!
raise TypeError("Singleton array %r cannot be considered"
" a valid collection." % x)
return x.shape[0]
else:
return len(x)

Related

jupyter notebook with python and sklearn. precision score error

This is the code below for jupyter noetbook with python and sklearn function (precision score).wnat to get the score by using t_test & y_test and got this error...
"Classification metrics can't handle a mix of continuous-multioutput and binary targets"
list item code:
diabetes = pd.read_csv("datasets/diabetes.csv")
x = diabetes.drop(diabetes.columns[-1], axis=1)
y = diabetes.iloc[:,-1]
scaler = StandardScaler()
scaler.fit_transform(x)
svc = SVC()
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3,
random_state = RANDOM_STATE)
svc.fit(x_train,y_train)
precision_score_svc = precision_score(x_test, y_test)

If you check the help page:
sklearn.metrics.precision_score(y_true, y_pred, *, labels=None, pos_label=1, average='binary', sample_weight=None, zero_division='warn')
[...]
Parameters
y_true 1d array-like, or label indicator array / sparse matrix
Ground truth (correct) target values.
y_pred 1d array-like, or label indicator array / sparse matrix
Estimated targets as returned by a classifier.
Means you need to provide the true labels followed by predicted labels. It should be:
precision_score_svc = precision_score(y_test,svc.predict(x_test))

How to predict single instance with Logistic Regression using sci-kit learn?

I'm trying to build a Logistic Regression model which can predict new instance's class.
Here what I've done:
path = 'diabetes.csv'
df = pd.read_csv(path, header = None)
print "Classifying with Logistic Regression"
values = df.values
X = values[1:,0:8]
y = values[1:,8]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=42)
model=LogisticRegression()
model.fit(X_train,y_train)
X_test = []
X_test.append(int(pregnancies_info))
X_test.append(int(glucose_info))
X_test.append(int(blood_press_info))
X_test.append(int(skin_thickness_info))
X_test.append(int(insulin_info))
X_test.append(float(BMI_info))
X_test.append(float(dpf_info))
X_test.append(int(age_info))
#X_test = np.array(X_test).reshape(-1, 1)
print X_test
y_pred=model.predict(X_test)
if y_pred == 0:
Label(login_screen, text="Healthy").pack()
if y_pred == 1:
Label(login_screen, text="Diabetes Metillus").pack()
pregnancies_entry.delete(0, END)
glucose_entry.delete(0, END)
blood_press_entry.delete(0, END)
skin_thickness_entry.delete(0, END)
insulin_entry.delete(0, END)
BMI_entry.delete(0, END)
dpf_entry.delete(0, END)
age_entry.delete(0, END)
But I got this error:
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
If I uncomment this line X_test = np.array(X_test).reshape(-1, 1) this error appears:
File "/anaconda2/lib/python2.7/site-packages/sklearn/linear_model/base.py", line 305, in decision_function
% (X.shape[1], n_features))
ValueError: X has 1 features per sample; expecting 8

You have to give it as
X_test = np.array(X_test).reshape(1, -1))
or you can directly do,
y_pred=model.predict([X_test])
The reason is predict function expects a 2D array with dimension (n_samples, n_features). When you have only record for which you need prediction, create a list of list and feed it! Hope it helps.

Attributes mismatch between training and testing data in sklearn - linear regression

I am trying to train a linear regression model using sklearn to predict likes of given tweets. I have the following as features/ attributes.
['id', 'month', 'hour', 'text', 'hasMedia', 'hasHashtag', 'followers_count', 'retweet_count', 'favourite_count', 'sentiment', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'sadness', 'surprise', 'trust', ......keywords............]
I use tfidfvectorizer for extracting keywords. The problem is, depending on the size of the training data, the number of keywords differ and therefore, the number of independent attributes differ. Because of this there is a mismatch of attributes between training and testing data. I get ValueError: Shape of passed values is (1, 1678), indices imply (1, 1928).
It works fine when I split the same data into train and test and predict with test as below.
Program for training and prediction
def train_favourite_prediction(result):
result = result.drop(['retweet_count'], axis=1)
result = result.dropna()
X = result.loc[:, result.columns != 'favourite_count']
y = result['favourite_count']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# now you can save it to a file
joblib.dump(regressor, os.path.join(dirname, '../../knowledge_base/knowledge_favourite.pkl'))
return None
def predict_favourites(result):
result = result.drop(['retweet_count'], axis=1)
result = result.dropna()
X = result.loc[:, result.columns != 'favourite_count']
y = result['favourite_count']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regressor = LinearRegression()
# and later you can load it
regressor = joblib.load(os.path.join(dirname, '../../knowledge_base/knowledge_favourite.pkl'))
coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])
print(coeff_df)
y_pred = regressor.predict(X_test)
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
print(df)
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print("the large training just finished")
return None
Code for fit vectorization
Have a look at Applying Tfidfvectorizer on list of pos tags gives ValueError to understand the format of my 'text' column.
def ready_for_training(dataset):
dataset = dataset.head(1000)
dataset['text'] = dataset.text.apply(lambda x: literal_eval(x))
dataset['text'] = dataset['text'].apply(
lambda row: [item for sublist in row for item in sublist])
tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english', lowercase=False)
keyword_response = tfidf.fit_transform(dataset['text'])
keyword_matrix = pd.DataFrame(keyword_response.todense(), columns=tfidf.get_feature_names())
keyword_matrix = keyword_matrix.loc[:, (keyword_matrix != 0).any(axis=0)]
dataset['sentiments'] = dataset['sentiments'].map(eval)
dataset = pd.concat([dataset.drop(['sentiments'], axis=1), dataset['sentiments'].apply(pd.Series)], axis=1)
dataset = dataset.drop(['neg', 'neu','pos'], axis=1)
dataset['emotions'] = dataset['emotions'].map(eval)
dataset = pd.concat([dataset.drop(['emotions'], axis=1), dataset['emotions'].apply(pd.Series)], axis=1)
dataset = dataset.drop(['id', 'month', 'text'], axis=1)
result = pd.concat([dataset, keyword_matrix], axis=1, sort=False)
return result
What I need is to predict 'favourite_count' when a new single Tweet is given. When I get the keywords for this tweet I get only a few. While training I trained with 1000+ keywords. I have stored the trained knowledge in a .pkl file. How should I handle this mismatch of attributes? To fill the missing columns in testing tweet as in Keep same dummy variable in training and testing data I may need the training set as a dataframe. But I have stored the trained knowledge as .pkl. and won't be able to access the columns in the trained knowledge.

How to operate multidimensional features in SVM or use multidimensional features to train model?

If I have this input:
"a1,b1,c1,d1;A1,B1,C1,D1;α1,β1,γ1,θ1;Label1"
"... ... "
"an,bn,cn,dn;An,Bn,Cn,Dn;αn,βn,γn,θn;Labelx"
Array expression：
[
[[a1,b1,c1,d1],[A1,B1,C1,D1],[α1,β1,γ1,θ1],[Label1]],
... ... ... ...
[[an,bn,cn,dn],[An,Bn,Cn,Dn],[αn,βn,γn,θn],[Labelx]]
]
Instance:
[... ... ... ...
[[58.32,453.65,980.50,540.23],[774.40,428.79,1101.96,719.79],[503.70,624.76,1128.00,1064.26],[1]],
[[0,0,0,0],[871.05,478.17,1109.37,698.36],[868.63,647.56,1189.92,1040.80],[1]],
[[169.34,43.41,324.46,187.96],[50.24,37.84,342.39,515.21],[0,0,0,0],[0]]]
Like this:
There are 3 rectangles,and the label means intersect,contain or some other.
I want to use 3 or N features to train a model by SVM.
And I just learn the "python Iris SVM" code.What should I do?
The Opinion:
this is my try：
from sklearn import svm
import numpy as np
mport matplotlib as mpl
from sklearn.model_selection import train_test_split
def label_type(s):
it = {b'Situation_1': 0, b'Situation_2': 1, b'Unknown': 2}
return it[s]
path = 'C:/Users/SEARECLUSE/Desktop/MNIST_DATASET/temp_test.data'
data = np.loadtxt(path, dtype=list, delimiter=';', converters={3:
label_type})
x, y = np.split((data), (3,), axis=1)
x = x[:, :3]
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1,
train_size=0.6)
clf = svm.SVC(C=0.8, kernel='rbf', gamma=20, decision_function_shape='ovr')
clf.fit(x_train, y_train.ravel())
Report Error:
Line: clf.fit(x_train, y_train.ravel())
ValueError: could not convert string to float:
If I try to convert the data:
x, y = np.split(float(data), (3,), axis=1)
Report Error:
Line: x, y = np.split(float(data), (3,), axis=1)
TypeError: only length-1 arrays can be converted to Python scalars

SVMs were not initially designed to handle multidimensional data. I suggest you flatten your input features:
x, y = np.split((data), (3,), axis=1)
x = x[:, :3]
# flatten the features
x = np.reshape(x,(len(x),-1))
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1,
train_size=0.6)
clf = svm.SVC(C=0.8, kernel='rbf', gamma=20, decision_function_shape='ovr')
clf.fit(x_train, y_train.ravel())

I have few questions before I go for an answer:
Q1. What kind of data you are using to train SVM model. Is it image data? If image data then, is it RGB data? The way you explained you data it seems you are intended to do image classification using SVM. Correct me if I am wrong.
Assumption
Let say you have image data. Then please convert to gray scale. Then you try to convert entire data into numpy array. check numpy module to find how to do that.
Once you data become numpy array then you can apply your model.
Let me know if that helps.

Bad input shape () in multi-class classification

I'am performing a multi-class classification task using sci-kit learn. In the setup i created, i want to compare different classification algorithms.
I use a pipeline, where text is inserted as X and Y is the class (multi-class, N = 5). Textual features are extracted in the pipeline using TfidfVectorizer().
KNN does the job, but other classifiers give this: ValueError: bad input shape (670, 5)
Full traceback:
"/Users/Robbert/pipeline.py", line 62, in <module>
train_pipeline.fit(X_train, Y_train)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/pipeline.py", line 130, in fit
self.steps[-1][-1].fit(Xt, y, **fit_params)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/svm/base.py", line 138, in fit
y = self._validate_targets(y)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/svm/base.py", line 441, in _validate_targets
y_ = column_or_1d(y, warn=True)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/validation.py", line 319, in column_or_1d
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (670, 5)
The code i use:
def read_data(f):
data = []
for row in csv.reader(open(f), delimiter=';'):
if row:
plottext = row[8]
target = { 'Age': row[4] }
data.append((plottext, target))
(X, Ycat) = zip(*data)
Y = DictVectorizer().fit_transform(Ycat)
Y = preprocessing.LabelBinarizer().fit_transform(Y)
return (X, Y)
X, Y = read_data('development2.csv')
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=42)
###KNN Pipeline
#train_pipeline = Pipeline([
# ('vect', TfidfVectorizer(ngram_range=(1, 3), min_df=1)),
# ('clf', KNeighborsClassifier(n_neighbors=350, weights='uniform'))])
###Logistic regression Pipeline
#train_pipeline = Pipeline([
# ('vect', TfidfVectorizer(ngram_range=(1, 3), min_df=1)),
# ('clf', LogisticRegression())])
##SVC
train_pipeline = Pipeline([
('vect', TfidfVectorizer(ngram_range=(1, 3), min_df=1)),
('clf', SVC(C=1, kernel='rbf', gamma=0.001, probability=True))])
##Decision tree
#train_pipeline = Pipeline([
# ('vect', TfidfVectorizer(ngram_range=(1, 3), min_df=1)),
# ('clf', DecisionTreeClassifier(random_state=0))])
train_pipeline.fit(X_train, Y_train)
predicted = train_pipeline.predict(X_test)
print accuracy_score(Y_test, predicted)
How is it possible that KNN accepts the shape of the array and other classifiers don't? And how to change this shape?

If you compare documentation for fit(X, y) function in KNeighborsClassifier and SVC, you will see that only the former one accepts the y in the form [n_samples, n_outputs].
Possible solution: why do you need LabelBinarizer at all? Just do not use it.

If your Y vector is of size (n_samples, n_classes) and contains at least a single row which has more than one non-zero element, then you are solving a multi-label classification problem. If that is the case, The multiclass and multilabel algorithms page in scikit-learn docs lists KNN as one of the classifiers that supports multi-label classification. You might want to try out other classifiers from that list
* sklearn.tree.DecisionTreeClassifier
* sklearn.tree.ExtraTreeClassifier
* sklearn.ensemble.ExtraTreesClassifier
* sklearn.neural_network.MLPClassifier
* sklearn.neighbors.RadiusNeighborsClassifier
* sklearn.ensemble.RandomForestClassifier
* sklearn.linear_model.RidgeClassifierCV

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

text vectorizer using sckitlearn - python

Related

jupyter notebook with python and sklearn. precision score error

How to predict single instance with Logistic Regression using sci-kit learn?

Attributes mismatch between training and testing data in sklearn - linear regression

How to operate multidimensional features in SVM or use multidimensional features to train model?

Bad input shape () in multi-class classification

Categories

Resources