Text Classification Using Python - python

I have list of words in text variable with their labels. I like to make a classifier that can predict the label of new input text.
I am thinking of using scikit-learn package in Python to use SVM model.
I realize that the text need to be corverted to vector form so I am trying TfidfVectorizer and CountVectorizer.
This is my code so far using TfidfVectorizer:
from sklearn import svm
from sklearn.feature_extraction.text import TfidfVectorizer
label = ['organisasi','organisasi','organisasi','organisasi','organisasi','lokasi','lokasi','lokasi','lokasi','lokasi']
text = ['Partai Anamat Nasional','Persatuan Sepak Bola', 'Himpunan Mahasiswa','Organisasi Sosial','Masyarakat Peduli','Malioboro','Candi Borobudur','Taman Pintar','Museum Sejarah','Monumen Mandala']
vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(text)
y = label
klasifikasi = svm.SVC()
klasifikasi = klasifikasi.fit(X,y) #training
test_text = ['Partai Perjuangan']
test_vector = vectorizer.fit_transform(test_text)
prediksi = klasifikasi.predict([test_vector]) #test
print(prediksi)
I also try the CountVectorizer with same code above.
Both showing the same Error result:
ValueError: setting an array element with a sequence.
How to solve this problem? Thanks

The error is due to this line:
prediksi = klasifikasi.predict([test_vector])
Most scikit estimators require an array of shape [n_samples, n_features]. The test_vector output from TfidfVectorizer is already in that shape ready to use for estimators. You don't need to wrap it in square brackets ([ and ]). The wrapping makes it a list which is unsuitable.
Try using it like this:
prediksi = klasifikasi.predict(test_vector)
But even then you will gt error. Because of this line:
test_vector = vectorizer.fit_transform(test_text)
Here you are fitting the vectorizer in a different way than what was learned by the klasifikasi estimator. fit_transform() is just a shortcut for calling fit() (learning the data) and then transform() it. For test data, always use transform() method, never fit() or fit_transform()
So the correct code will be:
test_vector = vectorizer.transform(test_text)
prediksi = klasifikasi.predict(test_vector)
#Output: array(['organisasi'], dtype='|S10')

Related

How to get string as Y output using Linear regression Python

I have this rating prediction model using linear regression
status = pd.DataFrame({'rating': [10.5,20.30,30.12,40.24,50.55,60.6,70.2], 'B': ['Bad','Not bad','Good','I like it','Very good','The best','Deserve an oscar']})
x = status.iloc[:,:-1].values
y = status.iloc[:,-1].values
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,train_size=0.4,random_state=0)
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(x,y)
input = 40.24
lr.predict([[input]])
So I have 40.24 as my input for X value I was expecting for 'I like it' as the output but it throws error instead because the expected output is a string, here's the error: ValueError: could not convert string to float: 'Bad'. How do I make it capable of having string as output?
Hi thats because sckitlearn or rather machine learning labels require numbers as an input, i am not sure what the classes are in this case but you can use the onehotencoder from sckitlearn
Also do change it to logistic regression
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
# 1. INSTANTIATE
enc = OneHotEncoder()
# 2. FIT
enc.fit(y)
# 3. Transform
onehotlabels = enc.transform(y).toarray()
onehotlabels.shape
clf = LogisticRegression(random_state=0).fit(x, onehotlabels)
or you can just manually map it out which ever way you prefer
(e.g Bad -> 0, Good -> 1)
You cannot do a Linear Regression if you have Target feature as a Categorical D-Type.
That is the first rule of performing a Linear Regression that you should have Continuous Target feature as the y=mx+c function only takes in numbers as input and tests the function against the numerical items and predicts the numerical item.
That is why it gets trained but fails to predict.
You need to encode your target feature.
Please self-study these concepts.
Hope this helps.
Your labels are categorical where regression labels should be continuous numerical.
You can consider to see it as a classification problem rather than regression.

Scikit learn: TypeError: float() argument must be a string or a number, not 'Bunch'

I want to apply the svm using the following approach but apparently the "Bunch" type is not appropriate.
Usually, with Bunch (Dictionary-like object), the interesting attributes are: ‘data’, the data to learn and ‘target’, the classification labels. You can access the .data and the .target information accordingly. How can I make it work as I have the code below?
import pandas as pd
from sklearn import preprocessing
#Call the data below using scikit learn which stores them in Bunch
newsgroups_train = fetch_20newsgroups(subset='train',remove=('headers', 'footers', 'quotes'), categories = cats)
newsgroups_test = fetch_20newsgroups(subset='test',remove=('headers', 'footers', 'quotes'), categories = cats)
vectorizer = TfidfVectorizer( stop_words = 'english') #new
vectors = vectorizer.fit_transform(newsgroups_train.data) #new
vectors_test = vectorizer.transform(newsgroups_test.data) #new
max_abs_scaler = preprocessing.MaxAbsScaler()
scaled_train_data = max_abs_scaler.fit_transform(vectors)#corrected
scaled_test_data = max_abs_scaler.transform(vectors_test)
clf=CalibratedClassifierCV(OneVsRestClassifier(SVC(C=1)))
clf.fit(scaled_train_data, train_labels)
predictions=clf.predict(scaled_test_data)
proba=clf.predict_proba(scaled_test_data)
in the clf.fit line in the position of "trained_labels" I put "vectorizer.vocabulary_.keys()" but it gives: ValueError: bad input shape (). What should I do to get the trained labels and make it work?
You are trying to apply a Numerical Scaling Operation over text data. That is logically incorrect. If you see the official documentation of MaxAbsScalar it's function is to :
Scale each feature by its maximum absolute value
If you want to find the vectors of the text data, then you need to use something like CountVectorizer. See this example from official documentation here.
Alternatively, you can try TfIDfTransformer as well. Here is an example of using it with newsgroup data.

KNN query data dimension must match training data dimension

I'm trying Bag of Words problem with a dataset which has two columns - summary and solution. I'm using KNN for it. The train dataset has 91 columns and the test dataset has 15 columns.
To generate the vectors, I'm using the following piece of code.
vectorizer = CountVectorizer()
train_bow_set = vectorizer.fit_transform(dataset[0]).todense()
print( vectorizer.fit_transform(dataset[0]).todense() )
print( vectorizer.vocabulary_ )
I trained it.
classifier = KNeighborsClassifier(n_neighbors=3)
classifier.fit(train_bow_set, dataset[1])
Now, I'm testing it.
y_pred = classifier.predict(test_bow_set)
Here, I'm getting below error when I test it:
sklearn/neighbors/binary_tree.pxi in sklearn.neighbors.kd_tree.BinaryTree.query()
**ValueError: query data dimension must match training data dimension**
I guess you are fitting the vectorizer on the test data again instead of using transform function.
Make sure you are doing the following.
test_bow_set = vectorizer.transform(test_dataset)
You are fitting again the vectorizer:
train_bow_set = vectorizer.fit_transform(dataset[0]).todense()
You need to keep the vectorizer from the training (all the preprocessing elements actually) and only use transform. Fitting again will profoundly change the results.
train_bow_set = vectorizer.transform(dataset[0]).todense()

What are X_train and y_train?

I want to start develop an application using Machine Learning. I want to classify text - spam or not spam. I have 2 files - spam.txt, ham.txt - that contain thousand of sentences each file. If I want to use a classifier, let's say LogisticRegression.
For example, as I saw on the Internet, to fit my model I need to do like this:
`lr = LogisticRegression()
model = lr.fit(X_train, y_train)`
So here comes my question, what are actually X_train and y_train? How can I obtain them from my sentences? I searched on the Internet, I did not understand, here is my last call, I am pretty new to this topic. Thank you!
According to the documentation (see here):
X corresponds to your float feature matrix of shape (n_samples, n_features) (aka. the design matrix of your training set)
y is the float target vector of shape (n_samples,) (the label vector). In your case, label 0 could correspond to a spam example, and 1 to a ham one
The question is now about how to get a float feature matrix from text data.
A common scheme is to use a tf-idf vectorisation (more on this here), which is available in sklearn.
The vectorisation can be chained with the logistic regression via the Pipeline API of sklearn.
This is how the code would look like roughly
from itertools import chain
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
import numpy as np
# prepare string data
with open('spam.txt', 'r') as f:
spam = f.readlines()
with open('ham.txt', 'r') as f:
ham = f.readlines()
text_train = list(chain(spam, ham))
# prepare labels
labels_train = np.concatenate((np.zeros(len(spam)),np.ones(len(ham))))
# build pipeline
vectorizer = TfidfVectorizer()
regressor = LogisticRegression()
pipeline = Pipeline([('vectorizer', vectorizer), ('regressor', regressor)])
# fit pipeline
pipeline.fit(text_train, labels_train)
# test predict
test = ["Is this spam or ham?"]
pipeline.predict(test) # value in [0,1]

Feature selection with sklearn - ValueError: X has a different shape than during fitting

:) Very sorry in advance if my code looks like something a total newbie would write. Down below is a portion of my code in python. I am fiddling with sklearn and machine learning techniques.
I trained several Naive Bayes Model based on different datasets and stored them in trained_models
Prior this step i created an object chi_squared of the SelectPercentile class using the chi2 function for feature selection. From my understanding, i should write data_feature_reduced = chi_squared.transform(some_data) then use data_feature_reduced at the time of training like this, ie: nb.fit(data_feature_reduced, data.target)
This is what did, and stored the results objects nb ( and some other informations in the list trained_models.
I am now attempting to apply these models on a different set of data ( actually from the same source, if that matters to the question )
for name, model, intra_result, dev, training_data, chi_squarer in trained_models:
cross_results = []
new_vect= StemmedVectorizer(ngram_range=(1, 4), stop_words='english', max_df=0.90, min_df=2)
for data in demframes:
data_name = data[0]
X_test_data = new_vect.fit_transform(data[1].values.astype('U'))
Y_test_data = data[2]
chi_squared_test_data = chi_squarer.transform(X_test_data)
final_results.append((name, "applied to", data[0], model.score(X_test_data,Y_test_data)))
I have to admit that I am a bit of stranger to the feature selection part.
Here is the error that i get :
ValueError: X has a different shape than during fitting.
at line chi_squared_test_data = chi_squarer.transform(X_test_data)
I am assuming I am doing feature selection in an incorrect manner, Where did I go wrong ?
Thanks to everyone for their help!
I will just paste the comment that helped me solve my problem from #Vivek-Kumar.
This error is due to this line new_vect.fit_transform(). Like your
trained models, you should use the same StemmedVectorizer which was
used at training time.
The same StemmedVectorize object will transform the X_test_data to same shape, what it had during the training. Currently, you are using different object and fitting on it (fit_transform is fit and transform), hence the shape is different. Hence the error.
why not use a pipeline to make it simple? that way you dont have to transform twice and take care of the shapes.
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
chi_squarer = SelectKBest(chi2, k=100) # change accordingly
lr = LogisticRegression() # or naive bayes
clf = pipeline.Pipeline([('chi_sq', chi_squarer), ('model', lr)])
# for training:
clf.fit(training_data, targets)
# for predictions:
clf.predict(test_data)
you can also add the new_vect in the pipeline

Categories