How to get string as Y output using Linear regression Python - python

I have this rating prediction model using linear regression
status = pd.DataFrame({'rating': [10.5,20.30,30.12,40.24,50.55,60.6,70.2], 'B': ['Bad','Not bad','Good','I like it','Very good','The best','Deserve an oscar']})
x = status.iloc[:,:-1].values
y = status.iloc[:,-1].values
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,train_size=0.4,random_state=0)
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(x,y)
input = 40.24
lr.predict([[input]])
So I have 40.24 as my input for X value I was expecting for 'I like it' as the output but it throws error instead because the expected output is a string, here's the error: ValueError: could not convert string to float: 'Bad'. How do I make it capable of having string as output?

Hi thats because sckitlearn or rather machine learning labels require numbers as an input, i am not sure what the classes are in this case but you can use the onehotencoder from sckitlearn
Also do change it to logistic regression
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
# 1. INSTANTIATE
enc = OneHotEncoder()
# 2. FIT
enc.fit(y)
# 3. Transform
onehotlabels = enc.transform(y).toarray()
onehotlabels.shape
clf = LogisticRegression(random_state=0).fit(x, onehotlabels)
or you can just manually map it out which ever way you prefer
(e.g Bad -> 0, Good -> 1)

You cannot do a Linear Regression if you have Target feature as a Categorical D-Type.
That is the first rule of performing a Linear Regression that you should have Continuous Target feature as the y=mx+c function only takes in numbers as input and tests the function against the numerical items and predicts the numerical item.
That is why it gets trained but fails to predict.
You need to encode your target feature.
Please self-study these concepts.
Hope this helps.

Your labels are categorical where regression labels should be continuous numerical.
You can consider to see it as a classification problem rather than regression.

Related

Multiclass with logreg

So i'm trying to find a simple (not Dijkstra's algorithm) for a shortest path problem.
Without reproducing everything, I have 3 paths and 50 samples of it (i.e. shape (50,3))and I have identified the shortest path for each sample using the min. function
for x_train being
newx_train = np.zeros((50,3))
newx_train[:,0] = p1_train
newx_train[:,1] = p2_train
newx_train[:,2] = p3_train
[x_train] <- just random numbers generated
and subsequently, y_train (since I'm generating it; i pass min function through it)
newy_train[np.arange(newx_train.shape[0]),newx_train.argmin(axis=1)]=1
print(newy_train)
[newy_train] <- passing min will show a 1 for each row where the minimum value is
So i get something like
[[1,0,0],
[0,1,0],
[1,0,0],
[0,0,1]]
Based on x_train, y_train generated, I am trying to implement SVM, logreg to predict how well they perform for multi-class and then i'll compute the classification matrix and accuracy.
My question is, how do i go about using multi-class for logreg? When i run a fit through x_train, y_train; understandably python throws up error that y should be 1-D array but got (50,3) instead.
from sklearn.linear_model import LogisticRegression
LogReg = LogisticRegression(solver = 'lbfgs', multi_class = 'multinomial')
LogReg.fit(newx_train,newy_train[:,0])
ylog_pred = LogReg.predict(newx_test)
print(ylog_pred)
The above code naturally works for binary (assuming only 2 paths) since predicting '1' for one column (index 0) would naturally mean the other column is a '0'. But this would not work for multi-class. Could anyone help with it?
I think you're just missing the part with how to interpret the y.
LogisticRegression expects the y column to not be one-hot encoded and to actually be the target labels, so you need something like
newy_train = np.argmax(newy_train, axis=1) # index of max across each row
Then you should be able to fit something with
LogReg.fit(newx_train,newy_train)

Python: statsmodels - what does .predict(X) actually predict?

I'm a bit confused as to what the line model.predict(X) actually predicts. I can't find anything on it with a Google search.
import statsmodels.api as sm
# Step 1) Load data into dataframe
df = pd.read_csv('my_data.csv')
# Step 2) Separate dependent and independent variables
X = df['independent_variable']
y = df["dependent_variable"]
# Step 3) using OLS -fit a linear regression
model = sm.OLS(y, X).fit()
predictions = model.predict(X) # make predictions
predictions
I'm not sure what predictions is showing? Is it predicting the next x amount of rows or something? Aren't I just passing in my independent variables?
You are fitting an OLS model from your data, which is most likely interpreted as an array. The predict method will returns an array of fitted values given the trained model.
In other words, from statsmodels documentation:
Return linear predicted values from a design matrix.
Similar to the sk-learn. After model = sm.OLS(y, X).fit(), you will have a model, then predictions = model.predict(X) is not predict next x amount of rows, it will predict from your X, the training dataset. The model using ordinary least squares will be a function of "x" and the output should be:
$$ \hat{y}=f(x) $$
If you want to predict the new X, you need to split X into training and testing dataset.
Actually you are doing it wrong
The predict method is use to predict next values
After separating dependent and I dependent values
You can split the data in two part train and test
From sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,0.2)
This will make X_train ur 80% of total data with only independent variable
And you can put your y_test in predict method to check how well the model is performing

Why can't I predict new data using SVM and KNN?

I'm new to machine learning and I just learned KNN and SVM with sklearn. How do I make a prediction for new data using SVM or KNN? I have tried both to make prediction. They make good prediction only when the data is already known. But when I try to predict new data, they give an incorrect prediction.
Here is my code:
import numpy as np
from sklearn import svm
x=np.array([[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11]], dtype=np.float64)
y=np.array([2,3,4,5,6,7,8,9,10,11,12], dtype=np.float64)
clf = svm.SVC(kernel='linear')
clf.fit(x, y)
print(clf.predict([[20]]))
print(clf.score(x, y))
0utput:
[12.]
1.0
This code will make a good prediction as long as the data to predict is within the range x_train. But when I try to predict for example 20, or anything above the range x_train, the output will always be 12 which is the last element of y. I don't know what I do wrong in the code.
The code is behaving as mathematically described by a support vector machine.
You must understand how your data are being interpreted by the algorithm. You have 11 data points, and you are giving each one a different class. The SVM ends up basically dividing the number line into 11 segments (for the 11 classes you defined):
data = [(x, clf.predict([[x]])[0]) for x in np.linspace(1, 20, 300)]
plt.scatter([p[0] for p in data], [p[1] for p in data])
plt.show()
The answer by AILearning tells you how to fit your given toy problem, but make sure you also understand why your code wasn't doing what you thought it was. For any finite set of examples there are infinitely many functions that fit the data. Your fundamental issue is you are confusing regression and classification. From the sounds of it, you want a simple regression model to extrapolate a fit function from the data points, but your code is for a classification model.
You have to use a regression model rather than a classification model. For svm based regression use svm.SVR()
import numpy as np
from sklearn import svm
x=np.array([[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11]], dtype=np.float64)
y=np.array([2,3,4,5,6,7,8,9,10,11,12], dtype=np.float64)
clf = svm.SVR(kernel='linear')
clf.fit(x, y)
print(clf.predict([[50]]))
print(clf.score(x, y))
output:
[50.12]
0.9996

Python ValueError: Unknown label type: 'continuous'

I'm a beginner here and I am trying for the life of me to understand this other stack over flow post that has the same question as I do..
Logistic Regression:Unknown label type: 'continuous'
This is my machine learning code below, and the shell output is giving me ValueError: Unknown label type: 'continuous'
I think I understand that I am "passing floats to a classifier which expects categorical values as the target vector. If you convert it to int it will be accepted as input (although it will be questionable if that's the right way to do it). It would be better to convert your training scores by using scikit's labelEncoder function"
Can someone give me a tip on how to incorporate scikit's labelEncoder function into my code? Is this implemented prior to stating the classifiers X & y? Whatever I am trying I am doing something wrong. Thank you
import numpy as np
from sklearn import preprocessing, cross_validation, neighbors, utils
import pandas as pd
df = pd.read_csv('C:\\Users\\bbartling\\Documents\\Python\\WB
Data\\WB_RTU6data.csv', index_col='Date', parse_dates=True)
print(df.head())
print(df.tail())
print(df.shape)
print(df.columns)
print(df.info())
print(df.describe())
X = np.array(df.drop(['VAV6znt'],1))
df.dropna(inplace=True)
y = np.array(df['VAV6znt'])
accuracies = []
X_train, X_test, y_train, y_test =
cross_validation.train_test_split(X,y,test_size=0.50)
clf = neighbors.KNeighborsClassifier(n_neighbors=50)
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)
print(accuracy)
Since your VAV6znt column is a float, which means you are trying to estimate a numerical value from the data. That makes it a regression problem and you are using KNeighborsClassifier which is a classification estimator.
Try using KNeighborsRegressor or any other estimators which have Regressor in their name.
Converting them to int as you did above will work but will not give good results because that means that you have those many classes in your data as their are unique ints in it, which obviously is wrong.

Text Classification Using Python

I have list of words in text variable with their labels. I like to make a classifier that can predict the label of new input text.
I am thinking of using scikit-learn package in Python to use SVM model.
I realize that the text need to be corverted to vector form so I am trying TfidfVectorizer and CountVectorizer.
This is my code so far using TfidfVectorizer:
from sklearn import svm
from sklearn.feature_extraction.text import TfidfVectorizer
label = ['organisasi','organisasi','organisasi','organisasi','organisasi','lokasi','lokasi','lokasi','lokasi','lokasi']
text = ['Partai Anamat Nasional','Persatuan Sepak Bola', 'Himpunan Mahasiswa','Organisasi Sosial','Masyarakat Peduli','Malioboro','Candi Borobudur','Taman Pintar','Museum Sejarah','Monumen Mandala']
vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(text)
y = label
klasifikasi = svm.SVC()
klasifikasi = klasifikasi.fit(X,y) #training
test_text = ['Partai Perjuangan']
test_vector = vectorizer.fit_transform(test_text)
prediksi = klasifikasi.predict([test_vector]) #test
print(prediksi)
I also try the CountVectorizer with same code above.
Both showing the same Error result:
ValueError: setting an array element with a sequence.
How to solve this problem? Thanks
The error is due to this line:
prediksi = klasifikasi.predict([test_vector])
Most scikit estimators require an array of shape [n_samples, n_features]. The test_vector output from TfidfVectorizer is already in that shape ready to use for estimators. You don't need to wrap it in square brackets ([ and ]). The wrapping makes it a list which is unsuitable.
Try using it like this:
prediksi = klasifikasi.predict(test_vector)
But even then you will gt error. Because of this line:
test_vector = vectorizer.fit_transform(test_text)
Here you are fitting the vectorizer in a different way than what was learned by the klasifikasi estimator. fit_transform() is just a shortcut for calling fit() (learning the data) and then transform() it. For test data, always use transform() method, never fit() or fit_transform()
So the correct code will be:
test_vector = vectorizer.transform(test_text)
prediksi = klasifikasi.predict(test_vector)
#Output: array(['organisasi'], dtype='|S10')

Categories