sklearn LogisticRegression bad input shape error fix [duplicate] - python

I have to predict the type of program a student is in based on other attributes.
prog is a categorical variable indicating what type of program a student is in: “General” (1), “Academic” (2), or “Vocational” (3)
Ses is a categorical variable indicating someone’s socioeconomic class: “Low” (1), “Middle” (2), and “High” (3)
read, write, math, science is their scores on different tests
honors Whether they have enrolled or not
csv file in image format;
import pandas as pd;
import numpy as np;
df1=pd.get_dummies(df,drop_first=True);
X=df1.drop(columns=['prog_general','prog_vocation'],axis=1);
y=df1.loc[:,['prog_general','prog_vocation']];
from sklearn.model_selection import train_test_split;
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.30, random_state=42);
from sklearn.linear_model import LogisticRegression;
from sklearn.metrics import classification_report;
clf=LogisticRegression(multi_class='multinomial',solver='newton-cg');
model=clf.fit(X_train,y_train)
But here I am getting the following error:
ValueError: bad input shape (140, 2).

As such, LogisticRegression does not handle multiple targets. But this is not the case with all the model in Sklearn. For example, all tree based models (DecisionTreeClassifier) can handle multi-output natively.
To make this work for LogisticRegression, you need a MultiOutputClassifier wrapper.
Example:
import numpy as np
from sklearn.datasets import make_multilabel_classification
from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression
X, y = make_multilabel_classification(n_classes=3, random_state=0)
clf = MultiOutputClassifier(estimator= LogisticRegression()).fit(X, y)
clf.predict(X[-2:])

Related

do fit function of QSVC require float values as parameters?

Following is my code. The error seems to be in qsvc.fit() line but I can't understand why.one of the error line says "TypeError: Invalid parameter values, expected Sequence[Sequence[float]]." I'm pretty much sure I have passed arrays as parameters in fit function but do they need to be float type because labels are generally strings. sorry this is my first time trying this so these may seem naive.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from qiskit import Aer
from qiskit.circuit.library import ZFeatureMap
from qiskit_machine_learning.kernels import FidelityQuantumKernel
from qiskit.algorithms.state_fidelities import ComputeUncompute
from qiskit.primitives import Sampler
from qiskit.utils import QuantumInstance
from qiskit_machine_learning.algorithms import PegasosQSVC
data=pd.read_csv('train.csv')
X = data.loc[1:1000,["marital","balance","loan"]].values
Y = data.iloc[:1000,-1].values
x_train, x_test, y_train, y_test = train_test_split(X, Y)
data_feature_map = ZFeatureMap(feature_dimension=3, reps=1 )
sampler = Sampler()
fidelity = ComputeUncompute(sampler=sampler)
data_kernel = FidelityQuantumKernel(fidelity=fidelity, feature_map=data_feature_map)
pegasos_qsvc = PegasosQSVC(quantum_kernel=data_kernel, C=1000, num_steps=100)
pegasos_qsvc.fit(x_train, y_train)
qsvc_score = pegasos_qsvc.score(x_test, y_test)
print(f"QSVC classification test score: {qsvc_score}")
You can use values 0,1 and 2 to represent "marital", "balance" and "loan". sklearn has a LabelEncoder to help such a conversion.

Error "name: 'Predictions' is not defined when using a confusion matrix

I'm getting the error while running:
Creating a confusion matrix ##option
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test,predictions))
Error: NameError: name 'predictions' is not defined
The entire code is:
Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
get_ipython().run_line_magic('matplotlib', 'inline')
from sklearn.model_selection import train_test_split
# Importing the dataset
loans = pd.read_csv('loan_borowwer_data.csv')
loans.describe()
loans.isnull().sum()
loans.info()
plt.hist(loans['fico'],color='blue',edgecolor='white',bins=5)
plt.title('Histogram of fico')
plt.xlabel('fico')
plt.ylabel('frequency')
plt.show()
plt.figure(figsize=(10,6))
loans[loans['not.fully.paid']==1]['fico'].hist(alpha=0.5,color='blue', bins=30,label='not.fully.paid=1')
loans[loans['not.fully.paid']==0]['fico'].hist(alpha=0.5,color='red', bins=30,label='not.fully.paid=0')
plt.legend()
plt.xlabel('FICO')
#Create a list of elements containing the string purpose. Name this list cat_feats.
cat_feats = ['purpose']
final_data = pd.get_dummies(loans,columns=cat_feats,drop_first=True)
final_data.info()
final_data.head()
# # Train-Test Split
# In[ ]:
X = final_data.drop('not.fully.paid',axis=1)
y = final_data['not.fully.paid']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)
# # Training Decision Tree Model
# Let us create a DecisionTreeClassifier from sklearn.tree.
# In[76]:
#Let us create a DecisionTreeClassifier from sklearn.tree.
# In[77]:
# Decision tree model
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
dtree.fit(X_train,y_train)
# # Evaluating Decision Tree
# Creating a confusion matrix ##option
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test,predictions))
NameError: name 'predictions' is not defined
Please help
You are missing a step before creating the confusion matrix. After you declare and fit() the model with the train data, you need to do a prediction on the test data. It would be something like predictions = dtree.predict(X_test) which will create the predicted y values for the X_test values. Once that is done, you can run the confusion matrix and it should create the matrix you desire. Hope this helps.

Fitting linear model using PolynominalFeatures

I want to create some random data and try to improve my model with PolynominalFeatures, however I'm facing little troubles with doing so.
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
import random
import pandas as pd
import numpy as np
import statsmodels.api as sm
#create some artificial data
x=np.linspace(-1,1,1000)
x=pd.array(random.choices(x,k=1000))
y=x**2+np.random.randn(1000)
#divide sample
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5)
#define data frame to futher use for PolynomialFeatures
df=pd.DataFrame([x_train,x_test])
df=df.transpose()
data = df
# perform a polynomial features transform of the dataset
trans = PolynomialFeatures(degree=2)
data = trans.fit_transform(data)
model = sm.OLS(y_train,data).fit()
And then I get error : ValueError: unrecognized data structures: <class 'pandas.core.arrays.numpy_.PandasArray'> / <class 'numpy.ndarray'>
Do you have any ideas what should be done to make my regression work properly ?
use to_numpy() function to convert pandas array to numpy array
model = sm.OLS(y_train.to_numpy(),data).fit()

ROC curve for multi-class classification without one vs all in python

I have a multi-class classification problem with 9 different classes. I am using the AdaBoostClassifier class from scikit-learn to train my model without using the one vs all technique, as the number of classes is very high and it might be inefficient.
I have tried using the tips from the documentation in scikit learn [1], but there the one vs all technique is used, which is substantially different. In my approach I only get one prediction per event, i.e. if I have n classes, the outcome of the prediction is a single value within the n classes. For the one vs all approach, on the other hand, the outcome of the prediction is an array of size n with a sort of likelihood value per class.
[1]
https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#sphx-glr-auto-examples-model-selection-plot-roc-py
The code is:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # Matplotlib plotting library for basic visualisation
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_curve, auc
from sklearn import preprocessing
# Read data
df = pd.read_pickle('data.pkl')
# Create the dependent variable class
# This will substitute each of the n classes from
# text to number
factor = pd.factorize(df['target_var'])
df.target_var= factor[0]
definitions = factor[1]
X = df.drop('target_var', axis=1)
y = df['target_var]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
bdt_clf = AdaBoostClassifier(
DecisionTreeClassifier(max_depth=2),
n_estimators=250,
learning_rate=0.3)
bdt_clf.fit(X_train, y_train)
y_pred = bdt_clf.predict(X_test)
#Reverse factorize (converting y_pred from 0s,1s, 2s, etc. to their original values
reversefactor = dict(zip(range(9),definitions))
y_test_rev = np.vectorize(reversefactor.get)(y_test)
y_pred_rev = np.vectorize(reversefactor.get)(y_pred)
I tried directly with the roc curve function, and also binarising the labels, but I always get the same error message.
def multiclass_roc_auc(y_test, y_pred):
lb = preprocessing.LabelBinarizer()
lb.fit(y_test)
y_test = lb.transform(y_test)
y_pred = lb.transform(y_pred)
return roc_curve(y_test, y_pred)
multiclass_roc_auc(y_test, y_pred_test)
The error message is:
ValueError: multilabel-indicator format is not supported
How could this be sorted out? Am I missing some important concept?

How does cross_val_score and gridsearchCV works?

I am new to python and I have been trying to figure out how gridsearchCV and cross_val_score work.
Finding odds results a set up a sort of validation experiment, but still I do not understand what I am doing wrong.
To try to simplify I am using gridsearchCV is the simplest possible way and try to validate and understand what is happening:
Here it is:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, RobustScaler, QuantileTransformer
from sklearn.feature_selection import SelectKBest, f_regression, RFECV
from sklearn.decomposition import PCA
from sklearn.linear_model import RidgeCV,Ridge, LinearRegression
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import GridSearchCV,KFold,TimeSeriesSplit,PredefinedSplit,cross_val_score
from sklearn.metrics import mean_squared_error,make_scorer,r2_score,mean_absolute_error,mean_squared_error
from math import sqrt
I create a cross validation object (for gridsearchCV and cross_val_score) and a train/test dataset for pipeline and simple linear regression. I have checked that the two dataset are identical:
train_indices = np.full((15,), -1, dtype=int)
test_indices = np.full((6,), 0, dtype=int)
test_fold = np.append(train_indices, test_indices)
kf = PredefinedSplit(test_fold)
for train_index, test_index in kf.split(X):
print('TRAIN:', train_index, 'TEST:', test_index)
X_train_kf = X[train_index]
X_test_kf = X[test_index]
train_data = list(range(0,15))
test_data = list(range(15,21))
X_train, y_train=X[train_data,:],y[train_data]
X_test, y_test=X[test_data,:],y[test_data]
Here is what I do:
instantiate a simple linear model and use it with the manual set of data
lr=LinearRegression()
lm=lr.fit(X,y)
lmscore_train=lm.score(X_train,y_train)
->r2=0.4686662249071524
lmscore_test=lm.score(X_test,y_test)
->r2 0.6264021467338086
now I try do do the exact same things using a pipeline:
pipe_steps = ([('est', LinearRegression())])
pipe=Pipeline(pipe_steps)
p=pipe.fit(X,y)
pscore_train=p.score(X_train,y_train)
->r2=0.4686662249071524
pscore_test=p.score(X_test,y_test)
->r2 0.6264021467338086
LinearRegression and pipeline matches perfectly
Now I try to do the same by using cross_val_score using the predefined split kf
cv_scores = cross_val_score(lm, X, y, cv=kf)
->r2 = -1.234474757883921470e+01?!?! (this is supposed to be the test score)
Now let's try gridsearchCV
scoring = {'r_squared':'r2'}
grid_parameters = [{}]
gridsearch=GridSearchCV(p, grid_parameters, verbose=3,cv=kf,scoring=scoring,return_train_score='true',refit='r_squared')
gs=gridsearch.fit(X,y)
results=gs.cv_results_
from cv_results_ I get once again
->mean_test_r_squared->r2->-1.234474757883921292e+01
So cross_val_score and gridsearch in the end match one another, but the score is totally off and different from what should be.
Will you please help me out solving this puzzle?
cross_val_score and GridSearchCV will first split the data, train the model on the train data only and then score on test data.
Here you are training on the full data, and then scoring on test data. Hence you dont match the results of cross_val_score.
Instead of this:
lm=lr.fit(X,y)
Try this:
lm=lr.fit(X_train, y_train)
Same for pipeline:
Instead of p=pipe.fit(X,y), do this:
p=pipe.fit(X_train, y_train)
You can look at my answers for more description:-
https://stackoverflow.com/a/42364900/3374996
https://stackoverflow.com/a/42230764/3374996

Categories