Following is my code. The error seems to be in qsvc.fit() line but I can't understand why.one of the error line says "TypeError: Invalid parameter values, expected Sequence[Sequence[float]]." I'm pretty much sure I have passed arrays as parameters in fit function but do they need to be float type because labels are generally strings. sorry this is my first time trying this so these may seem naive.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from qiskit import Aer
from qiskit.circuit.library import ZFeatureMap
from qiskit_machine_learning.kernels import FidelityQuantumKernel
from qiskit.algorithms.state_fidelities import ComputeUncompute
from qiskit.primitives import Sampler
from qiskit.utils import QuantumInstance
from qiskit_machine_learning.algorithms import PegasosQSVC
data=pd.read_csv('train.csv')
X = data.loc[1:1000,["marital","balance","loan"]].values
Y = data.iloc[:1000,-1].values
x_train, x_test, y_train, y_test = train_test_split(X, Y)
data_feature_map = ZFeatureMap(feature_dimension=3, reps=1 )
sampler = Sampler()
fidelity = ComputeUncompute(sampler=sampler)
data_kernel = FidelityQuantumKernel(fidelity=fidelity, feature_map=data_feature_map)
pegasos_qsvc = PegasosQSVC(quantum_kernel=data_kernel, C=1000, num_steps=100)
pegasos_qsvc.fit(x_train, y_train)
qsvc_score = pegasos_qsvc.score(x_test, y_test)
print(f"QSVC classification test score: {qsvc_score}")
You can use values 0,1 and 2 to represent "marital", "balance" and "loan". sklearn has a LabelEncoder to help such a conversion.
Related
I want to create some random data and try to improve my model with PolynominalFeatures, however I'm facing little troubles with doing so.
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
import random
import pandas as pd
import numpy as np
import statsmodels.api as sm
#create some artificial data
x=np.linspace(-1,1,1000)
x=pd.array(random.choices(x,k=1000))
y=x**2+np.random.randn(1000)
#divide sample
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5)
#define data frame to futher use for PolynomialFeatures
df=pd.DataFrame([x_train,x_test])
df=df.transpose()
data = df
# perform a polynomial features transform of the dataset
trans = PolynomialFeatures(degree=2)
data = trans.fit_transform(data)
model = sm.OLS(y_train,data).fit()
And then I get error : ValueError: unrecognized data structures: <class 'pandas.core.arrays.numpy_.PandasArray'> / <class 'numpy.ndarray'>
Do you have any ideas what should be done to make my regression work properly ?
use to_numpy() function to convert pandas array to numpy array
model = sm.OLS(y_train.to_numpy(),data).fit()
I have to predict the type of program a student is in based on other attributes.
prog is a categorical variable indicating what type of program a student is in: “General” (1), “Academic” (2), or “Vocational” (3)
Ses is a categorical variable indicating someone’s socioeconomic class: “Low” (1), “Middle” (2), and “High” (3)
read, write, math, science is their scores on different tests
honors Whether they have enrolled or not
csv file in image format;
import pandas as pd;
import numpy as np;
df1=pd.get_dummies(df,drop_first=True);
X=df1.drop(columns=['prog_general','prog_vocation'],axis=1);
y=df1.loc[:,['prog_general','prog_vocation']];
from sklearn.model_selection import train_test_split;
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.30, random_state=42);
from sklearn.linear_model import LogisticRegression;
from sklearn.metrics import classification_report;
clf=LogisticRegression(multi_class='multinomial',solver='newton-cg');
model=clf.fit(X_train,y_train)
But here I am getting the following error:
ValueError: bad input shape (140, 2).
As such, LogisticRegression does not handle multiple targets. But this is not the case with all the model in Sklearn. For example, all tree based models (DecisionTreeClassifier) can handle multi-output natively.
To make this work for LogisticRegression, you need a MultiOutputClassifier wrapper.
Example:
import numpy as np
from sklearn.datasets import make_multilabel_classification
from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression
X, y = make_multilabel_classification(n_classes=3, random_state=0)
clf = MultiOutputClassifier(estimator= LogisticRegression()).fit(X, y)
clf.predict(X[-2:])
I have a multi-class classification problem with 9 different classes. I am using the AdaBoostClassifier class from scikit-learn to train my model without using the one vs all technique, as the number of classes is very high and it might be inefficient.
I have tried using the tips from the documentation in scikit learn [1], but there the one vs all technique is used, which is substantially different. In my approach I only get one prediction per event, i.e. if I have n classes, the outcome of the prediction is a single value within the n classes. For the one vs all approach, on the other hand, the outcome of the prediction is an array of size n with a sort of likelihood value per class.
[1]
https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#sphx-glr-auto-examples-model-selection-plot-roc-py
The code is:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # Matplotlib plotting library for basic visualisation
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_curve, auc
from sklearn import preprocessing
# Read data
df = pd.read_pickle('data.pkl')
# Create the dependent variable class
# This will substitute each of the n classes from
# text to number
factor = pd.factorize(df['target_var'])
df.target_var= factor[0]
definitions = factor[1]
X = df.drop('target_var', axis=1)
y = df['target_var]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
bdt_clf = AdaBoostClassifier(
DecisionTreeClassifier(max_depth=2),
n_estimators=250,
learning_rate=0.3)
bdt_clf.fit(X_train, y_train)
y_pred = bdt_clf.predict(X_test)
#Reverse factorize (converting y_pred from 0s,1s, 2s, etc. to their original values
reversefactor = dict(zip(range(9),definitions))
y_test_rev = np.vectorize(reversefactor.get)(y_test)
y_pred_rev = np.vectorize(reversefactor.get)(y_pred)
I tried directly with the roc curve function, and also binarising the labels, but I always get the same error message.
def multiclass_roc_auc(y_test, y_pred):
lb = preprocessing.LabelBinarizer()
lb.fit(y_test)
y_test = lb.transform(y_test)
y_pred = lb.transform(y_pred)
return roc_curve(y_test, y_pred)
multiclass_roc_auc(y_test, y_pred_test)
The error message is:
ValueError: multilabel-indicator format is not supported
How could this be sorted out? Am I missing some important concept?
I have typed in the following lines of code:
# import relevant statistical packages
import numpy as np
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import sklearn.linear_model as skl
import sklearn.metrics as metrics
from sklearn.model_selection import train_test_split
# import data
url = "/<...>/Smarket.csv" # relative url within my computer
Smarket = pd.read_csv(url, index_col = 'SlNo')
X3 = Smarket[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5', 'Volume']]
Y3 = Smarket['Direction']
X_train, X_test, y_train, y_test = train_test_split(X3, Y3, test_size=0.2016)
data_1 = pd.concat([pd.DataFrame(y_train), X_train], axis = 1)
model_1 = sm.formula.glm(formula = 'y_train~X_train', data = data_1, family= sm.families.Binomial()).fit()
X_new = model_1.predict(X_test)
Now it is in the last code where I recieve the following error:
PatsyError: Number of rows mismatch between data argument and X_train (252 versus 998)
y_train~X_train
^^^^^^^
I am just unable to understand why I am getting this error. I get it might be because of mismatch in the number of data between X_test and X_train. How do I need to change my code to get the predicted values?
I want to estimate the model from the data I've used here in scikit-learn. I am using the DecisionTreeClassifier.score function but when running the code I'll receive an ValueError:
Can't handle mix of continuous and multiclass.
Here is the code I use:
from sklearn import datasets
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
nba = pd.read_excel(r"C:\Users\user\Desktop\nba.xlsx")
X = nba.drop('平均得分', axis = 1)
y = nba['平均得分']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.20)
nba_tree = DecisionTreeClassifier()
nba_tree.fit(X_train, y_train.astype('int'))
y_pred = nba_tree.predict(X_test)
nba_tree.score(X_test, y_test)
It looks like your target variable 平均得分 is a continuous variable. Probably you are try to solve a regression problem. If that is the case then try DecisionTreeRegressor instead of DecisionTreeClassifier.