I want to create some random data and try to improve my model with PolynominalFeatures, however I'm facing little troubles with doing so.
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
import random
import pandas as pd
import numpy as np
import statsmodels.api as sm
#create some artificial data
x=np.linspace(-1,1,1000)
x=pd.array(random.choices(x,k=1000))
y=x**2+np.random.randn(1000)
#divide sample
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5)
#define data frame to futher use for PolynomialFeatures
df=pd.DataFrame([x_train,x_test])
df=df.transpose()
data = df
# perform a polynomial features transform of the dataset
trans = PolynomialFeatures(degree=2)
data = trans.fit_transform(data)
model = sm.OLS(y_train,data).fit()
And then I get error : ValueError: unrecognized data structures: <class 'pandas.core.arrays.numpy_.PandasArray'> / <class 'numpy.ndarray'>
Do you have any ideas what should be done to make my regression work properly ?
use to_numpy() function to convert pandas array to numpy array
model = sm.OLS(y_train.to_numpy(),data).fit()
Related
Following is my code. The error seems to be in qsvc.fit() line but I can't understand why.one of the error line says "TypeError: Invalid parameter values, expected Sequence[Sequence[float]]." I'm pretty much sure I have passed arrays as parameters in fit function but do they need to be float type because labels are generally strings. sorry this is my first time trying this so these may seem naive.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from qiskit import Aer
from qiskit.circuit.library import ZFeatureMap
from qiskit_machine_learning.kernels import FidelityQuantumKernel
from qiskit.algorithms.state_fidelities import ComputeUncompute
from qiskit.primitives import Sampler
from qiskit.utils import QuantumInstance
from qiskit_machine_learning.algorithms import PegasosQSVC
data=pd.read_csv('train.csv')
X = data.loc[1:1000,["marital","balance","loan"]].values
Y = data.iloc[:1000,-1].values
x_train, x_test, y_train, y_test = train_test_split(X, Y)
data_feature_map = ZFeatureMap(feature_dimension=3, reps=1 )
sampler = Sampler()
fidelity = ComputeUncompute(sampler=sampler)
data_kernel = FidelityQuantumKernel(fidelity=fidelity, feature_map=data_feature_map)
pegasos_qsvc = PegasosQSVC(quantum_kernel=data_kernel, C=1000, num_steps=100)
pegasos_qsvc.fit(x_train, y_train)
qsvc_score = pegasos_qsvc.score(x_test, y_test)
print(f"QSVC classification test score: {qsvc_score}")
You can use values 0,1 and 2 to represent "marital", "balance" and "loan". sklearn has a LabelEncoder to help such a conversion.
How can I fix this error it throws? ValueError: Found input variables with inconsistent numbers of samples: [645471, 78]
full code attached
#Importing the numpy to perform Linear Algebraic operations on the data
import numpy as np
#Import pandas library to perform the data preprocessing
import pandas
#importing the Keras deep learning framework of Python
import keras
#Importing the Sequential model from keras
from keras.models import Sequential
#Importing the types of layers in the Neural Network that we are going to have
from keras.layers import Dense
#Importing the train_test_split function which is useful in dividing the dataset into the training and testing data
from sklearn.model_selection import train_test_split
#Importing the StandardScaler function to perform the standardisation/scaling of the data
from sklearn.preprocessing import StandardScaler, LabelEncoder
#Importing the metries for the performance evaluation of our deep learning model
from sklearn import metrics
from keras.utils import np_utils, normalize, to_categorical
data = pandas.read_csv("C:/Users/bam/train.csv", header=0, dtype=object)
X = data.iloc[:, 0:78]
y = data.iloc[:78]
#I have splitted the dataset into a ratio of 80:20 between the train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 23)
#Creating an object of StandardScaler
sc = StandardScaler()
#Scaling the data using the StandardScaler() object
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
error
I am pretty new to machine learning and for the past two days, I have been trying to get rid of the Unknown label type: 'continuous' error.
My code: import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
dataset = pd.read_csv(r'allData.csv', sep=',')
X = dataset.iloc[:, 1:3].values
y = dataset.iloc[:, 4].values
train_features, test_features, train_lables, test_lables = train_test_split(X, y, test_size=10, random_state=10)
feature_scaler = StandardScaler()
train_features = feature_scaler.fit_transform(train_features)
test_features = feature_scaler.transform(test_features)
classifier = RandomForestClassifier(n_estimators=300, random_state=10)
all_accuracies = cross_val_score(estimator=classifier, X=train_features, y=train_lables, cv="warn")
#all_accuracies = cross_val_score(estimator=classifier, X=train_features, y=train_lables, cv=3)
#print(all_accuracies)
The error comes up at the cross_val_score section and I do not understand why I am getting the Unknown label type: 'continuous' error.
Any help would be appreciated.
If it would help, the data I have is all numerical has 4 columns with 300 rows.
You are using RandomForestClassifier while having a continuous output. In case the problem you are solving is regression, then you should use RandomForestRegressor.
I want to estimate the model from the data I've used here in scikit-learn. I am using the DecisionTreeClassifier.score function but when running the code I'll receive an ValueError:
Can't handle mix of continuous and multiclass.
Here is the code I use:
from sklearn import datasets
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
nba = pd.read_excel(r"C:\Users\user\Desktop\nba.xlsx")
X = nba.drop('平均得分', axis = 1)
y = nba['平均得分']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.20)
nba_tree = DecisionTreeClassifier()
nba_tree.fit(X_train, y_train.astype('int'))
y_pred = nba_tree.predict(X_test)
nba_tree.score(X_test, y_test)
It looks like your target variable 平均得分 is a continuous variable. Probably you are try to solve a regression problem. If that is the case then try DecisionTreeRegressor instead of DecisionTreeClassifier.
I use the dataset from UCI repo: http://archive.ics.uci.edu/ml/datasets/Energy+efficiency
Then doing next:
from pandas import *
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.cross_validation import train_test_split
dataset = read_excel('/Users/Half_Pint_boy/Desktop/ENB2012_data.xlsx')
dataset = dataset.drop(['X1','X4'], axis=1)
trg = dataset[['Y1','Y2']]
trn = dataset.drop(['Y1','Y2'], axis=1)
Then do the models and cross validate:
models = [LinearRegression(),
RandomForestRegressor(n_estimators=100, max_features ='sqrt'),
KNeighborsRegressor(n_neighbors=6),
SVR(kernel='linear'),
LogisticRegression()
]
Xtrn, Xtest, Ytrn, Ytest = train_test_split(trn, trg, test_size=0.4)
I'm creating a regression model for predicting values but have a problems. Here is the code:
TestModels = DataFrame()
tmp = {}
for model in models:
m = str(model)
tmp['Model'] = m[:m.index('(')]
for i in range(Ytrn.shape[1]):
model.fit(Xtrn, Ytrn[:,i])
tmp[str(i+1)] = r2_score(Ytest[:,0], model.predict(Xtest))
TestModels = TestModels.append([tmp])
TestModels.set_index('Model', inplace=True)
It shows unhashable type: 'slice' for line model.fit(Xtrn, Ytrn[:,i])
How can it be avoided and made working?
Thanks!
I think that I had a similar problem before! Try to convert your data to numpy arrays before feeding them to sklearn estimators. It most probably solve the hashing problem. For instance, You can do:
Xtrn_array = Xtrn.as_matrix()
Ytrn_array = Ytrn.as_matrix()
and use Xtrn_array and Ytrn_array when you fit your data to estimators.