Im having a classification problem with iris dataset,i can create a pairplot on the raw dataset which looks like this when hue='species'
But How can i use hue after splitting the dataset into X_train,y_train as the species class is being separated ?
X = DATA.drop(['class'], axis = 'columns')
y = DATA['class'].values
X_train, X_test, y_train, y_test=train_test_split(X,y, test_size=0.20,random_state =42)
gbl_pl=[]
gbl_pl.append(('standard_scaler_gb',
StandardScaler()))
gblpq=Pipeline((gbl_pl))
scaled_df=gblpq.fit_transform(X_train,y_train)
sns.pairplot(data=scaled_df)
plt.show()
output
Expectation
(Something like this with the split dataset excluding the test data)
You could concatenate y_train as a column to X_train.
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
import seaborn as sns
import pandas as pd
import numpy as np
iris = sns.load_dataset('iris')
X = iris.drop(columns='species')
y = iris['species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
sns.pairplot(data=pd.concat([X_train, y_train], axis=1), hue=y_train.name)
Related
How can I use this dataset "MC1" to plot a KNN decision boundary figure?
Here is my code, I have tried to use iloc and loc but did not work
from sklearn.model_selection import train_test_split as tts
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from yellowbrick.contrib.classifier import DecisionViz
from yellowbrick.features import RadViz
from yellowbrick.style import set_palette
set_palette('flatui')
data_set = pd.read_csv('MC1.csv')
X, y = data_set
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = tts(X, y, test_size=.4, random_state=42)
visualizer = RadViz(size=(500, 400))
viz = DecisionViz(
KNeighborsClassifier(5), title="Nearest Neighbors",classes=['Y', 'N']
)
viz.fit(X_train, y_train)
viz.draw(X_test, y_test)
viz.show()
Trying to fit a linear kernel ridge regression model on a dataset with 8 features.
import pandas as pd
import urllib.request
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/Concrete_Data.xls'
urllib.request.urlretrieve(url, './Concrete_Data.xls')
data = pd.read_excel('./Concrete_Data.xls')
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
new_col_names = ["Cement", "BlastFurnaceSlag", "FlyAsh", "Water", "Superplasticizer","CoarseAggregate", "FineAggregate", "Age", "CC_Strength"]
curr_col_names = list(data.columns)
mapper = {}
for i,name in enumerate(curr_col_names):
mapper[name] = new_col_names[i]
data = data.rename(columns=mapper)
X = data.iloc[:,:-1]
y = data.iloc[:,-1]
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
from sklearn.kernel_ridge import KernelRidge
kr = KernelRidge(alpha=1.0)
kr.fit(x_train, y_train)
y_pred_kr = kr.predict(y_test)
When I try to run the code, there is an error that says the expected array is meant to be 2D but is a 1D array. Could someone let me know what I am possibly doing wrong?
I'm trying to simply plot a regression line, however I get messy lines. Is it because I fitted the model with 2 features, so the only appropriate visualization would be a 3d plane?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
# prepare data
boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)[['AGE','RM']]
y = boston.target
# split dataset into training and test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=33)
# apply linear regression on dataset
lm = LinearRegression()
lm.fit(X_train, y_train)
pred_train = lm.predict(X_train)
pred_test = lm.predict(X_test)
#plot relationship between RM and price
plt.scatter(X_train['RM'],
y_train,
c='g',
s=40,
alpha=0.5)
plt.plot(X_train['RM'], pred_train, color='r')
plt.title('Relationship between RM and Price')
plt.ylabel('Price')
plt.xlabel('RM')
You are right. You are training on multiple features, i.e AGE, and RM. But you are plotting a 2D plot with only one feature, i.e RM. Try to get a 3D plot. In general, linear regression with two features results in a plane. This is still a linear regression. That is why we use the term "hyperplane". It resolves to a line for a single feature, a plane for two features and so on.
Here is the output in 3D:
plt3d = plt.figure().gca(projection='3d')
plt3d.view_init(azim=135)
plt3d.plot_trisurf(X_train['RM'].values, X_train['AGE'].values, pred_train, alpha=0.7, antialiased=True)
The problem is that when you plot you have to order the arguments.
'plt.plot(np.sort(X_train['RM']), np.sort(pred_train), color='r')'
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
# prepare data
boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)[['AGE','RM']]
y = boston.target
# split dataset into training and test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=33)
# apply linear regression on dataset
lm = LinearRegression()
lm.fit(X_train, y_train)
pred_train = lm.predict(X_train)
pred_test = lm.predict(X_test)
#plot relationship between RM and price
plt.scatter(X_train['RM'],
y_train,
c='g',
s=40,
alpha=0.5)
plt.plot(np.sort(X_train['RM']), np.sort(pred_train), color='r')
plt.title('Relationship between RM and Price')
plt.ylabel('Price')
plt.xlabel('RM')
plt.show()
the result:
output-plot
Probably if you do a 3d plot you will visualize easily the relation between the co-variables RM and age 3d-plot
I want to estimate the model from the data I've used here in scikit-learn. I am using the DecisionTreeClassifier.score function but when running the code I'll receive an ValueError:
Can't handle mix of continuous and multiclass.
Here is the code I use:
from sklearn import datasets
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
nba = pd.read_excel(r"C:\Users\user\Desktop\nba.xlsx")
X = nba.drop('平均得分', axis = 1)
y = nba['平均得分']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.20)
nba_tree = DecisionTreeClassifier()
nba_tree.fit(X_train, y_train.astype('int'))
y_pred = nba_tree.predict(X_test)
nba_tree.score(X_test, y_test)
It looks like your target variable 平均得分 is a continuous variable. Probably you are try to solve a regression problem. If that is the case then try DecisionTreeRegressor instead of DecisionTreeClassifier.
I have a data sample of 750x256.
Rows = 750
Columns = 256
If I split my data into 20%. I will have for X_train 600 samples and y_train 150 samples.
Then the problem would accure when doing decisionTreeRegressor
it will say Number of y_train=150 does not match number of samples=600
But if I split my test_size into 50%, then it will work.
is there a way to around this? I don't want to use 50% of my test_size.
Any help would be great!
here is my code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import graphviz
#Load the data
dataset = pd.read_csv('new_york.csv')
dataset['Higher'] = dataset['2016-12'].gt(dataset['2016-11']).astype(int)
X = dataset.iloc[:, 6:254].values
y = dataset.iloc[:, 255].values
#Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, :248])
X[:, :248] = imputer.transform(X[:, :248])
#Split the data into train and test sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_test, y_train = train_test_split(X, y, test_size = .2, random_state = 0)
#let's build our first model
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, export_graphviz
clf = DecisionTreeClassifier(max_depth=6)
clf.fit(X_train, y_train)
clf.score(X_train, y_train)
train_test_split() returns X_train, X_test, y_train, y_test, you have y_train and y_test in the wrong order.
If you use a split of 50% this is not causing an error because y_train and y_test will have the same size (but the wrong values obviously).