I am trying to plot an 8x8 correlation matrix between the different feature scores and the corresponding chances of admit. May I know how I am supposed to do so?
import tensorflow as tf
import numpy as np
import pylab as plt
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import pandas as pd
admit_data = np.genfromtxt('admission_predict.csv', delimiter= ',')
X_data, Y_data = admit_data[1:,1:8], admit_data[1:,-1]
x_train, x_test, y_train_, y_test_ = train_test_split(
X_data,
Y_data,
test_size=0.3,
random_state=42
)
scaler = preprocessing.StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)
y_train = y_train_.reshape(len(y_train_), no_labels)
y_test = y_test_.reshape(len(y_test_), no_labels)
data = admit_data
df = pd.DataFrame(data, columns = ['Serial No.','GRE Score','TOEFL Score','University Rating','SOP','LOR','CGPA','Research','Chance of Admit'])
df.corr()
This is the code I'm reading now and my file looks like this
Please help me plot a 8x8 correlation matrix as of now my code doesn't return a 8x8 correlation matrix
What about
import matplotlib.pyplot as plt
cors = df.corr()
plt.matshow(cors)
plt.yticks(range(cors.shape[1]), cors.columns, fontsize=7)
plt.xticks(range(cors.shape[1]), cors.columns, fontsize=7, rotation=90)
plt.colorbar()
to use all except "Serial No" column use this cors instead:
cors = df.drop("Serial No.", axis=1).corr()
Related
Im having a classification problem with iris dataset,i can create a pairplot on the raw dataset which looks like this when hue='species'
But How can i use hue after splitting the dataset into X_train,y_train as the species class is being separated ?
X = DATA.drop(['class'], axis = 'columns')
y = DATA['class'].values
X_train, X_test, y_train, y_test=train_test_split(X,y, test_size=0.20,random_state =42)
gbl_pl=[]
gbl_pl.append(('standard_scaler_gb',
StandardScaler()))
gblpq=Pipeline((gbl_pl))
scaled_df=gblpq.fit_transform(X_train,y_train)
sns.pairplot(data=scaled_df)
plt.show()
output
Expectation
(Something like this with the split dataset excluding the test data)
You could concatenate y_train as a column to X_train.
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
import seaborn as sns
import pandas as pd
import numpy as np
iris = sns.load_dataset('iris')
X = iris.drop(columns='species')
y = iris['species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
sns.pairplot(data=pd.concat([X_train, y_train], axis=1), hue=y_train.name)
Trying to fit a linear kernel ridge regression model on a dataset with 8 features.
import pandas as pd
import urllib.request
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/Concrete_Data.xls'
urllib.request.urlretrieve(url, './Concrete_Data.xls')
data = pd.read_excel('./Concrete_Data.xls')
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
new_col_names = ["Cement", "BlastFurnaceSlag", "FlyAsh", "Water", "Superplasticizer","CoarseAggregate", "FineAggregate", "Age", "CC_Strength"]
curr_col_names = list(data.columns)
mapper = {}
for i,name in enumerate(curr_col_names):
mapper[name] = new_col_names[i]
data = data.rename(columns=mapper)
X = data.iloc[:,:-1]
y = data.iloc[:,-1]
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
from sklearn.kernel_ridge import KernelRidge
kr = KernelRidge(alpha=1.0)
kr.fit(x_train, y_train)
y_pred_kr = kr.predict(y_test)
When I try to run the code, there is an error that says the expected array is meant to be 2D but is a 1D array. Could someone let me know what I am possibly doing wrong?
I have a data set and want to apply scaling and then PCA to a subset of a pandas dataframe and return just the components and the columns not being transformed. So using the mpg data set from seaborn I can see the training set trying to predict mpg looks like this:
Now let's say I want to leave cylinders and discplacement alone and scale everything else and reduce it to 2 components. I'd expect the result to be 4 total columns, the original 2 plus the 2 components.
How can I use ColumnTransformer to do the scaling to a subset of columns, then the PCA and return only the components and the 2 passthrough columns?
MWE
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import (StandardScaler)
from sklearn.decomposition import PCA
from sklearn.compose import ColumnTransformer
df = sns.load_dataset('mpg').drop(["origin", "name"], axis = 1).dropna()
X = df.loc[:, ~df.columns.isin(['mpg'])]
y = df.iloc[:,0].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 21)
scaler = StandardScaler()
pca = PCA(n_components = 2)
dtm_i = list(range(2, len(X_train.columns)))
preprocess = ColumnTransformer(transformers=[('scaler', scaler, dtm_i), ('PCA DTM', pca, dtm_i)], remainder='passthrough')
trans = preprocess.fit_transform(X_train)
pd.DataFrame(trans)
I strongly suspect my misconception of how this step works is wrong: preprocess = ColumnTransformer(transformers=[('scaler', scaler, dtm_i), ('PCA DTM', pca, dtm_i)]. I think it operates on the last 4 columns, first doing a scale and then PCA and final returns the 2 components but I get 8 columns, the first 4 are scale, the next 2 appear to be the components (likely they weren't scale first), and lastly, the two columns I 'passthrough'.
I think this works but don't know if this is the way Python/scikit way to solve it:
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import (StandardScaler)
from sklearn.decomposition import PCA
from sklearn.compose import ColumnTransformer
df = sns.load_dataset('mpg').drop(["origin", "name"], axis = 1).dropna()
X = df.loc[:, ~df.columns.isin(['mpg'])]
y = df.iloc[:,0].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 21)
scaler = StandardScaler()
pca = PCA(n_components = 2)
dtm_i = list(range(2, len(X_train.columns)))
dtm_i2 = list(range(0, len(X_train.columns)-2))
preprocess = ColumnTransformer(transformers=[('scaler', scaler, dtm_i)], remainder='passthrough')
preprocess2 = ColumnTransformer(transformers=[('PCA DTM', pca, dtm_i2)], remainder='passthrough')
trans = preprocess.fit_transform(X_train)
trans = preprocess2.fit_transform(trans)
pd.DataFrame(trans)
I Am new in Data Science. I am trying to find out the feature importance ranking for my dataset. I already applied Random forest and got the output.
Here is my code:
# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# importing dataset
dataset=pd.read_csv('Churn_Modelling.csv')
X = dataset.iloc[:,3:12].values
Y = dataset.iloc[:,13].values
#encoding catagorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
#country
labelencoder_X_1= LabelEncoder()
X[:,1]=labelencoder_X_1.fit_transform(X[:,1])
#gender
labelencoder_X_2= LabelEncoder()
X[:,2]=labelencoder_X_2.fit_transform(X[:,2])
onehotencoder = OneHotEncoder(categorical_features=[0])
X = onehotencoder.fit_transform(X).toarray()
#spliting dataset into test set and train set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.20)
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=20, random_state=0)
regressor.fit(X_train, y_train)
In the importance part i almost copied the example shown in :
https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html
Here is the code:
#feature importance
from sklearn.ensemble import ExtraTreesClassifier
importances = regressor.feature_importances_
std = np.std([tree.feature_importances_ for tree in regressor.estimators_],
axis=0)
indices = np.argsort(importances)[::-1]
print("Feature ranking:")
for f in range(X.shape[1]):
print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
plt.show()
I am expecting the output shown in the documentation. Can Anyone Help me please ? Thanks in Advance.
My dataset is here:
You have a lot of features and cannot been seen in a single plot.
Just plot some of them.
Here I plot the first 20 most important:
# Plot the feature importances of the forest
plt.figure(figsize=(18,9))
plt.title("Feature importances")
n=20
_ = plt.bar(range(n), importances[indices][:n], color="r", yerr=std[indices][:n])
plt.xticks(range(n), indices)
plt.xlim([-1, n])
plt.show()
My code in case you need it: https://filebin.net/be4h27swglqf3ci3
Output:
i am trying to use my machine learning model on dataset where i have only two columns while standard scaling them,i got the error expected 2D array but got 1 .
Below is the code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# Splitting the dataset into the Training set and Test set
"""from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)"""
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X = sc_X.fit_transform(X)
y = sc_y.fit_transform(y)
# Fitting SVR to the dataset
from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
regressor.fit(X, y)
# Predicting a new result
y_pred = regressor.predict(6.5)
y_pred = sc_y.inverse_transform(y_pred)
# Visualising the SVR results
plt.scatter(X, y, color = 'red')
plt.plot(X, regressor.predict(X), color = 'blue')
plt.title('Truth or Bluff (SVR)')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
when i try to put
y = sc_y.fit_transform([y])
like this i received no error but when i execute next 3 lines i receive another error.
which is bad input shape (1, 10)
can anyone help me on this?
The StandardScaler() function in sklearn expects the input(X) to be in the following format:
X : numpy array of shape [n_samples, n_features]
So, reshape X to (-1,1) if you have only one feature column.
sc_X.fit_transform(X.reshape[-1,1])
This should work!