LDA processing failing with "Variables are collinear." error in python - python

I am using Python 3.6.1 | Anaconda 4.4.0
I am novice in ML and practicing while learning. I picked up a kagle dataset to practice LDA for dimensionality reduction. Two confusion arised:
I started getting warning error "Variables are collinear."
Even though i am using n-components as 2, still the output vector x_train is showing only 1 feature.
code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
datasets = pd.read_csv('mushrooms.csv')
X_df = datasets.iloc[:, 1:] # Independent variables
y_df = datasets.iloc[:, 0] # Dependent variables
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
X_df = X_df.apply(LabelEncoder().fit_transform)
x = OneHotEncoder(sparse=False).fit_transform(X_df.values)
y = LabelEncoder().fit_transform(y_df.values)
# Splitting dataset in to training set and test set.
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test =
train_test_split(x,y,test_size=0.2,random_state=0)
# Feature scaling
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
#---------------------------------------------
# Applying LDA (Linear Discriminant Analysis)
#---------------------------------------------
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components = 2)
x_train = lda.fit_transform(x_train, y_train)
x_test = lda.transform(x_test)

This suggests just what the error message says: some of your variables are collinear. In other words, the elements of one vector are a linear function of the elements of another, such as
0, 1, 2, 3
3, 5, 7, 9
In this case, LDA can't differentiate their influences on the rest of the world.
I can't diagnose anything specific, since you failed to provide the suggested MCVE.

Related

How to use a GradientBoostingRegressor in scikit-learn with 3 output dimensions

I am trying to map 13-dimensional input data to 3-dimensional output data by using RandomForest and GradientBoostingRegressor of scikit-learn. While for the RandomForest regressor this works fine, I get a ValueError for the GradientBoostingRegressor stating ValueError: y should be a 1d array, got an array of shape (16127, 3) instead.
I don't really understand why I get this error when using GradientBoostingRegressor and not when using the RandomForestRegressor. As far as I understand, both of them use decision trees as a weak learner and combine them to get a good result. Of course I know that I could transform the 3-dimensional output-labels to a 1-dimensional array but this does not make sense as i want to map to a 3-dimensional output-vector. Any idea how I can do this using the GradientBoostingRegressor?
Here is my code:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
# Read data from csv files
Input_data_features = pd.read_csv("C:/Users/wi9632/Desktop/TestData_InputFeatures.csv", sep=';')
Input_data_labels = pd.read_csv("C:/Users/wi9632/Desktop/TestData_OutputLabels.csv", sep=';')
Input_data_features = Input_data_features.values
Input_data_labels = Input_data_labels.values
# standardize input features X and output labels Y
scaler_standardized_X = StandardScaler()
Input_data_features = scaler_standardized_X.fit_transform(Input_data_features)
scaler_standardized_Y = StandardScaler()
Input_data_labels = scaler_standardized_Y.fit_transform(Input_data_labels)
# Split dataset into train, validation, an test
index_X_Train_End = int(0.7 * len(Input_data_features))
index_X_Validation_End = int(0.9 * len(Input_data_features))
X_train = Input_data_features[0: index_X_Train_End]
X_valid = Input_data_features[index_X_Train_End: index_X_Validation_End]
X_test = Input_data_features[index_X_Validation_End:]
Y_train = Input_data_labels[0: index_X_Train_End]
Y_valid = Input_data_labels[index_X_Train_End: index_X_Validation_End]
Y_test = Input_data_labels[index_X_Validation_End:]
#Define a random forest model and train it
model_randomForest = RandomForestRegressor( )
model_randomForest.fit(X_train, Y_train)
#Predict the test data with Random Forest
Y_pred_randomForest = model_randomForest.predict(X_test)
print(f"Random Forest Prediction: {Y_pred_randomForest}")
#Define a gradient boosting model and train it (-->Here I get the ValueError)
model_gradientBoosting = GradientBoostingRegressor( )
model_gradientBoosting.fit(X_train, Y_train)
#Predict the test data with Random Forest
Y_pred_gradientBoosting = model_gradientBoosting.predict(X_test)
print(f"Gradient Boosting Prediction: {Y_pred_gradientBoosting}")
Here is the test data: https://filetransfer.io/data-package/ABCrGPzt#link
Reminder: As I could not solve my problem, I would like to remind you on this question. Does anybody have an idea how to tackle this problem?
RandomForestRegressor supports multi output regression, see docs. GradientBoostingRegressor does not.
You can use MultiOutputRegressor + GradientBoostingRegressor for the problem. See this answer.
from sklearn.multioutput import MultiOutputRegressor
params = {'n_estimators': 5000, 'max_depth': 4, 'min_samples_split': 2, 'min_samples_leaf': 2}
estimator = MultiOutputRegressor(ensemble.GradientBoostingRegressor(**params))
estimator.fit(train_data,train_targets)

Unable to Allocate Memory for an Array with Size

I'm an NLP noob working on a project, and I need to calculate the accuracy for a few different methods; however, I keep getting memory errors when running the code. For example, I keep getting "Unable to allocate 14.2 GiB for an array with shape (38045, 50000) and data type float64", even though I cast to a uint8 data type and messed with the Windows advanced settings to change the memory allocation. My code is below.
import sklearn
import numpy as np
import sklearn.feature_extraction.text
import pandas as pd
df = pd.read_csv ('amprocessed.csv')
labels = df.iloc[:, 0]
import sklearn.model_selection
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(max_features=50000, dtype="uint8")
#vectorizer = TfidfVectorizer()
X = (vectorizer.fit_transform(df["Source"]).toarray()).astype(dtype="uint8")
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
xscale = scaler.fit_transform(X).astype(dtype=np.uint8)
from sklearn import svm
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(xscale, labels, test_size=0.2, random_state=42)
clf = svm.SVC(kernel='linear') # Linear Kernel
clf.fit(x_train, y_train).astype(dtype=np.uint8)
y_pred = clf.predict(x_test)
from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(y_test, y_pred)
The problem here is that you are transforming the output of CountVectorizer into an np.array. CountVectorizer output a sparse matrix scipy.sparse.csr.csr_matrix which is an efficient way to store such data.
Instead of having for each document a np.array of shape (50000,0) with almost all value equal to 0 and very few equal 1, Sparse Matrix will only reference the value not equal to 0. This will greatly reduce the memory footprint as shown in this example:
from scipy.sparse import csr_matrix
import numpy as np
import sys
X = np.zeros((100_000))
X[0] = 1
print(f'size (bytes) of np.array {sys.getsizeof(X)}')
X_sparse = csr_matrix(X)
print(f'size (bytes) of Sparse Matrix {sys.getsizeof(X_sparse)}')
Outputs:
size (bytes) of np.array 800104
size (bytes) of Sparse Matrix 48
Therefore you should modify your preprocessing code with:
X = (vectorizer.fit_transform(df["Source"]).toarray())
In addition to that, the fit function should be simply written as follow:
clf.fit(x_train, y_train)

how to pass float argument in predict function in Python?

I was following a course on machine learning where the instructor passes a float argument in predict function for polynomial linear regression and it works for him. However, when I pass the code it throws an error stating
"Expected 2D array, got scalar array instead".
I have tried to use the scalar into an array but it does not seem to work.
# Polynomial Regression
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# Splitting the dataset into the Training set and Test set
"""from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)"""
# Feature Scaling
"""from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)"""
# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 4)
X_poly = poly_reg.fit_transform(X)
poly_reg.fit(X_poly, y)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly, y)
# Predicting a new result with Linear Regression
lin_reg.predict(6.5)
The code seems to run smoothly for the instructor. However, I am getting the following error:
ValueError: Expected 2D array, got scalar array instead:
array=6.5.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
This is the error that I am getting.
Actually the predict function accepts 2D array as an input, so u can put 6.5 inside big brackets like this [[6.5]]
lin_reg.predict([[6.5]])
This will work.
Welcome to stackoverflow! You're more likely to get your question answered with a minimal reproducible example, and show at least a portion of any required external files. In this case, I think I've boiled it down to the essentials:
import pandas as pd
# Importing the dataset
salaries = [('Junior', 1, 50000),
('Associate', 2, 60000),
('Senior', 3, 70000),
('Manager', 4, 80000)]
df = pd.DataFrame(salaries)
X = df.iloc[:, 1:2].values
y = df.iloc[:, 2].values
# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
# Predicting a new result with Linear Regression
print(lin_reg.predict(6.5))
Although I can't be sure exactly what is in the Position_Salaries.csv, I assume based on other arguments that it looks something like what I've shown. Running that example returns the expected result of 76100 in python 3.6 with sklearn 0.19. If you still get an error, try updating sklearn
pip update sklearn
If you're still getting an error after that, not sure where the difference is, but you can spoof a 2d array by passing the argument like this: lin_reg.predict([[6.5]])

ROC curve for multi-class classification without one vs all in python

I have a multi-class classification problem with 9 different classes. I am using the AdaBoostClassifier class from scikit-learn to train my model without using the one vs all technique, as the number of classes is very high and it might be inefficient.
I have tried using the tips from the documentation in scikit learn [1], but there the one vs all technique is used, which is substantially different. In my approach I only get one prediction per event, i.e. if I have n classes, the outcome of the prediction is a single value within the n classes. For the one vs all approach, on the other hand, the outcome of the prediction is an array of size n with a sort of likelihood value per class.
[1]
https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#sphx-glr-auto-examples-model-selection-plot-roc-py
The code is:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # Matplotlib plotting library for basic visualisation
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_curve, auc
from sklearn import preprocessing
# Read data
df = pd.read_pickle('data.pkl')
# Create the dependent variable class
# This will substitute each of the n classes from
# text to number
factor = pd.factorize(df['target_var'])
df.target_var= factor[0]
definitions = factor[1]
X = df.drop('target_var', axis=1)
y = df['target_var]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
bdt_clf = AdaBoostClassifier(
DecisionTreeClassifier(max_depth=2),
n_estimators=250,
learning_rate=0.3)
bdt_clf.fit(X_train, y_train)
y_pred = bdt_clf.predict(X_test)
#Reverse factorize (converting y_pred from 0s,1s, 2s, etc. to their original values
reversefactor = dict(zip(range(9),definitions))
y_test_rev = np.vectorize(reversefactor.get)(y_test)
y_pred_rev = np.vectorize(reversefactor.get)(y_pred)
I tried directly with the roc curve function, and also binarising the labels, but I always get the same error message.
def multiclass_roc_auc(y_test, y_pred):
lb = preprocessing.LabelBinarizer()
lb.fit(y_test)
y_test = lb.transform(y_test)
y_pred = lb.transform(y_pred)
return roc_curve(y_test, y_pred)
multiclass_roc_auc(y_test, y_pred_test)
The error message is:
ValueError: multilabel-indicator format is not supported
How could this be sorted out? Am I missing some important concept?

How can I create a Linear Regression Model from a split dataset?

I've just split my data into a training and testing set and my plan is to train a Linear Regression model and be able to check what the performance is like using my testing split.
My current code is:
import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
import matplotlib.pyplot as plt
df = pd.read_csv('C:/Dataset.csv')
df['split'] = np.random.randn(df.shape[0], 1)
split = np.random.rand(len(df)) <= 0.75
training_set = df[split]
testing_set = df[~split]
Is there a proper method I should be using to plot a Linear Regression model from an external file such as a .csv?
Since you want to use scikit-learn, here's an approach using sklearn.linear_model.LinearRegression:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
X_train, y_train = training_set[x_vars], training_set[y_var]
X_test, y_test = testing_test[x_vars], testing_test[y_var]
model.fit(X_train, y_train)
predictions = model.predict(X_test)
Depending on whether you need more descriptive output, you might also look into use statsmodels for linear regression.

Categories