What does this error mean with StratifiedShuffleSplit? - python

I'm totally new to Data Science in general and was hoping someone could explain why this does not work:
I'm using the Advertising dataset from the following url: "http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv" which has 3 feature columns ("TV", "Radio", "Newspaper") and 1 label column ("sales"). My complete dataset is named data.
Next, I try to use sklearn's StratifiedShuffleSplit function to divide the data into training and testing sets.
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, random_state=0) # can use test_size=0.8
for train_index, test_index in split.split(data.drop("sales", axis=1), data["sales"]): # Generate indices to split data into training and test set.
strat_train_set = data.loc[train_index]
strat_test_set = data.loc[test_index]
I get this ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
Using the same code on another dataset which has 14 feature columns and 1 label column separates the data appropriately. Why doesn't it work here? Thanks.

I think that problem is your data_y is 2D matrix.
but as I see in sklearn.model_selection.StratifiedShuffleSplit doc, it should be the 1D vector. Try to encode each row of data_y as the integer (it will be interpreted as a class), and after use split.
Or possibly your y is a regression variable (continuous numerical data).(Vivek's link)

Related

Multiclass with logreg

So i'm trying to find a simple (not Dijkstra's algorithm) for a shortest path problem.
Without reproducing everything, I have 3 paths and 50 samples of it (i.e. shape (50,3))and I have identified the shortest path for each sample using the min. function
for x_train being
newx_train = np.zeros((50,3))
newx_train[:,0] = p1_train
newx_train[:,1] = p2_train
newx_train[:,2] = p3_train
[x_train] <- just random numbers generated
and subsequently, y_train (since I'm generating it; i pass min function through it)
newy_train[np.arange(newx_train.shape[0]),newx_train.argmin(axis=1)]=1
print(newy_train)
[newy_train] <- passing min will show a 1 for each row where the minimum value is
So i get something like
[[1,0,0],
[0,1,0],
[1,0,0],
[0,0,1]]
Based on x_train, y_train generated, I am trying to implement SVM, logreg to predict how well they perform for multi-class and then i'll compute the classification matrix and accuracy.
My question is, how do i go about using multi-class for logreg? When i run a fit through x_train, y_train; understandably python throws up error that y should be 1-D array but got (50,3) instead.
from sklearn.linear_model import LogisticRegression
LogReg = LogisticRegression(solver = 'lbfgs', multi_class = 'multinomial')
LogReg.fit(newx_train,newy_train[:,0])
ylog_pred = LogReg.predict(newx_test)
print(ylog_pred)
The above code naturally works for binary (assuming only 2 paths) since predicting '1' for one column (index 0) would naturally mean the other column is a '0'. But this would not work for multi-class. Could anyone help with it?
I think you're just missing the part with how to interpret the y.
LogisticRegression expects the y column to not be one-hot encoded and to actually be the target labels, so you need something like
newy_train = np.argmax(newy_train, axis=1) # index of max across each row
Then you should be able to fit something with
LogReg.fit(newx_train,newy_train)

Python's "StandardScaler" and "LabelEncoder", and "fit" and "fit_transform" do not work with a CSV which contains both float and string

I was learning the MPL regressor at Google Colaboratory and ran the source code:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data = np.array(table)
scaler.fit(data)
y_index = data.shape[1]-1
sd_x = (scaler.var_[:y_index])**0.5
sd_y = (scaler.var_[y_index])**0.5
mean_x = scaler.mean_[:y_index]
mean_y = scaler.mean_[y_index]
x = (data[:, :y_index]).astype(np.float32)
y = (data[:, y_index]).astype(np.float32)
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.25)
print('Separate training and testing sets!')
It gave the error ValueError: could not convert string to float: 'Photo Editor & Candy Camera & Grid & ScrapBook'.
So I checked the question RandomForestClassfier.fit(): ValueError: could not convert string to float. I also tried sklearn-LinearRegression: could not convert string to float: '--'.
I changed from fit(data) to fit_transform(data), but the same error still insisted. Then I changed from StandardScaler to LabelEncoder, and from scaler = StandardScaler() to scaler = LabelEncoder(). But the different error appeared: ValueError: bad input shape (10841, 13) on the line scaler.fit_transform(data).
You can check the CSV from Kaggle's CSV here. The CSV contains both strings and numbers without quotation marks (except the prices which contain double quotation marks).
From the documentation of sklearn's LabelEncoder: This transformer should be used to encode target values, i.e. y, and not the input X.
Particularly, it's not intended to fit a LabelEncoder on the full dataset.
If you just want to replace the values of the categorical (i.e, string-valued) columns by unique and numeric ids, one way to go is to apply the label encoder (before splitting the data) on each column you want to encode individually. As your sample code imports pandas, I assume that your data has been loaded into a pandas.DataFrame like
df = pd.read_csv('/path/to/googleplaystore.csv')
From there, you can apply the encoder on each column:
df['App'] = LabelEncoder().fit_transform(df['App'].values)
You may also want to have a look how to handle categorical data within pandas.
However, even after doing this for each non-numeric column in your dataset, there is still a long way before fitting a model on the encoded data (you may want to apply one-hot encoding onto these columns afterwards, but this heavily depends on the model you want to use).
StandardScaler is a preprocessing class from sklearn that takes numeric entries and convert them to a likely Gaussian distribution with 0 mean and unit variance. It doesn't deal with text data. That explains the first error.
LabelEncoder is another preprocessing class from sklearn that takes data and maps them to a numeric encoded representation.
Ex: ["apple","banana","apple","banana"] to [0,1,0,1]
Your dataset has missing values, you should deal with them first. By means of imputing, droping or some similar approach.
Then you should convert the types (all but rating are considered object or string) from each column to handle properly each datatype.
table = pd.read_csv('googleplaystore.csv')
# check dataset info
table.info()
# check missing values
table.isna().sum()
To be honest, I think this is more of a conceptual problem than a technical one. As other users told you, StandarScaler must be used on numeric columns but most of your dataframe columns are object type. Probably you should use OneHotEncoder on it, all transformer on sklearn have a similar behaviour.
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore')
ohe.fit_transform(X) # your data without target column
# ...blabla...
Finally, I recommend you read about Pipelines from sklearn, I think they are more elegant than a lot of messy code. You can put preprocessing and model steps on the same pipeline, for example here.

xgboost - feature mismatch when I predict on my test data

I"m using xgboost to train some data and then I want to score it on a test set.
My data is a combination of categorical and numeric variables, so I used pd.get_dummies to dummy all my categorical variables. training is fine, but the problem happens when I score the model on the test set.
I get an error of "feature_names_mismatch" and it lists the columns that are missing. My dataset is already in a matrix (numpy array format).
the mismatch in feature name is valid since some dummy-categories may not be present in the test set. So if this happens, is there a way for the model to still work?
If I understood your problem correctly; you have some categorical values which appears in train set but not in test set. This usually happens when you create dummy variables (converting categorical features using one hot coding etc) separately for train and test instead of doing it on entire dataset. Following code can help
for col in featurs_object:
X[col]=pd.Categorical(X[col],categories=df[col].dropna().unique())
X_col = pd.get_dummies(X[col])
X = X.drop(col,axis=1)
X_col.columns = X_col.columns.tolist()
frames = [X_col, X]
X = pd.concat(frames,axis=1)
X = pd.concat([X,df_continous],axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.3,
random_state = 1)
featurs_object : consists of all categorical columns which you want to include for model building.
df : your entire dataset (post cleanup)
df_continous : Subset of df, with only continuous features.

How to remove minority classes with less than a certain number of examples before performing SMOTE, python

I have a dataset which contains 100 columns as feature vectors(100D feature vectors) generated from word2vec and my target is a categorical variable for each of the rows of vector in my dataset. Now there are around 1000 different categorical variables in total for my dataset and the number of rows are around 75000. The issue with the dataset is that it is highly imbalanced and except the top 200 categorical variables all the remaining classes have very few samples and some classes have less than 6 samples.
Now I want to perform oversampling on this data using SMOTE to generate more examples for the minority classes. I want to ignore the classes that have less than 6 sample examples because that is the point where SMOTE gives a value error. Is there any way, I can handle it in the code so that, I can ignore those classes with less than 6 samples while performing SMOTE ? And will doing that help in solving the error that I am facing currently?
Code & Error message for reference:
dataset = pd.read_csv(r'C:\vectors.csv')
X = dataset.iloc[:, 3:103]
y = dataset.iloc[:, 0]
from imblearn.over_sampling import SMOTE
smote = SMOTE(k_neighbors = 1)
smote_Xtrain, smote_y_train = smote.fit_sample(X, y)
I am getting this error currently ValueError: Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 2 though I have set k_neighbors = 1
Any help on this will be highly appreciated
You can see Unique entries for each class, and count them, with the following command : df['VARIABLE'].value_counts(dropna=False) (Turn dropna=True if you don't want NaN to appear).
Then with that, you can yourself create an algorithm, setting a threshold, and automatically removing classes appearing less than your threshold, or putting them in a new big class "Other" for example

Error with scikit-learn training: Unable to allocate array with shape

I have the following columns: Col1: string, Col2:float, Col3:float. During prediction I want to predict the value of Col3:
import pickle
import numpy as np
from sklearn import model_selection
from sklearn import linear_model
from sklearn.preprocessing import OneHotEncoder
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
x_columns_to_encode = ['Col1']
x_columns_to_scale = ['Col2']
y_columns_to_scale = ['Col3']
# Instantiate encoder/scaler
scaler = StandardScaler()
ohe = OneHotEncoder(sparse=False)
# Scale and Encode Separate Columns
x_scaled_columns = scaler.fit_transform(df1[x_columns_to_scale])
x_encoded_columns = ohe.fit_transform(df1[x_columns_to_encode])
y_scaled_columns = scaler.fit_transform(df1[y_columns_to_scale])
df = np.concatenate([x_scaled_columns, x_encoded_columns], axis=1)
validation_size = 0.50
seed = 7
x_train, x_validation, y_train, y_validation = model_selection.train_test_split(df, y_scaled_columns, test_size=validation_size, random_state=seed)
bestScore = 0.0
model = linear_model.LinearRegression()
score = model.fit(x_train, y_train).score(x_validation, y_validation)
print(score)
When I run this code I get error:
"Unable to allocate array with shape (2763330, 25380) and data type float64"
Can someone please help me understand as to where am I making mistake?
One hot encoding generates a new column for every unique class in the categorical column. If your categorical column has too many unique classes, you can run out of memory.
It might help to show us what your data looks like, so we can provide better suggestions.
In the meantime, you could try the following options:
Using a sparse representation for the one hot encoded matrix. This saves a lot of space, since a one hot encoded matrix has a lot of entries which are 0.
Using a label encoding. However, be careful, because label encoding assumes an ordinal relationship between classes. It's better to use a tree-based model if there is no ordinal relationship between classes, as they don't use a mathematical equation to build the model, and so the ordinal relationship is not relevant when the decision tree makes the split.
Refer this for a clear explanation.
Making a logical grouping of some of the classes. For example, suppose your categorical column has classes 'tiger', 'lion' and 'cheetah', you could group these together logically under a class, say, 'big_cats'. This will reduce the number of unique classes, and hence the number of new columns added will be less too.
Drop the categorical column entirely.

Categories