Error with scikit-learn training: Unable to allocate array with shape - python

I have the following columns: Col1: string, Col2:float, Col3:float. During prediction I want to predict the value of Col3:
import pickle
import numpy as np
from sklearn import model_selection
from sklearn import linear_model
from sklearn.preprocessing import OneHotEncoder
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
x_columns_to_encode = ['Col1']
x_columns_to_scale = ['Col2']
y_columns_to_scale = ['Col3']
# Instantiate encoder/scaler
scaler = StandardScaler()
ohe = OneHotEncoder(sparse=False)
# Scale and Encode Separate Columns
x_scaled_columns = scaler.fit_transform(df1[x_columns_to_scale])
x_encoded_columns = ohe.fit_transform(df1[x_columns_to_encode])
y_scaled_columns = scaler.fit_transform(df1[y_columns_to_scale])
df = np.concatenate([x_scaled_columns, x_encoded_columns], axis=1)
validation_size = 0.50
seed = 7
x_train, x_validation, y_train, y_validation = model_selection.train_test_split(df, y_scaled_columns, test_size=validation_size, random_state=seed)
bestScore = 0.0
model = linear_model.LinearRegression()
score = model.fit(x_train, y_train).score(x_validation, y_validation)
print(score)
When I run this code I get error:
"Unable to allocate array with shape (2763330, 25380) and data type float64"
Can someone please help me understand as to where am I making mistake?

One hot encoding generates a new column for every unique class in the categorical column. If your categorical column has too many unique classes, you can run out of memory.
It might help to show us what your data looks like, so we can provide better suggestions.
In the meantime, you could try the following options:
Using a sparse representation for the one hot encoded matrix. This saves a lot of space, since a one hot encoded matrix has a lot of entries which are 0.
Using a label encoding. However, be careful, because label encoding assumes an ordinal relationship between classes. It's better to use a tree-based model if there is no ordinal relationship between classes, as they don't use a mathematical equation to build the model, and so the ordinal relationship is not relevant when the decision tree makes the split.
Refer this for a clear explanation.
Making a logical grouping of some of the classes. For example, suppose your categorical column has classes 'tiger', 'lion' and 'cheetah', you could group these together logically under a class, say, 'big_cats'. This will reduce the number of unique classes, and hence the number of new columns added will be less too.
Drop the categorical column entirely.

Related

How to get feature names when using onehot encoder on only certain columns sklearn

I have read many posts on this that reference the get_feature_names() from sklearn which appears to be now deprecated and replaced by get_feature_names_out neither of which I can get to work. It also appears that there is no way to use the get_feature_names (or the get_feature_names_out) with the ColumnTransformer class. So I am trying to fit and transform my numeric columns with a SimpleImputer and then StandardScaler class then SimpleImpute ('most_frequent') and OneHotEncode the categorical variables. I run them all individually since I can't put them in a pipeline then I try to get_feature_names and this results:
ValueError: input_features should have length equal to number of features (5), got 11
I have also tried getting feature names for just the categorical features as well as just the numeric and each one give the following errors respectively:
ValueError: input_features should have length equal to number of features (5), got 121942
and
ValueError: input_features should have length equal to number of features (5), got 121942
I am completely lost and also open to an easier way to get the feature names so that I can make sure the prod data that I run this model on after training/testing has the exact same features as the ones the model is trained to expect (which is the root issue here).
If I'm "barking up the wrong tree" by trying to get the feature names for the reasoning outlined in the root issue I'm also more than willing to be corrected. Here is my code:
#ONE HOT
import sklearn
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# !pip install -U scikit-learn
print('The scikit-learn version is {}.'.format(sklearn.__version__))
numeric_columns = X.select_dtypes(include=['int64','float64']).columns
cat_columns = X.select_dtypes(include=['object']).columns
si_num = SimpleImputer(strategy='median')
si_cat = SimpleImputer(strategy='most_frequent')
ss = StandardScaler()
ohe = OneHotEncoder()
si_num.fit_transform(X[numeric_columns])
si_cat.fit_transform(X[cat_columns])
ss.fit_transform(X[numeric_columns])
ohe.fit_transform(X[cat_columns])
ohe.get_feature_names(X[numeric_columns])
Thanks!
I think this should work as a single composite estimator that does all your transformations and provides get_feature_names_out:
num_pipe = Pipeline([
("imp", si_num),
("scale", ss),
])
cat_pipe = Pipeline([
("imp", si_cat),
("ohe", ohe),
])
preproc = ColumnTransformer([
("num", num_pipe, numeric_columns),
("cat", cat_pipe, cat_columns),
])
Ideally, you should save the fitted composite and use that to transform production data, rather than using the feature names to reconcile different categories.
You should also fit this composite only on the training set, transforming the test set separately.

Python's "StandardScaler" and "LabelEncoder", and "fit" and "fit_transform" do not work with a CSV which contains both float and string

I was learning the MPL regressor at Google Colaboratory and ran the source code:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data = np.array(table)
scaler.fit(data)
y_index = data.shape[1]-1
sd_x = (scaler.var_[:y_index])**0.5
sd_y = (scaler.var_[y_index])**0.5
mean_x = scaler.mean_[:y_index]
mean_y = scaler.mean_[y_index]
x = (data[:, :y_index]).astype(np.float32)
y = (data[:, y_index]).astype(np.float32)
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.25)
print('Separate training and testing sets!')
It gave the error ValueError: could not convert string to float: 'Photo Editor & Candy Camera & Grid & ScrapBook'.
So I checked the question RandomForestClassfier.fit(): ValueError: could not convert string to float. I also tried sklearn-LinearRegression: could not convert string to float: '--'.
I changed from fit(data) to fit_transform(data), but the same error still insisted. Then I changed from StandardScaler to LabelEncoder, and from scaler = StandardScaler() to scaler = LabelEncoder(). But the different error appeared: ValueError: bad input shape (10841, 13) on the line scaler.fit_transform(data).
You can check the CSV from Kaggle's CSV here. The CSV contains both strings and numbers without quotation marks (except the prices which contain double quotation marks).
From the documentation of sklearn's LabelEncoder: This transformer should be used to encode target values, i.e. y, and not the input X.
Particularly, it's not intended to fit a LabelEncoder on the full dataset.
If you just want to replace the values of the categorical (i.e, string-valued) columns by unique and numeric ids, one way to go is to apply the label encoder (before splitting the data) on each column you want to encode individually. As your sample code imports pandas, I assume that your data has been loaded into a pandas.DataFrame like
df = pd.read_csv('/path/to/googleplaystore.csv')
From there, you can apply the encoder on each column:
df['App'] = LabelEncoder().fit_transform(df['App'].values)
You may also want to have a look how to handle categorical data within pandas.
However, even after doing this for each non-numeric column in your dataset, there is still a long way before fitting a model on the encoded data (you may want to apply one-hot encoding onto these columns afterwards, but this heavily depends on the model you want to use).
StandardScaler is a preprocessing class from sklearn that takes numeric entries and convert them to a likely Gaussian distribution with 0 mean and unit variance. It doesn't deal with text data. That explains the first error.
LabelEncoder is another preprocessing class from sklearn that takes data and maps them to a numeric encoded representation.
Ex: ["apple","banana","apple","banana"] to [0,1,0,1]
Your dataset has missing values, you should deal with them first. By means of imputing, droping or some similar approach.
Then you should convert the types (all but rating are considered object or string) from each column to handle properly each datatype.
table = pd.read_csv('googleplaystore.csv')
# check dataset info
table.info()
# check missing values
table.isna().sum()
To be honest, I think this is more of a conceptual problem than a technical one. As other users told you, StandarScaler must be used on numeric columns but most of your dataframe columns are object type. Probably you should use OneHotEncoder on it, all transformer on sklearn have a similar behaviour.
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore')
ohe.fit_transform(X) # your data without target column
# ...blabla...
Finally, I recommend you read about Pipelines from sklearn, I think they are more elegant than a lot of messy code. You can put preprocessing and model steps on the same pipeline, for example here.

Why is Multi Class Machine Learning Model Giving Bad Results?

I have the following code so far:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
df_train = pd.read_csv('uc_data_train.csv')
del df_train['Unnamed: 0']
temp = df_train['size_womenswear']
del df_train['size_womenswear']
df_train['size_womenswear'] = temp
df_train['count'] = 1
print(df_train.head())
print(df_train.dtypes)
print(df_train[['size_womenswear', 'count']].groupby('size_womenswear').count()) # Determine number of unique catagories, and number of cases for each catagory
del df_train['count']
df_test = pd.read_csv('uc_data_test.csv')
del df_test['Unnamed: 0']
print(df_test.head())
print(df_test.dtypes)
df_train.drop(['customer_id','socioeconomic_status','brand','socioeconomic_desc','order_method',
'first_order_channel','days_since_first_order','total_number_of_orders', 'return_rate'], axis=1, inplace=True)
LE = preprocessing.LabelEncoder() # Create label encoder
df_train['size_womenswear'] = LE.fit_transform(np.ravel(df_train[['size_womenswear']]))
print(df_train.head())
print(df_train.dtypes)
x = df_train.iloc[:,np.arange(len(df_train.columns)-1)].values # Assign independent values
y = df_train.iloc[:,-1].values # and dependent values
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size = 0.25, random_state = 0) # Testing on 75% of the data
model = GaussianNB()
model.fit(xTrain, yTrain)
yPredicted = model.predict(xTest)
#print(yPrediction)
print('Accuracy: ', accuracy_score(yTest, yPredicted))
I am not sure how to include the data that I am using but I am trying to predict the 'size_womenswear'. There are 8 different sizes that I have encoded to predict and I have moved this column to the end of the dataframe. so y is the dependent and x are the independent (all the other columns)
I am using a Gaussian Naive Bayes classifier to try and classify the 8 different sizes and then test on 25% of the data. The results are not very good.
I don't know why I am only getting an accuracy of 61% when I am working with 80,000 rows. I am very new to Machine Learning and would appreciate any assistance. Is there a better method that I could use in this case than Gaussian Naive Bayes?
can't comment, just throwing out some ideas;
Maybe you need to deal with class imbalance, and try other model that will fit the data better? try the xgboost or lightgbm package given good data they usually perform pretty good in general, but it really depends on the data.
Also the way you split train and test, does the resulting train and test data set has similar distribution for your Y? that's very important.
Last thing, for classification models the performance measurement can be a bit tricky, try some other measurement methods. F1 scores or try to draw a confusion matrix and see what your predictions vs Y looks like. perhaps your model is predicting everything to one
or just a few classes.

scikit-learn logistic regression feature importance

I'm looking for a way to get an idea of the impact of the features I'm using in a classification problem. Using sklearn's logistic regression classifier (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), I understood that the .coef_ attribute gets me the information I'm after (as also discussed in this thread: How to find the importance of the features for a logistic regression model?).
The first few lines of my matrix:
phrase_type,type,complex_np,np_form,referentiality,grammatical_role,ambiguity,anaphor_type,dir_speech,length_of_span,length_of_coref_chain,position_in_coref_chain,position_in_sentence,is_topic
np,anaphoric,no,defnp,referring,sbj,not_ambig,anaphor_nominal,text_level,2,1,-1,18,True
np,anaphoric,no,defnp,referring,sbj,not_ambig,anaphor_nominal,text_level,2,2,1,1,True
np,none,no,defnp,discourse-new,sbj,not_ambig,_unspecified_,text_level,2,1,-1,9,True
Where the first line is the header, followed by the data (using the preprocessor's LabelEncoder in my code to convert this to ints).
Now, when I do a
print(classifier.coef_)
I get
[[ 0.84768459 -0.56344453 0.00365928 0.21441586 -1.70290447 -0.18460676
1.6167634 0.08556331 0.02152226 -0.05111953 0.07310608 -0.073653 ]]
which contains 12 columns/elements. I'm confused by this, since my data contains 13 columns (plus the 14th one with the label, I'm separating the features from the labels later on in my code).
I was wondering if maybe sklearn expects/assumes the first column to be the id and doesn't actually use the value of this column? But I cannot find any info on this.
Any help here would be much appreciated!
Not sure how to edit my original question in a way that it would still make sense for future reference, so I'll post a minimal example here:
import pandas
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import f1_score
from collections import defaultdict
import numpy
headers = ['phrase_type','type','complex_np','np_form','referentiality','grammatical_role','ambiguity','anaphor_type','dir_speech','length_of_span','length_of_coref_chain','position_in_coref_chain','position_in_sentence','is_topic']
matrix = [
['np','none','no','no,pds','referring','dir-obj','not_ambig','_unspecified_','text_level','1','1','-1','1','True'],
['np','none','no','pds','not_specified','sbj','not_ambig','_unspecified_','text_level','1','1','-1','21','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','8','1','-1','1','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','8','2','0','6','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','6','2','0','4','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','21','1','-1','1','True'],
['np','anaphoric','no','ne','referring','other','not_ambig','anaphor_nominal','text_level','1','9','4','2','True'],
['np','anaphoric','no','defnp','referring','sbj','not_ambig','anaphor_nominal','text_level','3','9','5','1','True'],
['np','anaphoric','no','defnp','referring','sbj','not_ambig','anaphor_nominal','text_level','2','9','7','1','True'],
['np','anaphoric','no','pper','referring','sbj','not_ambig','anaphor_nominal','text_level','1','2','1','1','True'],
['np','anaphoric','no','ne','referring','sbj','not_ambig','anaphor_nominal','text_level','2','3','2','1','True'],
['np','anaphoric','no','pper','referring','sbj','not_ambig','anaphor_nominal','text_level','1','9','1','13','False'],
['np','none','no','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','2','3','0','5','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','6','1','-1','1','False'],
['np','none','no','ne','discourse-new','sbj','not_ambig','_unspecified_','text_level','2','9','0','1','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','5','1','-1','5','False'],
['np','anaphoric','no','defnp','referring','sbj','not_ambig','anaphor_nominal','text_level','2','3','1','5','False'],
['np','none','no','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','3','3','0','1','True'],
['np','anaphoric','no','pper','referring','sbj','not_ambig','anaphor_nominal','text_level','1','3','1','1','True'],
['np','anaphoric','no','pds','referring','sbj','not_ambig','anaphor_nominal','text_level','1','1','-1','2','True']
]
df = pandas.DataFrame(matrix, columns=headers)
d = defaultdict(LabelEncoder)
fit = df.apply(lambda x: d[x.name].fit_transform(x))
df = df.apply(lambda x: d[x.name].transform(x))
testrows = []
trainrows = []
splitIndex = len(matrix)/10
for index, row in df.iterrows():
if index < splitIndex:
testrows.append(row)
else:
trainrows.append(row)
testdf = pandas.DataFrame(testrows)
traindf = pandas.DataFrame(trainrows)
train_labels = traindf.is_topic
labels = list(set(train_labels))
train_labels = numpy.array([labels.index(x) for x in train_labels])
train_features = traindf.iloc[:,0:len(headers)-1]
train_features = numpy.array(train_features)
print('train features shape:', train_features.shape)
test_labels = testdf.is_topic
labels = list(set(test_labels))
test_labels = numpy.array([labels.index(x) for x in test_labels])
test_features = testdf.iloc[:,0:len(headers)-1]
test_features = numpy.array(test_features)
classifier = LogisticRegression()
classifier.fit(train_features, train_labels)
print(classifier.coef_)
results = classifier.predict(test_features)
f1 = f1_score(test_labels, results)
print(f1)
I think I may have found the source of the error (thanks #Alexey Trofimov for pointing me in the right direction). My code at first contained:
train_features = traindf.iloc[:,1:len(headers)-1]
Which was copied from another script, where I did have id's as the first column in my matrix, hence didn't want to take these into account. The len(headers)-1 then, if I understand things correctly, is to not take into account the actual label. Testing this in a real world scenario, deleting the -1 results in perfect f-score, which would make sense, since it would just only look at the actual label and always predict correctly...
So I now changed this to
train_features = traindf.iloc[:,0:len(headers)-1]
as in the code snippet, and now get 13 columns (in X_train.shape, and consequently in classifier.coef_).
I think this solved my issue, but am still not 100% convinced, so if someone could point out an error in this line of reasoning/my code above, I'd be grateful to hear about it.

What does this error mean with StratifiedShuffleSplit?

I'm totally new to Data Science in general and was hoping someone could explain why this does not work:
I'm using the Advertising dataset from the following url: "http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv" which has 3 feature columns ("TV", "Radio", "Newspaper") and 1 label column ("sales"). My complete dataset is named data.
Next, I try to use sklearn's StratifiedShuffleSplit function to divide the data into training and testing sets.
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, random_state=0) # can use test_size=0.8
for train_index, test_index in split.split(data.drop("sales", axis=1), data["sales"]): # Generate indices to split data into training and test set.
strat_train_set = data.loc[train_index]
strat_test_set = data.loc[test_index]
I get this ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
Using the same code on another dataset which has 14 feature columns and 1 label column separates the data appropriately. Why doesn't it work here? Thanks.
I think that problem is your data_y is 2D matrix.
but as I see in sklearn.model_selection.StratifiedShuffleSplit doc, it should be the 1D vector. Try to encode each row of data_y as the integer (it will be interpreted as a class), and after use split.
Or possibly your y is a regression variable (continuous numerical data).(Vivek's link)

Categories