Pandas Array Exception: Data must be 1-Dimensional - python

This is my Python script for using Markov Blanket Algorithm on my Dataset:
df1 = read_csv("input-binary-120-training.csv")
Y1 = df1[df1.CategoryL == 1].CategoryL
X1 = minmax_scale(df1[df1.CategoryL == 1].ix[:, 1:24], axis = 0)
y_train = Y1.values
df2 = read_csv("input-binary-120-test.csv")
Y2 = df2[df2.CategoryL == 1].CategoryL
X2 = minmax_scale(df2[df2.CategoryL == 1].ix[:, 1:24], axis = 0)
y_test = Y2.values
x_test = X2.reshape(X2.shape[0], X2.shape[1], 1)
seed(2017)
kfold = KFold(n_splits=5, random_state=27, shuffle=True)
scores = list()
# Create a PyImpetus classification object and initialize with required parameters
model = PPIMBC(LogisticRegression(random_state=27, max_iter=1000, class_weight="balanced"), cv=0, num_simul=20, simul_type=0, simul_size=0.2, random_state=27, sig_test_type="non-parametric", verbose=2, p_val_thresh=0.05)
x_train = model.fit_transform(X1, Y1)
x_test = model.transform(x_test)
print("Markov Blanket: ", model.MB)
But for the line X_train = model.fit_transform(X1,Y1) I got the exception:
Data must be 1-Dimensional.
I used X1.flatten() but it doesn't work. Could you please advise me about this issue?
Full error:
x_train = model.fit_transform(X1, Y1)
File "/home/osboxes/Downloads/Thesis/PyImpetus.py", line 326, in fit_transform
self.fit(data, Y)
File "/home/osboxes/Downloads/Thesis/PyImpetus.py", line 299, in fit
final_MB, final_feat_imp = self._find_MB(data.copy(), Y)
File "/home/osboxes/Downloads/Thesis/PyImpetus.py", line 221, in _find_MB
Y = np.reshape(Y, (-1, 1))
File "<__array_function__ internals>", line 6, in reshape
File "/home/osboxes/venv/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 299, in reshape
return _wrapfunc(a, 'reshape', newshape, order=order)
File "/home/osboxes/venv/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 55, in _wrapfunc
return _wrapit(obj, method, *args, **kwds)
File "/home/osboxes/venv/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 48, in _wrapit
result = wrap(result)
File "/home/osboxes/venv/lib/python3.6/site-packages/pandas/core/generic.py", line 1999, in __array_wrap__
return self._constructor(result, **d).__finalize__(self)
File "/home/osboxes/venv/lib/python3.6/site-packages/pandas/core/series.py", line 311, in __init__
data = sanitize_array(data, index, dtype, copy, raise_cast_failure=True)
File "/home/osboxes/venv/lib/python3.6/site-packages/pandas/core/internals/construction.py", line 729, in sanitize_array
raise Exception("Data must be 1-dimensional")
Exception: Data must be 1-dimensional

Try to reshape Y1 either Y1=Y1[:, 0] or Y1=Y1.ravel() to get a 1D dimension.

Related

Matrix and vector shape in TVP-VAR in the statespace mlemodels

Thanks to everyone in advance for their time!
I am trying to run a TVP-VAR (time varying papramenter) for a panel in the statespace mlemodels in statsmodel. I get an error while trying to fit the model. My understanding is that mostly is regarding the dimensions for state covariance matrix. I suspect will get the same error later when I will deal with column shape.
The error is :
ValueError: Invalid dimensions for state covariance matrix matrix: requires 702 rows, got 3
How could I solve that ? The type error is showing is as below, highlighted in bold both the error and the Traceback :
preliminary = tvppanelvarmodel.fit(maxiter=1000)
Traceback (most recent call last):
File "/var/folders/m6/68zljfsj2t9_dzgpwwslj29r0000gp/T/ipykernel_6232/3038987883.py", line 1, in <module>
preliminary = tvppanelvarmodel.fit(maxiter=1000)
File "/opt/anaconda3/envs/spyder-env/lib/python3.10/site-packages/statsmodels/tsa/statespace/mlemodel.py", line 704, in fit
mlefit = super(MLEModel, self).fit(start_params, method=method,
File "/opt/anaconda3/envs/spyder-env/lib/python3.10/site-packages/statsmodels/base/model.py", line 563, in fit
xopt, retvals, optim_settings = optimizer._fit(f, score, start_params,
File "/opt/anaconda3/envs/spyder-env/lib/python3.10/site-packages/statsmodels/base/optimizer.py", line 241, in _fit
xopt, retvals = func(objective, gradient, start_params, fargs, kwargs,
File "/opt/anaconda3/envs/spyder-env/lib/python3.10/site-packages/statsmodels/base/optimizer.py", line 651, in _fit_lbfgs
retvals = optimize.fmin_l_bfgs_b(func, start_params, maxiter=maxiter,
File "/opt/anaconda3/envs/spyder-env/lib/python3.10/site-packages/scipy/optimize/lbfgsb.py", line 197, in fmin_l_bfgs_b
res = _minimize_lbfgsb(fun, x0, args=args, jac=jac, bounds=bounds,
File "/opt/anaconda3/envs/spyder-env/lib/python3.10/site-packages/scipy/optimize/lbfgsb.py", line 306, in _minimize_lbfgsb
sf = _prepare_scalar_function(fun, x0, jac=jac, args=args, epsilon=eps,
File "/opt/anaconda3/envs/spyder-env/lib/python3.10/site-packages/scipy/optimize/optimize.py", line 261, in _prepare_scalar_function
sf = ScalarFunction(fun, x0, args, grad, hess,
File "/opt/anaconda3/envs/spyder-env/lib/python3.10/site-packages/scipy/optimize/_differentiable_functions.py", line 140, in __init__
self._update_fun()
File "/opt/anaconda3/envs/spyder-env/lib/python3.10/site-packages/scipy/optimize/_differentiable_functions.py", line 233, in _update_fun
self._update_fun_impl()
File "/opt/anaconda3/envs/spyder-env/lib/python3.10/site-packages/scipy/optimize/_differentiable_functions.py", line 137, in update_fun
self.f = fun_wrapped(self.x)
File "/opt/anaconda3/envs/spyder-env/lib/python3.10/site-packages/scipy/optimize/_differentiable_functions.py", line 134, in fun_wrapped
return fun(np.copy(x), *args)
File "/opt/anaconda3/envs/spyder-env/lib/python3.10/site-packages/statsmodels/base/model.py", line 531, in f
return -self.loglike(params, *args) / nobs
File "/opt/anaconda3/envs/spyder-env/lib/python3.10/site-packages/statsmodels/tsa/statespace/mlemodel.py", line 933, in loglike
self.update(params, transformed=True, includes_fixed=True,
File "/var/folders/m6/68zljfsj2t9_dzgpwwslj29r0000gp/T/ipykernel_6232/3786466608.py", line 104, in update
self['state_cov'] = np.diag([params[2]**2, params[3]**2, params[4]**2]) # W
File "/opt/anaconda3/envs/spyder-env/lib/python3.10/site-packages/statsmodels/tsa/statespace/mlemodel.py", line 239, in __setitem__
return self.ssm.__setitem__(key, value)
File "/opt/anaconda3/envs/spyder-env/lib/python3.10/site-packages/statsmodels/tsa/statespace/representation.py", line 420, in __setitem__
setattr(self, key, value)
File "/opt/anaconda3/envs/spyder-env/lib/python3.10/site-packages/statsmodels/tsa/statespace/representation.py", line 54, in __set__
value = self._set_matrix(obj, value, shape)
File "/opt/anaconda3/envs/spyder-env/lib/python3.10/site-packages/statsmodels/tsa/statespace/representation.py", line 68, in _set_matrix
validate_matrix_shape(
File "/opt/anaconda3/envs/spyder-env/lib/python3.10/site-packages/statsmodels/tsa/statespace/tools.py", line 1474, in validate_matrix_shape
raise ValueError('Invalid dimensions for %s matrix: requires %d'
ValueError: Invalid dimensions for state covariance matrix matrix: requires 702 rows, got 3
When I defined the class initially, I did the following
class TVPVAR(sm.tsa.statespace.MLEModel):
def __init__(self, y):
# Create a matrix with [y_t' : y_{t-1}'] for t = 2, ..., T
augmented = sm.tsa.lagmat(y, 1, trim='both', original='in', use_pandas=True)
# Separate into y_t and z_t = [1 : y_{t-1}']
p = y.shape[1]
y_t = augmented.iloc[:, :p]
z_t = sm.add_constant(augmented.iloc[:, p:])
nobs = y.shape[0]
T=y.shape[0]
# Recall that the length of the state vector is p * (p + 1)
k_states = p * (p + 1)
super(TVPVAR,self).__init__(y_t, exog=None, k_states=k_states,k_posdef=k_states)
self.k_y = p
self.k_states = p * (p + 1)
self.nobs = T
self['design'] = np.zeros((self.k_y, self.k_states, 1))
self['transition'] = np.eye(k_states) # G
self['selection'] = np.eye(k_states) # R=1
def update_variances(self, obs_cov, state_cov_diag):
self['obs_cov'] = obs_cov
self['state_cov'] = np.diag(state_cov_diag) # W
init = initialization.Initialization(self.k_states)
init.set((0, 2), 'diffuse')
init.set((2, 4), 'stationary')
self.ssm.initialize(init)
OTHER CODE
def update(self, params, **kwargs):
params = super().update(params, **kwargs)
self['transition', 2,2] = params[0]
self['transition', 3,2] = params[1]
self['state_cov'] = np.diag([params[2]**2, params[3]**2, params[4]**2]) # W
How can I define the dimensions for state covariance matrix and the vector shape? Thanks your inputs.

ValueError: could not convert string to float: -- problem with lables

I'm getting the following error when trying to extract festues, I have to do the split before the feature extraction. The y contains the labels so not sure why I'm getting the error. data is wav files and the labels are text originally
for cls in os.listdir(path):
for sound in tqdm(os.listdir(os.path.join(path, cls))):
wav = librosa.load(os.path.join(os.path.join(path, cls, sound)), sr=16000)[0].astype(np.float32)
tmp_samples.append(wav[0])
tmp_labels.append(cls)
tmp_labels=np.array(tmp_labels)
X_train, y_train , X_test , y_test = train_test_split( tmp_samples, tmp_labels , test_size=0.60,shuffle=True)
encoder = LabelBinarizer()
y_test = encoder.fit_transform(y_test)
minmax_scaler = MinMaxScaler()
X_train = np.asarray( X_train ).reshape(-1,1)
X_train = minmax_scaler.fit_transform( X_train )
X_test = np.asarray( X_test ).reshape(-1,1)
X_test = minmax_scaler.fit_transform( X_test )
y_test = encoder.fit_transform(y_test)
for x,y in zip(X_test,y_test):
extract_features(x[0], y, model, plain_samples , plain_labels )
def extract_features(wav, cls, model, plain_samples, plain_labels):
for feature in model(wav)[1]:
plain_samples.append(feature)
plain_labels.append(cls)
Error:
Traceback (most recent call last):
File "optunaCopy.py", line 523, in <module>
main(sys.argv[1:])
File "optunaCopy.py", line 439, in main
X_train, y_train , X_test , y_test,X_valid,y_valid = create_dataset(path)
File "optunaCopy.py", line 129, in create_dataset
X_test = minmax_scaler.fit_transform( X_test )
File "C:\Users\x\anaconda3\envs\yamnet\lib\site-packages\sklearn\base.py", line 844, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "C:\Users\x\anaconda3\envs\yamnet\lib\site-packages\sklearn\preprocessing\_data.py", line 416, in fit
return self.partial_fit(X, y)
File "C:\Users\x\anaconda3\envs\yamnet\lib\site-packages\sklearn\preprocessing\_data.py", line 458, in partial_fit
force_all_finite="allow-nan",
File "C:\Users\x\anaconda3\envs\yamnet\lib\site-packages\sklearn\base.py", line 557, in _validate_data
X = check_array(X, **check_params)
File "C:\Users\x\anaconda3\envs\yamnet\lib\site-packages\sklearn\utils\validation.py", line 738, in check_array
array = np.asarray(array, order=order, dtype=dtype)
File "C:\Users\x\anaconda3\envs\yamnet\lib\site-packages\numpy\core\_asarray.py", line 83, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: 'hh'

in Tensorflow 1.X, Keras cannot be trained with tf.data?

I want to use tf.data library for training speed.
But my code, raise error message, like as bellows.
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 727, in fit
use_multiprocessing=use_multiprocessing)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 675, in fit
steps_name='steps_per_epoch')
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 169, in model_iteration
ins = _prepare_feed_values(model, inputs, targets, sample_weights, mode)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 535, in _prepare_feed_values
extract_tensors_from_dataset=True)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 2471, in _standardize_user_data
exception_prefix='input')
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_utils.py", line 517, in standardize_input_data
standardize_single_array(x, shape) for (x, shape) in zip(data, shapes)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_utils.py", line 517, in <listcomp>
standardize_single_array(x, shape) for (x, shape) in zip(data, shapes)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_utils.py", line 442, in standardize_single_array
if (x.shape is not None and len(x.shape) == 1 and
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/tensor_shape.py", line 827, in __len__
raise ValueError("Cannot take the length of shape with unknown rank.")
ValueError: Cannot take the length of shape with unknown rank.
I don't know why and how can I fix it.
My dataset code is as bellows.
def _py_parse_line(line):
line = line.decode('utf-8')
parsed_line = line.split("\t")
label = int(parsed_line[0])
rawSentence = str(parsed_line[1])
morphemePOS = str(parsed_line[2])
NE_LIST = str(parsed_line[3])
preprocessedDatum = FTV.featureToVectorLexMorpDictNElist(
[rawSentence], [morphemePOS], [NE_LIST]
)
syllable_feature = preprocessedDatum[0][0]
sdiDict_feature = preprocessedDatum[1][0]
morphemePos_feature = preprocessedDatum[2][0]
neDict_feautre = preprocessedDatum[3][0]
return syllable_feature, sdiDict_feature, morphemePos_feature, neDict_feautre, label
def _decode_tsv(line):
type_list = [tf.int32, tf.int8, tf.int32, tf.int8, tf.int64]
data_list = tf.py_func(_py_parse_line, [line], type_list)
label = data_list[4]
features = {"lexical_input": data_list[0],
"dictInfo_input": data_list[1],
"morphemePos_input": data_list[2],
"ne_input": data_list[3]
}
d = features, label
return d
dataset = tf.data.TextLineDataset("/root/Workspace/ias_sdi_trainer/generatedModel/sdi_nugu.dtg.wholeData.txt")
dataset = dataset.shuffle(4096)
dataset = dataset.map(_decode_tsv)
dataset = dataset.batch(batchSize )
dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)
I saw inside of tensor with "make_oneshot_iterator()" and tf.Session.
There was data what I think.
I was this question, too. But It cannot help me. Cannot take the length of Shape with unknown rank
Is there any good way to use tf.data for keras?
I want to use data from textline in tf.data
Added
My tensorflow version is 1.14.
I used dataset in fit, like as bellows.
sdi_model.fit(dataset, epochs=MAX_EPOC)
I used dataset with generator too. Code is as bellows.
dataFetcher = dataset.make_one_shot_iterator()
def gen(dataFetcher):
with tf.Session() as sess:
while True:
next_elem = dataFetcher.get_next()
x_batch,y_batch = sess.run(next_elem)
yield x_batch, y_batch
sdi_model.fit_generator(gen(dataFetcher), steps_per_epoch=100, epochs=MAX_EPOC)
With generator, I received a error message, "sess is empty graph"

TensorFlow/Keras: Why do I get "ValueError: Incompatible conversion from float32 to uint8" when calling fit?

I use TensorFlow 1.12 with eager execution. When I call
model.fit(train, steps_per_epoch=int(np.ceil(num_train_samples / BATCH_SIZE)), epochs=NUM_EPOCHS, batch_size=BATCH_SIZE, validation_data=val, validation_steps=int(np.ceil(num_val_samples / BATCH_SIZE)))
I get the following error:
ValueError: Incompatible type conversion requested to type 'uint8' for variable of type 'float32'
As far as I know, I am not converting from uint8 to float32 anywhere, at least not explicitly.
My datasets are generated as follows:
train = tf.data.Dataset.from_generator(generator=train_sample_fetcher, output_types=(tf.uint8, tf.float32))
train = train.repeat()
train = train.batch(BATCH_SIZE)
train = train.shuffle(10)
val = tf.data.Dataset.from_generator(generator=val_sample_fetcher, output_types=(tf.uint8, tf.float32))
employing the following generator functions:
def train_sample_fetcher():
return sample_fetcher()
def val_sample_fetcher():
return sample_fetcher(is_validations=True)
def sample_fetcher(is_validations=False):
sample_names = [filename[:-4] for filename in os.listdir(DIR_DATASET + "ndarrays/")]
if not is_validations: sample_names = sample_names[:int(len(sample_names) * TRAIN_VAL_SPLIT)]
else: sample_names = sample_names[int(len(sample_names) * TRAIN_VAL_SPLIT):]
for sample_name in sample_names:
rgb = tf.image.decode_jpeg(tf.read_file(DIR_DATASET + sample_name + ".jpg"))
rgb = tf.image.resize_images(rgb, (HEIGHT, WIDTH))
#d = tf.image.decode_jpeg(tf.read_file(DIR_DATASET + "depth/" + sample_name + ".jpg"))
#d = tf.image.resize_images(d, (HEIGHT, WIDTH))
#rgbd = tf.concat([rgb,d], axis=2)
onehots = tf.convert_to_tensor(np.load(DIR_DATASET + "ndarrays/" + sample_name + ".npy"), dtype=tf.float32)
yield rgb, onehots
---------------------------------------------------------------------------------
For reference, the full stacktrace:
Traceback (most recent call last):
File "tensorflow/python/ops/gen_nn_ops.py", line 976, in conv2d
"data_format", data_format, "dilations", dilations)
tensorflow.python.eager.core._FallbackException: Expecting int64_t value for attr strides, got numpy.int32
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "dir/to/my_script.py", line 100, in <module>
history = model.fit(train, steps_per_epoch=int(np.ceil(num_train_samples / BATCH_SIZE)), epochs=NUM_EPOCHS, batch_size=BATCH_SIZE, validation_data=val, validation_steps=int(np.ceil(num_val_samples / BATCH_SIZE)))
File "/tensorflow/python/keras/engine/training.py", line 1614, in fit
validation_steps=validation_steps)
File "/tensorflow/python/keras/engine/training_eager.py", line 705, in fit_loop
batch_size=batch_size)
File "/tensorflow/python/keras/engine/training_eager.py", line 251, in iterator_fit_loop
model, x, y, sample_weights=sample_weights, training=True)
File "/tensorflow/python/keras/engine/training_eager.py", line 511, in _process_single_batch
training=training)
File "/tensorflow/python/keras/engine/training_eager.py", line 90, in _model_loss
outs, masks = model._call_and_compute_mask(inputs, **kwargs)
File "/tensorflow/python/keras/engine/network.py", line 856, in _call_and_compute_mask
mask=masks)
File "/tensorflow/python/keras/engine/network.py", line 1029, in _run_internal_graph
computed_tensor, **kwargs)
File "/tensorflow/python/keras/engine/network.py", line 856, in _call_and_compute_mask
mask=masks)
File "/tensorflow/python/keras/engine/network.py", line 1031, in _run_internal_graph
output_tensors = layer.call(computed_tensor, **kwargs)
File "/tensorflow/python/keras/layers/convolutional.py", line 194, in call
outputs = self._convolution_op(inputs, self.kernel)
File "/tensorflow/python/ops/nn_ops.py", line 868, in __call__
return self.conv_op(inp, filter)
File "/tensorflow/python/ops/nn_ops.py", line 520, in __call__
return self.call(inp, filter)
File "/tensorflow/python/ops/nn_ops.py", line 204, in __call__
name=self.name)
File "/tensorflow/python/ops/gen_nn_ops.py", line 982, in conv2d
name=name, ctx=_ctx)
File "/tensorflow/python/ops/gen_nn_ops.py", line 1015, in conv2d_eager_fallback
_attr_T, _inputs_T = _execute.args_to_matching_eager([input, filter], _ctx)
File "/tensorflow/python/eager/execute.py", line 195, in args_to_matching_eager
ret = [internal_convert_to_tensor(t, dtype, ctx=ctx) for t in l]
File "/tensorflow/python/eager/execute.py", line 195, in <listcomp>
ret = [internal_convert_to_tensor(t, dtype, ctx=ctx) for t in l]
File "/tensorflow/python/framework/ops.py", line 1146, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/tensorflow/python/ops/variables.py", line 828, in _TensorConversionFunction
"of type '%s'" % (dtype.name, v.dtype.name))
ValueError: Incompatible type conversion requested to type 'uint8' for variable of type 'float32'
The first thrown error is about NumPy ndarrays, but I convert those to TensorFlow tensors right after I import them. Any suggestions are greatly appreciated! I checked for any np.int32 types, but was not able to find any.
The output of tf.image.resize_images is a tensor of type float and therefore the rgb tensor returned from sample_fetcher() is a tensor of type float. However, when calling the Dataset.from_generator() method, you are specifying the output type of the first generated element as tf.uint8 (i.e. output_types=(tf.uint8, tf.float32)). Therefore, a conversion needs to be done which actually could not be done. Change it to tf.float32 (i.e. output_types=(tf.float32, tf.float32)) for both train and validation generators and the problem would be fixed.

How to use SMOTENC inside pipeline (Error: Some of the categorical indices are out of range)?

I would greatly appreciate if you could let me know how to use SMOTENC. I wrote:
# Data
XX = pd.read_csv('Financial Distress.csv')
y = np.array(XX['Financial Distress'].values.tolist())
y = np.array([0 if i > -0.50 else 1 for i in y])
Na = np.array(pd.read_csv('Na.csv', header=None).values)
XX = XX.iloc[:, 3:127]
# Use get-dummies to convert categorical features into dummy ones
dis_features = ['x121']
X = pd.get_dummies(XX, columns=dis_features)
# # Divide Data into Train and Test
indices = np.arange(y.shape[0])
X_train, X_test, y_train, y_test, idx_train, idx_test = train_test_split(X, y, indices, stratify=y, test_size=0.3,
random_state=42)
num_indices=list(X)[:X.shape[1]-37]
cat_indices=list(X)[X.shape[1]-37:]
num_indices1 = list(X.iloc[:,np.r_[0:94,95,97,100:123]].columns.values)
cat_indices1 = list(X.iloc[:,np.r_[94,96,98,99,123:160]].columns.values)
print(len(num_indices1))
print(len(cat_indices1))
pipeline=Pipeline(steps= [
# Categorical features
('feature_processing', FeatureUnion(transformer_list = [
('categorical', MultiColumn(cat_indices)),
#numeric
('numeric', Pipeline(steps = [
('select', MultiColumn(num_indices)),
('scale', StandardScaler())
]))
])),
('clf', rg)
]
)
pipeline_with_resampling = make_pipeline(SMOTENC(categorical_features=cat_indices1), pipeline)
# # Grid Search to determine best params
cv=StratifiedKFold(n_splits=5,random_state=42)
rg_cv = GridSearchCV(pipeline_with_resampling, param_grid, cv=cv, scoring = 'f1')
rg_cv.fit(X_train, y_train)
Therefore, as it is indicated I have 5 categorical features. Really, indices 123 to 160 are related to one categorical feature with 37 possible values which is converted into 37 columns using get_dummies. Unfortunately, it throws the following error:
Traceback (most recent call last):
File "D:/mifs-master_2/MU/learning-from-imbalanced-classes-master/learning-from-imbalanced-classes-master/continuous/Final Logit/SMOTENC/logit-final - Copy.py", line 424, in <module>
rg_cv.fit(X_train, y_train)
File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py", line 722, in fit
self._run_search(evaluate_candidates)
File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py", line 1191, in _run_search
evaluate_candidates(ParameterGrid(self.param_grid))
File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py", line 711, in evaluate_candidates
cv.split(X, y, groups)))
File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 917, in __call__
if self.dispatch_one_batch(iterator):
File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 759, in dispatch_one_batch
self._dispatch(tasks)
File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 716, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 182, in apply_async
result = ImmediateResult(func)
File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 549, in __init__
self.results = batch()
File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 225, in __call__
for func, args, kwargs in self.items]
File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 225, in <listcomp>
for func, args, kwargs in self.items]
File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 528, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\imblearn\pipeline.py", line 237, in fit
Xt, yt, fit_params = self._fit(X, y, **fit_params)
File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\imblearn\pipeline.py", line 200, in _fit
cloned_transformer, Xt, yt, **fit_params_steps[name])
File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\externals\joblib\memory.py", line 342, in __call__
return self.func(*args, **kwargs)
File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\imblearn\pipeline.py", line 576, in _fit_resample_one
X_res, y_res = sampler.fit_resample(X, y, **fit_params)
File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\imblearn\base.py", line 85, in fit_resample
output = self._fit_resample(X, y)
File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\imblearn\over_sampling\_smote.py", line 940, in _fit_resample
self._validate_estimator()
File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\imblearn\over_sampling\_smote.py", line 933, in _validate_estimator
' should be between 0 and {}'.format(self.n_features_))
ValueError: Some of the categorical indices are out of range. Indices should be between 0 and 160
Thanks in advance.
As it follows, two pipelines should be used:
num_indices1 = list(X.iloc[:,np.r_[0:94,95,97,100:120,121:123]].columns.values)
cat_indices1 = list(X.iloc[:,np.r_[94,96,98,99,120]].columns.values)
print(len(num_indices1))
print(len(cat_indices1))
cat_indices = [94, 96, 98, 99, 120]
from imblearn.pipeline import make_pipeline
pipeline=Pipeline(steps= [
# Categorical features
('feature_processing', FeatureUnion(transformer_list = [
('categorical', MultiColumn(cat_indices1)),
#numeric
('numeric', Pipeline(steps = [
('select', MultiColumn(num_indices1)),
('scale', StandardScaler())
]))
])),
('clf', rg)
]
)
pipeline_with_resampling = make_pipeline(SMOTENC(categorical_features=cat_indices), pipeline)
You can not dummies your categorical variables and use it later SMOTENC because it already implements in its algorithm get_dummies what will bias your model.
However, I recommend using SMOTE () instead of SMOTENC (), but in this case you must first apply get_demmies.
You cannot use scikit learn pipeline with imblearn pipeline. The imblearn pipeline implements fit_sample as well as fit_predict. Sklearn pipeline onle implements fit_predict. You cannot combine them.
First, don't do the get_dummies. Then, change the way you do your categorical_features, and put a list of booleans for if it's categorical or not.
Try this:
cat_cols = []
for col in x.columns:
if x[col].dtype == 'object': #or 'category' if that's the case
cat_cols.append(True)
else:
cat_cols.append(False)
Then pass cat_cols to your SMOTENC:
smote_nc = SMOTENC(categorical_features=cat_cols, random_state=0)

Categories