XGBoost feature_importances_ parameters gives a 0 valued vector

XGBoost feature_importances_ parameters gives a 0 valued vector - python

I have experimented XGBClassifier() with a large dataset of shape [400000,93],
the data contains a lot of NaN values, so I have used imputation from sklearn package
imputer = Imputer()
imputed_x = imputer.fit_transform(data)
data = imputed_x
but the feature importance values look like this:
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Notice there is only a 1 and the rest are 0. For this reason, the resulting metrics are:
precision: 1.0
recall: 1.0
accuracy: 1.0
traning_accuracy: 1.0
Why the model can't fit the data.
Example code fragments
model_xboost = XGBClassifier(max_depth=5, n_estimators=100)
#train
model_xboost.fit(train_data, train_labels)
print(model_xboost.feature_importances_)

From the feature importance, there is only a 1 and the rest are 0. It looks as if you have included a column which is somewhat similar to the target column in the training data, thus resulting in that feature being perfectly correlated to the target!
For example I've come across a classification problem where I used the patient's background and medical parameters to predict whether or not the patient has cancer. There was 1 column called "data_source" which became the most significant. That's purely because patients who come from "XXX Cancer Hospital" will surely have cancer!
This is a good example of unintended data leakage.

You have one feature that is fully correlated to the target, with correlation value 1.0. That means you have trained your model with the target. You must remove it in training.

Related

Iterate over padded area in 2D array in python

Assume I have a 2D array in Python and I add some padding. How can I iterate over the new padded area only?
For example
1 2 3
4 5 6
7 8 9
Becomes
x x x x x x x
x x x x x x x
x x 1 2 3 x x
x x 4 5 6 x x
x x 7 8 9 x x
x x x x x x x
x x x x x x x
How can I loop over only the x's?

Not sure if I understand what you are trying to do, but if you are using numpy, you can use masks:
import numpy as np
arr = np.array(np.arange(1,10)).reshape(3,3)
# mask full of True's
mask = np.ones((7,7),dtype=bool)
# setting the interior of the mask as False
mask[2:-2,2:-2] = False
# using zero padding as example
pad_arr = np.zeros((7,7))
pad_arr[2:-2,2:-2] = arr
print(pad_arr)
# loop for elements of the padding, where mask == True
for value in pad_arr[mask]:
print(value)
Returns:
[[0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0.]
[0. 0. 1. 2. 3. 0. 0.]
[0. 0. 4. 5. 6. 0. 0.]
[0. 0. 7. 8. 9. 0. 0.]
[0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0.]]
and 0.0 40 times (the padded values)

How to convert 2D numpy array to One Hot Encoding?

I was trying to apply one hot encoding for the following data. But I am confused about the output. Before applying one hot encoding the shape of data is (5,10) and after applying one hot encoding the shape of data is (5,20). But each letter would be encoded as a 4 element. So, after applying one hot encoding, the shape should be (5, 40) instead of (5,10). How can I solve this?
X = [[‘A’, ‘G’, ‘T’, ‘G’, ‘T’, ‘C’, ‘T’, ‘A’, ‘A’, ‘C’],
[‘A’, ‘G’, ‘T’, ‘G’, ‘T’, ‘C’, ‘T’, ‘A’, ‘A’, ‘C’],
[‘G’, ‘C’, ‘C’, ‘A’, ‘C’, ‘T’, ‘C’, ‘G’, ‘G’, ‘T’],
[‘G’, ‘C’, ‘C’, ‘A’, ‘C’, ‘T’, ‘C’, ‘G’, ‘G’, ‘T’],
[‘G’, ‘C’, ‘C’, ‘A’, ‘C’, ‘T’, ‘C’, ‘G’, ‘G’, ‘T’]]
Y = np.array(X)
print('Shape of numpy array', Y.shape)
# one hot encoding
onehot_encoder = OneHotEncoder(sparse=False)
onehot_encoded = onehot_encoder.fit_transform(Y)
print(onehot_encoded)
print('Shape of one hot encoding', onehot_encoded.shape)
Output:
Shape of numpy array (5, 10)
[[1. 0. 0. 1. 0. 1. 0. 1. 0. 1. 1. 0. 0. 1. 1. 0. 1. 0. 1. 0.]
[1. 0. 0. 1. 0. 1. 0. 1. 0. 1. 1. 0. 0. 1. 1. 0. 1. 0. 1. 0.]
[0. 1. 1. 0. 1. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 1. 0. 1. 0. 1.]
[0. 1. 1. 0. 1. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 1. 0. 1. 0. 1.]
[0. 1. 1. 0. 1. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 1. 0. 1. 0. 1.]]
Shape of one hot encoding (5, 20)

You need to one-hot encode each column separately so you will get 4 new columns for each column in your ndarray:
X = np.array(X)
# Get unique classes.
classes = np.unique(X)
# Replace classes with itegers.
X = np.searchsorted(classes, X)
# Get an identity matrix.
eye = np.eye(classes.shape[0])
# Iterate over all columns
# and get one-hot encoding for each column.
X = np.concatenate([eye[i] for i in X.T], axis=1)
X.shape
# (5, 40)
Consider the following example:
[['A', 'G'],
['C', 'C'],
['T', 'A']]
You will get 8 (2 x 4) columns in your one-hot encoded ndarray:
Column 0 Column 1
A C G T A C G T
1 0 0 0 0 0 1 0
0 1 0 0 0 1 0 0
0 0 0 1 1 0 0 0

how to know precinct of an image prediction?

I want to know the predict precinct of one image
classes = model.predict(image)
print(classes)
Output:
0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
I want to show
[0.95, 0.20 , 0.30 , 0.0 , 0.25 .........]

My conditional variable on my if statement is being changed by the statement, even though it doesn't appear in the statement. Why? [duplicate]

This question already has answers here:
Why does my original list change? [duplicate]
(2 answers)
Closed 3 years ago.
I want to create two matrices. Then make the second matrix numbers changed depending on the numbers in the first matrix. So I generate an If statement about my first matrix and if true this will induce a change in my second matrix. However, it induces a change in both matrices?
My code works perfectly with single digit objects. It only occurs when I try to apply it with matrices.
import numpy as np
n = 3
matr = np.zeros((n,n))
matr[0][0] = 1
matr2 = matr
print(matr)
[[1. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]
print(matr2)
[[1. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]
if matr[0][0] == 1:
matr2[0][0] = 9
print(matr)
[[9. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]
print(matr2)
[[9. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]
Because "matr" doesn't occur as a subject in my if statement it shouldn't be altered right?
x = 1
y = x
if x == 1:
y = 9
print(x)
1
print(y)
9

Those 2 variables are just two references to the same matrix, not 2 different matrices; matr2 = matr just creates a new reference to the same matrix.
The statement matr2[0][0] = 9 modifies the one and only matrix that exists in your example, and it is exactly the same as using matr[0][0] = 9.

transform an adjacency list into a sparse adjacency matrix using python

When using scipy, I was able to transform my data in the following format:
(row, col) (weight)
(0, 0) 5
(0, 47) 5
(0, 144) 5
(0, 253) 4
(0, 513) 5
...
(6039, 3107) 5
(6039, 3115) 3
(6039, 3130) 4
(6039, 3132) 2
How can I transform this into an array or sparse matrix with zeros for missing weight values as such? (based on the data above, column 1 to 46 should be filled with zeros, and so on...)
0 1 2 3 ... 47 48 49 50
1 [0 0 0 0 ... 5 0 0 0 0
2 2 0 1 0 ... 4 0 5 0 0
3 3 1 0 5 ... 1 0 0 4 2
4 0 0 0 4 ... 5 0 1 3 0
5 5 1 5 4 ... 0 0 3 0 1]
I know it is better in terms of memory to keep the data in the format above, but I need it as a matrix for experimentation.

scipy.sparse does it for you.
import numpy as np
from scipy.sparse import dok_matrix
your_data = [((2, 7), 1)]
XDIM, YDIM = 10, 10 # Replace with your values
dct = {}
for (row, col), weight in your_data:
dct[(row, col)] = weight
smat = dok_matrix((XDIM, YDIM))
smat.update(dct)
dense = smat.toarray()
print dense
'''
[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
'''

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

XGBoost feature_importances_ parameters gives a 0 valued vector - python

You have one feature that is fully correlated to the target, with correlation value 1.0. That means you have trained your model with the target. You must remove it in training.

Related

Iterate over padded area in 2D array in python

How to convert 2D numpy array to One Hot Encoding?

how to know precinct of an image prediction?

My conditional variable on my if statement is being changed by the statement, even though it doesn't appear in the statement. Why? [duplicate]

transform an adjacency list into a sparse adjacency matrix using python

Categories

Resources