Is this data representation here exact for One-Hot Encoding? - python

I am trying to encode the mushroom dataset here (https://www.kaggle.com/uciml/mushroom-classification) using One-Hot Encoding. Here is the code that I used (in Python) for the encoding:
from sklearn.preprocessing import OneHotEncoder
second_df = OneHotEncoder(handle_unknown='ignore').fit_transform(new_df)
print(second_df)
The result for my code is in this image, and that makes me quite confusing: Result for the encoding.
Is this result the right representation for my One-Hot Encoding? If not, what shall I do to fix the code?

The output looks a bit unusual because OneHotEncoder returns a sparse matrix by default:
OneHotEncoder(*, categories='auto', drop=None, sparse=True, dtype=<class 'numpy.float64'>, handle_unknown='error')
Sparse output is interpreted as (row, col) non_zero_value, where all the unlisted coordinates are zero:
(0, 1) 1.0 # value 1.0 at row 0, col 1
(0, 7) 1.0 # value 1.0 at row 0, col 7
...
To get a dense array instead,
either set sparse=False:
OneHotEncoder(sparse=False).fit_transform(new_df)
or chain toarray:
OneHotEncoder().fit_transform(new_df).toarray()
Output:
array([[0., 0., 0., ..., 0., 0., 1.],
[0., 0., 0., ..., 0., 1., 0.],
[1., 0., 0., ..., 0., 1., 0.],
...,
[0., 0., 1., ..., 0., 1., 0.],
[0., 0., 0., ..., 0., 0., 1.],
[0., 0., 0., ..., 0., 1., 0.]])

Related

Creating many state vectors and saving them in a file

I want to create m number of matrices, each of which is an n x 1 numpy arrays. Moreover those matrices should have only two nonzero entries in the two rows, all other rows should have 0 as their entries, meaning that matrix number m=1 should have entries m[0,:]=m[1,:]=1, rest elements are 0. And similarly the last matrix m=m should have entries like m[n-1,:]=m[n,:]=1, where rest of the elements in other rows are 0. So for consecutive two matrices, the nonzero elements shift by two rows. And finally, I would like them to be stored into a dictionary or in a file.
What would be a neat way to do this?
Is this what you're looking for?
In [2]: num_rows = 10 # should be divisible by 2
In [3]: np.repeat(np.eye(num_rows // 2), 2, axis=0)
Out[3]:
array([[1., 0., 0., 0., 0.],
[1., 0., 0., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 0., 1., 0.],
[0., 0., 0., 1., 0.],
[0., 0., 0., 0., 1.],
[0., 0., 0., 0., 1.]])
In terms of storage in a file, you can use np.save and np.load.
Note that the default data type for np.eye will be float64. If you expect your values to be small when you begin integrating or whatever you're planning on doing with your state vectors, I'd recommend setting the data type appropriately (like np.uint8 for positive integers < 256 for example).

Understanding Variance Treshold

I am working on a text classification problem in python, where I build a traing array based on {0,1} if the word is inside the text or not.
array([[0., 1., 1., ..., 0., 0., 0.],
[0., 1., 1., ..., 0., 0., 0.],
[0., 1., 1., ..., 0., 0., 0.],
...,
[0., 1., 1., ..., 0., 0., 0.],
[0., 1., 1., ..., 0., 0., 0.],
[0., 1., 1., ..., 0., 0., 0.]])
as I want to run SVM on it, I want to reduce my features. In scikit learn I found this: https://scikit-learn.org/stable/modules/feature_selection.html
with the Variance Threshold set to:
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
x_train_red = sel.fit_transform(x_train)
from the reduction I am reducing my shape from:
(7808, 2000)
(7808, 97)
does it only reduce the futre where every line has 1 or where every line has a 0 or how does it work?
From the documentation you can see the variance is calculated by p(1-p), the default threeshold or limit 0.8 means that any column with a probability of having 0 variance above 0.8 will be eliminated. So it deletes the columns with rare occurrences, those words are not in your text a lot, so their variance will be close to 0 and the feature selection will eliminate it.

Numpy tuple-index based 2d array additive assignment

If we have a 2d array, for example 100x100, and we want to add values to it based on their coordinates, how can we avoid overwriting the previous coordinate value's value?
import numpy as np
vm_map = np.zeros((100,100))
a = np.array([[0,1], [10,10], [40,40], [40,40]])
vm_map[tuple(a.T)] = vm_map[tuple(a.T)] + [1,.5,.3, .2]
print(vm_map[40,40])
We would like this code block to print .5, adding the two [40,40] coordinates, instead it prints .2, as that was the last value it received at that coordinate.
You could use np.add.at for an in-place addition at the specified coordinates:
vm_map = np.zeros((100,100))
a = np.array([[0,1], [10,10], [40,40], [40,40]])
np.add.at(vm_map, tuple(zip(*a)), [1,.5,.3, .2])
print(vm_map)
array([[0., 1., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]])

Issues using Keras np_utils.to_categorical

I'm trying to make an array of one-hot vector of integers into an array of one-hot vector that keras will be able to use to fit my model. Here's the relevant part of the code:
Y_train = np.hstack(np.asarray(dataframe.output_vector)).reshape(len(dataframe),len(output_cols))
dummy_y = np_utils.to_categorical(Y_train)
Below is an image showing what Y_train and dummy_y actually are.
I couldn't find any documentation for to_categorical that could help me.
Thanks in advance.
np_utils.to_categorical is used to convert array of labeled data(from 0 to nb_classes - 1) to one-hot vector.
The official doc with an example.
In [1]: from keras.utils import np_utils # from keras import utils as np_utils
Using Theano backend.
In [2]: np_utils.to_categorical?
Signature: np_utils.to_categorical(y, num_classes=None)
Docstring:
Convert class vector (integers from 0 to nb_classes) to binary class matrix, for use with categorical_crossentropy.
# Arguments
y: class vector to be converted into a matrix
nb_classes: total number of classes
# Returns
A binary matrix representation of the input.
File: /usr/local/lib/python3.5/dist-packages/keras/utils/np_utils.py
Type: function
In [3]: y_train = [1, 0, 3, 4, 5, 0, 2, 1]
In [4]: """ Assuming the labeled dataset has total six classes (0 to 5), y_train is the true label array """
In [5]: np_utils.to_categorical(y_train, num_classes=6)
Out[5]:
array([[ 0., 1., 0., 0., 0., 0.],
[ 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0.],
[ 0., 0., 0., 0., 1., 0.],
[ 0., 0., 0., 0., 0., 1.],
[ 1., 0., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0.],
[ 0., 1., 0., 0., 0., 0.]])
from keras.utils.np_utils import to_categorical
UPDATED --- keras.utils.np_utils doesn't work in newer versions; if so use:
from tensorflow.keras.utils import to_categorical
In both cases
to_categorical(0, max_value_of_array)
It assumes the class values were in string and you will be label encoding them, hence starting everytime from 0 to n-classes.
for the same example:- consider an array of {1,2,3,4,2}
The output will be [zero value, one value, two value, three value, four value]
array([[ 0., 1., 0., 0., 0.],
[ 0., 0., 1., 0., 0.],
[ 0., 0., 0., 1., 0.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 1., 0., 0.]],
Let's look at another example:-
Again, for an array having 3 classes, Y = {4, 8, 9, 4, 9}
to_categorical(Y) will output
array([[0., 0., 0., 0., 1., 0., 0., 0., 0., 0. ],
[0., 0., 0., 0., 0., 0., 0., 0., 1., 0. ],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 1. ],
[0., 0., 0., 0., 1., 0., 0., 0., 0., 0. ],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 1. ]]

Inexplicable behavior when using vlen with h5py

I am using h5py to build a dataset. Since I want to store arrays with different #of rows dimension, I use the h5py special_type vlen. However, I experience behavior I can't explain, maybe you can me help in understanding what is happening:
>>>> import h5py
>>>> import numpy as np
>>>> fp = h5py.File(datasource_fname, mode='w')
>>>> dt = h5py.special_dtype(vlen=np.dtype('float32'))
>>>> train_targets = fp.create_dataset('target_sequence', shape=(9549, 5,), dtype=dt)
>>>> test
Out[130]:
array([[ 0., 1., 1., 1., 0., 1., 1., 0., 1., 0., 0.],
[ 1., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1.]])
>>>> train_targets[0] = test
>>>> train_targets[0]
Out[138]:
array([ array([ 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 1.], dtype=float32),
array([ 1., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.], dtype=float32),
array([ 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0.], dtype=float32),
array([ 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0.], dtype=float32),
array([ 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0.], dtype=float32)], dtype=object)
I do expect the train_targets[0] to be of this shape, however I can't recognize the rows in my array. They seem to be totally jumbled about, however it is consistent. By which I mean that every time I try the above code, train_targets[0] looks the same.
To clarify: the first element in my train_targets, in this case test, has shape (5,11), however the second element might be of shape (5,38) which is why I use vlen.
Thank you for your help
Mat
I think
train_targets[0] = test
has stored your (11,5) array as an F ordered array in a row of train_targets. According to the (9549,5) shape, that's a row of 5 elements. And since it is vlen, each element is a 1d array of length 11.
That's what you get back in train_targets[0] - an array of 5 arrays, each shape (11,), with values taken from test (order F).
So I think there are 2 issues - what a 2d shape means, and what vlen allows.
My version of h5py is pre v2.3, so I only get string vlen. But I suspect your problem may be that vlen only works with 1d arrays, an extension, so to speak, of byte strings.
Does the 5 in shape=(9549, 5,) have anything to do with 5 in the test.shape? I don't think it does, at least not as numpy and h5py see it.
When I make a file following the string vlen example:
>>> f = h5py.File('foo.hdf5')
>>> dt = h5py.special_dtype(vlen=str)
>>> ds = f.create_dataset('VLDS', (100,100), dtype=dt)
and then do:
ds[0]='this one string'
and look at ds[0], I get an object array with 100 elements, each being this string. That is, I've set a whole row of ds.
ds[0,0]='another'
is the correct way to set just one element.
vlen is 'variable length', not 'variable shape'. While the https://www.hdfgroup.org/HDF5/doc/TechNotes/VLTypes.html documentation is not entirely clear on this, I think you can store 1d arrays with shape (11,) and (38,) with vlen, but not 2d ones.
Actually, train_targets output is reproduced with:
In [54]: test1=np.empty((5,),dtype=object)
In [55]: for i in range(5):
test1[i]=test.T.flatten()[i:i+11]
It's 11 values taken from the transpose (F order), but shifted for each sub array.

Categories