Merge numeric and text features for category classification - python

I'm trying to classify product items in order to predict their category based on the product title and their base price.
An example(product title, price, category):
['notebook sony vaio vgn-z770td dockstation', 3000.0, u'MLA54559']
Previously I was only using product title for the prediction task but I'd like to include the price to see if the accuracy improves.
The problem with my code is that I can't merge the text/numeric features, I've been reading some questions here in SO and this is my code excerpt:
#extracting features from text
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform([e[0] for e in training_set])
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
#extracting numerical features
X_train_price = np.array([e[1] for e in training_set])
X = sparse.hstack([X_train_tfidf, X_train_price]) #this is where the problem begins
clf = svm.LinearSVC().fit(X, [e[2] for e in training_set])
I try to merge the data types with sparse.hstack but I get the following error:
ValueError: blocks[0,:] has incompatible row dimensions
I guess the problem lies in X_train_price(a list of prices) but I don't know how to format it for the sparse function to succesfully work.
These are the shapes of both arrays:
>>> X_train_tfidf.shape
(65845, 23136)
>>>X_train_price.shape
(65845,)

It looks to me like this should be as simple as stacking the arrays. If scikit-learn follows the conventions I'm familiar with, then each row in X_train_tfidf is a training datapoint, and there are a total of 65845 points. So you just have to do an hstack -- as you said you tried to do.
However, you need to make sure the dimensions are compatible! In vanilla numpy you get this error otherwise:
>>> a = numpy.arange(15).reshape(5, 3)
>>> b = numpy.arange(15, 20)
>>> numpy.hstack((a, b))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/
Extras/lib/python/numpy/core/shape_base.py", line 270, in hstack
return _nx.concatenate(map(atleast_1d,tup),1)
ValueError: arrays must have same number of dimensions
Reshape b to have the correct dimensions -- noting that a 1-d array of shape (5,) is totally different from a 2-d array of shape (5, 1).
>>> b
array([15, 16, 17, 18, 19])
>>> b.reshape(5, 1)
array([[15],
[16],
[17],
[18],
[19]])
>>> numpy.hstack((a, b.reshape(5, 1)))
array([[ 0, 1, 2, 15],
[ 3, 4, 5, 16],
[ 6, 7, 8, 17],
[ 9, 10, 11, 18],
[12, 13, 14, 19]])
So in your case, you want an array of shape (65845, 1) instead of (65845,). I might be missing something because you are using sparse arrays. Nonetheless, the principle ought be the same. I have no idea what sparse format you're using based on the above code, so I just picked one to test:
>>> a = scipy.sparse.lil_matrix(numpy.arange(15).reshape(5, 3))
>>> scipy.sparse.hstack((a, b.reshape(5, 1))).toarray()
array([[ 0, 1, 2, 15],
[ 3, 4, 5, 16],
[ 6, 7, 8, 17],
[ 9, 10, 11, 18],
[12, 13, 14, 19]])

Related

Pytorch different outputs between with transpose

Let I have a tensor dimension of (B, N^2, C)
and I reshape it into (B, C, N, N).
I think that I have two choices below
A = torch.rand(5, 100, 20) # Original Tensor
# First Method
B = torch.transpose(2, 1)
B = B.view(5, 20, 10, 10)
# Second Method
C = A.view(5, 20, 10, 10)
Both methods work but the outputs are slightly different and I cannot catch the difference between them.
Thanks
The difference between B and C is that you have used torch.transpose which means you have swapped two axes, this means you have changed the layout of the memory. The view at the end is just a nice interface for you to access your data but it has no effect on the underlying data of your tensor. What it comes down to is a contiguous memory data buffer.
If you take a smaller example, something we can grasp more easily:
>>> A = torch.rand(1, 4, 3)
tensor([[[0.2656, 0.5920, 0.3774],
[0.8447, 0.5984, 0.0614],
[0.5160, 0.8048, 0.6260],
[0.1644, 0.3144, 0.1040]]])
Here swapping axis=1 and axis=2 comes down to a batched transpose (in mathematical terms):
>>> B = A.transpose(2, 1)
tensor([[[0.4543, 0.7447, 0.7814, 0.3444],
[0.9766, 0.2732, 0.4766, 0.0387],
[0.0123, 0.7260, 0.8939, 0.8581]]])
In terms of memory layout A has the following memory arangement:
>>> A.flatten()
tensor([0.4543, 0.9766, 0.0123, 0.7447, 0.2732, 0.7260, 0.7814, 0.4766, 0.8939,
0.3444, 0.0387, 0.8581])
While B has a different layout. By layout I mean memory arrangement, I am not referring to its shape which is irrelevant:
>>> B.flatten()
tensor([0.4543, 0.7447, 0.7814, 0.3444, 0.9766, 0.2732, 0.4766, 0.0387, 0.0123,
0.7260, 0.8939, 0.8581])
As I said reshaping i.e. building a view on top of a tensor doesn't change its memory layout, it's an abstraction level to better manipulate tensors.
So in the end, yes you end up with two different results: C shares the same data as A, while B is a copy and has a different memory layout.
Transposing/permuting and view/reshape are NOT the same!
reshape and view only affect the shape of a tensor, but d not change the underlying order of elements.
In contrast, transpose and permute change the underlying order of elements in the tensor. See this answer, and this one for more details.
Here's an example, with B=1, N=3 and C=2, the first channel has even numbers 0..16, and the second channel has odd numbers 1..17:
A = torch.arange(2*9).view(1,9,2)
tensor([[[ 0, 1],
[ 2, 3],
[ 4, 5],
[ 6, 7],
[ 8, 9],
[10, 11],
[12, 13],
[14, 15],
[16, 17]]])
If you correctly transpose and then reshape, you get the correct split into even and odd channels:
A.transpose(1,2).view(1,2,3,3)
tensor([[[[ 0, 2, 4],
[ 6, 8, 10],
[12, 14, 16]],
[[ 1, 3, 5],
[ 7, 9, 11],
[13, 15, 17]]]])
However, if you only change the shape (i.e., using view or reshape) you incorrectly "mix" the values from the two channels:
A.view(1,2,3,3)
tensor([[[[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8]],
[[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17]]]])
Update (Aug 31st, 2022)
Take a look at this simple example:
# original tensor
x = torch.arange(12).view(3,4)
x.data_ptr() # -> 94308398597888
x.stride() # -> (4, 1)
# transpose
x1 = x.transpose(0, 1)
x1.data_ptr() # -> 94308398597888 (same data)
x1.stride() # -> (1, 4) efficient stride representation can handle this
# messing around a bit more:
x1.view(3,4)
# strides cannot cut it anymore - we get an error
RuntimeError: view size is not compatible with input tensor''s size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
# using reshape:
x2 = x1.reshape(3, 4)
x2.data_ptr() # -> 94308399099200 (NOT the same data)
x2.stride() # -> (4, 1)

How to slice arrays with a percantage of overlapping

I have a set of data like this:
numpy.array([[3, 7],[5, 8],[6, 19],[8, 59],[10, 42],[12, 54], [13, 32], [14, 19], [99, 19]])
which I want to split into number of chunkcs with a percantage of overlapping, for each column separatly... for example for column 1, splitting into 3 chunkcs with %50 overlapping (results in a 2-d array):
[[3, 5, 6, 8,],
[6, 8, 10, 12,],
[10, 12, 13, 14,]]
(ignoring last row which will result in [13, 14, 99] not identical in size as the rest).
I'm trying to make a function that takes the array, number of chunkcs and overlpapping percantage and returns the results.
That's a window function, so use skimage.util.view_as_windows:
from skimage.util import view_as_windows
out = view_as_windows(in_arr[:, 0], window_shape = 4, step = 2)
If you need numpy only, you can use this recipe
For numpy only, quite fast approach is:
def rolling(a, window, step):
shape = ((a.size - window)//step + 1, window)
strides = (step*a.itemsize, a.itemsize)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
And you can call it like so:
rolling(arr[:,0].copy(), 4, 2)
Remark: I've got unexpected outputs for rolling(arr[:,0], 4, 2) so just took a copy instead.

keras ValueError: invalid literal for int() with base 10: [duplicate]

I have a list:
code = ['<s>', 'are', 'defined', 'in', 'the', '"editable', 'parameters"', '\n', 'section.', '\n', 'A', 'larger', '`tsteps`', 'value', 'means', 'that', 'the', 'LSTM', 'will', 'need', 'more', 'memory', '\n', 'to', 'figure', 'out']
And I want to convert to one hot encoding. I tried:
to_categorical(code)
And I get an error: ValueError: invalid literal for int() with base 10: '<s>'
What am I doing wrong?
keras only supports one-hot-encoding for data that has already been integer-encoded. You can manually integer-encode your strings like so:
Manual encoding
# this integer encoding is purely based on position, you can do this in other ways
integer_mapping = {x: i for i,x in enumerate(code)}
vec = [integer_mapping[word] for word in code]
# vec is
# [0, 1, 2, 3, 16, 5, 6, 22, 8, 22, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
Using scikit-learn
from sklearn.preprocessing import LabelEncoder
import numpy as np
code = np.array(code)
label_encoder = LabelEncoder()
vec = label_encoder.fit_transform(code)
# array([ 2, 6, 7, 9, 19, 1, 16, 0, 17, 0, 3, 10, 5, 21, 11, 18, 19,
# 4, 22, 14, 13, 12, 0, 20, 8, 15])
You can now feed this into keras.utils.to_categorical:
from keras.utils import to_categorical
to_categorical(vec)
instead use
pandas.get_dummies(y_train)
tf.keras.layers.CategoryEncoding
In TF 2.6.0, One Hot Encoding (OHE) or Multi Hot Encoding (MHE) can be implemented using tf.keras.layers.CategoryEncoding , tf.keras.layers.StringLookup, and tf.keras.layers.IntegerLookup.
I think this way is not plausible in TF 2.4.x so it must have been implemented after.
See Classify structured data using Keras preprocessing layers for the actual implementation.
def get_category_encoding_layer(name, dataset, dtype, max_tokens=None):
# Create a layer that turns strings into integer indices.
if dtype == 'string':
index = layers.StringLookup(max_tokens=max_tokens)
# Otherwise, create a layer that turns integer values into integer indices.
else:
index = layers.IntegerLookup(max_tokens=max_tokens)
# Prepare a `tf.data.Dataset` that only yields the feature.
feature_ds = dataset.map(lambda x, y: x[name])
# Learn the set of possible values and assign them a fixed integer index.
index.adapt(feature_ds)
# Encode the integer indices.
encoder = layers.CategoryEncoding(num_tokens=index.vocabulary_size())
# Apply multi-hot encoding to the indices. The lambda function captures the
# layer, so you can use them, or include them in the Keras Functional model later.
return lambda feature: encoder(index(feature))
Try converting it to a numpy array first:
from numpy import array
and then:
to_categorical(array(code))

How to separate 2 output arrays of sklearn kneighbors() Python?

I am a beginner in Python and I use NearestNeighbors in sklearn and the output is:
print(neigh.kneighbors([[0.00015217, 0.00050968, 0.00044049, 0.00014538,
0.00077339, 0.0020284 , 0.00047572]]))
And the output is:
(array([[1.01980586e-08, 7.73354596e-05, 7.73354596e-05, 1.20134585e-04,
1.39792434e-04, 1.48002389e-04, 1.98794609e-04, 4.63512739e-04,
5.31436554e-04, 5.36960418e-04, 5.72679303e-04, 6.28187320e-04,
6.67923141e-04, 7.51928163e-04, 8.97313642e-04, 1.00023442e-03,
1.06114362e-03, 1.11943158e-03, 1.12626043e-03, 1.20185118e-03,
1.51073901e-03, 1.71592746e-03, 1.73362257e-03]]),array([[ 0, 16, 15,
19,1, 23, 5, 8, 20, 9,6, 10, 17, 3, 21, 22,14, 2, 13, 7, 11, 12,
18]],dtype=int64))
I would like to import these data to csv because I need both the arrays in csv. how can I separate these arrays?
hh = neigh.kneighbors([[0.00015217, 0.00050968, 0.00044049, 0.00014538,
0.00077339, 0.0020284 , 0.00047572]])
first_array = hh[0]
second_array = hh[1]

Munging PyTorch's tensor shape from (C, B, H) to (B, C*H)

Given an input tensor of shape (C, B, H) torch.Size([2, 5, 32]) of some neural net layers, where
channels = 2
batch_size = 5
hidden_size = 32
The goal is to flatten the channels and manipulate the input tensor to the shape (B, C*H) torch.Size([5, 2 * 32]), where:
batch_size = 5
hidden_size = 32 * 2
I've tried to do the following:
import torch
t = torch.rand([2, 5, 32])
# Changed from (channels, batch_size, hidden_size)
# -> (batch_size, channels, hidden_size)
t = t.permute(1, 0, 2)
# Reshape using view(), where batch_size is t.size(0)
# and -1 is to flatten the left over values to the other dimension.
z = t.contiguous().view(t.size(0), -1)
print(z.shape)
print(z)
[out]:
torch.Size([5, 64])
tensor([[0.3911, 0.9586, 0.2104, 0.3937, 0.9976, 0.3378, 0.0630, 0.6676, 0.0806,
0.9311, 0.5219, 0.1697, 0.7442, 0.5162, 0.2555, 0.0826, 0.5502, 0.9700,
0.3375, 0.5012, 0.9025, 0.8176, 0.1465, 0.1848, 0.3460, 0.9999, 0.7892,
0.7577, 0.6615, 0.2620, 0.6868, 0.2003, 0.4840, 0.8354, 0.9253, 0.3172,
0.9516, 0.8962, 0.1272, 0.2268, 0.6510, 0.5166, 0.6772, 0.9616, 0.9826,
0.5254, 0.9191, 0.4378, 0.7048, 0.8808, 0.0299, 0.1102, 0.9710, 0.8714,
0.7256, 0.9684, 0.6117, 0.1957, 0.8663, 0.4742, 0.2843, 0.6548, 0.9592,
0.1559],
[0.2333, 0.0858, 0.5284, 0.2965, 0.3863, 0.3370, 0.6940, 0.3387, 0.3513,
0.1022, 0.3731, 0.3575, 0.7095, 0.0053, 0.7024, 0.4091, 0.3289, 0.5808,
0.5640, 0.8847, 0.7584, 0.8878, 0.9873, 0.0525, 0.7731, 0.2501, 0.9926,
0.5226, 0.0925, 0.0300, 0.4176, 0.0456, 0.4643, 0.4497, 0.5920, 0.9519,
0.6647, 0.2379, 0.4927, 0.9666, 0.1675, 0.9887, 0.7741, 0.5668, 0.7376,
0.4452, 0.7449, 0.1298, 0.9065, 0.3561, 0.5813, 0.1439, 0.2115, 0.5874,
0.2038, 0.1066, 0.3843, 0.6179, 0.8321, 0.9428, 0.1067, 0.5045, 0.9324,
0.3326],
[0.6556, 0.1479, 0.9288, 0.9238, 0.1324, 0.0718, 0.6620, 0.2659, 0.7162,
0.7559, 0.7564, 0.2120, 0.3943, 0.9497, 0.7520, 0.8455, 0.4444, 0.4708,
0.8371, 0.6365, 0.3616, 0.0326, 0.1581, 0.4973, 0.6701, 0.9245, 0.8274,
0.3464, 0.7044, 0.5376, 0.0441, 0.5210, 0.8603, 0.7396, 0.2544, 0.3514,
0.5686, 0.3283, 0.7248, 0.4303, 0.9531, 0.5587, 0.8703, 0.1585, 0.9161,
0.9043, 0.9778, 0.4489, 0.9463, 0.8655, 0.5576, 0.1135, 0.1268, 0.3424,
0.1504, 0.2265, 0.1734, 0.1872, 0.3995, 0.1191, 0.0532, 0.6109, 0.1662,
0.6937],
[0.6342, 0.1922, 0.1758, 0.4625, 0.7654, 0.6509, 0.2908, 0.1546, 0.4768,
0.3779, 0.2490, 0.0086, 0.6170, 0.5425, 0.6953, 0.4730, 0.5834, 0.8326,
0.0165, 0.8236, 0.0023, 0.7479, 0.5621, 0.9894, 0.5957, 0.0857, 0.6087,
0.5667, 0.5478, 0.8197, 0.9228, 0.7329, 0.4434, 0.5894, 0.9860, 0.6133,
0.2395, 0.4718, 0.8830, 0.6361, 0.6104, 0.6630, 0.5084, 0.7604, 0.7591,
0.3601, 0.6888, 0.6767, 0.9178, 0.5291, 0.0591, 0.4320, 0.7875, 0.5038,
0.4419, 0.0319, 0.3719, 0.5843, 0.0334, 0.3525, 0.0023, 0.1205, 0.4040,
0.7908],
[0.0989, 0.8436, 0.0425, 0.6247, 0.6091, 0.4778, 0.2692, 0.4785, 0.9217,
0.9604, 0.6355, 0.4686, 0.9414, 0.7722, 0.8013, 0.1660, 0.6578, 0.6414,
0.6814, 0.6212, 0.4124, 0.7102, 0.7416, 0.7404, 0.9842, 0.6542, 0.0106,
0.3826, 0.5529, 0.8079, 0.9855, 0.3012, 0.2341, 0.9353, 0.6597, 0.7177,
0.8214, 0.1438, 0.4729, 0.6747, 0.9310, 0.4167, 0.3689, 0.8464, 0.9395,
0.9407, 0.8419, 0.5486, 0.1786, 0.1423, 0.9900, 0.9365, 0.3996, 0.1862,
0.6232, 0.7547, 0.7779, 0.4767, 0.6218, 0.9079, 0.6153, 0.1488, 0.5960,
0.4015]])
Although the permute() + view() achieve the desired output, are there other ways to perform the same operation? Is there a better way that can directly rehape without first permutating the order of the shape?
Let's look "behind the curtain" and see why one must have both permute/transpose and view in order to go from a C-B-H to B-C*H:
Elements of tensors are stored as a long contiguous vector in memory. For instance, if you look at a 2-3-4 tensor it has 24 elements stored at 24 consecutive places in memory. This tensor also has a "header" that tells pytorch to treat these 24 values as a 2-by-3-by-4 tensor. This is done by storing not only the size of the tensor, but also "strides": what is the "stride" one need to jump in order to get to the next element along each dimension. In our example, size=(2,3,4) and strides=(12, 4, 1) (you can check this out yourself, and you can see more about it here).
Now, if you only want to change the size to 2-(3*4) you do not need to move any item of the tensor in memory, only to update the "header" of the tensor. By setting size=(2, 12) and strides=(12, 1) you are done!
Alternatively, if you want to "transpose" the tensor to 3-2-4 that's a bit more tricky, but you can still do that by manipulating the strides. Setting size=(3, 2, 4) and strides=(4, 12, 1) gives you exactly what you want without moving any of the real tensor elements in memory.
However, once you manipulated the strides, you cannot trivially change the size of the tensor - because now you will need to have two different "stride" values for one (or more) dimensions. This is why you must call contiguous() at this point.
Summary
If you want to move from shape (C, B, H) to (B, C*H) you must have permute, contiguous and view operations, otherwise you just scramble the entries of your tensor.
A small example with 2-3-4 tensor:
a =
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],
[[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23]]])
If you just change the view of the tensor you get
a.view(3,8)
array([[ 0, 1, 2, 3, 4, 5, 6, 7],
[ 8, 9, 10, 11, 12, 13, 14, 15],
[16, 17, 18, 19, 20, 21, 22, 23]])
Which is not what you want!
You need to have
a.permute(1,0,2).contiguous().view(3, 8)
array([[ 0, 1, 2, 3, 12, 13, 14, 15],
[ 4, 5, 6, 7, 16, 17, 18, 19],
[ 8, 9, 10, 11, 20, 21, 22, 23]])
Einops allows doing such element rearrangements in one (readable) line
from einops import rearrange
import torch
t = torch.rand([2, 5, 32])
y = rearrange(t, 'c b h -> b (c h)')
y.shape # prints torch.Size([5, 64])

Categories