Pytorch expects each tensor to be equal size - python

When running this code: embedding_matrix = torch.stack(embeddings)
I got this error:
RuntimeError: stack expects each tensor to be equal size, but got [7, 768] at entry 0 and [8, 768] at entry 1
I'm trying to get embedding using BERT via:
split_sent = sent.split()
tokens_embedding = []
j = 0
for full_token in split_sent:
curr_token = ''
x = 0
for i,_ in enumerate(tokenized_sent[1:]):
token = tokenized_sent[i+j]
piece_embedding = bert_embedding[i+j]
if token == full_token and curr_token == '' :
tokens_embedding.append(piece_embedding)
j += 1
break
sent_embedding = torch.stack(tokens_embedding)
embeddings.append(sent_embedding)
embedding_matrix = torch.stack(embeddings)
Does anyone know how I can fix this?

As per PyTorch Docs about torch.stack() function, it needs the input tensors in the same shape to stack. I don't know how will you be using the embedding_matrix but either you can add padding to your tensors (which will be a list of zeros at the end till a certain user-defined length and is recommended if you will train with this stacked tensor, refer this tutorial) to make them equidimensional or you can simply use something like torch.cat(data,dim=0).

Related

Convert a string to one hot encoding matrix and then feed to neural network

I have many DNA sequence data, which has been read into xtrain. Each sample has a label (classification problem), which has been read into ytrain.
tokenizer = keras.preprocessing.text.Tokenizer(char_level=True, lower=True)
tokenizer.fit_on_texts("ATCGN")
# number of distinct characters, should be 5 in this case
max_id = len(tokenizer.word_index)
print(tokenizer.word_index)
{'a': 1, 't': 2, 'c': 3, 'g': 4, 'n': 5}
one sequence data looks like: "---ATCGATN---".
I want to split each sequence into fixed length (e.g., 4) sub-seq. Take the seq above as an example: "ATCG", "TCGA", "CGAT", "GATN". Each seq will be represented by one row in the matrix. Then using one hot encoding to represent each character. So, "A" is something like [0,0,0,0,1], "T" is something like [0,0,0,1,0]. Concatenating all encodings for each chars in the sequence gives us the encoding for the sub-seq. So, "ATCG" will be something like [0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,...]
In this way, each sequence will be turned into a matrix of size (number_of_sub-seq, len_of_sub-seq * 5), where 5 comes from tokenizer.word_index.
The following code tries to accomplish this. I am pretty new to Tensorflow, so I cannot figure out how to convert each type into others or print out the real values of tensors. For the code [x_encoded] = np.array(tokenizer.texts_to_sequences([x])) - 1 gives me an error of AttributeError: 'Tensor' object has no attribute 'lower'.
def seq2mat(x, y):
x = tf.strings.regex_replace(x, "-", "")
x = tf.strings.regex_replace(x, 'K', 'N')
[x_encoded] = np.array(tokenizer.texts_to_sequences([x])) - 1
x_dataset = x_dataset.window(kmer_len, shift=3, drop_remainder=True)
x_flat = x_dataset.flat_map(lambda window: window.batch(kmer_len))
x_1hot = x_flat.map(lambda kmer: tf.one_hot(kmer, depth=max_id))
# try to stack them into a matrix
x_np_mat = []
for item in x_1hot:
line = np.array(item)
x_np_mat.append(line.flatten())
x_np_mat = np.array(x_np_mat)
return x_np_mat
batch_size = 8
kmer_len = 8
train_dataset = tf.data.Dataset.from_tensor_slices((xtrain, ytrain))
train_data = train_dataset.shuffle(buffer_size=1000, seed=1)
train_data = train_data.map(seq2mat)
train_data = train_data.batch(batch_size).prefetch(1)
tokenizer = keras.preprocessing.text.Tokenizer(char_level=True, lower=True)
tokenizer.fit_on_texts("ATCGN")
x = []
seq = np.array(tokenizer.texts_to_sequences('ATCGN'))
a = keras.utils.to_categorical(seq[:,0]-1)
for i in a:
x = x + list(i)
print(x)

Transforming different arrays into a loop

I was wondering if it was possible to transform these next process into a loop so that I can use one word for this (not as a vector):
Data0 = np.zeros(dem0.shape, dtype=np.int32)
Data0[zipp[0] >= 0 ] = 1
Data1 = np.zeros(dem1.shape, dtype=np.int32)
Data1[zipp[1] >= 0 ] = 1
Data2 = np.zeros(dem2.shape, dtype=np.int32)
Data2[zipp[2] >= 0 ] = 1
Data3 = np.zeros(dem3.shape, dtype=np.int32)
Data3[zipp[3] >= 0 ] = 1
As you can see, there are 4 shapes for each layer (four layers total). I am trying to put a specific/corresponding "zipp" vector position to each dem.shape for each layer I have (in vector zipp[i] each i is an array of each dem).
What I want it to do is to replace with the number 1 those values greater than or equal to zero in the array contained in zipp[i] for each layer/shape/dem.
However, as a result, I must deliver this as a word not a vector or array, so I've been thinking of a loop but haven't been illuminated enough just yet.
Thank you :)
I'm not quite sure what you mean by delivering the result "as a word not a vector or array", but assuming all of these arrays have the same shape you can reduce this to a couple of lines (maybe someone else knows how to do it in 1):
data = np.zeros_like(zipp, dtype=np.int32)
data[zipp >= 0] = 1
If just you want to return a boolean array of where zipp is greater than or equal to 0 you can do that in 1 line like this:
bool = np.greater_equal(zipp, 0)

How to store matrices of different sizes (like a cell array in Matlab)?

I am writing Python code (and also relatively new to Python) to do some classification using neural networks. The number of neurons in the hidden layers varies, which means I have various matrices of different size that I want to save.
If this were MATLAB, I would just do
for layer_iteration = 1 : number_of_layers
cell_matrix{layer_iteration} = create_matrix(size_current, size_previous)
end
which would store different sized matrices in cell_matrix{layer_iteration}.
I wasn't quite sure how to do this in Python, so my first thought was to create the matrices at each iteration, flatten them, and append them. Then I create an index matrix of the same size as the flattened array, where each value is an integer, which points to the correct layer_iteration. I can then used indexing in later operations to find the correct slice of the 1D matrix and reshape. This is some of the code:
layer_sizes = [2, 10 ,6 ,2]
total_amount_of_values = 92 e.g. 2*10 + 10*6 + 6*2
W = np.zeros(total_amount_of_values)
idx_W = np.zeros((total_amount_of_values,), dtype=np.int)
it_idx = 0
for idx, val in enumerate(layer_sizes):
to_idx = idx + 1
if idx == 0:
start_idx = 0
else:
start_idx = end_idx
if it_dx < len(layer_sizes)-1:
tmp_W = create_matrix(val, layer_sizes[to_idx])
end_idx = start_idx + layer_sizes[idx + 1] * val
W[start_idx:end_idx] = tmp_W
idx_W[start_idx:end_idx] = int(it_idx)
it_idx += 1
If I want to use W for matrix multiplication later on, all I need to do is:
tmp_W = W[idx_W == iteration]
tmp_W = np.reshape(tmp_W, (size_1, size_2))
This works fine but I've realised it doesn't seem Pythonic at all. I am now at a part of my code, where using my technique is complicating things. I would really benefit of a simpler way of storing different size matrices both in terms of my current code and more importantly, my education in Python.
What would be the best solution?
Thanks!

filtering "empty" values from Tensorflow

I wrote this code to filter values from a Dataset that are <= 6.
import tensorflow as tf
import tensorflow.contrib.data as ds
def make_graph():
inits = []
filter_value = tf.constant([6], dtype=tf.int64)
source = ds.Dataset.range(10)
batched = source.batch(3)
batched_iter = batched.make_initializable_iterator()
batched_next = batched_iter.get_next()
inits.append(batched_iter.initializer)
predicate = tf.less_equal(batched_next, filter_value, name="less_than_filter")
true_coordinates = tf.where(predicate)
reshaped = tf.reshape(true_coordinates, [-1])
# need to turn bools into 1 and 0 elsewhere
found = tf.gather(params=batched_next, indices=reshaped)
return found, inits # prepend final tensor
def run_graph(final_tensor, initializers, rounds):
with tf.Session() as sess:
init_ops = (tf.local_variables_initializer(), tf.global_variables_initializer())
sess.run(init_ops)
summary_writer = tf.summary.FileWriter(graph=sess.graph, logdir=".")
while rounds > 0:
for i in initializers:
sess.run(i)
try:
while True:
final_result = sess.run(final_tensor)
p```pythrint("Got result: {r}".format(r=final_result))
except tf.errors.OutOfRangeError:
print("Got out of range error")
rounds -=1
summary_writer.flush()
def run():
final_tensor, initializers = make_graph()
run_graph(final_tensor=final_tensor,
initializers=initializers,
rounds=1)
if __name__ == "__main__":
run()
However, the results are as follows:
Got result: [0 1 2]
Got result: [3 4 5]
Got result: [6]
Got result: []
Got out of range error
Is there a way to filter this empty Tensor? I tried to brainstorm ways to do this, maybe with a tf.while loop, but I'm not sure whether I'm missing something or such an operation (i.e. an OpKernel "dropping" an input by not producing output based on its value) is not possible in Tensorflow.
Keeping only values <= 6 BEFORE batching:
dataset = ds.Dataset.range(10)
dataset = dataset.filter( lambda v : v <= 6 )
dataset = dataset.batch(3)
batched_iter = dataset.make_initializable_iterator()
This will generate batches containing only the data you want. Note that it's generally better to filter out the unwanted data before building the batches. This way, empty tensors will not be generated by the iterator.

Too many indices for array

I am trying to create a 3D image mat1 from the data given to me by an object. But I am getting the error for the last line: mat1[x,y,z] = mat[x,y,z] + (R**2/U**2)**pf1[l,m,beta]:
IndexError: too many indices for array
What could possible be the problem here?
Following is my code :
mat1 = np.zeros((1024,1024,360),dtype=np.int32)
k = 498
gamma = 0.00774267
R = 0.37
g = np.zeros(1024)
g[0:512] = np.linspace(0,1,512)
g[513:] = np.linspace(1,0,511)
pf = np.zeros((1024,1024,360))
pf1 = np.zeros((1024,1024,360))
for b in range(0,1023) :
for beta in range(0,359) :
for a in range(0,1023) :
pf[a,b,beta] = (R/(((R**2)+(a**2)+(b**2))**0.5))*mat[a,b,beta]
pf1[:,b,beta] = np.convolve(pf[:,b,beta],g,'same')
for x in range(0,1023) :
for y in range(0,1023) :
for z in range(0,359) :
for beta in range(0,359) :
a = R*((-x*0.005)*(sin(beta)) + (y*0.005)*(cos(beta)))/(R+ (x*0.005)*(cos(beta))+(y*0.005)*(sin(beta)))
b = z*R/(R+(x*0.005)*(cos(beta))+(y*0.005)*(sin(beta)))
U = R+(x*0.005)*(cos(beta))+(y*0.005)*(sin(beta))
l = math.trunc(a)
m = math.trunc(b)
if (0<=l<1024 and 0<=m<1024) :
mat1[x,y,z] = mat[x,y,z] + (R**2/U**2)**pf1[l,m,beta]
The line where you do the convolution:
pf1 = np.convolve(pf[:,b,beta],g)
generates a 1-dimensional array, and not 3-dimensional as your call in the last line: pf1[l,m,beta]
To solve this you can use:
pf1[:,b,beta] = np.convolve(pf[:,b,beta],g,'same')
and you also need to predefine pf1:
pf1 = np.zeros((1024,1024,360))
Note that the convolution of f*g (np.convole(f,g)) returns normally a length of |f|+|g|-1. If you however use np.convolve with the parameter 'same' it returns an array which has the maximum length of f or g (i.e. max(|f|,|g|)).
Edit:
Furthermore you have to be sure that the dimensions of the matrices and the indices you use are correct, for example:
You define mat1 = np.zeros((100,100,100),dtype=np.int32), thus a 100x100x100 matrix, but in the last line you do mat1[x,y,z] where the variables x, y and z clearly get out of these dimensions. In this case they get to the range of the mat matrix. Probably you have to change the dimensions of mat1 also to those:
mat1 = np.zeros((1024,1024,360),dtype=np.int32)
Also be sure that the last variable indices you calculate (l and m) are within the dimensions of pf1.
Edit 2: The range(a,b) function returns an array from a to b, but not including b. So instead of range(0,1023) for example, you should write range(0,1024) (or shorter: range(1024)).
Edit 3: To check if l or m exceed the dimensions you could add an error as soon as they do:
l = math.trunc(a)
if l>=1024:
print 'l exceeded bounds: ',l
m = math.trunc(b)
if m>=1024:
print 'm exceeded bounds: ',m
Edit 4: note that your your code, especially your last for will take a long time! Your last nested for results in 1024*1024*360*360=135895449600 iterations. With a small time estimation I did (calculating the running time of the code in your for loop) your code might take about 5 days to run.
A small easy optimization you could do is instead of calculating the sin and cos several times, create a variable storing the value:
sinbeta = sin(beta)
cosbeta = cos(beta)
but it will probably still take several days. You might want to check how to optimize your calculations or calculate it with a C program for example.

Categories