I have two 3000x3 vectors and I'd like to compute 1-to-1 Euclidean distance between them. For example, vec1 is
1 1 1
2 2 2
3 3 3
4 4 4
...
The vec2 is
2 2 2
3 3 3
4 4 4
5 5 5
...
I'd like to get the results as
1.73205081
1.73205081
1.73205081
1.73205081
...
I triedscipy.spatial.distance.cdist(vec1,vec2), and it returns a 3000x3000 matrix whereas I only need the main diagonal. I also tried np.sqrt(np.sum((vec1-vec2)**2 for vec1,vec2 in zip(vec1,vec2))) and it didn't work for my purpose. Is there any way to compute the distances please? I'd appreciate any comments.
cdist gives you back a 3000 x 3000 array because it computes the distance between every pair of row vectors in your two input arrays.
To compute only the distances between corresponding row indices, you could use np.linalg.norm:
a = np.repeat((np.arange(3000) + 1)[:, None], 3, 1)
b = a + 1
dist = np.linalg.norm(a - b, axis=1)
Or using standard vectorized array operations:
dist = np.sqrt(((a - b) ** 2).sum(1))
Here's another way that works. It still utilizes the np.linalg.norm function but it processes the data, if that is something you needed.
import numpy as np
vec1='''1 1 1
2 2 2
3 3 3
4 4 4'''
vec2='''2 2 2
3 3 3
4 4 4
5 5 5'''
process_vec1 = np.array([])
process_vec2 = np.array([])
for line in vec1:
process_vec1 = np.append( process_vec1, map(float,line.split()) )
for line in vec2:
process_vec2 = np.append( process_vec2, map(float,line.split()) )
process_vec1 = process_vec1.reshape( (len(process_vec1)/3, 3) )
process_vec2 = process_vec2.reshape( (len(process_vec2)/3, 3) )
dist = np.linalg.norm( process_vec1 - process_vec2 , axis = 1 )
print dist
[1.7320508075688772 1.7320508075688772 1.7320508075688772 1.7320508075688772]
Related
I'm trying to get cumulative max of array1 until it reach the levels of array2, then it will restart accumulation from those points.
So: (RsiMa and DeltaFastAtrRsi are arrays)
long = (RsiMa - DeltaFastAtrRsi)
longband = np.fmax.accumulate(long)
but at the points where longband >= RsiMa:
longband = long
and then, max accumulation will restart from this point.
UPPER LINE = RsiMa (array2)
LOWER LINE = longband (array1)
I NEED TO DO THIS WITHOUT LOOPS!!! (NUMPY)
EDIT EXAMPLE:
0 1 2 3 4 5 6
RsiMa 4 4 4 3 5 2 1
long 1 2 3 2 4 1 0.5
np.fmax.accumulate(long) 1 2 3 3 4 4 4
expected output 1 2 3 2 4 1 0.5
^ ^ ^
In the highlighted points (2, 1, 0.5) expected output was >= RsiMa, so the output is equal to the long value, and accumulation will restart from those points.
I am generating a normal distribution but keeping the mean and std exactly the same by using np.random.seed(0). I am trying to shuffle r except the first and the last elements of the array but it keeps the remaining elements at the same location in the array as shown in the current output. I also present the expected output.
import numpy as np
np.random.seed(0)
mu, sigma = 50, 2.0 # mean and standard deviation
Nodes=10
r = np.random.normal(mu, sigma, Nodes)
sort_r = np.sort(r);
r1=sort_r[::-1]
r1=r1.reshape(1,Nodes)
r2 = r.copy()
np.random.shuffle(r2.ravel()[1:])
r2=r2.reshape(1,Nodes) #actual radius values in mu(m)
maximum = r2.max()
indice1 = np.where(r2 == maximum)
r2[indice1] = r2[0][0]
r2[0][0] = maximum
r2[0][Nodes-1] = maximum #+0.01*maximum
print("r2 with max at (0,0)=",[r2])
The current output for many runs is
r2 with max at (0,0)= [array([[54.4817864 , 51.90017684, 53.52810469, 53.73511598, 48.04544424,
51.95747597, 50.80031442, 50.821197 , 49.7935623 , 54.4817864 ]])]
The expected output is (shuffling all elements randomly except the first and the last element)
Run 1: r2 with max at (0,0)= [array([[54.4817864 , 53.52810469, 51.90017684, ,53.73511598, 48.04544424,49.7935623 ,50.80031442, 50.821197 , 51.95747597, 54.4817864 ]])]
Run 2: r2 with max at (0,0)= [array([[54.4817864 , 51.90017684,53.52810469, 48.04544424, 53.73511598, 51.95747597, 49.7935623 ,50.80031442, 50.821197 , 54.4817864 ]])]
It's not that clear from your question what do you include in a run.
If, like it seems, you're initializing distribution and seed every time, shuffling it once will always give you the same result. It must be like that because random state is fixed, just like you want your random numbers to be predictable also the shuffle operation will return always the same result.
Let me show you what I mean with some simpler code than yours:
# reinit distribution and seed at each run
for run in range(5):
np.random.seed(0)
a = np.random.randint(10, size=10)
np.random.shuffle(a)
print(f'{run}:{a}')
Which will print
0:[2 3 9 0 3 7 4 5 3 5]
1:[2 3 9 0 3 7 4 5 3 5]
2:[2 3 9 0 3 7 4 5 3 5]
3:[2 3 9 0 3 7 4 5 3 5]
4:[2 3 9 0 3 7 4 5 3 5]
What you want is to initialize your distribution once and shuffle it at each run:
# init distribution and just shuffle it at each run
np.random.seed(0)
a = np.random.randint(10, size=10)
for run in range(5):
np.random.shuffle(a)
print(f'{run}:{a}')
Which will print:
0:[2 3 9 0 3 7 4 5 3 5]
1:[9 0 3 4 2 5 7 3 3 5]
2:[2 0 3 3 3 5 7 5 4 9]
3:[5 3 5 3 0 2 7 4 9 3]
4:[3 9 3 2 5 7 3 4 0 5]
I have a scenario where the matrix input is, 5*5
4 4 4 1 1
4 4 4 0 2
4 4 4 5 1
4 4 4 2 3
4 4 4 4 4
If I give any input like 23 or 33, it should find the reoccurrences of the matrix. like
3*3
it has the subsets of matrix like this
from row 1;-
4 4 4
4 4 4
4 4 4
likewise, If I specify 3 * 3 the subset of the matrix has 3 occurrences. Is it possible to convert this to Python?
Say your original 5x5 matrix is called input_matrix. You could do this to find the number of times a given submatrix from top-left repeats:
input_size = input_matrix.shape[0]
# However many rows and columns the sub-matrix should have.
num_rows = 3
num_cols = 3
# The sub-matrix to find the number of occurrences of.
orig_matrix = input_matrix[:num_rows, :num_cols]
num_occurrences = 0
# Iterate through all indices at positions where the sub-matrix could still fit
# inside the input matrix.
for row_num in range(input_size - num_rows + 1):
for col_num in range(input_size - num_cols + 1):
# Get the sub-matrix at those indices
this_matrix = input_matrix[row_num:row_num + num_rows, col_num:col_num + num_cols]
# Get the difference between the original sub-matrix and the current submatrix.
# If difference at all points is 0, the submatrices are the same.
if np.count_nonzero(orig_matrix - this_matrix, axis=None) == 0:
num_occurrences += 1
num_occurrences contains the number you want. I'm sure there are more efficient ways to do this, but here's what I've got.
If I have 3-d Matrix like:
cor =: 3 3 3 $ i.5
cor
0 1 2
3 4 0
1 2 3
4 0 1
2 3 4
0 1 2
3 4 0
1 2 3
4 0 1
and 2-d matrix like:
d =: 3 3 $ i.5
d
0 1 2
3 4 0
1 2 3
It is really simple to calculate in J language: putting "2 (by 2D matrix) after - sign.
d -"2 cor
0 0 0
0 0 0
0 0 0
_4 1 1
1 1 _4
1 1 1
_3 _3 2
2 2 _3
_3 2 2
But I am still a numpy novice....
cor - d
ValueError: Unable to coerce to Series/DataFrame, dim must be <= 2: (59, 59, 59)
Is there anyway that I can manipulate this kind of matrix manipulation in Python Numpy??
Thanks in advance.
this is the python for loop code that I wanted to change into numpy
def pcor(df):
cor = df.corr()
n = df.shape[1] # number of indices
pcor = np.empty((n, n, n))
d = np.empty((n, n, n))
for x in range(n):
for y in range(n):
for m in range(n):
if x==y:
pcor[x,y,m] = float('nan')
else:
pcor[x,y,m] = (cor.iloc[x,y] - cor.iloc[x,m]*cor.iloc[y,m])/((1-cor.iloc[x,m]**2)*(1-cor.iloc[y,m]**2))**(1/2)
d[x,y,m] = cor.iloc[x,y] - pcor[x,y,m] # <-- this part!
You need to match the shape of d (currently (3, 3)) to the shape of cor (currently (3, 3, 3)) before subtraction. Try cor - d[:None]. This basically tells numpy to use the existing shape of d (:) and to create a new axis for the last dimension (None).
I have this Numpy code:
def uniq(seq):
"""
Like Unix tool uniq. Removes repeated entries.
:param seq: numpy.array. (time,) -> label
:return: seq
"""
diffs = np.ones_like(seq)
diffs[1:] = seq[1:] - seq[:-1]
idx = diffs.nonzero()
return seq[idx]
Now, I want to extend this to support 2D arrays and make it use Theano. It should be fast on the GPU.
I will get an array with multiple sequences as multiple batches in the format (time,batch), and a time_mask which specifies indirectly the length of each sequence.
My current try:
def uniq_with_lengths(seq, time_mask):
# seq is (time,batch) -> label
# time_mask is (time,batch) -> 0 or 1
num_batches = seq.shape[1]
diffs = T.ones_like(seq)
diffs = T.set_subtensor(diffs[1:], seq[1:] - seq[:-1])
time_range = T.arange(seq.shape[0]).dimshuffle([0] + ['x'] * (seq.ndim - 1))
idx = T.switch(T.neq(diffs, 0) * time_mask, time_range, -1)
seq_lens = T.sum(T.ge(idx, 0), axis=0) # (batch,) -> len
max_seq_len = T.max(seq_lens)
# I don't know any better way without scan.
def step(batch_idx, out_seq_b1):
out_seq = seq[T.ge(idx[:, batch_idx], 0).nonzero(), batch_idx][0]
return T.concatenate((out_seq, T.zeros((max_seq_len - out_seq.shape[0],), dtype=seq.dtype)))
out_seqs, _ = theano.scan(
step,
sequences=[T.arange(num_batches)],
outputs_info=[T.zeros((max_seq_len,), dtype=seq.dtype)]
)
# out_seqs is (batch,max_seq_len)
return out_seqs.T, seq_lens
How to construct out_seqs directly?
I would do something like out_seqs = seq[idx] but I'm not exactly sure how to express that.
Here's a quick answer that only addresses part of your task:
def compile_theano_uniq(x):
diffs = x[1:] - x[:-1]
diffs = tt.concatenate([tt.ones_like([x[0]], dtype=diffs.dtype), diffs])
y = diffs.nonzero_values()
return theano.function(inputs=[x], outputs=y)
theano_uniq = compile_theano_uniq(tt.vector(dtype='int32'))
The key is nonzero_values().
Update: I can't imagine any way to do this without using theano.scan. To be clear, and using 0 as padding, I'm assuming that given the input
1 1 2 3 3 4 0
1 2 2 2 3 3 4
1 2 3 4 5 0 0
you would want the output to be
1 2 3 4 0 0 0
1 2 3 4 0 0 0
1 2 3 4 5 0 0
or even
1 2 3 4 0
1 2 3 4 0
1 2 3 4 5
You could identify the indexes of the items you want to keep without using scan. Then either a new tensor needs to be constructed from scratch or the values you want to keep some how moved to make the sequences contiguous. Neither approaches seem feasible without theano.scan.