Using np.newaxis to compute sum of squared differences - python

In chapter 2 of "Python Data Science Handbook" by Jake VanderPlas, he computes the sum of squared differences of several 2-d points using the following code:
rand = np.random.RandomState(42)
X = rand.rand(10,2)
dist_sq = np.sum(X[:,np.newaxis,:] - X[np.newaxis,:,:]) ** 2, axis=-1)
Two questions:
Why is a third axis created? What is the best way to visualize what is going on?
Is there a more intuitive way to perform this calculation?

Why is a third axis created? What is the best way to visualize what is going on?
The adding new dimensions before adding/subtracting trick is a relatively common one to generate all pairs, by using broadcasting (None is the same as np.newaxis here):
>>> a = np.arange(10)
>>> a[:,None]
array([[0],
[1],
[2],
[3],
[4],
[5],
[6],
[7],
[8],
[9]])
>>> a[None,:]
array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
>>> a[:,None] + 100*a[None,:]
array([[ 0, 100, 200, 300, 400, 500, 600, 700, 800, 900],
[ 1, 101, 201, 301, 401, 501, 601, 701, 801, 901],
[ 2, 102, 202, 302, 402, 502, 602, 702, 802, 902],
[ 3, 103, 203, 303, 403, 503, 603, 703, 803, 903],
[ 4, 104, 204, 304, 404, 504, 604, 704, 804, 904],
[ 5, 105, 205, 305, 405, 505, 605, 705, 805, 905],
[ 6, 106, 206, 306, 406, 506, 606, 706, 806, 906],
[ 7, 107, 207, 307, 407, 507, 607, 707, 807, 907],
[ 8, 108, 208, 308, 408, 508, 608, 708, 808, 908],
[ 9, 109, 209, 309, 409, 509, 609, 709, 809, 909]])
Your example does the same, just with 2-vectors instead of scalars at the innermost level:
>>> X[:,np.newaxis,:].shape
(10, 1, 2)
>>> X[np.newaxis,:,:].shape
(1, 10, 2)
>>> (X[:,np.newaxis,:] - X[np.newaxis,:,:]).shape
(10, 10, 2)
Thus we find that the 'magical subtraction' is just all combinations of the coordinate X subtracted from each other.
Is there a more intuitive way to perform this calculation?
Yes, use scipy.spatial.distance.pdist for pairwise distances. To get an equivalent result to your example:
from scipy.spatial.distance import pdist, squareform
dist_sq = squareform(pdist(X))**2

Related

How to concat corresponding elements (which are integers) of two 2D arrays of the same shape?

I have two 10x8 arrays, C and D. I need to concat the corresponding elements of these two arrays and store the result in another 10x8 array. For example, if C = [[1, 2, 3, 4, 5, 6, 7, 8],[9, 10, 11, 12, 13, 14, 15, 16],[8 elements],... [10th row which has 8 elements]] and D = [[100, 99, 98, 97, 96, 95, 94, 93],[92, 90, 89, 88, 87, 86, 85, 84],[8 elements],... [10th row which has 8 elements]]. I need another 10x8 array, E, which looks like E = [[1100, 299, 398, 497, 596, 695, 794, 893], [992, 1090, 1189, 1288, 1387, 1486, 1585, 1684],... [10th row which contains concatenation of the corresponding 8 elements in the 10th row of C and D]]. How do I obtain this? Appreciate your help!
Nested list comprehension:
>>> C = [[1, 2, 3, 4, 5, 6, 7, 8],[9, 10, 11, 12, 13, 14, 15, 16]]
>>> D = [[100, 99, 98, 97, 96, 95, 94, 93],[92, 90, 89, 88, 87, 86, 85, 84]]
>>> [[int(f'{c}{d}') for c, d in zip(lc, ld)] for lc, ld in zip(C, D)]
[[1100, 299, 398, 497, 596, 695, 794, 893],
[992, 1090, 1189, 1288, 1387, 1486, 1585, 1684]]
Just for fun, here is a functional solution:
>>> from functools import partial
>>> from itertools import starmap
>>> list(map(list, map(partial(map, int), map(partial(starmap, '{0}{1}'.format), map(zip, C, D)))))
[[1100, 299, 398, 497, 596, 695, 794, 893],
[992, 1090, 1189, 1288, 1387, 1486, 1585, 1684]]
Just run a loop and concatenate with the concatenation method. Do not run two loop just run one loop and concatenate with the loop if they are of same dimensions. This is a very easy method if they are of same dimensions.

numpy How to get elements from a two-dimensional array, when each row of a slice has a different number of columns? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
How to get elements from a two-dimensional array, when each row of a slice has a different number of columns?
buffer = np.zeros((32, 32, 3), 'u1') # this is our data buffer 2d.
buffer[2:5, (2:4, 3:7, 0:11)] # does not work.
# vertical interval: 2..5; horizontal intervals: 1..3, 4..9, 7..10
multi_intervals = ((2, 5), ((1, 3), (4, 9), (7, 10)))
# our very slowerest function.
def gen_xy_indices(y_interval, x_multi_intervals):
x_multi_ranges = list(map(lambda x: np.arange(*x),x_multi_intervals))
y_range = np.arange(*y_interval)
y_indices = np.repeat(y_range, list(map(len, x_multi_ranges)))
x_indices = np.concatenate(x_multi_ranges)
return x_indices, y_indices
ix, iy = gen_xy_indices(*multi_intervals)
buffer[iy, ix].shape == (10, 3) # yeah work but slow.
# IS THERE A FASTER WAY TO DO THIS?! (in python with numpy)
You can use np.repeat and np.concatenate.
>>> import numpy as np
>>>
>>> class By_Row:
... def __getitem__(self, idx):
... y, *x = (np.arange(i.start, i.stop, i.step) for i in idx)
... return y.repeat(np.fromiter((i.size for i in x), int, y.size)), np.concatenate(x)
...
>>>
>>> b_ = By_Row()
>>>
>>> A = sum(np.ogrid[:600:100, :12])
>>> A
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
[100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111],
[200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211],
[300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311],
[400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411],
[500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511]])
>>> A[b_[2:5, 2:4, 3:7, 0:11]]
array([202, 203, 303, 304, 305, 306, 400, 401, 402, 403, 404, 405, 406,
407, 408, 409, 410])
Here's one way you could do it:
x = range(2,5)
y = range(17)
divs = [(2,4), (3,7), (12,17)]
y_vals = []
x_vals = []
for d, div in enumerate(divs):
y_grp = y[div[0]:div[1]]
y_vals += y_grp
x_vals += [x[d]]*len(y_grp)
print(x_vals)
print(y_vals)
> [2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4]
> [2, 3, 3, 4, 5, 6, 12, 13, 14, 15, 16]

Avoid nested for with numpy arrays

I need to implement a generic operation on the elements of some np 2D arrays (A,B,C). In pseudo-code
for i in A.height:
for j in A.width:
A[i,j] = f(B[i,j],C[i,j])
where f() is concatenating the bits of the two variables by means of struct.pack(), struct.unpack()
x = struct.pack('2B', B[i, j], C[i, j])
y = struct.unpack('H', x)
This code takes a really long time to execute (0.25 secs for 640*480 matrices ... maybe is normal yet I could use something faster ), so I was wondering if anybody could suggest me some pythonic way of achieving the same result which could also improve the performance
Your function:
In [310]: def foo(a,b):
...: x = struct.pack('2B', a,b)
...: return struct.unpack('H',x)[0]
np.vectorize is a convenient way of broadcasting arrays. It passes scalar values to the functions. It does not speed up the code (related frompyfunc may give a 2x speed up relative to plain iteration)
In [311]: fv = np.vectorize(foo)
In [312]: fv(np.arange(5)[:,None],np.arange(10))
Out[312]:
array([[ 0, 256, 512, 768, 1024, 1280, 1536, 1792, 2048, 2304],
[ 1, 257, 513, 769, 1025, 1281, 1537, 1793, 2049, 2305],
[ 2, 258, 514, 770, 1026, 1282, 1538, 1794, 2050, 2306],
[ 3, 259, 515, 771, 1027, 1283, 1539, 1795, 2051, 2307],
[ 4, 260, 516, 772, 1028, 1284, 1540, 1796, 2052, 2308]])
I can replicate those values with a simple math expression on the same arrays:
In [313]: np.arange(5)[:,None]+np.arange(10)*256
Out[313]:
array([[ 0, 256, 512, 768, 1024, 1280, 1536, 1792, 2048, 2304],
[ 1, 257, 513, 769, 1025, 1281, 1537, 1793, 2049, 2305],
[ 2, 258, 514, 770, 1026, 1282, 1538, 1794, 2050, 2306],
[ 3, 259, 515, 771, 1027, 1283, 1539, 1795, 2051, 2307],
[ 4, 260, 516, 772, 1028, 1284, 1540, 1796, 2052, 2308]])
This probably only works for limited ranges of values, but it gives an idea of how you can properly 'vectorize' calculations in numpy.
Depends on what 'f' does... Not sure if this is what you mean
b = np.arange(3*4).reshape(3,4)
c = np.arange(3*4).reshape(3,4)[::-1]
b
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
c
array([[ 8, 9, 10, 11],
[ 4, 5, 6, 7],
[ 0, 1, 2, 3]])
def f(b, c):
"""some function"""
a = b + c
return a
a = f(b, c)
a
array([[ 8, 10, 12, 14],
[ 8, 10, 12, 14],
[ 8, 10, 12, 14]])

Python - Split array into multiple arrays dependent on array values

I have a list which needs to be split into multiple lists of differing size. The values in the original list randomly increase in size until the split point, where the value drops before continuing to increase. The values must remain in order after being split.
E.g.
Original list
[100, 564, 572, 578, 584, 590, 596, 602, 608, 614, 620, 625, 631, 70, 119,
125, 130, 134, 139, 144, 149, 154, 159, 614, 669, 100, 136, 144, 149, 153,
158, 163, 167, 173, 179, 62, 72, 78, 82, 87, 92, 97, 100, 107, 112, 117,
124, 426, 100, 129, 135, 140, 145, 151]
After split:
[100, 564, 572, 578, 584, 590, 596, 602, 608, 614, 620, 625, 631]
[70, 119, 125, 130, 134, 139, 144, 149, 154, 159, 614, 669]
[100, 136, 144, 149, 153, 158, 163, 167, 173, 179]
[62, 72, 78, 82, 87, 92, 97, 100, 107, 112, 117, 124, 426]
[100, 129, 135, 140, 145, 151]
I have searched for a solution, finding numpy.where and numpy.diff as likely candidates, but I'm unsure how to implement.
Thanks for the help!
Approach #1
Using NumPy's numpy.split to have list of arrays as output -
import numpy as np
arr = np.array(a) # a is input list
out = np.split(arr,np.flatnonzero(arr[1:] < arr[:-1])+1)
Approach #2
Using loop comrehension to split the list directly and thus avoid numpy.split for efficiency purposes -
idx = np.r_[0, np.flatnonzero(np.diff(a)<0)+1, len(a)]
out = [a[idx[i]:idx[i+1]] for i in range(len(idx)-1)]
Output for given sample -
In [52]: idx = np.r_[0, np.flatnonzero(np.diff(a)<0)+1, len(a)]
In [53]: [a[idx[i]:idx[i+1]] for i in range(len(idx)-1)]
Out[53]:
[[100, 564, 572, 578, 584, 590, 596, 602, 608, 614, 620, 625, 631],
[70, 119, 125, 130, 134, 139, 144, 149, 154, 159, 614, 669],
[100, 136, 144, 149, 153, 158, 163, 167, 173, 179],
[62, 72, 78, 82, 87, 92, 97, 100, 107, 112, 117, 124, 426],
[100, 129, 135, 140, 145, 151]]
We are using np.diff here, which feeds in a list in this case and then computes the differentiation. So, a better alternative would be with converting to array and then using comparison between shifted slices of it instead of actually computing the differentiation values. Thus, we could get idx like this as well -
arr = np.asarray(a)
idx = np.r_[0, np.flatnonzero(arr[1:] < arr[:-1])+1, len(arr)]
Let's time it and see if there's any improvement -
In [84]: a = np.random.randint(0,100,(1000,100)).cumsum(1).ravel().tolist()
In [85]: %timeit np.r_[0, np.flatnonzero(np.diff(a)<0)+1, len(a)]
100 loops, best of 3: 3.24 ms per loop
In [86]: arr = np.asarray(a)
In [87]: %timeit np.asarray(a)
100 loops, best of 3: 3.05 ms per loop
In [88]: %timeit np.r_[0, np.flatnonzero(arr[1:] < arr[:-1])+1, len(arr)]
10000 loops, best of 3: 77 µs per loop
In [89]: 3.05+0.077
Out[89]: 3.127
So, a marginal improvement there with the shifting and comparing method with the conversion : np.asarray(a) eating-up most of the runtime.
I know you tagged numpy. But here's a implementation without any dependencies too:
lst = [100, 564, 572, 578, 584, 590, 596, 602, 608, 614, 620, 625, 631, 70, 119,
125, 130, 134, 139, 144, 149, 154, 159, 614, 669, 100, 136, 144, 149, 153,
158, 163, 167, 173, 179, 62, 72, 78, 82, 87, 92, 97, 100, 107, 112, 117,
124, 426, 100, 129, 135, 140, 145, 151]
def split(lst):
last_pos = 0
for i in range(1, len(lst)):
if lst[i] < lst[i-1]:
yield lst[last_pos:i]
last_pos = i
if(last_pos <= len(lst)-1):
yield lst[last_pos:]
print([x for x in split(lst)])
If you want to use numpy.diff and numpy.where, you can try
a = numpy.array(your original list)
numpy.split(a, numpy.where(numpy.diff(a) < 0)[0] + 1)
Explanation:
numpy.diff(a) calculates the difference of each item and its preceding one, and returns an array.
numpy.diff(a) < 0 returns a boolean array, where each element is replaced by whether it satisfies the predicate, in this case less than zero. This is the result of numpy.ndarray overloading the comparison operators.
numpy.where takes this boolean array and returns the indices where the element is not zero. In this context False evaluates to zero, so you take the indices of True.
[0] takes the first (and only) axis
+ 1 You want to break off from after the indices, not before
Finally, numpy.split breaks them off at the given indices.

How to get list of all possible sums of n*m matrix rows

I have this 4x10 (nxm) data matrix in csv:
1, 5, 19, 23, 7, 51, 18, 20, 35, 41
15, 34, 17, 8, 11, 93, 13, 46, 3, 10
1, 2, 3, 4, 5, 6, 7, 8, 9, 10
10, 9, 8, 7, 6, 5, 4, 3, 2, 1
First, I try to get a list of all possible sums from the first n/2 rows. With remaining last n/2 rows I do the same.
Under all possible sums of first rows I mean the following:
Example:
Row 1: 1, 2, 3
Row 2: 3, 2, 1
All possible sums list: 1 + [3, 2, 1]; 2 + [3, 2, 1]; 3 + [3, 2, 1]
Final list: [4, 3, 2, 5, 4, 3, 6, 5, 4]
(At the moment I do not want to remove duplicates)
For my logic I have this code:
import csv
def loadCsv(filename):
lines = csv.reader(open(filename, "rb"))
dataset = list(lines)
for i in range(len(dataset)):
dataset[i] = [float(x) for x in dataset[i]]
return dataset
data = loadCsv('btest2.txt')
divider = len(data)/2
firstPossibleSumsList = []
secondPossibleSumsList = []
#Possible sum list for the first n/2 rows:
for i in range(len(data[0])):
for j in range(len(data[0])):
firstPossibleSumsList.append(data[0][i] + data[1][j])
#Possible sum list for the last n/2 rows:
for i in range(len(data[0])):
for j in range(len(data[0])):
secondPossibleSumsList.append(data[2][i] + data[3][j])
The problem is that I divided rows manually by using data[0][i], data[1][i], data[2][i] and so on. I want to do it more efficiently and by involving divider variable, but I can't figure out how. In my code I depend on integers 0, 1, 2, 3, but I wanted to split matrix rows into halves regardless of matrix dimensions.
One option is to think of it as a sum of a vector and transposed vector. Then you could do:
import numpy as np
data = np.array(loadCsv('btest2.txt'))
firstPossibleSumsArray = (data[0,:,np.newaxis] + data[1]).flatten()
#output for first two columns:
array([ 15, 34, 17, 8, 11, 93, 13, 46, 3, 10, 75,
170, 85, 40, 55, 465, 65, 230, 15, 50, 285, 646,
323, 152, 209, 1767, 247, 874, 57, 190, 345, 782, 391,
184, 253, 2139, 299, 1058, 69, 230, 105, 238, 119, 56,
77, 651, 91, 322, 21, 70, 765, 1734, 867, 408, 561,
4743, 663, 2346, 153, 510, 270, 612, 306, 144, 198, 1674,
234, 828, 54, 180, 300, 680, 340, 160, 220, 1860, 260,
920, 60, 200, 525, 1190, 595, 280, 385, 3255, 455, 1610,
105, 350, 615, 1394, 697, 328, 451, 3813, 533, 1886, 123,
410])
The last flatten is to turn it from a 10x10 array to a 100x1 array, which should not be necessary.
Downside of using arrays is that they are not as flexible when it comes to resizing/appending data.
Edit:
The full code could be something like:
div = int(data.shape[0])
row_len_squared = int(data.shape[1]**2)
firstPossibleSumsArray = np.empty( int((div*(div-1))/2 * row_len_squared), dtype=int )
idx = 0
for row in range(div):
for col in range(row+1,div):
firstPossibleSumsArray[idx:idx+row_len_squared] = \
(data[row,:,np.newaxis] + data[col]).flatten()
idx += row_len_squared
#reapeat process for second possible sums array by replacing the range
#in the first loop from range(div) to range(div,2*div)
This will go through each row, and sum it with the remaining rows in matrix half (row #1 + row #2, ..., row #1 + row #n, row #2 + row #3 etc.)

Categories