Unexpected results with functools reduce in python - python

While executing the below code I'm simply getting the first row(split) of the matrix back not the sum of the elements as what I'm expecting. Is my understanding incorrect or did I do something stupid?
My objective is to get the sum of all elements in each row.
import numpy as np
from functools import reduce
matrix = 100*np.random.rand(4,4)
matrix=matrix.astype(int)
print(matrix)
s_matrix = np.vsplit(matrix, 4)
sum_test = reduce((lambda a,b : a+b), list(s_matrix[0]))
print(sum_test)
Output:
[[79 75 33 26]
[49 45 16 19]
[58 33 83 55]
[40 14 2 93]]
[79 75 33 26]
Expected:
[213, 129, 229, 149]

Check the expression you're using: print(list(s_matrix[0])). I think you'll find that it's a double-nested list
[[79 75 33 26]]
Thus, the "sum" is merely concatenation of a single list element.

You can use reduce() for this by continually adding the results to a list in the accumulator:
import numpy as np
from functools import reduce
matrix = 100*np.random.rand(4,4)
matrix=matrix.astype(int)
sum_test = reduce(lambda a,b : a+[sum(b)], list(matrix), [])
print(sum_test)
...but you really shouldn't. One of the major points of Numpy is that you can avoid explicit python loops. Instead you should just use the array's sum function. You can pass it an axis to tell it to sum rows instead of the whole thing:
import numpy as np
matrix = np.random.randint(0, 100, [4,4])
print(matrix)
print(matrix.sum(axis = 1))
Result
[[64 89 97 15]
[12 47 81 31]
[52 81 37 78]
[27 64 79 50]]
[265 171 248 220]

sum_test = reduce((lambda a,b : a+b), list(s_matrix[0]))
above line is your problem,
you are only giving the first row of your matrix instead of giving the whole matrix

Related

Select only values below 50 from array, add 5 then multiply by 2. The other values should remain unchanged

I have a python array that i got using
array = np.arange(2,201,2).reshape(25,4)
which gave me this:
[[ 2 4 6 8]
[ 18 20 22 24]
[ 34 36 38 40]
[ 50 52 54 56]
[ 66 68 70 72]
[ 82 84 86 88]
[ 98 100 102 104]
[114 116 118 120]
[130 132 134 136]
[146 148 150 152]
[162 164 166 168]
[178 180 182 184]
[194 196 198 200]]
but now i'm instructed to select only the values below 50 from "array", add 5 to these values, and then multiply by 2. The other values should remain unchanged and everything should be saved as "array". This is a school assignment so I don't have the output but basically the output should be the array in the same 25x4 shape and the first ~3 rows will be changed (since those are the ones under 50) and the other rows/values will be the same (since they're over 50). I've tried the following code:
for i in array:
if array < 50:
print((i+5)*2)
else:
print(i)
and I'm getting an error that says -
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
any help would be greatly appreciated since I can't find any other articles with similar questions
There are 2 ways to address this question. A Python one and a numpy one (numpy is not Python...).
Python way:
You have a sequence of sequence containers. You can use a double iteration to test the values one at a time and replace the ones that have to be:
for row in array: # iterate over the rows
for i, val in enumerate(row): # then the values in the row
if val <=50: # test them
row[i] = (val + 5) * 2 # and replace
This works as soon as the outer iteration gives you a direct access to the row container. This is true for both Python containers (lists) and numpy arrays but may not be guaranteed for any type of containers. The super safe way would be to keep the indexes and directly modify array:
for i in range(len(array)):
for j in range(len(array[i])):
if array[i][j]< 50:
array[i][j] = (array[i][j] + 2) * 5
Numpy way:
The power of numpy is to provide high speed iterations on its arrays. In numpy wordings it is called vectorization. You should first extract the relevant indexes and then change the values in one single vectorized operation:
ix = np.where(array < 50)
array[ix] = (array[ix] + 5) * 2
For large arrays, this second way should be at least one magnitude order faster than the first one.
For your question, the correct way is the one that matches your current lesson, either Python or numpy...
import numpy as np
array = np.arange(2,201,2).reshape(25,4)
values = [ (element+5)*2 if element < 50 else element for innerList in array for element in innerList ]
print(values)

How to concatenate 4 numpy matrices along the x axis?

I am trying to concatenate 4 numpy matrices along the x axis. Below is the code I have written.
print(dt.shape)
print(condition.shape)
print(uc.shape)
print(rt.shape)
x = np.hstack((dt, condition, uc, rt))
print(x.shape)
I am getting the following output.
(215063, 1)
(215063, 1112)
(215063, 1)
(215063, 1)
I am getting the following error.
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 1 dimension(s)
Final output should be
(215063, 1115)
I shall recommend you to use numpy concatenate. I used this to merge two images in a single image.It provides you option to concatenate in either of the two axes X and Y. For more info on this visit this link
Your code is OK. To confirm it, I performed the following test
on smaller arrays:
dt = np.arange(1,6).reshape(-1,1)
condition = np.arange(11,41).reshape(-1,6)
uc = np.arange(71,76).reshape(-1,1)
uc = np.arange(81,86).reshape(-1,1)
print(dt.shape, condition.shape, uc.shape, rt.shape)
x = np.hstack((dt, condition, uc, rt))
print(x.shape)
print(x)
and got:
(5, 1) (5, 6) (5, 1) (5, 1)
(5, 9)
[[ 1 11 12 13 14 15 16 81 41]
[ 2 17 18 19 20 21 22 82 42]
[ 3 23 24 25 26 27 28 83 43]
[ 4 29 30 31 32 33 34 84 44]
[ 5 35 36 37 38 39 40 85 45]]
So probably there is something wrong with your data.
Attempt to run np.hstack on the above set of arrays, dropping
each (one) of them in turn.
If in one case (without some array) the execution succeeds, then
the source of problem is just the array missing in this case.
Then you should look thoroughly at this array and find what is wrong with it.

numpy: take multiple range subsets of the same of size

What I'm looking for
# I have an array
x = np.arange(0, 100)
# I have a size n
n = 10
# I have a random set of numbers
indexes = np.random.randint(n, 100, 10)
# What I want is a matrix where every row i is the i-th element of indexes plus the previous n elements
res = np.empty((len(indexes), n), int)
for (i, v) in np.ndenumerate(indexes):
res[i] = x[v-n:v]
To reformulate, as I wrote in the title what am looking for is a way to take multiple subsets (of the same size) of an initial array.
Just to add a detail this loopy version works, I want just to know if there is a numpyish way to achieve this in a more elegant way.
The following does what you are asking for. It uses numpy.lib.stride_tricks.as_strided to create a special view on the data which can be indexed in the desired way.
import numpy as np
from numpy.lib import stride_tricks
x = np.arange(100)
k = 10
i = np.random.randint(k, len(x)+1, size=(5,))
xx = stride_tricks.as_strided(x, strides=np.repeat(x.strides, 2), shape=(len(x)-k+1, k))
print(i)
print(xx[i-k])
Sample output:
[ 69 85 100 37 54]
[[59 60 61 62 63 64 65 66 67 68]
[75 76 77 78 79 80 81 82 83 84]
[90 91 92 93 94 95 96 97 98 99]
[27 28 29 30 31 32 33 34 35 36]
[44 45 46 47 48 49 50 51 52 53]]
A bit of explanation. Arrays store not only data but also a small "header" with layout information. Amongst this are the strides which tell how to translate linear memory to nd. There is a stride for each dimension which is just the offset at which the next element along that dimension can be found. So the strides for a 2d array are (row offset, element offset). as_strided permits to directly manipulate an array's strides; by setting row offsets to the same as element offsets we create a view that looks like
0 1 2 ...
1 2 3 ...
2 3 4
. .
. .
. .
Note that no data are copied at this stage; for exasmple, all the 2s refer to the same memory location in the original array. Which is why this solution should be quite efficient.

Index two sets of columns in an array

I am trying to slice columns out of an array and assign to a new variable, like so.
array1 = array[:,[0,1,2,3,15,16,17,18,19,20]]
Is there a short cut for something like this?
I tried this, but it threw an error:
array1 = array[:,[0:3,15:20]]
This is probably really simple but I can't find it anywhere.
Use np.r_:
Translates slice objects to concatenation along the first axis.
import numpy as np
arr = np.arange(100).reshape(5, 20)
cols = np.r_[:3, 15:20]
print(arr[:, cols])
[[ 0 1 2 15 16 17 18 19]
[20 21 22 35 36 37 38 39]
[40 41 42 55 56 57 58 59]
[60 61 62 75 76 77 78 79]
[80 81 82 95 96 97 98 99]]
At the end of the day, probably only a little less verbose than what you have now, but could come in handy for more complex cases.
For most simple cases like this, the best and most straightforward way is to use concatenation:
array1 = array[0:3] + array[15:20]
For more complicated cases, you'll need to use a custom slice, such as NumPy's s_, which allows for multiple slices with gaps, separated by commas. You can read about it here.
Also, if your slice follows a pattern (i.e. get 5, skip 10, get 5 etc), you can use itertools.compress, as explained by user ncoghlan in this answer.
You could use list(range(0, 4)) + list(range(15, 20))

Find multiple maximum values in a 2d array fast

The situation is as follows:
I have a 2D numpy array. Its shape is (1002, 1004). Each element contains a value between 0 and Inf. What I now want to do is determine the first 1000 maximum values and store the corresponding indices in to a list named x and a list named y. This is because I want to plot the maximum values and the indices actually correspond to real time x and y position of the value.
What I have so far is:
x = numpy.zeros(500)
y = numpy.zeros(500)
for idx in range(500):
x[idx] = numpy.unravel_index(full.argmax(), full.shape)[0]
y[idx] = numpy.unravel_index(full.argmax(), full.shape)[1]
full[full == full.max()] = 0.
print os.times()
Here full is my 2D numpy array. As can be seen from the for loop, I only determine the first 500 maximum values at the moment. This however already takes about 5 s. For the first 1000 maximum values, the user time should actually be around 0.5 s. I've noticed that a very time consuming part is setting the previous maximum value to 0 each time. How can I speed things up?
Thank you so much!
If you have numpy 1.8, you can use the argpartition function or method.
Here's a script that calculates x and y:
import numpy as np
# Create an array to work with.
np.random.seed(123)
full = np.random.randint(1, 99, size=(8, 8))
# Get the indices for the largest `num_largest` values.
num_largest = 8
indices = (-full).argpartition(num_largest, axis=None)[:num_largest]
# OR, if you want to avoid the temporary array created by `-full`:
# indices = full.argpartition(full.size - num_largest, axis=None)[-num_largest:]
x, y = np.unravel_index(indices, full.shape)
print("full:")
print(full)
print("x =", x)
print("y =", y)
print("Largest values:", full[x, y])
print("Compare to: ", np.sort(full, axis=None)[-num_largest:])
Output:
full:
[[67 93 18 84 58 87 98 97]
[48 74 33 47 97 26 84 79]
[37 97 81 69 50 56 68 3]
[85 40 67 85 48 62 49 8]
[93 53 98 86 95 28 35 98]
[77 41 4 70 65 76 35 59]
[11 23 78 19 16 28 31 53]
[71 27 81 7 15 76 55 72]]
x = [0 2 4 4 0 1 4 0]
y = [6 1 7 2 7 4 4 1]
Largest values: [98 97 98 98 97 97 95 93]
Compare to: [93 95 97 97 97 98 98 98]
You could loop through the array as #Inspired suggests, but looping through NumPy arrays item-by-item tends to lead to slower-performing code than code which uses NumPy functions since the NumPy functions are written in C/Fortran, while the item-by-item loop tends to use Python functions.
So, although sorting is O(n log n), it may be quicker than a Python-based one-pass O(n) solution. Below np.unique performs the sort:
import numpy as np
def nlargest_indices(arr, n):
uniques = np.unique(arr)
threshold = uniques[-n]
return np.where(arr >= threshold)
full = np.random.random((1002,1004))
x, y = nlargest_indices(full, 10)
print(full[x, y])
print(x)
# [ 2 7 217 267 299 683 775 825 853]
print(y)
# [645 621 132 242 556 439 621 884 367]
Here is a timeit benchmark comparing nlargest_indices (above) to
def nlargest_indices_orig(full, n):
full = full.copy()
x = np.zeros(n)
y = np.zeros(n)
for idx in range(n):
x[idx] = np.unravel_index(full.argmax(), full.shape)[0]
y[idx] = np.unravel_index(full.argmax(), full.shape)[1]
full[full == full.max()] = 0.
return x, y
In [97]: %timeit nlargest_indices_orig(full, 500)
1 loops, best of 3: 5 s per loop
In [98]: %timeit nlargest_indices(full, 500)
10 loops, best of 3: 133 ms per loop
For timeit purposes I needed to copy the array inside nlargest_indices_orig, lest full get mutated by the timing loop.
Benchmarking the copying operation:
def base(full, n):
full = full.copy()
In [102]: %timeit base(full, 500)
100 loops, best of 3: 4.11 ms per loop
shows this added about 4ms to the 5s benchmark for nlargest_indices_orig.
Warning: nlargest_indices and nlargest_indices_orig may return different results if arr contains repeated values.
nlargest_indices finds the n largest values in arr and then returns the x and y indices corresponding to the locations of those values.
nlargest_indices_orig finds the n largest values in arr and then returns one x and y index for each large value. If there is more than one x and y corresponding to the same large value, then some locations where large values occur may be missed.
They also return indices in a different order, but I suppose that does not matter for your purpose of plotting.
If you want to know the indices of the n max/min values in the 2d array, my solution (for largest is)
indx = divmod((-full).argpartition(num_largest,axis=None)[:3],full.shape[0])
This finds the indices of the largest values from the flattened array and then determines the index in the 2d array based on the remainder and mod.
Nevermind. Benchmarking shows the unravel method is twice as fast at least for num_largest = 3.
I'm afraid that the most time-consuming part is recalculating maximum. In fact, you have to calculate maximum of 1002*1004 numbers 500 times which gives you 500 million comparisons.
Probably you should write your own algorithm to find the solution in one pass: keep only 1000 greatest numbers (or their indices) somewhere while scanning your 2D array (without modifying the source array). I think that some sort of a binary heap (have a look at heapq) would suit for the storage.

Categories