numpy: take multiple range subsets of the same of size - python

What I'm looking for
# I have an array
x = np.arange(0, 100)
# I have a size n
n = 10
# I have a random set of numbers
indexes = np.random.randint(n, 100, 10)
# What I want is a matrix where every row i is the i-th element of indexes plus the previous n elements
res = np.empty((len(indexes), n), int)
for (i, v) in np.ndenumerate(indexes):
res[i] = x[v-n:v]
To reformulate, as I wrote in the title what am looking for is a way to take multiple subsets (of the same size) of an initial array.
Just to add a detail this loopy version works, I want just to know if there is a numpyish way to achieve this in a more elegant way.

The following does what you are asking for. It uses numpy.lib.stride_tricks.as_strided to create a special view on the data which can be indexed in the desired way.
import numpy as np
from numpy.lib import stride_tricks
x = np.arange(100)
k = 10
i = np.random.randint(k, len(x)+1, size=(5,))
xx = stride_tricks.as_strided(x, strides=np.repeat(x.strides, 2), shape=(len(x)-k+1, k))
print(i)
print(xx[i-k])
Sample output:
[ 69 85 100 37 54]
[[59 60 61 62 63 64 65 66 67 68]
[75 76 77 78 79 80 81 82 83 84]
[90 91 92 93 94 95 96 97 98 99]
[27 28 29 30 31 32 33 34 35 36]
[44 45 46 47 48 49 50 51 52 53]]
A bit of explanation. Arrays store not only data but also a small "header" with layout information. Amongst this are the strides which tell how to translate linear memory to nd. There is a stride for each dimension which is just the offset at which the next element along that dimension can be found. So the strides for a 2d array are (row offset, element offset). as_strided permits to directly manipulate an array's strides; by setting row offsets to the same as element offsets we create a view that looks like
0 1 2 ...
1 2 3 ...
2 3 4
. .
. .
. .
Note that no data are copied at this stage; for exasmple, all the 2s refer to the same memory location in the original array. Which is why this solution should be quite efficient.

Related

How do I sort columns of numerical file data in python

I'm trying to write a piece of code in python to graph some data from a tab separated file with numerical data.
I'm very new to Python so I would appreciate it if any help could be dumbed down a little bit.
Basically, I have this file and I would like to take two columns from it, sort them each in ascending order, and then graph those sorted columns against each other.
First of all, you should not put code as images, since there is a functionality to insert and format here in the editor.
It's as simple as calling x.sort() and y.sort() since both of them are slices from data so that should work fine (assuming they are 1 dimensional arrays).
Here is an example:
import numpy as np
array = np.random.randint(0,100, size=50)
print(array)
Output:
[89 47 4 10 29 21 91 95 32 12 97 66 59 70 20 20 36 79 23 4]
So if we use the method mentioned before:
print(array.sort())
Output:
[ 4 4 10 12 20 20 21 23 29 32 36 47 59 66 70 79 89 91 95 97]
Easy as that :)

Iteration through a 3D array using a 2D query window by using Numpy transpose

This question is a generalized version of a question which I have asked before:
Reshaping a Numpy Array into lexicographical list of cubes of shape (n, n, n)
The question is, given an nd-array of shape (x, y, z) and a query window (p, q), with the restriction that x % p == 0 and y % q == 0, how do I transpose the matrix in such a way that it has shape (p, q, -1) and maintains the ordering proposed in the original question. The idea is that I can quickly take slices of a specific shape instead of having to iterate to the relevant indices.
In the original post, this answer was proposed:
N = 4
a = np.arange(N**3).reshape(N,N,N)
b = a.reshape(2,N//2,2,N//2,N).transpose(1,3,0,2,4).reshape(N//2,N//2,N*4)
with output:
print(b):
[[[ 0 1 2 3 8 9 10 11 32 33 34 35 40 41 42 43]
[ 4 5 6 7 12 13 14 15 36 37 38 39 44 45 46 47]]
[[16 17 18 19 24 25 26 27 48 49 50 51 56 57 58 59]
[20 21 22 23 28 29 30 31 52 53 54 55 60 61 62 63]]]
This would correspond to input shape (4, 4, 4), query shape (2, 2) and output shape (2, 2, -1).
The accepted answer in the original question is close to what I need, but its output shape is dependent on the shape of the nd-array. That is not the behavior that I am looking for as I'd like to use any query shape (p, q) for any input shape (x, y, z).
I am not very proficient in using Numpy transpose to implement these kinds of operations (I have tried to use this answer and generalize its myself without success), so it would be greatly appreciated if, when answered, the answer could be supplemented with a bit of an explanation about the approach which the answerer took or point to some resources which could help me out with this!
Hope that makes it clear!
It can be just a simple modification modified, think (p,q) = (2,2) in this case. So something like this:
a.reshape(p, x//p, q, y//q, -1).transpose(3,1,2,0,4).reshape(p,q,-1)

Remove Specific Indices From 2D Numpy Array

If I have a set of data that's of shape (1000,1000) and I know that the values I need from it are contained within the indices (25:888,11:957), how would I go about separating the two sections of data from one another?
I couldn't figure out how to get np.delete() to like the specific 2D case and I also need both the good and the bad sections of data for analysis, so I can't just specify my array bounds to be within the good indices.
I feel like there's a simple solution I'm missing here.
Is this how you want to divide the array?
In [364]: arr = np.ones((1000,1000),int)
In [365]: beta = arr[25:888, 11:957]
In [366]: beta.shape
Out[366]: (863, 946)
In [367]: arr[:25,:].shape
Out[367]: (25, 1000)
In [368]: arr[888:,:].shape
Out[368]: (112, 1000)
In [369]: arr[25:888,:11].shape
Out[369]: (863, 11)
In [370]: arr[25:888,957:].shape
Out[370]: (863, 43)
I'm imaging a square with a rectangle cut out of the middle. It's easy to specify that rectangle, but the frame is has to be viewed as 4 rectangles - unless it is described via the mask of what is missing.
Checking that I got everything:
In [376]: x = np.array([_366,_367,_368,_369,_370])
In [377]: np.multiply.reduce(x, axis=1).sum()
Out[377]: 1000000
Let's say your original numpy array is my_arr
Extracting the "Good" Section:
This is easy because the good section has a rectangular shape.
good_arr = my_arr[25:888, 11:957]
Extracting the "Bad" Section:
The "bad" section doesn't have a rectangular shape. Rather, it has the shape of a rectangle with a rectangular hole cut out of it.
So, you can't really store the "bad" section alone, in any array-like structure, unless you're ok with wasting some extra space to deal with the cut out portion.
What are your options for the "Bad" Section?
Option 1:
Be happy and content with having extracted the good section. Let the bad section remain as part of the original my_arr. While iterating trough my_arr, you can always discriminate between good and and bad items based on the indices. The disadvantage is that, whenever you want to process only the bad items, you have to do it through a nested double loop, rather than use some vectorized features of numpy.
Option 2:
Suppose we want to perform some operations such as row-wise totals or column-wise totals on only the bad items of my_arr, and suppose you don't want the overhead of the nested for loops. You can create something called a numpy masked array. With a masked array, you can perform most of your usual numpy operations, and numpy will automatically exclude masked out items from the calculations. Note that internally, there will be some memory wastage involved, just to store an item as "masked"
The code below illustrates how you can create a masked array called masked_arr from your original array my_arr:
import numpy as np
my_size = 10 # In your case, 1000
r_1, r_2 = 2, 8 # In your case, r_1 = 25, r_2 = 889 (which is 888+1)
c_1, c_2 = 3, 5 # In your case, c_1 = 11, c_2 = 958 (which is 957+1)
# Using nested list comprehension, build a boolean mask as a list of lists, of shape (my_size, my_size).
# The mask will have False everywhere, except in the sub-region [r_1:r_2, c_1:c_2], which will have True.
mask_list = [[True if ((r in range(r_1, r_2)) and (c in range(c_1, c_2))) else False
for c in range(my_size)] for r in range(my_size)]
# Your original, complete 2d array. Let's just fill it with some "toy data"
my_arr = np.arange((my_size * my_size)).reshape(my_size, my_size)
print (my_arr)
masked_arr = np.ma.masked_where(mask_list, my_arr)
print ("masked_arr is:\n", masked_arr, ", and its shape is:", masked_arr.shape)
The output of the above is:
[[ 0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24 25 26 27 28 29]
[30 31 32 33 34 35 36 37 38 39]
[40 41 42 43 44 45 46 47 48 49]
[50 51 52 53 54 55 56 57 58 59]
[60 61 62 63 64 65 66 67 68 69]
[70 71 72 73 74 75 76 77 78 79]
[80 81 82 83 84 85 86 87 88 89]
[90 91 92 93 94 95 96 97 98 99]]
masked_arr is:
[[0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 -- -- 25 26 27 28 29]
[30 31 32 -- -- 35 36 37 38 39]
[40 41 42 -- -- 45 46 47 48 49]
[50 51 52 -- -- 55 56 57 58 59]
[60 61 62 -- -- 65 66 67 68 69]
[70 71 72 -- -- 75 76 77 78 79]
[80 81 82 83 84 85 86 87 88 89]
[90 91 92 93 94 95 96 97 98 99]] , and its shape is: (10, 10)
Now that you have a masked array, you will be able to perform most of the numpy operations on it, and numpy will automatically exclude the masked items (the ones that appear as "--" when you print the masked array)
Some examples of what you can do with the masked array:
# Now, you can print column-wise totals, of only the bad items.
print (masked_arr.sum(axis=0))
# Or row-wise totals, for that matter.
print (masked_arr.sum(axis=1))
The output of the above is:
[450 460 470 192 196 500 510 520 530 540]
[45 145 198 278 358 438 518 598 845 945]

Understand tensorflow slice operation

I am confused about the follow code:
import tensorflow as tf
import numpy as np
from tensorflow.python.framework import ops
from tensorflow.python.ops import array_ops
from tensorflow.python.ops import control_flow_ops
from tensorflow.python.ops import math_ops
from tensorflow.python.framework import dtypes
'''
Randomly crop a tensor, then return the crop position
'''
def random_crop(value, size, seed=None, name=None):
with ops.name_scope(name, "random_crop", [value, size]) as name:
value = ops.convert_to_tensor(value, name="value")
size = ops.convert_to_tensor(size, dtype=dtypes.int32, name="size")
shape = array_ops.shape(value)
check = control_flow_ops.Assert(
math_ops.reduce_all(shape >= size),
["Need value.shape >= size, got ", shape, size],
summarize=1000)
shape = control_flow_ops.with_dependencies([check], shape)
limit = shape - size + 1
begin = tf.random_uniform(
array_ops.shape(shape),
dtype=size.dtype,
maxval=size.dtype.max,
seed=seed) % limit
return tf.slice(value, begin=begin, size=size, name=name), begin
sess = tf.InteractiveSession()
size = [10]
a = tf.constant(np.arange(0, 100, 1))
print (a.eval())
a_crop, begin = random_crop(a, size = size, seed = 0)
print ("offset: {}".format(begin.eval()))
print ("a_crop: {}".format(a_crop.eval()))
a_slice = tf.slice(a, begin=begin, size=size)
print ("a_slice: {}".format(a_slice.eval()))
assert (tf.reduce_all(tf.equal(a_crop, a_slice)).eval() == True)
sess.close()
outputs:
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
96 97 98 99]
offset: [46]
a_crop: [89 90 91 92 93 94 95 96 97 98]
a_slice: [27 28 29 30 31 32 33 34 35 36]
There are two tf.slice options:
(1). called in function random_crop, such as tf.slice(value, begin=begin, size=size, name=name)
(2). called as a_slice = tf.slice(a, begin=begin, size=size)
The parameters (values, begin and size) of those two slice operations are the same.
However, why the printed values a_crop and a_slice are different and tf.reduce_all(tf.equal(a_crop, a_slice)).eval() is True?
Thanks
EDIT1
Thanks #xdurch0, I understand the first question now.
Tensorflow random_uniform seems like a random generator.
import tensorflow as tf
import numpy as np
sess = tf.InteractiveSession()
size = [10]
np_begin = np.random.randint(0, 50, size=1)
tf_begin = tf.random_uniform(shape = [1], minval=0, maxval=50, dtype=tf.int32, seed = 0)
a = tf.constant(np.arange(0, 100, 1))
a_slice = tf.slice(a, np_begin, size = size)
print ("a_slice: {}".format(a_slice.eval()))
a_slice = tf.slice(a, np_begin, size = size)
print ("a_slice: {}".format(a_slice.eval()))
a_slice = tf.slice(a, tf_begin, size = size)
print ("a_slice: {}".format(a_slice.eval()))
a_slice = tf.slice(a, tf_begin, size = size)
print ("a_slice: {}".format(a_slice.eval()))
sess.close()
output
a_slice: [42 43 44 45 46 47 48 49 50 51]
a_slice: [42 43 44 45 46 47 48 49 50 51]
a_slice: [41 42 43 44 45 46 47 48 49 50]
a_slice: [29 30 31 32 33 34 35 36 37 38]
The confusing thing here is that tf.random_uniform (like every random operation in TensorFlow) produces a new, different value on each evaluation call (each call to .eval() or, in general, each call to tf.Session.run). So if you evaluate a_crop you get one thing, if you then evaluate a_slice you get a different thing, but if you evaluate tf.reduce_all(tf.equal(a_crop, a_slice)) you get True, because all is being computed in a single evaluation step, so only one random value is produced and it determines the value of both a_crop and a_slice. Another example is this, if you run tf.stack([a_crop, a_slice]).eval() you will get a tensor with to equal rows; again, only one random value was produced. More generally, if you call tf.Session.run with multiple tensors to evaluate, all the computations in that call will use the same random values.
As a side note, if you actually need a random value in a computation that you want to maintain for a later computation, the easiest thing would be to just retrieve if with tf.Session.run, along with any other needed computation, to feed it back later through feed_dict; or you could have a tf.Variable and store the random value there. A more advanced possibility would be to use partial_run, an experimental API that allows you to evaluate part of the computation graph and continue evaluating it later, while maintaining the same state (i.e. the same random values, among other things).

Index two sets of columns in an array

I am trying to slice columns out of an array and assign to a new variable, like so.
array1 = array[:,[0,1,2,3,15,16,17,18,19,20]]
Is there a short cut for something like this?
I tried this, but it threw an error:
array1 = array[:,[0:3,15:20]]
This is probably really simple but I can't find it anywhere.
Use np.r_:
Translates slice objects to concatenation along the first axis.
import numpy as np
arr = np.arange(100).reshape(5, 20)
cols = np.r_[:3, 15:20]
print(arr[:, cols])
[[ 0 1 2 15 16 17 18 19]
[20 21 22 35 36 37 38 39]
[40 41 42 55 56 57 58 59]
[60 61 62 75 76 77 78 79]
[80 81 82 95 96 97 98 99]]
At the end of the day, probably only a little less verbose than what you have now, but could come in handy for more complex cases.
For most simple cases like this, the best and most straightforward way is to use concatenation:
array1 = array[0:3] + array[15:20]
For more complicated cases, you'll need to use a custom slice, such as NumPy's s_, which allows for multiple slices with gaps, separated by commas. You can read about it here.
Also, if your slice follows a pattern (i.e. get 5, skip 10, get 5 etc), you can use itertools.compress, as explained by user ncoghlan in this answer.
You could use list(range(0, 4)) + list(range(15, 20))

Categories