How to concatenate 2D array with a chunk size - python

I have the following 2D array and I would like to find a way to generate another 2D array but with data concatened with a chunk size.
array_2d = [
[0,0,0,0,0,0,0,1],
[0,0,0,0,0,0,0,1],
[0,0,0,0,0,1,1,1],
[0,0,0,0,0,0,1,1],
[0,0,0,0,0,1,1,1]
]
For example, with a chunk size of 2 the above 2D array will be changed to:
array_2d = [
[0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1],
[0,0,0,0,0,0,1,1,0,0,0,0,0,1,1,1],
[0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1]
]
Note that the last element has been zero padded on the left.
Thanks for help.

array_2d = [
[0,0,0,0,0,0,0,1],
[0,0,0,0,0,0,0,1],
[0,0,0,0,0,1,1,1],
[0,0,0,0,0,0,1,1],
[0,0,0,0,0,1,1,1]
]
def chunkconcat(chunk, init_data):
data_arr = init_data
# padding
while (len(data_arr) % chunk) != 0:
data_arr.append([0 for _ in range(len(data_arr[0]))])
# division into smaller chunks
divide_data = [data_arr[i*chunk:(i+1)*chunk] for i in range(int(len(data_arr)/chunk))]
print(divide_data)
new_arr = []
# use the slices by reversing order and concatenating them
for slice in divide_data:
tmp = []
for ind in range(chunk):
tmp.append(slice[-1])
slice.pop()
new_arr.append(np.concatenate(tmp))
final_data = np.array(new_arr)
print(final_data)
chunkconcat(2, array_2d)
So you can also follow the solution and build your own solutions based on this, I did it explicitly.
Basically you start by padding vectors of zeros to be able to do the wanted conversion/concatenation or however you want to call it.
Afterwards you divide the data set into smaller parts of your wanted chunksize. Reversing order and passing it to temporary list enables us to get the order right and concatenating them, afterwards we add the result to our final list ('matrix').
Converting it into a numpy array in a final step to have a real matrix.
Using predefined functions and list comprehensions you could solve this in few lines, but this would not be very helpful for you I guess.
Take care.

Related

Fastest way to create a numpy array with consecutive integer but ignore specific number

I need to generate a numpy array fill with consecutive numbers but ignore a specific number.
For example, I need a numpy array between 0 to 5 but ignore 3. The result will be [0,1,2,4,5,].
My current solution is very slow when the array size I need is large. Here is my testing code and it took 2m34s on my i7-6770 machine with Python 3.6.5.
import numpy as np
length = 150000
for _ in range(10000):
skip = np.random.randint(length)
indexing = np.asarray([i for i in range(length) if i != skip])
Hence, I would like to know if there's better one. Thanks
Instead of ignoring a number, split your array into two ranges, leaving the number you're ignoring out. Then use np.arange to make the arrays and concatenate them.
def range_with_ignore(start, stop, ignore):
return np.concatenate([
np.arange(start, ignore),
np.arange(ignore + 1, stop)
])

Speeding up fancy indexing with numpy

I have two numpy arrays and each has shape of (10000,10000).
One is value array and the other one is index array.
Value=np.random.rand(10000,10000)
Index=np.random.randint(0,1000,(10000,10000))
I want to make a list (or 1D numpy array) by summing all the "Value array" referring the "Index array". For example, for each index i, finding matching array index and giving it to value array as argument
for i in range(1000):
NewArray[i] = np.sum(Value[np.where(Index==i)])
However, This is too slow since I have to do this loop through 300,000 arrays.
I tried to come up with some logical indexing method like
NewArray[Index] += Value[Index]
But it didn't work.
The next thing I tried is using dictionary
for k, v in list(zip(Index.flatten(),Value.flatten())):
NewDict[k].append(v)
and
for i in NewDict:
NewDict[i] = np.sum(NewDict[i])
But it was slow too
Is there any smart way to speed up?
I had two thoughts. First, try masking, it speeds this up by about 4x:
for i in range(1000):
NewArray[i] = np.sum(Value[Index==i])
Alternately, you can sort your arrays to put the values you're adding together in contiguous memory space. Masking or using where() has to gather all your values together each time you call sum on the slice. By front-loading this gathering, you might be able to speed things up considerably:
# flatten your arrays
vals = Value.ravel()
inds = Index.ravel()
s = np.argsort(inds) # these are the indices that will sort your Index array
v_sorted = vals[s].copy() # the copy here orders the values in memory instead of just providing a view
i_sorted = inds[s].copy()
searches = np.searchsorted(i_sorted, np.arange(0, i_sorted[-1] + 2)) # 1 greater than your max, this gives you your array end...
for i in range(len(searches) -1):
st = searches[i]
nd = searches[i+1]
NewArray[i] = v_sorted[st:nd].sum()
This method takes 26 sec on my computer vs 400 using the old way. Good luck. If you want to read more about contiguous memory and performance check this discussion out.

How can I efficiently translate a large, non-rectangular 2D list to an even larger rectangular 2D array?

I have a 2-D list of shape (300,000, X), where each of the sublists has a different size (X) and contains integers between 0 and 25. In order to convert the data to a Tensor, all of the sublists need to have equal length, but I don't want to lose any data from my sublists in the conversion.
At first glance, I wanted to fill all sublists smaller than the longest sublist with filler (-1) in order to create a rectangular array. For my current dataset, the longest sublist is of length 5037.
My conversion code is below:
for seq in new_format:
seq.extend([-1] * (length - len(seq)))
However, when there are 300,000 sequences in new_format, and length-len(seq) is generally >4000, the process causes a MemoryError due to its enormous size. Most of the sublists become much longer when extended to size 5037 in order to equalize the sublists. How can I make this more space-efficient or avoid the problem entirely?
My advice? Don't contruct a Python list to initialize the array. That will be too memory heavy. Since your values fall between 0-25, and you want a filler of -1, you can use np.int8:
First, initialize an adequately shaped array with the appropriate filler value:
>>> arr = np.full((300000, 5037), -1, dtype=np.int8)
Then simply loop over your existing data and set the values as needed.
>>> for i, row in enumerate(data):
... for j, val in enumerate(row):
... arr[i, j] = val
...
This will give you a nice and compact array of about 1.5 gigs:
>>> arr.nbytes*1e-9
1.5111

Null numpy array to be appended to

I'm writing a feature selection code. Basically get the output from featureselection function and concatenate it to the numpy array data
data=np.zeros([1,4114]) # put feature length here
for i in range(1,N):
filename=splitpath+str(i)+'.tiff'
feature=featureselection(filename)
data=np.vstack((data, feature))
data=data[1:,:] # remove the first zeros row
However, this is not a robust implementation as I need to know feature length (4114) beforehand.
Is there any null numpy array matrix, like in Python list we have []?
Appending to a numpy array in a loop is inefficient, there might be some situations when it cannot be avoided but this doesn't seem to be one of them. If you know the size of the array that you'll end up with, it's best to just per-allocate the array, something like this:
data = np.zeros([N, 4114])
for i in range(1, N):
filename = splitpath+str(i)+'.tiff'
feature = featureselection(filename)
data[i] = feature
Sometimes you don't know the size of the final array. There are several ways to deal with this case, but the simplest is probably to use a temporary list, something like:
data = []
for i in range(1,N):
filename = splitpath+str(i)+'.tiff'
feature = featureselection(filename)
data.append(feature)
data = np.array(data)
Just for completeness, you can also do data = np.zeros([0, 4114]), but I would recommend against that and suggest one of the methods above.
If you don't want to assume the size before creating the first array, you can use lazy initialization.
data = None
for i in range(1,N):
filename=splitpath+str(i)+'.tiff'
feature=featureselection(filename)
if data is None:
data = np.zeros(( 0, feature.size ))
data = np.vstack((data, feature))
if data is None:
print 'no features'
else:
print data.shape

Select cells randomly from NumPy array - without replacement

I'm writing some modelling routines in NumPy that need to select cells randomly from a NumPy array and do some processing on them. All cells must be selected without replacement (as in, once a cell has been selected it can't be selected again, but all cells must be selected by the end).
I'm transitioning from IDL where I can find a nice way to do this, but I assume that NumPy has a nice way to do this too. What would you suggest?
Update: I should have stated that I'm trying to do this on 2D arrays, and therefore get a set of 2D indices back.
How about using numpy.random.shuffle or numpy.random.permutation if you still need the original array?
If you need to change the array in-place than you can create an index array like this:
your_array = <some numpy array>
index_array = numpy.arange(your_array.size)
numpy.random.shuffle(index_array)
print your_array[index_array[:10]]
All of these answers seemed a little convoluted to me.
I'm assuming that you have a multi-dimensional array from which you want to generate an exhaustive list of indices. You'd like these indices shuffled so you can then access each of the array elements in a randomly order.
The following code will do this in a simple and straight-forward manner:
#!/usr/bin/python
import numpy as np
#Define a two-dimensional array
#Use any number of dimensions, and dimensions of any size
d=numpy.zeros(30).reshape((5,6))
#Get a list of indices for an array of this shape
indices=list(np.ndindex(d.shape))
#Shuffle the indices in-place
np.random.shuffle(indices)
#Access array elements using the indices to do cool stuff
for i in indices:
d[i]=5
print d
Printing d verified that all elements have been accessed.
Note that the array can have any number of dimensions and that the dimensions can be of any size.
The only downside to this approach is that if d is large, then indices may become pretty sizable. Therefore, it would be nice to have a generator. Sadly, I can't think of how to build a shuffled iterator off-handedly.
Extending the nice answer from #WoLpH
For a 2D array I think it will depend on what you want or need to know about the indices.
You could do something like this:
data = np.arange(25).reshape((5,5))
x, y = np.where( a = a)
idx = zip(x,y)
np.random.shuffle(idx)
OR
data = np.arange(25).reshape((5,5))
grid = np.indices(data.shape)
idx = zip( grid[0].ravel(), grid[1].ravel() )
np.random.shuffle(idx)
You can then use the list idx to iterate over randomly ordered 2D array indices as you wish, and to get the values at that index out of the data which remains unchanged.
Note: You could also generate the randomly ordered indices via itertools.product too, in case you are more comfortable with this set of tools.
Use random.sample to generates ints in 0 .. A.size with no duplicates,
then split them to index pairs:
import random
import numpy as np
def randint2_nodup( nsample, A ):
""" uniform int pairs, no dups:
r = randint2_nodup( nsample, A )
A[r]
for jk in zip(*r):
... A[jk]
"""
assert A.ndim == 2
sample = np.array( random.sample( xrange( A.size ), nsample )) # nodup ints
return sample // A.shape[1], sample % A.shape[1] # pairs
if __name__ == "__main__":
import sys
nsample = 8
ncol = 5
exec "\n".join( sys.argv[1:] ) # run this.py N= ...
A = np.arange( 0, 2*ncol ).reshape((2,ncol))
r = randint2_nodup( nsample, A )
print "r:", r
print "A[r]:", A[r]
for jk in zip(*r):
print jk, A[jk]
Let's say you have an array of data points of size 8x3
data = np.arange(50,74).reshape(8,-1)
If you truly want to sample, as you say, all the indices as 2d pairs, the most compact way to do this that i can think of, is:
#generate a permutation of data's size, coerced to data's shape
idxs = divmod(np.random.permutation(data.size),data.shape[1])
#iterate over it
for x,y in zip(*idxs):
#do something to data[x,y] here
pass
Moe generally, though, one often does not need to access 2d arrays as 2d array simply to shuffle 'em, in which case one can be yet more compact. just make a 1d view onto the array and save yourself some index-wrangling.
flat_data = data.ravel()
flat_idxs = np.random.permutation(flat_data.size)
for i in flat_idxs:
#do something to flat_data[i] here
pass
This will still permute the 2d "original" array as you'd like. To see this, try:
flat_data[12] = 1000000
print data[4,0]
#returns 1000000
people using numpy version 1.7 or later there can also use the builtin function numpy.random.choice

Categories