How to extract a sample from a large numpy array

How to extract a sample from a large numpy array - python

I have a large data set stored as a 300000x860 numpy array and performing any operations on it takes a very long time. Is there any way to extract the first 10000 elements of the numpy array without looping through the first 10000 elements and appending each to a new array?

Yes, that is exactly what numpy indexing is for:
x.shape # gives (300000, 860)
first_k = x[:10000,:]
This would extract the first 10000 rows from the array, but I am not sure what you mean by first 10000 elements since you have a 2D array.
The first 10000 elements would be:
ex_rows = np.floor(10000 / cols)
ex_cols = 10000 % cols
ex_array = old_array[:ex_rows, :ex_cols]

Related

Create array from slices of numpy arrays contained in a list object

I have a pandas dataframe of shape (7761940, 16). I converted it into a list of 7762 numpy arrays using np.array_split, each array of shape (1000, 16) .
Now I need to take a slice of the first 50 elements from each array and create a new array of shape (388100, 16) from them. The number 388100 comes from 7762 arrays multiplied by 50 elements.
I know it is a sort of slicing and indexing but I could not manage it.

If you split the array, you waste memory. If you pad the array to allow a nice reshape, you waste memory. This is not a huge problem, but it can be avoided. One way is to use the arcane np.lib.stride_tricks.as_strided function. This function is dangerous, and we would break some rules with it, but as long as you only want the 50 first elements of a chunk, and the last chunk is longer than 50 elements, everything will be fine:
x = ... # your data as a numpy array
chunks = int(np.ceil(x.shape[0] / 1000))
view = np.lib.stride_tricks.as_strided(x, shape=(chunks, 1000, x.shape[-1]), strides=(np.max(*x.strides) * 1000, *x.strides))
This will create a view of shape (7762, 1000, 16) into the original memory, without making a copy. Since your original array does not have a multiple of 1000 rows, the last plane will have some memory that doesn't belong to you. As long as you don't try to access it, it won't hurt you.
Now accessing the first 50 elements of each plane is trivial:
data = view[:, :50, :]
You can unravel the first dimensions to get the final result:
data.reshape(-1, x.shape[-1])
A much healthier way would be to pad and reshape the original.

After getting benefit from friends comments and some survey, i came up with a solution:
my_data = np.array_split(dataframe, 7762) #split dataframe to a list of 7762 ndarray
#each of 1000x16 dimension
my_list = [] #define new list object
for i in range(0,7762): #loop to iterate over the 7762 ndarrays
my_list.append(my_data[i][0:50, :]) #append first 50 rows from each adarray into my_list

You can do something like this:
Split the data of size (7762000 x 16) to (7762 x 1000 x 16)
data_first_split = np.array_split(data, 7762)
Slice the data to 7762 x 50 x 16, to get the first 50 elements of data_first_split
data_second_split = data_first_split[:, :50, :]
Reshape to get 388100 x 16
data_final = np.reshape(data_second_split, (7762 * 50, 16))
as #hpaulj mentioned, you can also do it using np.vstack. IMO you should also give numpy.strides a look.

Vectorization - how to append array without loop for

I have the following code:
x = range(100)
M = len(x)
sample=np.zeros((M,41632))
for i in range(M):
lista=np.load('sample'+str(i)+'.npy')
for j in range(41632):
sample[i,j]=np.array(lista[j])
print i
to create an array made of sample_i numpy arrays.
sample0, sample1, sample3, etc. are numpy arrays and my expected output is a Mx41632 array like this:
sample = [[sample0],[sample1],[sample2],...]
How can I compact and make more quick this operation without loop for? M can reach also 1 million.
Or, how can I append my sample array if the starting point is, for example, 1000 instead of 0?
Thanks in advance

Initial load
You can make your code a lot faster by avoiding the inner loop and not initialising sample to zeros.
x = range(100)
M = len(x)
sample = np.empty((M, 41632))
for i in range(M):
sample[i, :] = np.load('sample'+str(i)+'.npy')
In my tests this took the reading code from 3 seconds to 60 miliseconds!
Adding rows
In general it is very slow to change the size of a numpy array. You can append a row once you have loaded the data in this way:
sample = np.insert(sample, len(sample), newrow, axis=0)
but this is almost never what you want to do, because it is so slow.
Better storage: HDF5
Also if M is very large you will probably start running out of memory.
I recommend that you have a look at PyTables which will allow you to store your sample results in one HDF5 file and manipulate the data without loading it into memory. This will in general be a lot faster than the .npy files you are using now.

It is quite simple with numpy. Consider this example:
import numpy as np
l = [[1,2,3],[4,5,6],[7,8,9],[10,11,12]]
#create an array with 4 rows and 3 columns
arr = np.zeros([4,3])
arr[:,:] = l
You can also insert rows or columns separately:
#insert the first row
arr[0,:] = l[0]
You just have to provide that dimensions are the same.

Efficiently determining if large sorted numpy array has only unique values

I have a very large numpy array and I want to sort it and test if it is unique.
I'm aware of the function numpy.unique but it sorts the array another time to achieve it.
The reason I need the array sorted a priori is because the returned keys from the argsort function will be used to reorder another array.
I'm looking for a way to do both (argsort and unique test) without the need to sort the array again.
Example code:
import numpy as np
import numpy.random
# generating random arrays with 2 ^ 27 columns (it can grow even bigger!)
slices = np.random.random_integers(2 ** 32, size = 2 ** 27)
values = np.random.random_integers(2 ** 32, size = 2 ** 27)
# get an array of keys to sort slices AND values
# this operation takes a long time
sorted_slices = slices.argsort()
# sort both arrays
# it would be nice to make this operation in place
slices = slices[sorted_slices]
values = values[sorted_slices]
# test 'uniqueness'
# here, the np.unique function sorts the array again
if slices.shape[0] == np.unique(slices).shape[0]:
print('it is unique!')
else:
print('not unique!')
Both the arrays slices and values have 1 row and the same (huge) number of columns.
Thanks in advance.

You can check whether there are two or more equal values next to each other (non-unique values in a sorted array) by comparing their difference to 0
numpy.any(numpy.diff(slices) == 0)
Be aware though that numpy will create two intermediate arrays: one with the difference values, one with boolean values.

Here's an approach making use of slicing and instead of actual differentiation, we can just compare each element against the previous one without actually computing the differentiation value, like so -
~((slices[1:] == slices[:-1]).any())
Runtime test -
In [54]: slices = np.sort(np.random.randint(0,100000000,(10000000)))
# #Nils Werner's soln
In [55]: %timeit ~np.any(np.diff(slices) == 0)
100 loops, best of 3: 18.5 ms per loop
# #Marco's suggestion in comments
In [56]: %timeit np.diff(slices).all()
10 loops, best of 3: 20.6 ms per loop
# Proposed soln in this post
In [57]: %timeit ~((slices[1:] == slices[:-1]).any())
100 loops, best of 3: 6.12 ms per loop

sampling 2-D numpy array using another array

I have a pretty big numpy matrix (2-D array with more than 1000 * 1000 cells), and another 2-D array of indexes in the following form: [[x1,y1],[x2,y2],...,[xn,yn]], which is also quite large (n > 1000). I want to extract all the cells in the matrix that their (x,y) coordinates appear in the array as efficient as possible, i.e. without loops. If the array was an array of tuples I could just to
cells = matrix[array]
and get what I want, but the array is not in that format, and I couldn't find an efficient way to convert it to the desired form...

You can make your array into a tuple of arrays like this:
tuple(array.T)
This matches the output style of np.where(), which can be indexed.
cells=matrix[tuple(array.T)]
you can also do standard numpy array slicing and get Divakar's answer in the comments:
cells=matrix[array[:,0],array[:,1]]

Apply numpy index to matrix

I have spent the last hour trying to figure this out
Suppose we have
import numpy as np
a = np.random.rand(5, 20) - 0.5
amin_index = np.argmin(np.abs(a), axis=1)
print(amin_index)
> [ 0 12 5 18 1] # or something similar
this does not work:
a[amin_index]
So, in essence, I need to find the minima along a certain axis for the array np.abs(a), but then extract the values from the array a at these positions. How can I apply an index to just one axis?
Probably very simple, but I can't get it figured out. Also, I can't use any loops since I have to do this for arrays with several million entries.
thanks 😊

One way is to pass in the array of row indexes (e.g. [0,1,2,3,4]) and the list of column indexes for the minimum in each corresponding row (your list amin_index).
This returns an array containing the value at [i, amin_index[i]] for each row i:
>>> a[np.arange(a.shape[0]), amin_index]
array([-0.0069325 , 0.04268358, -0.00128002, -0.01185333, -0.00389487])
This is basic indexing (rather than advanced indexing), so the returned array is actually a view of a rather than a new array in memory.

Is because argmin returns indexes of columns for each of the rows (with axis=1), therefore you need to access to each row at those particular columns:
a[range(a.shape[0]), amin_index]

Why not simply do np.amin(np.abs(a), axis=1), it's much simpler if you don't need the intermediate amin_index array via argmin?
Numpy's reference page is an excellent resource, see "Indexing".
Edits: Timing is always useful:
In [3]: a=np.random.rand(4000, 4000)-.5
In [4]: %timeit np.amin(np.abs(a), axis=1)
10 loops, best of 3: 128 ms per loop
In [5]: %timeit a[np.arange(a.shape[0]), np.argmin(np.abs(a), axis=1)]
10 loops, best of 3: 135 ms per loop

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract a sample from a large numpy array - python

I have a large data set stored as a 300000x860 numpy array and performing any operations on it takes a very long time. Is there any way to extract the first 10000 elements of the numpy array without looping through the first 10000 elements and appending each to a new array?

Related

Create array from slices of numpy arrays contained in a list object

Vectorization - how to append array without loop for

Efficiently determining if large sorted numpy array has only unique values

sampling 2-D numpy array using another array

Apply numpy index to matrix

Categories

Resources