Removing blank rows from Memmapped array - python

I have around 6000 json.gz files totalling 24GB which I need to do various calculations on.
Because I have no clue about the number of lines I'm going to pick up from each JSON file, (since I would reject some lines with invalid data), I estimated a maximum of 2000 lines from each JSON.
I created a Memmapped Numpy with array shape (6000*2000,10) and parsed data from the json.gz into the Memmapped numpy [Total size=2.5GB]
In the end it turned out that because of the overestimation, the last 10-15% of the rows are all zeroes. Now because of the nature of my computation I need to remove these invalid rows from the Memmapped numpy. Priority is of course time and after that memory.
What could be the best method to do this? I'm programatically aware of the exact indices of the rows to be removed.
Create another Memmaped array with the correct shape and size, slice the original array into this.
Use the delete() function
Use Masking
Something else?

You can use arr.base.resize to truncate or enlarge the array, then arr.flush() to save the change to disk:
In [169]: N = 10**6
In [170]: filename = '/tmp/test.dat'
In [171]: arr = np.memmap(filename, mode='w+', shape=(N,))
In [172]: arr.base.resize(N//2)
In [173]: arr.flush()
In [174]: arr = np.memmap(filename, mode='r')
In [175]: arr.shape
Out[175]: (500000,)
In [176]: arr.size == N//2
Out[176]: True

Related

default values for numpy ndarray

I was working with numpy.ndarray and something interesting happened.
I created an array with the shape of (2, 2) and left everything else with the default values.
It created an array for me with these values:
array([[2.12199579e-314, 0.00000000e+000],
[5.35567160e-321, 7.72406468e-312]])
I created another array with the same default values and it also gave me the same result.
Then I created a new array (using the default values and the shape (2, 2)) and filled it with zeros using the 'fill' method.
The interesting part is that now whenever I create a new array with ndarray it gives me an array with 0 values.
So what is going on behind the scenes?
See https://numpy.org/doc/stable/reference/generated/numpy.empty.html#numpy.empty:
(Precisely as #Michael Butscher commented)
np.empty([2, 2]) creates an array without touching the contents of the memory chunk allocated for the array; thus, the array may look as if filled with some more or less random values.
np.ndarray([2, 2]) does the same.
Other creation methods, however, fill the memory with some values:
np.zeros([2, 2]) fills the memory with zeros,
np.full([2, 2], 9) fills the memory with nines, etc.
Now, if you create a new array via np.empty() after creating (and disposing of, i.e. automatically garbage collected) an array filled with e.g. ones, your new array may be allocated the same chunk of memory and thus look as if "filled" with ones.
np.empty explicitly says it returns:
Array of uninitialized (arbitrary) data of the given shape, dtype, and
order. Object arrays will be initialized to None.
It's compiled code so I can't say for sure, but I strongly suspect is just calls np.ndarray, with shape and dtype.
ndarray describes itself as a low level function, and lists many, better alternatives.
In a ipython session I can make two arrays:
In [2]: arr = np.empty((2,2), dtype='int32'); arr
Out[2]:
array([[ 927000399, 1267404612],
[ 1828571807, -1590157072]])
In [3]: arr1 = np.ndarray((2,2), dtype='int32'); arr1
Out[3]:
array([[ 927000399, 1267404612],
[ 1828571807, -1590157072]])
The values are the same, but when I check the "location" of their data buffers, I see that they are different:
In [4]: arr.__array_interface__['data'][0]
Out[4]: 2213385069328
In [5]: arr1.__array_interface__['data'][0]
Out[5]: 2213385068176
We can't use that number in code to fiddle with the values, but it's useful as a human-readable indicator of where the data is stored. (Do you understand the basics of how arrays are stored, with shape, dtype, strides, and data-buffer?)
Why the "uninitialized values" are the same is anyones guess; my guess it's just an artifact of the how that bit of memory was used before. np.empty stresses that we shouldn't place an significance to those values.
Doing the ndarray again, produces different values and location:
In [9]: arr1 = np.ndarray((2,2), dtype='int32'); arr1
Out[9]:
array([[1469865440, 515],
[ 0, 0]])
In [10]: arr1.__array_interface__['data'][0]
Out[10]: 2213403372816
apparent reuse
If I don't assign the array to a variable, or otherwise "hang on to it", numpy may reuse the data buffer memory:
In [17]: np.ndarray((2,2), dtype='int').__array_interface__['data'][0]
Out[17]: 2213403374512
In [18]: np.ndarray((2,2), dtype='int').__array_interface__['data'][0]
Out[18]: 2213403374512
In [19]: np.ndarray((2,2), dtype='int').__array_interface__['data'][0]
Out[19]: 2213403374512
In [20]: np.empty((2,2), dtype='int').__array_interface__['data'][0]
Out[20]: 2213403374512
Again, we shouldn't place too much significance to this reuse, and certainly not count on it for any calculations.
object dtype
If we specify the object dtype, then the values are initialized to None. This dtype contains references/pointers to objects in memory, and "random" pointers wouldn't be safe.
In [14]: arr1 = np.ndarray((2,2), dtype='object'); arr1
Out[14]:
array([[None, None],
[None, None]], dtype=object)
In [15]: arr1 = np.ndarray((2,2), dtype='U3'); arr1
Out[15]:
array([['', ''],
['', '']], dtype='<U3')

How to extract a sample from a large numpy array

I have a large data set stored as a 300000x860 numpy array and performing any operations on it takes a very long time. Is there any way to extract the first 10000 elements of the numpy array without looping through the first 10000 elements and appending each to a new array?
Yes, that is exactly what numpy indexing is for:
x.shape # gives (300000, 860)
first_k = x[:10000,:]
This would extract the first 10000 rows from the array, but I am not sure what you mean by first 10000 elements since you have a 2D array.
The first 10000 elements would be:
ex_rows = np.floor(10000 / cols)
ex_cols = 10000 % cols
ex_array = old_array[:ex_rows, :ex_cols]

Create array from slices of numpy arrays contained in a list object

I have a pandas dataframe of shape (7761940, 16). I converted it into a list of 7762 numpy arrays using np.array_split, each array of shape (1000, 16) .
Now I need to take a slice of the first 50 elements from each array and create a new array of shape (388100, 16) from them. The number 388100 comes from 7762 arrays multiplied by 50 elements.
I know it is a sort of slicing and indexing but I could not manage it.
If you split the array, you waste memory. If you pad the array to allow a nice reshape, you waste memory. This is not a huge problem, but it can be avoided. One way is to use the arcane np.lib.stride_tricks.as_strided function. This function is dangerous, and we would break some rules with it, but as long as you only want the 50 first elements of a chunk, and the last chunk is longer than 50 elements, everything will be fine:
x = ... # your data as a numpy array
chunks = int(np.ceil(x.shape[0] / 1000))
view = np.lib.stride_tricks.as_strided(x, shape=(chunks, 1000, x.shape[-1]), strides=(np.max(*x.strides) * 1000, *x.strides))
This will create a view of shape (7762, 1000, 16) into the original memory, without making a copy. Since your original array does not have a multiple of 1000 rows, the last plane will have some memory that doesn't belong to you. As long as you don't try to access it, it won't hurt you.
Now accessing the first 50 elements of each plane is trivial:
data = view[:, :50, :]
You can unravel the first dimensions to get the final result:
data.reshape(-1, x.shape[-1])
A much healthier way would be to pad and reshape the original.
After getting benefit from friends comments and some survey, i came up with a solution:
my_data = np.array_split(dataframe, 7762) #split dataframe to a list of 7762 ndarray
#each of 1000x16 dimension
my_list = [] #define new list object
for i in range(0,7762): #loop to iterate over the 7762 ndarrays
my_list.append(my_data[i][0:50, :]) #append first 50 rows from each adarray into my_list
You can do something like this:
Split the data of size (7762000 x 16) to (7762 x 1000 x 16)
data_first_split = np.array_split(data, 7762)
Slice the data to 7762 x 50 x 16, to get the first 50 elements of data_first_split
data_second_split = data_first_split[:, :50, :]
Reshape to get 388100 x 16
data_final = np.reshape(data_second_split, (7762 * 50, 16))
as #hpaulj mentioned, you can also do it using np.vstack. IMO you should also give numpy.strides a look.

Saving (and averaging) Large Sparse Numpy Array

I have written a code that generates a large 3d numpy array of data observations (floats). The dimensions are (33,000 x 2016 x 53), which corresponds to (#obs.locations x 5min_intervals_perweek x weeks_in_leapyear). It is very sparse (about 1.5% of entries are filled).
Currently I do this by calling:
my3Darray = np.zeros(33000,2016,53)
or
my3Darray = np.empty(33000,2016,53)
My loop then indexes into the array one entry at a time and updates 1.5% with floats (this part is actually very fast). I then need to:
Save each 2D (33000 x 2016) slice as a CSV or other 'general format' data file
Take the mean over the 3rd dimension (so I should get a 33000 x 2016 matrix)
I have tried saving with:
for slice_2d_week_i in xrange(nweeks):
weekfile = str(slice_2d_week_i)
np.savetxt(weekfile, my3Darray[:,:,slice_2d_week_i], delimiter=",")
However, this is extremely slow and the empty entries in the output show up as
0.000000000000000000e+00
which makes the file sizes huge.
Is there a more efficient way to save (possibly leaving blanks for entries that were never updated?) Is there a better way to allocate the array besides np.zeros or np.empty? And how can I take the mean over the 3rd dimension while ignoring non-updated entries ( mean(my3Darray,3) does not ignore the 0 entries ).
You can save in one of numpy's binary formats, here's one I use: np.savez.
You can average with np.sum(a, axis=2) / np.sum(a != 0, axis=2). Keep in mind that this will still give you NaN's when there are zeros in the denominator.

Efficiently remove rows/columns of numpy image array

I am trying to remove rows or columns from an image represented by a Numpy array. My image is of type uint16 and 2560 x 2176. As an example, say I want to remove the first 16 columns to make it 2560 x 2160.
I'm a MATLAB-to-Numpy convert, and in MATLAB would use something like:
A = rand(2560, 2196);
A(:, 1:16) = [];
As I understand, this deletes the columns in place and saves a lot of time by not copying to a new array.
For Numpy, previous posts have used commands like numpy.delete. However, the documentation is clear that this returns a copy, and so I must reassign the copy to A. This seems like it would waste a lot of time copying.
import numpy as np
A = np.random.rand(2560,2196)
A = np.delete(A, np.r_[:16], 1)
Is this truly as fast as an in-place deletion? I feel I must be missing a better method or not understanding how python handles array storage during deletion.
Relevant previous posts:
Removing rows in NumPy efficiently
Documentation for numpy.delete
Why not just do a slice? Here I'm removing the first 3000 columns instead of 16 to make the memory usage more clear:
import numpy as np
a = np.empty((5000, 5000)
a = a[:, 3000:]
This effectively reduces the size of the array in memory, as can be seen:
In [31]: a = np.zeros((5000, 5000), dtype='d')
In [32]: whos
Variable Type Data/Info
-------------------------------
a ndarray 5000x5000: 25000000 elems, type `float64`, 200000000 bytes (190 Mb)
In [33]: a = a[:, 3000:]
In [34]: whos
Variable Type Data/Info
-------------------------------
a ndarray 5000x2000: 10000000 elems, type `float64`, 80000000 bytes (76 Mb)
For this size of array a slice seems to be about 10,000x faster than your delete option:
%timeit a=np.empty((5000,5000), dtype='d'); a=np.delete(a, np.r_[:3000], 1)
1 loops, best of 3: 404 ms per loop
%timeit a=np.empty((5000,5000), dtype='d'); a=a[:, 3000:]
10000 loops, best of 3: 39.3 us per loop

Categories