Multiply numpy array by large numpy packedbit array - python

I have a very large binary array which I compress using
arr1 = np.random.randint(0,2,(100, 100))
bitArray = np.packbits(arr1)
How can I then multiply another numpy integer array by this packed array,
arr2 = np.random.randint(0,10,(100,100))
result = MULTIPLY(arr2,bitArray)
treating the values as standard ones and zeros such that the result would be the same as
np.dot(arr2,arr1)
without ever converting the bitarray out of the packed format?

Related

Saving a numpy array in binary does not improve disk usage compared to uint8

I'm saving numpy arrays while trying to use as little disk space as possible.
Along the way I realized that saving a boolean numpy array does not improve disk usage compared to a uint8 array.
Is there a reason for that or am I doing something wrong here?
Here is a minimal example:
import sys
import numpy as np
rand_array = np.random.randint(0, 2, size=(100, 100), dtype=np.uint8) # create a random dual state numpy array
array_uint8 = rand_array * 255 # array, type uint8
array_bool = np.array(rand_array, dtype=bool) # array, type bool
print(f"size array uint8 {sys.getsizeof(array_uint8)}")
# ==> size array uint8 10120
print(f"size array bool {sys.getsizeof(array_bool)}")
# ==> size array bool 10120
np.save("array_uint8", array_uint8, allow_pickle=False, fix_imports=False)
# size in fs: 10128
np.save("array_bool", array_bool, allow_pickle=False, fix_imports=False)
# size in fs: 10128
The uint8 and bool data types both occupy one byte of memory per element, so the arrays of equal dimensions are always going to occupy the same memory. If you are aiming to reduce your memory footprint, you can pack the boolean values as bits into a uint8 array using numpy.packbits, thereby storing binary data in a significantly smaller array (read here)

Best way to store and represent many 1D numpy float arrays to one 1D numpy array

I'm converting .bed files into 1D numpy float arrays. Later on, I will need this 1D numpy float array. For circa 700 .bed files saving them in 1D numpy array is very costly.
My solution is to convert them into string array and consecutively concatenating them so that I can retrieve them later in the order they are concatenated. Like getting the last array as shown below.
import numpy as np
array_size=10000
number_of_files=700
sample1 = np.random.uniform(low=0.5, high=13.3, size=(1,array_size))
s1 = np.array(["%.2f" % x for x in sample1.reshape(sample1.size)])
np.save('test',s1)
for i in range(number_of_files):
sample = np.random.uniform(low=0.5, high=13.3, size=(1,array_size))
s = np.array(["%.2f" % x for x in sample.reshape(sample.size)])
s_temp=np.load('test.npy',mmap_mode='r')
s_new=['%s_%s' %(x,y) for x,y in zip(s_temp,s)]
np.save('test',s_new)
result=np.load('test.npy')
last=[x.split('_')[100] for x in result]
However, my test code shows that this string array is much more costly.
Storing 1D numpy float of 250M size costs for 1.9 GB.
700 files would make 1330 GM!!!.
Storing 1D numpy string array of 10 K size costs for 143 MB.
(250M/10K)*143MB would make 3575 GB!!!
Do you have any better solution for this representation and later retrieval problem?

How to read a numpy ndarray from a block of memory?

I have a block of memory that stores a 2D array of float32 numbers.
For example, the shape is (1000, 10), and what I have in memory is something like a C array with 10000 elements.
Can I turn this into a numpy array just by specifying the shape and dtype?
Reading a memory-mapped array from disk involves numpy.memmap() function. The data type and the shape need to be specified again, as this information is not stored in the file.
Lets call the file containing data in disk : memmapped.dat
import numpy as np
array = np.memmap('memmapped.dat', dtype=np.float32,shape=(1000, 10))
Ref : https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.memmap.html and https://ipython-books.github.io/48-processing-large-numpy-arrays-with-memory-mapping/
Turns out numpy supports interpreting a buffer as a 1-D array.
import numpy as np
def load_array(data, shape):
return np.frombuffer(data, dtype=np.float32).reshape(shape)

Pandas Series.as_matrix() doesn't properly convert a series of nd arrays into a single nd array

I have a pandas dataframe where one column is labeled "feature_vector" and contains in it a 1d numpy array with a bunch of numbers. Now, I am needing to use this data in an scikit learn model, so I need it as a single numpy array. So naturally I call DataFrame["feature_vector"].as_matrix() to get the numpy array from the correct series. The only problem is, the as_matrix() function will return an 1d numpy array where each element is an 1d numpy array containing each vector. When this is passed to an sklearn model's .fit() function, it throws an error. What I instead need is a 2d numpy array rather than the 1d array of 1d arrays. I wrote this work around, which uses presumably unnecessary memory and computation time:
x = dataframe["feature_vector"].as_matrix()
#x is a 1d array of 1d arrays.
l = []
for e in x:
l.append(e)
x = np.array(l)
#x is now a single 2d array.
Is this a bug in pandas .as_matrix()? Is there a better work around that doesn't require me to change the structure of the original dataframe?

How to copy data from memory to numpy array in Python

For example, I have a variable which point to a vector contains many elements in memory, I want to copy element in vector to a numpy array, what should I do except one by one copy? Thx
I am assuming that your vector can be represented like that:-
import array
x = array('l', [1, 3, 10, 5, 6]) # an array using python's built-in array module
Casting it as a numpy array will then be:-
import numpy as np
y = np.array(x)
If the data is packed in a buffer in native float format:
a = numpy.fromstring(buf, dtype=float, count=N)

Categories