Python space+time efficient Data Structure to store 2D Bit Arrays - python

I want to create a 2D Binary (Bit) Array in Python in a space and time efficient way as my 2D bitarray would be around 1 Million(rows) * 50000(columns of 0's or 1's) and also I would be performing bitwise operations over these huge elements. My array would look something like:
0 1 0 1
1 1 1 0
1 0 0 0
...
In C++ most efficient way (space) for me would be to create a kind of array of integers where each element represents 32 bits and then I can use the shift operators coupled with bit-wise operators to carry operations.
Now I know that there is a bitarray module in python. But I am unable to create a 2D structure using list of bit-arrays. How can I do this?
Another way I know in C++ would be to create a map something like map<id, vector<int> > where then I can manipulate the vector as I have mentioned above. Should I use the dictionary equivalent in python?
Even if you suggest me some way out to use bit array for this task it will be great If I can get to know whether I can have multiple threads operate on a splice of bitarray so that I can make it multithreaded. Thanks for the help!!
EDIT:
I can even go on creating my own data structure for this if the need be. However just wanted to check before reinventing the wheel.

As per my comment, you may be able to use sets
0 1 0 1
1 1 1 0
1 0 0 0
can be represented as
set([(1,0), (3,0), (0,1), (1,1), (2, 1), (0,2)])
or
{(1,0), (3,0), (0,1), (1,1), (2, 1), (0,2)}
an AND is equivalent to an intersection of 2 sets
OR is the union of the 2 sets

How about the following:
In [11]: from bitarray import bitarray
In [12]: arr = [bitarray(50) for i in xrange(10)]
This creates a 10x50 bit array, which you can access as follows:
In [15]: arr[0][1] = True
In [16]: arr[0][1]
Out[16]: True
Bear in mind that a 1Mx50K array would require about 6GB of memory (and a 64-bit build of Python on a 64-bit OS).
whether I can have multiple threads operate on a splice of bitarray so that I can make it multithreaded
That shouldn't be a problem, with the usual caveats. Bear in mind that due to the GIL you are unlikely to achieve performance improvements through multithreading.

Can you use numpy?
>>> import numpy
>>> A = numpy.zeros((50000, 1000000), dtype=bool)
EDIT: Doesn't seem to be the most space efficient. Uses 50GB (1-byte per bool). Does anyone know if numpy has a way to use packed bools?

Related

Python - Multiprocess shared memory with python numpy array?

Here i have a 2 numpy arrays, and a function that will take those arrays as an input, and then do some numpy calculation and return the result. It works as it is but it's slow and i think we can use multiprocessing to make it a bit faster.
Anyway, here's my code :
A = #4 dimensions big numpy array
B = #1 dimension numpy array
def function(A, B):
P = np.einsum("ijkl,ij->kl", A, B)
return P.astype(np.uint8)
result = function(A,B)
I'm still quite new into this Multiprocessing stuff, but think that we're able to make array A and B as a shared memory (maybe using multiprocessing.Array() ??) , and then make multiple processes to compute the function(A, B). But i still can't quite understand how to put all of that in the code.
EDIT:
Alright, so it seems like the approach above doesn't work, but let's try another case, but now, lets say the length of array A is 120, and now i want to use only 3/4 parts of array A from index number 0 to 89 and use all of array B in the Process No.1
And then, i also want to use 3/4 parts of array A but from index number 30 to 119 and use all parts of array B in the Process No.2, will that help? Of course i can make the A array even larger to get it's part computed with even more process where, but the thing is, will this concept works?

avoid for loop in python normal(size={}))

My goal is to create an array where each elemet is normal(size={})) of each element of it.
I am trying to oprimize:
it = 2 ** arange(6, 25)
M = zeros(len(it))
for x in range(len(it)):
M[x] = (normal(size=it[x]))
I have these not working so far:
N = zeros(len(it))
it = 2 ** arange(6, 25)
N = (normal(size=it))
Further I tried:
N = (normal(size=it[:]))
Provided my data, I believe that such a manual work, or for loop is really inefficient, so I am trying to come up with vectorized operations.
i receive:
File "mtrand.pyx", line 1335, in numpy.random.mtrand.RandomState.normal
File "common.pyx", line 557, in numpy.random.common.cont
ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.
you've not been very precise in where these functions are coming from, but I'm guessing that by normal(size=it[:]) you mean:
import numpy as np
it = 2 ** np.arange(6, 25)
np.random.normal(size=it)
which would be telling numpy to create a 19 dimensional array (i.e. len(it)) that contains 6 × 1085 elements (i.e. np.prod(it.astype(float)) as floats because the number overflows an int64). numpy is saying that it can't do that, which seems like a reasonable thing to do.
Numpy doesn't like the "ragged arrays" you're trying to create, neither do most matrix/numeric libraries, hence support is limited!
I'm unsure why you consider that the "loop is really inefficient". You're creating ~33 million of floats from 19 iterations of a simple Python loop. The vast majority of time will be in highly optimised Numpy library code and some tiny (basically unmeasurable) amount of time will be spent evaluating your Python bytecode.
If you really want a one-liner then you can do:
X = [np.random.normal(size=2**i) for i in range(6, 25)]
which makes the split between Numpy and Python worlds more obvious.
Note that on my laptop, the Python code executes in ~5µs while the Numpy code runs for ~800ms. So you're trying to optimise the 0.0006% part!
Note that it's not always a win to use Numpy's vectorization, it only helps with larger arrays, for example the above loop is "faster" than:
X = [np.random.normal(i) for i in 2**np.arange(6, 25)]
4.8 vs 5.1 µs for the Python code, because of the time spent marshalling objects into/out of the Numpy world. Again, none of this matters, just use whichever solution makes your code easier to understand. A few microseconds is nothing compared to seconds.

Fast way to construct a matrix in Python

I have been browsing through the questions, and could find some help, but I prefer having confirmation by asking it directly. So here is my problem.
I have an (numpy) array u of dimension N, from which I want to build a square matrix k of dimension N^2. Basically, each matrix element k(i,j) is defined as k(i,j)=exp(-|u_i-u_j|^2).
My first naive way to do it was like this, which is, I believe, Fortran-like:
for i in range(N):
for j in range(N):
k[i][j]=np.exp(np.sum(-(u[i]-u[j])**2))
However, this is extremely slow. For N=1000, for example, it is taking around 15 seconds.
My other way to proceed is the following (inspired by other questions/answers):
i, j = np.ogrid[:N,:N]
k = np.exp(np.sum(-(u[i]-u[j])**2,axis=2))
This is way faster, as for N=1000, the result is almost instantaneous.
So I have two questions.
1) Why is the first method so slow, and why is the second one so fast ?
2) Is there a faster way to do it ? For N=10000, it is starting to take quite some time already, so I really don't know if this was the "right" way to do it.
Thank you in advance !
P.S: the matrix is symmetric, so there must also be a way to make the process faster by calculating only the upper half of the matrix, but my question was more related to the way to manipulate arrays, etc.
First, a small remark, there is no need to use np.sum if u can be re-written as u = np.arange(N). Which seems to be the case since you wrote that it is of dimension N.
1) First question:
Accessing indices in Python is slow, so best is to not use [] if there is a way to not use it. Plus you call multiple times np.exp and np.sum, whereas they can be called for vectors and matrices. So, your second proposal is better since you compute your k all in once, instead of elements by elements.
2) Second question:
Yes there is. You should consider using only numpy functions and not using indices (around 3 times faster):
k = np.exp(-np.power(np.subtract.outer(u,u),2))
(NB: You can keep **2 instead of np.power, which is a bit faster but has smaller precision)
edit (Take into account that u is an array of tuples)
With tuple data, it's a bit more complicated:
ma = np.subtract.outer(u[:,0],u[:,0])**2
mb = np.subtract.outer(u[:,1],u[:,1])**2
k = np.exp(-np.add(ma, mb))
You'll have to use twice np.substract.outer since it will return a 4 dimensions array if you do it in one time (and compute lots of useless data), whereas u[i]-u[j] returns a 3 dimensions array.
I used np.add instead of np.sum since it keep the array dimensions.
NB: I checked with
N = 10000
u = np.random.random_sample((N,2))
I returns the same as your proposals. (But 1.7 times faster)

memory efficient way to make large zeros matrix python

I am currently trying to make a really large matrix, i am unsure how to do so in a memory efficient way.
I was trying to use numpy, which worked fine for my smaller case (2750086X300)
However, i got a larger one, 2750086X1000, which is just too big for me to run.
I though about making it out of ints, but I will add float values to it, so unsure how that cld affect it.
I tried find something about making a sparse zero filled array, but cldnt find any great topics/questions/suggestions here or elsewhere.
Anyone got any good advice? I am currently using python so I am kind of looking for a pythonic solution, but i am willing to try other languages.
Thx
edit:
thx for advices, i ve tried scipy.sparse.csr_matrix which managed to create a matrix but deeply increased the time to go through it.
heres kind of what i am doing:
matrix = scipy.sparse.csr_matrix((df.shape[0], 300))
## matrix = np.zeros((df.shape[0],
for i, q in enumerate(df['column'].values):
matrix[i, :] = function(q)
where function is pretty much a vector operation function on that row.
Now, if i do the loop on the np.zeros, it does so quite easily, about 10 minuts.
Now, if i try to do the same with the scipy sparse matrix, it takes about 50 hours. which is not that reasonable.
Any advices?
Edit 2:
scipy.sparse.lil_matrix did the trick
takes about 20 minut for the loop and uses way less memory than np.zeros
Thx.
Edit 3:
still memory expensive. decided to not store data on matrix. process 1 row at a time. get relevant value/metric out of it, store value at original df, run again.
Try scipy.sparse.csr_matrix:
from scipy.sparse import *
from scipy import *
a=csr_matrix( (2750086,1000), dtype=int8 )
Then a is
<2750086x1000 sparse matrix of type '<class 'numpy.int8'>'
with 0 stored elements in Compressed Sparse Row format>
For example, if you do:
from scipy.sparse import *
from scipy import *
a=csr_matrix( (5,4), dtype=int8 ).todense()
print(a)
You get:
[[0 0 0 0]
[0 0 0 0]
[0 0 0 0]
[0 0 0 0]
[0 0 0 0]]
Another options is to use scipy.sparse.lil_matrix
a = scipy.sparse.lil_matrix((2750086,1000), dtype=int8 )
This seems to be more efficient for setting elements (like a[1,1]=2).

Stocking large numbers into numpy array

I have a dataset on which I'm trying to apply some arithmetical method.
The thing is it gives me relatively large numbers, and when I do it with numpy, they're stocked as 0.
The weird thing is, when I compute the numbers appart, they have an int value, they only become zeros when I compute them using numpy.
x = np.array([18,30,31,31,15])
10*150**x[0]/x[0]
Out[1]:36298069767006890
vector = 10*150**x/x
vector
Out[2]: array([0, 0, 0, 0, 0])
I have off course checked their types:
type(10*150**x[0]/x[0]) == type(vector[0])
Out[3]:True
How can I compute this large numbers using numpy without seeing them turned into zeros?
Note that if we remove the factor 10 at the beggining the problem slitghly changes (but I think it might be a similar reason):
x = np.array([18,30,31,31,15])
150**x[0]/x[0]
Out[4]:311075541538526549
vector = 150**x/x
vector
Out[5]: array([-329406144173384851, -230584300921369396, 224960293581823801,
-224960293581823801, -368934881474191033])
The negative numbers indicate the largest numbers of the int64 type in python as been crossed don't they?
As Nils Werner already mentioned, numpy's native ctypes cannot save numbers that large, but python itself can since the int objects use an arbitrary length implementation.
So what you can do is tell numpy not to convert the numbers to ctypes but use the python objects instead. This will be slower, but it will work.
In [14]: x = np.array([18,30,31,31,15], dtype=object)
In [15]: 150**x
Out[15]:
array([1477891880035400390625000000000000000000L,
191751059232884086668491363525390625000000000000000000000000000000L,
28762658884932613000273704528808593750000000000000000000000000000000L,
28762658884932613000273704528808593750000000000000000000000000000000L,
437893890380859375000000000000000L], dtype=object)
In this case the numpy array will not store the numbers themselves but references to the corresponding int objects. When you perform arithmetic operations they won't be performed on the numpy array but on the objects behind the references.
I think you're still able to use most of the numpy functions with this workaround but they will definitely be a lot slower than usual.
But that's what you get when you're dealing with numbers that large :D
Maybe somewhere out there is a library that can deal with this issue a little better.
Just for completeness, if precision is not an issue, you can also use floats:
In [19]: x = np.array([18,30,31,31,15], dtype=np.float64)
In [20]: 150**x
Out[20]:
array([ 1.47789188e+39, 1.91751059e+65, 2.87626589e+67,
2.87626589e+67, 4.37893890e+32])
150 ** 28 is way beyond what an int64 variable can represent (it's in the ballpark of 8e60 while the maximum possible value of an unsigned int64 is roughly 18e18).
Python may be using an arbitrary length integer implementation, but NumPy doesn't.
As you deduced correctly, negative numbers are a symptom of an int overflow.

Categories