I need to create a bit array in Python. So far, I've discovered that one can generate very memory-efficient arrays using the bitarray module.
However, my final intention is to use #vectorize decorator from Numba. Numba works with only a limited amount of Python and numpy features and bitarray is not one of them.
My question is, what's the best memory-efficient way of creating bit arrays using the structures that are supported by Numba?
I would go with numpy arrays, but I've done a quick memory test and it doesn't look good:
>>> import numpy as np
>>> import random
>>> from bitarray import bitarray
>>> from sys import getsizeof
>>> N = 10000
>>> a = bitarray(N)
>>> print(type(a), getsizeof(a))
<class 'bitarray.bitarray'> 96
>>> b = np.random.randint(0, 1, N)
>>> print(type(b), b.nbytes)
<class 'numpy.ndarray'> 40000
>>> c = [random.randint(0, 1) for i in range(N)]
>>> print(type(c), getsizeof(c))
<class 'list'> 87624
(not to say anything about list)
EDIT: As a side question, does anyone have any idea why getsizeof returns such an unrealistically low number for bitarray? I have just noticed.
You can simply specify the data type:
N=1000
b = np.random.randint(0, 1, N)
print(type(b),getsizeof(b))
<class 'numpy.ndarray'> 4096
c = np.random.randint(0, 1, N, dtype=np.bool)
print(type(b),getsizeof(c))
<class 'numpy.ndarray'> 1096
And for your side-question, numpy constructs much more in the numpy object then bitarrray so it is less efficient in terms of total memory of the object.
EDIT:
The memory of the object in python consist of all methods implemented in the object, at least their references to the code, attributes and items such as object.size which is a tuple in numpy that consist of integers, etc. In your list, you have several references to methods such as pop, delete, etc., and it consist of integers arranged in different nodes (list is an extended implementation of a classical linked list combined with other methods, see data structures in official docs).
Taking all of that into consideration, best practice is to use an appropriate data structure that works well in your pipeline, and specify types whenever is possible. Since you use numba, numpy seems the best fit. Memory is not always the issue.
Related
I'm using python to analyse some large files and I'm running into memory issues, so I've been using sys.getsizeof() to try and keep track of the usage, but it's behaviour with numpy arrays is bizarre. Here's an example involving a map of albedos that I'm having to open:
>>> import numpy as np
>>> import struct
>>> from sys import getsizeof
>>> f = open('Albedo_map.assoc', 'rb')
>>> getsizeof(f)
144
>>> albedo = struct.unpack('%df' % (7200*3600), f.read(7200*3600*4))
>>> getsizeof(albedo)
207360056
>>> albedo = np.array(albedo).reshape(3600,7200)
>>> getsizeof(albedo)
80
Well the data's still there, but the size of the object, a 3600x7200 pixel map, has gone from ~200 Mb to 80 bytes. I'd like to hope that my memory issues are over and just convert everything to numpy arrays, but I feel that this behaviour, if true, would in some way violate some law of information theory or thermodynamics, or something, so I'm inclined to believe that getsizeof() doesn't work with numpy arrays. Any ideas?
You can use array.nbytes for numpy arrays, for example:
>>> import numpy as np
>>> from sys import getsizeof
>>> a = [0] * 1024
>>> b = np.array(a)
>>> getsizeof(a)
8264
>>> b.nbytes
8192
The field nbytes will give you the size in bytes of all the elements of the array in a numpy.array:
size_in_bytes = my_numpy_array.nbytes
Notice that this does not measures "non-element attributes of the array object" so the actual size in bytes can be a few bytes larger than this.
In python notebooks I often want to filter out 'dangling' numpy.ndarray's, in particular the ones that are stored in _1, _2, etc that were never really meant to stay alive.
I use this code to get a listing of all of them and their size.
Not sure if locals() or globals() is better here.
import sys
import numpy
from humanize import naturalsize
for size, name in sorted(
(value.nbytes, name)
for name, value in locals().items()
if isinstance(value, numpy.ndarray)):
print("{:>30}: {:>8}".format(name, naturalsize(size)))
I just accidentally discovered that I can mix sympy expressions up with numpy arrays:
>>> import numpy as np
>>> import sympy as sym
>>> x, y, z = sym.symbols('x y z')
>>> np.ones(5)*x
array([1.0*x, 1.0*x, 1.0*x, 1.0*x, 1.0*x], dtype=object)
# I was expecting this to throw an error!
# sum works and collects terms etc. as I would expect
>>> np.sum(np.array([x+0.1,y,z+y]))
x + 2*y + z + 0.1
# dot works too
>>> np.dot(np.array([x,y,z]),np.array([z,y,x]))
2*x*z + y**2
>>> np.dot(np.array([x,y,z]),np.array([1,2,3]))
x + 2*y + 3*z
This is quite useful for me, because I'm doing both numerical and symbolic calculations in the same program. However, I'm curious about the pitfalls and limitations of this approach --- it seems for example that neither np.sin nor sym.sin are supported on Numpy arrays containing Sympy objects, since both give an error.
However, this numpy-sympy integration doesn't appear to be documented anywhere. Is it just an accident of how these libraries are implemented, or is it a deliberate feature? If the latter, when is it designed to be used, and when would it be better to use sympy.Matrix or other solutions? Can I expect to keep some of numpy's speed when working with arrays of this kind, or will it just drop back to Python loops as soon as a sympy symbol is involved?
In short I'm pleased to find this feature exists, but I would like to know more about it!
This is just NumPy's support for arrays of objects. It is not specific to SymPy. NumPy examines the operands and finds not all of them are scalars; there are some objects involved. So it calls that object's __mul__ or __rmul__, and puts the result into an array of objects. For example: mpmath objects,
>>> import mpmath as mp
>>> np.ones(5) * mp.mpf('1.23')
array([mpf('1.23'), mpf('1.23'), mpf('1.23'), mpf('1.23'), mpf('1.23')],
dtype=object)
or lists:
>>> np.array([[2], 3])*5
array([list([2, 2, 2, 2, 2]), 15], dtype=object)
>>> np.array([2, 3])*[[1, 1], [2]]
array([list([1, 1, 1, 1]), list([2, 2, 2])], dtype=object)
Can I expect to keep some of numpy's speed when working with arrays of this kind,
No. NumPy object arrays have no performance benefits over Python lists; there is probably more overhead in accessing elements than would be in a list. Storing Python objects in a Python list vs. a fixed-length Numpy array
There is no reason to use such arrays if a more specific data structure is available.
I just came across a relevant note in the latest numpy release notes (https://docs.scipy.org/doc/numpy-1.15.1/release.html)
Comparison ufuncs accept dtype=object, overriding the default bool
This allows object arrays of symbolic types, which override == and other operators to return expressions, to be compared elementwise with np.equal(a, b, dtype=object).
I think that means this works, but didn't before:
In [9]: np.array([x+.1, 2*y])==np.array([.1+x, y*2])
Out[9]: array([ True, True])
import numpy as np
a=np.random.randn(1, 2)
b=np.zeros((1,2))
print("Data type of A: ",type(a))
print("Data type of A: ",type(b))
Output:
Data type of A: <class 'numpy.ndarray'>
Data type of A: <class 'numpy.ndarray'>
In np.zeros(), to declare an array we give the input in 2 brackets whereas in np.random.radn(), we give it in 1 bracket?
Is there any specific reason for the syntax,as both of them are of same data type but follow a different syntax?
In an effort to ease the transition for Matlab users to NumPy, some convenience functions like randn were built which use the same call signature as their Matlab equivalents.
The more NumPy-centric (as opposed to Matlab-centric) NumPy functions (such as np.zeros) expect the size (or shape) to be a tuple. This allows other parameters like dtype and order to be passed to the function as well.
The Matlab-centric functions assume all the arguments are part of the size.
np.random.randn is one of NumPy's Matlab-centric convenience functions, modeled after Matlab's randn. The more NumPy-centric alternative to np.random.randn is np.random.standard_normal.
I have a dataset on which I'm trying to apply some arithmetical method.
The thing is it gives me relatively large numbers, and when I do it with numpy, they're stocked as 0.
The weird thing is, when I compute the numbers appart, they have an int value, they only become zeros when I compute them using numpy.
x = np.array([18,30,31,31,15])
10*150**x[0]/x[0]
Out[1]:36298069767006890
vector = 10*150**x/x
vector
Out[2]: array([0, 0, 0, 0, 0])
I have off course checked their types:
type(10*150**x[0]/x[0]) == type(vector[0])
Out[3]:True
How can I compute this large numbers using numpy without seeing them turned into zeros?
Note that if we remove the factor 10 at the beggining the problem slitghly changes (but I think it might be a similar reason):
x = np.array([18,30,31,31,15])
150**x[0]/x[0]
Out[4]:311075541538526549
vector = 150**x/x
vector
Out[5]: array([-329406144173384851, -230584300921369396, 224960293581823801,
-224960293581823801, -368934881474191033])
The negative numbers indicate the largest numbers of the int64 type in python as been crossed don't they?
As Nils Werner already mentioned, numpy's native ctypes cannot save numbers that large, but python itself can since the int objects use an arbitrary length implementation.
So what you can do is tell numpy not to convert the numbers to ctypes but use the python objects instead. This will be slower, but it will work.
In [14]: x = np.array([18,30,31,31,15], dtype=object)
In [15]: 150**x
Out[15]:
array([1477891880035400390625000000000000000000L,
191751059232884086668491363525390625000000000000000000000000000000L,
28762658884932613000273704528808593750000000000000000000000000000000L,
28762658884932613000273704528808593750000000000000000000000000000000L,
437893890380859375000000000000000L], dtype=object)
In this case the numpy array will not store the numbers themselves but references to the corresponding int objects. When you perform arithmetic operations they won't be performed on the numpy array but on the objects behind the references.
I think you're still able to use most of the numpy functions with this workaround but they will definitely be a lot slower than usual.
But that's what you get when you're dealing with numbers that large :D
Maybe somewhere out there is a library that can deal with this issue a little better.
Just for completeness, if precision is not an issue, you can also use floats:
In [19]: x = np.array([18,30,31,31,15], dtype=np.float64)
In [20]: 150**x
Out[20]:
array([ 1.47789188e+39, 1.91751059e+65, 2.87626589e+67,
2.87626589e+67, 4.37893890e+32])
150 ** 28 is way beyond what an int64 variable can represent (it's in the ballpark of 8e60 while the maximum possible value of an unsigned int64 is roughly 18e18).
Python may be using an arbitrary length integer implementation, but NumPy doesn't.
As you deduced correctly, negative numbers are a symptom of an int overflow.
I'm using Numpy and Python. I need to copy data, WITHOUT numeric conversion between np.uint64 and np.float64, e.g. 1.5 <-> 0x3ff8000000000000.
I'm aware of float.hex, but the output format a long way from uint64:
In [30]: a=1.5
In [31]: float.hex(a)
Out[31]: '0x1.8000000000000p+0'
Im also aware of various string input routines for the other way.
Can anybody suggest more direct methods? After all, its just simple copy and type change but python/numpy seem really rigid about converting the data on the way.
Use an intermediate array and the frombuffer method to "cast" one array type into the other:
>>> v = 1.5
>>> fa = np.array([v], dtype='float64')
>>> ua = np.frombuffer(fa, dtype='uint64')
>>> ua[0]
4609434218613702656 # 0x3ff8000000000000
Since frombuffer creates a view into the original buffer, this is efficient even for reinterpreting data in large arrays.
So, what you need is to see the 8 bytes that represent the float64 in memory as an integer number. (representing this int64 number as an hexadecimal string is another thing - it
is just its representation).
The Struct and Union functionality that comes bundled with the stdlib's ctypes
may be nice for you - no need for numpy. It has a Union type that works
quite like C language unions, and allow you to do this:
>>> import ctypes
>>> class Conv(ctypes.Union):
... _fields_ = [ ("float", ctypes.c_double), ("int", ctypes.c_uint64)]
...
>>> c = Conv()
>>> c.float = 1.5
>>> print hex(c.int)
0x3ff8000000000000L
The built-in "hex" function is a way to get the hexadecimal representation of the number.
You can use the struct module as well: pack the number to a string as a double, and unpack it as int. I think it is both less readable and less efficient than using ctypes Union:
>>> inport struct
>>> hex(struct.unpack("<Q", struct.pack("<d", 1.5))[0])
'0x3ff8000000000000'
Since you are using numpy , however, you can simply change the array type, "on the fly", and manipulate all the array as integers with 0 copy:
>>> import numpy
>>> x = numpy.array((1.5,), dtype=numpy.double)
>>> x[0]
1.5
>>> x.dtype=numpy.dtype("uint64")
>>> x[0]
4609434218613702656
>>> hex(x[0])
'0x3ff8000000000000L'
This is by far the most efficient way of doing it, whatever is your purpose in getting the raw bytes of the float64 numbers.