Python allocates integers automatically based on the underlying system architecture. Unfortunately I have a huge dataset which needs to be fully loaded into memory.
So, is there a way to force Python to use only 2 bytes for some integers (equivalent of C++ 'short')?
Nope. But you can use short integers in arrays:
from array import array
a = array("h") # h = signed short, H = unsigned short
As long as the value stays in that array it will be a short integer.
documentation for the array module
Thanks to Armin for pointing out the 'array' module. I also found the 'struct' module that packs c-style structs in a string:
From the documentation (https://docs.python.org/library/struct.html):
>>> from struct import *
>>> pack('hhl', 1, 2, 3)
'\x00\x01\x00\x02\x00\x00\x00\x03'
>>> unpack('hhl', '\x00\x01\x00\x02\x00\x00\x00\x03')
(1, 2, 3)
>>> calcsize('hhl')
8
You can use NumyPy's int as np.int8 or np.int16.
Armin's suggestion of the array module is probably best. Two possible alternatives:
You can create an extension module yourself that provides the data structure that you're after. If it's really just something like a collection of shorts, then
that's pretty simple to do.
You can
cheat and manipulate bits, so that
you're storing one number in the
lower half of the Python int, and
another one in the upper half.
You'd write some utility functions
to convert to/from these within your
data structure. Ugly, but it can be made to work.
It's also worth realising that a Python integer object is not 4 bytes - there is additional overhead. So if you have a really large number of shorts, then you can save more than two bytes per number by using a C short in some way (e.g. the array module).
I had to keep a large set of integers in memory a while ago, and a dictionary with integer keys and values was too large (I had 1GB available for the data structure IIRC). I switched to using a IIBTree (from ZODB) and managed to fit it. (The ints in a IIBTree are real C ints, not Python integers, and I hacked up an automatic switch to a IOBTree when the number was larger than 32 bits).
You can also store multiple any size of integers in a single large integer.
For example as seen below, in python3 on 64bit x86 system, 1024 bits are taking 164 bytes of memory storage. That means on average one byte can store around 6.24 bits. And if you go with even larger integers you can get even higher bits storage density. For example around 7.50 bits per byte with 2**20 bits wide integer.
Obviously you will need some wrapper logic to access individual short numbers stored in the larger integer, which is easy to implement.
One issue with this approach is your data access will slow down due use of the large integer operations.
If you are accessing a big batch of consecutively stored integers at once to minimize the access to large integers, then the slower access to long integers won't be an issue.
I guess use of numpy will be easier approach.
>>> a = 2**1024
>>> sys.getsizeof(a)
164
>>> 1024/164
6.2439024390243905
>>> a = 2**(2**20)
>>> sys.getsizeof(a)
139836
>>> 2**20 / 139836
7.49861266054521
Using bytearray in python which is basically a C unsigned char array under the hood will be a better solution than using large integers. There is no overhead for manipulating a byte array and, it has much less storage overhead compared to large integers. It's possible to get storage density of 7.99+ bits per byte with bytearrays.
>>> import sys
>>> a = bytearray(2**32)
>>> sys.getsizeof(a)
4294967353
>>> 8 * 2**32 / 4294967353
7.999999893829228
Related
I'm developing a program that will deal with approx. 90 billion records, so I need to manage memory carefully. Which is larger in memory: 8 char string or 8 digit int?
Details:
-Python 3.7.4
-64 bits
Edit1:
following the advice of user8080blablabla I got:
sys.getsizeof(99999999)
28
sys.getsizeof("99999999")
57
seriously? a 8 char string is 57 bytes long?!?
An int will generally take less memory than its representation as a string, because it is more compact. However, because Python int values are objects, they still take quite a lot of space each compared to primitive values in other languages: the integer object 1 takes up 28 bytes of memory on my machine.
>>> import sys
>>> sys.getsizeof(1)
28
If minimising memory use is your priority, and there is a maximum range the integers can be in, consider using the array module. It can store numeric data (or Unicode characters) in an array, in a primitive data type of your choice, so that each value isn't an object taking up 28+ bytes.
>>> from array import array
>>> arr = array('I') # unsigned int in C
>>> arr.extend(range(10000))
>>> arr.itemsize
4
>>> sys.getsizeof(arr)
40404
The actual number of bytes used per item is dependent on the machine architecture. On my machine, each number takes 4 bytes; there are 404 bytes of overhead for an array of length 10,000. Check arr.itemsize on your machine to see if you need a different primitive type; fewer than 4 bytes is not enough for an 8-digit number.
That said, you should not be trying to fit 90 billion numbers in memory, at 4 bytes each; this would take 360GB of memory. Look for a solution which doesn't require holding every record in memory at once.
You ought to remember that strings are represented as Unicodes in Python, therefore storing a digit in a string can take an upwards of 4-bytes per character to store, which is why you see such a large discrepancy between int and str (interesting read on the topic).
If you are worried about memory allocation I would instead recommend using pandas to manage the backend for you when it comes to manipulating large datasets.
Is it possible to make Python use less than 12 bytes for an int?
>>> x=int()
>>> x
0
>>> sys.getsizeof(x)
12
I am not a computer specialist but isn't 12 bytes excessive?
The smallest int I want to store is 0, the largest int 147097614, so I shouldn't really need more than 4 bytes.
(There is probably something I misunderstand here as I couldn't find an answer anywhere on the net. Keep that in mind.)
In python, ints are objects just like everything else. Because of that, there is a little extra overhead just associated with the fact that you're using an object which has some associated meta-data.
If you're going to use lots of ints, and it makes sense to lay them out in an array-like structure, you should look into numpy. Numpy ndarray objects will have a little overhead associated with them for the various pieces of meta-data that the array objects keep track of, but the actual data is stored as the datatype you specify (e.g. numpy.int32 for a 4-byte integer.)
Thus, if you have:
import numpy as np
a = np.zeros(5000,dtype=np.int32)
The array will take only slightly more than 4*5000 = 20000 bytes of your memory
Size of an integer object includes the overhead of maintaining other object information along with its value. The additional information can include object type, reference count and other implementation-specific details.
If you store many integers and want to optimize the space spent, use the array module, specifically arrays constructed with array.array('i').
Integers in python are objects, and are therefore stored with extra overhead.
You can read more information about it here
The integer type in cpython is stored in a structure like so:
typedef struct {
PyObject_HEAD
long ob_ival;
} PyIntObject;
PyObject_HEAD is a macro that expands out into a reference count and a pointer to the type object.
So you can see that:
long ob_ival - 4 bytes for a long.
Py_ssize_t ob_refcnt - I would assume to size_t here is 4 bytes.
PyTypeObject *ob_type - Is a pointer, so another 4 bytes.
12 bytes in total!
How do I represent minimum and maximum values for integers in Python? In Java, we have Integer.MIN_VALUE and Integer.MAX_VALUE.
See also: What is the maximum float in Python?.
Python 3
In Python 3, this question doesn't apply. The plain int type is unbounded.
However, you might actually be looking for information about the current interpreter's word size, which will be the same as the machine's word size in most cases. That information is still available in Python 3 as sys.maxsize, which is the maximum value representable by a signed word. Equivalently, it's the size of the largest possible list or in-memory sequence.
Generally, the maximum value representable by an unsigned word will be sys.maxsize * 2 + 1, and the number of bits in a word will be math.log2(sys.maxsize * 2 + 2). See this answer for more information.
Python 2
In Python 2, the maximum value for plain int values is available as sys.maxint:
>>> sys.maxint # on my system, 2**63-1
9223372036854775807
You can calculate the minimum value with -sys.maxint - 1 as shown in the docs.
Python seamlessly switches from plain to long integers once you exceed this value. So most of the time, you won't need to know it.
If you just need a number that's bigger than all others, you can use
float('inf')
in similar fashion, a number smaller than all others:
float('-inf')
This works in both python 2 and 3.
The sys.maxint constant has been removed from Python 3.0 onward, instead use sys.maxsize.
Integers
PEP 237: Essentially, long renamed to int. That is, there is only one built-in integral type, named int; but it behaves mostly like the old long type.
...
The sys.maxint constant was removed, since there is no longer a limit to the value of integers. However, sys.maxsize can be used as an integer larger than any practical list or string index. It conforms to the implementation’s “natural” integer size and is typically the same as sys.maxint in previous releases on the same platform (assuming the same build options).
For Python 3, it is
import sys
max_size = sys.maxsize
min_size = -sys.maxsize - 1
In Python integers will automatically switch from a fixed-size int representation into a variable width long representation once you pass the value sys.maxint, which is either 231 - 1 or 263 - 1 depending on your platform. Notice the L that gets appended here:
>>> 9223372036854775807
9223372036854775807
>>> 9223372036854775808
9223372036854775808L
From the Python manual:
Numbers are created by numeric literals or as the result of built-in functions and operators. Unadorned integer literals (including binary, hex, and octal numbers) yield plain integers unless the value they denote is too large to be represented as a plain integer, in which case they yield a long integer. Integer literals with an 'L' or 'l' suffix yield long integers ('L' is preferred because 1l looks too much like eleven!).
Python tries very hard to pretend its integers are mathematical integers and are unbounded. It can, for instance, calculate a googol with ease:
>>> 10**100
10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000L
You may use 'inf' like this:
import math
bool_true = 0 < math.inf
bool_false = 0 < -math.inf
Refer: math — Mathematical functions
If you want the max for array or list indices (equivalent to size_t in C/C++), you can use numpy:
np.iinfo(np.intp).max
This is same as sys.maxsize however advantage is that you don't need import sys just for this.
If you want max for native int on the machine:
np.iinfo(np.intc).max
You can look at other available types in doc.
For floats you can also use sys.float_info.max.
sys.maxsize is not the actually the maximum integer value which is supported. You can double maxsize and multiply it by itself and it stays a valid and correct value.
However, if you try sys.maxsize ** sys.maxsize, it will hang your machine for a significant amount of time. As many have pointed out, the byte and bit size does not seem to be relevant because it practically doesn't exist. I guess python just happily expands it's integers when it needs more memory space. So in general there is no limit.
Now, if you're talking about packing or storing integers in a safe way where they can later be retrieved with integrity then of course that is relevant. I'm really not sure about packing but I know python's pickle module handles those things well. String representations obviously have no practical limit.
So really, the bottom line is: what is your applications limit? What does it require for numeric data? Use that limit instead of python's fairly nonexistent integer limit.
I rely heavily on commands like this.
python -c 'import sys; print(sys.maxsize)'
Max int returned: 9223372036854775807
For more references for 'sys' you should access
https://docs.python.org/3/library/sys.html
https://docs.python.org/3/library/sys.html#sys.maxsize
code given below will help you.
for maximum value you can use sys.maxsize and for minimum you can negate same value and use it.
import sys
ni=sys.maxsize
print(ni)
I usually perform things like this in C++, but I'm using python to write a quick script and I've run into a wall.
If I have a binary list (or whatever python stores the result of an "fread" in). I can access the individual bytes in it with: buffer[0], buffer[1], etc.
I need to change the bytes [8-11] to hold a new 32-bit file-size (read: there's already a filesize there, I need to update it). In C++ I would just get a pointer to the location and cast it to store the integer, but with python I suddenly realized I have no idea how to do something like this.
How can I update 4 bytes in my buffer at a specific location to hold the value of an integer in python?
EDIT
I'm going to add more because I can't seem to figure it out from the solutions (though I can see they're on the right track).
First of all, I'm on python 2.4 (and can't upgrade, big corporation servers) - so that apparently limits my options. Sorry for not mentioning that earlier, I wasn't aware it had so many less features.
Secondly, let's make this ultra-simple.
Lets say I have a binary file named 'myfile.binary' with the five-byte contents '4C53535353' in hex - this equates to the ascii representations for letters "L and 4xS" being alone in the file.
If I do:
f = open('myfile.binary', 'rb')
contents = f.read(5)
contents should (from Sven Marnach's answer) hold a five-byte immutable string.
Using Python 2.4 facilities only, how could I change the 4 S's held in 'contents' to an arbitrary integer value? I.e. give me a line of code that can make byte indices contents [1-4] contain the 32-bit integer 'myint' with value 12345678910.
What you need is this function:
struct.pack_into(fmt, buffer, offset, v1, v2, ...)
It's documented at http://docs.python.org/library/struct.html near the top.
Example code:
import struct
import ctypes
data=ctypes.create_string_buffer(10)
struct.pack_into(">i", data, 5, 0x12345678)
print list(data)
Similar posting: Python: How to pack different types of data into a string buffer using struct.pack_into
EDIT: Added a Python 2.4 compatible example:
import struct
f=open('myfile.binary', 'rb')
contents=f.read(5)
data=list(contents)
data[0:4]=struct.pack(">i", 0x12345678)
print data
Have a look at Struct module. You need pack function.
EDIT:
The code:
import struct
s = "LSSSS" # your string
s = s[0] + struct.pack('<I', 1234567891) # note "shorter" constant than in your example
print s
Output:
L╙☻ЦI
struct.pack should be available in Python2.4.
Your number "12345678910" cannot be packed into 4 bytes, I shortened it a bit.
The result of file.read() is a string in Python, and it is immutable. Depending on the context of the task you are trying to accomplish, there are different solutions to the problem.
One is using the array module: Read the file directly as an array of 32-bit integers. You can modify this array and write it back to the file.
with open("filename") as f:
f.seek(0, 2)
size = f.tell()
f.seek(0)
data = array.array("i")
assert data.itemsize == 4
data.fromfile(f, size // 4)
data[2] = new_value
# use data.tofile(g) to write the data back to a new file g
You could install the numpy module, which is often used for scientific computing.
read_data = numpy.fromfile(file=id, dtype=numpy.uint32)
Then access the data at the desired location and make your changes.
The following is just a demonstration for you to understand what really happens when the four bytes are converted into an integer.
Suppose you have a number: 15213
Decimal: 15213
Binary: 0011 1011 0110 1101
Hex: 3 B 6 D
On little-endian systems (i.e x86 machines), this number can be represented using a length-4 bytearray as: b"\x6d\x3b\x00\x00" or b"m;\x00\x00" when you print it on the screen, to convert the four bytes into an integer, we simply do a bit of base conversion, which in this case, is:
sum(n*(256**i) for i,n in enumerate(b"\x6d\x3b\x00\x00"))
This gives you the result: 15213
I need to get the numbers of one line randomly, and put each line in other array, then get the numbers of one col.
I have a big file, more than 400M. In that file, there are 13496*13496 number, means 13496 rows and 13496 cols. I want to read them to a array.
This is my code:
_L1 = [[0 for col in range(13496)] for row in range(13496)]
_L1file = open('distanceCMD.function.txt')
while (i<13496):
print "i="+str(i)
_strlf = _L1file.readline()
_strlf = _strlf.split('\t')
_strlf = _strlf[:-1]
_L1[i] = _strlf
i += 1
_L1file.close()
And this is my error message:
MemoryError:
File "D:\research\space-function\ART3.py", line 30, in <module>
_strlf = _strlf.split('\t')
you might want to approach your problem in another way. Process the file line by line. I don't see a need to store the whole big file into array. Otherwise, you might want to tell us what you are actually trying to do.
for line in open("400MB_file"):
# do something with line.
Or
f=open("file")
for linenum,line in enumerate(f):
if linenum+1 in [2,3,10]:
print "there are ", len(line.split())," columns" #assuming you want to split on spaces
print "100th column value is: ", line.split()[99]
if linenum+1>10:
break # break if you want to stop after the 10th line
f.close()
This is a simple case of your program demanding more memory than is available to the computer. An array of 13496x13496 elements requires 182,142,016 'cells', where a cell is a minimum of one byte (if storing chars) and potentially several bytes (if storing floating-point numerics, for example). I'm not even taking your particular runtimes' array metadata into account, though this would typically be a tiny overhead on a simple array.
Assuming each array element is just a single byte, your computer needs around 180MB of RAM to hold it in memory in its' entirety. Trying to process it could be impractical.
You need to think about the problem a different way; as has already been mentioned, a line-by-line approach might be a better option. Or perhaps processing the grid in smaller units, perhaps 10x10 or 100x100, and aggregating the results. Or maybe the problem itself can be expressed in a different form, which avoids the need to process the entire dataset altogether...?
If you give us a little more detail on the nature of the data and the objective, perhaps someone will have an idea to make the task more manageable.
Short answer: the Python object overhead is killing you. In Python 2.x on a 64-bit machine, a list of strings consumes 48 bytes per list entry even before accounting for the content of the strings. That's over 8.7 Gb of overhead for the size of array you describe.
On a 32-bit machine it'll be a bit better: only 28 bytes per list entry.
Longer explanation: you should be aware that Python objects themselves can be quite large: even simple objects like ints, floats and strings. In your code you're ending up with a list of lists of strings. On my (64-bit) machine, even an empty string object takes up 40 bytes, and to that you need to add 8 bytes for the list pointer that's pointing to this string object in memory. So that's already 48 bytes per entry, or around 8.7 Gb. Given that Python allocates memory in multiples of 8 bytes at a time, and that your strings are almost certainly non-empty, you're actually looking at 56 or 64 bytes (I don't know how long your strings are) per entry.
Possible solutions:
(1) You might do (a little) better by converting your entries from strings to ints or floats as appropriate.
(2) You'd do much better by either using Python's array type (not the same as list!) or by using numpy: then your ints or floats would only take 4 or 8 bytes each.
Since Python 2.6, you can get basic information about object sizes with the sys.getsizeof function. Note that if you apply it to a list (or other container) then the returned size doesn't include the size of the contained list objects; only of the structure used to hold those objects. Here are some values on my machine.
>>> import sys
>>> sys.getsizeof("")
40
>>> sys.getsizeof(5.0)
24
>>> sys.getsizeof(5)
24
>>> sys.getsizeof([])
72
>>> sys.getsizeof(range(10)) # 72 + 8 bytes for each pointer
152
MemoryError exception:
Raised when an operation runs out of
memory but the situation may still be
rescued (by deleting some objects).
The associated value is a string
indicating what kind of (internal)
operation ran out of memory. Note that
because of the underlying memory
management architecture (C’s malloc()
function), the interpreter may not
always be able to completely recover
from this situation; it nevertheless
raises an exception so that a stack
traceback can be printed, in case a
run-away program was the cause.
It seems that, at least in your case, reading the entire file into memory is not a doable option.
Replace this:
_strlf = _strlf[:-1]
with this:
_strlf = [float(val) for val in _strlf[:-1]]
You are making a big array of strings. I can guarantee that the string "123.00123214213" takes a lot less memory when you convert it to floating point.
You might want to include some handling for null values.
You can also go to numpy's array type, but your problem may be too small to bother.