My understanding is that os.urandom(size) outputs a random string of bytes of the given "size", but then:
import os
import sys
print(sys.getsizeof(os.urandom(42)))
>>>
75
Why is this not 42?
And a related question:
import base64
import binascii
print(sys.getsizeof(base64.b64encode(os.urandom(42))))
print(sys.getsizeof(binascii.hexlify(os.urandom(42))))
>>>
89
117
Why are these so different? Which encoding would be the most memory efficient way to store a string of bytes such as that given by os.urandom?
Edit: It seems like quite a stretch to say that this question is a duplicate of What is the difference between len() and sys.getsizeof() methods in python? My question is not about the difference between len() and getsizeof(). I was confused about the memory used by Python objects in general, which the answer to this question has clarified for me.
Python byte string objects are more than just the characters that comprise them. They are fully fledged objects. As such they require more space to accommodate the object's components such as the type pointer (needed to identify what kind of object the bytestring even is) and the length (needed for efficiency and because Python bytestrings can contain null bytes).
The simplest object, an object instance, requires space:
>>> sys.getsizeof(object())
16
The second part of your question is simply because the strings produced by b64encode() and hexlify() have different lengths; the latter being 28 characters longer which, unsurprisingly, is the difference in the values reported by sys.getsizeof().
>>> s1 = base64.b64encode(os.urandom(42))
>>> s1
b'CtlMjDM9q7zp+pGogQci8gr0igJsyZVjSP4oWmMj2A8diawJctV/8sTa'
>>> s2 = binascii.hexlify(os.urandom(42))
>>> s2
b'c82d35f717507d6f5ffc5eda1ee1bfd50a62689c08ba12055a5c39f95b93292ddf4544751fbc79564345'
>>> len(s2) - len(s1)
28
>>> sys.getsizeof(s2) - sys.getsizeof(s1)
28
Unless you use some form of compression, there is no encoding that will be more efficient than the binary string that you already have, and this is particularly true in this case because the data is random which is inherently incompressible.
Related
This question already has an answer here:
How is unicode represented internally in Python?
(1 answer)
Closed 4 years ago.
In Python 2, an empty string occupy exactly 37 bytes,
>>>> print sys.getsizeof('')
37
In Python 3.6, the same call would output 49 bytes,
>>>> print(sys.getsizeof(''))
49
Now I thought that this was due to the fact that in Python 3, all strings are now Unicode. But, to my surprise here are some confusing outputs,
Python 2.7
>>>> print sys.getsizeof(u'')
52
>>>> print sys.getsizeof(u'1')
56
Python 3.6
>>>>print(sys.getsizeof(''))
49
>>>>print(sys.getsizeof('1'))
50
An empty string is not the same size.
4 additional bytes are needed when adding a character in Python 2 and only one for Python 3
Why is the memory footprint different between the two versions ?
EDIT
I specified the exact version of my python environment, because between different Python 3 builds there are differences.
There are reasons of course, but really, it should not matter for any practical purposes. If you have a Python system in which you have to keep so many strings in memory as to get close to the system memory, you should optimize it by (1) trying to lazily load/create strings in memory or (2) Use a byte-oriented efficient binary structure to deal with your data, such as those provided by Numpy, or Python's own bytearray.
The change for the empty string literal (unicode literal fro Py2) could bedue to any implementation details between the versions you are looking at, which should not matter even if were writting C code to interact directly with Python strings: even those should only touch the strings via the API.
Now, the specific reason for why the string in Python 3 just increases its size by "1" byte, while in Python 2 it increases the size by 4 bytes is PEP 393.
Before Python 3.3, any (unicode) string in Python would use either fixed 2 bytes or fixed 4 bytes of memory for each character - and the Python interpreter and Python modules using native code would have to be compiled to use just one of these kinds. I.E. you efectively could have incompatible Python binaries, even if the versions matched, due to the string-width optoin picked up at build time - the builds were known as "narrow build" and "wide build". With the above mentioned PEP 391, Python strings have their character sized determined when they are instantiate, depending on the size of the widest unicode codepoint it contains. Strings that contain points that are contained in the first 256 codepoints (equivalent to the Latin-1 character set) use only 1 byte per character.
Internally, Python 3 now stores strings in four different encodings, and chooses a different encoding for each string. These encodings are ASCII, LATIN-1, UCS-2, and UTF-32. Each one is capable of representing a different subset of Unicode characters and has the useful property that the element at index i is also the Unicode code point at index i.
In [1]: import sys
In [2]: sys.getsizeof('\xFF')
Out[2]: 74
In [3]: sys.getsizeof('X\xFF')
Out[3]: 75
In [4]: sys.getsizeof('\u0100')
Out[4]: 76
In [5]: sys.getsizeof('X\u0100')
Out[5]: 78
In [6]: sys.getsizeof('\U00010000')
Out[6]: 80
In [7]: sys.getsizeof('X\U00010000')
Out[7]: 84
You can see that adding an additional character, in this case 'X', to a string causes that string to take up additional space depending on the values contained in the rest of the string.
This system was proposed in PEP-0393 and implemented in Python 3.3. Earlier versions of Python use the older unicode representation, which always used 2 or 4 bytes per... element (I hesitate to say "character"), depending on compile-time options and they could not be mixed.
When I ran the below code I got 3 and 36 as the answers respectively.
x ="abd"
print len(x)
print sys.getsizeof(x)
Can someone explain to me what's the difference between them ?
They are not the same thing at all.
len() queries for the number of items contained in a container. For a string that's the number of characters:
Return the length (the number of items) of an object. The argument may be a sequence (string, tuple or list) or a mapping (dictionary).
sys.getsizeof() on the other hand returns the memory size of the object:
Return the size of an object in bytes. The object can be any type of object. All built-in objects will return correct results, but this does not have to hold true for third-party extensions as it is implementation specific.
Python string objects are not simple sequences of characters, 1 byte per character.
Specifically, the sys.getsizeof() function includes the garbage collector overhead if any:
getsizeof() calls the object’s __sizeof__ method and adds an additional garbage collector overhead if the object is managed by the garbage collector.
String objects do not need to be tracked (they cannot create circular references), but string objects do need more memory than just the bytes per character. In Python 2, __sizeof__ method returns (in C code):
Py_ssize_t res;
res = PyStringObject_SIZE + PyString_GET_SIZE(v) * Py_TYPE(v)->tp_itemsize;
return PyInt_FromSsize_t(res);
where PyStringObject_SIZE is the C struct header size for the type, PyString_GET_SIZE basically is the same as len() and Py_TYPE(v)->tp_itemsize is the per-character size. In Python 2.7, for byte strings, the size per character is 1, but it's PyStringObject_SIZE that is confusing you; on my Mac that size is 37 bytes:
>>> sys.getsizeof('')
37
For unicode strings the per-character size goes up to 2 or 4 (depending on compilation options). On Python 3.3 and newer, Unicode strings take up between 1 and 4 bytes per character, depending on the contents of the string.
key difference is that len() will give actual length of elements in container , Whereas sys.sizeof() will give it's memory size which it occupy
for more information read docs of python which is available at
https://docs.python.org/3/library/sys.html#module-sys
What does sys.getsizeof return for a standard string? I am noticing that this value is much higher than what len returns.
I will attempt to answer your question from a broader point of view. You're referring to two functions and comparing their outputs. Let's take a look at their documentation first:
len():
Return the length (the number of items) of an object. The argument may
be a sequence (such as a string, bytes, tuple, list, or range) or a
collection (such as a dictionary, set, or frozen set).
So in case of string, you can expect len() to return the number of characters.
sys.getsizeof():
Return the size of an object in bytes. The object can be any type of
object. All built-in objects will return correct results, but this
does not have to hold true for third-party extensions as it is
implementation specific.
So in case of string (as with many other objects) you can expect sys.getsizeof() the size of the object in bytes. There is no reason to think that it should be the same as the number of characters.
Let's have a look at some examples:
>>> first = "First"
>>> len(first)
5
>>> sys.getsizeof(first)
42
This example confirms that the size is not the same as the number of characters.
>>> second = "Second"
>>> len(second)
6
>>> sys.getsizeof(second)
43
We can notice that if we look at a string one character longer, its size is one byte bigger as well. We don't know if it's a coincidence or not though.
>>> together = first + second
>>> print(together)
FirstSecond
>>> len(together)
11
If we concatenate the two strings, their combined length is equal to the sum of their lengths, which makes sense.
>>> sys.getsizeof(together)
48
Contrary to what someone might expect though, the size of the combined string is not equal to the sum of their individual sizes. But it still seems to be the length plus something. In particular, something worth 37 bytes. Now you need to realize that it's 37 bytes in this particular case, using this particular Python implementation etc. You should not rely on that at all. Still, we can take a look why it's 37 bytes what they are (approximately) used for.
String objects are in CPython (probably the most widely used implementation of Python) implemented as PyStringObject. This is the C source code (I use the 2.7.9 version):
typedef struct {
PyObject_VAR_HEAD
long ob_shash;
int ob_sstate;
char ob_sval[1];
/* Invariants:
* ob_sval contains space for 'ob_size+1' elements.
* ob_sval[ob_size] == 0.
* ob_shash is the hash of the string or -1 if not computed yet.
* ob_sstate != 0 iff the string object is in stringobject.c's
* 'interned' dictionary; in this case the two references
* from 'interned' to this object are *not counted* in ob_refcnt.
*/
} PyStringObject;
You can see that there is something called PyObject_VAR_HEAD, one int, one long and a char array. The char array will always contain one more character to store the '\0' at the end of the string. This, along with the int, long and PyObject_VAR_HEAD take the additional 37 bytes. PyObject_VAR_HEAD is defined in another C source file and it refers to other implementation-specific stuff, you need to explore if you want to find out where exactly are the 37 bytes. Plus, the documentation mentions that sys.getsizeof()
adds an additional garbage collector overhead if the object is managed
by the garbage collector.
Overall, you don't need to know what exactly takes the something (the 37 bytes here) but this answer should give you a certain idea why the numbers differ and where to find more information should you really need it.
To quote the documentation:
Return the size of an object in bytes. The object can be any type of object. All built-in objects will return correct results, but this does not have to hold true for third-party extensions as it is implementation specific.
Built in strings are not simple character sequences - they are full fledged objects, with garbage collection overhead, which probably explains the size discrepancy you're noticing.
Consider the following code:
arr = []
for (str, id, flag) in some_data:
arr.append((str, id, flag))
Imagine the input strings being 2 chars long in average and 5 chars max and some_data having 1 million elements.
What will the memory requirement of such a structure be?
May it be that a lot of memory is wasted for the strings? If so, how can I avoid that?
In this case, because the strings are quite short, and there are so many of them, you stand to save a fair bit of memory by using intern on the strings. Assuming there are only lowercase letters in the strings, that's 26 * 26 = 676 possible strings, so there must be a lot of repetitions in this list; intern will ensure that those repetitions don't result in unique objects, but all refer to the same base object.
It's possible that Python already interns short strings; but looking at a number of different sources, it seems this is highly implementation-dependent. So calling intern in this case is probably the way to go; YMMV.
As an elaboration on why this is very likely to save memory, consider the following:
>>> sys.getsizeof('')
40
>>> sys.getsizeof('a')
41
>>> sys.getsizeof('ab')
42
>>> sys.getsizeof('abc')
43
Adding single characters to a string adds only a byte to the size of the string itself, but every string takes up 40 bytes on its own.
In recent Python 3 (64-bit) versions, string instances take up 49+ bytes. But also keep in mind that if you use non-ASCII characters, the memory usage jumps up even more:
>>> sys.getsizeof('t')
50
>>> sys.getsizeof('я')
76
Notice how even if one character in a string is non-ASCII, all other characters will take up more space (2 or 4 bytes each):
>>> sys.getsizeof('t12345')
55 # +5 bytes, compared to 't'
>>> sys.getsizeof('я12345')
86 # +10 bytes, compared to 'я'
This has to do with the internal representation of strings since Python 3.3. See PEP 393 -- Flexible String Representation for more details.
Python, in general, is not very memory efficient, when it comes to having lots of small objects, not just for strings. See these examples:
>>> sys.getsizeof(1)
28
>>> sys.getsizeof(True)
28
>>> sys.getsizeof([])
56
>>> sys.getsizeof(dict())
232
>>> sys.getsizeof((1,1))
56
>>> sys.getsizeof([1,1])
72
Internalizing strings could help, but make sure you don't have too many unique values, as that could do more harm than good.
It's hard to tell how to optimize your specific case, as there is no single universal solution. You could save up a lot of memory if you somehow serialize data from multiple items into a single byte buffer, for example, but then that could complicate your code or affect performance too much. In many cases it won't be worth it, but if I were in a situation where I really needed to optimize memory usage, I would also consider writing that part in a language like Rust (it's not too hard to create a native Python module via PyO3 for example).
If your strings are so short, it is likely there will be a significant number of duplicates. Python interning will optimise it so that these strings are stored only once and the reference used multiple tiems, rather than storing the string multiple times...
These strings should be automatically interned as there are.
Python allocates integers automatically based on the underlying system architecture. Unfortunately I have a huge dataset which needs to be fully loaded into memory.
So, is there a way to force Python to use only 2 bytes for some integers (equivalent of C++ 'short')?
Nope. But you can use short integers in arrays:
from array import array
a = array("h") # h = signed short, H = unsigned short
As long as the value stays in that array it will be a short integer.
documentation for the array module
Thanks to Armin for pointing out the 'array' module. I also found the 'struct' module that packs c-style structs in a string:
From the documentation (https://docs.python.org/library/struct.html):
>>> from struct import *
>>> pack('hhl', 1, 2, 3)
'\x00\x01\x00\x02\x00\x00\x00\x03'
>>> unpack('hhl', '\x00\x01\x00\x02\x00\x00\x00\x03')
(1, 2, 3)
>>> calcsize('hhl')
8
You can use NumyPy's int as np.int8 or np.int16.
Armin's suggestion of the array module is probably best. Two possible alternatives:
You can create an extension module yourself that provides the data structure that you're after. If it's really just something like a collection of shorts, then
that's pretty simple to do.
You can
cheat and manipulate bits, so that
you're storing one number in the
lower half of the Python int, and
another one in the upper half.
You'd write some utility functions
to convert to/from these within your
data structure. Ugly, but it can be made to work.
It's also worth realising that a Python integer object is not 4 bytes - there is additional overhead. So if you have a really large number of shorts, then you can save more than two bytes per number by using a C short in some way (e.g. the array module).
I had to keep a large set of integers in memory a while ago, and a dictionary with integer keys and values was too large (I had 1GB available for the data structure IIRC). I switched to using a IIBTree (from ZODB) and managed to fit it. (The ints in a IIBTree are real C ints, not Python integers, and I hacked up an automatic switch to a IOBTree when the number was larger than 32 bits).
You can also store multiple any size of integers in a single large integer.
For example as seen below, in python3 on 64bit x86 system, 1024 bits are taking 164 bytes of memory storage. That means on average one byte can store around 6.24 bits. And if you go with even larger integers you can get even higher bits storage density. For example around 7.50 bits per byte with 2**20 bits wide integer.
Obviously you will need some wrapper logic to access individual short numbers stored in the larger integer, which is easy to implement.
One issue with this approach is your data access will slow down due use of the large integer operations.
If you are accessing a big batch of consecutively stored integers at once to minimize the access to large integers, then the slower access to long integers won't be an issue.
I guess use of numpy will be easier approach.
>>> a = 2**1024
>>> sys.getsizeof(a)
164
>>> 1024/164
6.2439024390243905
>>> a = 2**(2**20)
>>> sys.getsizeof(a)
139836
>>> 2**20 / 139836
7.49861266054521
Using bytearray in python which is basically a C unsigned char array under the hood will be a better solution than using large integers. There is no overhead for manipulating a byte array and, it has much less storage overhead compared to large integers. It's possible to get storage density of 7.99+ bits per byte with bytearrays.
>>> import sys
>>> a = bytearray(2**32)
>>> sys.getsizeof(a)
4294967353
>>> 8 * 2**32 / 4294967353
7.999999893829228