Python: size of strings in memory

Python: size of strings in memory - python

Consider the following code:
arr = []
for (str, id, flag) in some_data:
arr.append((str, id, flag))
Imagine the input strings being 2 chars long in average and 5 chars max and some_data having 1 million elements.
What will the memory requirement of such a structure be?
May it be that a lot of memory is wasted for the strings? If so, how can I avoid that?

In this case, because the strings are quite short, and there are so many of them, you stand to save a fair bit of memory by using intern on the strings. Assuming there are only lowercase letters in the strings, that's 26 * 26 = 676 possible strings, so there must be a lot of repetitions in this list; intern will ensure that those repetitions don't result in unique objects, but all refer to the same base object.
It's possible that Python already interns short strings; but looking at a number of different sources, it seems this is highly implementation-dependent. So calling intern in this case is probably the way to go; YMMV.
As an elaboration on why this is very likely to save memory, consider the following:
>>> sys.getsizeof('')
40
>>> sys.getsizeof('a')
41
>>> sys.getsizeof('ab')
42
>>> sys.getsizeof('abc')
43
Adding single characters to a string adds only a byte to the size of the string itself, but every string takes up 40 bytes on its own.

In recent Python 3 (64-bit) versions, string instances take up 49+ bytes. But also keep in mind that if you use non-ASCII characters, the memory usage jumps up even more:
>>> sys.getsizeof('t')
50
>>> sys.getsizeof('я')
76
Notice how even if one character in a string is non-ASCII, all other characters will take up more space (2 or 4 bytes each):
>>> sys.getsizeof('t12345')
55 # +5 bytes, compared to 't'
>>> sys.getsizeof('я12345')
86 # +10 bytes, compared to 'я'
This has to do with the internal representation of strings since Python 3.3. See PEP 393 -- Flexible String Representation for more details.
Python, in general, is not very memory efficient, when it comes to having lots of small objects, not just for strings. See these examples:
>>> sys.getsizeof(1)
28
>>> sys.getsizeof(True)
28
>>> sys.getsizeof([])
56
>>> sys.getsizeof(dict())
232
>>> sys.getsizeof((1,1))
56
>>> sys.getsizeof([1,1])
72
Internalizing strings could help, but make sure you don't have too many unique values, as that could do more harm than good.
It's hard to tell how to optimize your specific case, as there is no single universal solution. You could save up a lot of memory if you somehow serialize data from multiple items into a single byte buffer, for example, but then that could complicate your code or affect performance too much. In many cases it won't be worth it, but if I were in a situation where I really needed to optimize memory usage, I would also consider writing that part in a language like Rust (it's not too hard to create a native Python module via PyO3 for example).

If your strings are so short, it is likely there will be a significant number of duplicates. Python interning will optimise it so that these strings are stored only once and the reference used multiple tiems, rather than storing the string multiple times...
These strings should be automatically interned as there are.

Related

What takes more memory, an 8 char string or 8 digit int?

I'm developing a program that will deal with approx. 90 billion records, so I need to manage memory carefully. Which is larger in memory: 8 char string or 8 digit int?
Details:
-Python 3.7.4
-64 bits
Edit1:
following the advice of user8080blablabla I got:
sys.getsizeof(99999999)
28
sys.getsizeof("99999999")
57
seriously? a 8 char string is 57 bytes long?!?

An int will generally take less memory than its representation as a string, because it is more compact. However, because Python int values are objects, they still take quite a lot of space each compared to primitive values in other languages: the integer object 1 takes up 28 bytes of memory on my machine.
>>> import sys
>>> sys.getsizeof(1)
28
If minimising memory use is your priority, and there is a maximum range the integers can be in, consider using the array module. It can store numeric data (or Unicode characters) in an array, in a primitive data type of your choice, so that each value isn't an object taking up 28+ bytes.
>>> from array import array
>>> arr = array('I') # unsigned int in C
>>> arr.extend(range(10000))
>>> arr.itemsize
4
>>> sys.getsizeof(arr)
40404
The actual number of bytes used per item is dependent on the machine architecture. On my machine, each number takes 4 bytes; there are 404 bytes of overhead for an array of length 10,000. Check arr.itemsize on your machine to see if you need a different primitive type; fewer than 4 bytes is not enough for an 8-digit number.
That said, you should not be trying to fit 90 billion numbers in memory, at 4 bytes each; this would take 360GB of memory. Look for a solution which doesn't require holding every record in memory at once.

You ought to remember that strings are represented as Unicodes in Python, therefore storing a digit in a string can take an upwards of 4-bytes per character to store, which is why you see such a large discrepancy between int and str (interesting read on the topic).
If you are worried about memory allocation I would instead recommend using pandas to manage the backend for you when it comes to manipulating large datasets.

Why are Unicode strings having a different memory footprint in Python 2 and 3? [duplicate]

This question already has an answer here:
How is unicode represented internally in Python?
(1 answer)
Closed 4 years ago.
In Python 2, an empty string occupy exactly 37 bytes,
>>>> print sys.getsizeof('')
37
In Python 3.6, the same call would output 49 bytes,
>>>> print(sys.getsizeof(''))
49
Now I thought that this was due to the fact that in Python 3, all strings are now Unicode. But, to my surprise here are some confusing outputs,
Python 2.7
>>>> print sys.getsizeof(u'')
52
>>>> print sys.getsizeof(u'1')
56
Python 3.6
>>>>print(sys.getsizeof(''))
49
>>>>print(sys.getsizeof('1'))
50
An empty string is not the same size.
4 additional bytes are needed when adding a character in Python 2 and only one for Python 3
Why is the memory footprint different between the two versions ?
EDIT
I specified the exact version of my python environment, because between different Python 3 builds there are differences.

There are reasons of course, but really, it should not matter for any practical purposes. If you have a Python system in which you have to keep so many strings in memory as to get close to the system memory, you should optimize it by (1) trying to lazily load/create strings in memory or (2) Use a byte-oriented efficient binary structure to deal with your data, such as those provided by Numpy, or Python's own bytearray.
The change for the empty string literal (unicode literal fro Py2) could bedue to any implementation details between the versions you are looking at, which should not matter even if were writting C code to interact directly with Python strings: even those should only touch the strings via the API.
Now, the specific reason for why the string in Python 3 just increases its size by "1" byte, while in Python 2 it increases the size by 4 bytes is PEP 393.
Before Python 3.3, any (unicode) string in Python would use either fixed 2 bytes or fixed 4 bytes of memory for each character - and the Python interpreter and Python modules using native code would have to be compiled to use just one of these kinds. I.E. you efectively could have incompatible Python binaries, even if the versions matched, due to the string-width optoin picked up at build time - the builds were known as "narrow build" and "wide build". With the above mentioned PEP 391, Python strings have their character sized determined when they are instantiate, depending on the size of the widest unicode codepoint it contains. Strings that contain points that are contained in the first 256 codepoints (equivalent to the Latin-1 character set) use only 1 byte per character.

Internally, Python 3 now stores strings in four different encodings, and chooses a different encoding for each string. These encodings are ASCII, LATIN-1, UCS-2, and UTF-32. Each one is capable of representing a different subset of Unicode characters and has the useful property that the element at index i is also the Unicode code point at index i.
In [1]: import sys
In [2]: sys.getsizeof('\xFF')
Out[2]: 74
In [3]: sys.getsizeof('X\xFF')
Out[3]: 75
In [4]: sys.getsizeof('\u0100')
Out[4]: 76
In [5]: sys.getsizeof('X\u0100')
Out[5]: 78
In [6]: sys.getsizeof('\U00010000')
Out[6]: 80
In [7]: sys.getsizeof('X\U00010000')
Out[7]: 84
You can see that adding an additional character, in this case 'X', to a string causes that string to take up additional space depending on the values contained in the rest of the string.
This system was proposed in PEP-0393 and implemented in Python 3.3. Earlier versions of Python use the older unicode representation, which always used 2 or 4 bytes per... element (I hesitate to say "character"), depending on compile-time options and they could not be mixed.

How much memory does this byte string actually take up?

My understanding is that os.urandom(size) outputs a random string of bytes of the given "size", but then:
import os
import sys
print(sys.getsizeof(os.urandom(42)))
>>>
75
Why is this not 42?
And a related question:
import base64
import binascii
print(sys.getsizeof(base64.b64encode(os.urandom(42))))
print(sys.getsizeof(binascii.hexlify(os.urandom(42))))
>>>
89
117
Why are these so different? Which encoding would be the most memory efficient way to store a string of bytes such as that given by os.urandom?
Edit: It seems like quite a stretch to say that this question is a duplicate of What is the difference between len() and sys.getsizeof() methods in python? My question is not about the difference between len() and getsizeof(). I was confused about the memory used by Python objects in general, which the answer to this question has clarified for me.

Python byte string objects are more than just the characters that comprise them. They are fully fledged objects. As such they require more space to accommodate the object's components such as the type pointer (needed to identify what kind of object the bytestring even is) and the length (needed for efficiency and because Python bytestrings can contain null bytes).
The simplest object, an object instance, requires space:
>>> sys.getsizeof(object())
16
The second part of your question is simply because the strings produced by b64encode() and hexlify() have different lengths; the latter being 28 characters longer which, unsurprisingly, is the difference in the values reported by sys.getsizeof().
>>> s1 = base64.b64encode(os.urandom(42))
>>> s1
b'CtlMjDM9q7zp+pGogQci8gr0igJsyZVjSP4oWmMj2A8diawJctV/8sTa'
>>> s2 = binascii.hexlify(os.urandom(42))
>>> s2
b'c82d35f717507d6f5ffc5eda1ee1bfd50a62689c08ba12055a5c39f95b93292ddf4544751fbc79564345'
>>> len(s2) - len(s1)
28
>>> sys.getsizeof(s2) - sys.getsizeof(s1)
28
Unless you use some form of compression, there is no encoding that will be more efficient than the binary string that you already have, and this is particularly true in this case because the data is random which is inherently incompressible.

What does python sys getsizeof for string return?

What does sys.getsizeof return for a standard string? I am noticing that this value is much higher than what len returns.

I will attempt to answer your question from a broader point of view. You're referring to two functions and comparing their outputs. Let's take a look at their documentation first:
len():
Return the length (the number of items) of an object. The argument may
be a sequence (such as a string, bytes, tuple, list, or range) or a
collection (such as a dictionary, set, or frozen set).
So in case of string, you can expect len() to return the number of characters.
sys.getsizeof():
Return the size of an object in bytes. The object can be any type of
object. All built-in objects will return correct results, but this
does not have to hold true for third-party extensions as it is
implementation specific.
So in case of string (as with many other objects) you can expect sys.getsizeof() the size of the object in bytes. There is no reason to think that it should be the same as the number of characters.
Let's have a look at some examples:
>>> first = "First"
>>> len(first)
5
>>> sys.getsizeof(first)
42
This example confirms that the size is not the same as the number of characters.
>>> second = "Second"
>>> len(second)
6
>>> sys.getsizeof(second)
43
We can notice that if we look at a string one character longer, its size is one byte bigger as well. We don't know if it's a coincidence or not though.
>>> together = first + second
>>> print(together)
FirstSecond
>>> len(together)
11
If we concatenate the two strings, their combined length is equal to the sum of their lengths, which makes sense.
>>> sys.getsizeof(together)
48
Contrary to what someone might expect though, the size of the combined string is not equal to the sum of their individual sizes. But it still seems to be the length plus something. In particular, something worth 37 bytes. Now you need to realize that it's 37 bytes in this particular case, using this particular Python implementation etc. You should not rely on that at all. Still, we can take a look why it's 37 bytes what they are (approximately) used for.
String objects are in CPython (probably the most widely used implementation of Python) implemented as PyStringObject. This is the C source code (I use the 2.7.9 version):
typedef struct {
PyObject_VAR_HEAD
long ob_shash;
int ob_sstate;
char ob_sval[1];
/* Invariants:
* ob_sval contains space for 'ob_size+1' elements.
* ob_sval[ob_size] == 0.
* ob_shash is the hash of the string or -1 if not computed yet.
* ob_sstate != 0 iff the string object is in stringobject.c's
* 'interned' dictionary; in this case the two references
* from 'interned' to this object are *not counted* in ob_refcnt.
*/
} PyStringObject;
You can see that there is something called PyObject_VAR_HEAD, one int, one long and a char array. The char array will always contain one more character to store the '\0' at the end of the string. This, along with the int, long and PyObject_VAR_HEAD take the additional 37 bytes. PyObject_VAR_HEAD is defined in another C source file and it refers to other implementation-specific stuff, you need to explore if you want to find out where exactly are the 37 bytes. Plus, the documentation mentions that sys.getsizeof()
adds an additional garbage collector overhead if the object is managed
by the garbage collector.
Overall, you don't need to know what exactly takes the something (the 37 bytes here) but this answer should give you a certain idea why the numbers differ and where to find more information should you really need it.

To quote the documentation:
Return the size of an object in bytes. The object can be any type of object. All built-in objects will return correct results, but this does not have to hold true for third-party extensions as it is implementation specific.
Built in strings are not simple character sequences - they are full fledged objects, with garbage collection overhead, which probably explains the size discrepancy you're noticing.

A RAM error of big array

I need to get the numbers of one line randomly, and put each line in other array, then get the numbers of one col.
I have a big file, more than 400M. In that file, there are 13496*13496 number, means 13496 rows and 13496 cols. I want to read them to a array.
This is my code:
_L1 = [[0 for col in range(13496)] for row in range(13496)]
_L1file = open('distanceCMD.function.txt')
while (i<13496):
print "i="+str(i)
_strlf = _L1file.readline()
_strlf = _strlf.split('\t')
_strlf = _strlf[:-1]
_L1[i] = _strlf
i += 1
_L1file.close()
And this is my error message:
MemoryError:
File "D:\research\space-function\ART3.py", line 30, in <module>
_strlf = _strlf.split('\t')

you might want to approach your problem in another way. Process the file line by line. I don't see a need to store the whole big file into array. Otherwise, you might want to tell us what you are actually trying to do.
for line in open("400MB_file"):
# do something with line.
Or
f=open("file")
for linenum,line in enumerate(f):
if linenum+1 in [2,3,10]:
print "there are ", len(line.split())," columns" #assuming you want to split on spaces
print "100th column value is: ", line.split()[99]
if linenum+1>10:
break # break if you want to stop after the 10th line
f.close()

This is a simple case of your program demanding more memory than is available to the computer. An array of 13496x13496 elements requires 182,142,016 'cells', where a cell is a minimum of one byte (if storing chars) and potentially several bytes (if storing floating-point numerics, for example). I'm not even taking your particular runtimes' array metadata into account, though this would typically be a tiny overhead on a simple array.
Assuming each array element is just a single byte, your computer needs around 180MB of RAM to hold it in memory in its' entirety. Trying to process it could be impractical.
You need to think about the problem a different way; as has already been mentioned, a line-by-line approach might be a better option. Or perhaps processing the grid in smaller units, perhaps 10x10 or 100x100, and aggregating the results. Or maybe the problem itself can be expressed in a different form, which avoids the need to process the entire dataset altogether...?
If you give us a little more detail on the nature of the data and the objective, perhaps someone will have an idea to make the task more manageable.

Short answer: the Python object overhead is killing you. In Python 2.x on a 64-bit machine, a list of strings consumes 48 bytes per list entry even before accounting for the content of the strings. That's over 8.7 Gb of overhead for the size of array you describe.
On a 32-bit machine it'll be a bit better: only 28 bytes per list entry.
Longer explanation: you should be aware that Python objects themselves can be quite large: even simple objects like ints, floats and strings. In your code you're ending up with a list of lists of strings. On my (64-bit) machine, even an empty string object takes up 40 bytes, and to that you need to add 8 bytes for the list pointer that's pointing to this string object in memory. So that's already 48 bytes per entry, or around 8.7 Gb. Given that Python allocates memory in multiples of 8 bytes at a time, and that your strings are almost certainly non-empty, you're actually looking at 56 or 64 bytes (I don't know how long your strings are) per entry.
Possible solutions:
(1) You might do (a little) better by converting your entries from strings to ints or floats as appropriate.
(2) You'd do much better by either using Python's array type (not the same as list!) or by using numpy: then your ints or floats would only take 4 or 8 bytes each.
Since Python 2.6, you can get basic information about object sizes with the sys.getsizeof function. Note that if you apply it to a list (or other container) then the returned size doesn't include the size of the contained list objects; only of the structure used to hold those objects. Here are some values on my machine.
>>> import sys
>>> sys.getsizeof("")
40
>>> sys.getsizeof(5.0)
24
>>> sys.getsizeof(5)
24
>>> sys.getsizeof([])
72
>>> sys.getsizeof(range(10)) # 72 + 8 bytes for each pointer
152

MemoryError exception:
Raised when an operation runs out of
memory but the situation may still be
rescued (by deleting some objects).
The associated value is a string
indicating what kind of (internal)
operation ran out of memory. Note that
because of the underlying memory
management architecture (C’s malloc()
function), the interpreter may not
always be able to completely recover
from this situation; it nevertheless
raises an exception so that a stack
traceback can be printed, in case a
run-away program was the cause.
It seems that, at least in your case, reading the entire file into memory is not a doable option.

Replace this:
_strlf = _strlf[:-1]
with this:
_strlf = [float(val) for val in _strlf[:-1]]
You are making a big array of strings. I can guarantee that the string "123.00123214213" takes a lot less memory when you convert it to floating point.
You might want to include some handling for null values.
You can also go to numpy's array type, but your problem may be too small to bother.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.