Two Python strings with the same characters, a == b,
may share memory, id(a) == id(b),
or may be in memory twice, id(a) != id(b).
Try
ab = "ab"
print id( ab ), id( "a"+"b" )
Here Python recognizes that the newly created "a"+"b" is the same
as the "ab" already in memory -- not bad.
Now consider an N-long list of state names
[ "Arizona", "Alaska", "Alaska", "California" ... ]
(N ~ 500000 in my case).
I see 50 different id() s ⇒ each string "Arizona" ... is stored only once, fine.
BUT write the list to disk and read it back in again:
the "same" list now has N different id() s, way more memory, see below.
How come -- can anyone explain Python string memory allocation ?
""" when does Python allocate new memory for identical strings ?
ab = "ab"
print id( ab ), id( "a"+"b" ) # same !
list of N names from 50 states: 50 ids, mem ~ 4N + 50S, each string once
but list > file > mem again: N ids, mem ~ N * (4 + S)
"""
from __future__ import division
from collections import defaultdict
from copy import copy
import cPickle
import random
import sys
states = dict(
AL = "Alabama",
AK = "Alaska",
AZ = "Arizona",
AR = "Arkansas",
CA = "California",
CO = "Colorado",
CT = "Connecticut",
DE = "Delaware",
FL = "Florida",
GA = "Georgia",
)
def nid(alist):
""" nr distinct ids """
return "%d ids %d pickle len" % (
len( set( map( id, alist ))),
len( cPickle.dumps( alist, 0 ))) # rough est ?
# cf http://stackoverflow.com/questions/2117255/python-deep-getsizeof-list-with-contents
N = 10000
exec( "\n".join( sys.argv[1:] )) # var=val ...
random.seed(1)
# big list of random names of states --
names = []
for j in xrange(N):
name = copy( random.choice( states.values() ))
names.append(name)
print "%d strings in mem: %s" % (N, nid(names) ) # 10 ids, even with copy()
# list to a file, back again -- each string is allocated anew
joinsplit = "\n".join(names).split() # same as > file > mem again
assert joinsplit == names
print "%d strings from a file: %s" % (N, nid(joinsplit) )
# 10000 strings in mem: 10 ids 42149 pickle len
# 10000 strings from a file: 10000 ids 188080 pickle len
# Python 2.6.4 mac ppc
Added 25jan:
There are two kinds of strings in Python memory (or any program's):
Ustrings, in a Ucache of unique strings: these save memory, and make a == b fast if both are in Ucache
Ostrings, the others, which may be stored any number of times.
intern(astring) puts astring in the Ucache (Alex +1);
other than that we know nothing at all about how Python moves Ostrings to the Ucache --
how did "a"+"b" get in, after "ab" ?
("Strings from files" is meaningless -- there's no way of knowing.)
In short, Ucaches (there may be several) remain murky.
A historical footnote:
SPITBOL
uniquified all strings ca. 1970.
Each implementation of the Python language is free to make its own tradeoffs in allocating immutable objects (such as strings) -- either making a new one, or finding an existing equal one and using one more reference to it, are just fine from the language's point of view. In practice, of course, real-world implementation strike reasonable compromise: one more reference to a suitable existing object when locating such an object is cheap and easy, just make a new object if the task of locating a suitable existing one (which may or may not exist) looks like it could potentially take a long time searching.
So, for example, multiple occurrences of the same string literal within a single function will (in all implementations I know of) use the "new reference to same object" strategy, because when building that function's constants-pool it's pretty fast and easy to avoid duplicates; but doing so across separate functions could potentially be a very time-consuming task, so real-world implementations either don't do it at all, or only do it in some heuristically identified subset of cases where one can hope for a reasonable tradeoff of compilation time (slowed down by searching for identical existing constants) vs memory consumption (increased if new copies of constants keep being made).
I don't know of any implementation of Python (or for that matter other languages with constant strings, such as Java) that takes the trouble of identifying possible duplicates (to reuse a single object via multiple references) when reading data from a file -- it just doesn't seem to be a promising tradeoff (and here you'd be paying runtime, not compile time, so the tradeoff is even less attractive). Of course, if you know (thanks to application level considerations) that such immutable objects are large and quite prone to many duplications, you can implement your own "constants-pool" strategy quite easily (intern can help you do it for strings, but it's not hard to roll your own for, e.g., tuples with immutable items, huge long integers, and so forth).
I strongly suspect that Python is behaving like many other languages here - recognising string constants within your source code and using a common table for those, but not applying the same rules when creating strings dynamically. This makes sense as there will only be a finite set of strings within your source code (although Python lets you evaluate code dynamically, of course) whereas it's much more likely that you'll be creating huge numbers of strings in the course of your program.
This process is generally called interning - and indeed by the looks of this page it's called interning in Python, too.
A side note: it is very important to know the lifetime of objects in Python. Note the following session:
Python 2.6.4 (r264:75706, Dec 26 2009, 01:03:10)
[GCC 4.3.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> a="a"
>>> b="b"
>>> print id(a+b), id(b+a)
134898720 134898720
>>> print (a+b) is (b+a)
False
Your thinking that by printing the IDs of two separate expressions and noting “they are equal ergo the two expressions must be equal/equivalent/the same” is faulty. A single line of output does not necessarily imply all of its contents were created and/or co-existed at the same single moment in time.
If you want to know if two objects are the same object, ask Python directly (using the is operator).
x = 42
y = 42
x == y #True
x is y #True
In this interaction, X and Y should be
== (same value), but not is (same object) because we ran two different
literal expressions. Because small
integers and strings are cached and
reused, though, is tells us they
reference the same single object.
In fact, if you really want to look
under the hood, you can always ask
Python how many references there are
to an object using the getrefcount
function in the standard sys module
returns the object’s reference count.
This behavior reflects one of the many
ways Python optimizes its model for
execution speed.
Learning Python
I found a good article to explain the intern behavior of CPython:
http://guilload.com/python-string-interning/
In short:
String object in CPython has a flag to indicate that if it's in intern.
Interning string by storing them in a normal dictionary with keys and values are string's pointers. This accepts string class only.
Interning help Python to reduce memory consumption because objects can refer to the same memory address, and speed up comparison speed because it only has to compare the string's pointers.
Python does the intern in the compile process, which means only literal strings (or string can be computed at compile time, like 'hello' + 'world')
For your question: Only strings with length 0 or length 1 or contains ASCII letters only(a-z, A-Z, 0-9) are interned
Intern works in Python due to strings are immutable, otherwise does not make sense.
This is a really good article, I strongly suggest visiting his site and check for other ones, worth our time.
Related
See this link: https://docs.python.org/3/c-api/long.html#c.PyLong_FromLong
The current implementation keeps an array of integer objects for all integers between -5 and 256; when you create an int in that range, you actually just get back a reference to the existing object. So, it should be possible to change the value of 1. I suspect the behavior of Python, in this case, is undefined. :-)
What do the bold lines mean in this context?
It means that integers in Python are actual objects with a "value"-field to hold the integer's value. In Java, you could express Python's integers like so (leaving out a lot of details, of course):
class PyInteger {
private int value;
public PyInteger(int val) {
this.value = val;
}
public PyInteger __add__(PyInteger other) {
return new PyInteger(this.value + other.value);
}
}
In order to not have hunderts of Python integers with the same value around, it caches some integers, along the lines of:
PyInteger[] cache = {
new PyInteger(0),
new PyInteger(1),
new PyInteger(2),
...
}
However, what would happen if you did something like this (let's ignore that value is private for a moment):
PyInteger one = cache[1]; // the PyInteger representing 1
one.value = 3;
Suddenly, every time you used 1 in your program, you would actually get back 3, because the object representing 1 has an effective value of 3.
Indeed, you can do that in Python! That is: it is possible to change the effective numeric value of an integer in Python. There is an answer in this reddit post. I copy it here for completeness, though (original credits go to Veedrac):
import ctypes
def deref(addr, typ):
return ctypes.cast(addr, ctypes.POINTER(typ))
deref(id(29), ctypes.c_int)[6] = 100
#>>>
29
#>>> 100
29 ** 0.5
#>>> 10.0
The Python specification itself does not say anything about how integers are to be stored or represented internally. It also does not say which integers should be cached, or that any should be cached at all. In short: there is nothing in the Python specifications defining what should happen if you do something silly like this ;-).
We could even go slightly further...
In reality, the field value above is actually an array of integers, emulating an arbitrary large integer value (for a 64-bit integer, you just combine two 32-bit fields, etc). However, when integers start to get large and outgrow a standard 32-bit integer, caching is no longer a viable option. Even if you used a dictionary, comparing integer arrays for equality would be too much of an overhead with too little gain.
You can actually check this yourself by using is to compare identities:
>>> 3 * 4 is 12
True
>>> 300 * 400 is 120000
False
>>> 300 * 400 == 120000
True
In a typical Python system, there is exactly one object representing the number 12. 120000, on the other hand, is hardly ever cached. So, above, 300 * 400 yields a new object representing 120000, which is different from the object created for the number on the right hand side.
Why is this relevant? If you change the value of a small number like 1 or 29, it will affect all calculations that use that number. You will most likely seriously break your system (until you restart). But if you change the value of a large integer, the effects will be minimal.
Changing the value of 12 to 13 means that 3 * 4 will yield 13. Chaning the value of 120000 to 130000 has much less effect and 300 * 400 will still yield (a new) 120000 and not 130000.
As soon as you take other Python implementations into the picture, things can get even harder to predict. MicroPython, for instance, does not have objects for small numbers, but emalutes them on the fly, and PyPy might well just optimise your changes away.
Bottomline: the exact behaviour of numbers that you tinker with is truly undefined, but depends on several factors and the exact implementation.
Answer to a question in the comments: What is the significance of 6 in Veedrac's code above?
All objects in Python share a common memory layout. The first field is a reference counter that tells you how many other objects are currently referring to this object. The second field is a reference to the object's class or type. Since integers do not have a fixed size, the third field is the size of the data part (you can find the relevant definitions here (general objects) and here (integers/longs)):
struct longObject {
native_int ref_counter; // offset: +0 / +0
PyObject* type; // offset: +1 / +2
native_int size; // offset: +2 / +4
unsigned short value[]; // offset: +3 / +6
}
On a 32-bit system, native_int and PyObject* both occupy 32 bits, whereas on a 64-bit system they occupy 64 bits, naturally. So, if we access the data as 32 bits (using ctypes.c_int) on a 64-bit system, the actual value of the integer is to be found at offset +6. If you change the type to ctypes.c_long, on the other hand, the offset is +3.
Because id(x) in CPython returns the memory address of x, you can actually check this yourself. Based on the above deref function, let's do:
>>> deref(id(29), ctypes.c_long)[3]
29
>>> deref(id(29), ctypes.c_long)[1]
10277248
>>> id(int) # memory address of class "int"
10277248
Since the object is returned by reference, then if you change the object it will change for everything in the program.
So taking the value 1 as an example, you could change it to 42.
This is only possible because the C API gives you internal access to the Python interpreter; it feels unlikely you could do this from within a Python script itself (without using something like cffi for example).
Another way to think about what would happen if you “changed the value at 1’s address to 17” in the internals is just printing each element in range(3)— you would see 0, 17, 2.
I have an object which needs to be "tagged" with 0-3 strings (out of a set of 20-some possibilities); these values are all unique and order doesn't matter. The only operation that needs to be done on the tags is checking if a particular one is present or not (specific_value in self.tags).
However, there's an enormous number of these objects in memory at once, to the point that it pushes the limits of my old computer's RAM. So saving a few bytes can add up.
With so few tags on each object, I doubt the lookup time is going to matter much. But: is there a memory difference between using a tuple and a frozenset here? Is there any other real reason to use one over the other?
Tuples are very compact. Sets are based on hash tables, and depend on having "empty" slots to make hash collisions less likely.
For a recent enough version of CPython, sys._debugmallocstats() displays lots of potentially interesting info. Here under a 64-bit Python 3.7.3:
>>> from sys import _debugmallocstats as d
>>> tups = [tuple("abc") for i in range(1000000)]
tuple("abc") creates a tuple of 3 1-character strings, ('a', 'b', 'c'). Here I'll edit out almost all the output:
>>> d()
Small block threshold = 512, in 64 size classes.
class size num pools blocks in use avail blocks
----- ---- --------- ------------- ------------
...
8 72 17941 1004692 4
Since we created a million tuples, it's a very good bet that the size class using 1004692 blocks is the one we want ;-) Each of the blocks consumes 72 bytes.
Switching to frozensets instead, the output shows that those consume 224 bytes each, a bit over 3x more:
>>> tups = [frozenset(t) for t in tups]
>>> d()
Small block threshold = 512, in 64 size classes.
class size num pools blocks in use avail blocks
----- ---- --------- ------------- ------------
...
27 224 55561 1000092 6
In this particular case, the other answer you got happens to give the same results:
>>> import sys
>>> sys.getsizeof(tuple("abc"))
72
>>> sys.getsizeof(frozenset(tuple("abc")))
224
While that's often true, it's not always so, because an object may require allocating more bytes than it actually needs, to satisfy HW alignment requirements. getsizeof() doesn't know anything about that, but _debugmallocstats() shows the number of bytes Python's small-object allocator actually needs to use.
For example,
>>> sys.getsizeof("a")
50
On a 32-bit box, 52 bytes actually need to be used, to provide 4-byte alignment. On a 64-bit box, 8-byte alignment is currently required, so 56 bytes need to be used. Under Python 3.8 (not yet released), on a 64-bit box 16-byte alignment is required, and 64 bytes will need to be used.
But ignoring all that, a tuple will always need less memory than any form of set with the same number of elements - and even less than a list with the same number of elements.
sys.getsizeof seems like the stdlib option you want... but I feel queasy about your whole use case
import sys
t = ("foo", "bar", "baz")
f = frozenset(("foo","bar","baz"))
print(sys.getsizeof(t))
print(sys.getsizeof(f))
https://docs.python.org/3.7/library/sys.html#sys.getsizeof
All built-in objects will return correct results, but this does not have to hold true for third-party extensions as it is implementation specific.
...So don't get comfy with this solution
EDIT: Obviously #TimPeters answer is more correct...
If you're trying to save memory, consider
Trading off some elegance for some memory savings by extracting the data structure of which tags are present into an external (singleton) data structure
Using a "flags" (bitmap) type approach, where each tag is mapped to a bit of a 32-bit integer. Then all you need is a (singleton) dict mapping from the object (identity) to a 32-bit integer (flags). If no flags are present, no entry in the dictionary.
`There is a possibility to reduce memory if replace tuple with a type from recordclass library:
>>> from recordclass import make_arrayclass
>>> Triple = make_arrayclass("Triple", 3)
>>> from sys import getsizeof as sizeof
>>> sizeof(Triple("ab","cd","ef"))
40
>>> sizeof(("ab","cd","ef"))
64
The difference is equal to the sizeof(PyGC_Head) + sizeof(Py_ssize_t).
P.S.: The numbers are mesured on 64-bit Python 3.8.
Two Python strings with the same characters, a == b,
may share memory, id(a) == id(b),
or may be in memory twice, id(a) != id(b).
Try
ab = "ab"
print id( ab ), id( "a"+"b" )
Here Python recognizes that the newly created "a"+"b" is the same
as the "ab" already in memory -- not bad.
Now consider an N-long list of state names
[ "Arizona", "Alaska", "Alaska", "California" ... ]
(N ~ 500000 in my case).
I see 50 different id() s ⇒ each string "Arizona" ... is stored only once, fine.
BUT write the list to disk and read it back in again:
the "same" list now has N different id() s, way more memory, see below.
How come -- can anyone explain Python string memory allocation ?
""" when does Python allocate new memory for identical strings ?
ab = "ab"
print id( ab ), id( "a"+"b" ) # same !
list of N names from 50 states: 50 ids, mem ~ 4N + 50S, each string once
but list > file > mem again: N ids, mem ~ N * (4 + S)
"""
from __future__ import division
from collections import defaultdict
from copy import copy
import cPickle
import random
import sys
states = dict(
AL = "Alabama",
AK = "Alaska",
AZ = "Arizona",
AR = "Arkansas",
CA = "California",
CO = "Colorado",
CT = "Connecticut",
DE = "Delaware",
FL = "Florida",
GA = "Georgia",
)
def nid(alist):
""" nr distinct ids """
return "%d ids %d pickle len" % (
len( set( map( id, alist ))),
len( cPickle.dumps( alist, 0 ))) # rough est ?
# cf http://stackoverflow.com/questions/2117255/python-deep-getsizeof-list-with-contents
N = 10000
exec( "\n".join( sys.argv[1:] )) # var=val ...
random.seed(1)
# big list of random names of states --
names = []
for j in xrange(N):
name = copy( random.choice( states.values() ))
names.append(name)
print "%d strings in mem: %s" % (N, nid(names) ) # 10 ids, even with copy()
# list to a file, back again -- each string is allocated anew
joinsplit = "\n".join(names).split() # same as > file > mem again
assert joinsplit == names
print "%d strings from a file: %s" % (N, nid(joinsplit) )
# 10000 strings in mem: 10 ids 42149 pickle len
# 10000 strings from a file: 10000 ids 188080 pickle len
# Python 2.6.4 mac ppc
Added 25jan:
There are two kinds of strings in Python memory (or any program's):
Ustrings, in a Ucache of unique strings: these save memory, and make a == b fast if both are in Ucache
Ostrings, the others, which may be stored any number of times.
intern(astring) puts astring in the Ucache (Alex +1);
other than that we know nothing at all about how Python moves Ostrings to the Ucache --
how did "a"+"b" get in, after "ab" ?
("Strings from files" is meaningless -- there's no way of knowing.)
In short, Ucaches (there may be several) remain murky.
A historical footnote:
SPITBOL
uniquified all strings ca. 1970.
Each implementation of the Python language is free to make its own tradeoffs in allocating immutable objects (such as strings) -- either making a new one, or finding an existing equal one and using one more reference to it, are just fine from the language's point of view. In practice, of course, real-world implementation strike reasonable compromise: one more reference to a suitable existing object when locating such an object is cheap and easy, just make a new object if the task of locating a suitable existing one (which may or may not exist) looks like it could potentially take a long time searching.
So, for example, multiple occurrences of the same string literal within a single function will (in all implementations I know of) use the "new reference to same object" strategy, because when building that function's constants-pool it's pretty fast and easy to avoid duplicates; but doing so across separate functions could potentially be a very time-consuming task, so real-world implementations either don't do it at all, or only do it in some heuristically identified subset of cases where one can hope for a reasonable tradeoff of compilation time (slowed down by searching for identical existing constants) vs memory consumption (increased if new copies of constants keep being made).
I don't know of any implementation of Python (or for that matter other languages with constant strings, such as Java) that takes the trouble of identifying possible duplicates (to reuse a single object via multiple references) when reading data from a file -- it just doesn't seem to be a promising tradeoff (and here you'd be paying runtime, not compile time, so the tradeoff is even less attractive). Of course, if you know (thanks to application level considerations) that such immutable objects are large and quite prone to many duplications, you can implement your own "constants-pool" strategy quite easily (intern can help you do it for strings, but it's not hard to roll your own for, e.g., tuples with immutable items, huge long integers, and so forth).
I strongly suspect that Python is behaving like many other languages here - recognising string constants within your source code and using a common table for those, but not applying the same rules when creating strings dynamically. This makes sense as there will only be a finite set of strings within your source code (although Python lets you evaluate code dynamically, of course) whereas it's much more likely that you'll be creating huge numbers of strings in the course of your program.
This process is generally called interning - and indeed by the looks of this page it's called interning in Python, too.
A side note: it is very important to know the lifetime of objects in Python. Note the following session:
Python 2.6.4 (r264:75706, Dec 26 2009, 01:03:10)
[GCC 4.3.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> a="a"
>>> b="b"
>>> print id(a+b), id(b+a)
134898720 134898720
>>> print (a+b) is (b+a)
False
Your thinking that by printing the IDs of two separate expressions and noting “they are equal ergo the two expressions must be equal/equivalent/the same” is faulty. A single line of output does not necessarily imply all of its contents were created and/or co-existed at the same single moment in time.
If you want to know if two objects are the same object, ask Python directly (using the is operator).
x = 42
y = 42
x == y #True
x is y #True
In this interaction, X and Y should be
== (same value), but not is (same object) because we ran two different
literal expressions. Because small
integers and strings are cached and
reused, though, is tells us they
reference the same single object.
In fact, if you really want to look
under the hood, you can always ask
Python how many references there are
to an object using the getrefcount
function in the standard sys module
returns the object’s reference count.
This behavior reflects one of the many
ways Python optimizes its model for
execution speed.
Learning Python
I found a good article to explain the intern behavior of CPython:
http://guilload.com/python-string-interning/
In short:
String object in CPython has a flag to indicate that if it's in intern.
Interning string by storing them in a normal dictionary with keys and values are string's pointers. This accepts string class only.
Interning help Python to reduce memory consumption because objects can refer to the same memory address, and speed up comparison speed because it only has to compare the string's pointers.
Python does the intern in the compile process, which means only literal strings (or string can be computed at compile time, like 'hello' + 'world')
For your question: Only strings with length 0 or length 1 or contains ASCII letters only(a-z, A-Z, 0-9) are interned
Intern works in Python due to strings are immutable, otherwise does not make sense.
This is a really good article, I strongly suggest visiting his site and check for other ones, worth our time.
I was surprised that sys.getsizeof( 10000*[x] )
is 40036 regardless of x: 0, "a", 1000*"a", {}.
Is there a deep_getsizeof
which properly considers elements that share memory ?
(The question came from looking at in-memory database tables like
range(1000000) -> province names: list or dict ?)
(Python is 2.6.4 on a mac ppc.)
Added:
10000*["Mississippi"] is 10000 pointers to one "Mississippi",
as several people have pointed out. Try this:
nstates = [AlabamatoWyoming() for j in xrange(N)]
where AlabamatoWyoming() -> a string "Alabama" .. "Wyoming".
What's deep_getsizeof(nstates) ?
(How can we tell ?
a proper deep_getsizeof: difficult, ~ gc tracer
estimate from total vm
inside knowledge of the python implementation
guess.
Added 25jan:
see also when-does-python-allocate-new-memory-for-identical-strings
10000 * [x] will produce a list of 10000 times the same object, so the sizeof is actually closer to correct than you think. However, a deep sizeof is very problematic because it's impossible to tell Python when you want to stop the measurement. Every object references a typeobject. Should the typeobject be counted? What if the reference to the typeobject is the last one, so if you deleted the object the typeobject would go away as well? What about if you have multiple (different) objects in the list refer to the same string object? Should it be counted once, or multiple times?
In short, getting the size of a data structure is very complicated, and sys.getsizeof() should never have been added :S
Have a look at guppy/heapy; I haven't played around with it too much myself, but a few of my co-workers have used it for memory profiling with good results.
The documentation could be better, but this howto does a decent job of explaining the basic concepts.
If you list is only holding objects with the same length you could get a more accurate estimate number by doing this
def getSize(array):
return sys.getsizeof(array) + len(array) * sys.getsizeof(array[0])
Obviously it's not going to work as good for strings with variable length.
If you only want to calculate the size for debugging or during development and you don't care about the performance, you could iterate over all items recursively and calculation the total size. Note that this solution is not going to handle multiple references to same object correctly.
I wrote a tool called RememberMe exactly for this. Basic usage:
from rememberme import memory
a = [1, 2, 3]
b = [a, a, a]
print(memory(a)) # 172 bytes
print(memory(b)) # 260 bytes. Duplication counted only once.
Hope it helps.
mylist = 10000 * [x] means create a list of size 10000 with 10000 references to object x.
Object x is not copied - only a single one exists in memory!!!
So to use getsizeof, it would be: sys.getsizeof(mylist) + sys.getsizeof(x)
For example, how much memory is required to store a list of one million (32-bit) integers?
alist = range(1000000) # or list(range(1000000)) in Python 3.0
"It depends." Python allocates space for lists in such a way as to achieve amortized constant time for appending elements to the list.
In practice, what this means with the current implementation is... the list always has space allocated for a power-of-two number of elements. So range(1000000) will actually allocate a list big enough to hold 2^20 elements (~ 1.045 million).
This is only the space required to store the list structure itself (which is an array of pointers to the Python objects for each element). A 32-bit system will require 4 bytes per element, a 64-bit system will use 8 bytes per element.
Furthermore, you need space to store the actual elements. This varies widely. For small integers (-5 to 256 currently), no additional space is needed, but for larger numbers Python allocates a new object for each integer, which takes 10-100 bytes and tends to fragment memory.
Bottom line: it's complicated and Python lists are not a good way to store large homogeneous data structures. For that, use the array module or, if you need to do vectorized math, use NumPy.
PS- Tuples, unlike lists, are not designed to have elements progressively appended to them. I don't know how the allocator works, but don't even think about using it for large data structures :-)
Useful links:
How to get memory size/usage of python object
Memory sizes of python objects?
if you put data into dictionary, how do we calculate the data size?
However they don't give a definitive answer. The way to go:
Measure memory consumed by Python interpreter with/without the list (use OS tools).
Use a third-party extension module which defines some sort of sizeof(PyObject).
Update:
Recipe 546530: Size of Python objects (revised)
import asizeof
N = 1000000
print asizeof.asizeof(range(N)) / N
# -> 20 (python 2.5, WinXP, 32-bit Linux)
# -> 33 (64-bit Linux)
Addressing "tuple" part of the question
Declaration of CPython's PyTuple in a typical build configuration boils down to this:
struct PyTuple {
size_t refcount; // tuple's reference count
typeobject *type; // tuple type object
size_t n_items; // number of items in tuple
PyObject *items[1]; // contains space for n_items elements
};
Size of PyTuple instance is fixed during it's construction and cannot be changed afterwards. The number of bytes occupied by PyTuple can be calculated as
sizeof(size_t) x 2 + sizeof(void*) x (n_items + 1).
This gives shallow size of tuple. To get full size you also need to add total number of bytes consumed by object graph rooted in PyTuple::items[] array.
It's worth noting that tuple construction routines make sure that only single instance of empty tuple is ever created (singleton).
References:
Python.h,
object.h,
tupleobject.h,
tupleobject.c
A new function, getsizeof(), takes a
Python object and returns the amount
of memory used by the object, measured
in bytes. Built-in objects return
correct results; third-party
extensions may not, but can define a
__sizeof__() method to return the object’s size.
kveretennicov#nosignal:~/py/r26rc2$ ./python
Python 2.6rc2 (r26rc2:66712, Sep 2 2008, 13:11:55)
[GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2
>>> import sys
>>> sys.getsizeof(range(1000000))
4000032
>>> sys.getsizeof(tuple(range(1000000)))
4000024
Obviously returned numbers don't include memory consumed by contained objects (sys.getsizeof(1) == 12).
This is implementation specific, I'm pretty sure. Certainly it depends on the internal representation of integers - you can't assume they'll be stored as 32-bit since Python gives you arbitrarily large integers so perhaps small ints are stored more compactly.
On my Python (2.5.1 on Fedora 9 on core 2 duo) the VmSize before allocation is 6896kB, after is 22684kB. After one more million element assignment, VmSize goes to 38340kB. This very grossly indicates around 16000kB for 1000000 integers, which is around 16 bytes per integer. That suggests a lot of overhead for the list. I'd take these numbers with a large pinch of salt.