Python deep getsizeof list with contents? - python

I was surprised that sys.getsizeof( 10000*[x] )
is 40036 regardless of x: 0, "a", 1000*"a", {}.
Is there a deep_getsizeof
which properly considers elements that share memory ?
(The question came from looking at in-memory database tables like
range(1000000) -> province names: list or dict ?)
(Python is 2.6.4 on a mac ppc.)
Added:
10000*["Mississippi"] is 10000 pointers to one "Mississippi",
as several people have pointed out. Try this:
nstates = [AlabamatoWyoming() for j in xrange(N)]
where AlabamatoWyoming() -> a string "Alabama" .. "Wyoming".
What's deep_getsizeof(nstates) ?
(How can we tell ?
a proper deep_getsizeof: difficult, ~ gc tracer
estimate from total vm
inside knowledge of the python implementation
guess.
Added 25jan:
see also when-does-python-allocate-new-memory-for-identical-strings

10000 * [x] will produce a list of 10000 times the same object, so the sizeof is actually closer to correct than you think. However, a deep sizeof is very problematic because it's impossible to tell Python when you want to stop the measurement. Every object references a typeobject. Should the typeobject be counted? What if the reference to the typeobject is the last one, so if you deleted the object the typeobject would go away as well? What about if you have multiple (different) objects in the list refer to the same string object? Should it be counted once, or multiple times?
In short, getting the size of a data structure is very complicated, and sys.getsizeof() should never have been added :S

Have a look at guppy/heapy; I haven't played around with it too much myself, but a few of my co-workers have used it for memory profiling with good results.
The documentation could be better, but this howto does a decent job of explaining the basic concepts.

If you list is only holding objects with the same length you could get a more accurate estimate number by doing this
def getSize(array):
return sys.getsizeof(array) + len(array) * sys.getsizeof(array[0])
Obviously it's not going to work as good for strings with variable length.
If you only want to calculate the size for debugging or during development and you don't care about the performance, you could iterate over all items recursively and calculation the total size. Note that this solution is not going to handle multiple references to same object correctly.

I wrote a tool called RememberMe exactly for this. Basic usage:
from rememberme import memory
a = [1, 2, 3]
b = [a, a, a]
print(memory(a)) # 172 bytes
print(memory(b)) # 260 bytes. Duplication counted only once.
Hope it helps.

mylist = 10000 * [x] means create a list of size 10000 with 10000 references to object x.
Object x is not copied - only a single one exists in memory!!!
So to use getsizeof, it would be: sys.getsizeof(mylist) + sys.getsizeof(x)

Related

difference in id() between list and string type in python [duplicate]

Two Python strings with the same characters, a == b,
may share memory, id(a) == id(b),
or may be in memory twice, id(a) != id(b).
Try
ab = "ab"
print id( ab ), id( "a"+"b" )
Here Python recognizes that the newly created "a"+"b" is the same
as the "ab" already in memory -- not bad.
Now consider an N-long list of state names
[ "Arizona", "Alaska", "Alaska", "California" ... ]
(N ~ 500000 in my case).
I see 50 different id() s ⇒ each string "Arizona" ... is stored only once, fine.
BUT write the list to disk and read it back in again:
the "same" list now has N different id() s, way more memory, see below.
How come -- can anyone explain Python string memory allocation ?
""" when does Python allocate new memory for identical strings ?
ab = "ab"
print id( ab ), id( "a"+"b" ) # same !
list of N names from 50 states: 50 ids, mem ~ 4N + 50S, each string once
but list > file > mem again: N ids, mem ~ N * (4 + S)
"""
from __future__ import division
from collections import defaultdict
from copy import copy
import cPickle
import random
import sys
states = dict(
AL = "Alabama",
AK = "Alaska",
AZ = "Arizona",
AR = "Arkansas",
CA = "California",
CO = "Colorado",
CT = "Connecticut",
DE = "Delaware",
FL = "Florida",
GA = "Georgia",
)
def nid(alist):
""" nr distinct ids """
return "%d ids %d pickle len" % (
len( set( map( id, alist ))),
len( cPickle.dumps( alist, 0 ))) # rough est ?
# cf http://stackoverflow.com/questions/2117255/python-deep-getsizeof-list-with-contents
N = 10000
exec( "\n".join( sys.argv[1:] )) # var=val ...
random.seed(1)
# big list of random names of states --
names = []
for j in xrange(N):
name = copy( random.choice( states.values() ))
names.append(name)
print "%d strings in mem: %s" % (N, nid(names) ) # 10 ids, even with copy()
# list to a file, back again -- each string is allocated anew
joinsplit = "\n".join(names).split() # same as > file > mem again
assert joinsplit == names
print "%d strings from a file: %s" % (N, nid(joinsplit) )
# 10000 strings in mem: 10 ids 42149 pickle len
# 10000 strings from a file: 10000 ids 188080 pickle len
# Python 2.6.4 mac ppc
Added 25jan:
There are two kinds of strings in Python memory (or any program's):
Ustrings, in a Ucache of unique strings: these save memory, and make a == b fast if both are in Ucache
Ostrings, the others, which may be stored any number of times.
intern(astring) puts astring in the Ucache (Alex +1);
other than that we know nothing at all about how Python moves Ostrings to the Ucache --
how did "a"+"b" get in, after "ab" ?
("Strings from files" is meaningless -- there's no way of knowing.)
In short, Ucaches (there may be several) remain murky.
A historical footnote:
SPITBOL
uniquified all strings ca. 1970.
Each implementation of the Python language is free to make its own tradeoffs in allocating immutable objects (such as strings) -- either making a new one, or finding an existing equal one and using one more reference to it, are just fine from the language's point of view. In practice, of course, real-world implementation strike reasonable compromise: one more reference to a suitable existing object when locating such an object is cheap and easy, just make a new object if the task of locating a suitable existing one (which may or may not exist) looks like it could potentially take a long time searching.
So, for example, multiple occurrences of the same string literal within a single function will (in all implementations I know of) use the "new reference to same object" strategy, because when building that function's constants-pool it's pretty fast and easy to avoid duplicates; but doing so across separate functions could potentially be a very time-consuming task, so real-world implementations either don't do it at all, or only do it in some heuristically identified subset of cases where one can hope for a reasonable tradeoff of compilation time (slowed down by searching for identical existing constants) vs memory consumption (increased if new copies of constants keep being made).
I don't know of any implementation of Python (or for that matter other languages with constant strings, such as Java) that takes the trouble of identifying possible duplicates (to reuse a single object via multiple references) when reading data from a file -- it just doesn't seem to be a promising tradeoff (and here you'd be paying runtime, not compile time, so the tradeoff is even less attractive). Of course, if you know (thanks to application level considerations) that such immutable objects are large and quite prone to many duplications, you can implement your own "constants-pool" strategy quite easily (intern can help you do it for strings, but it's not hard to roll your own for, e.g., tuples with immutable items, huge long integers, and so forth).
I strongly suspect that Python is behaving like many other languages here - recognising string constants within your source code and using a common table for those, but not applying the same rules when creating strings dynamically. This makes sense as there will only be a finite set of strings within your source code (although Python lets you evaluate code dynamically, of course) whereas it's much more likely that you'll be creating huge numbers of strings in the course of your program.
This process is generally called interning - and indeed by the looks of this page it's called interning in Python, too.
A side note: it is very important to know the lifetime of objects in Python. Note the following session:
Python 2.6.4 (r264:75706, Dec 26 2009, 01:03:10)
[GCC 4.3.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> a="a"
>>> b="b"
>>> print id(a+b), id(b+a)
134898720 134898720
>>> print (a+b) is (b+a)
False
Your thinking that by printing the IDs of two separate expressions and noting “they are equal ergo the two expressions must be equal/equivalent/the same” is faulty. A single line of output does not necessarily imply all of its contents were created and/or co-existed at the same single moment in time.
If you want to know if two objects are the same object, ask Python directly (using the is operator).
x = 42
y = 42
x == y #True
x is y #True
In this interaction, X and Y should be
== (same value), but not is (same object) because we ran two different
literal expressions. Because small
integers and strings are cached and
reused, though, is tells us they
reference the same single object.
In fact, if you really want to look
under the hood, you can always ask
Python how many references there are
to an object using the getrefcount
function in the standard sys module
returns the object’s reference count.
This behavior reflects one of the many
ways Python optimizes its model for
execution speed.
Learning Python
I found a good article to explain the intern behavior of CPython:
http://guilload.com/python-string-interning/
In short:
String object in CPython has a flag to indicate that if it's in intern.
Interning string by storing them in a normal dictionary with keys and values are string's pointers. This accepts string class only.
Interning help Python to reduce memory consumption because objects can refer to the same memory address, and speed up comparison speed because it only has to compare the string's pointers.
Python does the intern in the compile process, which means only literal strings (or string can be computed at compile time, like 'hello' + 'world')
For your question: Only strings with length 0 or length 1 or contains ASCII letters only(a-z, A-Z, 0-9) are interned
Intern works in Python due to strings are immutable, otherwise does not make sense.
This is a really good article, I strongly suggest visiting his site and check for other ones, worth our time.

Efficient and not memory consuming way to find all possible pairs in list

I have a dictionary called lemma_all_context_dict, and it has approximately 8000 keys. I need a list of all possible pairs of these keys.
I used:
pairs_of_words_list = list(itertools.combinations(lemma_all_context_dict.keys(), 2))
However, when using this line I get a MemoryError. I have 8GB of RAM but perhaps I get this error anyway because I've got a few very large dictionaries in this code.
So I tried a different way:
pairs_of_words_list = []
for p_one in range(len(lemma_all_context_dict.keys())):
for p_two in range(p_one+1,len(lemma_all_context_dict.keys())):
pairs_of_words_list.append([lemma_all_context_dict.keys()[p_one],lemma_all_context_dict.keys()[p_two]])
But this piece of codes takes around 20 minutes to run... does anyone know of a more efficient way to solve the problem? Thanks
**I don't think that this question is a duplicate because what I'm asking - and I don't think this has been asked - is how to implement this stuff without my computer crashing :-P
Don't build a list, since that's the reason you get a memory error (you even create two lists, since that's what .keys() does). You can iterate over the iterator (that's their purpose):
for a, b in itertools.combinations(lemma_all_context_dict, 2):
print a, b

Iterate two or more lists / numpy arrays... and compare each item with each other and avoid loops in python

I am new to python and my problem is the following:
I have defined a function func(a,b) that return a value, given two input values.
Now I have my data stored in lists or numpy arrays A,Band would like to use func for every combination. (A and B have over one million entries)
ATM i use this snippet:
for p in A:
for k in B:
value = func(p,k)
This takes really really a lot of time.
So i was thinking that maybe something like this:
C=(map(func,zip(A,B)))
But this method only works pairwise... Any ideas?
Thanks for help
First issue
You need to calculate the output of f for many pairs of values. The "standard" way to speed up this kind of loops (calculations) is to make your function f accept (NumPy) arrays as input, and do the calculation on the whole array at once (ie, no looping as seen from Python). Check any NumPy tutorial to get an introduction.
Second issue
If A and B have over a million entries each, there are one trillion combinations. For 64 bits numbers, that means you'll need 7.3 TiB of space just to store the result of your calculation. Do you have enough hard drive to just store the result?
Third issue
If A and B where much smaller, in your particular case you'd be able to do this:
values = f(*meshgrid(A, B))
meshgrid returns the cartesian product of A and B, so it's simply a way to generate the points that have to be evaluated.
Summary
You need to use NumPy effectively to avoid Python loops. (Or if all else fails or they can't easily be vectorized, write those loops in a compiled language, for instance by using Cython)
Working with terabytes of data is hard. Do you really need that much data?
Any solution that calls a function f 1e12 times in a loop is bound to be slow, specially in CPython (which is the default Python implementation. If you're not really sure and you're using NumPy, you're using it too).
suppose, itertools.product does what you need:
from itertools import product
pro = product(A,B)
C = map(lambda x: func(*x), pro)
so far as it is generator it doesn't require additional memory
One million times one million is one trillion. Calling f one trillion times will take a while.
Unless you have a way of reducing the number of values to compute, you can't do better than the above.
If you use NumPy, you should definitely look the np.vectorize function which is designed for this kind of problems...

Repeatedly appending to a large list (Python 2.6.6)

I have a project where I am reading in ASCII values from a microcontroller through a serial port (looks like this : AA FF BA 11 43 CF etc)
The input is coming in quickly (38 two character sets / second).
I'm taking this input and appending it to a running list of all measurements.
After about 5 hours, my list has grown to ~ 855000 entries.
I'm given to understand that the larger a list becomes, the slower list operations become. My intent is to have this test run for 24 hours, which should yield around 3M results.
Is there a more efficient, faster way to append to a list then list.append()?
Thanks Everyone.
I'm given to understand that the larger a list becomes, the slower list operations become.
That's not true in general. Lists in Python are, despite the name, not linked lists but arrays. There are operations that are O(n) on arrays (copying and searching, for instance), but you don't seem to use any of these. As a rule of thumb: If it's widely used and idiomatic, some smart people went and chose a smart way to do it. list.append is a widely-used builtin (and the underlying C function is also used in other places, e.g. list comprehensions). If there was a faster way, it would already be in use.
As you will see when you inspect the source code, lists are overallocating, i.e. when they are resized, they allocate more than needed for one item so the next n items can be appended without need to another resize (which is O(n)). The growth isn't constant, it is proportional with the list size, so resizing becomes rarer as the list grows larger. Here's the snippet from listobject.c:list_resize that determines the overallocation:
/* This over-allocates proportional to the list size, making room
* for additional growth. The over-allocation is mild, but is
* enough to give linear-time amortized behavior over a long
* sequence of appends() in the presence of a poorly-performing
* system realloc().
* The growth pattern is: 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
*/
new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6);
As Mark Ransom points out, older Python versions (<2.7, 3.0) have a bug that make the GC sabotage this. If you have such a Python version, you may want to disable the gc. If you can't because you generate too much garbage (that slips refcounting), you're out of luck though.
One thing you might want to consider is writing your data to a file as it's collected. I don't know (or really care) if it will affect performance, but it will help ensure that you don't lose all your data if power blips. Once you've got all the data, you can suck it out of the file and jam it in a list or an array or a numpy matrix or whatever for processing.
Appending to a python list has a constant cost. It is not affected by the number of items in the list (in theory). In practice appending to a list will get slower once you run out of memory and the system starts swapping.
http://wiki.python.org/moin/TimeComplexity
It would be helpful to understand why you actually append things into a list. What are you planning to do with the items. If you don't need all of them you could build a ring buffer, if you don't need to do computation you could write the list to a file, etc.
First of all, 38 two-character sets per second, 1 stop bit, 8 data bits, and no parity, is only 760 baud, not fast at all.
But anyway, my suggestion, if you're worried about having overly large lists/don't want to use one huge list, is just to store store a list on disk once it reaches a certain size and start a new list, repeating until you've gotten all the data, then combining all the lists into one once you're done receiving the data.
Though you may skip the sublists completely and just go with nmichaels' suggestion, writing the data to a file as you get it and using a small circular buffer to hold the received data that has not yet been written.
It might be faster to use numpy if you know how long the array is going to be and you can convert your hex codes to ints:
import numpy
a = numpy.zeros(3000000, numpy.int32)
for i in range(3000000):
a[i] = int(scanHexFromSerial(),16)
This will leave you with an array of integers (which you could convert back to hex with hex()), but depending on your application maybe that will work just as well for you.

when does Python allocate new memory for identical strings?

Two Python strings with the same characters, a == b,
may share memory, id(a) == id(b),
or may be in memory twice, id(a) != id(b).
Try
ab = "ab"
print id( ab ), id( "a"+"b" )
Here Python recognizes that the newly created "a"+"b" is the same
as the "ab" already in memory -- not bad.
Now consider an N-long list of state names
[ "Arizona", "Alaska", "Alaska", "California" ... ]
(N ~ 500000 in my case).
I see 50 different id() s ⇒ each string "Arizona" ... is stored only once, fine.
BUT write the list to disk and read it back in again:
the "same" list now has N different id() s, way more memory, see below.
How come -- can anyone explain Python string memory allocation ?
""" when does Python allocate new memory for identical strings ?
ab = "ab"
print id( ab ), id( "a"+"b" ) # same !
list of N names from 50 states: 50 ids, mem ~ 4N + 50S, each string once
but list > file > mem again: N ids, mem ~ N * (4 + S)
"""
from __future__ import division
from collections import defaultdict
from copy import copy
import cPickle
import random
import sys
states = dict(
AL = "Alabama",
AK = "Alaska",
AZ = "Arizona",
AR = "Arkansas",
CA = "California",
CO = "Colorado",
CT = "Connecticut",
DE = "Delaware",
FL = "Florida",
GA = "Georgia",
)
def nid(alist):
""" nr distinct ids """
return "%d ids %d pickle len" % (
len( set( map( id, alist ))),
len( cPickle.dumps( alist, 0 ))) # rough est ?
# cf http://stackoverflow.com/questions/2117255/python-deep-getsizeof-list-with-contents
N = 10000
exec( "\n".join( sys.argv[1:] )) # var=val ...
random.seed(1)
# big list of random names of states --
names = []
for j in xrange(N):
name = copy( random.choice( states.values() ))
names.append(name)
print "%d strings in mem: %s" % (N, nid(names) ) # 10 ids, even with copy()
# list to a file, back again -- each string is allocated anew
joinsplit = "\n".join(names).split() # same as > file > mem again
assert joinsplit == names
print "%d strings from a file: %s" % (N, nid(joinsplit) )
# 10000 strings in mem: 10 ids 42149 pickle len
# 10000 strings from a file: 10000 ids 188080 pickle len
# Python 2.6.4 mac ppc
Added 25jan:
There are two kinds of strings in Python memory (or any program's):
Ustrings, in a Ucache of unique strings: these save memory, and make a == b fast if both are in Ucache
Ostrings, the others, which may be stored any number of times.
intern(astring) puts astring in the Ucache (Alex +1);
other than that we know nothing at all about how Python moves Ostrings to the Ucache --
how did "a"+"b" get in, after "ab" ?
("Strings from files" is meaningless -- there's no way of knowing.)
In short, Ucaches (there may be several) remain murky.
A historical footnote:
SPITBOL
uniquified all strings ca. 1970.
Each implementation of the Python language is free to make its own tradeoffs in allocating immutable objects (such as strings) -- either making a new one, or finding an existing equal one and using one more reference to it, are just fine from the language's point of view. In practice, of course, real-world implementation strike reasonable compromise: one more reference to a suitable existing object when locating such an object is cheap and easy, just make a new object if the task of locating a suitable existing one (which may or may not exist) looks like it could potentially take a long time searching.
So, for example, multiple occurrences of the same string literal within a single function will (in all implementations I know of) use the "new reference to same object" strategy, because when building that function's constants-pool it's pretty fast and easy to avoid duplicates; but doing so across separate functions could potentially be a very time-consuming task, so real-world implementations either don't do it at all, or only do it in some heuristically identified subset of cases where one can hope for a reasonable tradeoff of compilation time (slowed down by searching for identical existing constants) vs memory consumption (increased if new copies of constants keep being made).
I don't know of any implementation of Python (or for that matter other languages with constant strings, such as Java) that takes the trouble of identifying possible duplicates (to reuse a single object via multiple references) when reading data from a file -- it just doesn't seem to be a promising tradeoff (and here you'd be paying runtime, not compile time, so the tradeoff is even less attractive). Of course, if you know (thanks to application level considerations) that such immutable objects are large and quite prone to many duplications, you can implement your own "constants-pool" strategy quite easily (intern can help you do it for strings, but it's not hard to roll your own for, e.g., tuples with immutable items, huge long integers, and so forth).
I strongly suspect that Python is behaving like many other languages here - recognising string constants within your source code and using a common table for those, but not applying the same rules when creating strings dynamically. This makes sense as there will only be a finite set of strings within your source code (although Python lets you evaluate code dynamically, of course) whereas it's much more likely that you'll be creating huge numbers of strings in the course of your program.
This process is generally called interning - and indeed by the looks of this page it's called interning in Python, too.
A side note: it is very important to know the lifetime of objects in Python. Note the following session:
Python 2.6.4 (r264:75706, Dec 26 2009, 01:03:10)
[GCC 4.3.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> a="a"
>>> b="b"
>>> print id(a+b), id(b+a)
134898720 134898720
>>> print (a+b) is (b+a)
False
Your thinking that by printing the IDs of two separate expressions and noting “they are equal ergo the two expressions must be equal/equivalent/the same” is faulty. A single line of output does not necessarily imply all of its contents were created and/or co-existed at the same single moment in time.
If you want to know if two objects are the same object, ask Python directly (using the is operator).
x = 42
y = 42
x == y #True
x is y #True
In this interaction, X and Y should be
== (same value), but not is (same object) because we ran two different
literal expressions. Because small
integers and strings are cached and
reused, though, is tells us they
reference the same single object.
In fact, if you really want to look
under the hood, you can always ask
Python how many references there are
to an object using the getrefcount
function in the standard sys module
returns the object’s reference count.
This behavior reflects one of the many
ways Python optimizes its model for
execution speed.
Learning Python
I found a good article to explain the intern behavior of CPython:
http://guilload.com/python-string-interning/
In short:
String object in CPython has a flag to indicate that if it's in intern.
Interning string by storing them in a normal dictionary with keys and values are string's pointers. This accepts string class only.
Interning help Python to reduce memory consumption because objects can refer to the same memory address, and speed up comparison speed because it only has to compare the string's pointers.
Python does the intern in the compile process, which means only literal strings (or string can be computed at compile time, like 'hello' + 'world')
For your question: Only strings with length 0 or length 1 or contains ASCII letters only(a-z, A-Z, 0-9) are interned
Intern works in Python due to strings are immutable, otherwise does not make sense.
This is a really good article, I strongly suggest visiting his site and check for other ones, worth our time.

Categories