python list boundary - python

Is there a boundary on list and dictionary in python?
if there is, what is the limit?

I think by boundary you mean whether there is an upper bound on the number of elements in a list or dict. Python does not define any limits on them, so they can be as big as the memory available on your machine permits.

Actually, currently hash implementation for inner Python objects use 32 bit hashes --so, at a point close to 2^32 elements on a dictionary (assuming you have memory for that much), you will start to have a lot of collisions, and will have a significant slow down in dictionary usage. But that won't prevent it from working.
(Python devels are looking at making this hash 64 bit in future builds, so this is no longer an issue).
As for absolute limit, there is none - the limiting factor is the available system memory.

The amount of memory you have is the limit.

Related

At what point am I using too much memory on a Mac?

I've tried really hard to figure out why my python is using 8 gigs of memory. I've even use gc.get_object() and measured the size of each object and only one of them was larger than 10 megs. Still, all of the objects, and there were about 100,000 of them, added up to 5.5 gigs. On the other hand, my computer is working fine, and the program is running at a reasonable speed. So is the fact that I'm using so much memory cause for concern?
As #bnaecker said this doesn't have a simple (i.e., yes/no) answer. It's only a problem if the combined RSS (resident set size) of all running processes exceeds the available memory thus causing excessive demand paging.
You didn't say how you calculated the size of each object. Hopefully it was by using sys.getsizeof() which should accurately include the overhead associated with each object. If you used some other method (such as calling the __sizeof() method directly) then your answer will be far lower than the correct value. However, even sys.getsizeof() won't account for wasted space due to memory alignment. For example, consider this experiment (using python 3.6 on macOS):
In [25]: x='x'*8193
In [26]: sys.getsizeof(x)
Out[26]: 8242
In [28]: 8242/4
Out[28]: 2060.5
Notice that last value. It implies that the object is using 2060 and 1/2 words of memory. Which is wrong since all allocations consume a multiple of a word. In fact, it looks to me like sys.getsizeof() does not correctly account for word alignment and padding of either the underlying object or the data structure that describes the object. Which means the value is smaller than the amount of memory actually used by the object. Multiplied by 100,000 objects that could represent a substantial amount of memory.
Also, many memory allocators will round up large allocations to a page size (typically a multiple of 4 KiB). Which results in "wasted" space that is probably not going to be included in the sys.getsizeof() return value.

What is the maximum tuple array size in Python 3?

I am building a web scraper that stores data retrieved from four different websites into a tuple array. I later iterate through the tuple and save the entire lot as both CSV and Excel.
Are tuple arrays or arrays in general, limited to the processor's RAM/disc-space?
Thanks
According to the doc, this is given by sys.maxsize
sys.maxsize
An integer giving the maximum value a variable of type Py_ssize_t can
take. It’s usually 2**31 - 1 on a 32-bit platform and 2**63 - 1 on a
64-bit platform.
And interestingly enough, this the Python3 doc about data model gives more implementation details under object.__len__.
CPython implementation detail: In CPython, the length is required to
be at most sys.maxsize. If the length is larger than sys.maxsize
some features (such as len()) may raise OverflowError.
I believe tuples and lists are limited by the size of the machine's virtual memory, unless you're on a 32 bit system in which case you're limited by the small word size. Also, lists are dynamically resized by... I believe about 12% each time they grow too small, so there's a little overhead there as well.
If you're concerned you're going to run out of virtual memory, it might be a good idea to write to a file or files instead.

Unexpectedly high memory usage in Google App Engine

I have a Python GAE app that stores data in each instance, and the memory usage is much higher than I’d expected. As an illustration, consider this test code which I’ve added to my app:
from google.appengine.ext import webapp
bucket = []
class Memory(webapp.RequestHandler):
def get(self):
global bucket
n = int(self.request.get('n'))
size = 0
for i in range(n):
text = '%10d' % i
bucket.append(text)
size += len(text)
self.response.out.write('Total number of characters = %d' % size)
A call to this handler with a value for query variable n will cause the instance to add n strings to its list, each 10 characters long.
If I call this with n=1 (to get everything loaded) and then check the instance memory usage on the production server, I see a figure of 29.4MB. If I then call it with n=100000 and check again, memory usage has jumped to 38.9MB. That is, my memory footprint has increased by 9.5MB to store only one million characters, nearly ten times what I’d expect. I believe that characters consume only one byte each, but even if that’s wrong there’s still a long way to go. Overhead of the list structure surely can’t explain it. I tried adding an explicit garbage collection call, but the figures didn’t change. What am I missing, and is there a way to reduce the footprint?
(Incidentally, I tried using a set instead of a list and found that after calling with n=100000 the memory usage increased by 13MB. That suggests that the set overhead for 100000 strings is 3.5MB more than that of lists, which is also much greater than expected.)
I know that I'm really late to the party here, but this isn't surprising at all...
Consider a string of length 1:
s = '1'
That's pretty small, right? Maybe somewhere on the order of 1 byte? Nope.
>>> import sys
>>> sys.getsizeof('1')
38
So there are approximately 37 bytes of overhead associated with each string that you create (all of those string methods need to be stored somewhere).
Additionally it's usually most efficient for your CPU to store items based on "word size" rather than byte size. On lots of systems, a "word" is 4 bytes...). I don't know for certain, but I wouldn't be surprised if python's memory allocator plays tricks there too to keep it running fairly quickly.
Also, don't forget that lists are represented as over-allocated arrays (to prevent huge performance problems each time you .append). It is possible that, when you make a list of 100k elements, python actually allocates pointers for 110k or more.
Finally, regarding set -- That's probably fairly easily explained by the fact that set are even more over-allocated than list (they need to avoid all those hash collisions after all). They end up having large jumps in memory usage as the set size grows in order to have enough free slots in the array to avoid hash collisions:
>>> sys.getsizeof(set([1]))
232
>>> sys.getsizeof(set([1, 2]))
232
>>> sys.getsizeof(set([1, 2, 3]))
232
>>> sys.getsizeof(set([1, 2, 3, 4]))
232
>>> sys.getsizeof(set([1, 2, 3, 4, 5]))
232
>>> sys.getsizeof(set([1, 2, 3, 4, 5, 6])) # resize!
744
The overhead of the list structure doesn't explain what you're seeing directly, but memory fragmentation does. And strings have a non-zero overhead in terms of underlying memory, so counting string lengths is going to undercount significantly.
I'm not an expert, but this is an interesting question. It seems like it's more of a python memory management issue than a GAE issue. Have you tried running it locally and comparing the memory usage on your local dev_appserver vs deployed on GAE? That should indicate whether it's the GAE platform, or just python.
Secondly, the python code you used is simple, but not very efficient, a list comprehension instead of the for loop should be more efficient. This should reduce the memory usage a bit:
''.join([`%10d` % i for i in range(n)])
Under the covers your growing string must be constantly reallocated. Every time through the for loop, there's a discarded string left lying around. I would have expected that triggering the garbage collector after your for loop should have cleaned up the extra strings though.
Try triggering the garbage collector before you check the memory usage.
import gc
gc.collect()
return len(gc.get_objects())
That should give you an idea if the garbage collector hasn't cleaned out some of the extra strings.
This is largely a response to dragonx.
The sample code exists only to illustrate the problem, so I wasn't concerned with small efficiencies. I am instead concerned about why the application consumes around ten times as much memory as there is actual data. I can understand there being some memory overhead, but this much?
Nonetheless, I tried using a list comprehension (without the join, to match my original) and the memory usage increases slightly, from 9.5MB to 9.6MB. Perhaps that's within the margin of error. Or perhaps the large range() expression sucks it up; it's released, no doubt, but better to use xrange(), I think. With the join the instance variable is set to one very long string, and the memory footprint unsurprisingly drops to a sensible 1.1MB, but this isn't the same case at all. You get the same 1.1MB just setting the instance variable to one million characters without using a list comprehension.
I'm not sure I agree that with my loop "there's a discarded string left lying around." I believe that the string is added to the list (by reference, if that's proper to say) and that no strings are discarded.
I had already tried explicit garbage collection, as my original question states. No help there.
Here's a telling result. Changing the length of the strings from 10 to some other number causes a proportional change in memory usage, but there's a constant in there as well. My experiments show that for every string added to the list there's an 85 byte overhead, no matter what the string length. Is this the cost for strings or for putting the strings into a list? I lean toward the latter. Creating a list of 100,000 None’s consumes 4.5MB, or around 45 bytes per None. This isn't as bad as for strings, but it's still pretty bad. And as I mentioned before, it's worse for sets than it is for lists.
I wish I understood why the overhead (or fragmentation) was this bad, but the inescapable conclusion seems to be that large collections of small objects are extremely expensive. You're probably right that this is more of a Python issue than a GAE issue.

Time complexity of python set operations?

What is the the time complexity of each of python's set operations in Big O notation?
I am using Python's set type for an operation on a large number of items. I want to know how each operation's performance will be affected by the size of the set. For example, add, and the test for membership:
myset = set()
myset.add('foo')
'foo' in myset
Googling around hasn't turned up any resources, but it seems reasonable that the time complexity for Python's set implementation would have been carefully considered.
If it exists, a link to something like this would be great. If nothing like this is out there, then perhaps we can work it out?
Extra marks for finding the time complexity of all set operations.
According to Python wiki: Time complexity, set is implemented as a hash table. So you can expect to lookup/insert/delete in O(1) average. Unless your hash table's load factor is too high, then you face collisions and O(n).
P.S. for some reason they claim O(n) for delete operation which looks like a mistype.
P.P.S. This is true for CPython, pypy is a different story.
The other answers do not talk about 2 crucial operations on sets: Unions and intersections. In the worst case, union will take O(n+m) whereas intersection will take O(min(x,y)) provided that there are not many element in the sets with the same hash. A list of time complexities of common operations can be found here: https://wiki.python.org/moin/TimeComplexity
The operation in should be independent from he size of the container, ie. O(1) -- given an optimal hash function. This should be nearly true for Python strings. Hashing strings is always critical, Python should be clever there and thus you can expect near-optimal results.

Is it possible to give a python dict an initial capacity (and is it useful)

I am filling a python dict with around 10,000,000 items. My understanding of dict (or hashtables) is that when too much elements get in them, the need to resize, an operation that cost quite some time.
Is there a way to say to a python dict that you will be storing at least n items in it, so that it can allocate memory from the start? Or will this optimization not do any good to my running speed?
(And no, I have not checked that the slowness of my small script is because of this, I actually wouldn't now how to do that. This is however something I would do in Java, set the initial capacity of the HashSet right)
First off, I've heard rumor that you can set the size of a dictionary at initialization, but I have never seen any documentation or PEP describing how this would be done.
With this in mind I ran an analysis on your quantity of items, described below. While it may take some time to resize the dictionary each time I would recommend moving ahead without worrying about it, at least until you can test its performance.
The two rules that concern us in determining resizing is number of elements and factor of resizing. A dictionary will resize itself when it is 2/3 full on the addition of the element putting it over the 2/3 mark. Below 50,000 elements it will increase by a factor of 4, above that amount by a factor of 2. Using your estimate of 10,000,000 elements (between 2^23 and 2^24) your dictionary will resize itself 15 times (7 times below 50k, 8 times above). Another resize would occur just past 11,100,000.
Resizing and replacing the current elements in the hashtable does take some time, but I wonder if you'd notice it with whatever else you have going on in the code nearby. I just put together a timing suite comparing inserts at five places along each boundary from dictionary sizes of 2^3 through 2^24, and the "border" additions average 0.4 nanoseconds longer than the "non-border" additions. This is 0.17% longer... probably acceptable. The minimum for all operations was 0.2085 microseconds, and max was 0.2412 microseconds.
Hope this is insightful, and if you do check the performance of your code please follow-up with an edit! My primary resource for dictionary internals was the splendid talk given by Brandon Rhodes at PyCon 2010: The Mighty Dictionary
Yes you can and here is a solution I found in another person's question that is related to yours too:
d = {}
for i in xrange(4000000):
d[i] = None
# 722ms
d = dict(itertools.izip(xrange(4000000), itertools.repeat(None)))
# 634ms
dict.fromkeys(xrange(4000000))
# 558ms
s = set(xrange(4000000))
dict.fromkeys(s)
# Not including set construction 353ms
those are different ways to initialize a dictionary with a certain size.

Categories